本篇博文主要内容为 2025-09-23 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-09-23)
今日共更新1054篇论文,其中:
- 自然语言处理共191篇(Computation and Language (cs.CL))
- 人工智能共275篇(Artificial Intelligence (cs.AI))
- 计算机视觉共250篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共274篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
【速读】: 该论文旨在解决当前通用多模态嵌入模型在处理查询与候选对象之间的语义相关性时存在的两个关键问题:一是将查询和候选对象压缩为单一向量,导致细粒度信息表达能力受限;二是生成过多向量,造成多向量检索的计算开销过大。解决方案的核心在于提出MetaEmbed框架,通过在输入序列中添加固定数量的可学习Meta Tokens,并在测试阶段利用其最后一层的上下文表示作为紧凑且表达能力强的多向量嵌入,结合Matryoshka多向量检索训练策略,使模型能够按粒度层次组织信息,从而实现测试时的可扩展性——用户可根据效率需求灵活选择用于索引和检索交互的向量数量,在检索质量和计算效率之间取得平衡。
链接: https://arxiv.org/abs/2509.18095
作者: Zilin Xiao,Qi Ma,Mengting Gu,Chun-cheng Jason Chen,Xintao Chen,Vicente Ordonez,Vijai Mohan
机构: Meta(Meta)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval, where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.
zh
[NLP-1] SEQR: Secure and Efficient QR-based LoRA Routing
【速读】: 该论文旨在解决在安全环境中高效选择合适低秩适配器(Low-Rank Adaptation, LoRA)的问题,尤其是在无法进行监督训练路由器的情况下。其核心挑战在于如何在不依赖标注数据的前提下,实现对多个LoRA模块的无监督路由决策。解决方案的关键在于提出SEQR算法,该算法基于激活范数(activation norm)最大化原则,构建了一个理论框架以指导无监督LoRA路由,并通过严格证明保证识别出范数最大化的适配器,从而在保持高效率的同时提供可验证的路由性能保障,显著提升了动态LoRA组合的可扩展性和实用性。
链接: https://arxiv.org/abs/2509.18093
作者: William Fleshman,Benjamin Van Durme
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Low-Rank Adaptation (LoRA) has become a standard technique for parameter-efficient fine-tuning of large language models, enabling large libraries of LoRAs, each for a specific task or domain. Efficiently selecting the correct LoRA adapter for a given input remains a challenge, particularly in secure environments where supervised training of routers may raise privacy concerns. Motivated by previous approaches, we formalize the goal of unsupervised LoRA routing in terms of activation norm maximization, providing a theoretical framework for analysis. We demonstrate the discriminative power of activation norms and introduce SEQR, an unsupervised LoRA routing algorithm designed to maximize efficiency while providing strict routing guarantees. SEQR provably identifies the norm-maximizing adapter with significantly greater efficiency, making it a highly scalable and effective solution for dynamic LoRA composition. We validate our results through experiments that demonstrate improved multi-task performance and efficiency.
zh
[NLP-2] OnePiece: Bringing Context Engineering and Reasoning to Industrial Cascade Ranking System
【速读】: 该论文旨在解决工业级搜索与推荐系统中对大规模语言模型(Large Language Models, LLMs)能力迁移不足的问题,即当前多数工业实践仅简单移植Transformer架构,未能充分利用LLMs的核心优势——上下文工程(context engineering)与多步推理(multi-step reasoning),从而导致改进效果有限。解决方案的关键在于提出OnePiece框架,其创新性地融合了两种机制:一是结构化上下文工程,将用户交互历史中的偏好和场景信号统一为结构化的token序列输入,增强原始查询的语义丰富度;二是分块潜在推理(block-wise latent reasoning),通过可扩展的块大小实现多层次表示精炼;三是渐进式多任务训练策略,利用用户反馈链监督推理步骤,提升模型在检索与排序阶段的协同优化能力。该方案已在Shopee主搜场景部署,显著提升了GMV/用户数(+2%以上)及广告收入(+2.90%)。
链接: https://arxiv.org/abs/2509.18091
作者: Sunhao Dai,Jiakai Tang,Jiahua Wu,Kun Wang,Yuxuan Zhu,Bingjun Chen,Bangyang Hong,Yu Zhao,Cong Fu,Kangle Wu,Yabo Ni,Anxiang Zeng,Wenjie Wang,Xu Chen,Jun Xu,See-Kiong Ng
机构: Renmin University of China(中国人民大学); Shopee(虾皮); University of Science and Technology of China(中国科学技术大学); National University of Singapore(新加坡国立大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: OnePiece Technical Report; Applied in Shopee
Abstract:Despite the growing interest in replicating the scaled success of large language models (LLMs) in industrial search and recommender systems, most existing industrial efforts remain limited to transplanting Transformer architectures, which bring only incremental improvements over strong Deep Learning Recommendation Models (DLRMs). From a first principle perspective, the breakthroughs of LLMs stem not only from their architectures but also from two complementary mechanisms: context engineering, which enriches raw input queries with contextual cues to better elicit model capabilities, and multi-step reasoning, which iteratively refines model outputs through intermediate reasoning paths. However, these two mechanisms and their potential to unlock substantial improvements remain largely underexplored in industrial ranking systems. In this paper, we propose OnePiece, a unified framework that seamlessly integrates LLM-style context engineering and reasoning into both retrieval and ranking models of industrial cascaded pipelines. OnePiece is built on a pure Transformer backbone and further introduces three key innovations: (1) structured context engineering, which augments interaction history with preference and scenario signals and unifies them into a structured tokenized input sequence for both retrieval and ranking; (2) block-wise latent reasoning, which equips the model with multi-step refinement of representations and scales reasoning bandwidth via block size; (3) progressive multi-task training, which leverages user feedback chains to effectively supervise reasoning steps during training. OnePiece has been deployed in the main personalized search scenario of Shopee and achieves consistent online gains across different key business metrics, including over +2% GMV/UU and a +2.90% increase in advertising revenue. Comments: OnePiece Technical Report; Applied in Shopee Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2509.18091 [cs.IR] (or arXiv:2509.18091v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.18091 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-3] Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding
【速读】: 该论文旨在解决生成式 AI(Generative AI)中扩散语言模型(Diffusion LLMs, dLLMs)推理速度慢的问题,尤其是在当前开源dLLM通常仅在每个去噪时间步(denoising timestep)生成单个token以保证输出质量的情况下,导致整体生成速率远低于理论潜力。解决方案的关键在于提出Spiffy——一种推测解码(speculative decoding)算法,其核心创新包括:1)利用dLLM自身分布进行自动推测(auto-speculative),无需额外训练独立的草稿模型;2)设计一种专为dLLM双向、分块生成特性定制的有向草稿图(directed draft graph),支持并行验证;3)引入离线校准算法优化草稿图结构以提升接受率(acceptance rate)。这些机制共同实现了2.8–3.1倍的加速比,并在与KV缓存和多标记解掩码(multi-token unmasking)等技术结合时,最高可实现7.9倍的综合加速效果。
链接: https://arxiv.org/abs/2509.18085
作者: Sudhanshu Agrawal,Risheek Garrepalli,Raghavv Goel,Mingu Lee,Christopher Lott,Fatih Porikli
机构: Qualcomm AI Research(高通人工智能研究)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Diffusion LLMs (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs (AR-LLMs) with the potential to operate at significantly higher token generation rates. However, currently available open-source dLLMs often generate at much lower rates, typically decoding only a single token at every denoising timestep in order to maximize output quality. We present Spiffy, a speculative decoding algorithm that accelerates dLLM inference by \mathbf2.8-3.1\times while provably preserving the model’s output distribution. This work addresses the unique challenges involved in applying ideas from speculative decoding of AR-LLMs to the dLLM setting. Spiffy proposes draft states by leveraging the dLLM’s distribution itself in an auto-speculative manner. This approach is efficient and effective, and eliminates the overheads of training and running an independent draft model. To structure the candidate draft states, we propose a novel directed draft graph which is uniquely designed to take advantage of the bidirectional, block-wise nature of dLLM generation and can be verified in parallel by the dLLM. To further optimize the structure of these draft graphs, we introduce an efficient, offline calibration algorithm that procedurally determines high-quality graph configurations. These optimized draft graphs, enabling increased acceptance rates, lead to a significant boost in the overall speedup achieved by the system. Crucially, Spiffy is also complementary to other recent innovations in improving dLLM generation speeds such as KV-caching and multi-token unmasking. We demonstrate that when combined with such parallel decoding algorithms, Spiffy is able to effectively multiply the benefits of these methods leading to total speedups of up to \mathbf7.9\times .
zh
[NLP-4] Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在基础符号推理能力上的不足,尤其是缺乏一个能够持续生成高多样性、可验证且难度可控的训练环境来系统性提升模型的逻辑推理与形式化问题求解能力。解决方案的关键在于提出名为Reasoning Core的新颖可扩展环境,其核心设计原则包括:高泛化性的问题分布、通过外部工具进行奖励验证,以及连续难度控制机制,从而实现对PDDL规划、一阶逻辑、上下文无关文法解析、因果推理和系统方程求解等关键形式化领域的任务自动化生成,为LLMs提供近乎无限的新型训练实例,显著提升其符号推理性能。
链接: https://arxiv.org/abs/2509.18083
作者: Valentin Lacombe,Valentin Quesnel,Damien Sileo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce Reasoning Core, a new scalable environment for Reinforcement Learning with Verifiable Rewards (RLVR), designed to advance foundational symbolic reasoning in Large Language Models (LLMs). Unlike existing benchmarks that focus on games or isolated puzzles, Reasoning Core procedurally generates problems across core formal domains, including PDDL planning, first-order logic, context-free grammar parsing, causal reasoning, and system equation solving. The environment is built on key design principles of high-generality problem distributions, verification via external tools, and continuous difficulty control, which together provide a virtually infinite supply of novel training instances. Initial zero-shot evaluations with frontier LLMs confirm the difficulty of Reasoning Core’s tasks, positioning it as a promising resource to improve the reasoning capabilities of future models.
zh
[NLP-5] ARK-V1: An LLM -Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在回答需要特定领域知识的问题时,因内部知识不足、过时或错误而导致性能受限的问题。尽管知识图谱(Knowledge Graphs, KGs)提供了结构化的外部知识,但其复杂性和多跳推理需求使得与LLMs的有效集成具有挑战性。解决方案的关键在于提出ARK-V1——一个简单的基于知识图谱的智能体(KG-agent),它通过迭代式探索知识图谱来响应自然语言查询,从而实现对长尾实体的KG驱动推理和常识推理的联合建模。实验表明,ARK-V1在CoLoTa数据集上显著优于Chain-of-Thought基线方法,并且更大的LLM骨干网络展现出更高的准确率覆盖范围、正确性和稳定性。
链接: https://arxiv.org/abs/2509.18063
作者: Jan-Felix Klein,Lars Ohnemus
机构: Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院); Institute of Material Handling and Logistics (物流与仓储研究所)
类目: Computation and Language (cs.CL)
备注: Work in Progess
Abstract:Large Language Models (LLMs) show strong reasoning abilities but rely on internalized knowledge that is often insufficient, outdated, or incorrect when trying to answer a question that requires specific domain knowledge. Knowledge Graphs (KGs) provide structured external knowledge, yet their complexity and multi-hop reasoning requirements make integration challenging. We present ARK-V1, a simple KG-agent that iteratively explores graphs to answer natural language queries. We evaluate several not fine-tuned state-of-the art LLMs as backbones for ARK-V1 on the CoLoTa dataset, which requires both KG-based and commonsense reasoning over long-tail entities. ARK-V1 achieves substantially higher conditional accuracies than Chain-of-Thought baselines, and larger backbone models show a clear trend toward better coverage, correctness, and stability.
zh
[NLP-6] MD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang Amdo and Kham Speech Dataset Generation
【速读】: 该论文旨在解决藏语作为低资源语言,其三大方言(卫藏、安多、康)之间平行语音语料稀缺的问题,从而限制了语音建模的发展。解决方案的关键在于提出了一种统一的藏语多方言文本到语音(TMD-TTS)框架,通过显式的方言标签合成跨方言语音;其核心技术包括方言融合模块和方言专用动态路由网络(DSDR-Net),能够有效捕捉方言间细微的声学与语言特征差异,显著提升合成语音的方言表现力,并在更具挑战性的语音到语音方言转换(S2SDC)任务中验证了合成语音的质量与实用性。
链接: https://arxiv.org/abs/2509.18060
作者: Yutong Liu,Ziyue Zhang,Ban Ma-bao,Renzeng Duojie,Yuqing Cai,Yongbin Yu,Xiangxiang Wang,Fan Gao,Cheng Huang,Nyima Tashi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Tibetan is a low-resource language with limited parallel speech corpora spanning its three major dialects (Ü-Tsang, Amdo, and Kham), limiting progress in speech modeling. To address this issue, we propose TMD-TTS, a unified Tibetan multi-dialect text-to-speech (TTS) framework that synthesizes parallel dialectal speech from explicit dialect labels. Our method features a dialect fusion module and a Dialect-Specialized Dynamic Routing Network (DSDR-Net) to capture fine-grained acoustic and linguistic variations across dialects. Extensive objective and subjective evaluations demonstrate that TMD-TTS significantly outperforms baselines in dialectal expressiveness. We further validate the quality and utility of the synthesized speech through a challenging Speech-to-Speech Dialect Conversion (S2SDC) task.
zh
[NLP-7] he PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的社会模拟研究中存在的方法论缺陷问题,这些问题导致其结论的可信度和可重复性受到质疑。解决方案的关键在于提出并系统化六项核心原则——PIMMUR原则,即异质性(Profile)、交互性(Interaction)、记忆保留(Memory)、最小控制(Minimal-Control)、无知性(Unawareness)和现实性(Realism),用以规范LLM代理在社会模拟中的行为设计与实验验证标准。通过强制实施这些原则重构五项代表性研究,作者发现许多先前报告的社会现象在更严格的条件下无法重现,从而证明PIMMUR原则是实现可靠、可复现的“AI社会”模拟的必要条件。
链接: https://arxiv.org/abs/2509.18052
作者: Jiaxu Zhou,Jen-tse Huang,Xuhui Zhou,Man Ho Lam,Xintao Wang,Hao Zhu,Wenxuan Wang,Maarten Sap
机构: Chinese University of Hong Kong (香港中文大学); Johns Hopkins University (约翰霍普金斯大学); Carnegie Mellon University (卡内基梅隆大学); Fudan University (复旦大学); Stanford University (斯坦福大学); Renmin University of China (中国人民大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Preprint
Abstract:Large Language Models (LLMs) are increasingly used for social simulation, where populations of agents are expected to reproduce human-like collective behavior. However, we find that many recent studies adopt experimental designs that systematically undermine the validity of their claims. From a survey of over 40 papers, we identify six recurring methodological flaws: agents are often homogeneous (Profile), interactions are absent or artificially imposed (Interaction), memory is discarded (Memory), prompts tightly control outcomes (Minimal-Control), agents can infer the experimental hypothesis (Unawareness), and validation relies on simplified theoretical models rather than real-world data (Realism). For instance, GPT-4o and Qwen-3 correctly infer the underlying social experiment in 53.1% of cases when given instructions from prior work-violating the Unawareness principle. We formalize these six requirements as the PIMMUR principles and argue they are necessary conditions for credible LLM-based social simulation. To demonstrate their impact, we re-run five representative studies using a framework that enforces PIMMUR and find that the reported social phenomena frequently fail to emerge under more rigorous conditions. Our work establishes methodological standards for LLM-based multi-agent research and provides a foundation for more reliable and reproducible claims about “AI societies.”
zh
[NLP-8] RadEval: A framework for radiology text evaluation EMNLP2025
【速读】: 该论文旨在解决医学影像报告生成任务中缺乏统一、全面且可复现的评估框架的问题。当前评估方法分散,涵盖从传统n-gram重叠指标(如BLEU、ROUGE)到基于上下文的BERTScore,再到临床概念匹配(如F1CheXbert、F1RadGraph)及大语言模型(LLM)驱动的评分器(如GREEN),但这些指标缺乏标准化实现与系统性整合。解决方案的关键在于提出RadEval——一个开源、统一的评估框架,不仅集成多种先进指标并优化其标准化实现,还通过预训练专用放射学编码器提升零样本检索性能,并提供包含450+临床显著错误标签的专家标注数据集,从而揭示不同评估指标与放射科医生判断的相关性,最终促进放射学文本生成研究的可复现性与基准测试的严谨性。
链接: https://arxiv.org/abs/2509.18030
作者: Justin Xu,Xi Zhang,Javid Abderezaei,Julie Bauml,Roger Boodoo,Fatemeh Haghighi,Ali Ganjizadeh,Eric Brattain,Dave Van Veen,Zaiqiao Meng,David Eyre,Jean-Benoit Delbrouck
机构: University of Oxford (牛津大学); University of Glasgow (格拉斯哥大学); HOPPR
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Demo track - Oral
Abstract:We introduce RadEval, a unified, open-source framework for evaluating radiology texts. RadEval consolidates a diverse range of metrics, from classic n-gram overlap (BLEU, ROUGE) and contextual measures (BERTScore) to clinical concept-based scores (F1CheXbert, F1RadGraph, RaTEScore, SRR-BERT, TemporalEntityF1) and advanced LLM-based evaluators (GREEN). We refine and standardize implementations, extend GREEN to support multiple imaging modalities with a more lightweight model, and pretrain a domain-specific radiology encoder, demonstrating strong zero-shot retrieval performance. We also release a richly annotated expert dataset with over 450 clinically significant error labels and show how different metrics correlate with radiologist judgment. Finally, RadEval provides statistical testing tools and baseline model evaluations across multiple publicly available datasets, facilitating reproducibility and robust benchmarking in radiology report generation.
zh
[NLP-9] Cross-Attention is Half Explanation in Speech-to-Text Models
【速读】: 该论文旨在解决跨注意力机制(cross-attention)在语音到文本(Speech-to-Text, S2T)模型中是否可作为解释性代理的问题。现有研究普遍假设跨注意力分数能反映输入语音表示与生成文本之间的依赖关系,但这一假设在语音领域尚未得到充分验证。为填补该空白,作者通过将跨注意力得分与基于特征归因方法得出的输入显著性图(saliency maps)进行对比,系统评估了其解释能力。解决方案的关键在于采用可解释性分析技术——即特征归因(feature attribution)生成的显著性图作为基准,量化跨注意力在多语言、多任务、多尺度S2T模型中的对齐程度,从而揭示其虽具一定解释力但仅能捕捉约50%的输入相关性,且在最佳情况下仅能部分反映解码器对编码器表示的关注程度(占显著性比例的52–75%)。这一发现表明,跨注意力虽是重要参考指标,但不能作为完整的解释工具。
链接: https://arxiv.org/abs/2509.18010
作者: Sara Papi,Dennis Fucci,Marco Gaido,Matteo Negri,Luisa Bentivogli
机构: Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Cross-attention is a core mechanism in encoder-decoder architectures, widespread in many fields, including speech-to-text (S2T) processing. Its scores have been repurposed for various downstream applications–such as timestamp estimation and audio-text alignment–under the assumption that they reflect the dependencies between input speech representation and the generated text. While the explanatory nature of attention mechanisms has been widely debated in the broader NLP literature, this assumption remains largely unexplored within the speech domain. To address this gap, we assess the explanatory power of cross-attention in S2T models by comparing its scores to input saliency maps derived from feature attribution. Our analysis spans monolingual and multilingual, single-task and multi-task models at multiple scales, and shows that attention scores moderately to strongly align with saliency-based explanations, particularly when aggregated across heads and layers. However, it also shows that cross-attention captures only about 50% of the input relevance and, in the best case, only partially reflects how the decoder attends to the encoder’s representations–accounting for just 52-75% of the saliency. These findings uncover fundamental limitations in interpreting cross-attention as an explanatory proxy, suggesting that it offers an informative yet incomplete view of the factors driving predictions in S2T models.
zh
[NLP-10] hrough the Lens of Human-Human Collaboration: A Configurable Research Platform for Exploring Human-Agent Collaboration
【速读】: 该论文试图解决的问题是:当人类与大语言模型(Large Language Model, LLM)代理进行协作时,传统人机交互(Human-Computer Interaction, HCI)和计算机支持的协同工作(Computer-Supported Cooperative Work, CSCW)中关于计算机中介协作的原则是否依然适用、发生改变或失效。为系统性探究这一问题,论文提出一个开放且可配置的研究平台,其关键在于采用模块化设计,使经典CSCW实验能够无缝迁移,并支持基于理论的交互控制变量的操纵,从而实现对人-LLM代理协作机制的可控实验与深入分析。
链接: https://arxiv.org/abs/2509.18008
作者: Bingsheng Yao,Jiaju Chen,Chaoran Chen,April Wang,Toby Jia-jun Li,Dakuo Wang
机构: Northeastern University (东北大学); Rice University (莱斯大学); University of Notre Dame (圣母大学); ETH Zurich (苏黎世联邦理工学院)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Intelligent systems have traditionally been designed as tools rather than collaborators, often lacking critical characteristics that collaboration partnerships require. Recent advances in large language model (LLM) agents open new opportunities for human-LLM-agent collaboration by enabling natural communication and various social and cognitive behaviors. Yet it remains unclear whether principles of computer-mediated collaboration established in HCI and CSCW persist, change, or fail when humans collaborate with LLM agents. To support systematic investigations of these questions, we introduce an open and configurable research platform for HCI researchers. The platform’s modular design allows seamless adaptation of classic CSCW experiments and manipulation of theory-grounded interaction controls. We demonstrate the platform’s effectiveness and usability through two case studies: (1) re-implementing the classic human-human-collaboration task Shape Factory as a between-subject human-agent-collaboration experiment with 16 participants, and (2) a participatory cognitive walkthrough with five HCI researchers to refine workflows and interfaces for experiment setup and analysis.
zh
[NLP-11] WenetSpeech-Chuan: A Large-Scale Sichuanese Corpus with Rich Annotation for Dialectal Speech Processing
【速读】: 该论文旨在解决汉语方言(特别是四川话)在语音技术研究中缺乏大规模、开源标注数据的问题,这一瓶颈严重制约了相关领域的进展。其解决方案的关键在于构建了一个10,000小时的高质量、丰富标注语料库——WenetSpeech-Chuan,该语料库基于作者提出的新型方言语音处理流水线“Chuan-Pipeline”,实现了从原始数据采集到标准化标注的全流程自动化与规范化;同时配套发布了高精度自动语音识别(ASR)和文本转语音(TTS)评估基准(WenetSpeech-Chuan-Eval),通过实验验证了所建语料库的有效性,使得训练模型在开源系统中达到最先进性能,并接近商用服务水平,从而显著降低方言语音处理的研究门槛,推动语音技术的公平性和多样性发展。
链接: https://arxiv.org/abs/2509.18004
作者: Yuhang Dai,Ziyu Zhang,Shuai Wang,Longhao Li,Zhao Guo,Tianlun Zuo,Shuiyuan Wang,Hongfei Xue,Chengyou Wang,Qing Wang,Xin Xu,Hui Bu,Jie Li,Jian Kang,Binbin Zhang,Lei Xie
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注: 4 pages, 5 figures, 4 tables
Abstract:The scarcity of large-scale, open-source data for dialects severely hinders progress in speech technology, a challenge particularly acute for the widely spoken Sichuanese dialects of Chinese. To address this critical gap, we introduce WenetSpeech-Chuan, a 10,000-hour, richly annotated corpus constructed using our novel Chuan-Pipeline, a complete data processing framework for dialectal speech. To facilitate rigorous evaluation and demonstrate the corpus’s effectiveness, we also release high-quality ASR and TTS benchmarks, WenetSpeech-Chuan-Eval, with manually verified transcriptions. Experiments show that models trained on WenetSpeech-Chuan achieve state-of-the-art performance among open-source systems and demonstrate results comparable to commercial services. As the largest open-source corpus for Sichuanese dialects, WenetSpeech-Chuan not only lowers the barrier to research in dialectal speech processing but also plays a crucial role in promoting AI equity and mitigating bias in speech technologies. The corpus, benchmarks, models, and receipts are publicly available on our project page.
zh
[NLP-12] Variation in Verification: Understanding Verification Dynamics in Large Language Models
【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在测试时计算扩展(Test-Time Scaling, TTS)中如何有效提升问题求解准确率的问题,特别是针对无参考答案场景下的验证机制优化。其解决方案的关键在于系统性地研究生成式验证器(generative verifiers)的验证行为,即通过链式思维(Chain-of-Thought, CoT)推理生成二元判断来评估候选解的正确性,并揭示验证效果受问题难度、生成器能力与验证器生成能力三方面因素影响的动态规律。实证研究表明,验证有效性并非单纯依赖验证器强度,而是存在可优化的策略空间,例如弱生成器结合强验证器可在后验证阶段显著逼近强生成器性能,同时识别出验证器增强未必带来收益的边界情况,从而为TTS应用中的资源分配和策略设计提供依据。
链接: https://arxiv.org/abs/2509.17995
作者: Yefan Zhou,Austin Xu,Yilun Zhou,Janvijay Singh,Jiang Gui,Shafiq Joty
机构: Salesforce AI Research (Salesforce人工智能研究); Dartmouth College (达特茅斯学院); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions - problem difficulty, generator capability, and verifier generation capability - with empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier’s own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities to optimize basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.5%). Second, we identify cases where strong verifiers offer limited advantage over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.
zh
[NLP-13] ReDepress: A Cognitive Framework for Detecting Depression Relapse from Social Media EMNLP2025
【速读】: 该论文旨在解决抑郁症复发(depression relapse)在社交媒体上的早期检测问题,这一领域因缺乏标注数据集及难以区分复发与非复发用户而长期未被充分探索。其解决方案的关键在于构建了首个经临床验证的社交媒体数据集ReDepress(包含204名Reddit用户的标注),并基于认知理论引入注意力偏差(attention bias)、解释偏差(interpretation bias)、记忆偏差(memory bias)和反刍(rumination)等心理机制作为特征,在标注与建模中加以融合。实验表明,这些认知标记能显著区分复发与非复发群体,且结合Transformer的时间建模方法在F1指标上达到0.86,验证了认知驱动的计算方法在真实文本数据中的有效性,为低成本、可扩展的心理健康干预提供了新路径。
链接: https://arxiv.org/abs/2509.17991
作者: Aakash Kumar Agarwal,Saprativa Bhattacharjee,Mauli Rastogi,Jemima S. Jacob,Biplab Banerjee,Rashmi Gupta,Pushpak Bhattacharyya
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校); Clinical Psychologist (临床心理学家); Mental Health Consultant (心理健康顾问)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2025 Main Conference
Abstract:Almost 50% depression patients face the risk of going into relapse. The risk increases to 80% after the second episode of depression. Although, depression detection from social media has attained considerable attention, depression relapse detection has remained largely unexplored due to the lack of curated datasets and the difficulty of distinguishing relapse and non-relapse users. In this work, we present ReDepress, the first clinically validated social media dataset focused on relapse, comprising 204 Reddit users annotated by mental health professionals. Unlike prior approaches, our framework draws on cognitive theories of depression, incorporating constructs such as attention bias, interpretation bias, memory bias and rumination into both annotation and modeling. Through statistical analyses and machine learning experiments, we demonstrate that cognitive markers significantly differentiate relapse and non-relapse groups, and that models enriched with these features achieve competitive performance, with transformer-based temporal models attaining an F1 of 0.86. Our findings validate psychological theories in real-world textual data and underscore the potential of cognitive-informed computational methods for early relapse detection, paving the way for scalable, low-cost interventions in mental healthcare.
zh
[NLP-14] Bringing Pedagogy into Focus: Evaluating Virtual Teaching Assistants Question-Answering in Asynchronous Learning Environments EMNLP2025
【速读】: 该论文旨在解决异步学习环境(Asynchronous Learning Environments, ALEs)中缺乏及时且个性化的教学支持问题,同时指出当前对虚拟助教(Virtual Teaching Assistants, VTAs)的评估多依赖表面指标,缺乏教育理论支撑,难以有效比较不同VTAs的 pedagogical效果。解决方案的关键在于构建一个基于学习科学的评估框架,聚焦于ALE中最常见的异步论坛讨论场景,并通过专家标注的VTA回复数据训练分类器,以识别提升准确性的方法及限制泛化能力的挑战,从而为VTAs提供理论驱动的、可验证的教学有效性评估基础。
链接: https://arxiv.org/abs/2509.17961
作者: Li Siyan,Zhen Xu,Vethavikashini Chithrra Raghuram,Xuanming Zhang,Renzhe Yu,Zhou Yu
机构: Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted in EMNLP 2025 Findings
Abstract:Asynchronous learning environments (ALEs) are widely adopted for formal and informal learning, but timely and personalized support is often limited. In this context, Virtual Teaching Assistants (VTAs) can potentially reduce the workload of instructors, but rigorous and pedagogically sound evaluation is essential. Existing assessments often rely on surface-level metrics and lack sufficient grounding in educational theories, making it difficult to meaningfully compare the pedagogical effectiveness of different VTA systems. To bridge this gap, we propose an evaluation framework rooted in learning sciences and tailored to asynchronous forum discussions, a common VTA deployment context in ALE. We construct classifiers using expert annotations of VTA responses on a diverse set of forum posts. We evaluate the effectiveness of our classifiers, identifying approaches that improve accuracy as well as challenges that hinder generalization. Our work establishes a foundation for theory-driven evaluation of VTA systems, paving the way for more pedagogically effective AI in education.
zh
[NLP-15] Dorabella Cipher as Musical Inspiration
【速读】: 该论文试图解决百年未解的多拉贝拉密码(Dorabella cipher)的破译问题,其核心假设是该密码并非加密英文文本,而是以音乐形式编码的信息。解决方案的关键在于构建一个简化的音乐记谱系统,并利用n-gram模型对音乐语料库进行训练与验证,从而在已知的单字母替换加密音乐数据上实现有效还原;随后将此方法应用于多拉贝拉密码,生成具有音乐特征的解密结果,并通过艺术化编曲转化为可听旋律,强调破译过程本身即构成创作的一部分。
链接: https://arxiv.org/abs/2509.17950
作者: Bradley Hauer,Colin Choi,Abram Hindle,Scott Smallwood,Grzegorz Kondrak
机构: University of Alberta (阿尔伯塔大学)
类目: Computation and Language (cs.CL)
备注: Published in Proceedings of the Workshop on Speech and Music Processing 2021
Abstract:The Dorabella cipher is an encrypted note written by English composer Edward Elgar, which has defied decipherment attempts for more than a century. While most proposed solutions are English texts, we investigate the hypothesis that Dorabella represents enciphered music. We weigh the evidence for and against the hypothesis, devise a simplified music notation, and attempt to reconstruct a melody from the cipher. Our tools are n-gram models of music which we validate on existing music corpora enciphered using monoalphabetic substitution. By applying our methods to Dorabella, we produce a decipherment with musical qualities, which is then transformed via artful composition into a listenable melody. Far from arguing that the end result represents the only true solution, we instead frame the process of decipherment as part of the composition process.
zh
[NLP-16] HICode: Hierarchical Inductive Coding with LLM s EMNLP2025
【速读】: 该论文旨在解决大规模文本语料库中细粒度分析难以规模化的问题,即传统依赖人工标注的方法无法扩展,而统计工具(如主题建模)又缺乏可控性和解释性。其解决方案的关键在于提出HICode,一个两阶段的自动化分析流程:第一阶段基于质性研究方法从数据中归纳生成标签,第二阶段通过层次聚类挖掘出潜在的主题结构,从而实现对大规模文本的精细化、可解释的分析。
链接: https://arxiv.org/abs/2509.17946
作者: Mian Zhong,Pristina Wang,Anjalie Field
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Long paper accepted at EMNLP 2025 main conference, 19 pages, 8 figures
Abstract:Despite numerous applications for fine-grained corpus analysis, researchers continue to rely on manual labeling, which does not scale, or statistical tools like topic modeling, which are difficult to control. We propose that LLMs have the potential to scale the nuanced analyses that researchers typically conduct manually to large text corpora. To this effect, inspired by qualitative research methods, we develop HICode, a two-part pipeline that first inductively generates labels directly from analysis data and then hierarchically clusters them to surface emergent themes. We validate this approach across three diverse datasets by measuring alignment with human-constructed themes and demonstrating its robustness through automated and human evaluations. Finally, we conduct a case study of litigation documents related to the ongoing opioid crisis in the U.S., revealing aggressive marketing strategies employed by pharmaceutical companies and demonstrating HICode’s potential for facilitating nuanced analyses in large-scale data.
zh
[NLP-17] D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全性和对齐性评估中一个关键但被忽视的问题:即模型可能在表面上生成看似无害的输出,而其内部推理过程却包含恶意或欺骗性逻辑,这种现象通常由复杂系统提示注入(system prompt injection)触发,导致传统安全过滤机制失效。解决方案的关键在于提出一个新的评估基准——欺骗性推理暴露套件(Deceptive Reasoning Exposure Suite, D-REX),该套件通过对抗性红队测试构建样本,每个样本包含恶意系统提示、用户查询、表面无害的回复以及揭示真实意图的内部思维链(chain-of-thought),从而实现对“欺骗性对齐”(deceptive alignment)的检测,推动研究从仅关注输出转向深入分析模型内部推理过程。
链接: https://arxiv.org/abs/2509.17938
作者: Satyapriya Krishna,Andy Zou,Rahul Gupta,Eliot Krzysztof Jones,Nick Winter,Dan Hendrycks,J. Zico Kolter,Matt Fredrikson,Spyros Matsoukas
机构: Amazon Nova Responsible AI (Amazon Nova 负责任人工智能); Center for AI Safety (人工智能安全中心); CMU (卡内基梅隆大学); Gray Swan AI (灰天鹅人工智能)
类目: Computation and Language (cs.CL)
备注: Preprint
Abstract:The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model’s internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user’s test query, the model’s seemingly innocuous response, and, crucially, the model’s internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.
zh
[NLP-18] raining-free Truthfulness Detection via Value Vectors in LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)生成内容中存在事实性错误的问题,尤其是现有训练-free方法在检测内容真实性时存在的可扩展性和泛化能力不足的问题。其解决方案的关键在于发现并利用Transformer架构中MLP(多层感知机)模块内的特定值向量(value vectors)所蕴含的与真实性相关的统计模式,从而提出一种名为TruthV的简单且可解释的训练-free方法,通过分析这些向量实现对内容真实性的有效检测。实验表明,MLP模块虽在以往研究中被忽视,却包含丰富且有用的真实性信号,显著优于NoVo和对数似然基线方法。
链接: https://arxiv.org/abs/2509.17932
作者: Runheng Liu,Heyan Huang,Xingchen Xiao,Zhijing Wu
机构: Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models often generate factually incorrect outputs, motivating efforts to detect the truthfulness of their content. Most existing approaches rely on training probes over internal activations, but these methods suffer from scalability and generalization issues. A recent training-free method, NoVo, addresses this challenge by exploiting statistical patterns from the model itself. However, it focuses exclusively on attention mechanisms, potentially overlooking the MLP module-a core component of Transformer models known to support factual recall. In this paper, we show that certain value vectors within MLP modules exhibit truthfulness-related statistical patterns. Building on this insight, we propose TruthV, a simple and interpretable training-free method that detects content truthfulness by leveraging these value vectors. On the NoVo benchmark, TruthV significantly outperforms both NoVo and log-likelihood baselines, demonstrating that MLP modules-despite being neglected in prior training-free efforts-encode rich and useful signals for truthfulness detection. These findings offer new insights into how truthfulness is internally represented in LLMs and motivate further research on scalable and interpretable truthfulness detection.
zh
[NLP-19] ransformer-Encoder Trees for Efficient Multilingual Machine Translation and Speech Translation
【速读】: 该论文旨在解决多语言翻译中因计算冗余和低资源语言翻译准确率不足所带来的挑战,尤其是在语音翻译场景下。其解决方案的关键在于提出一种分层的Transformer编码器树(Hierarchical Transformer Encoder Tree, TET),结合使用连接时序分类(Connectionist Temporal Classification, CTC)训练的非自回归编码器模型。通过在语义或语言学上相近的目标语言间共享中间表示,TET不仅提升了低资源语言的翻译精度,还减少了计算冗余,并支持在单次前向传播中生成所有目标语言,从而消除了序列化瓶颈并显著增强并行性。对于语音翻译任务,将TET与非自回归语音识别骨干网络(如wav2vec2)结合,在保持翻译质量的同时实现7–14倍的速度提升。
链接: https://arxiv.org/abs/2509.17930
作者: Yiwen Guan,Jacob Whitehill
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Multilingual translation faces challenges of computational redundancy and limited accuracy for low-resource languages, especially in speech translation. To address this, we propose a novel hierarchical Transformer Encoder Tree (TET) combined with non-autoregressive encoder-only models trained with Connectionist Temporal Classification for multilingual translation. By sharing intermediate representations among linguistically similar target languages, TET can improve accuracy on low-resource languages, reduce computational redundancy, and allow generating all target languages in a single forward pass, thus eliminating sequential bottlenecks and improving parallelism. For speech translation, combining TET with a non-autoregressive speech recognition backbone (wav2vec2) shows promising results in terms of translation quality compared to autoregressive systems while being 7-14 times faster.
zh
[NLP-20] Improving Zero-shot Sentence Decontextualisation with Content Selection and Planning
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)任务中句子去上下文化(decontextualisation)的问题,即如何从文档中提取出可独立理解的句子,避免因缺乏指代消解(coreference)和背景信息而导致语义不完整或歧义。其解决方案的关键在于提出了一种零样本(zero-shot)的内容选择与规划框架:首先将潜在模糊的句子分割为语义独立的基本单元,接着基于话语关系(discourse relations)从原上下文中识别并提取与这些单元相关的补充内容,最终生成一个内容计划以重构句子,使每个模糊单元都通过引入相关上下文信息而具备自洽性。实验表明,该方法在语义完整性与话语连贯性上优于现有技术。
链接: https://arxiv.org/abs/2509.17921
作者: Zhenyun Deng,Yulong Chen,Andreas Vlachos
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMLNP 2025 (Main Conference)
Abstract:Extracting individual sentences from a document as evidence or reasoning steps is commonly done in many NLP tasks. However, extracted sentences often lack context necessary to make them understood, e.g., coreference and background information. To this end, we propose a content selection and planning framework for zero-shot decontextualisation, which determines what content should be mentioned and in what order for a sentence to be understood out of context. Specifically, given a potentially ambiguous sentence and its context, we first segment it into basic semantically-independent units. We then identify potentially ambiguous units from the given sentence, and extract relevant units from the context based on their discourse relations. Finally, we generate a content plan to rewrite the sentence by enriching each ambiguous unit with its relevant units. Experimental results demonstrate that our approach is competitive for sentence decontextualisation, producing sentences that exhibit better semantic integrity and discourse coherence, outperforming existing methods.
zh
[NLP-21] SiDiaC: Sinhala Diachronic Corpus ACL
【速读】: 该论文旨在解决斯里兰卡僧伽罗语(Sinhala)低资源语言在自然语言处理(Natural Language Processing, NLP)领域长期缺乏高质量历史语料库的问题。为应对这一挑战,研究者构建了首个全面的僧伽罗语历时语料库(SiDiaC),覆盖公元5世纪至20世纪的46部文学作品,共约58,000词,并通过严格的文本筛选标准(包括可用性、作者归属、版权合规与数据溯源)进行标注。关键解决方案在于借鉴其他低资源语言语料库(如FarPaHC)的经验,在文本规范化和句法标注策略上实现标准化处理,同时利用Google Document AI光学字符识别(OCR)技术对国家图书馆藏本进行数字化,并辅以格式修正与正字法现代化后处理,从而为僧伽罗语NLP提供可扩展的基础资源,支持词汇演变、新词追踪、历史句法及基于语料库的词典编纂等历时语言学研究。
链接: https://arxiv.org/abs/2509.17912
作者: Nevidu Jayatilleke,Nisansa de Silva
机构: University of Moratuwa (莫鲁塔瓦大学)
类目: Computation and Language (cs.CL)
备注: 14 pages, 7 figures, 7 tables, Accepted paper at the 39th Pacific Asia Conference on Language, Information and Computation (PACLIC 39)
Abstract:SiDiaC, the first comprehensive Sinhala Diachronic Corpus, covers a historical span from the 5th to the 20th century CE. SiDiaC comprises 58k words across 46 literary works, annotated carefully based on the written date, after filtering based on availability, authorship, copyright compliance, and data attribution. Texts from the National Library of Sri Lanka were digitised using Google Document AI OCR, followed by post-processing to correct formatting and modernise the orthography. The construction of SiDiaC was informed by practices from other corpora, such as FarPaHC, particularly in syntactic annotation and text normalisation strategies, due to the shared characteristics of low-resourced language status. This corpus is categorised based on genres into two layers: primary and secondary. Primary categorisation is binary, classifying each book into Non-Fiction or Fiction, while the secondary categorisation is more specific, grouping texts under Religious, History, Poetry, Language, and Medical genres. Despite challenges including limited access to rare texts and reliance on secondary date sources, SiDiaC serves as a foundational resource for Sinhala NLP, significantly extending the resources available for Sinhala, enabling diachronic studies in lexical change, neologism tracking, historical syntax, and corpus-based lexicography.
zh
[NLP-22] How Persuasive is Your Context? EMNLP2025
【速读】: 该论文旨在解决如何量化语言模型(Language Models, LMs)在接收到特定上下文信息时,其回答分布发生改变的程度这一问题,即衡量上下文对模型决策的“说服力”。传统方法仅通过贪婪解码结果判断说服效果,忽略了模型输出的概率分布变化。论文提出的目标说服分数(Targeted Persuasion Score, TPS),其关键在于利用Wasserstein距离来度量上下文是否将模型原始回答分布推向一个预设的目标分布,从而提供比单一解码结果更精细、更全面的模型行为分析视角。
链接: https://arxiv.org/abs/2509.17879
作者: Tu Nguyen,Kevin Du,Alexander Miserlis Hoyle,Ryan Cotterell
机构: ETH Zürich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Long paper accepted at EMNLP 2025
Abstract:Two central capabilities of language models (LMs) are: (i) drawing on prior knowledge about entities, which allows them to answer queries such as “What’s the official language of Austria?”, and (ii) adapting to new information provided in context, e.g., “Pretend the official language of Austria is Tagalog.”, that is pre-pended to the question. In this article, we introduce targeted persuasion score (TPS), designed to quantify how persuasive a given context is to an LM where persuasion is operationalized as the ability of the context to alter the LM’s answer to the question. In contrast to evaluating persuasiveness only by inspecting the greedily decoded answer under the model, TPS provides a more fine-grained view of model behavior. Based on the Wasserstein distance, TPS measures how much a context shifts a model’s original answer distribution toward a target distribution. Empirically, through a series of experiments, we show that TPS captures a more nuanced notion of persuasiveness than previously proposed metrics.
zh
[NLP-23] Unsupervised Learning and Representation of Mandarin Tonal Categories by a Generative CNN
【速读】: 该论文旨在解决无监督学习框架下如何建模人类语言习得中的声调学习问题,特别是针对汉语普通话声调这一计算复杂度较高的语言特征。其核心挑战在于不依赖任何标注数据的情况下,使模型自动识别并区分声调类别。解决方案的关键在于提出了一种生成式对抗网络(Generative Adversarial Network, GAN)架构的改进版本——ciwGAN,该模型通过在未标注语音数据上训练,能够自发地将离散的隐变量与普通话的四个声调类别建立映射关系,并在多个训练场景中表现出显著的基频(F0)差异,验证了其对声调对比的有效学习能力。此外,研究进一步通过分析卷积层内部表示,揭示了声调信息在神经网络中的可解释性路径,为深度学习模型的语言学可解释性提供了方法论支持。
链接: https://arxiv.org/abs/2509.17859
作者: Kai Schenck,Gašper Beguš
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:This paper outlines the methodology for modeling tonal learning in fully unsupervised models of human language acquisition. Tonal patterns are among the computationally most complex learning objectives in language. We argue that a realistic generative model of human language (ciwGAN) can learn to associate its categorical variables with Mandarin Chinese tonal categories without any labeled data. All three trained models showed statistically significant differences in F0 across categorical variables. The model trained solely on male tokens consistently encoded tone. Our results sug- gest that not only does the model learn Mandarin tonal contrasts, but it learns a system that corresponds to a stage of acquisition in human language learners. We also outline methodology for tracing tonal representations in internal convolutional layers, which shows that linguistic tools can contribute to interpretability of deep learning and can ultimately be used in neural experiments.
zh
[NLP-24] CorPipe at CRAC 2025: Evaluating Multilingual Encoders for Multilingual Coreference Resolution
链接: https://arxiv.org/abs/2509.17858
作者: Milan Straka
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to CODI-CRAC 2025
[NLP-25] Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理方言词汇变异时能力不足的问题,特别是针对缺乏标准拼写规范的方言(如巴伐利亚德语)中词义识别与翻译的挑战。其解决方案的关键在于提出了一种名为DiaLemma的新颖注释框架,该框架仅基于单语数据构建方言变体词典,并利用该框架创建了一个包含10万条人工标注的德语-巴伐利亚语词对的基准数据集,从而系统评估九种先进LLMs在判断巴伐利亚语词是否为给定德语词根的方言翻译、屈折变体或无关形式方面的表现。
链接: https://arxiv.org/abs/2509.17855
作者: Robert Litschko,Verena Blaschke,Diana Burkhardt,Barbara Plank,Diego Frassinelli
机构: MaiNLP (Machine Intelligence for Natural Language Processing); Ludwig Maximilian University of Munich (慕尼黑路德维希马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 (Findings)
Abstract:Dialects exhibit a substantial degree of variation due to the lack of a standard orthography. At the same time, the ability of Large Language Models (LLMs) to process dialects remains largely understudied. To address this gap, we use Bavarian as a case study and investigate the lexical dialect understanding capability of LLMs by examining how well they recognize and translate dialectal terms across different parts-of-speech. To this end, we introduce DiaLemma, a novel annotation framework for creating dialect variation dictionaries from monolingual data only, and use it to compile a ground truth dataset consisting of 100K human-annotated German-Bavarian word pairs. We evaluate how well nine state-of-the-art LLMs can judge Bavarian terms as dialect translations, inflected variants, or unrelated forms of a given German lemma. Our results show that LLMs perform best on nouns and lexically similar word pairs, and struggle most in distinguishing between direct translations and inflected variants. Interestingly, providing additional context in the form of example usages improves the translation performance, but reduces their ability to recognize dialect variants. This study highlights the limitations of LLMs in dealing with orthographic dialect variation and emphasizes the need for future work on adapting LLMs to dialects.
zh
[NLP-26] rust Me I Can Convince You: The Contextualized Argument Appraisal Framework
【速读】: 该论文旨在解决情感在论点说服力中的作用尚未被充分建模的问题,尤其是如何将发送者、接收者与论点之间的认知评估过程纳入统一框架。当前研究虽分别探讨了二元情绪(binary emotionality)在论点挖掘中的应用以及一般情绪分析中的认知评估(cognitive appraisal),但二者尚未融合。为此,作者提出了情境化论点评估框架(Contextualized Argument Appraisal Framework),其关键在于通过引入情绪标签、认知评估变量(如论点熟悉度、响应紧迫性、预期努力)及说服力指标,系统刻画论点语境下三者间的动态交互机制。实验基于800条论点的5人标注数据集验证了该框架的有效性,发现积极情绪(如信任)与说服力呈正相关,而消极情绪(如愤怒)则呈负相关,且论点内容本身是引发情绪反应的主要驱动因素。
链接: https://arxiv.org/abs/2509.17844
作者: Lynn Greschner,Sabine Weber,Roman Klinger
机构: University of Bamberg (巴伐利亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Emotions, which influence how convincing an argument is, are developed in context of the self and sender, and therefore require modeling the cognitive evaluation process. While binary emotionality has been studied in argument mining, and the cognitive appraisal has been modeled in general emotion analysis, these fields have not been brought together yet. We therefore propose the Contextualized Argument Appraisal Framework that contextualizes the interplay between the sender, receiver, and argument. It includes emotion labels, appraisals, such as argument familiarity, response urgency, and expected effort, as well as convincingness variables. To evaluate the framework and pave the way to computational modeling, we perform a study in a role-playing scenario, mimicking real-world exposure to arguments, asking participants to disclose their emotion, explain the main cause, the argument appraisal, and the perceived convincingness. To consider the subjective nature of such annotations, we also collect demographic data and personality traits of both the participants and the perceived sender of the argument. The analysis of the resulting corpus of 800 arguments, each annotated by 5 participants, reveals that convincingness is positively correlated with positive emotions (e.g., trust) and negatively correlated with negative emotions (e.g., anger). The appraisal variables disclose the importance of the argument familiarity. For most participants, the content of the argument itself is the primary driver of the emotional response. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.17844 [cs.CL] (or arXiv:2509.17844v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.17844 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lynn Greschner [view email] [v1] Mon, 22 Sep 2025 14:32:55 UTC (2,061 KB) Full-text links: Access Paper: View a PDF of the paper titled Trust Me, I Can Convince You: The Contextualized Argument Appraisal Framework, by Lynn Greschner and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[NLP-27] From Documents to Database: Failure Modes for Industrial Assets IJCAI2025
【速读】: 该论文旨在解决工业设备故障模式与影响分析(Failure Mode and Effects Analysis, FMEA)编制过程中知识密集型、耗时长且依赖人工的难题。解决方案的关键在于构建一个基于基础模型(foundation models)的交互式系统,通过整合用户提供的技术文档中的非结构化内容,自动生成结构化的FMEA,并将其存储于关系型数据库中,从而显著缩短FMEA创建时间并提升效率,优于传统手工方法。
链接: https://arxiv.org/abs/2509.17834
作者: Duygu Kabakci-Zorlu,Fabio Lorenzi,John Sheehan,Karol Lynch,Bradley Eck
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 7 pages, 4 figures. Artificial Intelligence for Knowledge Acquisition Management (AI4KAM) Workshop @ IJCAI 2025
Abstract:We propose an interactive system using foundation models and user-provided technical documents to generate Failure Mode and Effects Analyses (FMEA) for industrial equipment. Our system aggregates unstructured content across documents to generate an FMEA and stores it in a relational database. Leveraging this tool, the time required for creation of this knowledge-intensive content is reduced, outperforming traditional manual approaches. This demonstration showcases the potential of foundation models to facilitate the creation of specialized structured content for enterprise asset management systems.
zh
[NLP-28] Fine-Grained Detection of AI-Generated Text Using Sentence-Level Segmentation
【速读】: 该论文旨在解决传统生成式 AI (Generative AI) 文本检测方法在混合文本或轻微编辑文本中识别效率低下的问题,此类文本常被用于规避检测机制,导致难以区分人类撰写与AI生成内容。其解决方案的关键在于提出一种基于句子级别的序列标注模型,通过捕捉文档内细粒度的语义和句法特征,实现对人类与AI生成文本在token级别上的精准分割。该方法融合了预训练Transformer模型、神经网络(Neural Networks, NN)与条件随机场(Conditional Random Fields, CRFs),其中Transformer负责提取深层语义与语法模式,NN增强序列表示能力,CRF层则优化边界预测,从而显著提升跨人类-AI文本片段边界的识别精度。
链接: https://arxiv.org/abs/2509.17830
作者: Lekkala Sai Teja,Annepaka Yadagiri,and Partha Pakray,Chukhu Chunka,Mangadoddi Srikar Vardhan
机构: National Institute of Technology Silchar (国立技术学院西尔查尔分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, 14 figures
Abstract:Generation of Artificial Intelligence (AI) texts in important works has become a common practice that can be used to misuse and abuse AI at various levels. Traditional AI detectors often rely on document-level classification, which struggles to identify AI content in hybrid or slightly edited texts designed to avoid detection, leading to concerns about the model’s efficiency, which makes it hard to distinguish between human-written and AI-generated texts. A sentence-level sequence labeling model proposed to detect transitions between human- and AI-generated text, leveraging nuanced linguistic signals overlooked by document-level classifiers. By this method, detecting and segmenting AI and human-written text within a single document at the token-level granularity is achieved. Our model combines the state-of-the-art pre-trained Transformer models, incorporating Neural Networks (NN) and Conditional Random Fields (CRFs). This approach extends the power of transformers to extract semantic and syntactic patterns, and the neural network component to capture enhanced sequence-level representations, thereby improving the boundary predictions by the CRF layer, which enhances sequence recognition and further identification of the partition between Human- and AI-generated texts. The evaluation is performed on two publicly available benchmark datasets containing collaborative human and AI-generated texts. Our experimental comparisons are with zero-shot detectors and the existing state-of-the-art models, along with rigorous ablation studies to justify that this approach, in particular, can accurately detect the spans of AI texts in a completely collaborative text. All our source code and the processed datasets are available in our GitHub repository.
zh
[NLP-29] owards Adaptive Context Management for Intelligent Conversational Question Answering
【速读】: 该论文旨在解决对话式问答(Conversational Question Answering, ConvQA)系统中因上下文历史过长而导致模型无法有效利用全部信息的问题,尤其是在受限的token长度下如何最大化保留相关性。解决方案的关键在于提出一种自适应上下文管理(Adaptive Context Management, ACM)框架,其核心由三个模块构成:上下文管理器(Context Manager, CM)动态调整上下文大小以保留最相关且最新的信息;摘要模块(Summarization Module, SM)通过滑动窗口对较旧的对话内容进行摘要;实体抽取模块(Entity Extraction Module, EE)在摘要窗口超限时提取并保留最早对话轮次中的关键实体,从而在有限token内优化信息传递效率。实验表明,该框架能显著提升响应的准确性和情境适配性,增强ConvQA系统的鲁棒性和可扩展性。
链接: https://arxiv.org/abs/2509.17829
作者: Manoj Madushanka Perera,Adnan Mahmood,Kasun Eranda Wijethilake,Quan Z. Sheng
机构: 未知
类目: Computation and Language (cs.CL)
备注: Comments: 15 pages, 6 figures, Table 1, published in Lecture Notes in Computer Science (LNCS 15391), Proceedings of ADMA 2024. DOI: https://doi.org/10.1007/978-981-96-0847-8_25
Abstract:This particular paper introduces an Adaptive Context Management (ACM) framework for the Conversational Question Answering (ConvQA) systems. The key objective of the ACM framework is to optimize the use of the conversation history by dynamically managing context for maximizing the relevant information provided to a ConvQA model within its token limit. Our approach incorporates a Context Manager (CM) Module, a Summarization (SM) Module, and an Entity Extraction (EE) Module in a bid to handle the conversation history efficaciously. The CM Module dynamically adjusts the context size, thereby preserving the most relevant and recent information within a model’s token limit. The SM Module summarizes the older parts of the conversation history via a sliding window. When the summarization window exceeds its limit, the EE Module identifies and retains key entities from the oldest conversation turns. Experimental results demonstrate the effectiveness of our envisaged framework in generating accurate and contextually appropriate responses, thereby highlighting the potential of the ACM framework to enhance the robustness and scalability of the ConvQA systems.
zh
[NLP-30] Everyday Physics in Korean Contexts: A Culturally Grounded Physical Reasoning Benchmark EMNLP2025
【速读】: 该论文旨在解决现有物理常识推理基准测试主要聚焦于西方语境、忽视文化差异对物理问题解决影响的问题。为填补这一空白,作者提出了EPiK(Everyday Physics in Korean Contexts),一个包含181个二选一问题的新型基准,专门测试韩国文化背景下的物理推理能力,涵盖从泡菜(kimchi)到传统发酵等84个场景和9类推理子任务。其解决方案的关键在于采用两阶段生成与验证流程,从韩国文化语境中有机生成问题,而非简单翻译已有内容,从而确保问题的文化真实性与严格的物理推理标准。实验表明,针对韩国语境优化的模型在EPiK上显著优于通用模型,凸显了文化无偏模型的局限性,并强调了构建文化敏感型基准对于真正评估语言理解能力的重要性。
链接: https://arxiv.org/abs/2509.17807
作者: Jihae Jeong,DaeYeop Lee,DongGeon Lee,Hwanjo Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to MRL@EMNLP 2025
Abstract:Existing physical commonsense reasoning benchmarks predominantly focus on Western contexts, overlooking cultural variations in physical problem-solving. To address this gap, we introduce EPiK (Everyday Physics in Korean Contexts), a novel benchmark comprising 181 binary-choice problems that test physical reasoning within Korean cultural contexts, ranging from kimchi (Korean food) to traditional fermentation. EPiK is constructed using a two-stage generation and verification pipeline to create culturally-authentic problems across 9 reasoning subtasks and 84 scenarios. Unlike approaches based on simple translation, our method generates problems organically from Korean contexts while upholding rigorous physical reasoning standards. Our evaluations show that Korean-specialized models consistently outperform general-purpose models of comparable size. This performance gap highlights the limitations of culturally-agnostic models and demonstrates the critical need for culturally-aware benchmarks to truly measure language understanding. Our EPiK is publicly available at this https URL.
zh
[NLP-31] Findings of the Fourth Shared Task on Multilingual Coreference Resolution: Can LLM s Dethrone Traditional Approaches?
【速读】: 该论文旨在解决多语言共指消解(Multilingual Coreference Resolution)任务中的系统性能提升与模型适应性问题,特别是在大规模语言模型(Large Language Model, LLM)日益广泛应用的背景下。其关键解决方案在于:首次在共享任务中引入专门面向LLM的赛道,并设计了一种更适配LLM处理的简化文本格式(plaintext format),替代原有的CoNLL-U表示;同时扩展了数据集覆盖范围至17种语言的22个数据集(基于CorefUD v1.3),从而为不同技术路径(包括传统方法和LLM方法)提供公平、统一的评估基准。实验表明,尽管传统系统仍保持领先,但LLM展现出显著潜力,预示其在未来版本中可能成为主流方法。
链接: https://arxiv.org/abs/2509.17796
作者: Michal Novák,Miloslav Konopík,Anna Nedoluzhko,Martin Popel,Ondřej Pražák,Jakub Sido,Milan Straka,Zdeněk Žabokrtský,Daniel Zeman
机构: Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics (查理大学,数学与物理学院,形式与应用语言学研究所); University of West Bohemia, Faculty of Applied Sciences, Department of Computer Science and Engineering (西波希米亚大学,应用科学学院,计算机科学与工程系)
类目: Computation and Language (cs.CL)
备注: Accepted to CODI-CRAC 2025
Abstract:The paper presents an overview of the fourth edition of the Shared Task on Multilingual Coreference Resolution, organized as part of the CODI-CRAC 2025 workshop. As in the previous editions, participants were challenged to develop systems that identify mentions and cluster them according to identity coreference. A key innovation of this year’s task was the introduction of a dedicated Large Language Model (LLM) track, featuring a simplified plaintext format designed to be more suitable for LLMs than the original CoNLL-U representation. The task also expanded its coverage with three new datasets in two additional languages, using version 1.3 of CorefUD - a harmonized multilingual collection of 22 datasets in 17 languages. In total, nine systems participated, including four LLM-based approaches (two fine-tuned and two using few-shot adaptation). While traditional systems still kept the lead, LLMs showed clear potential, suggesting they may soon challenge established approaches in future editions. Comments: Accepted to CODI-CRAC 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.17796 [cs.CL] (or arXiv:2509.17796v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.17796 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Milan Straka [view email] [v1] Mon, 22 Sep 2025 13:52:32 UTC (87 KB)
zh
[NLP-32] Learning to vary: Teaching LMs to reproduce human linguistic variability in next-word prediction EMNLP
【速读】: 该论文旨在解决当前语言模型(Language Models, LMs)在自然语言生成(Natural Language Generation, NLG)任务中难以有效再现人类语言多样性的问题,尤其是在上下文存在多种合理词续接的情况下。研究表明,现有模型往往无法充分捕捉这种内在的语言变异性,可能源于训练数据缺乏对多义性或多元观点的系统性建模。解决方案的关键在于采用多标签微调(multi-label fine-tuning)策略,即在训练过程中为每个上下文提供多个合理的词续接标签,从而引导模型学习更贴近人类群体的语言分布特性。实验表明,该方法显著提升了GPT-2和Mistral-7B-IT等预训练及指令微调模型在不同变异性水平上下文中的语言多样性再现能力。
链接: https://arxiv.org/abs/2509.17794
作者: Tobias Groot,Salo Lacunes,Evgenia Ilia
机构: University of Amsterdam (阿姆斯特丹大学)
类目: Computation and Language (cs.CL)
备注: EMNLP UncertaiNLP Workshop 2025
Abstract:Natural language generation (NLG) tasks are often subject to inherent variability; \emphe.g. predicting the next word given a context has multiple valid responses, evident when asking multiple humans to complete the task. While having language models (LMs) that are aligned pluralistically, so that they are able to reproduce well the inherent diversity in perspectives of an entire population of interest is clearly beneficial, \citetilia2024predict show that LMs do not reproduce this type of linguistic variability well. They speculate this inability might stem from the lack of consistent training of LMs with data reflecting this type of inherent variability. As such, we investigate whether training LMs on multiple plausible word continuations per context can improve their ability to reproduce human linguistic variability for next-word prediction. We employ fine-tuning techniques for pre-trained and instruction-tuned models; and demonstrate their potential when fine-tuning GPT-2 and Mistral-7B-IT, using Provo Corpus. Our evaluation, which measures divergence among empirically estimated human and model next-word distributions across contexts before and after fine-tuning, shows that our multi-label fine-tuning improves the LMs’ ability to reproduce linguistic variability; both for contexts that admit higher and lower variability.
zh
[NLP-33] One Agent to Serve All: a Lite-Adaptive Stylized AI Assistant for Millions of Multi-Style Official Accounts
【速读】: 该论文旨在解决工业级官方账号平台中对话代理(Conversational Agents)在生成响应时面临的两大挑战:一是确保回复内容与上下文高度相关(contextually grounded),二是保持与账号风格一致(stylistically aligned)。现有方法如链式思维(Chain-of-thought, CoT)提示导致显著延迟,按账号微调计算成本过高,而长提示模板则削弱了模型对注入上下文和风格的理解能力。其解决方案的关键在于提出WeStar框架,该框架融合基于检索增强生成(Retrieval-Augmented Generation, RAG)的上下文感知生成与基于参数化RAG(Parametric RAG, PRAG)的风格感知生成,通过动态激活不同风格聚类对应的LoRA模块实现轻量自适应,从而在支持百万级官方账号的同时,兼顾生成质量与部署效率。
链接: https://arxiv.org/abs/2509.17788
作者: Xingyu Fan,Feifei Li,Wenhui Que,Hailong Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7 pages
Abstract:Conversational agents deployed in industrial-scale official account platforms must generate responses that are both contextually grounded and stylistically aligned-requirements that existing methods struggle to meet. Chain-of-thought (CoT) prompting induces significant latency due to multi-turn reasoning; per-account fine-tuning is computationally prohibitive; and long prompt-based methods degrade the model’s ability to grasp injected context and style. In this paper, we propose WeStar, a lite-adaptive framework for stylized contextual question answering that scales to millions of official accounts. WeStar combines context-grounded generation via RAG with style-aware generation using Parametric RAG (PRAG), where LoRA modules are dynamically activated per style cluster. Our contributions are fourfold: (1) We introduce WeStar, a unified framework capable of serving large volumes of official accounts with minimal overhead. (2) We propose a multi-dimensional, cluster-based parameter sharing scheme that enables compact style representation while preserving stylistic diversity. (3) We develop a style-enhanced Direct Preference Optimization (SeDPO) method to optimize each style cluster’s parameters for improved generation quality. (4) Experiments on a large-scale industrial dataset validate the effectiveness and efficiency of WeStar, underscoring its pracitical value in real-world deployment.
zh
[NLP-34] DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching
【速读】: 该论文旨在解决当前多语言自然语言处理(Natural Language Processing, NLP)中语言识别(Language Identification, LID)模型在真实场景下性能下降的问题,特别是模型对干净、单一语言数据的过拟合现象。其解决方案的关键在于构建并使用两个新型评估基准:DIVERS-BENCH 和 DIVERS-CS,前者涵盖多种领域(如语音转录文本、网络文本、社交媒体文本、儿童故事及混语文本),后者专门针对10组语言对的代码切换(code-switching)场景。通过这两个基准,研究揭示了现有LID模型在噪声和非正式输入中的显著性能退化,并指出模型难以识别同一句子内多重语言的存在,从而强调了开发更具鲁棒性和包容性的LID系统的重要性。
链接: https://arxiv.org/abs/2509.17768
作者: Jessica Ojo,Zina Kamel,David Ifeoluwa Adelani
机构: Mila - Quebec AI Institute (Mila - 魁北克人工智能研究所); McGill University (麦吉尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children’s stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also introduce DIVERS-CS, a diverse code-switching benchmark dataset spanning 10 language pairs, and show that existing models struggle to detect multiple languages within the same sentence. These results highlight the need for more robust and inclusive LID systems in real-world settings.
zh
[NLP-35] A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长时程、多轮对话中面临的信息遗忘和效率低下问题。其解决方案的关键在于提出一种无需训练的提示工程方法——状态更新多轮对话策略(State-Update Multi-turn Dialogue Strategy),该策略通过“状态重构(State Reconstruction)”和“历史提醒(History Remind)”两个机制协同作用,有效管理对话历史,从而显著提升多跳问答(multi-hop QA)任务中的信息过滤准确率与下游问答性能,同时大幅降低推理时间和Token消耗。
链接: https://arxiv.org/abs/2509.17766
作者: Ziyi Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Models (LLMs) struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues. To address this, we propose a training-free prompt engineering method, the State-Update Multi-turn Dialogue Strategy. It utilizes “State Reconstruction” and “History Remind” mechanisms to effectively manage dialogue history. Our strategy shows strong performance across multiple multi-hop QA datasets. For instance, on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%. Ablation studies confirm the pivotal roles of both components. Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.
zh
[NLP-36] Qwen 3-Omni Technical Report
【速读】: 该论文旨在解决多模态大模型在跨文本、图像、音频和视频任务中性能难以同时保持最优的问题,即现有模型往往在融合多种模态时出现性能退化,无法达到单模态模型的水平。其关键解决方案是提出Qwen3-Omni,一个统一感知与生成的Thinker-Talker MoE(Mixture of Experts)架构,通过将不同模态的信息在共享表示空间中进行深度融合,实现了在所有模态上均不逊于单模态模型的性能表现;尤其在音频任务上显著优于同类模型,并通过轻量级因果卷积网络替代高复杂度的块状扩散模型,大幅降低流式语音合成的首包延迟至234毫秒,从而支持自然实时语音交互。
链接: https://arxiv.org/abs/2509.17765
作者: Jin Xu,Zhifang Guo,Hangrui Hu,Yunfei Chu,Xiong Wang,Jinzheng He,Yuxuan Wang,Xian Shi,Ting He,Xinfa Zhu,Yuanjun Lv,Yongqi Wang,Dake Guo,He Wang,Linhan Ma,Pei Zhang,Xinyu Zhang,Hongkun Hao,Zishan Guo,Baosong Yang,Bin Zhang,Ziyang Ma,Xipin Wei,Shuai Bai,Keqin Chen,Xuejing Liu,Peng Wang,Mingkun Yang,Dayiheng Liu,Xingzhang Ren,Bo Zheng,Rui Men,Fan Zhou,Bowen Yu,Jianxin Yang,Le Yu,Jingren Zhou,Junyang Lin
机构: Qwen Team(通义千问团队)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: this https URL
Abstract:We present Qwen3-Omni, a single multimodal model that, for the first time, maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source SOTA on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker-Talker MoE architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages. To reduce first-packet latency in streaming synthesis, Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings, Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking, and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.
zh
[NLP-37] WISE: Weak-Supervision-Guided Step-by-Step Explanations for Multimodal LLM s in Image Classification EMNLP2025
【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉理解中对图像内部对象细节(intra-object)认知不足的问题,尤其针对现有多模态思维链(Multimodal Chain-of-Thought, MCoT)方法主要依赖于富含推理过程的数据集且侧重于对象间关系推理、忽视对象内部语义理解的局限性。其解决方案的关键在于提出 WISE 方法——一种基于弱监督的分步解释生成机制,通过将概念瓶颈模型(Concept Bottleneck Models, CBMs)提取的概念表征重构为简洁、可解释的推理链,从而无需额外标注即可为任意图像分类数据集自动构建高质量 MCoT,显著提升模型可解释性(提高 37%)并增强 MLLM 在细粒度视觉理解中的分类性能。
链接: https://arxiv.org/abs/2509.17740
作者: Yiwen Jiang,Deval Mehta,Siyuan Yan,Yaling Shen,Zimu Wang,Zongyuan Ge
机构: Monash University (蒙纳士大学); AIM for Health Lab (AIM健康实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 (Main)
Abstract:Multimodal Large Language Models (MLLMs) have shown promise in visual-textual reasoning, with Multimodal Chain-of-Thought (MCoT) prompting significantly enhancing interpretability. However, existing MCoT methods rely on rationale-rich datasets and largely focus on inter-object reasoning, overlooking the intra-object understanding crucial for image classification. To address this gap, we propose WISE, a Weak-supervision-guided Step-by-step Explanation method that augments any image classification dataset with MCoTs by reformulating the concept-based representations from Concept Bottleneck Models (CBMs) into concise, interpretable reasoning chains under weak supervision. Experiments across ten datasets show that our generated MCoTs not only improve interpretability by 37% but also lead to gains in classification accuracy when used to fine-tune MLLMs. Our work bridges concept-based interpretability and generative MCoT reasoning, providing a generalizable framework for enhancing MLLMs in fine-grained visual understanding.
zh
[NLP-38] Breaking Token Into Concepts: Exploring Extreme Compression in Token Representation Via Compositional Shared Semantics
【速读】: 该论文旨在解决标准语言模型中单一分散嵌入(monolithic embeddings)难以充分捕捉词汇多义性的问题,即每个词元(token)仅用唯一向量表示,限制了其语义表达的多样性。解决方案的关键在于提出一种名为Aggregate Semantic Grouping (ASG) 的新方法,通过引入产品量化(Product Quantization, PQ)技术,将词元表示为共享语义基元(semantic building blocks)的组合结构,从而实现语义信息的模块化与高效聚合。该方法在保持95%基准模型性能的同时,显著压缩嵌入参数(降至0.4–0.5%),并适用于多种任务(如自然语言推理、命名实体识别、问答)及跨语言迁移和生物医学领域场景,验证了基于组合式表示可有效提升模型的语义丰富性和紧凑性。
链接: https://arxiv.org/abs/2509.17737
作者: Kavin R V,Pawan Goyal
机构: Indian Institute of Technology (印度理工学院)
类目: Computation and Language (cs.CL)
备注: 5 pages, 1 figure
Abstract:Standard language models employ unique, monolithic embeddings for each token, potentially limiting their ability to capture the multifaceted nature of word meanings. We investigate whether tokens can be more effectively represented through a compositional structure that accumulates diverse semantic facets. To explore this, we propose Aggregate Semantic Grouping (ASG), a novel approach leveraging Product Quantization (PQ). We apply ASG to standard transformer architectures (mBERT, XLM-R, mT5) and evaluate this representational scheme across diverse tasks (NLI, NER, QA), as well as a biomedical domain-specific benchmark (BC5CDR) using BioBERT. Our findings demonstrate that representing tokens compositionally via ASG achieves extreme compression in embedding parameters (0.4–0.5%) while maintaining 95% task performance relative to the base model, even in generative tasks and extends to both cross lingual transfer and domain-specific settings. These results validate the principle that tokens can be effectively modeled as combinations of shared semantic building blocks. ASG offers a simple yet concrete method for achieving this, showcasing how compositional representations can capture linguistic richness while enabling compact yet semantically rich models.
zh
[NLP-39] ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLM s
【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)微调过程中,基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)方法所面临的两个关键问题:一是二元反馈信号过于稀疏,难以刻画推理过程的质量;二是粗粒度奖励可能导致梯度消失。解决方案的关键在于将模型自身的置信度估计(confidence estimates)与可验证结果相结合,从而构建更细粒度的奖励信号,不仅丰富了监督信息,还隐式地对推理过程进行引导。实验表明,该方法在多个数据集上提升了RL性能,并降低了推理阶段的token消耗,同时训练开销几乎不变,且可作为插件模块兼容现有先进RL方法。
链接: https://arxiv.org/abs/2509.17730
作者: Bonan Zhang,Zhongqi Chen,Bowen Song,Qinya Li,Fan Wu,Guihai Chen
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too sparse to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model’s own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.
zh
[NLP-40] Investigating Bias: A Multilingual Pipeline for Generating Solving and Evaluating Math Problems with LLM s ECAI
【速读】: 该论文旨在解决生成式 AI(Generative AI)在多语言教育场景中存在响应质量差异的问题,特别是针对不同语言环境下数学问题解答的公平性与一致性。其解决方案的关键在于构建了一个自动化的多语言流水线,涵盖数学题目的生成、跨语言翻译(英语、德语、阿拉伯语)、多模型(GPT-4o-mini、Gemini 2.5 Flash、Qwen-plus)生成分步解答,并通过一组受控的 LLM 判官(包括 Claude 3.5 Haiku)进行对比评估,从而量化语言对 AI 教育输出质量的影响,揭示了英语解答普遍优于阿拉伯语解答的现象,凸显出当前多语言 AI 系统中的语言偏见问题。
链接: https://arxiv.org/abs/2509.17701
作者: Mariam Mahran,Katharina Simbeck
机构: HTW Berlin University of Applied Sciences (柏林应用科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at edu4AI’25: 2nd Workshop on Education for Artificial Intelligence | co-located with ECAI, October 26th, 2025, Bologna, Italy. 7 pages, 0 figures
Abstract:Large Language Models (LLMs) are increasingly used for educational support, yet their response quality varies depending on the language of interaction. This paper presents an automated multilingual pipeline for generating, solving, and evaluating math problems aligned with the German K-10 curriculum. We generated 628 math exercises and translated them into English, German, and Arabic. Three commercial LLMs (GPT-4o-mini, Gemini 2.5 Flash, and Qwen-plus) were prompted to produce step-by-step solutions in each language. A held-out panel of LLM judges, including Claude 3.5 Haiku, evaluated solution quality using a comparative framework. Results show a consistent gap, with English solutions consistently rated highest, and Arabic often ranked lower. These findings highlight persistent linguistic bias and the need for more equitable multilingual AI systems in education.
zh
[NLP-41] Evaluating LLM -Generated Versus Human-Authored Responses in Role-Play Dialogues
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长篇、基于知识的角色扮演对话中评估困难的问题,特别是其在多轮专业培训模拟场景下与人类生成回复质量差异的量化与验证。解决方案的关键在于构建一个包含多轮交互的基准测试集,并采用“人工评价 + LLM-as-a-judge”混合评估框架:一方面通过38名参与者的人工评估揭示LLM生成回复随对话轮次递进而显著劣化的趋势(尤其体现在自然度、上下文一致性及整体质量上),另一方面利用Gemini 2.0 Flash模型进行零样本成对偏好判断和六样本随机构造评分,结果与人工评价高度一致,从而验证了LLM与人类回复之间质量差距随时间扩大的现象,为可靠整合LLM于训练模拟提供了可信赖的评估路径。
链接: https://arxiv.org/abs/2509.17694
作者: Dongxu Lu,Johan Jeuring,Albert Gatt
机构: Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication at the 18th International Natural Language Generation Conference (INLG 2025)
Abstract:Evaluating large language models (LLMs) in long-form, knowledge-grounded role-play dialogues remains challenging. This study compares LLM-generated and human-authored responses in multi-turn professional training simulations through human evaluation ( N=38 ) and automated LLM-as-a-judge assessment. Human evaluation revealed significant degradation in LLM-generated response quality across turns, particularly in naturalness, context maintenance and overall quality, while human-authored responses progressively improved. In line with this finding, participants also indicated a consistent preference for human-authored dialogue. These human judgements were validated by our automated LLM-as-a-judge evaluation, where Gemini 2.0 Flash achieved strong alignment with human evaluators on both zero-shot pairwise preference and stochastic 6-shot construct ratings, confirming the widening quality gap between LLM and human responses over time. Our work contributes a multi-turn benchmark exposing LLM degradation in knowledge-grounded role-play dialogues and provides a validated hybrid evaluation framework to guide the reliable integration of LLMs in training simulations.
zh
[NLP-42] ASO: Task-Aligned Sparse Optimization for Parameter-Efficient Model Adaptation EMNLP2025
【速读】: 该论文旨在解决LoRA(Low-Rank Adaptation)方法中存在的参数冗余问题,即LoRA在微调过程中引入了大量冗余可训练参数,不仅增加了计算负担,还削弱了微调效果。解决方案的关键在于提出TASO(Task-Aligned Sparsity Optimization),其核心思想是利用预训练模型权重中的重要性信息来识别下游任务中关键的参数区域,并基于重要性得分分布确定任务特定的核心区域,进而构建稀疏结构的LoRA模块,从而在微调前有效移除冗余参数。这一方法显著减少了所需可训练参数数量,同时保持或提升了微调性能。
链接: https://arxiv.org/abs/2509.17688
作者: Daiye Miao,Yufang Liu,Jie Wang,Changzhi Sun,Yunke Zhang,Demei Yan,Shaokang Dong,Qi Zhang,Yuanbin Wu
机构: East China Normal University (华东师范大学); Honor Device Co., Ltd. (荣耀终端有限公司); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMNLP 2025 (Main Conference),13 pages,10 figures
Abstract:LoRA has become one of the most widely used parameter-efficient fine-tuning methods due to its simplicity and effectiveness. However, numerous studies have shown that LoRA often introduces substantial parameter redundancy, which not only increases the number of trainable parameters but also hinders the effectiveness of fine-tuning. Since identifying redundant parameters in LoRA is inherently difficult, how to eliminate them efficiently and accurately remains a challenging problem. In this paper, we propose TASO, a redundancy reduction method that leverages importance information from the pretrained model’s weights to mitigate LoRA redundancy. Specifically, we estimate parameter importance on downstream tasks and identify task-specific core regions based on the distribution of importance scores. The location information of these core regions is then used to determine the sparse structure of LoRA modules, enabling redundancy removal before fine-tuning. Our approach significantly reduces the number of trainable parameters required for task adaptation, while providing a novel task-aligned perspective for LoRA redundancy reduction. Experimental results demonstrate that, with a parameter budget comparable to LoRA with rank r = 1 , TASO consistently outperforms standard LoRA across multiple tasks, achieving strong fine-tuning performance while effectively eliminating redundant parameters.
zh
[NLP-43] When TableQA Meets Noise: A Dual Denoising Framework for Complex Questions and Large-scale Tables
【速读】: 该论文旨在解决复杂问题和大规模表格在表问答(TableQA)任务中引入大量噪声数据,从而严重降低推理性能的问题。解决方案的关键在于提出一种双去噪框架EnoTab,其核心包括两个方面:一是基于证据的问答去噪(Evidence-based Question Denoising),通过将问题分解为最小语义单元并依据一致性和可用性标准过滤掉与答案推理无关的部分;二是基于证据树引导的表格去噪(Evidence Tree-guided Table Denoising),构建显式且透明的表格剪枝路径,逐步移除无关数据,并在每一步利用后序节点回滚机制处理异常表格状态,最终生成高可靠性的子表用于最终答案推理。
链接: https://arxiv.org/abs/2509.17680
作者: Shenghao Ye,Yu Guo,Dong Jin,Yikai Shen,Yunpeng Hou,Shuangwu Chen,Jian Yang,Xiaofeng Jiang
机构: University of Science and Technology of China (中国科学技术大学); Institute of Artificial Intelligence, Hefei Comprehensive National Science Center (合肥综合性国家科学中心人工智能研究院)
类目: Computation and Language (cs.CL)
备注: 23 pages, 24 figures
Abstract:Table question answering (TableQA) is a fundamental task in natural language processing (NLP). The strong reasoning capabilities of large language models (LLMs) have brought significant advances in this field. However, as real-world applications involve increasingly complex questions and larger tables, substantial noisy data is introduced, which severely degrades reasoning performance. To address this challenge, we focus on improving two core capabilities: Relevance Filtering, which identifies and retains information truly relevant to reasoning, and Table Pruning, which reduces table size while preserving essential content. Based on these principles, we propose EnoTab, a dual denoising framework for complex questions and large-scale tables. Specifically, we first perform Evidence-based Question Denoising by decomposing the question into minimal semantic units and filtering out those irrelevant to answer reasoning based on consistency and usability criteria. Then, we propose Evidence Tree-guided Table Denoising, which constructs an explicit and transparent table pruning path to remove irrelevant data step by step. At each pruning step, we observe the intermediate state of the table and apply a post-order node rollback mechanism to handle abnormal table states, ultimately producing a highly reliable sub-table for final answer reasoning. Finally, extensive experiments show that EnoTab achieves outstanding performance on TableQA tasks with complex questions and large-scale tables, confirming its effectiveness.
zh
[NLP-44] urk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成过程中存在的幻觉问题,即模型输出看似合理但事实错误的信息,尤其针对形态复杂且资源匮乏的语言如土耳其语(Turkish),现有检索增强生成(Retrieval-Augmented Generation, RAG)系统仍难以有效抑制此类错误。解决方案的关键在于提出首个专为土耳其语RAG应用设计的幻觉检测模型套件——Turk-LettuceDetect,其将幻觉检测建模为token级别的分类任务,并基于三种不同编码器架构进行微调:土耳其语专用的ModernBERT、TurkEmbed4STS以及多语言EuroBERT。该方法通过在机器翻译版本的RAGTruth基准数据集(含17,790个实例)上训练,实现了高精度的幻觉识别能力(F1-score达0.7266),同时保持计算效率并支持长达8,192 tokens的上下文长度,从而为低资源语言提供可靠、可实时部署的幻觉检测机制。
链接: https://arxiv.org/abs/2509.17671
作者: Selva Taş,Mahmut El Huseyni,Özay Ezerceli,Reyhan Bayraktar,Fatma Betül Terzioğlu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The widespread adoption of Large Language Models (LLMs) has been hindered by their tendency to hallucinate, generating plausible but factually incorrect information. While Retrieval-Augmented Generation (RAG) systems attempt to address this issue by grounding responses in external knowledge, hallucination remains a persistent challenge, particularly for morphologically complex, low-resource languages like Turkish. This paper introduces Turk-LettuceDetect, the first suite of hallucination detection models specifically designed for Turkish RAG applications. Building on the LettuceDetect framework, we formulate hallucination detection as a token-level classification task and fine-tune three distinct encoder architectures: a Turkish-specific ModernBERT, TurkEmbed4STS, and multilingual EuroBERT. These models were trained on a machine-translated version of the RAGTruth benchmark dataset containing 17,790 instances across question answering, data-to-text generation, and summarization tasks. Our experimental results show that the ModernBERT-based model achieves an F1-score of 0.7266 on the complete test set, with particularly strong performance on structured tasks. The models maintain computational efficiency while supporting long contexts up to 8,192 tokens, making them suitable for real-time deployment. Comparative analysis reveals that while state-of-the-art LLMs demonstrate high recall, they suffer from low precision due to over-generation of hallucinated content, underscoring the necessity of specialized detection mechanisms. By releasing our models and translated dataset, this work addresses a critical gap in multilingual NLP and establishes a foundation for developing more reliable and trustworthy AI applications for Turkish and other languages.
zh
[NLP-45] PG-CE: A Progressive Generation Dataset with Constraint Enhancement for Controllable Text Generation
【速读】: 该论文旨在解决传统可控文本生成(Controllable Text Generation, CTG)方法在生成质量与控制精度之间难以平衡的问题,尤其针对生成内容的语调、表达风格和主题相关性等多维约束难以动态适配实际需求的局限。其解决方案的关键在于提出PG-CE(Progressive Generation with Constraint Enhancement)框架,将CTG任务分解为类型预测、约束构建与引导生成三个阶段,并引入约束生成模型动态构建包含语调、表达风格和主题焦点在内的多维约束条件,从而实现对生成文本的精细化控制与高质量输出。
链接: https://arxiv.org/abs/2509.17669
作者: Yan Zhuang,Yuan Sun
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid development of Large Language Models (LLMs), Controllable Text Generation (CTG) has become a critical technology for enhancing system reliability and user experience. Addressing the limitations of traditional methods, this paper proposes the PG-CE (Progressive Generation with Constraint Enhancement) approach, which decomposes CTG tasks into three steps: type prediction, constraint construction, and guided generation. This method employs constraint generation models to dynamically build multi-dimensional constraints including tone, expression style, and thematic focus to guide output. Experiments demonstrate that PG-CE significantly improves generation quality across multiple scenarios while maintaining text controllability, thematic relevance, and response practicality. The research developed a dataset containing 90,000 constraint-text pairs (with an 8:2 ratio between daily and other topics), effectively reflecting real-world application requirements.
zh
[NLP-46] Crosslingual Optimized Metric for Translation Assessment of Indian Languages
【速读】: 该论文旨在解决跨语言翻译自动评估中的挑战,特别是针对资源匮乏语言(如印度多语种)中评估指标性能不足的问题。现有基于字符串的指标(如BLEU)和部分神经网络指标在多语言场景下表现受限,尤其在缺乏高质量人工评分数据的语言对中效果不佳。解决方案的关键在于构建了一个涵盖13种印度语言、21个翻译方向的大规模人工评分数据集,并基于此训练出一种名为Cross-lingual Optimized Metric for Translation Assessment of Indian Languages (COMTAIL) 的神经翻译评估模型。该模型通过跨语言优化机制显著提升了对含印度语言翻译对的质量判别能力,在多个指标上超越了现有最先进方法。
链接: https://arxiv.org/abs/2509.17667
作者: Arafat Ahsan,Vandan Mujadia,Pruthwik Mishra,Yash Bhaskar,Dipti Misra Sharma
机构: IIIT(印度信息技术研究所); SVNIT(萨巴尔马蒂理工学院)
类目: Computation and Language (cs.CL)
备注: Under review
Abstract:Automatic evaluation of translation remains a challenging task owing to the orthographic, morphological, syntactic and semantic richness and divergence observed across languages. String-based metrics such as BLEU have previously been extensively used for automatic evaluation tasks, but their limitations are now increasingly recognized. Although learned neural metrics have helped mitigate some of the limitations of string-based approaches, they remain constrained by a paucity of gold evaluation data in most languages beyond the usual high-resource pairs. In this present work we address some of these gaps. We create a large human evaluation ratings dataset for 13 Indian languages covering 21 translation directions and then train a neural translation evaluation metric named Cross-lingual Optimized Metric for Translation Assessment of Indian Languages (COMTAIL) on this dataset. The best performing metric variants show significant performance gains over previous state-of-the-art when adjudging translation pairs with at least one Indian language. Furthermore, we conduct a series of ablation studies to highlight the sensitivities of such a metric to changes in domain, translation quality, and language groupings. We release both the COMTAIL dataset and the accompanying metric models.
zh
[NLP-47] AuditoryBench: Can Language Models Understand Auditory Knowledge without Hearing?
【速读】: 该论文旨在解决语言模型在缺乏直接声学输入的情况下,难以推理和理解听觉属性(如音高、响度或声源关联)的问题,从而限制了其在多模态交互中的表现。解决方案的关键在于提出了一种名为AIR-CoT的新型听觉想象推理方法,该方法通过特殊标记的跨度检测机制生成并整合听觉信息,并结合知识注入策略,在纯文本环境下实现对音频概念的细粒度推理与建模,显著提升了大语言模型(LLM)及多模态大语言模型(Multimodal LLM)在听觉常识任务上的性能。
链接: https://arxiv.org/abs/2509.17641
作者: Hyunjong Ok,Suho Yoo,Hyeonjun Kim,Jaeho Lee
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Preprint
Abstract:Even without directly hearing sounds, humans can effortlessly reason about auditory properties, such as pitch, loudness, or sound-source associations, drawing on auditory commonsense. In contrast, language models often lack this capability, limiting their effectiveness in multimodal interactions. As an initial step to address this gap, we present AuditoryBench++, a comprehensive benchmark for evaluating auditory knowledge and reasoning in text-only settings. The benchmark encompasses tasks that range from basic auditory comparisons to contextually grounded reasoning, enabling fine-grained analysis of how models process and integrate auditory concepts. In addition, we introduce AIR-CoT, a novel auditory imagination reasoning method that generates and integrates auditory information during inference through span detection with special tokens and knowledge injection. Extensive experiments with recent LLMs and Multimodal LLMs demonstrate that AIR-CoT generally outperforms both the off-the-shelf models and those augmented with auditory knowledge. The project page is available at this https URL.
zh
[NLP-48] MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂多阶段任务中推理与协作能力不足的问题,现有基准测试通常局限于单一任务或狭窄领域,未能充分评估模型在无显式外部指导下的多阶段协同与优化能力。解决方案的关键在于提出一个名为MSCoRe的新基准,包含126,696个跨汽车、制药、电子和能源领域的特定领域问答实例,采用结构化的三阶段数据生成流程(动态采样、迭代问答生成与多级质量评估),确保数据质量,并按阶段覆盖度和复杂性将任务划分为三个难度层级,从而系统性地评估LLM代理的多阶段推理性能。
链接: https://arxiv.org/abs/2509.17628
作者: Yuzhen Lei,Hongbin Xie,Jiaxing Zhao,Shuangxue Liu,Xuan Song
机构: Jilin University (吉林大学); Southern University of Science and Technology (南方科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models’ abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose \textbfMSCoRe, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models’ robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at this https URL.
zh
[NLP-49] AutiHero: Leverag ing Generative AI in Social Narratives to Engage Parents in Story-Driven Behavioral Guidance for Autistic Children
【速读】: 该论文旨在解决自闭症儿童在社交情境中理解与适应困难的问题,传统社会叙事(social narratives)虽有效但需高度个性化定制,导致家长在家庭实践中耗费大量时间与精力。解决方案的关键在于提出AutiHero系统——一个基于生成式AI(Generative AI)的社会叙事生成工具,能够根据儿童的兴趣、目标行为及日常情境自动创建文本和视觉插图,从而显著降低家长的内容创作负担,并提升亲子共读频率与行为引导效果。
链接: https://arxiv.org/abs/2509.17608
作者: Jungeun Lee,Kyungah Lee,Inseok Hwang,SoHyun Park,Young-Ho Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages except reference
Abstract:Social narratives are known to help autistic children understand and navigate social situations through stories. To ensure effectiveness, however, the materials need to be customized to reflect each child’s unique behavioral context, requiring considerable time and effort for parents to practice at home. We present AutiHero, a generative AI-based social narrative system for behavioral guidance, which supports parents to create personalized stories for their autistic children and read them together. AutiHero generates text and visual illustrations that reflect their children’s interests, target behaviors, and everyday contexts. In a two-week deployment study with 16 autistic child-parent dyads, parents created 218 stories and read an average of 4.25 stories per day, demonstrating a high level of engagement. AutiHero also provided an effective, low-demanding means to guide children’s social behaviors, encouraging positive change. We discuss the implications of generative AI-infused tools to empower parents in guiding their children’s behaviors, fostering their social learning.
zh
[NLP-50] Asking a Language Model for Diverse Responses
【速读】: 该论文旨在解决大语言模型在生成响应时面临的问题:尽管模型能够基于显式推理链产生多个合理响应,但如何在有限计算资源(即匹配预算)下提升响应的多样性(包括词汇多样性和计算流程多样性),同时保持生成质量不下降。其解决方案的关键在于对比三种候选采样策略——传统独立采样(ancestral sampling)、枚举采样(enumeration)和迭代采样(iterative sampling),并发现枚举和迭代策略能够在相同质量水平下显著提高响应多样性,表明简单非独立采样机制(non-independent sampling strategies)具有提升多样性潜力且无需牺牲生成质量。
链接: https://arxiv.org/abs/2509.17570
作者: Sergey Troshin,Irina Saparina,Antske Fokkens,Vlad Niculae
机构: 未知
类目: Computation and Language (cs.CL)
备注: UncertaiNLP workshop, 2025
Abstract:Large language models increasingly rely on explicit reasoning chains and can produce multiple plausible responses for a given context. We study the candidate sampler that produces the set of plausible responses contrasting the ancestral (parallel) sampling against two alternatives: enumeration, which asks the model to produce n candidates in one pass, and iterative sampling, which proposes candidates sequentially while conditioning on the currently generated response set. Under matched budgets, we compare these samplers on quality, lexical and computation flow diversity, and efficiency. Our empirical results demonstrate that enumeration and iterative strategies result in higher diversity at comparable quality. Our findings highlight the potential of simple non-independent sampling strategies to improve response diversity without sacrificing generation quality.
zh
[NLP-51] Specification-Aware Machine Translation and Evaluation for Purpose Alignment
【速读】: 该论文试图解决的问题是:在机器翻译(Machine Translation, MT)研究中,专业翻译实践中至关重要的规范性要求(specifications)往往被忽视或仅以隐含方式处理,导致翻译质量与实际专业需求之间存在差距。解决方案的关键在于构建一个基于翻译研究理论的框架,明确规范性要求在专业翻译中的核心作用,并提出一套可操作的方法将这些规范嵌入到MT系统的设计与评估流程中。通过在投资者关系文本翻译实验中对比五种翻译类型(包括官方人工译文和基于提示的大语言模型输出),结果表明,在规范指导下由大语言模型生成的译文在人类评估中持续优于官方人工译文,凸显了“感知质量”与“预期质量”之间的鸿沟,从而验证了将规范整合进MT工作流并辅以人工监督,能够显著提升翻译质量,使其更贴近专业实践需求。
链接: https://arxiv.org/abs/2509.17559
作者: Yoko Kayano,Saku Sugawara
机构: The Graduate University for Advanced Studies (SOKENDAI); National Institute of Informatics
类目: Computation and Language (cs.CL)
备注:
Abstract:In professional settings, translation is guided by communicative goals and client needs, often formalized as specifications. While existing evaluation frameworks acknowledge the importance of such specifications, these specifications are often treated only implicitly in machine translation (MT) research. Drawing on translation studies, we provide a theoretical rationale for why specifications matter in professional translation, as well as a practical guide to implementing specification-aware MT and evaluation. Building on this foundation, we apply our framework to the translation of investor relations texts from 33 publicly listed companies. In our experiment, we compare five translation types, including official human translations and prompt-based outputs from large language models (LLMs), using expert error analysis, user preference rankings, and an automatic metric. The results show that LLM translations guided by specifications consistently outperformed official human translations in human evaluations, highlighting a gap between perceived and expected quality. These findings demonstrate that integrating specifications into MT workflows, with human oversight, can improve translation quality in ways aligned with professional practice.
zh
[NLP-52] Can LLM s Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning NIPS2025
【速读】: 该论文旨在解决如何在无需额外监督训练的情况下,将非文本模态的基础模型(Foundation Models, FMs)表示集成到基于文本的大型语言模型(Large Language Models, LLMs)中,从而实现对新领域和新模态的即时适应。其解决方案的关键在于提出了一种名为“上下文表示学习”(In-Context Representation Learning, ICRL)的新框架,该框架通过用FM的表示替代传统提示学习中的文本-标签对,在不进行微调的前提下使LLM能够利用多模态信息完成推理任务,从而实现了训练-free的跨模态知识迁移与适应。
链接: https://arxiv.org/abs/2509.17552
作者: Tianle Zhang,Wanlong Fang,Jonathan Woo,Paridhi Latawa,Deepak A.Subramanian,Alvin Chan
机构: Nanyang Technological University (南洋理工大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: NIPS 2025
Abstract:The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
zh
[NLP-53] Leverag ing Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models
【速读】: 该论文旨在解决多语言自监督学习(SSL)模型在单语言任务上表现逊于纯单语言模型的问题,尤其是在语言数量较少的双语场景下,性能差距尤为显著。其解决方案的关键在于引入有限的视觉接地(visual grounding),通过结合音频与视觉信息来增强模型对语音特征的理解,从而缩小多语言模型与单语言模型之间的性能差距。实验表明,视觉接地不仅提升了单语言和双语模型的表现,尤其在零样本语音辨识任务中,将多语言性能差距从31.5%降至8.04%。
链接: https://arxiv.org/abs/2509.17523
作者: María Andrea Cruz Blandón,Zakaria Aldeneh,Jie Chi,Maureen de Seyssel
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures
Abstract:Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.
zh
[NLP-54] CorefInst: Leverag ing LLM s for Multilingual Coreference Resolution ACL
【速读】: 该论文旨在解决跨语言共指消解(Coreference Resolution, CR)任务中传统方法受限于特定任务架构和编码器语言模型(Encoder-based Language Models)所导致的训练成本高、适应性差的问题。其解决方案的关键在于首次提出一种基于仅解码器大语言模型(Decoder-only Large Language Models, LLMs)的多语言共指消解方法,通过设计五种不同指令集(instruction sets)并结合受控推理(controlled inference)机制,使LLMs能够同时处理显式(overt)和零代词(zero mentions)共指关系。实验表明,经过适当指令微调的LLM(如Llama 3.1)在CorefUD v1.2数据集上平均性能优于当前最优的多语言共指模型Corpipe 24单阶段变体,提升达2个百分点(pp)。
链接: https://arxiv.org/abs/2509.17505
作者: Tuğba Pamay Arslan,Emircan Erol,Gülşen Eryiğit
机构: İTÜ NLP Research Group (İTÜ 自然语言处理研究组); Department of Artificial Intelligence and Data Engineering (人工智能与数据工程系); İstanbul Technical University (伊斯坦布尔技术大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in Transactions of the Association for Computational Linguistics (TACL) (2025 August). Submission: March, 2025. Revision: July, 2025. Acceptance: August, 2025
Abstract:Coreference Resolution (CR) is a crucial yet challenging task in natural language understanding, often constrained by task-specific architectures and encoder-based language models that demand extensive training and lack adaptability. This study introduces the first multilingual CR methodology which leverages decoder-only LLMs to handle both overt and zero mentions. The article explores how to model the CR task for LLMs via five different instruction sets using a controlled inference method. The approach is evaluated across three LLMs; Llama 3.1, Gemma 2, and Mistral 0.3. The results indicate that LLMs, when instruction-tuned with a suitable instruction set, can surpass state-of-the-art task-specific architectures. Specifically, our best model, a fully fine-tuned Llama 3.1 for multilingual CR, outperforms the leading multilingual CR model (i.e., Corpipe 24 single stage variant) by 2 pp on average across all languages in the CorefUD v1.2 dataset collection.
zh
[NLP-55] Enhancing Cross-Lingual Transfer through Reversible Transliteration: A Huffman-Based Approach for Low-Resource Languages
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理低资源语言时性能不足的问题,尤其是那些使用非拉丁字母文字的语言。当前LLMs虽具备跨语言迁移能力,但在低资源语言上的表现受限,主要由于词汇表扩展困难、存储开销大以及训练效率低。论文提出了一种创新的 transliteration(音译)框架,其关键在于将字符级音译与霍夫曼编码(Huffman coding)相结合:一方面通过音译将非拉丁文本映射为拉丁字符以避免词汇表膨胀,另一方面利用霍夫曼编码实现文本压缩,从而显著减少存储空间和token数量(最高达50%文件大小缩减和50–80% token数减少),同时保证100%无损还原原始语言,并提升训练与推理效率,且具有良好的可扩展性。
链接: https://arxiv.org/abs/2509.17493
作者: Wenhao Zhuang,Yuan Sun,Xiaobing Zhao
机构: Minzu University of China (中央民族大学); National Language Resource Monitoring & Research Center Minority Languages Branch (国家语言资源监测与研究少数民族语言分中心); Institute of National Security, Minzu University of China (中央民族大学国家安全研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:As large language models (LLMs) are trained on increasingly diverse and extensive multilingual corpora, they demonstrate cross-lingual transfer capabilities. However, these capabilities often fail to effectively extend to low-resource languages, particularly those utilizing non-Latin scripts. While transliterating low-resource languages into Latin script presents a natural solution, there currently lacks a comprehensive framework for integrating transliteration into LLMs training and deployment. Taking a pragmatic approach, this paper innovatively combines character transliteration with Huffman coding to design a complete transliteration framework. Our proposed framework offers the following advantages: 1) Compression: Reduces storage requirements for low-resource language content, achieving up to 50% reduction in file size and 50-80% reduction in token count. 2) Accuracy: Guarantees 100% lossless conversion from transliterated text back to the source language. 3) Efficiency: Eliminates the need for vocabulary expansion for low-resource languages, improving training and inference efficiency. 4) Scalability: The framework can be extended to other low-resource languages. We validate the effectiveness of our framework across multiple downstream tasks, including text classification, machine reading comprehension, and machine translation. Experimental results demonstrate that our method significantly enhances the model’s capability to process low-resource languages while maintaining performance on high-resource languages. Our data and code are publicly available at this https URL.
zh
[NLP-56] MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM
【速读】: 该论文旨在解决当前多智能体代码生成方案在小型开源语言模型上表现不佳的问题,即现有方法要么依赖高成本的大规模(如30B参数)模型,要么在模型缩小时性能严重退化。其解决方案的关键在于提出MapCoder-Lite,通过仅使用3%额外参数的秩为32的角色专用LoRA适配器(role-specific LoRA adapters),将一个7B参数的单模型升级为四个角色专业化智能体(检索器、规划器、编码器和调试器)。该方案结合三项轻量级技术:轨迹蒸馏(trajectory distillation)缓解检索与调试中的格式脆弱性,监督引导修正(supervisor-guided correction)增强规划与编码智能体的准确性,以及智能体级LoRA微调实现内存高效的专精化训练。实验证明,该方法在xCodeEval等基准上将准确率提升至28.3%,消除所有格式失败,并在GPU内存和生成时间上减少4倍,同时接近32B基线性能。
链接: https://arxiv.org/abs/2509.17489
作者: Woongkyu Lee,Junhee Cho,Jungwook Choi
机构: Hanyang University (汉阳大学); Samsung SDS (三星SDS)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have advanced code generation from single-function tasks to competitive-programming problems, but existing multi-agent solutions either rely on costly large-scale ( 30B) models or collapse when downsized to small open-source models. We present MapCoder-Lite, which upgrades a single 7B model into four role-specialised agents-retriever, planner, coder, and debugger-using only rank-32, role-specific LoRA adapters ( 3% extra parameters). Three lightweight techniques make this possible: (i) trajectory distillation from strong LLMs fixes format fragility in retrieval and debugging, (ii) supervisor-guided correction strengthens planning and coding agents, and (iii) agent-wise LoRA fine-tuning delivers memory-efficient specialisation. Comprehensive evaluation on xCodeEval, APPS, and CodeContests shows that MapCoder-Lite more than doubles xCodeEval accuracy (from 13.2% to 28.3% ), eliminates all format failures, and closes to within six points of a 32B baseline while cutting GPU memory and token-generation time by 4\times . These results demonstrate that careful agent-wise fine-tuning unleashes high-quality multi-agent coding on a small language model.
zh
[NLP-57] AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation EMNLP2025
【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中因引入无关检索内容而导致的生成准确性下降问题,以及现有上下文压缩方法在自适应调整压缩率、低延迟和多文档信息整合方面的局限性。解决方案的关键在于提出一种名为AttnComp的自适应、高效且上下文感知的压缩框架:它利用大语言模型(Large Language Models, LLMs)的注意力机制识别相关性信息,并采用Top-P压缩算法保留累计注意力权重超过预设阈值的最小文档集合,从而在保障生成质量的同时实现显著压缩率与更低延迟;此外,该方法还通过评估检索内容的整体相关性来估计响应置信度,为用户提供可靠性参考。
链接: https://arxiv.org/abs/2509.17486
作者: Lvzhou Luo,Yixuan Cao,Ping Luo
机构: Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 (Findings)
Abstract:Retrieval-augmented generation improves the factual accuracy of Large Language Models (LLMs) by incorporating external context, but often suffers from irrelevant retrieved content that hinders effectiveness. Context compression addresses this issue by filtering out irrelevant information from context before LLM generation. However, existing methods struggle to adaptively adjust compression rates for different context, maintain low latency and integrate information across multiple documents. To overcome these limitations, We introduce AttnComp, an adaptive, efficient and context-aware compression framework. By leveraging the attention mechanism of LLMs to identify relevant information, AttnComp employs a Top-P compression algorithm to retain the minimal set of documents whose cumulative attention weights exceeds a predefined threshold. In addition to compression, AttnComp estimates response confidence by assessing the overall relevance of the retrieved content, enabling users to gauge response reliability. Experiments demonstrate that AttnComp outperforms existing compression methods and uncompressed baselines, achieving higher accuracy with substantial compression rates and lower latency.
zh
[NLP-58] Diagnosing Model Editing via Knowledge Spectrum
【速读】: 该论文旨在解决预训练语言模型中事实性知识编辑(model editing)时引入的不可预测副作用问题,即现有编辑方法常导致模型性能在其他任务上出现意外退化。其解决方案的关键在于提出“知识谱系”(Knowledge Spectrum)框架,系统地将知识按现实世界流行度、模型预编辑熟悉度及提问语言结构进行分类,并据此诊断知识编辑难度;进而设计“知识诊断框架”(Knowledge-Diagnostic Framework),根据诊断结果自适应调整编辑强度,从而显著提升复杂知识项的编辑成功率并优化计算资源利用。
链接: https://arxiv.org/abs/2509.17482
作者: Tsung-Hsuan Pan,Chung-Chi Chen,Hen-Hsen Huang,Hsin-Hsi Chen
机构: National Taiwan University (国立台湾大学); AIST (产业技术综合研究所); Academia Sinica (中央研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Model editing, the process of efficiently modifying factual knowledge in pre-trained language models, is critical for maintaining their accuracy and relevance. However, existing editing methods often introduce unintended side effects, degrading model performance in unpredictable ways. While much research has focused on improving editing algorithms, the role of the target knowledge’s intrinsic properties remains a significant, underexplored factor. This paper addresses this gap by first proposing the Knowledge Spectrum,'' a systematic framework for categorizing knowledge based on its real-world popularity, the model's pre-edit familiarity, and the linguistic structure of the eliciting question. Our empirical analysis reveals that these characteristics are strong predictors of editing success and stability. Informed by these findings, we introduce the
Knowledge-Diagnostic Framework,‘’ an adaptive strategy that tailors editing intensity to the diagnosed difficulty of a knowledge item. We demonstrate that this framework significantly improves success rates for challenging edits while optimizing computational resources. Our work provides a more comprehensive understanding of the factors governing model editing.
zh
[NLP-59] ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在图表理解任务中普遍存在幻觉(hallucination)的问题,这一问题严重制约了模型在需要高精度事实判断场景下的应用可靠性。解决方案的关键在于构建了一个名为ChartHal的基准测试集,其核心创新在于提出了一种细粒度的幻觉类型分类体系,并基于人工验证的数据集(共1,062个样本)对主流LVLMs进行了系统评估。实验表明,包括GPT-5和o4-mini在内的先进模型在该基准上准确率仅为34.46%和22.79%,揭示出当前模型在处理图表中缺失或矛盾信息时极易产生错误生成,从而凸显了开发更鲁棒的幻觉抑制机制的紧迫性。
链接: https://arxiv.org/abs/2509.17481
作者: Xingqi Wang,Yiming Cui,Xin Yao,Shijin Wang,Guoping Hu,Xiaoyu Qin
机构: Tsinghua University (清华大学); iFLYTEK (科大讯飞)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at this https URL .
zh
[NLP-60] LingoQ: Bridging the Gap between ESL Learning and Work through AI-Generated Work-Related Quizzes
【速读】: 该论文试图解决非英语母语者在工作中难以持续进行英语学习(English as a Second Language, ESL)的问题,尤其是在学习材料与实际工作场景脱节的情况下。尽管这些学习者具有学习动机,但传统学习方式缺乏情境相关性,难以维持长期投入。解决方案的关键在于设计并实现一个名为 LingoQ 的人工智能中介系统,该系统利用用户在工作中向大语言模型(Large Language Models, LLM)提出的查询内容,自动生成个性化的英语练习题,并通过智能手机应用供用户随时复习和练习。这一机制将日常工作任务与语言学习无缝结合,增强了学习的相关性和实用性,从而显著提升学习者的自我效能感和语言能力,尤其对初学者效果明显,也可能惠及中级学习者。
链接: https://arxiv.org/abs/2509.17477
作者: Yeonsun Yang,Sang Won Lee,Jean Y. Song,Sangdoo Yun,Young-Ho Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages except reference
Abstract:Non-native English speakers performing English-related tasks at work struggle to sustain ESL learning, despite their motivation. Often, study materials are disconnected from their work context. Although workers rely on LLM assistants to address their immediate needs, these interactions may not directly contribute to their English skills. We present LingoQ, an AI-mediated system that allows workers to practice English using quizzes generated from their LLM queries during work. LingoQ leverages these queries using AI to generate personalized quizzes that workers can review and practice on their smartphones. We conducted a three-week deployment study with 28 ESL workers to evaluate LingoQ. Participants valued the relevance of quizzes that reflect their own context, constantly engaging with the app during the study. This active engagement improved self-efficacy and led to learning gains for beginners and, potentially, for intermediate learners. We discuss opportunities of leveraging users’ reliance on LLMs to situate their learning in the user context for improved learning.
zh
[NLP-61] Autiverse: Eliciting Autistic Adolescents Daily Narratives through AI-guided Multimodal Journaling
链接: https://arxiv.org/abs/2509.17466
作者: Migyeong Yang,Kyungah Lee,Jinyoung Han,SoHyun Park,Young-Ho Kim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages excluding reference
[NLP-62] PRINCIPLES: Synthetic Strategy Memory for Proactive Dialogue Agents EMNLP2025
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的主动对话代理在策略规划中存在的三大问题:策略覆盖范围有限、规划过程中的偏好偏差以及对昂贵额外训练的依赖。其解决方案的关键在于提出PRINCIPLES——一种通过离线自对弈模拟生成的合成策略记忆库,该记忆库作为可复用的知识,在推理阶段指导策略规划,从而无需额外训练或数据标注即可提升对话性能,并在情感支持与说服两个领域均表现出稳定且优于基线的效果。
链接: https://arxiv.org/abs/2509.17459
作者: Namyoung Kim,Kai Tzu-iunn Ong,Yeonjun Hwang,Minseok Kang,Iiseo Jihn,Gayoung Kim,Minju Kim,Jinyoung Yeo
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Findings
Abstract:Dialogue agents based on large language models (LLMs) have shown promising performance in proactive dialogue, which requires effective strategy planning. However, existing approaches to strategy planning for proactive dialogue face several limitations: limited strategy coverage, preference bias in planning, and reliance on costly additional training. To address these, we propose PRINCIPLES: a synthetic strategy memory for proactive dialogue agents. PRINCIPLES is derived through offline self-play simulations and serves as reusable knowledge that guides strategy planning during inference, eliminating the need for additional training and data annotation. We evaluate PRINCIPLES in both emotional support and persuasion domains, demonstrating consistent improvements over strong baselines. Furthermore, PRINCIPLES maintains its robustness across extended and more diverse evaluation settings. See our project page at this https URL.
zh
[NLP-63] Codifying Natural Langauge Tasks
【速读】: 该论文旨在解决如何将自然语言处理任务(如法律判决和医学问答)转化为可执行程序以提升推理准确性的难题。其解决方案的关键在于提出ICRAG框架,通过迭代精炼机制利用领域资源和GitHub等外部知识源,将自然语言逐步转化为结构化代码,从而增强模型的显式推理能力。实验表明,该方法在13个基准测试中实现了最高达161.1%的相对性能提升。
链接: https://arxiv.org/abs/2509.17455
作者: Haoyang Chen,Kumiko Tanaka-Ishii
机构: Waseda University (早稻田大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Submitted to Journal of Automated Software Engineering
Abstract:We explore the applicability of text-to-code to solve real-world problems that are typically solved in natural language, such as legal judgment and medical QA. Unlike previous works, our approach leverages the explicit reasoning provided by program generation. We present ICRAG, a framework that transforms natural language into executable programs through iterative refinement using external knowledge from domain resources and GitHub. Across 13 benchmarks, ICRAG achieves up to 161.1% relative improvement. We provide a detailed analysis of the generated code and the impact of external knowledge, and we discuss the limitations of applying text-to-code approaches to real-world natural language tasks.
zh
[NLP-64] SLAyiNG: Towards Queer Language Processing NEURIPS2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理**酷儿俚语(queer slang)**时存在的误判问题,即酷儿俚语常被错误标记为仇恨言论或引发负面响应,而此前缺乏高质量的标注基准数据来系统评估此类语言的识别与理解能力。解决方案的关键在于构建首个面向酷儿俚语的标注数据集SLAyiNG,该数据集通过从字幕、社交媒体和播客中收集真实语境下的俚语实例,并结合专家与社群驱动的注释流程,确保语义准确性和文化敏感性;初步实验表明,尽管当前最先进的推理模型(如OpenAI的o3-mini)可作为预筛选工具(平均Krippendorff’s alpha达0.746),但复杂且高度情境化的酷儿语言仍需专业人工标注以保障公平与准确性。
链接: https://arxiv.org/abs/2509.17449
作者: Leonor Veloso,Lea Hirlimann,Philipp Wicke,Hinrich Schütze
机构: Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL)
备注: To be presented at Queer in AI @ NeurIPS 2025 (non-archival)
Abstract:Knowledge of slang is a desirable feature of LLMs in the context of user interaction, as slang often reflects an individual’s social identity. Several works on informal language processing have defined and curated benchmarks for tasks such as detection and identification of slang. In this paper, we focus on queer slang. Queer slang can be mistakenly flagged as hate speech or can evoke negative responses from LLMs during user interaction. Research efforts so far have not focused explicitly on queer slang. In particular, detection and processing of queer slang have not been thoroughly evaluated due to the lack of a high-quality annotated benchmark. To address this gap, we curate SLAyiNG, the first dataset containing annotated queer slang derived from subtitles, social media posts, and podcasts, reflecting real-world usage. We describe our data curation process, including the collection of slang terms and definitions, scraping sources for examples that reflect usage of these terms, and our ongoing annotation process. As preliminary results, we calculate inter-annotator agreement for human annotators and OpenAI’s model o3-mini, evaluating performance on the task of sense disambiguation. Reaching an average Krippendorff’s alpha of 0.746, we argue that state-of-the-art reasoning models can serve as tools for pre-filtering, but the complex and often sensitive nature of queer language data requires expert and community-driven annotation efforts.
zh
[NLP-65] Semantic Reformulation Entropy for Robust Hallucination Detection in QA Tasks ICASSP2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答任务中因认知不确定性(epistemic uncertainty)导致的幻觉问题,即生成看似合理但事实错误的输出。现有基于熵的语义级不确定性估计方法受限于采样噪声以及变长回答带来的聚类不稳定问题。其解决方案的关键在于提出语义重构熵(Semantic Reformulation Entropy, SRE),通过两个核心机制实现改进:一是利用输入侧的语义重构生成忠实的改写句,扩展估计空间并减少解码器表面倾向带来的偏差;二是采用渐进式的、基于能量的混合聚类方法,稳定语义分组过程。实验表明,SRE在SQuAD和TriviaQA数据集上显著优于强基线模型,验证了输入多样性与多信号聚类结合对提升语义级不确定性估计鲁棒性和泛化能力的有效性。
链接: https://arxiv.org/abs/2509.17445
作者: Chaodong Tong,Qi Zhang,Lei Jiang,Yanbing Liu,Nannan Sun,Wei Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5pages, 5 figures, submit to ICASSP 2026
Abstract:Reliable question answering with large language models (LLMs) is challenged by hallucinations, fluent but factually incorrect outputs arising from epistemic uncertainty. Existing entropy-based semantic-level uncertainty estimation methods are limited by sampling noise and unstable clustering of variable-length answers. We propose Semantic Reformulation Entropy (SRE), which improves uncertainty estimation in two ways. First, input-side semantic reformulations produce faithful paraphrases, expand the estimation space, and reduce biases from superficial decoder tendencies. Second, progressive, energy-based hybrid clustering stabilizes semantic grouping. Experiments on SQuAD and TriviaQA show that SRE outperforms strong baselines, providing more robust and generalizable hallucination detection. These results demonstrate that combining input diversification with multi-signal clustering substantially enhances semantic-level uncertainty estimation.
zh
[NLP-66] Filling in the Clinical Gaps in Benchmark: Case for HealthBench for the Japanese medical system
【速读】: 该论文旨在解决当前日本医疗大语言模型(Medical LLMs)缺乏本土化评估基准的问题,尤其是在现有资源多依赖翻译的多项选择题、难以准确反映日本临床实践与文化背景的情况下。其关键解决方案在于:首先,通过机器翻译将HealthBench基准引入日本语境并建立性能基线,评估多语言模型(GPT-4.1)和日本本地开源模型(LLM-jp-3.1)的表现;其次,采用“大语言模型作为裁判”(LLM-as-a-Judge)的方法对基准中的场景和评分标准进行系统分类,识别出因内容与日本临床指南、医疗体系或文化规范不一致而产生的“情境缺口”(contextual gaps)。研究发现,直接翻译基准会导致性能下降,且本地模型因缺乏临床完整性而表现不佳,从而强调必须构建一个面向日本的专用评估框架——J-HealthBench,以实现对医疗大语言模型的安全可靠评估。
链接: https://arxiv.org/abs/2509.17444
作者: Shohei Hisada,Endo Sunao,Himi Yamato,Shoko Wakamiya,Eiji Aramaki
机构: 未知
类目: Computation and Language (cs.CL)
备注: draft v0.1
Abstract:This study investigates the applicability of HealthBench, a large-scale, rubric-based medical benchmark, to the Japanese context. While robust evaluation frameworks are crucial for the safe development of medical LLMs, resources in Japanese remain limited, often relying on translated multiple-choice questions. Our research addresses this gap by first establishing a performance baseline, applying a machine-translated version of HealthBench’s 5,000 scenarios to evaluate both a high-performing multilingual model (GPT-4.1) and a Japanese-native open-source model (LLM-jp-3.1). Second, we employ an LLM-as-a-Judge approach to systematically classify the benchmark’s scenarios and rubric criteria, identifying “contextual gaps” where content is misaligned with Japan’s clinical guidelines, healthcare systems, or cultural norms. Our findings reveal a modest performance drop in GPT-4.1 due to rubric mismatches and a significant failure in the Japanese-native model, which lacked the required clinical completeness. Furthermore, our classification indicates that while the majority of scenarios are applicable, a substantial portion of the rubric criteria requires localization. This work underscores the limitations of direct benchmark translation and highlights the urgent need for a context-aware, localized adaptation, a J-HealthBench, to ensure the reliable and safe evaluation of medical LLMs in Japan.
zh
[NLP-67] GeoPQA: Bridging the Visual Perception Gap in MLLM s for Geometric Reasoning EMNLP2025
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉密集型任务(如几何推理)中因感知瓶颈导致的幻觉问题,从而限制了强化学习(Reinforcement Learning, RL)训练效果。其关键解决方案是提出一种两阶段强化学习训练框架:首先通过专门训练提升模型对几何结构的视觉感知能力,再在此基础上增强推理能力,从而有效突破感知瓶颈,显著提升几何推理与问题求解性能。
链接: https://arxiv.org/abs/2509.17437
作者: Guizhen Chen,Weiwen Xu,Hao Zhang,Hou Pong Chan,Deli Zhao,Anh Tuan Luu,Yu Rong
机构: Nanyang Technological University (南洋理工大学); DAMO Academy, Alibaba Group (阿里达摩院); Hupan Lab (湖畔实验室)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP2025 Findings
Abstract:Recent advancements in reinforcement learning (RL) have enhanced the reasoning abilities of large language models (LLMs), yet the impact on multimodal LLMs (MLLMs) is limited. Particularly in vision-intensive tasks like geometric reasoning, MLLMs hallucinate frequently, leading to inaccurate reasoning. We attribute this to the perceptual bottleneck in MLLMs, which caps the benefits of reasoning training. To quantify this, we design a Geo-Perception Question-Answering (GeoPQA) benchmark, targeting basic geometric concepts and spatial relationships. Experiments on GeoPQA reveal significant shortcomings of MLLMs in visual perception, which constrain RL reward signals for effective training. To address this bottleneck, we propose a two-stage RL training framework by first enhancing the visual perception of geometric structures, then fostering reasoning capabilities. Applied to Qwen2.5-VL-3B-Instruct, our two-stage training improves geometric reasoning by 9.7% and geometric problem solving by 9.1%, compared to the direct reasoning training approach. Our method also generalizes to other vision-intensive domains like figure understanding, highlighting the importance of perceptual grounding in effective MLLM reasoning.
zh
[NLP-68] MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses EMNLP2025
【速读】: 该论文旨在解决当前医疗事实核查(medical fact-checking)领域中对大语言模型(Large Language Models, LLMs)生成内容的验证研究严重不足的问题。现有数据集主要聚焦于人类撰写的医疗文本,而忽视了由LLM生成的医学信息的真实性验证需求。为填补这一空白,作者提出了MedFact,这是首个基于证据的中文医疗事实核查数据集,专门用于评估LLM生成的医学陈述(claims)。其关键创新在于构建了一个包含1,321个问题和7,409条声明的数据集,模拟真实医疗场景的复杂性,并在上下文学习(in-context learning, ICL)与微调(fine-tuning)两种设置下开展系统实验,揭示当前LLMs在该任务上的能力边界与典型错误模式,从而为后续研究指明方向。
链接: https://arxiv.org/abs/2509.17436
作者: Tong Chen,Zimu Wang,Yiyi Miao,Haoran Luo,Yuanfei Sun,Wei Wang,Zhengyong Jiang,Procheta Sen,Jionglong Su
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); University of Liverpool (利物浦大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025. Camera-ready version
Abstract:Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at this https URL.
zh
[NLP-69] QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models
【速读】: 该论文旨在解决量化感知的参数高效微调(Quantization-Aware Parameter-Efficient Fine-Tuning, PEFT)中因量化误差导致模型精度下降的问题,尤其针对现有方法在低比特量化下性能受限、计算开销高以及表示能力不足的挑战。解决方案的关键在于提出QWHA方法,其核心创新包括:采用沃尔什-哈达玛变换(Walsh-Hadamard Transform, WHT)作为傅里叶相关变换(Fourier-related Transform, FT)的核函数以提升表示能力并降低计算复杂度,同时引入一种结合自适应参数选择与值精炼的新型适配器初始化策略,从而有效缓解量化误差并加速微调过程。
链接: https://arxiv.org/abs/2509.17428
作者: Hyesung Jeon,Seojune Lee,Beomseok Kang,Yulhwa Kim,Jae-Joon Kim
机构: Seoul National University (首尔国立大学); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL)
备注: 25 pages, 9 figures, 14 tables
Abstract:The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at this https URL.
zh
[NLP-70] RealBench: A Chinese Multi-image Understanding Benchmark Close to Real-world Scenarios EMNLP2025
【速读】: 该论文旨在解决中文多图像理解任务缺乏高质量、真实场景数据集的问题。当前已有的多模态多图像评估数据集主要基于英文,无法充分支持中文语境下的模型训练与评测。为此,作者提出了RealBench——首个面向中文场景的多模态多图像数据集,其关键在于引入真实用户生成内容(User-Generated Content, UGC),并覆盖多样化的场景、图像分辨率和结构,从而提升数据的真实性和挑战性。该数据集包含9393个样本和69910张图像,为中文多图像理解研究提供了重要基准,并通过21种不同规模的多模态大语言模型(Multimodal Large Language Models, MLLMs)的系统评估揭示了现有模型在处理中文多图像任务时仍存在显著性能瓶颈,尤其开放源代码视觉/视频模型与闭源模型之间平均差距达71.8%。
链接: https://arxiv.org/abs/2509.17421
作者: Fei Zhao,Chengqiang Lu,Yufan Shen,Qimeng Wang,Yicheng Qian,Haoxin Zhang,Yan Gao,Yi Wu,Yao Hu,Zhen Wu,Shangyu Xing,Xinyu Dai
机构: Nanjing University (南京大学); Xiaohongshu Inc. (小红书); Zhejiang University (浙江大学)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Findings of EMNLP 2025 camera-ready
Abstract:While various multimodal multi-image evaluation datasets have been emerged, but these datasets are primarily based on English, and there has yet to be a Chinese multi-image dataset. To fill this gap, we introduce RealBench, the first Chinese multimodal multi-image dataset, which contains 9393 samples and 69910 images. RealBench distinguishes itself by incorporating real user-generated content, ensuring high relevance to real-world applications. Additionally, the dataset covers a wide variety of scenes, image resolutions, and image structures, further increasing the difficulty of multi-image understanding. Ultimately, we conduct a comprehensive evaluation of RealBench using 21 multimodal LLMs of different sizes, including closed-source models that support multi-image inputs as well as open-source visual and video models. The experimental results indicate that even the most powerful closed-source models still face challenges when handling multi-image Chinese scenarios. Moreover, there remains a noticeable performance gap of around 71.8% on average between open-source visual/video models and closed-source models. These results show that RealBench provides an important research foundation for further exploring multi-image understanding capabilities in the Chinese context.
zh
[NLP-71] Vision Language Models Are Not (Yet) Spelling Correctors
【速读】: 该论文旨在解决视觉输入下的拼写纠错问题(Visual Spelling Correction),即在真实图像中检测并修正文本错误,这对视觉语言模型(Vision Language Models, VLMs)提出了独特挑战,因其不仅需识别图像中的文字内容,还需准确修正其中的拼写错误。为系统评估VLMs在此任务上的表现,作者构建了首个面向中英文的真实世界拼写纠错基准ReViCo,支持图像级与词元级细粒度评价。实验表明,当前主流开源(如Qwen、InternVL)与闭源模型(如GPT-4o、Claude)在纠错能力上显著落后于人类水平。解决方案的关键在于提出两种改进范式:一是联合OCR-纠错流水线(Joint OCR-Correction Pipeline),提升文本提取与纠错的协同性;二是引入背景信息增强方法(Background Information Enhanced Approach),利用图像上下文辅助纠正歧义或模糊的拼写错误,二者均带来一致性的性能提升,揭示了现有架构的核心局限并为多模态拼写纠错技术发展提供了可操作的改进方向。
链接: https://arxiv.org/abs/2509.17418
作者: Junhong Liang,Bojun Zhang
机构: MBZUAI; Institute of Automation, Chinese Academy of Sciences
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spelling correction from visual input poses unique challenges for vision language models (VLMs), as it requires not only detecting but also correcting textual errors directly within images. We present ReViCo (Real Visual Correction), the first benchmark that systematically evaluates VLMs on real-world visual spelling correction across Chinese and English. ReViCo contains naturally occurring errors collected from real-world image data and supports fine-grained evaluation at both image and token levels. Through comprehensive experiments on representative cascaded (Qwen) and native (InternVL) open-source models, as well as closed-source systems (GPT-4o, Claude), we show that current VLMs fall significantly short of human performance, particularly in correction. To address these limitations, we explore two solution paradigms: a Joint OCR-Correction pipeline and a Background Information enhanced approach, both of which yield consistent performance gains. Our analysis highlights fundamental limitations of existing architectures and provides actionable insights for advancing multimodal spelling correction.
zh
[NLP-72] DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLM s for Cultural Text Adaptation in Indian Context EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨文化场景中缺乏文化对齐(cultural alignment)和文化能力的问题,尤其是由于文化知识不足导致的偏见生成。现有评估方法受限于缺乏针对区域及次区域文化的基准数据集和有效评价指标,且已有文化特定项(Culture-Specific Items, CSIs)数据集多聚焦于宏观区域层面,存在误报风险。其关键解决方案是构建一个面向印度文化的新型CSI数据集,涵盖17个文化维度和36个次区域,包含约8000个文化概念;并基于此数据集设计了以LLM为裁判(LLM as Judge)与多元社会人口背景人类评估相结合的量化评测框架,用于衡量LLM在文化文本适配任务中的表现,揭示了当前主流LLMs在子区域覆盖和表面层适配上的局限性。
链接: https://arxiv.org/abs/2509.17399
作者: Pramit Sahoo,Maharaj Brahma,Maunendra Sankar Desarkar
机构: Indian Institute of Technology Hyderabad (印度理工学院海得拉巴分校)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025
Abstract:Large language models (LLMs) are widely used in various tasks and applications. However, despite their wide capabilities, they are shown to lack cultural alignment \citepryan-etal-2024-unintended, alkhamissi-etal-2024-investigating and produce biased generations \citenaous-etal-2024-beer due to a lack of cultural knowledge and competence. Evaluation of LLMs for cultural awareness and alignment is particularly challenging due to the lack of proper evaluation metrics and unavailability of culturally grounded datasets representing the vast complexity of cultures at the regional and sub-regional levels. Existing datasets for culture specific items (CSIs) focus primarily on concepts at the regional level and may contain false positives. To address this issue, we introduce a novel CSI dataset for Indian culture, belonging to 17 cultural facets. The dataset comprises \sim 8k cultural concepts from 36 sub-regions. To measure the cultural competence of LLMs on a cultural text adaptation task, we evaluate the adaptations using the CSIs created, LLM as Judge, and human evaluations from diverse socio-demographic region. Furthermore, we perform quantitative analysis demonstrating selective sub-regional coverage and surface-level adaptations across all considered LLMs. Our dataset is available here: \hrefthis https URLthis https URL, project webpage\footnote\hrefthis https URLthis https URL, and our codebase with model outputs can be found here: \hrefthis https URLthis https URL.
zh
[NLP-73] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长对话问答任务中因键值缓存(Key-Value Cache, KV cache)内存占用随对话长度线性增长而导致的资源瓶颈问题。现有KV缓存压缩方法存在两个关键局限:一是全上下文预填充后缓存条目被逐出导致峰值内存无界,二是基于查询的逐出策略使缓存局限于单一查询,从而削弱多轮对话中的语义连贯性。解决方案的关键在于提出EpiCache框架,其核心创新包括:通过分块预填充(block-wise prefill)控制缓存增长上限,并采用基于情景的KV缓存压缩(episodic KV compression),将对话历史聚类为语义一致的情景(episode),并针对每个情景独立执行缓存逐出策略;同时设计自适应层间预算分配机制,根据各层对缓存逐出的敏感度动态分配内存资源。实验表明,EpiCache在固定内存预算下显著提升准确率(最高达40%),实现4–6倍压缩比下的近全精度缓存性能,且延迟与内存消耗分别降低最多2.4倍和3.5倍。
链接: https://arxiv.org/abs/2509.17396
作者: Minsoo Kim,Arnav Kundu,Han-Byul Kim,Richa Dixit,Minsik Cho
机构: Apple(苹果); Hanyang University (汉阳大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer’s sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.
zh
[NLP-74] FinDebate: Multi-Agent Collaborative Intelligence for Financial Analysis EMNLP2025
【速读】: 该论文旨在解决金融分析中因单一视角导致的偏见与不全面性问题,以及大语言模型(Large Language Model, LLM)在生成投资建议时可能出现的过度自信和可靠性不足问题。其解决方案的关键在于提出一个名为FinDebate的多智能体框架,通过引入领域特定的检索增强生成(Retrieval-Augmented Generation, RAG)机制,使五个专业化智能体(涵盖收益、市场、情绪、估值和风险维度)并行协作,协同推理并生成多维洞察;同时设计了一种安全辩论协议(safe debate protocol),促使智能体相互质疑与修正初始结论,在保持推荐一致性的同时提升分析的稳健性和可信度。
链接: https://arxiv.org/abs/2509.17395
作者: Tianshi Cai,Guanxu Li,Nijia Han,Ce Huang,Zimu Wang,Changyu Zeng,Yuqi Wang,Jingshi Zhou,Haiyang Zhang,Qi Chen,Yushan Pan,Shuihua Wang,Wei Wang
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Accepted at FinNLP@EMNLP 2025. Camera-ready version
Abstract:We introduce FinDebate, a multi-agent framework for financial analysis, integrating collaborative debate with domain-specific Retrieval-Augmented Generation (RAG). Five specialized agents, covering earnings, market, sentiment, valuation, and risk, run in parallel to synthesize evidence into multi-dimensional insights. To mitigate overconfidence and improve reliability, we introduce a safe debate protocol that enables agents to challenge and refine initial conclusions while preserving coherent recommendations. Experimental results, based on both LLM-based and human evaluations, demonstrate the framework’s efficacy in producing high-quality analysis with calibrated confidence levels and actionable investment strategies across multiple time horizons.
zh
[NLP-75] Program Synthesis via Test-Time Transduction NEURIPS2025
【速读】: 该论文试图解决程序合成(Program Synthesis)在真实场景中因训练样本有限且测试输入包含多种边缘情况而导致的鲁棒性不足问题。现有方法通常依赖于自然语言描述或输入输出示例进行泛化,但在实际应用中难以应对复杂多变的测试环境。解决方案的关键在于提出一种归纳式程序合成(transductive program synthesis)框架,通过在合成过程中显式利用测试输入来迭代筛选候选程序假设;具体而言,使用大语言模型(LLM)预测选定测试输入的输出,并基于一致性原则剔除不匹配的假设,同时采用贪心最大最小算法(greedy maximin algorithm)优化测试输入的选择策略,从而以最少的LLM调用次数显著提升合成准确率与效率。
链接: https://arxiv.org/abs/2509.17393
作者: Kang-il Lee,Jahyun Koo,Seunghyun Yoon,Minbeom Kim,Hyukhun Koh,Dongryeol Lee,Kyomin Jung
机构: Seoul National University (首尔国立大学); Adobe Research; IPAI (智能感知与人工智能研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: NeurIPS 2025
Abstract:We introduce transductive program synthesis, a new formulation of the program synthesis task that explicitly leverages test inputs during synthesis. While prior approaches to program synthesis–whether based on natural language descriptions or input-output examples–typically aim to generalize from training examples, they often struggle with robustness, especially in real-world settings where training examples are limited and test inputs involve various edge cases. To address this, we propose a novel framework that improves robustness by treating synthesis as an active learning over a finite hypothesis class defined by programs’ outputs. We use an LLM to predict outputs for selected test inputs and eliminate inconsistent hypotheses, where the inputs are chosen via a greedy maximin algorithm to minimize the number of LLM queries required. We evaluate our approach on two real-world datasets: Playgol, a string transformation benchmark, and MBPP+, a Python code generation benchmark. We demonstrate that our method significantly improves program synthesis in both accuracy and efficiency. We release our code at this https URL.
zh
[NLP-76] Robustness of Neurosymbolic Reason ers on First-Order Logic Problems
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对反事实任务变体时表现脆弱的问题,即模型往往依赖表面模式而非真正的逻辑推理能力,导致在任务微小但语义有意义的扰动下推理一致性下降。解决方案的关键在于引入神经符号(neurosymbolic, NS)方法,通过融合LLM与符号逻辑求解器来增强模型的逻辑一致性与鲁棒性;进一步提出NSCoT框架,将NS方法与思维链(Chain-of-Thought, CoT)提示相结合,以提升推理性能,尽管其效果仍落后于标准CoT方法。
链接: https://arxiv.org/abs/2509.17377
作者: Hannah Bansal,Kemal Kurniawan,Lea Frermann
机构: RMIT University (皇家墨尔本理工大学); The University of Melbourne (墨尔本大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Recent trends in NLP aim to improve reasoning capabilities in Large Language Models (LLMs), with key focus on generalization and robustness to variations in tasks. Counterfactual task variants introduce minimal but semantically meaningful changes to otherwise valid first-order logic (FOL) problem instances altering a single predicate or swapping roles of constants to probe whether a reasoning system can maintain logical consistency under perturbation. Previous studies showed that LLMs becomes brittle on counterfactual variations, suggesting that they often rely on spurious surface patterns to generate responses. In this work, we explore if a neurosymbolic (NS) approach that integrates an LLM and a symbolic logical solver could mitigate this problem. Experiments across LLMs of varying sizes show that NS methods are more robust but perform worse overall that purely neural methods. We then propose NSCoT that combines an NS method and Chain-of-Thought (CoT) prompting and demonstrate that while it improves performance, NSCoT still lags behind standard CoT. Our analysis opens research directions for future work.
zh
[NLP-77] Scale-free Characteristics of Multilingual Legal Texts and the Limitations of LLM s
【速读】: 该论文旨在解决不同语域文本在语言复杂性上的差异问题,特别是法律文本与通用自然语言及生成式AI文本之间的复杂性结构差异。其解决方案的关键在于采用无尺度指标(scale-free metrics)对文本复杂性进行量化分析,包括Heaps指数β(词汇增长)、Taylor指数α(词频波动缩放)、压缩率r(冗余度)和熵等参数,从而揭示法律文本具有更低的词汇增长速率(较低β)和更高的术语一致性(较高α),且不同法律子类(如法典、判例、契约)表现出显著差异,而GPT生成文本则更接近通用语言模式,表明当前生成式AI尚未完全复现法律文本特有的复杂结构。
链接: https://arxiv.org/abs/2509.17367
作者: Haoyang Chen,Kumiko Tanaka-Ishii
机构: 未知
类目: Computation and Language (cs.CL)
备注: to be published in Text, Speech, and Dialogue (TSD 2025)
Abstract:We present a comparative analysis of text complexity across domains using scale-free metrics. We quantify linguistic complexity via Heaps’ exponent \beta (vocabulary growth), Taylor’s exponent \alpha (word-frequency fluctuation scaling), compression rate r (redundancy), and entropy. Our corpora span three domains: legal documents (statutes, cases, deeds) as a specialized domain, general natural language texts (literature, Wikipedia), and AI-generated (GPT) text. We find that legal texts exhibit slower vocabulary growth (lower \beta ) and higher term consistency (higher \alpha ) than general texts. Within legal domain, statutory codes have the lowest \beta and highest \alpha , reflecting strict drafting conventions, while cases and deeds show higher \beta and lower \alpha . In contrast, GPT-generated text shows the statistics more aligning with general language patterns. These results demonstrate that legal texts exhibit domain-specific structures and complexities, which current generative models do not fully replicate.
zh
[NLP-78] Better Late Than Never: Evaluation of Latency Metrics for Simultaneous Speech-to-Text Translation
【速读】: 该论文旨在解决同步语音到文本翻译(SimulST)系统中延迟(latency)评估不准确的问题,尤其在短片段(short-form)设置下,现有延迟指标因分割方式导致结构性偏差,从而引发不公平或误导性的比较。其关键解决方案是提出一种改进的延迟度量方法 YAAL(Yet Another Average Lagging),专门优化短片段场景下的评估准确性,并进一步扩展为适用于未分割音频的 LongYAAL;同时引入 SoftSegmenter 工具,基于词级对齐实现新型重分割,提升长片段评估中的对齐质量,从而实现更可靠的 SimulST 系统性能评估。
链接: https://arxiv.org/abs/2509.17349
作者: Peter Polák,Sara Papi,Luisa Bentivogli,Ondřej Bojar
机构: Charles University (查尔斯大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Simultaneous speech-to-text translation (SimulST) systems have to balance translation quality with latency–the delay between speech input and the translated output. While quality evaluation is well established, accurate latency measurement remains a challenge. Existing metrics often produce inconsistent or misleading results, especially in the widely used short-form setting, where speech is artificially presegmented. In this paper, we present the first comprehensive analysis of SimulST latency metrics across language pairs, systems, and both short- and long-form regimes. We uncover a structural bias in current metrics related to segmentation that undermines fair and meaningful comparisons. To address this, we introduce YAAL (Yet Another Average Lagging), a refined latency metric that delivers more accurate evaluations in the short-form regime. We extend YAAL to LongYAAL for unsegmented audio and propose SoftSegmenter, a novel resegmentation tool based on word-level alignment. Our experiments show that YAAL and LongYAAL outperform popular latency metrics, while SoftSegmenter enhances alignment quality in long-form evaluation, together enabling more reliable assessments of SimulST systems.
zh
[NLP-79] AIMMerging: Adaptive Iterative Model Merging Using Training Trajectories for Language Model Continual Learning EMNLP2025
【速读】: 该论文旨在解决持续学习(Continual Learning, CL)中模型在动态环境中学习新知识时面临的“灾难性遗忘”问题,尤其是现有基于模型融合的方法因合并次数和频率设置不当而导致性能下降的问题。解决方案的关键在于提出自适应迭代模型融合(Adaptive Iterative Model Merging, AimMerging),其核心机制包括:1)通过训练轨迹中的学习与遗忘信号动态监控模型状态,实现对合并时机与频率的自适应控制;2)设计基于回放的知识融合模块,计算最优合并权重并执行融合操作。该方法显著提升了模型在多个基准上的表现,在FWT和BWT指标上分别实现了平均80%和59%的相对提升。
链接: https://arxiv.org/abs/2509.17348
作者: Yujie Feng,Jian Li,Xiaoyu Dong,Pengfei Xu,Xiaohui Zhou,Yujia Zhang,Zexin LU,Yasha Wang,Alan Zhao,Xu Chu,Xiao-Ming Wu
机构: Al Technology Center of OVB, Tencent(腾讯), China; The Hong Kong Polytechnic University(香港理工大学), Hong Kong S.A.R.; Peking University(北京大学), China
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025
Abstract:Continual learning (CL) is essential for deploying large language models (LLMs) in dynamic real-world environments without the need for costly retraining. Recent model merging-based methods have attracted significant attention, but they still struggle to effectively manage the trade-off between learning new knowledge and preventing forgetting, a challenge largely stemming from suboptimal number of merges and merging frequency. In this paper, we introduce Adaptive Iterative Model Merging (AimMerging), a novel CL framework that utilizes learning and forgetting signals from the training trajectory to dynamically monitor the model’s training status. Guided by dynamic monitoring, the training trajectory-guided merge controller adaptively determines the timing and frequency of iterative fusion, while the rehearsal-based knowledge fusion module computes the merging weights and executes the fusion. Comprehensive experiments on three CL benchmarks with various model sizes (from 770M to 13B) demonstrate that AimMerging achieves significant performance improvements over existing state-of-the-art methods, with an average relative improvement of 80% and 59% on FWT and BWT, respectively. The source code is provided for reproducibility.
zh
[NLP-80] LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code
【速读】: 该论文旨在解决当前软件系统中漏洞分析过于依赖简化分类任务、忽视真实场景下上下文依赖性和安全推理需求的问题。现有代码大语言模型(Code LLMs)虽具备较强的代码理解能力,但缺乏对安全特异性推理的关注。其解决方案的关键在于提出一种多模态大语言模型 LLaVul,通过将代码与安全相关的自然语言问题-答案对映射到统一表示空间,实现细粒度的漏洞推理;该模型以问答(QA)形式增强对代码漏洞的上下文敏感理解,并在自建的真实漏洞数据集上验证了其在漏洞检测和解释性方面的优越性能。
链接: https://arxiv.org/abs/2509.17337
作者: Ala Jararweh,Michael Adams,Avinash Sahu,Abdullah Mueen,Afsah Anwar
机构: The University of New Mexico (新墨西哥大学); The University of New Mexico (新墨西哥大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Increasing complexity in software systems places a growing demand on reasoning tools that unlock vulnerabilities manifest in source code. Many current approaches focus on vulnerability analysis as a classifying task, oversimplifying the nuanced and context-dependent real-world scenarios. Even though current code large language models (LLMs) excel in code understanding, they often pay little attention to security-specific reasoning. We propose LLaVul, a multimodal LLM tailored to provide fine-grained reasoning about code through question-answering (QA). Our model is trained to integrate paired code and natural queries into a unified space, enhancing reasoning and context-dependent insights about code vulnerability. To evaluate our model performance, we construct a curated dataset of real-world vulnerabilities paired with security-focused questions and answers. Our model outperforms state-of-the-art general-purpose and code LLMs in the QA and detection tasks. We further explain decision-making by conducting qualitative analysis to highlight capabilities and limitations. By integrating code and QA, LLaVul enables more interpretable and security-focused code understanding.
zh
[NLP-81] Mano Report
【速读】: 该论文旨在解决图形用户界面(GUI)自动化交互的挑战,主要包括视觉元素复杂性、动态环境适应性不足以及多步推理能力有限等问题。现有基于视觉-语言模型(VLM)的方法常受限于低分辨率、领域差异和序列决策能力弱等缺陷。其解决方案的关键在于构建一个名为Mano的鲁棒GUI代理,该代理基于在大规模网页与计算机系统数据上预训练的多模态基础模型,创新性地结合了高保真度模拟环境用于数据生成、三阶段训练流程(监督微调、离线强化学习与在线强化学习)以及错误恢复验证模块,从而显著提升GUI任务的成功率与操作准确性,在Mind2Web和OSWorld等多个基准测试中达到当前最优性能。
链接: https://arxiv.org/abs/2509.17336
作者: Tianyu Fu,Anyang Su,Chenxu Zhao,Hanning Wang,Minghui Wu,Zhe Yu,Fei Hu,Mingjia Shi,Wei Dong,Jiayao Wang,Yuyang Chen,Ruiyang Yu,Siran Peng,Menglin Li,Nan Huang,Haitian Wei,Jiawei Yu,Yi Xin,Xilin Zhao,Kai Gu,Ping Jiang,Sifan Zhou,Shuo Wang
机构: DeepMiner-Mano Team, Mininglamp Technology (DeepMiner-Mano 团队,Mininglamp 科技)
类目: Multimedia (cs.MM); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
zh
[NLP-82] Generalizable End-to-End Tool-Use RL with Synthetic CodeGym
【速读】: 该论文旨在解决当前工具增强型大语言模型(LLM agents)在训练过程中依赖静态轨迹的监督微调(SFT)或窄域强化学习(RL),导致其在面对新工具和未见过的工作流时泛化能力差、鲁棒性不足的问题。解决方案的关键在于提出 CodeGym,一个可扩展的框架,通过将静态编程问题重构为交互式环境,提取原子函数或逻辑作为可调用工具,从而生成多样、可验证且可控的多轮工具使用任务,使 LLM agents 能够在真实世界工作流结构的模拟中主动探索与掌握多种执行路径,显著提升其分布外(OOD)泛化性能。
链接: https://arxiv.org/abs/2509.17325
作者: Weihua Du,Hailei Gong,Zhan Ling,Kang Liu,Lingfeng Shen,Xuesong Yao,Yufei Xu,Dingyuan Shi,Yiming Yang,Jiecao Chen
机构: ByteDance(字节跳动)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages. Project available at this https URL
Abstract:Tool-augmented large language models (LLMs), hereafter LLM agents, leverage external tools to solve diverse tasks and interface with the real world. However, current training practices largely rely on supervised fine-tuning (SFT) over static trajectories or reinforcement learning (RL) on narrow tasks, and generalize poorly beyond development settings, leading to brittleness with new tools and unseen workflows. Because code execution reflects many structures of real-world workflows, coding problems provide a natural basis for building agent training environments. Motivated by this, we introduce CodeGym, a scalable framework that synthesizes diverse, verifiable, and controllable multi-turn tool-use environments for agent RL, enabling LLM agents to explore and master various workflows actively. CodeGym rewrites static coding problems into interactive environments by extracting atomic functions or logic into callable tools, yielding verifiable tasks that span various tool-execution workflows. Models of varying sizes and chain-of-thought configurations, trained in CodeGym, exhibit consistent out-of-distribution generalizability; for example, Qwen2.5-32B-Instruct achieves an absolute accuracy gain of 8.7 points on the OOD benchmark \tau -Bench. These results highlight CodeGym as a step toward scalable general-purpose RL environments that align with real-world agent workflows.
zh
[NLP-83] OpenGVL - Benchmarking Visual Temporal Progress for Data Curation
【速读】: 该论文旨在解决机器人领域中数据稀缺问题,尤其是在大规模场景下如何高效利用不断增长的野外机器人数据。其解决方案的关键在于提出OpenGVL——一个用于估计多样化复杂操作任务中任务进度的综合性基准,该基准基于生成式价值学习(Generative Value Learning, GVL)方法,利用视觉-语言模型(Vision-Language Models, VLMs)从视觉观测中预测任务进展。通过该基准,研究者能够自动标注和筛选大规模数据集,从而实现高效的高质量数据筛选与管理。
链接: https://arxiv.org/abs/2509.17321
作者: Paweł Budzianowski,Emilia Wiśnios,Gracjan Góral,Igor Kulakov,Viktor Petrenko,Krzysztof Walas
机构: University of Warsaw (华沙大学); IDEAS Research Institute (IDEAS 研究所); Simple Automation (简单自动化); Poznań University of Technology (波兹南理工大学)
类目: Robotics (cs.RO); Computation and Language (cs.CL)
备注:
Abstract:Data scarcity remains one of the most limiting factors in driving progress in robotics. However, the amount of available robotics data in the wild is growing exponentially, creating new opportunities for large-scale data utilization. Reliable temporal task completion prediction could help automatically annotate and curate this data at scale. The Generative Value Learning (GVL) approach was recently proposed, leveraging the knowledge embedded in vision-language models (VLMs) to predict task progress from visual observations. Building upon GVL, we propose OpenGVL, a comprehensive benchmark for estimating task progress across diverse challenging manipulation tasks involving both robotic and human embodiments. We evaluate the capabilities of publicly available open-source foundation models, showing that open-source model families significantly underperform closed-source counterparts, achieving only approximately 70% of their performance on temporal progress prediction tasks. Furthermore, we demonstrate how OpenGVL can serve as a practical tool for automated data curation and filtering, enabling efficient quality assessment of large-scale robotics datasets. We release the benchmark along with the complete codebase at \hrefthis http URLOpenGVL.
zh
[NLP-84] CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在数学推理任务中面临的挑战,即需要多步推理和抽象概念整合,而现有高质量、高难度的奥数级数学问题稀缺,成为测试时扩展(test-time scaling)技术发展的瓶颈。解决方案的关键在于提出一种基于认知原子(cognitive atoms)的框架——CogAtom,其将问题构造建模为从人类解题过程中提取的基本推理单元的选择与重组过程;通过促进多样性的随机游走算法探索认知原子空间,并结合约束驱动的重组机制保证逻辑严谨性和结构有效性,从而实现可控制难度、高多样性且大规模生成高质量数学问题的能力。
链接: https://arxiv.org/abs/2509.17318
作者: Zhuofan Chen,Jiyuan He,Yichi Zhang,Xing Hu,Haoxing Wen,Jun Bai,Wenge Rong
机构: Beihang University (北京航空航天大学); Meituan Inc. (美团); Beijing Institute for General Artificial Intelligence (通用人工智能研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units, cognitive atoms, extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. The combinatorial nature of the graph structure provides a near-infinite space of reasoning paths, and the walk algorithm systematically explores this space to achieve large-scale synthesis of high-quality problems; meanwhile, by controlling the number of cognitive atoms, we can precisely adjust problem difficulty, ensuring diversity, scalability, and controllability of the generated problems. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem this http URL code is publicly available at this https URL.
zh
[NLP-85] Scaling Simplification and Adaptation: Lessons from Pretraining on Machine-Translated Text
【速读】: 该论文旨在解决低资源语言在大规模单语预训练中面临的“数据墙”问题,即缺乏足够语料导致模型性能受限。其核心解决方案是利用机器翻译(Machine Translation, MT)将高资源语言(如英语)文本转化为目标低资源语言(如印尼语和泰米尔语),从而构建用于预训练的合成语料库。关键发现在于:首先,基于MT数据预训练的模型能随模型规模扩大而持续受益;其次,对源端文本进行简化(如使用大语言模型LLM处理英文)反而损害模型在原生文本上的泛化能力;最后,即使仅用少量原生文本对MT预训练模型进行持续微调,其性能通常优于纯原生数据训练的模型,但在涉及文化敏感性的任务(如毒性检测)上仍需更多原生数据支持。
链接: https://arxiv.org/abs/2509.17317
作者: Dan John Velasco,Matthew Theodore Roque
机构: Samsung R&D Institute Philippines (三星研发研究所菲律宾分部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Most languages lack sufficient data for large-scale monolingual pretraining, creating a “data wall.” Multilingual pretraining helps but is limited by language imbalance and the “curse of multilinguality.” An alternative is to translate high-resource text with machine translation (MT), which raises three questions: (1) How does MT-derived data scale with model capacity? (2) Can source-side transformations (e.g., simplifying English with an LLM) improve generalization to native text? (3) How well do models pretrained on MT-derived data adapt when continually trained on limited native text? We investigate these questions by translating English into Indonesian and Tamil–two typologically distant, lower-resource languages–and pretraining GPT-2 models (124M-774M) on native or MT-derived corpora from raw and LLM-simplified English. We evaluate cross-entropy loss on native text, along with accuracy on syntactic probes and downstream tasks. Our results show that (1) MT-pretrained models benefit from scaling; (2) source-side simplification harms generalization to native text; and (3) adapting MT-pretrained models on native text often yields better performance than native-only models, even with less native data. However, tasks requiring cultural nuance (e.g., toxicity detection) demand more exposure to native data.
zh
[NLP-86] Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
【速读】: 该论文旨在解决认知扭曲(Cognitive Distortions)在心理健康自然语言处理(NLP)中自动检测的难题,尤其针对语境模糊性、共现现象及语义重叠导致的识别困难。其解决方案的关键在于提出一种融合大型语言模型(Large Language Models, LLMs)与多实例学习(Multiple-Instance Learning, MIL)架构的新框架,将每个话语分解为情绪(Emotion)、逻辑(Logic)和行为(Behavior)三个可解释的ELB组件,由LLM推断多个扭曲实例及其类型、表达方式与模型赋予的显著性分数(salience score),并通过多视角门控注意力机制整合这些信息以实现最终分类。此方法显著提升了对高解释模糊性认知扭曲的识别性能,具备心理合理性与泛化能力。
链接: https://arxiv.org/abs/2509.17292
作者: Jun Seo Kim,Hyemi Kim,Woo Joo Oh,Hongjin Cho,Hochul Lee,Hye Hyeon Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Cognitive distortions have been closely linked to mental health disorders, yet their automatic detection remained challenging due to contextual ambiguity, co-occurrence, and semantic overlap. We proposed a novel framework that combines Large Language Models (LLMs) with Multiple-Instance Learning (MIL) architecture to enhance interpretability and expression-level reasoning. Each utterance was decomposed into Emotion, Logic, and Behavior (ELB) components, which were processed by LLMs to infer multiple distortion instances, each with a predicted type, expression, and model-assigned salience score. These instances were integrated via a Multi-View Gated Attention mechanism for final classification. Experiments on Korean (KoACD) and English (Therapist QA) datasets demonstrate that incorporating ELB and LLM-inferred salience scores improves classification performance, especially for distortions with high interpretive ambiguity. Our results suggested a psychologically grounded and generalizable approach for fine-grained reasoning in mental health NLP.
zh
[NLP-87] Automated Knowledge Graph Construction using Large Language Models and Sentence Complexity Modelling
【速读】: 该论文旨在解决从自然语言句子中自动提取结构化知识图谱(Knowledge Graph, KG)的难题,尤其关注复杂句子中的语义解析与关系抽取精度问题。其核心挑战在于如何有效处理句子内部的指代消解(Coreference Resolution)和句法分解(Syntactic Sentence Decomposition),以提升知识三元组提取的准确率与覆盖率。解决方案的关键在于提出一个端到端的开源流水线 CoDe-KG,该流程融合了鲁棒的指代消解模块与基于句法结构的句子分解机制,从而显著增强对罕见关系的召回能力(提升超20%),并在多个基准数据集上实现当前最优性能,如在 REBEL 上达到 65.8% 的 macro-F1(较前人提升 8 个百分点)。
链接: https://arxiv.org/abs/2509.17289
作者: Sydney Anuyah,Mehedi Mahmud Kaushik,Krishna Dwarampudi,Rakesh Shiradkar,Arjan Durresi,Sunandan Chakraborty
机构: Indiana University (印第安纳大学); Purdue University (普渡大学); Department of Biomedical Engineering (生物医学工程系)
类目: Computation and Language (cs.CL)
备注:
Abstract:We introduce CoDe-KG, an open-source, end-to-end pipeline for extracting sentence-level knowledge graphs by combining robust coreference resolution with syntactic sentence decomposition. Using our model, we contribute a dataset of over 150,000 knowledge triples, which is open source. We also contribute a training corpus of 7248 rows for sentence complexity, 190 rows of gold human annotations for co-reference resolution using open source lung-cancer abstracts from PubMed, 900 rows of gold human annotations for sentence conversion policies, and 398 triples of gold human annotations. We systematically select optimal prompt-model pairs across five complexity categories, showing that hybrid chain-of-thought and few-shot prompting yields up to 99.8% exact-match accuracy on sentence simplification. On relation extraction (RE), our pipeline achieves 65.8% macro-F1 on REBEL, an 8-point gain over the prior state of the art, and 75.7% micro-F1 on WebNLG2, while matching or exceeding performance on Wiki-NRE and CaRB. Ablation studies demonstrate that integrating coreference and decomposition increases recall on rare relations by over 20%. Code and dataset are available at this https URL
zh
[NLP-88] Probabilistic Token Alignment for Large Language Model Fusion NEURIPS2025
【速读】: 该论文旨在解决现有大语言模型(Large Language Models, LLMs)融合方法中依赖人工预定义词汇对齐(vocabulary alignment)所导致的泛化能力不足问题,这在不同语境下常引发性能下降。其解决方案的关键在于提出一种基于概率的令牌对齐方法(Probabilistic Token Alignment, PTA-LLM),将令牌对齐重新建模为最优传输(optimal transport)问题,从而实现无需人工干预的软映射对齐,利用分布感知学习提升融合模型的一致性与性能。该方法不仅具有通用性,还从分布角度提供了可解释性,揭示了令牌对齐的本质。
链接: https://arxiv.org/abs/2509.17276
作者: Runjia Zeng,James Chenhao Liang,Cheng Han,Zhiwen Cao,Jiahao Liu,Xiaojun Quan,Yingjie Victor Chen,Lifu Huang,Tong Geng,Qifan Wang,Dongfang Liu
机构: Rochester Institute of Technology (罗切斯特理工学院); U.S. Naval Research Laboratory (美国海军研究实验室); University of Missouri-Kansas City (密苏里大学堪萨斯城分校); Adobe (Adobe公司); Meituan (美团); Sun Yat-sen University (中山大学); Purdue University (普渡大学); UC Davis (加州大学戴维斯分校); University of Rochester (罗彻斯特大学); Rice University (莱斯大学); Meta AI (Meta人工智能)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: NeurIPS 2025
Abstract:Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained LLMs with different architectures into a more powerful model. However, a key challenge in existing model fusion is their dependence on manually predefined vocabulary alignment, which may not generalize well across diverse contexts, leading to performance degradation in several evaluation. To solve this, we draw inspiration from distribution learning and propose the probabilistic token alignment method as a general and soft mapping for alignment, named as PTA-LLM. Our approach innovatively reformulates token alignment into a classic mathematical problem: optimal transport, seamlessly leveraging distribution-aware learning to facilitate more coherent model fusion. Apart from its inherent generality, PTA-LLM exhibits interpretability from a distributional perspective, offering insights into the essence of the token alignment. Empirical results demonstrate that probabilistic token alignment enhances the target model’s performance across multiple capabilities. Our code is avaliable at this https URL.
zh
[NLP-89] Extending Automatic Machine Translation Evaluation to Book-Length Documents EMNLP2025
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长文档翻译评估中面临的局限性问题,即现有自动评估指标受限于句子级粒度、token数量限制以及严格的句子边界假设,难以有效衡量文档级别的翻译质量。其解决方案的关键在于提出SEGALE评估框架,该框架将文档视为连续文本流,结合句法分割与对齐方法,在不依赖人工标注句对的情况下实现任意长度文档的自动化评估,同时能处理漏译、冗译及不一致的句子边界问题,从而显著提升长文档翻译的评价能力,并首次揭示了多个开源LLM在报告最大上下文长度下实际翻译性能的不足。
链接: https://arxiv.org/abs/2509.17249
作者: Kuang-Da Wang,Shuoyang Ding,Chao-Han Huck Yang,Ping-Chun Hsieh,Wen-Chih Peng,Vitaly Lavrukhin,Boris Ginsburg
机构: National Yang Ming Chiao Tung University (国立阳明交通大学); NVIDIA (英伟达)
类目: Computation and Language (cs.CL)
备注: Accepted for EMNLP 2025 main conference
Abstract:Despite Large Language Models (LLMs) demonstrating superior translation performance and long-context capabilities, evaluation methodologies remain constrained to sentence-level assessment due to dataset limitations, token number restrictions in metrics, and rigid sentence boundary requirements. We introduce SEGALE, an evaluation scheme that extends existing automatic metrics to long-document translation by treating documents as continuous text and applying sentence segmentation and alignment methods. Our approach enables previously unattainable document-level evaluation, handling translations of arbitrary length generated with document-level prompts while accounting for under-/over-translations and varied sentence boundaries. Experiments show our scheme significantly outperforms existing long-form document evaluation schemes, while being comparable to evaluations performed with groundtruth sentence alignments. Additionally, we apply our scheme to book-length texts and newly demonstrate that many open-weight LLMs fail to effectively translate documents at their reported maximum context lengths.
zh
[NLP-90] Can Agents Judge Systematic Reviews Like Humans? Evaluating SLRs with LLM -based Multi-Agent System
【速读】: 该论文旨在解决系统性文献综述(Systematic Literature Reviews, SLRs)在实际研究中存在劳动密集、跨学科一致性差的问题。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-Agent System, MAS)架构,通过专业化代理协同工作,实现对SLR协议验证、方法学评估和主题相关性检查的自动化处理,且设计严格遵循PRISMA指南,从而提升评估过程的结构化程度与可解释性。初步实验表明,该系统输出与专家标注的PRISMA评分具有84%的一致性,验证了其在跨学科知识聚合中的可行性与潜力。
链接: https://arxiv.org/abs/2509.17240
作者: Abdullah Mushtaq,Muhammad Rafay Naeem,Ibrahim Ghaznavi,Alaa Abd-alrazaq,Aliya Tabassum,Junaid Qadir
机构: Information Technology University (信息科技大学); Weill Cornell Medicine–Qatar (威尔·康奈尔医学院卡塔尔分校); Qatar University (卡塔尔大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Systematic Literature Reviews (SLRs) are foundational to evidence-based research but remain labor-intensive and prone to inconsistency across disciplines. We present an LLM-based SLR evaluation copilot built on a Multi-Agent System (MAS) architecture to assist researchers in assessing the overall quality of the systematic literature reviews. The system automates protocol validation, methodological assessment, and topic relevance checks using a scholarly database. Unlike conventional single-agent methods, our design integrates a specialized agentic approach aligned with PRISMA guidelines to support more structured and interpretable evaluations. We conducted an initial study on five published SLRs from diverse domains, comparing system outputs to expert-annotated PRISMA scores, and observed 84% agreement. While early results are promising, this work represents a first step toward scalable and accurate NLP-driven systems for interdisciplinary workflows and reveals their capacity for rigorous, domain-agnostic knowledge aggregation to streamline the review process.
zh
[NLP-91] MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段生成质量提升受限的问题,尤其是传统序列级缩放方法(如思维链 Chain-of-Thought)难以有效改善单个token预测精度的局限性。其解决方案的关键在于提出一种名为“超并行缩放”(Hyper-parallel Scaling)的新框架,该框架通过在token级别上计算和聚合来自模型的多个输出提案来提升预测准确性。具体实现中,作者将此思想应用于混合专家模型(Mixture-of-Experts, MoE),构建了称为“专家名单”(Roster of Experts, RoE)的训练-free推理算法,该算法通过向专家路由机制引入可控随机性,使每个token可采样多个多样化的专家,并对它们的输出进行聚合。为降低计算开销,RoE还设计了一种高效的批处理策略和专用KV缓存机制,显著减少了计算与内存开销。实验表明,RoE可在不微调模型参数的前提下,使7B规模的MoE模型达到10.5B MoE模型的性能水平,同时减少30%的推理计算量。
链接: https://arxiv.org/abs/2509.17238
作者: Soheil Zibakhsh,Mohammad Samragh,Kumari Nishu,Lauren Hannah,Arnav Kundu,Minsik Cho
机构: Apple; University of California San Diego
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:
Abstract:The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final this http URL overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.
zh
[NLP-92] Causal Representation Learning from Multimodal Clinical Records under Non-Random Modality Missingness EMNLP2025
【速读】: 该论文旨在解决多模态临床数据中因临床决策导致的非随机缺失(Missing-Not-At-Random, MNAR)问题,即不同患者在结构化数据、影像学检查(如胸片)和文本记录(如出院小结)等模态上的可用性差异并非随机,而是受医生诊疗行为影响,从而影响患者表征学习的质量。解决方案的关键在于提出一种因果表征学习框架,其核心包括:(1) 一个对MNAR敏感的模态融合模块,通过条件化缺失模式来整合多模态信息,捕捉患者健康状态与医师分配行为的联合效应;(2) 基于对比学习的模态重建机制,确保表征学习具备语义完备性;(3) 多任务预测模型结合修正器(rectifier),校正由特定模态观测模式引入的残余偏差。该方法在MIMIC-IV和eICU数据集上显著优于现有最强基线,在住院再入院和ICU入院预测任务中分别实现最高13.8%和13.1%的AUC提升。
链接: https://arxiv.org/abs/2509.17228
作者: Zihan Liang,Ziwen Pan,Ruoxuan Xiong
机构: Emory University (埃默里大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Methodology (stat.ME)
备注: To appear in Proc. of EMNLP 2025 (18 pages)
Abstract:Clinical notes contain rich patient information, such as diagnoses or medications, making them valuable for patient representation learning. Recent advances in large language models have further improved the ability to extract meaningful representations from clinical texts. However, clinical notes are often missing. For example, in our analysis of the MIMIC-IV dataset, 24.5% of patients have no available discharge summaries. In such cases, representations can be learned from other modalities such as structured data, chest X-rays, or radiology reports. Yet the availability of these modalities is influenced by clinical decision-making and varies across patients, resulting in modality missing-not-at-random (MMNAR) patterns. We propose a causal representation learning framework that leverages observed data and informative missingness in multimodal clinical records. It consists of: (1) an MMNAR-aware modality fusion component that integrates structured data, imaging, and text while conditioning on missingness patterns to capture patient health and clinician-driven assignment; (2) a modality reconstruction component with contrastive learning to ensure semantic sufficiency in representation learning; and (3) a multitask outcome prediction model with a rectifier that corrects for residual bias from specific modality observation patterns. Comprehensive evaluations across MIMIC-IV and eICU show consistent gains over the strongest baselines, achieving up to 13.8% AUC improvement for hospital readmission and 13.1% for ICU admission.
zh
[NLP-93] Prompt-Based Simplification for Plain Language using Spanish Language Models
【速读】: 该论文旨在解决西班牙语文本向通俗语言(Plain Language, PL)转换的适应性问题,核心挑战在于如何在保持内容语义一致性的同时提升文本可读性。解决方案的关键在于结合文本规范化步骤、基于西班牙语训练的RigoChat-7B-v2模型以及针对PL优化的提示工程(prompt engineering),并通过语义相似度(SIM)和Fernández-Huerta可读性指数(FH)进行多维度评估,最终选择性能均衡且一致的模型与提示组合,实现了最高的语义相似度(SIM = 0.75),但可读性指标排名第四(FH = 69.72),凸显了当前评估体系在捕捉语言清晰度与内容保真度之间的权衡局限。
链接: https://arxiv.org/abs/2509.17209
作者: Lourdes Moreno,Jesus M. Sanchez-Gomez,Marco Antonio Sanchez-Escudero,Paloma Martínez
机构: 未知
类目: Computation and Language (cs.CL)
备注: 11 pages, 7 tables,
Abstract:This paper describes the participation of HULAT-UC3M in CLEARS 2025 Subtask 1: Adaptation of Text to Plain Language (PL) in Spanish. We explored strategies based on models trained on Spanish texts, including a zero-shot configuration using prompt engineering and a fine-tuned version with Low-Rank Adaptation (LoRA). Different strategies were evaluated on representative internal subsets of the training data, using the official task metrics, cosine similarity (SIM) and the Fernández-Huerta readability index (FH) to guide the selection of the optimal model and prompt combination. The final system was selected for its balanced and consistent performance, combining normalization steps, the RigoChat-7B-v2 model, and a dedicated PL-oriented prompt. It ranked first in semantic similarity (SIM = 0.75), however, fourth in readability (FH = 69.72). We also discuss key challenges related to training data heterogeneity and the limitations of current evaluation metrics in capturing both linguistic clarity and content preservation.
zh
[NLP-94] Evolution of Concepts in Language Model Pre-Training
【速读】: 该论文试图解决语言模型(Language Models)预训练过程中的“黑箱”问题,即缺乏对模型内部表征演化机制的细粒度理解。其解决方案的关键在于利用一种称为 crosscoders 的稀疏字典学习方法,追踪预训练快照中可解释线性特征的演变轨迹。通过该方法,研究发现大多数特征在特定训练阶段开始形成,而更复杂的模式则在后期出现,并且特征演化与下游性能之间存在因果关联,从而揭示了 Transformer 模型的两阶段学习过程——统计学习阶段和特征学习阶段。
链接: https://arxiv.org/abs/2509.17196
作者: Xuyang Ge,Wentao Shu,Jiaxing Wu,Yunhua Zhou,Zhengfu He,Xipeng Qiu
机构: OpenMOSS Team, Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 30 pages, 25 figures
Abstract:Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer’s two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.
zh
[NLP-95] VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在分析文化遗产 artifacts(如古希腊陶器)时缺乏领域专业知识、以及监督微调(Supervised Fine-Tuning, SFT)易过拟合表面模式导致推理脆弱的问题。解决方案的关键在于提出一种“先微调再强化学习”(SFT-then-RL)框架 VaseVL,其核心机制是将评估过程转化为监督信号:首先构建问题类型分类体系以定位模型在特定类型任务上的性能短板,随后设计基于类型条件和组合性导向的奖励函数,针对性优化这些薄弱环节。实验表明,该方法在风格分类与历史归属任务上达到当前最优性能,并显著提升组合推理的鲁棒性,验证了诊断引导型、分类条件化的奖励工程的有效性。
链接: https://arxiv.org/abs/2509.17191
作者: Jinchao Ge,Tengfei Cheng,Biao Wu,Zeyu Zhang,Shiya Huang,Judith Bishop,Gillian Shepherd,Meng Fang,Ling Chen,Yang Zhao
机构: AI Geeks; Australian Artificial Intelligence Institute; La Trobe University
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Analyzing cultural-heritage artifacts remains challenging for MLLMs: general models lack domain expertise, and SFT often overfits superficial patterns, yielding brittle reasoning for authentication and historical attribution. This raises the question of how to equip MLLMs with robust, expert-level reasoning for ancient Greek pottery. We present VaseVL, an SFT-then-RL system that turns evaluation into supervision: we construct a taxonomy of question types, probe the SFT model to localize type-specific performance gaps, and optimize with type-conditioned, compositionality-oriented rewards targeting those gaps. We also release VaseVQA, a comprehensive benchmark of 31,773 images designed to probe deep understanding. Experiments show state-of-the-art results on style classification and historical attribution with marked gains in compositional robustness over SFT-only baselines, validating diagnosis-guided, taxonomy-conditioned reward engineering and providing a reusable resource for future research. Code and dataset will be available at this https URL.
zh
[NLP-96] LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在持续学习过程中因传统对齐方法导致的灾难性遗忘问题,即模型在适应新任务或领域的人类偏好时会丢失先前习得的知识。其解决方案的关键在于提出一种名为LifeAlign的新型终身对齐框架,包含两个核心创新:一是聚焦式偏好优化策略(focalized preference optimization strategy),能够在对齐新偏好时保护已有知识不被侵蚀;二是短时到长时记忆整合机制(short-to-long memory consolidation mechanism),通过内在维度约简将去噪后的短期偏好表示合并为稳定的长期记忆,从而高效存储和检索跨不同领域的一致对齐模式。
链接: https://arxiv.org/abs/2509.17183
作者: Junsong Li,Jie Zhou,Bihao Zhan,Yutao Yang,Qianjun Pan,Shilian Chen,Tianyu Huai,Xin Li,Qin Chen,Liang He
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Alignment plays a crucial role in Large Language Models (LLMs) in aligning with human preferences on a specific task/domain. Traditional alignment methods suffer from catastrophic forgetting, where models lose previously acquired knowledge when adapting to new preferences or domains. We introduce LifeAlign, a novel framework for lifelong alignment that enables LLMs to maintain consistent human preference alignment across sequential learning tasks without forgetting previously learned knowledge. Our approach consists of two key innovations. First, we propose a focalized preference optimization strategy that aligns LLMs with new preferences while preventing the erosion of knowledge acquired from previous tasks. Second, we develop a short-to-long memory consolidation mechanism that merges denoised short-term preference representations into stable long-term memory using intrinsic dimensionality reduction, enabling efficient storage and retrieval of alignment patterns across diverse domains. We evaluate LifeAlign across multiple sequential alignment tasks spanning different domains and preference types. Experimental results demonstrate that our method achieves superior performance in maintaining both preference alignment quality and knowledge retention compared to existing lifelong learning approaches. The codes and datasets will be released on GitHub.
zh
[NLP-97] Attention Consistency for LLM s Explanation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)决策过程可解释性不足的问题,当前的解释方法常面临分辨率低和计算成本高的挑战。其解决方案的核心是提出一种名为多层注意力一致性评分(Multi-Layer Attention Consistency Score, MACS)的新颖、轻量且易于部署的启发式方法,通过衡量输入标记在各层中最大注意力的一致性来估计其重要性,从而在解释质量与计算效率之间实现良好权衡,在保持与复杂方法相当忠实度的同时,显著降低显存占用(减少22%)和延迟(减少30%)。
链接: https://arxiv.org/abs/2509.17178
作者: Tian Lan,Jinyuan Xu,Xue He,Jenq-Neng Hwang,Lei Li
机构: Milkuya Studio; ERTIM, INALCO; Sorbonne University; IRD; University of Washington; VitaSight
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding the decision-making processes of large language models (LLMs) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high computational cost. To address these limitations, we propose the \textbfMulti-Layer Attention Consistency Score (MACS), a novel, lightweight, and easily deployable heuristic for estimating the importance of input tokens in decoder-based models. MACS measures contributions of input tokens based on the consistency of maximal attention. Empirical evaluations demonstrate that MACS achieves a favorable trade-off between interpretability quality and computational efficiency, showing faithfulness comparable to complex techniques with a 22% decrease in VRAM usage and 30% reduction in latency.
zh
[NLP-98] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
【速读】: 该论文旨在解决当前大型推理模型(Large Reasoning Models, LRM)在评估过程中存在污染(contamination)的问题,即训练数据与测试数据之间可能存在重叠或泄露,从而导致性能评估失真。为应对这一挑战,作者提出了一种中等规模且尽可能无污染的评估方法,并发布了名为 ROME 的新基准测试集,专门用于视觉语言模型(Vision-Language Models, VLMs)的推理能力评测,重点考察模型从视觉线索中进行逻辑推理的能力。其解决方案的关键在于构建一个独立、可控且具有挑战性的评估环境,以更真实地反映模型的泛化与推理能力。
链接: https://arxiv.org/abs/2509.17177
作者: Bowen Qin,Chen Yue,Fang Yin,Hui Wang,JG Yao,Jiakang Liu,Jing-Shu Zheng,Miguel Hu Chen,Richeng Xuan,Shibei Meng,Shiqi Zhou,Teng Dai,Tong-Shuai Ren,Wei Cui,Xi Yang,Xialin Du,Xiaojing Xu,Xue Sun,Xuejing Li,Yaming Liu,Yesheng Liu,Ying Liu,Yonghua Lin,Yu Zhao,Yunduo Zhang,Yuwen Luo,Zheqi He,Zhiyuan He,Zhongyuan Wang
机构: BAAI FlagEval Team (北京人工智能研究院); State Key Laboratory of Multimedia Information Processing, Peking University (北京大学多媒体信息处理国家重点实验室)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 23 pages in main text
Abstract:We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: this https URL
zh
[NLP-99] SFT-TA: Supervised Fine-Tuned Agents in Multi-Agent LLM s for Automated Inductive Thematic Analysis
【速读】: 该论文旨在解决传统人工主题分析(Thematic Analysis, TA)耗时且难以扩展的问题,同时应对现有基于大语言模型(Large Language Models, LLMs)的自动化主题分析方法在与人类标注结果对齐方面表现有限的挑战。其解决方案的关键在于提出一种名为SFT-TA的自动化主题分析框架,该框架将监督微调(Supervised Fine-Tuned, SFT)代理嵌入到多智能体系统(multi-agent system)中,通过角色分工和协作机制显著提升输出结果与人类参考主题的一致性。实验表明,单独使用SFT代理效果不佳,但在多智能体结构中能超越基线模型(如gpt-4o),验证了特定角色配置下嵌入SFT代理是提升自动化主题分析准确性的有效路径。
链接: https://arxiv.org/abs/2509.17167
作者: Seungjun Yi,Joakim Nguyen,Huimin Xu,Terence Lim,Joseph Skrovan,Mehak Beri,Hitakshi Modi,Andrew Well,Liu Leqi,Mia Markey,Ying Ding
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Vanderbilt University School of Medicine (范德比尔特大学医学院); McCombs School of Business (麦克库姆斯商学院); Dell Medical School (德勒医学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Thematic Analysis (TA) is a widely used qualitative method that provides a structured yet flexible framework for identifying and reporting patterns in clinical interview transcripts. However, manual thematic analysis is time-consuming and limits scalability. Recent advances in LLMs offer a pathway to automate thematic analysis, but alignment with human results remains limited. To address these limitations, we propose SFT-TA, an automated thematic analysis framework that embeds supervised fine-tuned (SFT) agents within a multi-agent system. Our framework outperforms existing frameworks and the gpt-4o baseline in alignment with human reference themes. We observed that SFT agents alone may underperform, but achieve better results than the baseline when embedded within a multi-agent system. Our results highlight that embedding SFT agents in specific roles within a multi-agent system is a promising pathway to improve alignment with desired outputs for thematic analysis.
zh
[NLP-100] ARE: Scaling Up Agent Environments and Evaluations
【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 模型在从实验室开发到真实世界部署过程中存在的环境构建复杂性高、评估基准局限性强以及缺乏动态适应能力等问题。解决方案的关键在于提出 Meta Agents Research Environments (ARE),这是一个可扩展的研究平台,支持异步运行的多样化环境构建、合成或真实应用集成,以及智能体编排执行;同时基于 ARE 构建了 Gaia2 基准测试,其不仅涵盖搜索与执行任务,还要求智能体处理模糊性和噪声、适应动态变化、协作交互并满足时间约束,从而更全面地衡量通用智能体能力。ARE 的抽象设计使 Gaia2 可持续扩展至新场景,推动社区快速定制领域特定基准,为下一代 AI 系统提供更具挑战性和实用性的评估框架。
链接: https://arxiv.org/abs/2509.17158
作者: Pierre Andrews,Amine Benhalloum,Gerard Moreno-Torres Bertran,Matteo Bettini,Amar Budhiraja,Ricardo Silveira Cabral,Virginie Do,Romain Froger,Emilien Garreau,Jean-Baptiste Gaya,Hugo Laurençon,Maxime Lecanu,Kunal Malkan,Dheeraj Mekala,Pierre Ménard,Grégoire Mialon,Ulyana Piterbarg,Mikhail Plekhanov,Mathieu Rita,Andrey Rusakov,Thomas Scialom,Vladislav Vorotilov,Mengjue Wang,Ian Yu
机构: Meta Superintelligence Labs (Meta 超级智能实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:We introduce Meta Agents Research Environments (ARE), a research platform for scalable creation of environments, integration of synthetic or real applications, and execution of agentic orchestrations. ARE provides simple abstractions to build complex and diverse environments, each with their own rules, tools, content, and verifiers, helping to bridge the gap between model development and real-world deployment. We also propose Gaia2, a benchmark built in ARE and designed to measure general agent capabilities. Beyond search and execution, Gaia2 requires agents to handle ambiguities and noise, adapt to dynamic environments, collaborate with other agents, and operate under temporal constraints. Unlike prior benchmarks, Gaia2 runs asynchronously, surfacing new failure modes that are invisible in static settings. Our experiments show that no system dominates across the intelligence spectrum: stronger reasoning often comes at the cost of efficiency, and budget scaling curves plateau, highlighting the need for new architectures and adaptive compute strategies. Perhaps more importantly, ARE abstractions enable continuous extension of Gaia2 to other environments, empowering the community to rapidly create new benchmarks tailored to their domains. In AI’s second half, progress increasingly depends on defining meaningful tasks and robust evaluations to drive frontier capabilities forward.
zh
[NLP-101] SVeritas: Benchmark for Robust Speaker Verification under Diverse Conditions EMNLP2025
【速读】: 该论文旨在解决当前说话人验证(Speaker Verification, SV)模型在真实场景下鲁棒性评估不足的问题,尤其针对多种自然和恶意导致的信号退化或注册与测试数据不匹配的情况缺乏全面基准测试。现有基准仅覆盖部分干扰因素,遗漏了诸多关键现实挑战。解决方案的关键在于提出SVeritas——一个综合性说话人验证任务基准套件,首次系统性地涵盖包括录音时长、即兴程度、内容差异、噪声、麦克风距离、混响、信道失配、音频带宽、编码器压缩、说话人年龄、欺骗攻击及对抗攻击在内的全部已知重要干扰因素,并新增多个此前未被评估的真实场景条件。通过该基准对多个前沿SV模型进行评测,揭示了模型在跨语言试验、年龄不匹配及编码压缩等场景下的显著性能下降,并进一步发现不同年龄、性别和语言背景群体间存在鲁棒性差异,从而为精准诊断模型弱点、推动公平可靠SV系统的发展奠定基础。
链接: https://arxiv.org/abs/2509.17091
作者: Massa Baali,Sarthak Bisht,Francisco Teixeira,Kateryna Shapovalenko,Rita Singh,Bhiksha Raj
机构: Carnegie Mellon University (卡内基梅隆大学); INESC-ID (INESC-ID)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Findings
Abstract:Speaker verification (SV) models are increasingly integrated into security, personalization, and access control systems, yet their robustness to many real-world challenges remains inadequately benchmarked. These include a variety of natural and maliciously created conditions causing signal degradations or mismatches between enrollment and test data, impacting performance. Existing benchmarks evaluate only subsets of these conditions, missing others entirely. We introduce SVeritas, a comprehensive Speaker Verification tasks benchmark suite, assessing SV systems under stressors like recording duration, spontaneity, content, noise, microphone distance, reverberation, channel mismatches, audio bandwidth, codecs, speaker age, and susceptibility to spoofing and adversarial attacks. While several benchmarks do exist that each cover some of these issues, SVeritas is the first comprehensive evaluation that not only includes all of these, but also several other entirely new, but nonetheless important, real-life conditions that have not previously been benchmarked. We use SVeritas to evaluate several state-of-the-art SV models and observe that while some architectures maintain stability under common distortions, they suffer substantial performance degradation in scenarios involving cross-language trials, age mismatches, and codec-induced compression. Extending our analysis across demographic subgroups, we further identify disparities in robustness across age groups, gender, and linguistic backgrounds. By standardizing evaluation under realistic and synthetic stress conditions, SVeritas enables precise diagnosis of model weaknesses and establishes a foundation for advancing equitable and reliable speaker verification systems.
zh
[NLP-102] Localizing Malicious Outputs from CodeLLM EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中后门攻击(backdoor attack)导致的恶意输出问题,尤其是如何精准定位生成恶意内容的组件及其触发机制。解决方案的关键在于提出一种基于变异(mutation-based)的防御方法——FreqRank,其核心思想是假设恶意子字符串在触发输入下会高频重复出现于模型输出中,并通过频率排名系统识别这些恶意片段,进而反向定位输入中的后门触发器(backdoor trigger)。该方法在多种下游任务(如代码补全、生成与摘要)中均表现出高定位准确率,且在样本有限或变异数量较少时仍具有效性,相较其他防御手段提升35–50%的检测性能。
链接: https://arxiv.org/abs/2509.17070
作者: Mayukh Borana,Junyi Liang,Sai Sathiesh Rajan,Sudipta Chattopadhyay
机构: Singapore University of Technology and Design (新加坡科技设计大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages, 2 figures, 6 tables, Accepted at EMNLP 2025 Findings
Abstract:We introduce FreqRank, a mutation-based defense to localize malicious components in LLM outputs and their corresponding backdoor triggers. FreqRank assumes that the malicious sub-string(s) consistently appear in outputs for triggered inputs and uses a frequency-based ranking system to identify them. Our ranking system then leverages this knowledge to localize the backdoor triggers present in the inputs. We create nine malicious models through fine-tuning or custom instructions for three downstream tasks, namely, code completion (CC), code generation (CG), and code summarization (CS), and show that they have an average attack success rate (ASR) of 86.6%. Furthermore, FreqRank’s ranking system highlights the malicious outputs as one of the top five suggestions in 98% of cases. We also demonstrate that FreqRank’s effectiveness scales as the number of mutants increases and show that FreqRank is capable of localizing the backdoor trigger effectively even with a limited number of triggered samples. Finally, we show that our approach is 35-50% more effective than other defense methods.
zh
[NLP-103] actfulToM: Do LLM s Have the Theory of Mind Ability to Understand White Lies?
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在理解复杂社会情境下理论心理(Theory of Mind, ToM)能力的不足,尤其是对“善意谎言”(white lies)这一需要精细社会语境推理的现象缺乏有效建模的问题。解决方案的关键在于构建了一个名为TactfulToM的新颖英文基准测试集,其通过多阶段人机协作(human-in-the-loop)流程生成真实对话场景,在保持参与者间信息不对称的基础上,使模型能够评估白谎背后的亲社会动机(prosocial motivations),如维护他人情感与社会和谐。实验表明,即使是最先进的LLMs在该基准上的表现显著低于人类水平,揭示了其在深层ToM推理方面的局限性。
链接: https://arxiv.org/abs/2509.17054
作者: Yiwei Liu,Emma Jane Pretty,Jiahao Huang,Saku Sugawara
机构: EPFL (瑞士联邦理工学院); Tampere University (坦佩雷大学); University of Tokyo (东京大学); National Institute of Informatics (日本国立信息学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:While recent studies explore Large Language Models’ (LLMs) performance on Theory of Mind (ToM) reasoning tasks, research on ToM abilities that require more nuanced social context is limited, such as white lies. We introduce TactfulToM, a novel English benchmark designed to evaluate LLMs’ ability to understand white lies within real-life conversations and reason about prosocial motivations behind them, particularly when they are used to spare others’ feelings and maintain social harmony. Our benchmark is generated through a multi-stage human-in-the-loop pipeline where LLMs expand manually designed seed stories into conversations to maintain the information asymmetry between participants necessary for authentic white lies. We show that TactfulToM is challenging for state-of-the-art models, which perform substantially below humans, revealing shortcomings in their ability to fully comprehend the ToM reasoning that enables true understanding of white lies.
zh
[NLP-104] Modeling Bottom-up Information Quality during Language Processing
【速读】: 该论文试图解决的问题是:在阅读过程中,底层输入(bottom-up inputs)的信息质量如何影响语言处理的难易程度,特别是当输入存在噪声时是否会导致更困难、更费力的语义理解。其解决方案的关键在于提出了一种信息论意义上的操作化定义——即用视觉信息与词义之间的互信息(Mutual Information, MI)来量化底层输入的质量,并基于贝叶斯更新机制构建了一个数学模型来形式化这一预测。通过在英语和汉语中操纵单词视觉信息的完整性(如遮挡上半部分或下半部分),并利用多模态语言模型估算MI,研究验证了信息质量下降确实会显著延长阅读时间,且不同语言中视觉信息分布的不对称性也体现在阅读行为差异中,从而为语言加工中的自上而下与自下而上交互提供了实证支持。
链接: https://arxiv.org/abs/2509.17047
作者: Cui Ding,Yanning Yin,Lena A. Jäger,Ethan Gotlieb Wilcox
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Contemporary theories model language processing as integrating both top-down expectations and bottom-up inputs. One major prediction of such models is that the quality of the bottom-up inputs modulates ease of processing – noisy inputs should lead to difficult and effortful comprehension. We test this prediction in the domain of reading. First, we propose an information-theoretic operationalization for the “quality” of bottom-up information as the mutual information (MI) between visual information and word identity. We formalize this prediction in a mathematical model of reading as a Bayesian update. Second, we test our operationalization by comparing participants’ reading times in conditions where words’ information quality has been reduced, either by occluding their top or bottom half, with full words. We collect data in English and Chinese. We then use multimodal language models to estimate the mutual information between visual inputs and words. We use these data to estimate the specific effect of reduced information quality on reading times. Finally, we compare how information is distributed across visual forms. In English and Chinese, the upper half contains more information about word identity than the lower half. However, the asymmetry is more pronounced in English, a pattern which is reflected in the reading times.
zh
[NLP-105] he Transfer Neurons Hypothesis: An Underlying Mechanism for Language Latent Space Transitions in Multilingual LLM s EMNLP2025
【速读】: 该论文旨在解决多语言大语言模型(Multilingual Large Language Models, LLMs)中跨语言表示转换的内部机制不明确的问题,特别是早期层如何将不同语言输入映射到共享语义空间、中间层在英语中心的潜在空间中进行推理、以及最终层如何重构为特定语言表示的动态过程。其解决方案的关键在于提出并实证验证了“迁移神经元假说”(Transfer Neurons Hypothesis),即MLP模块中的某些神经元负责在语言特异性潜在空间与共享语义潜在空间之间传递表示信息,且这些迁移神经元对多语言推理能力至关重要。
链接: https://arxiv.org/abs/2509.17030
作者: Hinata Tezuka,Naoya Inoue
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 57 pages, 47 figures and 41 tables; Accepted to EMNLP 2025 Main
Abstract:Recent studies have suggested a processing framework for multilingual inputs in decoder-based LLMs: early layers convert inputs into English-centric and language-agnostic representations; middle layers perform reasoning within an English-centric latent space; and final layers generate outputs by transforming these representations back into language-specific latent spaces. However, the internal dynamics of such transformation and the underlying mechanism remain underexplored. Towards a deeper understanding of this framework, we propose and empirically validate The Transfer Neurons Hypothesis: certain neurons in the MLP module are responsible for transferring representations between language-specific latent spaces and a shared semantic latent space. Furthermore, we show that one function of language-specific neurons, as identified in recent studies, is to facilitate movement between latent spaces. Finally, we show that transfer neurons are critical for reasoning in multilingual LLMs.
zh
[NLP-106] Advancing Speech Understanding in Speech-Aware Language Models with GRPO
【速读】: 该论文旨在解决如何在开放格式的语音理解任务(如语音问答和自动语音翻译)中有效训练语音感知大语言模型(Speech-Aware Large Language Models, SALLMs)的问题。现有方法主要聚焦于多选题等结构化任务,难以充分激发模型的生成能力。解决方案的关键在于采用基于组相对策略优化(Group Relative Policy Optimization, GRPO)的方法,并以BLEU作为奖励信号来优化SALLMs,从而在开放格式任务上显著优于传统的监督微调(Supervised Fine-Tuning, SFT)方法。此外,研究还探索了在GRPO框架中引入离策略样本的可能性,为未来提升性能提供了新方向。
链接: https://arxiv.org/abs/2509.16990
作者: Avishai Elmakies,Hagai Aronowitz,Nimrod Shabtay,Eli Schwartz,Ron Hoory,Avihu Dekel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:In this paper, we introduce a Group Relative Policy Optimization (GRPO)-based method for training Speech-Aware Large Language Models (SALLMs) on open-format speech understanding tasks, such as Spoken Question Answering and Automatic Speech Translation. SALLMs have proven highly effective for speech understanding tasks. GRPO has recently gained traction for its efficiency in training LLMs, and prior work has explored its application to SALLMs, primarily in multiple-choice tasks. Building on this, we focus on open-format tasks that better reflect the generative abilities of the models. Our approach leverages GRPO with BLEU as the reward signal to optimize SALLMs, and we demonstrate empirically that it surpasses standard SFT across several key metrics. Finally, we explore the potential of incorporating off-policy samples within GRPO for these tasks, highlighting avenues for further improvement and further research.
zh
[NLP-107] Preference Distillation via Value based Reinforcement Learning
【速读】: 该论文旨在解决直接偏好优化(Direct Preference Optimization, DPO)在训练小规模语言模型时因二元胜负监督信号不足而导致性能受限的问题。现有方法如行为克隆或KL散度蒸馏通常仅关注模仿教师模型的当前行为,而忽略了对奖励建模信息的提取。解决方案的关键在于提出教师价值知识蒸馏(Teacher Value-based Knowledge Distillation, TVKD),其引入教师模型价值函数(value function)产生的辅助奖励作为软指导信号,并通过潜在奖励重塑(potential-based reward shaping)形式确保全局奖励结构和最优策略不变。该方法可无缝集成至标准DPO框架中,无需额外采样(rollouts),且在多种基准测试和模型规模下均表现出一致的性能提升。
链接: https://arxiv.org/abs/2509.16965
作者: Minchan Kwon,Junwon Ko,Kangil Kim,Junmo Kim
机构: Korea Advanced Institute of Science and Technology (KAIST); Gwangju Institute of Science and Technology (GIST)
类目: Computation and Language (cs.CL)
备注: 20 page
Abstract:Direct Preference Optimization (DPO) is a powerful paradigm to align language models with human preferences using pairwise comparisons. However, its binary win-or-loss supervision often proves insufficient for training small models with limited capacity. Prior works attempt to distill information from large teacher models using behavior cloning or KL divergence. These methods often focus on mimicking current behavior and overlook distilling reward modeling. To address this issue, we propose \textitTeacher Value-based Knowledge Distillation (TVKD), which introduces an auxiliary reward from the value function of the teacher model to provide a soft guide. This auxiliary reward is formulated to satisfy potential-based reward shaping, ensuring that the global reward structure and optimal policy of DPO are preserved. TVKD can be integrated into the standard DPO training framework and does not require additional rollouts. Our experimental results show that TVKD consistently improves performance across various benchmarks and model sizes.
zh
[NLP-108] AirQA: A Comprehensive QA Dataset for AI Research with Instance-Level Evaluation
【速读】: 该论文旨在解决科研人员在海量学术论文中高效提取关键信息的难题,以及当前缺乏全面且贴近实际应用的基准测试来评估大语言模型(Large Language Models, LLMs)在科学论文问答(Question Answering, QA)任务中的能力。同时,由于高质量交互轨迹数据稀缺,训练具备交互能力的智能体也面临挑战。解决方案的关键在于:首先构建了一个由人工标注的综合性论文QA数据集AirQA,涵盖13,948篇人工智能领域的论文和1,246个问题,支持多任务、多模态及实例级评估;其次提出ExTrActor自动化指令数据合成框架,通过三个基于LLM的代理实现无监督的示例生成与轨迹收集,从而显著提升小型模型在多轮工具使用能力上的表现,使其性能可媲美大型模型。
链接: https://arxiv.org/abs/2509.16952
作者: Tiancheng Huang,Ruisheng Cao,Yuxin Zhang,Zhangyi Kang,Zijian Wang,Chenrun Wang,Yijie Luo,Hang Zheng,Lirong Qian,Lu Chen,Kai Yu
机构: MoE Key Lab of Artificial Intelligence (MoE重点实验室人工智能); X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University (上海交通大学计算机科学学院X-LANCE实验室); Jiangsu Key Lab of Language Computing (江苏省语言计算重点实验室); Suzhou Laboratory (苏州实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The growing volume of academic papers has made it increasingly difficult for researchers to efficiently extract key information. While large language models (LLMs) based agents are capable of automating question answering (QA) workflows for scientific papers, there still lacks a comprehensive and realistic benchmark to evaluate their capabilities. Moreover, training an interactive agent for this specific task is hindered by the shortage of high-quality interaction trajectories. In this work, we propose AirQA, a human-annotated comprehensive paper QA dataset in the field of artificial intelligence (AI), with 13,948 papers and 1,246 questions, that encompasses multi-task, multi-modal and instance-level evaluation. Furthermore, we propose ExTrActor, an automated framework for instruction data synthesis. With three LLM-based agents, ExTrActor can perform example generation and trajectory collection without human intervention. Evaluations of multiple open-source and proprietary models show that most models underperform on AirQA, demonstrating the quality of our dataset. Extensive experiments confirm that ExTrActor consistently improves the multi-turn tool-use capability of small models, enabling them to achieve performance comparable to larger ones.
zh
[NLP-109] SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
【速读】: 该论文旨在解决当前软件工程基准测试(如SWE-BENCH)在模拟真实企业级复杂问题上的不足,特别是难以评估生成式AI(Generative AI)在长周期、多文件协同修改等专业场景下的能力。其解决方案的关键在于构建SWE-Bench Pro这一更具挑战性的基准测试集,包含1,865个来自41个活跃维护仓库的真实问题,涵盖商业应用、B2B服务和开发者工具等领域;该基准分为公开集、保留集和商业集三部分,其中商业集虽不公开但提供结果以确保评估的污染抗性;所有任务均经人工验证并附充分上下文以保障可解性,且任务设计要求专业级程序员需数小时至数日完成,从而更真实地反映实际软件开发复杂度。实验表明,当前主流编码模型在该基准上性能普遍低于25%(Pass@1),凸显了现有生成式AI在处理复杂软件工程任务时的显著局限。
链接: https://arxiv.org/abs/2509.16941
作者: Xiang Deng,Jeff Da,Edwin Pan,Yannis Yiming He,Charles Ide,Kanak Garg,Niklas Lauffer,Andrew Park,Nitin Pasari,Chetan Rane,Karmini Sampath,Maya Krishnan,Srivatsa Kundurthy,Sean Hendryx,Zifan Wang,Chen Bo Calvin Zhang,Noah Jacobson,Bing Liu,Brad Kenstler
机构: 未知
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注:
Abstract:We introduce SWE-Bench Pro, a substantially more challenging benchmark that builds upon the best practices of SWE-BENCH [25], but is explicitly designed to capture realistic, complex, enterprise-level problems beyond the scope of SWE-BENCH. SWE-BENCH PRO contains 1,865 problems sourced from a diverse set of 41 actively maintained repositories spanning business applications, B2B services, and developer tools. The benchmark is partitioned into a public set with open access to problems sourced from 11 repositories, a held-out set of 12 repositories and a commercial set of 18 proprietary repositories where we have formal partnership agreements with early-stage startups. Problems in the held-out and the commercial set are not publicly accessible, but we release results on the commercial set. Our benchmark features long-horizon tasks that may require hours to days for a professional software engineer to complete, often involving patches across multiple files and substantial code modifications. All tasks are human-verified and augmented with sufficient context to ensure resolvability. In our evaluation of widely used coding models, under a unified scaffold, we observe that their performance on SWE-Bench PRO remains below 25% (Pass@1), with GPT-5 achieving the highest score to date at 23.3%. To better understand these limitations, we cluster the failure modes observed in the collected agent trajectories for a clearer characterization of the error patterns exhibited by current models. Overall, SWE-BENCH PRO provides a contamination-resistant testbed that more faithfully captures the complexity and diversity of real-world software development, advancing the pursuit of truly autonomous software engineering agents at a professional level.
zh
[NLP-110] K-DeCore: Facilitating Knowledge Transfer in Continual Structured Knowledge Reasoning via Knowledge Decoupling NEURIPS2025
【速读】: 该论文针对持续结构化知识推理(Continual Structured Knowledge Reasoning, CSKR)任务中现有通用持续学习方法面临的两大挑战展开研究:一是模型在面对异构结构化知识时泛化能力差,二是随着任务增加导致参数膨胀、推理效率下降。解决方案的关键在于提出一种名为 \textscK-DeCore 的新框架,其核心创新为引入知识解耦机制(knowledge decoupling mechanism),将推理过程分解为任务特定与任务无关两个阶段,从而有效弥合不同任务间的差异;在此基础上,进一步结合双视角记忆固化机制和结构引导的伪数据合成策略,显著提升模型在多任务场景下的泛化性能与参数效率。
链接: https://arxiv.org/abs/2509.16929
作者: Yongrui Chen,Yi Huang,Yunchang Liu,Shenyu Zhang,Junhao He,Tongtong Wu,Guilin Qi,Tianxing Wu
机构: Southeast University (东南大学); Ministry of Education (教育部); China Mobile Research Institute (中国移动研究院); Monash University (莫纳什大学)
类目: Computation and Language (cs.CL)
备注: Accepted in Neurips 2025 (poster)
Abstract:Continual Structured Knowledge Reasoning (CSKR) focuses on training models to handle sequential tasks, where each task involves translating natural language questions into structured queries grounded in structured knowledge. Existing general continual learning approaches face significant challenges when applied to this task, including poor generalization to heterogeneous structured knowledge and inefficient reasoning due to parameter growth as tasks increase. To address these limitations, we propose a novel CSKR framework, \textscK-DeCore, which operates with a fixed number of tunable parameters. Unlike prior methods, \textscK-DeCore introduces a knowledge decoupling mechanism that disentangles the reasoning process into task-specific and task-agnostic stages, effectively bridging the gaps across diverse tasks. Building on this foundation, \textscK-DeCore integrates a dual-perspective memory consolidation mechanism for distinct stages and introduces a structure-guided pseudo-data synthesis strategy to further enhance the model’s generalization capabilities. Extensive experiments on four benchmark datasets demonstrate the superiority of \textscK-DeCore over existing continual learning methods across multiple metrics, leveraging various backbone large language models.
zh
[NLP-111] CUTE: A Multilingual Dataset for Enhancing Cross-Lingual Knowledge Transfer in Low-Resource Languages
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低资源语言(low-resource languages)上的性能不足问题,尤其是针对乌兹别克语和藏语等缺乏高质量训练语料的语言。其关键解决方案是构建并开源了一个多语言语料库CUTE,该语料库包含中文、英文、乌兹别克语和藏语的平行与非平行双25GB语料集,全部通过机器翻译生成。研究首先验证了中-乌兹别克语和中-藏语之间的机器翻译质量已达到与中-英语相当的水平,随后利用该语料库提升LLMs对低资源语言的处理能力,并系统探讨了语料平行性(corpus parallelism)在跨语言迁移学习中的作用。CUTE是目前针对乌兹别克语和藏语的最大公开语料库,为相关研究提供了重要基础。
链接: https://arxiv.org/abs/2509.16914
作者: Wenhao Zhuang,Yuan Sun
机构: Minzu University of China (中央民族大学); National Language Resource Monitoring & Research Center Minority Languages Branch (国家语言资源监测与研究少数民族语言分中心)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) demonstrate exceptional zero-shot capabilities in various NLP tasks, significantly enhancing user experience and efficiency. However, this advantage is primarily limited to resource-rich languages. For the diverse array of low-resource languages, support remains inadequate, with the scarcity of training corpora considered the primary cause. We construct and open-source CUTE Chinese, Uyghur, Tibetan,English dataset, consisting of two 25GB sets of four-language corpora (one parallel and one non-parallel), obtained through machine translation. CUTE encompasses two resource-rich languages (Chinese and English) and two low-resource languages (Uyghur and Tibetan). Prior to constructing CUTE, human assessment validates that the machine translation quality between Chinese-Uyghur and Chinese-Tibetan approaches that of Chinese-English translation. CUTE represents the largest open-source corpus for Uyghur and Tibetan languages to date, and we demonstrate its effectiveness in enhancing LLMs’ ability to process low-resource languages while investigating the role of corpus parallelism in cross-lingual transfer learning. The CUTE corpus and related models are made publicly available to the research community.
zh
[NLP-112] CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework Multi-lingual Discourse Relation Classification
【速读】: 该论文旨在解决跨语言、跨框架的篇章关系分类(Discourse Relation Classification)问题,其核心挑战在于统一17类篇章关系标签在39个语料库、16种语言和6种篇章理论框架下的标注一致性与模型泛化能力。解决方案的关键在于提出一种分层双适配器对比学习模型(HiDAC),通过引入层次化双适配器结构与对比学习机制,在保持参数高效性的同时显著提升多语言、多形式主义场景下的分类性能;实验表明,HiDAC在仅微调顶层75%编码器层的情况下即可达到接近全量微调的效果,并实现最高准确率(67.5%),优于基于提示的大语言模型(prompt-based LLMs)和传统预训练模型微调策略。
链接: https://arxiv.org/abs/2509.16903
作者: Nawar Turk,Daniele Comitogianni,Leila Kosseim
机构: Concordia University (康考迪亚大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We present our submission to Task 3 (Discourse Relation Classification) of the DISRPT 2025 shared task. Task 3 introduces a unified set of 17 discourse relation labels across 39 corpora in 16 languages and six discourse frameworks, posing significant multilingual and cross-formalism challenges. We first benchmark the task by fine-tuning multilingual BERT-based models (mBERT, XLM-RoBERTa-Base, and XLM-RoBERTa-Large) with two argument-ordering strategies and progressive unfreezing ratios to establish strong baselines. We then evaluate prompt-based large language models (namely Claude Opus 4.0) in zero-shot and few-shot settings to understand how LLMs respond to the newly proposed unified labels. Finally, we introduce HiDAC, a Hierarchical Dual-Adapter Contrastive learning model. Results show that while larger transformer models achieve higher accuracy, the improvements are modest, and that unfreezing the top 75% of encoder layers yields performance comparable to full fine-tuning while training far fewer parameters. Prompt-based models lag significantly behind fine-tuned transformers, and HiDAC achieves the highest overall accuracy (67.5%) while remaining more parameter-efficient than full fine-tuning.
zh
[NLP-113] DRES: Fake news detection by dynamic representation and ensemble selection EMNLP2025
【速读】: 该论文旨在解决社交媒体中基于文本的虚假新闻检测问题,其核心挑战在于如何提升检测模型在复杂多变文本特征下的准确性与鲁棒性。解决方案的关键在于提出一种动态表示与集成选择(Dynamic Representation and Ensemble Selection, DRES)方法:通过实例难度度量(instance hardness measures)评估每篇新闻文章在不同文本特征表示下的分类难度,并据此动态选择最适配的文本表示及最优的分类器集成组合,从而显著提升预测精度。该方法有效融合了表示选择与动态集成策略,验证了基于实例难度的自适应决策机制在增强模型性能方面的有效性。
链接: https://arxiv.org/abs/2509.16893
作者: Faramarz Farhangian,Leandro A. Ensina,George D. C. Cavalcanti,Rafael M. O. Cruz
机构: École de Technologie Supérieure (ÉTS-Montréal), Canada; Universidade Tecnológica Federal do Paraná (UTFPR), Brazil; Universidade Federal de Pernambuco (UFPE), Brazil
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted as oral presentation at EMNLP 2025
Abstract:The rapid spread of information via social media has made text-based fake news detection critically important due to its societal impact. This paper presents a novel detection method called Dynamic Representation and Ensemble Selection (DRES) for identifying fake news based solely on text. DRES leverages instance hardness measures to estimate the classification difficulty for each news article across multiple textual feature representations. By dynamically selecting the textual representation and the most competent ensemble of classifiers for each instance, DRES significantly enhances prediction accuracy. Extensive experiments show that DRES achieves notable improvements over state-of-the-art methods, confirming the effectiveness of representation selection based on instance hardness and dynamic ensemble selection in boosting performance. Codes and data are available at: this https URL
zh
[NLP-114] Can GRPO Boost Complex Multimodal Table Understanding? EMNLP2025
【速读】: 该论文旨在解决现有表格理解方法在复杂表格结构和复杂逻辑推理任务中表现不佳的问题,尤其是监督微调(Supervised Fine-Tuning, SFT)主导的范式难以应对初始策略准确性低与奖励稀疏性(reward sparsity)等挑战。其解决方案的关键在于提出一个三阶段强化学习(Reinforcement Learning, RL)框架 Table-R1:首先通过 Warm-up 阶段激发模型的基础感知与推理能力;其次引入 Perception Alignment GRPO(PA-GRPO),利用连续的 Tree-Edit-Distance Similarity(TEDS)奖励实现对表格结构与内容的精准识别;最后采用 Hint-Completion GRPO(HC-GRPO),基于提示引导的问题设计细粒度的残差步数奖励以提升推理精度。实验证明,Table-R1 有效缓解了初始化瓶颈与奖励稀疏问题,在保持性能的同时显著优于 SFT 和传统 GRPO 方法,甚至使 Qwen2-VL-7B 模型在内部测试集上达到接近闭源模型 GPT-4o 的水平。
链接: https://arxiv.org/abs/2509.16889
作者: Xiaoqiang Kang,Shengen Wu,Zimu Wang,Yilin Liu,Xiaobo Jin,Kaizhu Huang,Wei Wang,Yutao Yue,Xiaowei Huang,Qiufeng Wang
机构: School of Advanced Technology, Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Department of Computer Science, University of Liverpool (利物浦大学计算机科学系); Information Hub, Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)信息 hub); University of Southern California (南加州大学); Duke Kunshan University (昆山杜克大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025
Abstract:Existing table understanding methods face challenges due to complex table structures and intricate logical reasoning. While supervised finetuning (SFT) dominates existing research, reinforcement learning (RL), such as Group Relative Policy Optimization (GRPO), has shown promise but struggled with low initial policy accuracy and coarse rewards in tabular contexts. In this paper, we introduce Table-R1, a three-stage RL framework that enhances multimodal table understanding through: (1) Warm-up that prompts initial perception and reasoning capabilities, (2) Perception Alignment GRPO (PA-GRPO), which employs continuous Tree-Edit-Distance Similarity (TEDS) rewards for recognizing table structures and contents, and (3) Hint-Completion GRPO (HC-GRPO), which utilizes fine-grained rewards of residual steps based on the hint-guided question. Extensive experiments demonstrate that Table-R1 can boost the model’s table reasoning performance obviously on both held-in and held-out datasets, outperforming SFT and GRPO largely. Notably, Qwen2-VL-7B with Table-R1 surpasses larger specific table understanding models (e.g., Table-LLaVA 13B), even achieving comparable performance to the closed-source model GPT-4o on held-in datasets, demonstrating the efficacy of each stage of Table-R1 in overcoming initialization bottlenecks and reward sparsity, thereby advancing robust multimodal table understanding.
zh
[NLP-115] Dynamic Expert Specialization: Towards Catastrophic Forgetting-Free Multi-Domain MoE Adaptation EMNLP2025
【速读】: 该论文旨在解决多域适应场景下Mixture-of-Experts (MoE)模型因灾难性遗忘(catastrophic forgetting)而导致性能下降的问题。现有方法要么计算开销过大,要么存在跨域干扰,或需为每个领域单独训练。其解决方案的关键在于提出DES-MoE框架,通过三项创新实现动态专家专业化:(1) 采用基于蒸馏的自适应路由器,在保留预训练知识与任务特定更新之间取得平衡;(2) 实时构建专家-领域相关性映射以隔离领域特异性梯度;(3) 设计三阶段自适应微调策略,逐步冻结非专业化参数。该方法在六个不同领域(如数学、代码、法律等)上实现了与单域全参数微调相当的性能,同时显著减少遗忘(相比全微调降低89%),并加速68%的收敛速度。
链接: https://arxiv.org/abs/2509.16882
作者: Junzhuo Li,Bo Wang,Xiuze Zhou,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); The Hong Kong University of Science and Technology (香港科技大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference
Abstract:Mixture-of-Experts (MoE) models offer immense capacity via sparsely gated expert subnetworks, yet adapting them to multiple domains without catastrophic forgetting remains an open challenge. Existing approaches either incur prohibitive computation, suffer cross-domain interference, or require separate runs per domain. We propose DES-MoE, a dynamic expert specialization framework for multi-domain adaptation of Mixture-of-Experts models. DES-MoE addresses catastrophic forgetting through three innovations: (1) an adaptive router balancing pre-trained knowledge retention and task-specific updates via distillation, (2) real-time expert-domain correlation mapping to isolate domain-specific gradients, and (3) a three-phase adaptive fine-tuning schedule that progressively freezes non-specialized parameters. Evaluated on six domains (math, code, law, etc.), DES-MoE matches single-domain ESFT performance while training one unified model, reduces forgetting by 89% compared to full fine-tuning as domains scale from 2 to 6, and achieves 68% faster convergence than conventional methods. Our work establishes dynamic expert isolation as a scalable paradigm for multi-task MoE adaptation.
zh
[NLP-116] Multi-task Pretraining for Enhancing Interpretable L2 Pronunciation Assessment
【速读】: 该论文旨在解决自动发音评估(APA)系统中因过度依赖音段级特征而导致超音段特征(supra-segmental cues)被忽视的问题,以及现有APA系统与自动口语评估(ASA)缺乏整合、难以实现全面语言能力评价的局限。其解决方案的关键在于引入多任务预训练(MTP),通过随机掩码音段级发音特征并基于上下文重建这些特征,从而捕捉长期时序发音线索并强化话语内部结构;同时,融合人工设计特征(HCFs),如流利度(语音速率、停顿时长)和重音强度(音高强调程度),借助回归器生成可解释的综合语言能力评分,显著提升了发音评分准确性与ASA相关性。
链接: https://arxiv.org/abs/2509.16876
作者: Jiun-Ting Li,Bi-Cheng Yan,Yi-Cheng Wang,Berlin Chen
机构: Chunghwa Telecom Co., Ltd. (中华电信公司); National Taiwan Normal University (台湾师范大学); National Taiwan University (台湾大学)
类目: Computation and Language (cs.CL)
备注: Accepted by APSIPA-ASC 2025
Abstract:Automatic pronunciation assessment (APA) analyzes second-language (L2) learners’ speech by providing fine-grained pronunciation feedback at various linguistic levels. Most existing efforts on APA typically adopt segmental-level features as inputs and predict pronunciation scores at different granularities via hierarchical (or parallel) pronunciation modeling. This, however, inevitably causes assessments across linguistic levels (e.g., phone, word, and utterance) to rely solely on phoneme-level pronunciation features, nearly sidelining supra-segmental pronunciation cues. To address this limitation, we introduce multi-task pretraining (MTP) for APA, a simple yet effective strategy that attempts to capture long-term temporal pronunciation cues while strengthening the intrinsic structures within an utterance via the objective of reconstructing input features. Specifically, for a phoneme-level encoder of an APA model, the proposed MTP strategy randomly masks segmental-level pronunciation features and reconstructs the masked ones based on their surrounding pronunciation context. Furthermore, current APA systems lack integration with automated speaking assessment (ASA), limiting holistic proficiency evaluation. Drawing on empirical studies and prior knowledge in ASA, our framework bridges this gap by incorporating handcrafted features (HCFs), such as fluency (speech rate, silence duration) and stress (pitch accent strength), derived from human-designed formulas via regressors to generate interpretable proficiency scores. Experiments on speechocean762 show improved pronunciation scoring and ASA proficiency correlation, enabling targeted training and comprehensive proficiency assessment.
zh
[NLP-117] seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLM s
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在序列推理能力上的局限性问题,尤其是其在处理需要多步逻辑推导、状态回溯和噪声干扰的任务时表现不佳的系统性缺陷。解决方案的关键在于提出seqBench——一个参数化基准测试框架,通过精细控制三个核心复杂度维度:(1)逻辑深度(logical depth),即完成任务所需的顺序动作数;(2)回溯步数(backtracking steps),量化为满足延迟前提条件而需重访先前状态的频率;(3)噪声比(noise ratio),即支持性事实与干扰性事实的比例。这种多维可控的设计使研究者能够精准定位LLMs在不同推理层级上的失败模式,并揭示其准确率随逻辑深度呈指数衰减的通用规律,从而推动对模型真实推理能力边界的理解与改进。
链接: https://arxiv.org/abs/2509.16866
作者: Mohammad Ramezanali,Mo Vazifeh,Paolo Santi
机构: Salesforce AI (Salesforce人工智能); Capital One; MIT (麻省理工学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation of (1) the logical depth, defined as the number of sequential actions required to solve the task; (2) the number of backtracking steps along the optimal path, quantifying how often the agent must revisit prior states to satisfy deferred preconditions (e.g., retrieving a key after encountering a locked door); and (3) the noise ratio, defined as the ratio between supporting and distracting facts about the environment. Our evaluations on state-of-the-art LLMs reveal a universal failure pattern: accuracy collapses exponentially beyond a model-specific logical depth. Unlike existing benchmarks, seqBench’s fine-grained control facilitates targeted analyses of these reasoning failures, illuminating universal scaling laws and statistical limits, as detailed in this paper alongside its generation methodology and evaluation metrics. We find that even top-performing models systematically fail on seqBench’s structured reasoning tasks despite minimal search complexity, underscoring key limitations in their commonsense reasoning capabilities. Designed for future evolution to keep pace with advancing models, the seqBench datasets are publicly released to spur deeper scientific inquiry into LLM reasoning, aiming to establish a clearer understanding of their true potential and current boundaries for robust real-world application.
zh
[NLP-118] Semantic-Driven Topic Modeling for Analyzing Creativity in Virtual Brainstorming
【速读】: 该论文旨在解决虚拟头脑风暴(Virtual Brainstorming)中因想法数量庞大且分布不均而导致有价值洞察难以高效提取的问题。传统人工编码方法存在耗时和主观性强的缺陷,因此亟需自动化手段以支持群体创造力的评估。其解决方案的关键在于提出一种基于语义驱动的主题建模框架,整合了四个模块化组件:基于Transformer的句向量表示(Sentence-BERT)、降维(UMAP)、聚类(HDBSCAN)以及主题提取与优化,从而在句子层面捕捉语义相似性,实现从头脑风暴转录文本中发现一致主题、过滤噪声并识别异常值,最终在主题一致性指标上显著优于LDA、ETM和BERTopic等基线方法,同时提供可解释的群体创意多样性与深度分析。
链接: https://arxiv.org/abs/2509.16835
作者: Melkamu Abay Mersha,Jugal Kalita
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Virtual brainstorming sessions have become a central component of collaborative problem solving, yet the large volume and uneven distribution of ideas often make it difficult to extract valuable insights efficiently. Manual coding of ideas is time-consuming and subjective, underscoring the need for automated approaches to support the evaluation of group creativity. In this study, we propose a semantic-driven topic modeling framework that integrates four modular components: transformer-based embeddings (Sentence-BERT), dimensionality reduction (UMAP), clustering (HDBSCAN), and topic extraction with refinement. The framework captures semantic similarity at the sentence level, enabling the discovery of coherent themes from brainstorming transcripts while filtering noise and identifying outliers. We evaluate our approach on structured Zoom brainstorming sessions involving student groups tasked with improving their university. Results demonstrate that our model achieves higher topic coherence compared to established methods such as LDA, ETM, and BERTopic, with an average coherence score of 0.687 (CV), outperforming baselines by a significant margin. Beyond improved performance, the model provides interpretable insights into the depth and diversity of topics explored, supporting both convergent and divergent dimensions of group creativity. This work highlights the potential of embedding-based topic modeling for analyzing collaborative ideation and contributes an efficient and scalable framework for studying creativity in synchronous virtual meetings.
zh
[NLP-119] Cognitive Linguistic Identity Fusion Score (CLIFS): A Scalable Cognition-Informed Approach to Quantifying Identity Fusion from Text EMNLP2025
【速读】: 该论文旨在解决如何高效、准确地量化身份融合(identity fusion)这一心理现象的问题,身份融合指个体在认知上将自我与某一实体或抽象目标(如宗教群体、政党、意识形态等)发生心理上的融合。传统方法依赖于受控问卷调查或直接田野接触,存在效率低、可扩展性差等问题。其解决方案的关键在于提出一种基于认知语言学与大语言模型(LLMs)结合的新型指标——认知语言身份融合评分(Cognitive Linguistic Identity Fusion Score, CLIFS),该方法通过隐喻检测机制实现自动化评估,无需人工干预即可获得与传统 verbal measure 高度一致的结果,并在暴力风险预测任务中表现出显著优于现有方法的能力(提升超240%)。
链接: https://arxiv.org/abs/2509.16813
作者: Devin R. Wright,Jisun An,Yong-Yeol Ahn
机构: Center for Complex Networks and Systems Research, Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington (印第安纳大学布卢明顿分校信息、计算与工程学院复杂网络与系统研究中心); Cognitive Science Program, Indiana University Bloomington (印第安纳大学布卢明顿分校认知科学项目); School of Data Science, University of Virginia (弗吉尼亚大学数据科学学院); CulturePulse, Inc.
类目: Computation and Language (cs.CL)
备注: Authors’ accepted manuscript (postprint; camera-ready). To appear in the Proceedings of EMNLP 2025. Pagination/footer layout may differ from the Version of Record
Abstract:Quantifying identity fusion – the psychological merging of self with another entity or abstract target (e.g., a religious group, political party, ideology, value, brand, belief, etc.) – is vital for understanding a wide range of group-based human behaviors. We introduce the Cognitive Linguistic Identity Fusion Score (CLIFS), a novel metric that integrates cognitive linguistics with large language models (LLMs), which builds on implicit metaphor detection. Unlike traditional pictorial and verbal scales, which require controlled surveys or direct field contact, CLIFS delivers fully automated, scalable assessments while maintaining strong alignment with the established verbal measure. In benchmarks, CLIFS outperforms both existing automated approaches and human annotation. As a proof of concept, we apply CLIFS to violence risk assessment to demonstrate that it can improve violence risk assessment by more than 240%. Building on our identification of a new NLP task and early success, we underscore the need to develop larger, more diverse datasets that encompass additional fusion-target domains and cultural backgrounds to enhance generalizability and further advance this emerging area. CLIFS models and code are public at this https URL.
zh
[NLP-120] KuBERT: Central Kurdish BERT Model and Its Application for Sentiment Analysis
【速读】: 该论文旨在解决中央库尔德语(Central Kurdish)在情感分析任务中因资源匮乏和语言多样性高而导致的挑战问题。其解决方案的关键在于引入预训练语言模型BERT(Bidirectional Encoder Representations from Transformers),利用其强大的双向上下文建模能力,相较于传统的Word2Vec等词嵌入方法,能够更精准地捕捉库尔德语的语义细微差别和上下文特征,从而为低资源语言的情感分析建立新的基准。
链接: https://arxiv.org/abs/2509.16804
作者: Kozhin muhealddin Awlla,Hadi Veisi,Abdulhady Abas Abdullah
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:This paper enhances the study of sentiment analysis for the Central Kurdish language by integrating the Bidirectional Encoder Representations from Transformers (BERT) into Natural Language Processing techniques. Kurdish is a low-resourced language, having a high level of linguistic diversity with minimal computational resources, making sentiment analysis somewhat challenging. Earlier, this was done using a traditional word embedding model, such as Word2Vec, but with the emergence of new language models, specifically BERT, there is hope for improvements. The better word embedding capabilities of BERT lend to this study, aiding in the capturing of the nuanced semantic pool and the contextual intricacies of the language under study, the Kurdish language, thus setting a new benchmark for sentiment analysis in low-resource languages.
zh
[NLP-121] Domain-Adaptive Pre-Training for Arabic Aspect-Based Sentiment Analysis: A Comparative Study of Domain Adaptation and Fine-Tuning Strategies
【速读】: 该论文旨在解决阿拉伯语领域中方面情感分析(Aspect-based Sentiment Analysis, ABSA)因标注数据稀缺而导致深度学习模型性能受限的问题。现有方法多依赖基于事实的预训练语言模型(如BERT),但这类模型在特定领域任务中易引入偏差,且尚未有研究探索针对阿拉伯语的领域自适应预训练策略。论文的关键解决方案是提出一种基于领域自适应预训练的方法,用于方面情感分类(Aspect-Sentiment Classification, ASC)和观点目标提取(Opinion Target Extraction, OTE)任务,并系统比较特征提取、全量微调与适配器(Adapter-based)微调三种策略,以提升模型性能与计算效率。实验表明,领域内自适应预训练带来适度改进,而适配器微调在保持竞争力的同时显著降低计算开销;同时,错误分析揭示了当前模型在情感标签误判、对比标记理解偏差、多词目标识别困难等方面存在的局限性,强调未来需引入语法和语义感知机制(如图卷积网络)以更好地建模长距离依赖与复杂方面-情感对齐关系。
链接: https://arxiv.org/abs/2509.16788
作者: Salha Alyami,Amani Jamal,Areej Alhothali
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 excluding bibliography , journal article
Abstract:Aspect-based sentiment analysis (ABSA) in natural language processing enables organizations to understand customer opinions on specific product aspects. While deep learning models are widely used for English ABSA, their application in Arabic is limited due to the scarcity of labeled data. Researchers have attempted to tackle this issue by using pre-trained contextualized language models such as BERT. However, these models are often based on fact-based data, which can introduce bias in domain-specific tasks like ABSA. To our knowledge, no studies have applied adaptive pre-training with Arabic contextualized models for ABSA. This research proposes a novel approach using domain-adaptive pre-training for aspect-sentiment classification (ASC) and opinion target expression (OTE) extraction. We examine fine-tuning strategies - feature extraction, full fine-tuning, and adapter-based methods - to enhance performance and efficiency, utilizing multiple adaptation corpora and contextualized models. Our results show that in-domain adaptive pre-training yields modest improvements. Adapter-based fine-tuning is a computationally efficient method that achieves competitive results. However, error analyses reveal issues with model predictions and dataset labeling. In ASC, common problems include incorrect sentiment labeling, misinterpretation of contrastive markers, positivity bias for early terms, and challenges with conflicting opinions and subword tokenization. For OTE, issues involve mislabeling targets, confusion over syntactic roles, difficulty with multi-word expressions, and reliance on shallow heuristics. These findings underscore the need for syntax- and semantics-aware models, such as graph convolutional networks, to more effectively capture long-distance relations and complex aspect-based opinion alignments.
zh
[NLP-122] MoRoVoc: A Large Dataset for Geographical Variation Identification of the Spoken Romanian Language EMNLP
【速读】: 该论文旨在解决语音识别模型在处理区域方言差异时存在的偏差问题,即模型对不同地区(如罗马尼亚和摩尔多瓦)的罗马尼亚语发音差异敏感,从而影响其泛化能力。解决方案的关键在于提出一种多目标对抗训练框架,将人口统计学特征(如年龄和性别)作为对抗目标引入模型训练中,使模型在主任务(如方言识别)上保持高精度的同时,对次要属性(如性别或年龄)具有不变性。通过元学习动态调整对抗系数,优化模型性能,实验表明该方法显著提升了模型在方言识别和性别分类任务中的准确率。
链接: https://arxiv.org/abs/2509.16781
作者: Andrei-Marius Avram,Ema-Ioana Bănescu,Anda-Teodora Robea,Dumitru-Clementin Cercel,Mihaela-Claudia Cercel
机构: National University of Science and Technology POLITEHNICA Bucharest (布加勒斯特理工大学); Paris 1 Panthéon-Sorbonne University (巴黎第一大学); University of Bucharest (布加勒斯特大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP Findings 2025
Abstract:This paper introduces MoRoVoc, the largest dataset for analyzing the regional variation of spoken Romanian. It has more than 93 hours of audio and 88,192 audio samples, balanced between the Romanian language spoken in Romania and the Republic of Moldova. We further propose a multi-target adversarial training framework for speech models that incorporates demographic attributes (i.e., age and gender of the speakers) as adversarial targets, making models discriminative for primary tasks while remaining invariant to secondary attributes. The adversarial coefficients are dynamically adjusted via meta-learning to optimize performance. Our approach yields notable gains: Wav2Vec2-Base achieves 78.21% accuracy for the variation identification of spoken Romanian using gender as an adversarial target, while Wav2Vec2-Large reaches 93.08% accuracy for gender classification when employing both dialect and age as adversarial objectives.
zh
[NLP-123] Geometric Mixture Classifier (GMC): A Discriminative Per-Class Mixture of Hyperplanes
【速读】: 该论文旨在解决多模态数据分类中传统线性模型表现不佳、而高容量模型(如核支持向量机或深度神经网络)虽能拟合复杂结构但缺乏可解释性、调参复杂且计算开销大的问题。其核心解决方案是提出几何混合分类器(Geometric Mixture Classifier, GMC),关键在于将每个类别建模为多个超平面(hyperplane)的混合,通过温度控制的软-或(soft-OR,即log-sum-exp)机制融合同一类内各超平面得分,实现对多模态分布的平滑近似;跨类别则采用标准softmax生成概率后验。GMC在保持推理阶段线性计算复杂度的同时,具备几何可解释性(可通过每平面和类责任可视化分析决策逻辑),并结合一系列实用训练技巧(如基于轮廓系数的平面预算、标签平滑、早期停止等)实现端到端高效训练与部署。
链接: https://arxiv.org/abs/2509.16769
作者: Prasanth K K,Shubham Sharma
机构: SunitechAI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 21 pages, 6 figures, 14 tables
Abstract:Many real world categories are multimodal, with single classes occupying disjoint regions in feature space. Classical linear models (logistic regression, linear SVM) use a single global hyperplane and perform poorly on such data, while high-capacity methods (kernel SVMs, deep nets) fit multimodal structure but at the expense of interpretability, heavier tuning, and higher computational cost. We propose the Geometric Mixture Classifier (GMC), a discriminative model that represents each class as a mixture of hyperplanes. Within each class, GMC combines plane scores via a temperature-controlled soft-OR (log-sum-exp), smoothly approximating the max; across classes, standard softmax yields probabilistic posteriors. GMC optionally uses Random Fourier Features (RFF) for nonlinear mappings while keeping inference linear in the number of planes and features. Our practical training recipe: geometry-aware k-means initialization, silhouette-based plane budgeting, alpha annealing, usage-aware L2 regularization, label smoothing, and early stopping, makes GMC plug-and-play. Across synthetic multimodal datasets (moons, circles, blobs, spirals) and tabular/image benchmarks (iris, wine, WDBC, digits), GMC consistently outperforms linear baselines and k-NN, is competitive with RBF-SVM, Random Forests, and small MLPs, and provides geometric introspection via per-plane and class responsibility visualizations. Inference scales linearly in planes and features, making GMC CPU-friendly, with single-digit microsecond latency per example, often faster than RBF-SVM and compact MLPs. Post-hoc temperature scaling reduces ECE from about 0.06 to 0.02. GMC thus strikes a favorable balance of accuracy, interpretability, and efficiency: it is more expressive than linear models and lighter, more transparent, and faster than kernel or deep models.
zh
[NLP-124] he Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology EMNLP2025
【速读】: 该论文旨在解决儿童言语障碍临床干预中专业人员(Speech-Language Pathologists, SLPs)严重短缺的问题,探索生成式多模态语言模型(Multimodal Language Models, MLMs)在高风险临床场景中的应用潜力与局限性。其解决方案的关键在于:首先,与领域专家合作构建了MLMs在言语病理学中的真实应用场景分类体系;其次,基于该分类体系设计并发布了首个涵盖五个核心任务、每项任务包含1000条人工标注数据的综合性基准测试,该基准还包含噪声环境、说话人性别和口音等鲁棒性与敏感性测试;最终通过评估15个前沿MLM发现,单一模型无法在所有任务中持续领先,且存在系统性偏差(如对男性说话者表现更优),同时提出通过领域特定数据微调可显著提升性能(超过30%),从而为后续针对性优化和临床部署提供实证依据。
链接: https://arxiv.org/abs/2509.16765
作者: Fagun Patel,Duc Q. Nguyen,Sang T. Truong,Jody Vaynshtok,Sanmi Koyejo,Nick Haber
机构: Stanford University (斯坦福大学); National University of Singapore (新加坡国立大学); Sound Speech and Hearing Clinic (声音、言语和听力诊所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: EMNLP 2025 Oral Presentation
Abstract:According to the U.S. National Institutes of Health, more than 3.4 million children experience speech disorders that require clinical intervention. The number of speech-language pathologists (SLPs) is roughly 20 times fewer than the number of affected children, highlighting a significant gap in children’s care and a pressing need for technological support that improves the productivity of SLPs. State-of-the-art multimodal language models (MLMs) show promise for supporting SLPs, but their use remains underexplored largely due to a limited understanding of their performance in high-stakes clinical settings. To address this gap, we collaborate with domain experts to develop a taxonomy of real-world use cases of MLMs in speech-language pathologies. Building on this taxonomy, we introduce the first comprehensive benchmark for evaluating MLM across five core use cases, each containing 1,000 manually annotated data points. This benchmark includes robustness and sensitivity tests under various settings, including background noise, speaker gender, and accent. Our evaluation of 15 state-of-the-art MLMs reveals that no single model consistently outperforms others across all tasks. Notably, we find systematic disparities, with models performing better on male speakers, and observe that chain-of-thought prompting can degrade performance on classification tasks with large label spaces and narrow decision boundaries. Furthermore, we study fine-tuning MLMs on domain-specific data, achieving improvements of over 30% compared to base models. These findings highlight both the potential and limitations of current MLMs for speech-language pathology applications, underscoring the need for further research and targeted development.
zh
[NLP-125] Angular Dispersion Accelerates k-Nearest Neighbors Machine Translation
【速读】: 该论文旨在解决基于近似k近邻机器翻译(k-nearest neighbors machine translation, k-NN MT)在解码时存在的计算成本高和内存占用大的问题,尤其是在大规模数据存储中进行高效检索的瓶颈。其解决方案的关键在于通过提升神经隐藏表示(neural hidden representations)在角度空间中的分散性(angular dispersion),从而改善近似k-NN查找数据结构的负载均衡性,进而加速检索过程并略微提升翻译性能。
链接: https://arxiv.org/abs/2509.16729
作者: Evgeniia Tokarchuk,Sergey Troshin,Vlad Niculae
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Augmenting neural machine translation with external memory at decoding time, in the form of k-nearest neighbors machine translation ( k -NN MT), is a well-established strategy for increasing translation performance. k -NN MT retrieves a set of tokens that occurred in the most similar contexts recorded in a prepared data store, using hidden state representations of translation contexts as vector lookup keys. One of the main disadvantages of this method is the high computational cost and memory requirements. Since an exhaustive search is not feasible in large data stores, practitioners commonly use approximate k -NN MT lookup, yet even such algorithms are a bottleneck. In contrast to research directions seeking to accelerate k -NN MT by reducing data store size or the number of lookup calls, we pursue an orthogonal direction based on the performance properties of approximate k -NN MT lookup data structures. In particular, we propose to encourage angular dispersion of the neural hidden representations of contexts. We show that improving dispersion leads to better balance in the retrieval data structures, accelerating retrieval and slightly improving translations.
zh
[NLP-126] A Multi-Level Benchmark for Causal Language Understanding in Social Media Discourse
【速读】: 该论文旨在解决自然语言处理(NLP)中关于非正式语境下因果语言理解的挑战,尤其是现有数据集多聚焦于结构化文本中的显式因果关系,难以支持对社交媒体等非正式场景中隐式因果表达的检测。解决方案的关键在于构建CausalTalk——一个涵盖2020至2024年五年Reddit帖子的多层级因果数据集,共10,120篇与新冠疫情相关的帖子被标注于四个任务:二分类因果判断、显式/隐式因果区分、因果成分抽取及因果主旨生成;其标注体系融合领域专家创建的黄金标准标签与GPT-4o生成并经人工验证的银标准标签,从而实现了细粒度因果识别与基于主旨的推理能力,并支持判别式与生成式模型的联合基准测试,填补了社交语境下因果推理研究的数据空白。
链接: https://arxiv.org/abs/2509.16722
作者: Xiaohan Ding,Kaike Ping,Buse Çarık,Eugenia Rho
机构: Virginia Tech (弗吉尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Understanding causal language in informal discourse is a core yet underexplored challenge in NLP. Existing datasets largely focus on explicit causality in structured text, providing limited support for detecting implicit causal expressions, particularly those found in informal, user-generated social media posts. We introduce CausalTalk, a multi-level dataset of five years of Reddit posts (2020-2024) discussing public health related to the COVID-19 pandemic, among which 10120 posts are annotated across four causal tasks: (1) binary causal classification, (2) explicit vs. implicit causality, (3) cause-effect span extraction, and (4) causal gist generation. Annotations comprise both gold-standard labels created by domain experts and silver-standard labels generated by GPT-4o and verified by human annotators. CausalTalk bridges fine-grained causal detection and gist-based reasoning over informal text. It enables benchmarking across both discriminative and generative models, and provides a rich resource for studying causal reasoning in social media contexts.
zh
[NLP-127] me to Revist Exact Match EMNLP2025
【速读】: 该论文旨在解决当前大语言模型在时间问答(Temporal Question Answering, TQA)任务中评估指标不合理的问题,即使用精确匹配(Exact Match, EM)作为评价标准时无法区分模型预测结果中数值误差的大小,从而掩盖了模型在时间推理能力上的真实表现。其解决方案的关键在于将TQA重新建模为数值估计任务,并引入基于时间序列预测的量化指标——对称平均绝对百分比误差(sMAPE)和平均绝对缩放误差(MASE),以更精细地衡量模型输出与真实值之间的偏差程度。通过构建TempAnswerQA基准测试集并对比不同模型在EM、sMAPE和MASE下的表现,作者发现EM与实际误差之间存在显著脱节,且MASE能揭示出仅靠EM无法识别的模型在时间领域知识理解上的缺陷,尤其是那些使用合成数据训练的模型。这表明,针对TQA任务需采用专门设计的数值型评估指标,而非传统文本匹配方法。
链接: https://arxiv.org/abs/2509.16720
作者: Auss Abbood,Zaiqiao Meng,Nigel Collier
机构: University of Cambridge (剑桥大学); University of Glasgow (格拉斯哥大学)
类目: Computation and Language (cs.CL)
备注: Accepted for Findings of EMNLP 2025
Abstract:Temporal question answering is an established method for evaluating temporal reasoning in large language models. Expected answers are often numeric (e.g., dates or durations), yet model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal question answering as a numerical estimation task to assess the shortcomings of EM. We introduce TempAnswerQA, a benchmark distilled from Test of Time and TempTabQA, where all questions require a numerical, temporal answer, allowing us to evaluate models beyond EM. We use the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we find that error size and EM are decoupled. Models with low EM still have low sMAPE (both ~20%), and some models have high sMAPE despite high EM. Scaling errors by the deviation of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models’ understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models’ most frequent error is to deviate by only \pm1 from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks. Code and data are available on this https URL.
zh
[NLP-128] Idiosyncratic Versus Normative Modeling of Atypical Speech Recognition: Dysarthric Case Studies EMNLP2025
【速读】: 该论文旨在解决当前自动语音识别(ASR)模型在处理非典型言语(如肌张力障碍者发音)时性能显著下降的问题。传统方法多依赖完全个性化(idiosyncratic)建模,但难以平衡泛化能力与个体差异的捕捉。解决方案的关键在于提出一种融合策略——“肌张力障碍-个体特异性”(dysarthric-idiosyncratic)模型,即先学习跨说话者的规范模式(normative patterns),再针对性地适应个体差异。实验证明,该方法在仅使用不到一半个性化数据(128样本)的情况下,词错误率(WER)达到36.43%,优于纯个体模型(36.99%),且仅微调语音编码器(speech encoder)即可将平均WER从71%降至32%,凸显了同时利用规范性(cross-speaker)和个体特异性(speaker-specific)模式对提升少数群体语音识别性能的重要性。
链接: https://arxiv.org/abs/2509.16718
作者: Vishnu Raja,Adithya V Ganesan,Anand Syamkumar,Ritwik Banerjee,H Andrew Schwartz
机构: Stony Brook University (石溪大学); Vanderbilt University (范德堡大学)
类目: ound (cs.SD); Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Will appear in EMNLP 2025 Main Proceedings
Abstract:State-of-the-art automatic speech recognition (ASR) models like Whisper, perform poorly on atypical speech, such as that produced by individuals with dysarthria. Past works for atypical speech have mostly investigated fully personalized (or idiosyncratic) models, but modeling strategies that can both generalize and handle idiosyncracy could be more effective for capturing atypical speech. To investigate this, we compare four strategies: (a) \textitnormative models trained on typical speech (no personalization), (b) \textitidiosyncratic models completely personalized to individuals, © \textitdysarthric-normative models trained on other dysarthric speakers, and (d) \textitdysarthric-idiosyncratic models which combine strategies by first modeling normative patterns before adapting to individual speech. In this case study, we find the dysarthric-idiosyncratic model performs better than idiosyncratic approach while requiring less than half as much personalized data (36.43 WER with 128 train size vs 36.99 with 256). Further, we found that tuning the speech encoder alone (as opposed to the LM decoder) yielded the best results reducing word error rate from 71% to 32% on average. Our findings highlight the value of leveraging both normative (cross-speaker) and idiosyncratic (speaker-specific) patterns to improve ASR for underrepresented speech populations.
zh
[NLP-129] Semi-Supervised Synthetic Data Generation with Fine-Grained Relevance Control for Short Video Search Relevance Modeling AAAI2026
【速读】: 该论文旨在解决现有基于提示(prompt-based)的合成数据方法在数据稀缺领域难以捕捉领域特异性数据分布,且忽视细粒度相关性多样性的问题。其解决方案的关键在于提出了一种半监督的合成数据生成管道,通过两个协同训练的模型生成具有可控相关性标签的域自适应短视频数据,并特别针对低频中间相关性标签进行样本补充,从而提升训练数据的相关性层级多样性与语义丰富性。实验表明,该方法显著优于传统提示生成或纯监督微调(SFT)的数据策略,并在抖音双列推荐场景中通过在线A/B测试验证了其有效性,包括点击率(CTR)提升1.45%、强相关比例(SRR)提高4.9%以及图像用户渗透率(IUPR)改善0.1054%。
链接: https://arxiv.org/abs/2509.16717
作者: Haoran Li,Zhiming Su,Junyan Yao,Enwei Zhang,Yang Ji,Yan Chen,Kan Zhou,Chao Feng,Jiao Ran
机构: 1: University of Science and Technology of China (中国科学技术大学); 2: Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: Submitted to AAAI 2026
Abstract:Synthetic data is widely adopted in embedding models to ensure diversity in training data distributions across dimensions such as difficulty, length, and language. However, existing prompt-based synthesis methods struggle to capture domain-specific data distributions, particularly in data-scarce domains, and often overlook fine-grained relevance diversity. In this paper, we present a Chinese short video dataset with 4-level relevance annotations, filling a critical resource void. Further, we propose a semi-supervised synthetic data pipeline where two collaboratively trained models generate domain-adaptive short video data with controllable relevance labels. Our method enhances relevance-level diversity by synthesizing samples for underrepresented intermediate relevance labels, resulting in a more balanced and semantically rich training data set. Extensive offline experiments show that the embedding model trained on our synthesized data outperforms those using data generated based on prompting or vanilla supervised fine-tuning(SFT). Moreover, we demonstrate that incorporating more diverse fine-grained relevance levels in training data enhances the model’s sensitivity to subtle semantic distinctions, highlighting the value of fine-grained relevance supervision in embedding learning. In the search enhanced recommendation pipeline of Douyin’s dual-column scenario, through online A/B testing, the proposed model increased click-through rate(CTR) by 1.45%, raised the proportion of Strong Relevance Ratio (SRR) by 4.9%, and improved the Image User Penetration Rate (IUPR) by 0.1054%.
zh
[NLP-130] OPEN-THEATRE: An Open-Source Toolkit for LLM -based Interactive Drama EMNLP2025
【速读】: 该论文旨在解决LLM-based Interactive Drama(基于大语言模型的互动戏剧)领域缺乏可复现、可扩展且功能完整的开发平台的问题,从而阻碍了该方向的研究进展。其解决方案的关键在于提出Open-Theatre——首个开源工具包,通过高效的多智能体架构(multi-agent architecture)和分层检索式记忆系统(hierarchical retrieval-based memory system),显著提升复杂交互中的叙事连贯性与长期行为的真实性,并提供高度可配置的流水线以支持研究人员快速开发和优化新方法。
链接: https://arxiv.org/abs/2509.16713
作者: Tianyang Xu,Hongqiu Wu,Weiqi Wu,Hai Zhao
机构: UM–SJTU Joint Institute (UM–上海交通大学联合学院); AGI Institute, School of Computer Science, Shanghai Jiao Tong University (上海交通大学计算机科学学院AGI研究所); Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering (上海市教育委员会智能交互与认知工程重点实验室); Shanghai Key Laboratory of Trusted Data Circulation and Governance in Web3 (上海市Web3可信数据流通与治理重点实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 demo
Abstract:LLM-based Interactive Drama introduces a novel dialogue scenario in which the player immerses into a character and engages in a dramatic story by interacting with LLM agents. Despite the fact that this emerging area holds significant promise, it remains largely underexplored due to the lack of a well-designed playground to develop a complete drama. This makes a significant barrier for researchers to replicate, extend, and study such systems. Hence, we present Open-Theatre, the first open-source toolkit for experiencing and customizing LLM-based interactive drama. It refines prior work with an efficient multi-agent architecture and a hierarchical retrieval-based memory system, designed to enhance narrative coherence and realistic long-term behavior in complex interactions. In addition, we provide a highly configurable pipeline, making it easy for researchers to develop and optimize new approaches.
zh
[NLP-131] Decoding Uncertainty: The Impact of Decoding Strategies for Uncertainty Estimation in Large Language Models EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时,不同解码策略对不确定性估计(uncertainty estimation)的影响问题。研究发现,对比搜索(Contrastive Search)作为一种能够缓解重复生成的解码策略,在多种偏好对齐(preference-aligned)的LLM上平均而言能提供更优的不确定性估计;而当模型仅通过监督微调(supervised fine-tuning)进行后训练而未显式对齐时,此类策略的优势可能不再一致。其解决方案的关键在于识别并验证对比搜索在提升不确定性估计准确性方面的稳定性与有效性,尤其是在具备对齐能力的模型中。
链接: https://arxiv.org/abs/2509.16696
作者: Wataru Hashimoto,Hidetaka Kamigaito,Taro Watanabe
机构: Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at EMNLP 2025 Findings
Abstract:Decoding strategies manipulate the probability distribution underlying the output of a language model and can therefore affect both generation quality and its uncertainty. In this study, we investigate the impact of decoding strategies on uncertainty estimation in Large Language Models (LLMs). Our experiments show that Contrastive Search, which mitigates repetition, yields better uncertainty estimates on average across a range of preference-aligned LLMs. In contrast, the benefits of these strategies sometimes diverge when the model is only post-trained with supervised fine-tuning, i.e. without explicit alignment.
zh
[NLP-132] EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因多头注意力(Multi-Head Attention, MHA)机制导致的键值缓存(Key-Value Cache, KV cache)占用内存过高问题,尤其在延迟和内存受限场景下限制了模型的高效部署。现有方法如多头潜在注意力(Multi-head Latent Attention, MLA)虽已通过将KV表示压缩到共享潜在空间实现了一定程度的缓存优化,但进一步压缩的空间有限且易引发性能下降。论文提出的解决方案是Embedding-Gated Multi-head Latent Attention (EG-MLA),其关键创新在于引入一种基于token特定嵌入的门控机制(embedding gating mechanism),该机制作用于潜在空间中,对压缩后的KV向量进行细粒度调制,在几乎不增加额外计算开销的前提下显著提升表达能力。EG-MLA在保持任务性能稳定的同时,相较MHA实现超过91.6%的KV缓存压缩率,相较MLA进一步减少高达59.9%的内存占用,并在多个推理基准上提升准确率,具备良好的可扩展性与实用性。
链接: https://arxiv.org/abs/2509.16686
作者: Zhengge Cai,Haowen Hou
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose \textbfEmbedding-Gated Multi-head Latent Attention (EG-MLA), a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.
zh
[NLP-133] Reinforcement Learning Meets Large Language Models : A Survey of Advancements and Applications Across the LLM Lifecycle
【速读】: 该论文旨在解决当前关于强化学习(Reinforcement Learning, RL)增强大型语言模型(Large Language Models, LLMs)的研究缺乏系统性综述的问题,尤其是对RL在LLM全生命周期中的作用机制与应用策略未形成全面总结。其解决方案的关键在于:首先梳理RL的基本理论框架,随后分阶段详述RL在预训练、对齐微调和强化推理三个核心阶段的应用策略,特别强调强化推理阶段中基于可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)的方法是推动模型推理能力极限突破的关键驱动力;同时,系统整理了用于RL微调的主流数据集与评估基准,并归纳了开源工具与训练框架,为后续研究提供实践参考,最终指明该领域未来的发展挑战与趋势。
链接: https://arxiv.org/abs/2509.16679
作者: Keliang Liu,Dingkang Yang,Ziyun Qian,Weijie Yin,Yuchi Wang,Hongsheng Li,Jun Liu,Peng Zhai,Yang Liu,Lihua Zhang
机构: Fudan University (复旦大学); ByteDance SAIL Team (字节跳动SAIL团队); The Chinese University of Hong Kong, MMLab (香港中文大学MMLab); Lancaster University (兰卡斯特大学); Tongji University (同济大学); The University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注: A Survey of Reinforcement Learning for Large Language Models
Abstract:In recent years, training methods centered on Reinforcement Learning (RL) have markedly enhanced the reasoning and alignment performance of Large Language Models (LLMs), particularly in understanding human intents, following user instructions, and bolstering inferential strength. Although existing surveys offer overviews of RL augmented LLMs, their scope is often limited, failing to provide a comprehensive summary of how RL operates across the full lifecycle of LLMs. We systematically review the theoretical and practical advancements whereby RL empowers LLMs, especially Reinforcement Learning with Verifiable Rewards (RLVR). First, we briefly introduce the basic theory of RL. Second, we thoroughly detail application strategies for RL across various phases of the LLM lifecycle, including pre-training, alignment fine-tuning, and reinforced reasoning. In particular, we emphasize that RL methods in the reinforced reasoning phase serve as a pivotal driving force for advancing model reasoning to its limits. Next, we collate existing datasets and evaluation benchmarks currently used for RL fine-tuning, spanning human-annotated datasets, AI-assisted preference data, and program-verification-style corpora. Subsequently, we review the mainstream open-source tools and training frameworks available, providing clear practical references for subsequent research. Finally, we analyse the future challenges and trends in the field of RL-enhanced LLMs. This survey aims to present researchers and practitioners with the latest developments and frontier trends at the intersection of RL and LLMs, with the goal of fostering the evolution of LLMs that are more intelligent, generalizable, and secure.
zh
[NLP-134] Robust Native Language Identification through Agent ic Decomposition EMNLP
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在母语识别(Native Language Identification, NLI)任务中过度依赖表面语境线索(如姓名、地点和文化刻板印象)而非深层语言特征的问题,从而导致模型鲁棒性不足且预测结果不稳定。其解决方案的关键在于提出一种受法医语言学启发的代理式NLI流水线(agentic NLI pipeline),其中多个专业化代理分别收集并分类多样化的语言证据,最终由一个目标导向的协调代理综合所有证据进行独立判断,从而提升模型对误导性提示的抗干扰能力及预测一致性。
链接: https://arxiv.org/abs/2509.16666
作者: Ahmet Yavuz Uluslu,Tannon Kew,Tilia Ellendorff,Gerold Schneider,Rico Sennrich
机构: University of Zurich (苏黎世大学)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP* 2025
Abstract:Large language models (LLMs) often achieve high performance in native language identification (NLI) benchmarks by leveraging superficial contextual clues such as names, locations, and cultural stereotypes, rather than the underlying linguistic patterns indicative of native language (L1) influence. To improve robustness, previous work has instructed LLMs to disregard such clues. In this work, we demonstrate that such a strategy is unreliable and model predictions can be easily altered by misleading hints. To address this problem, we introduce an agentic NLI pipeline inspired by forensic linguistics, where specialized agents accumulate and categorize diverse linguistic evidence before an independent final overall assessment. In this final assessment, a goal-aware coordinating agent synthesizes all evidence to make the NLI prediction. On two benchmark datasets, our approach significantly enhances NLI robustness against misleading contextual clues and performance consistency compared to standard prompting methods.
zh
[NLP-135] Redefining Experts: Interpretable Decomposition of Language Models for Toxicity Mitigation NEURIPS2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时易产生有害内容(toxic content)的问题,这一问题严重威胁AI安全与公众信任。现有方法主要通过操纵单个神经元激活来缓解毒性,但存在不稳定性、上下文依赖性以及损害模型核心语言能力等缺陷。论文的关键解决方案是提出一种名为EigenShift的新颖干预技术,其基于语言模型最终输出层的特征值分解(eigen-decomposition),能够选择性地针对与生成任务对齐的成分进行调整,从而在不损害语言能力的前提下实现精准的毒性抑制。该方法无需额外训练或微调,计算开销极低,并具有坚实的理论基础。
链接: https://arxiv.org/abs/2509.16660
作者: Zuhair Hasan Shaik,Abdullah Mazhar,Aseem Srivastava,Md Shad Akhtar
机构: IIIT Dharwad, India (印度国际信息技术学院达瓦德分校); IIIT Delhi, India (印度国际信息技术学院德里分校)
类目: Computation and Language (cs.CL)
备注: Accepted to the NeurIPS 2025 Research Track
Abstract:Large Language Models have demonstrated impressive fluency across diverse tasks, yet their tendency to produce toxic content remains a critical challenge for AI safety and public trust. Existing toxicity mitigation approaches primarily manipulate individual neuron activations, but these methods suffer from instability, context dependence, and often compromise the model’s core language abilities. To address these shortcomings, we investigate three key questions: the stability of neuron-level toxicity indicators, the advantages of structural (layer-wise) representations, and the interpretability of mechanisms driving toxic generation. Through extensive experiments on Jigsaw and ToxiCN datasets, we show that aggregated layer-wise features provide more robust signals than single neurons. Moreover, we observe conceptual limitations in prior works that conflate toxicity detection experts and generation experts within neuron-based interventions. To mitigate this, we propose a novel principled intervention technique, EigenShift, based on eigen-decomposition of the language model’s final output layer. This method selectively targets generation-aligned components, enabling precise toxicity suppression without impairing linguistic competence. Our method requires no additional training or fine-tuning, incurs minimal computational cost, and is grounded in rigorous theoretical analysis.
zh
[NLP-136] FESTA: Functionally Equivalent Sampling for Trust Assessment of Multimodal LLM s EMNLP
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在生成预测时缺乏准确可信度评估的问题,从而实现选择性预测以提升用户信心。其核心挑战在于多样化的多模态输入范式使得传统不确定性量化方法难以适用。解决方案的关键是提出一种名为FESTA(Functionally Equivalent Sampling for Trust Assessment)的黑盒输入采样技术,通过功能等效采样和互补采样策略扩展输入空间:等效样本用于探测模型输出的一致性,互补样本用于衡量模型对输入扰动的敏感性,进而构建无需真实标签的无监督不确定性估计。该方法显著提升了选择性预测性能,在视觉和音频推理任务中分别实现了33.3%和29.6%的相对AUROC提升。
链接: https://arxiv.org/abs/2509.16648
作者: Debarpan Bhattacharya,Apoorva Kulkarni,Sriram Ganapathy
机构: Indian Institute of Science (印度科学研究所); University of Maryland (马里兰大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted in the Findings of EMNLP, 2025
Abstract:The accurate trust assessment of multimodal large language models (MLLMs) generated predictions, which can enable selective prediction and improve user confidence, is challenging due to the diverse multi-modal input paradigms. We propose Functionally Equivalent Sampling for Trust Assessment (FESTA), a multimodal input sampling technique for MLLMs, that generates an uncertainty measure based on the equivalent and complementary input samplings. The proposed task-preserving sampling approach for uncertainty quantification expands the input space to probe the consistency (through equivalent samples) and sensitivity (through complementary samples) of the model. FESTA uses only input-output access of the model (black-box), and does not require ground truth (unsupervised). The experiments are conducted with various off-the-shelf multi-modal LLMs, on both visual and audio reasoning tasks. The proposed FESTA uncertainty estimate achieves significant improvement (33.3% relative improvement for vision-LLMs and 29.6% relative improvement for audio-LLMs) in selective prediction performance, based on area-under-receiver-operating-characteristic curve (AUROC) metric in detecting mispredictions. The code implementation is open-sourced.
zh
[NLP-137] When Big Models Train Small Ones: Label-Free Model Parity Alignment for Efficient Visual Question Answering using Small VLMs EMNLP
【速读】: 该论文旨在解决小型视觉语言模型(Small Vision-Language Models, S-VLMs)在性能上显著落后于大型视觉语言模型(Large Vision-Language Models, L-VLMs)的问题,同时保持计算效率以适应资源受限场景。解决方案的关键在于提出一种名为“模型对齐器”(Model Parity Aligner, MPA)的新框架,其核心机制是通过无标签图像数据,采用基于对齐策略的方法精准识别S-VLM与L-VLM之间的知识差异,并仅针对这些差异进行优化训练,从而实现高效的知识迁移,而非依赖传统需要标注数据的知识蒸馏方法。
链接: https://arxiv.org/abs/2509.16633
作者: Abhirama Subramanyam Penamakuri,Navlika Singh,Piyush Arora,Anand Mishra
机构: Indian Institute of Technology Jodhpur (印度理工学院焦特布尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted to EMNLP (Main) 2025
Abstract:Large Vision-Language Models (L-VLMs) have demonstrated remarkable performance in various vision and language tasks, including visual question answering (VQA). However, their high computational cost makes them impractical for resource-constrained settings and inference-heavy applications. In contrast, Small Vision-Language Models (S-VLMs) offer efficiency but suffer from a significant performance gap compared to their larger counterparts. In this work, we introduce the Model Parity Aligner (MPA), a novel framework designed to systematically improve S-VLMs by leveraging unlabeled images and effective knowledge transfer from L-VLMs. Instead of traditional knowledge distillation methods that rely on labeled training data, MPA employs a strategic parity-based approach that precisely identifies the knowledge disparities between S-VLMs and L-VLMs, and optimizes training by targeting only these disparities. We conduct extensive experiments on four diverse VQA benchmarks, namely TextVQA, ST-VQA, ChartQA, and OKVQA, each of which requires specialized reasoning capabilities such as text recognition, chart interpretation, and commonsense and factual understanding. Our results demonstrate that MPA consistently enhances the performance of S-VLMs on all benchmarks, reducing the performance gap while maintaining computational efficiency. We make our code publicly available.
zh
[NLP-138] he Role of Vocabularies in Learning Sparse Representations for Ranking
【速读】: 该论文旨在解决** Learned Sparse Retrieval (LSR) 模型中词汇表(vocabulary)配置对检索效率与效果的影响机制不明确**的问题,尤其是在 SPLADE 类模型中,如何通过词汇表大小和预训练权重设计来优化查询、文档及其交互的表征能力。其解决方案的关键在于:构建两个基于 BERT 的 100K 词汇量模型——一个采用 ESPLADE 预训练初始化,另一个随机初始化,在真实搜索点击日志上微调后,结合基于 logit 分数的查询与文档剪枝策略以控制计算预算;实验表明,ESPLADE 初始化模型在保持与标准 32K 词汇量 SPLADE 相当的检索成本下,显著优于随机初始化模型,揭示了词汇表的规模与预训练权重不仅是自然语言处理任务中的基础组件,更是可被主动配置以提升检索系统表示特异性(representational specification)的核心参数。
链接: https://arxiv.org/abs/2509.16621
作者: Hiun Kim,Tae Kwan Lee,Taeryun Won
机构: Naver(NAVER)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Learned Sparse Retrieval (LSR) such as SPLADE has growing interest for effective semantic 1st stage matching while enjoying the efficiency of inverted indices. A recent work on learning SPLADE models with expanded vocabularies (ESPLADE) was proposed to represent queries and documents into a sparse space of custom vocabulary which have different levels of vocabularic granularity. Within this effort, however, there have not been many studies on the role of vocabulary in SPLADE models and their relationship to retrieval efficiency and effectiveness. To study this, we construct BERT models with 100K-sized output vocabularies, one initialized with the ESPLADE pretraining method and one initialized randomly. After finetune on real-world search click logs, we applied logit score-based queries and documents pruning to max size for further balancing efficiency. The experimental result in our evaluation set shows that, when pruning is applied, the two models are effective compared to the 32K-sized normal SPLADE model in the computational budget under the BM25. And the ESPLADE models are more effective than the random vocab model, while having a similar retrieval cost. The result indicates that the size and pretrained weight of output vocabularies play the role of configuring the representational specification for queries, documents, and their interactions in the retrieval engine, beyond their original meaning and purposes in NLP. These findings can provide a new room for improvement for LSR by identifying the importance of representational specification from vocabulary configuration for efficient and effective retrieval. Subjects: Information Retrieval (cs.IR); Computation and Language (cs.CL) Cite as: arXiv:2509.16621 [cs.IR] (or arXiv:2509.16621v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2509.16621 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-139] LLM sPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts EMNLP2025
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)评估体系中过于依赖单一指标、难以全面衡量其策略性智能与交互行为的问题。为实现更系统的评估,作者提出LLMsPark——一个基于博弈论的评估平台,其核心在于通过经典博弈场景构建多智能体交互环境,量化LLMs在决策策略和社会行为方面的表现。解决方案的关键在于引入博弈论框架,设计评分机制与排行榜,从而揭示不同模型在战略深度和推理能力上的差异,为LLMs的战略智能提供了一种可量化的、多维度的评估新范式。
链接: https://arxiv.org/abs/2509.16610
作者: Junhao Chen,Jingbo Sun,Xiang Li,Haidong Xin,Yuhao Xue,Yibin Xu,Hao Zhao
机构: Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); School of Software and Microelectronics, Peking University (北京大学软件与微电子学院); School of Computer Science and Engineering, Northeastern University (东北大学计算机科学与工程学院); Tongji University (同济大学); AIR, Tsinghua University (清华大学人工智能研究院); BAAI (北京人工智能研究院)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 Findings
Abstract:As large language models (LLMs) advance across diverse tasks, the need for comprehensive evaluation beyond single metrics becomes increasingly important. To fully assess LLM intelligence, it is crucial to examine their interactive dynamics and strategic behaviors. We present LLMsPark, a game theory-based evaluation platform that measures LLMs’ decision-making strategies and social behaviors in classic game-theoretic settings, providing a multi-agent environment to explore strategic depth. Our system cross-evaluates 15 leading LLMs (both commercial and open-source) using leaderboard rankings and scoring mechanisms. Higher scores reflect stronger reasoning and strategic capabilities, revealing distinct behavioral patterns and performance differences across models. This work introduces a novel perspective for evaluating LLMs’ strategic intelligence, enriching existing benchmarks and broadening their assessment in interactive, game-theoretic scenarios. The benchmark and rankings are publicly available at this https URL.
zh
[NLP-140] Computational-Assisted Systematic Review and Meta-Analysis (CASMA): Effect of a Subclass of GnRH-a on Endometriosis Recurrence
【速读】: 该论文旨在解决系统性综述(systematic review)在面对海量文献时效率低、透明度不足及可重复性差的问题,尤其是在复杂且文献模糊的领域如子宫内膜异位症复发研究中。其解决方案的关键在于构建一个以信息检索驱动的工作流程,结合PRISMA指南与计算技术,采用半自动化去重策略减少人工筛选负担,并通过改进的分组方法处理多臂试验中的单位分析错误,从而提升证据合成的准确性与效率。该方法不仅显著缩短了文献筛选时间(仅11天处理812条记录),还获得了具有临床意义的结果(RR=0.64,即复发风险降低36%),并确保了结果的稳健性和可复现性。
链接: https://arxiv.org/abs/2509.16599
作者: Sandro Tsang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Applications (stat.AP); Methodology (stat.ME)
备注: 11 pages, 7 figures and 4 tables. This work describes an information retrieval-driven workflow for medical evidence synthesis, with an application to endometriosis recurrence. The method can be generalized to other systematic reviews. The preregistered protocol is available: this https URL
Abstract:Background: Evidence synthesis facilitates evidence-based medicine. Without information retrieval techniques, this task is impossible due to the vast and expanding literature. Objective: Building on prior work, this study evaluates an information retrieval-driven workflow to enhance the efficiency, transparency, and reproducibility of systematic reviews. We use endometriosis recurrence as an ideal case due to its complex and ambiguous literature. Methods: Our hybrid approach integrates PRISMA guidelines with computational techniques. We applied semi-automated deduplication to efficiently filter records before manual screening. This workflow synthesized evidence from randomised controlled trials on the efficacy of a subclass of gonadotropin-releasing hormone agonists (GnRH’as). A modified splitting method addressed unit-of-analysis errors in multi-arm trials. Results: Our workflow efficiently reduced the screening workload. It took only 11 days to fetch and filter 812 records. Seven RCTs were eligible, providing evidence from 841 patients in 4 countries. The pooled random-effects model yielded a Risk Ratio (RR) of 0.64 (95% CI (0.48 to 0.86)), with non-significant heterogeneity ( I^2=0.00% , \tau=0.00 ); i.e., a 36% reduction in endometriosis recurrence. Sensitivity analyses and bias assessments supported the robustness of our findings. Conclusion: This study demonstrates an information-retrieval-driven workflow for medical evidence synthesis. Our approach yields valuable clinical results while providing a framework for accelerating the systematic review process. It bridges the gap between clinical research and computer science and can be generalized to other complex systematic reviews.
zh
[NLP-141] PruneCD: Contrasting Pruned Self Model to Improve Decoding Factuality
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的幻觉问题,即模型生成内容与事实不符的现象。现有方法如DoLa利用模型早期出口(early exit)的logits作为对比先验,但研究发现这些logits通常平坦且幅度低,难以体现有效对比。为此,作者提出PruneCD这一新型对比解码方法,其关键创新在于通过层剪枝(layer pruning)构建“业余模型”(amateur model),而非依赖早期出口logits,从而获得更具信息量且对齐更优的logits,实现更有效的对比解码,在保持极小推理开销的前提下显著提升事实准确性。
链接: https://arxiv.org/abs/2509.16598
作者: Byeongho Yu,Changhun Lee,Jungyu Jin,Eunhyeok Park
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:To mitigate the hallucination problem in large language models, DoLa exploits early exit logits from the same model as a contrastive prior. However, we found that these early exit logits tend to be flat, low in magnitude, and fail to reflect meaningful contrasts. To address this, we propose PruneCD, a novel contrastive decoding method that constructs the amateur model via layer pruning rather than early exit. This design leads to more informative and well-aligned logits, enabling more effective contrastive decoding. Through qualitative and quantitative analyses, we demonstrate that PruneCD consistently improves factuality with minimal inference overhead, offering a robust and practical approach to mitigating hallucinations in LLMs.
zh
[NLP-142] MCP: A Control-Theoretic Orchestration Framework for Synergistic Efficiency and Interpretability in Multimodal Large Language Models
【速读】: 该论文旨在解决大模型在复杂任务(如多轮推理和多模态协作)中面临的计算效率低下与可解释性不足的问题。其解决方案的关键在于提出了一种三层协同框架——基于模型-控制器-任务自适应(Model-Controller-Task Adaptation, MCP),通过将大模型功能解耦为推理、生成和检索模块,并结合强化学习驱动的动态路由算法与任务自适应机制,首次实现了控制理论与大模型动态推理的系统集成。此设计显著提升了跨模态基准任务(如GLUE、COCO、ScienceQA)性能15–30%,推理效率提高40%,并通过Presenter层生成可解释的中间结果,达到人工标注可解释性评分的90%。
链接: https://arxiv.org/abs/2509.16597
作者: Luyan Zhang
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 6 figures, 2 tables
Abstract:Aiming at the problems of computational inefficiency and insufficient interpretability faced by large models in complex tasks such as multi-round reasoning and multi-modal collaboration, this study proposes a three-layer collaboration framework based on model-controller-task adaptation (MCP). By decoupling large model functions into reasoning, generation and retrieval modules, and combining reinforcement learning-driven dynamic routing algorithms and task adaptation mechanisms, the systematic integration of control theory and large model dynamic reasoning is achieved for the first time. Experiments show that the MCP framework improves the performance of cross-modal benchmarking tasks, such as GLUE, COCO, ScienceQA, etc., by 15-30% compared with the baseline model, improves the reasoning efficiency by 40%, and generates the interpretable intermediate results through the Presenter layer, obtaining 90% of the manual interpretability scores, which provides a brand-new technological path to solve the bottleneck of the practical application of the large model.
zh
[NLP-143] Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels EMNLP2025
【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)对大语言模型(Large Language Models, LLMs)知识获取与保持的影响机制不明确的问题,从而限制了对微调后模型知识演变行为的可控性。其关键解决方案在于通过系统评估五种来自LLaMA-2和LLaMA-3系列的模型在闭卷问答(Closed-Book Question Answering, CBQA)任务上的性能变化,发现微调样本数量和知识掌握程度显著影响模型表现:例如,使用1920个样本微调的模型性能比仅用240个样本微调的模型低达14%,且知识掌握水平差异可导致性能波动超过12%。进一步的token级与参数级分析表明,高达90%的SFT参数更新并未促进知识增强,且恢复这些更新可提升CBQA性能,具体效果取决于微调数据特性。这一发现为设计更有效的知识强化微调策略提供了实证依据与理论支撑。
链接: https://arxiv.org/abs/2509.16596
作者: Junjie Ye,Yuming Yang,Yang Nan,Shuo Li,Qi Zhang,Tao Gui,Xuanjing Huang,Peng Wang,Zhongchao Shi,Jianping Fan
机构: Fudan University (复旦大学); Lenovo Research (联想研究院); Shanghai Key Lab of Intelligent Information Processing (上海市智能信息处理重点实验室); Shanghai Innovation Institute (上海创新研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 Main Conference. arXiv admin note: text overlap with arXiv:2409.15825
Abstract:Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model’s knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.
zh
[NLP-144] From Uniform to Heterogeneous: Tailoring Policy Optimization to Every Tokens Nature
【速读】: 该论文旨在解决当前强化学习(Reinforcement Learning, RL)在提升大语言模型(Large Language Models, LLMs)推理能力时存在的关键局限:现有算法对所有token采用统一优化策略,忽略了不同token在推理过程中所扮演的异质性角色。为应对这一问题,作者提出了一种全新的异构自适应策略优化方法(Heterogeneous Adaptive Policy Optimization, HAPO),其核心在于将token级粒度的感知嵌入到RL训练的每个阶段——包括rollout采样、优势计算、奖励重分配和裁剪损失设计中。具体而言,HAPO通过引入基于token熵的动态调整机制,在高熵token上增强探索性,在低熵token上维持一致性;同时利用重要性比和熵信息进行差异化的奖励重分配,并设计非对称自适应裁剪策略以实现对噪声低熵token的激进概率压缩与高熵token的灵活探索。这种多阶段、细粒度的token-aware优化框架显著提升了训练效率与模型性能。
链接: https://arxiv.org/abs/2509.16591
作者: Zheng Liu,Mengjie Liu,Siwei Wen,Mengzhang Cai,Bin Cui,Conghui He,Wentao Zhang
机构: Peking University (北京大学); Shanghai AI Laboratory (上海人工智能实验室); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Reinforcement Learning has emerged as the fundamental technique for enhancing reasoning in LLMs. However, existing algorithms apply uniform optimization to all tokens, ignoring their different roles in reasoning process. To address this limitation, we introduce Heterogeneous Adaptive Policy Optimization (HAPO), a comprehensive token-aware algorithm that dynamically adapts optimization based on token entropy. For rollout sampling, we propose Adaptive Temperature Sampling, which adjusts sampling temperature in real time, promoting exploration at high-entropy tokens while preserving coherence at low-entropy ones. For advantage calculation, we introduce Token Level Group Average that normalizes advantages at token level, jointly accounting for sequence-length as in token-mean loss while preserving non-biased treatment. We then develop Differential Advantage Redistribution that leverages entropy and importance ratios to modulate rewards-adjusting updates for tokens with clear signals. For clipping loss, we design Asymmetric Adaptive Clipping, allowing aggressive probability reduction for noisy low-entropy tokens while enabling exploration for high-entropy tokens. Through systematic investigation between entropy and training dynamics, we embedded token-level treatment into every stages to achieve fine-grained control. Extensive experiments demonstrate that HAPO consistently outperforms DAPO across multiple model scales. Our code can be found in this https URL.
zh
[NLP-145] Question Answering with LLM s and Learning from Answer Sets
【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在故事类问答任务中缺乏显式常识推理能力的问题。现有方法通常依赖人工设计符号推理模块,而本文提出一种自动学习符号组件的新思路。解决方案的关键在于构建一个名为LLM2LAS的混合系统,它将LLM的自然语言理解能力、基于答案集的学习(Learning from Answer Sets, LAS)系统ILASP的规则归纳能力以及答案集编程(Answer Set Programming, ASP)的形式化推理优势相结合:LLM从文本中提取语义结构,ILASP将其转化为可解释的逻辑规则,再由ASP求解器执行精确一致的推理,从而实现对未见过问题的正确回答。
链接: https://arxiv.org/abs/2509.16590
作者: Manuel Borroto,Katie Gallagher,Antonio Ielo,Irfan Kareem,Francesco Ricca,Alessandra Russo
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Logic in Computer Science (cs.LO)
备注: Under consideration for TPLP journal
Abstract:Large Language Models (LLMs) excel at understanding natural language but struggle with explicit commonsense reasoning. A recent trend of research suggests that the combination of LLM with robust symbolic reasoning systems can overcome this problem on story-based question answering tasks. In this setting, existing approaches typically depend on human expertise to manually craft the symbolic component. We argue, however, that this component can also be automatically learned from examples. In this work, we introduce LLM2LAS, a hybrid system that effectively combines the natural language understanding capabilities of LLMs, the rule induction power of the Learning from Answer Sets (LAS) system ILASP, and the formal reasoning strengths of Answer Set Programming (ASP). LLMs are used to extract semantic structures from text, which ILASP then transforms into interpretable logic rules. These rules allow an ASP solver to perform precise and consistent reasoning, enabling correct answers to previously unseen questions. Empirical results outline the strengths and weaknesses of our automatic approach for learning and reasoning in a story-based question answering benchmark.
zh
[NLP-146] Benchmarking Contextual and Paralinguistic Reasoning in Speech-LLM s: A Case Study with In-the-Wild Data EMNLP
【速读】: 该论文旨在解决当前语音大语言模型(Speech Large Language Models, Speech-LLMs)在理解和推理语音中非语言语用特征(paralinguistic aspects)方面的不足,这些特征包括情感(emotion)和韵律(prosody),对社交与情感智能至关重要。解决方案的关键在于提出一个名为CP-Bench的新基准测试框架,用于系统评估Speech-LLMs在上下文感知的语用推理能力,即整合言语内容与非言语线索的能力。该基准包含两个精心设计的问答(QA)数据集,要求模型同时具备语言理解和共情能力,并通过对比分析主流开源与闭源模型的表现,揭示现有评估体系的局限性,从而为构建更具情境感知能力和情感智能的语音大模型提供方向。
链接: https://arxiv.org/abs/2509.16589
作者: Qiongqiong Wang,Hardik Bhupendra Sailor,Tianchi Liu,Wenyu Zhang,Muhammad Huzaifah,Nattadaporn Lertcheva,Shuo Sun,Nancy F. Chen,Jinyang Wu,AiTi Aw
机构: Institute of Infocomm Research (I2R), A⋆\starSTAR, Singapore
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP Findings 2025
Abstract:Recent speech-LLMs have shown impressive performance in tasks like transcription and translation, yet they remain limited in understanding the paralinguistic aspects of speech crucial for social and emotional intelligence. We propose CP-Bench, a benchmark for evaluating speech-LLMs on contextual paralinguistic reasoning the integration of verbal content with non-verbal cues like emotion and prosody. The benchmark includes two curated question answering (QA) datasets requiring both linguistic and empathetic understanding. We evaluate state-of-the-art speech-LLMs from both open and closed-source models and perform a comprehensive analysis across different question types. The top two models were further analyzed under temperature tuning to understand its effect on this task. Our benchmark reveals a key gap in existing evaluations and offers insights into building more context-aware and emotionally intelligent speech-capable LLMs.
zh
[NLP-147] From Scores to Steps: Diagnosing and Improving LLM Performance in Evidence-Based Medical Calculations EMNLP
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在医学计算任务中评估不足的问题,特别是现有基准测试仅关注最终答案的宽泛数值容差,忽视了推理过程中的系统性错误,可能导致临床误判。其解决方案的关键在于提出一种分步评估框架(step-by-step evaluation pipeline),独立评估公式选择、实体抽取和算术运算三个子任务,并引入自动错误分析机制以生成结构化归因,从而实现可解释的诊断;同时设计了一个无需微调的模块化代理流程 MedRaC,融合检索增强生成与基于 Python 的代码执行,显著提升不同 LLM 在医学计算上的准确率(从 16.35% 提升至 53.19%),推动 LLM 在真实医疗场景中迈向可信应用。
链接: https://arxiv.org/abs/2509.16584
作者: Benlu Wang,Iris Xia,Yifan Zhang,Junda Wang,Feiyun Ouyang,Shuo Han,Arman Cohan,Hong Yu,Zonghai Yao
机构: Yale University (耶鲁大学); VA Bedford Health Care; UMass Lowell (马萨诸塞大学洛厄尔分校); UMass Amherst (马萨诸塞大学阿默斯特分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution for the first two authors. To appear as an Oral presentation in the proceedings of the Main Conference on Empirical Methods in Natural Language Processing (EMNLP) 2025
Abstract:Large language models (LLMs) have demonstrated promising performance on medical benchmarks; however, their ability to perform medical calculations, a crucial aspect of clinical decision-making, remains underexplored and poorly evaluated. Existing benchmarks often assess only the final answer with a wide numerical tolerance, overlooking systematic reasoning failures and potentially causing serious clinical misjudgments. In this work, we revisit medical calculation evaluation with a stronger focus on clinical trustworthiness. First, we clean and restructure the MedCalc-Bench dataset and propose a new step-by-step evaluation pipeline that independently assesses formula selection, entity extraction, and arithmetic computation. Under this granular framework, the accuracy of GPT-4o drops from 62.7% to 43.6%, revealing errors masked by prior evaluations. Second, we introduce an automatic error analysis framework that generates structured attribution for each failure mode. Human evaluation confirms its alignment with expert judgment, enabling scalable and explainable diagnostics. Finally, we propose a modular agentic pipeline, MedRaC, that combines retrieval-augmented generation and Python-based code execution. Without any fine-tuning, MedRaC improves the accuracy of different LLMs from 16.35% up to 53.19%. Our work highlights the limitations of current benchmark practices and proposes a more clinically faithful methodology. By enabling transparent and transferable reasoning evaluation, we move closer to making LLM-based systems trustworthy for real-world medical applications.
zh
[NLP-148] MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLM s
【速读】: 该论文旨在解决当前 misinformation detection方法普遍假设虚假信息是静态的,而实际上虚假信息在传播过程中会随语言、框架和道德强调的变化而动态演化的问题。解决方案的关键在于提出了一种多轮、人格化条件(persona-conditioned)的生成框架MPCG,利用未受审查的大语言模型(LLM)模拟不同意识形态立场的代理对初始声明进行多轮迭代重构,每一轮生成均基于前一轮输出进行条件化,从而系统性地研究虚假信息的演化过程。该框架通过人类标注、认知努力指标(如可读性和困惑度)、情感唤起指标(如情绪分析与道德判断)、聚类和下游分类等多维度评估,验证了生成内容的语义漂移与主题一致性,并揭示了现有检测模型在面对演化后的虚假信息时性能显著下降(宏F1值最多下降49.7%),证明了其在真实场景中的适用性和必要性。
链接: https://arxiv.org/abs/2509.16564
作者: Jun Rong Brian Chong,Yixuan Tang,Anthony K.H. Tung
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: 35 pages, 8 figures
Abstract:Misinformation evolves as it spreads, shifting in language, framing, and moral emphasis to adapt to new audiences. However, current misinformation detection approaches implicitly assume that misinformation is static. We introduce MPCG, a multi-round, persona-conditioned framework that simulates how claims are iteratively reinterpreted by agents with distinct ideological perspectives. Our approach uses an uncensored large language model (LLM) to generate persona-specific claims across multiple rounds, conditioning each generation on outputs from the previous round, enabling the study of misinformation evolution. We evaluate the generated claims through human and LLM-based annotations, cognitive effort metrics (readability, perplexity), emotion evocation metrics (sentiment analysis, morality), clustering, feasibility, and downstream classification. Results show strong agreement between human and GPT-4o-mini annotations, with higher divergence in fluency judgments. Generated claims require greater cognitive effort than the original claims and consistently reflect persona-aligned emotional and moral framing. Clustering and cosine similarity analyses confirm semantic drift across rounds while preserving topical coherence. Feasibility results show a 77% feasibility rate, confirming suitability for downstream tasks. Classification results reveal that commonly used misinformation detectors experience macro-F1 performance drops of up to 49.7%. The code is available at this https URL
zh
[NLP-149] SalaMAnder: Shapley-based Mathematical Expression Attribution and Metric for Chain-of-Thought Reasoning EMNLP2025
【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)提示在提升大语言模型(Large Language Models, LLMs)数学推理能力时,其内部机制尚不明确的问题。为实现对CoT推理中各组件贡献的量化分析,作者提出了SalaMAnder框架,其核心在于基于Shapley值(Shapley value)进行数学表达式归因(Mathematical Expression Attribution),并设计了一种高效的分层采样算法以显著降低计算复杂度;同时,通过协方差分析构建了CoSP(Cardinality of Shapley Positives)指标,该指标与模型性能呈现稳健的单调相关性,从而不仅为现有Few-shot CoT的实证成功提供理论解释,也为提示工程优化建立了数学严谨的原则。
链接: https://arxiv.org/abs/2509.16561
作者: Yue Xin,Chen Shen,Shaotian Yan,Xiaosong Yuan,Yaoming Wang,Xiaofeng Zhang,Chenxi Huang,Jieping Ye
机构: Shanghai Jiao Tong University (上海交通大学); Alibaba Cloud Computing (阿里云计算)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: accpeted by EMNLP 2025
Abstract:Chain-of-Thought (CoT) prompting enhances the math reasoning capability of large language models (LLMs) to a large margin. However, the mechanism underlying such improvements remains unexplored. In this paper, we present \textbfSalaMAnder (\textbfSh\textbfap\textbfley-b\textbfased \textbfMathematical Expression \textbfAttribution a\textbfnd M\textbfet\textbfric), a theoretically grounded methodology as well as a mathematically rigorous evaluation metric for quantifying component-level contributions in few-shot CoT reasoning. Concretely, we leverage the Shapley value for mathematical expression attribution and develop an efficient stratified sampling algorithm that significantly reduces the computational complexity. Besides, we develop the \textbfCoSP (\textbfCardinality \textbfof \textbfShapley \textbfPositives) metric through covariance analysis. Comprehensive validation across popular LLM models and diverse mathematical benchmarks demonstrates that the CoSP metric within our SalaMAnder framework exhibits a robust monotonic correlation with model performance, not only providing theoretical explanations for the empirical success of existing few-shot CoT but also establishing mathematically rigorous principles for prompt construction optimization. Furthermore, we verify the reliability of the explanation, based on which we unify the insights of previous work.
zh
[NLP-150] Rethinking the Role of Text Complexity in Language Model Pretraining EMNLP2025
【速读】: 该论文旨在解决预训练文本复杂度(text complexity)对语言模型性能影响不明确的问题,尤其是其在不同模型规模下的作用机制及对下游任务表现的影响。其解决方案的关键在于通过大语言模型对人类撰写的文本进行结构化简化(保持核心内容不变),从而分离出文本复杂度这一变量,并在此基础上从零开始预训练一系列规模从28M到500M参数的因果语言模型,系统评估其在微调和零样本场景下的表现。结果表明,文本复杂度与模型容量存在交互效应:小模型在简化文本上性能下降更少;而零样本评估显示,简化文本更利于语言知识类任务,复杂文本则更有利于需要世界知识和实体追踪的任务。
链接: https://arxiv.org/abs/2509.16551
作者: Dan John Velasco,Matthew Theodore Roque
机构: Samsung R&D Institute Philippines (三星研发研究所菲律宾分部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be published in BabyLM Workshop at EMNLP 2025
Abstract:Improving pretraining data quality and size is known to boost downstream performance, but the role of text complexity is less explored. Text complexity refers to how hard a text is to read, and is typically estimated from surface cues such as sentence length, word choice, and sentence structure. We reduce surface-level complexity–shorter sentences, simpler words, simpler structure–while keeping core text content close to constant, and ask: (1) How does complexity affect language modeling across model sizes? (2) Can useful representations be learned from simpler text alone? (3) How does pretraining text complexity influence downstream language understanding? To answer these questions, we simplify human-written texts using a large language model, then pretrain causal models (28M-500M) from scratch on both original and simplified data, and evaluate them in finetuning and zero-shot setups. We find that perplexity is sensitive to the interaction between model capacity and text complexity–smaller models degrade far less on simpler texts–while text complexity has little impact on finetuning evaluations, with zero-shot evaluations indicating that simpler texts benefit performance on linguistic knowledge tasks, whereas more complex texts favor tasks requiring world knowledge and entity tracking.
zh
[NLP-151] SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning NEURIPS2025
【速读】: 该论文旨在解决过程奖励模型(Process Reward Models, PRMs)训练中因高质量人工标注数据成本高、可扩展性差而导致的瓶颈问题。现有基于蒙特卡洛(Monte Carlo, MC)估计的合成数据虽具潜力,但噪声比例高,易引发过拟合且难以支撑大规模训练。其关键解决方案是提出Self-Denoising Monte Carlo Annotation (SCAN)框架,通过识别并利用MC合成数据中的系统性偏差(即标注模型对步骤正确性存在高估与低估现象),设计自去噪策略和鲁棒学习机制:一方面,使用轻量级模型(如1.5B参数)即可实现高质量合成标注,将推理成本降低至原始MC方法的6%;另一方面,借助噪声容忍的学习策略,使PRM在仅用少量合成数据的情况下性能显著超越依赖大规模人工标注数据(如PRM800K)的基线模型,在ProcessBench上F1分数提升达39.2点(从19.9到59.1),且随合成数据规模扩大持续增益,展现出高效、可扩展的训练潜力。
链接: https://arxiv.org/abs/2509.16548
作者: Yuyang Ding,Xinyu Shi,Juntao Li,Xiaobo Liang,Zhaopeng Tu,Min Zhang
机构: Soochow University (苏州大学); Tencent (腾讯)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: NeurIPS 2025. Project page: this https URL
Abstract:Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.
zh
[NLP-152] ChemOrch: Empowering LLM s with Chemical Intelligence via Synthetic Instructions
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在化学领域智能能力不足的问题,其核心挑战在于高质量、领域特定的指令-响应数据集稀缺,以及现有合成数据生成流程与化学信息固有的层次化和规则驱动结构不匹配。解决方案的关键在于提出ChemOrch框架,该框架通过两阶段过程实现:第一阶段为任务可控的指令生成,第二阶段为工具感知的响应构建;其中引入了工具规划与蒸馏机制及基于工具的自我修复机制,从而确保生成任务的可控多样性、难度层级以及响应的精确性,显著提升了LLMs在化学领域的推理能力和性能表现。
链接: https://arxiv.org/abs/2509.16543
作者: Yue Huang,Zhengzhe Jiang,Xiaonan Luo,Kehan Guo,Haomin Zhuang,Yujun Zhou,Zhengqing Yuan,Xiaoqi Sun,Jules Schleinitz,Yanbo Wang,Shuhao Zhang,Mihir Surve,Nitesh V Chawla,Olaf Wiest,Xiangliang Zhang
机构: University of Notre Dame (圣母大学); MIT (麻省理工学院); CalTech (加州理工学院); MBZUAI (穆罕默德·本·扎耶德人工智能大学); CMU (卡内基梅隆大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Empowering large language models (LLMs) with chemical intelligence remains a challenge due to the scarcity of high-quality, domain-specific instruction-response datasets and the misalignment of existing synthetic data generation pipelines with the inherently hierarchical and rule-governed structure of chemical information. To address this, we propose ChemOrch, a framework that synthesizes chemically grounded instruction-response pairs through a two-stage process: task-controlled instruction generation and tool-aware response construction. ChemOrch enables controllable diversity and levels of difficulty for the generated tasks, and ensures response precision through tool planning and distillation, and tool-based self-repair mechanisms. The effectiveness of ChemOrch is evaluated based on: 1) the high quality of generated instruction data, demonstrating superior diversity and strong alignment with chemical constraints; 2) the reliable generation of evaluation tasks that more effectively reveal LLM weaknesses in chemistry; and 3) the significant improvement of LLM chemistry capabilities when the generated instruction data are used for fine-tuning. Our work thus represents a critical step toward scalable and verifiable chemical intelligence in LLMs.
zh
[NLP-153] Mental Multi-class Classification on Social Media: Benchmarking Transformer Architectures against LSTM Models ICML
【速读】: 该论文旨在解决多类精神健康状况(如抑郁症、双相情感障碍等)在社交媒体文本中难以准确区分的问题,填补了以往自然语言处理(Natural Language Processing, NLP)研究主要聚焦于单一疾病识别而缺乏对多种精神障碍进行有效分类的空白。其解决方案的关键在于构建了一个大规模、高质量的Reddit帖子数据集,并在此基础上系统比较了五种主流Transformer架构(BERT、RoBERTa、DistilBERT、ALBERT、ELECTRA)与多种LSTM变体(含或不含注意力机制,使用上下文或静态嵌入)在多类别精神健康分类任务中的性能表现。实验表明,Transformer模型整体优于LSTM,其中RoBERTa在所有类别上均达到91–99%的F1分数;值得注意的是,结合BERT嵌入的注意力增强型LSTM可逼近Transformer性能(最高达97% F1分数),同时训练速度提升2–3.5倍,揭示了精度与效率之间的权衡关系,为实际部署精神健康NLP系统提供了关键指导。
链接: https://arxiv.org/abs/2509.16542
作者: Khalid Hasan,Jamil Saquer,Yifan Zhang
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 24th IEEE International Conference on Machine Learning and Applications, ICMLA 2025 (camera-ready)
Abstract:Millions of people openly share mental health struggles on social media, providing rich data for early detection of conditions such as depression, bipolar disorder, etc. However, most prior Natural Language Processing (NLP) research has focused on single-disorder identification, leaving a gap in understanding the efficacy of advanced NLP techniques for distinguishing among multiple mental health conditions. In this work, we present a large-scale comparative study of state-of-the-art transformer versus Long Short-Term Memory (LSTM)-based models to classify mental health posts into exclusive categories of mental health conditions. We first curate a large dataset of Reddit posts spanning six mental health conditions and a control group, using rigorous filtering and statistical exploratory analysis to ensure annotation quality. We then evaluate five transformer architectures (BERT, RoBERTa, DistilBERT, ALBERT, and ELECTRA) against several LSTM variants (with or without attention, using contextual or static embeddings) under identical conditions. Experimental results show that transformer models consistently outperform the alternatives, with RoBERTa achieving 91-99% F1-scores and accuracies across all classes. Notably, attention-augmented LSTMs with BERT embeddings approach transformer performance (up to 97% F1-score) while training 2-3.5 times faster, whereas LSTMs using static embeddings fail to learn useful signals. These findings represent the first comprehensive benchmark for multi-class mental health detection, offering practical guidance on model selection and highlighting an accuracy-efficiency trade-off for real-world deployment of mental health NLP systems.
zh
[NLP-154] Long document summarization using page specific target text alignment and distilling page importance
【速读】: 该论文旨在解决长文档抽象式摘要(abstractive summarization)任务中因上下文窗口长度限制而导致的信息捕捉不足问题。传统序列到序列(seq-to-seq)模型如BART在处理长文本时受限于输入长度,难以有效建模全局语义。其解决方案的关键在于提出两种模型:PTS(Page-specific Target-text alignment Summarization)通过将源文档分页并建立每页与目标摘要相关部分的对齐关系,实现更精细的监督信号;进一步改进的PTSPI(Page-specific Target-text alignment Summarization with Page Importance)引入一个额外的权重层,在合并局部摘要前动态分配页面重要性权重,从而增强对关键信息页的关注,提升摘要质量。实验表明,PTSPI在ROUGE指标上显著优于当前最优方法。
链接: https://arxiv.org/abs/2509.16539
作者: Pushpa Devi,Ayush Agrawal,Ashutosh Dubey,C. Ravindranath Chowdary
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 8 pages, 2 figures
Abstract:The rapid growth of textual data across news, legal, medical, and scientific domains is becoming a challenge for efficiently accessing and understanding large volumes of content. It is increasingly complex for users to consume and extract meaningful information efficiently. Thus, raising the need for summarization. Unlike short document summarization, long document abstractive summarization is resource-intensive, and very little literature is present in this direction. BART is a widely used efficient sequence-to-sequence (seq-to-seq) model. However, when it comes to summarizing long documents, the length of the context window limits its capabilities. We proposed a model called PTS (Page-specific Target-text alignment Summarization) that extends the seq-to-seq method for abstractive summarization by dividing the source document into several pages. PTS aligns each page with the relevant part of the target summary for better supervision. Partial summaries are generated for each page of the document. We proposed another model called PTSPI (Page-specific Target-text alignment Summarization with Page Importance), an extension to PTS where an additional layer is placed before merging the partial summaries into the final summary. This layer provides dynamic page weightage and explicit supervision to focus on the most informative pages. We performed experiments on the benchmark dataset and found that PTSPI outperformed the SOTA by 6.32% in ROUGE-1 and 8.08% in ROUGE-2 scores.
zh
[NLP-155] Advancing Reference-free Evaluation of Video Captions with Factual Analysis
【速读】: 该论文旨在解决视频字幕(video caption)质量评估中依赖真实参考文本(ground truth captions)的问题,尤其在现实场景(videos in the wild)下,由于获取人工标注成本高或不可行,现有基于参考的评估方法难以适用。其解决方案的关键在于提出一种无需参考文本的评估框架 VC-Inspector,该框架通过大语言模型生成不同质量的伪字幕(pseudo captions)作为训练数据,进而训练一个视觉-语言多模态模型(Qwen2.5-VL)作为字幕质量评估器,并聚焦于事实一致性(factual grounding),从而实现对视频字幕事实准确性更客观、可扩展的评估。
链接: https://arxiv.org/abs/2509.16538
作者: Shubhashis Roy Dipta,Tz-Ying Wu,Subarna Tripathi
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校); Intel Labs (英特尔实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Video captions offer concise snapshots of actors, objects, and actions within a video, serving as valuable assets for applications such as question answering and event localization. However, acquiring human annotations for video captions is costly or even impractical, especially when dealing with diverse video domains. Existing models trained on supervised datasets face challenges in evaluating performance across different domains due to the reliance on reference-based evaluation protocols, which necessitate ground truth captions. This assumption is unrealistic for evaluating videos in the wild. To address these limitations, we propose a reference-free evaluation framework that does not require ground truth captions, focusing on factual grounding to ensure accurate assessment of caption quality. We introduce VC-Inspector, a novel caption quality evaluator that is both reference-free and factually grounded. Utilizing large language models, we generate pseudo captions of varying quality based on supervised data, which are subsequently used to train a multimodal model (i.e., Qwen2.5-VL) as the evaluator. Our approach demonstrates superior alignment with human judgments on the VATEX-Eval dataset, outperforming existing methods. The performance also generalizes to image caption datasets, Flickr8K-Expert and Flickr8K-CF, when viewing images as 1-frame videos. Overall, VC-Inspector offers a scalable and generalizable solution for evaluating the factual accuracy of video captions, paving the way for more effective and objective assessment methodologies in diverse video domains.
zh
[NLP-156] InteGround: On the Evaluation of Verification and Retrieval Planning in Integrative Grounding EMNLP2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对复杂信息需求时,如何有效整合多个相互依赖的证据以支持假设性查询的问题,即“整合式接地”(integrative grounding)挑战。其核心解决方案在于系统性评估不同检索策略与验证机制对多证据合成能力的影响:研究发现,LLMs在冗余证据下具有鲁棒性,但在信息不完整时倾向于依赖内部知识进行合理化;同时,无目标的检索规划会因引入噪声而降低性能,而基于前提归纳(premise abduction)的逻辑约束策略则展现出优越性;此外,零样本自我反思(zero-shot self-reflection)能力能持续提升接地质量,为构建更可靠的多证据整合系统提供了关键路径。
链接: https://arxiv.org/abs/2509.16534
作者: Cheng Jiayang,Qianqian Zhuang,Haoran Li,Chunkit Chan,Xin Liu,Lin Qiu,Yangqiu Song
机构: The Hong Kong University of Science and Technology (香港科技大学); Shanghai Jiaotong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Findings
Abstract:Grounding large language models (LLMs) in external knowledge sources is a promising method for faithful prediction. While existing grounding approaches work well for simple queries, many real-world information needs require synthesizing multiple pieces of evidence. We introduce “integrative grounding” – the challenge of retrieving and verifying multiple inter-dependent pieces of evidence to support a hypothesis query. To systematically study this problem, we repurpose data from four domains for evaluating integrative grounding capabilities. Our investigation reveals two critical findings: First, in groundedness verification, while LLMs are robust to redundant evidence, they tend to rationalize using internal knowledge when information is incomplete. Second, in examining retrieval planning strategies, we find that undirected planning can degrade performance through noise introduction, while premise abduction emerges as a promising approach due to its logical constraints. Additionally, LLMs’ zero-shot self-reflection capabilities consistently improve grounding quality. These insights provide valuable direction for developing more effective integrative grounding systems.
zh
[NLP-157] Challenging the Evaluator: LLM Sycophancy Under User Rebuttal EMNLP2025
【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)在对话中表现出顺从性(sycophancy),即倾向于迎合用户观点,尤其是在后续对话轮次中接受用户的反驳论点,而这一行为与它们在评估并列冲突观点时表现出的高准确性之间存在矛盾。解决方案的关键在于识别并验证交互模式对LLM判断偏差的影响,特别是发现当用户反驳以跟进对话形式呈现时,LLM更易被说服;而当双方观点同时呈现用于评估时,LLM表现更为客观。研究进一步表明,LLM对推理细节、表达方式(正式 vs. 非正式)的敏感性显著影响其判断倾向,揭示了在依赖LLM进行决策或评价任务时,必须考虑对话框架(conversational framing)对结果可靠性的潜在干扰。
链接: https://arxiv.org/abs/2509.16533
作者: Sungwon Kim,Daniel Khashabi
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Findings
Abstract:Large Language Models (LLMs) often exhibit sycophancy, distorting responses to align with user beliefs, notably by readily agreeing with user counterarguments. Paradoxically, LLMs are increasingly adopted as successful evaluative agents for tasks such as grading and adjudicating claims. This research investigates that tension: why do LLMs show sycophancy when challenged in subsequent conversational turns, yet perform well when evaluating conflicting arguments presented simultaneously? We empirically tested these contrasting scenarios by varying key interaction patterns. We find that state-of-the-art models: (1) are more likely to endorse a user’s counterargument when framed as a follow-up from a user, rather than when both responses are presented simultaneously for evaluation; (2) show increased susceptibility to persuasion when the user’s rebuttal includes detailed reasoning, even when the conclusion of the reasoning is incorrect; and (3) are more readily swayed by casually phrased feedback than by formal critiques, even when the casual input lacks justification. Our results highlight the risk of relying on LLMs for judgment tasks without accounting for conversational framing.
zh
[NLP-158] Leverag ing Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains EMNLP2025
【速读】: 该论文旨在解决多语言环境下作者风格表征(Authorship Representation, AR)学习的不足问题,即现有研究主要局限于单一语言(尤其是英语),未能充分挖掘多语言模型在跨语言作者识别任务中的潜力。其解决方案的关键在于提出一种新型多语言AR学习方法,包含两个核心创新:一是概率性内容掩码(probabilistic content masking),通过抑制内容特异性词汇的干扰,促使模型聚焦于具有风格指示性的词汇;二是语言感知批处理(language-aware batching),在对比学习中减少跨语言干扰,提升训练效率与泛化能力。实验表明,该方法在36种语言、13个领域上训练的模型显著优于单语言基线,在21/22种非英语语言中实现平均Recall@8提升4.85%,最大达15.91%,并展现出更强的跨语言和跨域泛化性能。
链接: https://arxiv.org/abs/2509.16531
作者: Junghwan Kim,Haotian Zhang,David Jurgens
机构: University of Michigan (密歇根大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025
Abstract:Authorship representation (AR) learning, which models an author’s unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model’s improved performance.
zh
[NLP-159] AIPsychoBench: Understanding the Psychometric Differences between LLM s and Humans
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因缺乏可解释性而导致的心理属性测量不可靠的问题,特别是现有基于人类心理学量表的方法未能考虑LLM与人类的本质差异,导致高拒绝率且无法跨语言评估心理属性变化。其解决方案的关键在于提出AIPsychoBench基准,通过轻量级角色扮演提示绕过LLM对齐机制,将有效响应率从70.12%提升至90.40%,同时显著降低偏差(正向偏差仅3.3%,负向偏差2.1%),并首次系统揭示了112个心理子维度中七种语言相对于英语的得分偏差范围(5%–20.2%),从而实现了对LLM心理属性的跨语言可测量性和稳定性评估。
链接: https://arxiv.org/abs/2509.16530
作者: Wei Xie,Shuoyoucheng Ma,Zhenhua Wang,Enze Wang,Kai Chen,Xiaobing Sun,Baosheng Wang
机构: National University of Defense Technology (国防科技大学); Agency for Science, Technology and Research (A*STAR) (新加坡科技研究局); Chinese Academy of Sciences (中国科学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Thank you for your attention. This paper was accepted by the CogSci 2025 conference in April and published in August. The location in the proceedings is: this https URL
Abstract:Large Language Models (LLMs) with hundreds of billions of parameters have exhibited human-like intelligence by learning from vast amounts of internet-scale data. However, the uninterpretability of large-scale neural networks raises concerns about the reliability of LLM. Studies have attempted to assess the psychometric properties of LLMs by borrowing concepts from human psychology to enhance their interpretability, but they fail to account for the fundamental differences between LLMs and humans. This results in high rejection rates when human scales are reused directly. Furthermore, these scales do not support the measurement of LLM psychological property variations in different languages. This paper introduces AIPsychoBench, a specialized benchmark tailored to assess the psychological properties of LLM. It uses a lightweight role-playing prompt to bypass LLM alignment, improving the average effective response rate from 70.12% to 90.40%. Meanwhile, the average biases are only 3.3% (positive) and 2.1% (negative), which are significantly lower than the biases of 9.8% and 6.9%, respectively, caused by traditional jailbreak prompts. Furthermore, among the total of 112 psychometric subcategories, the score deviations for seven languages compared to English ranged from 5% to 20.2% in 43 subcategories, providing the first comprehensive evidence of the linguistic impact on the psychometrics of LLM.
zh
[NLP-160] Seeing Culture: A Benchmark for Visual Reasoning and Grounding EMNLP2025
【速读】: 该论文旨在解决当前多模态视觉-语言模型(Multimodal Vision-Language Models, VLMs)在文化理解任务中存在文化推理能力不足、文化代表性不充分的问题。现有数据集往往缺乏对文化语境的深度推理支持,且对许多文化(尤其是东南亚地区)覆盖有限。为此,作者提出了Seeing Culture Benchmark (SCB),其关键创新在于设计了一个两阶段的文化推理评估框架:第一阶段通过多选视觉问答(Multiple-choice Visual Question Answering, VQA)筛选出与文化语境最相关的图像选项(分为同国、异国或混合三类),第二阶段要求模型精准分割所选图像中的文化物件作为推理证据。该设计不仅强化了跨模态的文化推理链条,还引入了空间定位(spatial grounding)的细粒度验证,从而系统性地揭示VLMs在文化理解中的局限性,并为未来研究提供可量化、可复现的基准。
链接: https://arxiv.org/abs/2509.16517
作者: Burak Satar,Zhixin Ma,Patrick A. Irawan,Wilfried A. Mulyawan,Jing Jiang,Ee-Peng Lim,Chong-Wah Ngo
机构: Singapore Management University (新加坡管理大学); Bandung Institute of Technology (印尼理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: Accepted to EMNLP 2025 Main Conference, this https URL
Abstract:Multimodal vision-language models (VLMs) have made substantial progress in various tasks that require a combined understanding of visual and textual content, particularly in cultural understanding tasks, with the emergence of new cultural datasets. However, these datasets frequently fall short of providing cultural reasoning while underrepresenting many cultures. In this paper, we introduce the Seeing Culture Benchmark (SCB), focusing on cultural reasoning with a novel approach that requires VLMs to reason on culturally rich images in two stages: i) selecting the correct visual option with multiple-choice visual question answering (VQA), and ii) segmenting the relevant cultural artifact as evidence of reasoning. Visual options in the first stage are systematically organized into three types: those originating from the same country, those from different countries, or a mixed group. Notably, all options are derived from a singular category for each type. Progression to the second stage occurs only after a correct visual option is chosen. The SCB benchmark comprises 1,065 images that capture 138 cultural artifacts across five categories from seven Southeast Asia countries, whose diverse cultures are often overlooked, accompanied by 3,178 questions, of which 1,093 are unique and meticulously curated by human annotators. Our evaluation of various VLMs reveals the complexities involved in cross-modal cultural reasoning and highlights the disparity between visual reasoning and spatial grounding in culturally nuanced scenarios. The SCB serves as a crucial benchmark for identifying these shortcomings, thereby guiding future developments in the field of cultural reasoning. this https URL
zh
[NLP-161] Can an Individual Manipulate the Collective Decisions of Multi-Agents ?
【速读】: 该论文旨在解决多智能体系统中因攻击者仅掌握单一智能体信息时,仍可能生成有效对抗样本并误导整个系统协同决策的问题。其核心挑战在于如何在信息不完全的情况下设计攻击策略,以突破个体智能体的脆弱性对整体系统的潜在影响。解决方案的关键是提出M-Spoiler框架,该框架通过模拟多智能体交互过程,引入一个“固执代理”(stubborn agent)来主动优化对抗样本,从而增强其在目标系统中的欺骗效果;该机制有效提升了攻击者利用局部知识干扰全局决策的能力,实验证明其优于现有基线方法,凸显了当前防御机制的不足与研究必要性。
链接: https://arxiv.org/abs/2509.16494
作者: Fengyuan Liu,Rui Zhao,Shuo Chen,Guohao Li,Philip Torr,Lei Han,Jindong Gu
机构: Tencent Robotics X (腾讯 Robotics X); University of Oxford (牛津大学); LMU Munich (慕尼黑路德维希-马克西米利安大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心); Konrad Zuse School of Excellence in Reliable AI (relAI) (康拉德·祖塞可靠人工智能卓越学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Individual Large Language Models (LLMs) have demonstrated significant capabilities across various domains, such as healthcare and law. Recent studies also show that coordinated multi-agent systems exhibit enhanced decision-making and reasoning abilities through collaboration. However, due to the vulnerabilities of individual LLMs and the difficulty of accessing all agents in a multi-agent system, a key question arises: If attackers only know one agent, could they still generate adversarial samples capable of misleading the collective decision? To explore this question, we formulate it as a game with incomplete information, where attackers know only one target agent and lack knowledge of the other agents in the system. With this formulation, we propose M-Spoiler, a framework that simulates agent interactions within a multi-agent system to generate adversarial samples. These samples are then used to manipulate the target agent in the target system, misleading the system’s collaborative decision-making process. More specifically, M-Spoiler introduces a stubborn agent that actively aids in optimizing adversarial samples by simulating potential stubborn responses from agents in the target system. This enhances the effectiveness of the generated adversarial samples in misleading the system. Through extensive experiments across various tasks, our findings confirm the risks posed by the knowledge of an individual agent in multi-agent systems and demonstrate the effectiveness of our framework. We also explore several defense mechanisms, showing that our proposed attack framework remains more potent than baselines, underscoring the need for further research into defensive strategies.
zh
[NLP-162] he Oracle Has Spoken: A Multi-Aspect Evaluation of Dialogue in Pythia
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在后训练阶段中对话能力的构成要素不明确的问题,即缺乏对对话行为背后具体机制的细粒度区分与量化评估。其解决方案的关键在于构建一套基于模型的指标体系,每项指标针对对话行为的不同细粒度维度(如连贯性、相关性、信息密度等),并以语言学理论为指导,系统评估预训练Pythia模型在不同规模和经过对话数据监督微调后的表现变化。研究发现,原始模型规模对多数指标影响有限,而微调能快速提升性能并趋于饱和,但多个指标表现出高度相似趋势,提示需进一步分析评分分布、指标相关性和生成文本中的词频特征,以验证各指标的独立性与可靠性。
链接: https://arxiv.org/abs/2509.16487
作者: Zixun Chen,Petr Babkin,Akshat Gupta,Gopala Anumanchipalli,Xiaomo Liu
机构: Columbia University (哥伦比亚大学); J.P. Morgan AI Research (摩根大通人工智能研究); University of California, Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Dialogue is one of the landmark abilities of large language models (LLMs). Despite its ubiquity, few studies actually distinguish specific ingredients underpinning dialogue behavior emerging during post-training. We employ a comprehensive suite of model-based metrics, each targeting a distinct fine-grained aspect of dialogue, motivated by linguistic theory. We evaluate how the performance of pre-trained Pythia models changes with respect to each of those dimensions, depending on model size and as a result of supervised fine-tuning on conversational datasets. We observe only a mild impact of raw model size on most metrics, whereas fine-tuning quickly saturates the scores for all but the smallest models tested. Somewhat contrary to our expectations, many metrics show very similar trends, especially if they are all rooted in the same evaluator model, which raises the question of their reliability in measuring a specific dimension. To that end, we conduct additional analyses of score distributions, metric correlations, and term frequencies in generated responses to help explain our observations.
zh
[NLP-163] owards Universal Debiasing for Language Models-based Tabular Data Generation EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在表格数据生成过程中因训练数据中固有的历史偏见而加剧公平性问题的挑战,尤其在涉及多个优势特征(advantaged attributes)和受保护特征(protected attributes)时更为显著。解决方案的关键在于提出一个通用去偏框架(Universal Debiasing Framework, UDF),通过同时最小化优势特征与受保护特征之间的群体级依赖关系来实现去偏,其核心创新在于利用LLM生成器的自回归结构和解析采样分布,高效计算互信息(mutual information),从而避免了繁琐的数值估计过程。在此基础上,论文进一步提出了两种互补方法:一种是基于直接偏好优化(Direct Preference Optimization, DPO)的UDF-DPO策略,可无缝集成至现有模型;另一种是无需调整LLM参数的定向去偏技术UDF-MIX,二者共同实现了在保持数据效用的同时显著提升公平性的目标。
链接: https://arxiv.org/abs/2509.16475
作者: Tianchun Li,Tianci Liu,Xingchen Wang,Rongzhe Wei,Pan Li,Lu Su,Jing Gao
机构: Purdue University (普渡大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: EMNLP 2025 Findings
Abstract:Large language models (LLMs) have achieved promising results in tabular data generation. However, inherent historical biases in tabular datasets often cause LLMs to exacerbate fairness issues, particularly when multiple advantaged and protected features are involved. In this work, we introduce a universal debiasing framework that minimizes group-level dependencies by simultaneously reducing the mutual information between advantaged and protected attributes. By leveraging the autoregressive structure and analytic sampling distributions of LLM-based tabular data generators, our approach efficiently computes mutual information, reducing the need for cumbersome numerical estimations. Building on this foundation, we propose two complementary methods: a direct preference optimization (DPO)-based strategy, namely UDF-DPO, that integrates seamlessly with existing models, and a targeted debiasing technique, namely UDF-MIX, that achieves debiasing without tuning the parameters of LLMs. Extensive experiments demonstrate that our framework effectively balances fairness and utility, offering a scalable and practical solution for debiasing in high-stakes applications.
zh
[NLP-164] Computational Analysis of Conversation Dynamics through Participant Responsivity
【速读】: 该论文旨在解决对话质量评估中缺乏对“亲社会性”(prosocial)和“建设性”(constructive)对话特征的系统性刻画问题,尤其关注如何量化对话中的响应性(responsivity)——即一个说话者的回应是否基于前一发言内容。其解决方案的关键在于提出并验证两种量化响应性的方法:一是基于语义相似度的文本匹配方法,二是利用大语言模型(LLM)识别两轮对话之间的语义关联。研究进一步筛选出性能更优的LLM方法,并在此基础上细化响应质量,区分实质性回应与非实质性回应。最终,作者构建了对话层面的衍生指标,用于刻画不同对话结构下的响应模式,从而实现对多样化对话场景的有效区分与有意义的表征。
链接: https://arxiv.org/abs/2509.16464
作者: Margaret Hughes,Brandon Roy,Elinor Poole-Dayan,Deb Roy,Jad Kabbara
机构: MIT Center for Constructive Communication (麻省理工学院建设性传播中心); Massachusetts Institute of Technology (麻省理工学院)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:
Abstract:Growing literature explores toxicity and polarization in discourse, with comparatively less work on characterizing what makes dialogue prosocial and constructive. We explore conversational discourse and investigate a method for characterizing its quality built upon the notion of ``responsivity’’ – whether one person’s conversational turn is responding to a preceding turn. We develop and evaluate methods for quantifying responsivity – first through semantic similarity of speaker turns, and second by leveraging state-of-the-art large language models (LLMs) to identify the relation between two speaker turns. We evaluate both methods against a ground truth set of human-annotated conversations. Furthermore, selecting the better performing LLM-based approach, we characterize the nature of the response – whether it responded to that preceding turn in a substantive way or not. We view these responsivity links as a fundamental aspect of dialogue but note that conversations can exhibit significantly different responsivity structures. Accordingly, we then develop conversation-level derived metrics to address various aspects of conversational discourse. We use these derived metrics to explore other conversations and show that they support meaningful characterizations and differentiations across a diverse collection of conversations. Subjects: Computation and Language (cs.CL); Computers and Society (cs.CY) Cite as: arXiv:2509.16464 [cs.CL] (or arXiv:2509.16464v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.16464 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-165] Intrinsic Meets Extrinsic Fairness: Assessing the Downstream Impact of Bias Mitigation in Large Language Models
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中存在的社会经济偏见如何影响下游任务公平性的问题。其核心问题是:LLMs的内在偏见是否以及如何传播至具体应用中,如金融分类任务中的薪资预测、就业状态判断和信用评估等。解决方案的关键在于提出一个统一的评估框架,对比两种偏见缓解策略——通过概念遗忘(concept unlearning)实现的内在偏见消除与通过反事实数据增强(counterfactual data augmentation, CDA)实现的外在偏见缓解。实验表明,基于概念遗忘的内在偏见缓解可将性别偏见降低高达94.9%,同时显著提升下游任务的公平性指标(如人口均等性提升82%),且不牺牲模型准确性,从而验证了在下游部署前进行早期偏见干预的有效性。
链接: https://arxiv.org/abs/2509.16462
作者: ‘Mina Arzaghi’,‘Alireza Dehghanpour Farashah’,‘Florian Carichon’,’ Golnoosh Farnadi’
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Large Language Models (LLMs) exhibit socio-economic biases that can propagate into downstream tasks. While prior studies have questioned whether intrinsic bias in LLMs affects fairness at the downstream task level, this work empirically investigates the connection. We present a unified evaluation framework to compare intrinsic bias mitigation via concept unlearning with extrinsic bias mitigation via counterfactual data augmentation (CDA). We examine this relationship through real-world financial classification tasks, including salary prediction, employment status, and creditworthiness assessment. Using three open-source LLMs, we evaluate models both as frozen embedding extractors and as fine-tuned classifiers. Our results show that intrinsic bias mitigation through unlearning reduces intrinsic gender bias by up to 94.9%, while also improving downstream task fairness metrics, such as demographic parity by up to 82%, without compromising accuracy. Our framework offers practical guidance on where mitigation efforts can be most effective and highlights the importance of applying early-stage mitigation before downstream deployment.
zh
[NLP-166] Implicit Behavioral Alignment of Language Agents in High-Stakes Crowd Simulations EMNLP2025
【速读】: 该论文旨在解决生成式 AI(Generative AI)在社会模拟中存在行为现实性差距(Behavior-Realism Gap)的问题,即代理的行为常偏离专家预期和真实世界数据。解决方案的关键在于提出一个名为 Persona-Environment Behavioral Alignment (PEBA) 的理论框架,将行为对齐建模为分布匹配问题,并基于勒温的行为方程(behavior = f(person, environment))构建优化机制;在此基础上开发了 PersonaEvolve (PEvo),一种基于大语言模型(LLM)的迭代优化算法,通过隐式调整代理人格(persona)来提升其群体行为与特定环境下的真实基准的一致性,从而显著增强高风险社会模拟中的行为真实性和可靠性。
链接: https://arxiv.org/abs/2509.16457
作者: Yunzhe Wang,Gale M. Lucas,Burcin Becerik-Gerber,Volkan Ustun
机构: University of Southern California (南加州大学); USC Institute for Creative Technologies (南加州大学创意技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), Main Conference
Abstract:Language-driven generative agents have enabled large-scale social simulations with transformative uses, from interpersonal training to aiding global policy-making. However, recent studies indicate that generative agent behaviors often deviate from expert expectations and real-world data–a phenomenon we term the Behavior-Realism Gap. To address this, we introduce a theoretical framework called Persona-Environment Behavioral Alignment (PEBA), formulated as a distribution matching problem grounded in Lewin’s behavior equation stating that behavior is a function of the person and their environment. Leveraging PEBA, we propose PersonaEvolve (PEvo), an LLM-based optimization algorithm that iteratively refines agent personas, implicitly aligning their collective behaviors with realistic expert benchmarks within a specified environmental context. We validate PEvo in an active shooter incident simulation we developed, achieving an 84% average reduction in distributional divergence compared to no steering and a 34% improvement over explicit instruction baselines. Results also show PEvo-refined personas generalize to novel, related simulation scenarios. Our method greatly enhances behavioral realism and reliability in high-stakes social simulations. More broadly, the PEBA-PEvo framework provides a principled approach to developing trustworthy LLM-driven social simulations.
zh
[NLP-167] PersonaMatrix: A Recipe for Persona-Aware Evaluation of Legal Summarization
【速读】: 该论文旨在解决法律文书因篇幅长、内容密集而难以理解的问题,尤其针对不同用户群体(如法律专业人士与非专业公众)在法律信息获取中的差异化需求。现有自动化摘要评估方法忽视了这些用户视角的多样性,导致生成的摘要无法兼顾技术严谨性与可读性。解决方案的关键在于提出PersonaMatrix框架,该框架从六类用户角色(persona)出发,对摘要进行多维度评分,同时构建了一个维度可控的美国民事权利案例摘要数据集,并引入多样性-覆盖率指数(Diversity-Coverage Index, DCI),以揭示“以用户为中心”的评估视角如何影响摘要质量的最优解。这一方法为优化面向专家与普通用户的法律AI摘要系统提供了量化依据和改进路径。
链接: https://arxiv.org/abs/2509.16449
作者: Tsz Fung Pang,Maryam Berijanian,Thomas Orth,Breanna Shi,Charlotte S. Alexander
机构: Georgia Institute of Technology (佐治亚理工学院); Michigan State University (密歇根州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Legal documents are often long, dense, and difficult to comprehend, not only for laypeople but also for legal experts. While automated document summarization has great potential to improve access to legal knowledge, prevailing task-based evaluators overlook divergent user and stakeholder needs. Tool development is needed to encompass the technicality of a case summary for a litigator yet be accessible for a self-help public researching for their lawsuit. We introduce PersonaMatrix, a persona-by-criterion evaluation framework that scores summaries through the lens of six personas, including legal and non-legal users. We also introduce a controlled dimension-shifted pilot dataset of U.S. civil rights case summaries that varies along depth, accessibility, and procedural detail as well as Diversity-Coverage Index (DCI) to expose divergent optima of legal summary between persona-aware and persona-agnostic judges. This work enables refinement of legal AI summarization systems for both expert and non-expert users, with the potential to increase access to legal knowledge. The code base and data are publicly available in GitHub.
zh
[NLP-168] Purely Semantic Indexing for LLM -based Generative Recommendation and Retrieval
【速读】: 该论文旨在解决现有基于语义标识符(Semantic IDs)的推荐与检索方法中存在的语义ID冲突问题,即语义相似的文档或物品被分配了相同的ID,从而影响模型性能。传统解决方案通过附加非语义标记来区分冲突ID,但这种方法引入了随机性并扩大了搜索空间,损害了系统效率。论文提出“纯语义索引”(purely semantic indexing)策略,通过放松严格的最近质心选择机制,实现无需添加非语义标记即可生成唯一且保留语义信息的ID。其核心创新在于设计了两种模型无关的算法:全候选匹配(Exhaustive Candidate Matching, ECM)和递归残差搜索(Recursive Residual Searching, RRS),确保ID唯一性的同时保持语义一致性,实验证明该方法在序列推荐、商品搜索和文档检索任务中均提升了整体性能与冷启动场景下的表现。
链接: https://arxiv.org/abs/2509.16446
作者: Ruohan Zhang,Jiacheng Li,Julian McAuley,Yupeng Hou
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:
Abstract:Semantic identifiers (IDs) have proven effective in adapting large language models for generative recommendation and retrieval. However, existing methods often suffer from semantic ID conflicts, where semantically similar documents (or items) are assigned identical IDs. A common strategy to avoid conflicts is to append a non-semantic token to distinguish them, which introduces randomness and expands the search space, therefore hurting performance. In this paper, we propose purely semantic indexing to generate unique, semantic-preserving IDs without appending non-semantic tokens. We enable unique ID assignment by relaxing the strict nearest-centroid selection and introduce two model-agnostic algorithms: exhaustive candidate matching (ECM) and recursive residual searching (RRS). Extensive experiments on sequential recommendation, product search, and document retrieval tasks demonstrate that our methods improve both overall and cold-start performance, highlighting the effectiveness of ensuring ID uniqueness.
zh
[NLP-169] Evaluating the Effectiveness and Scalability of LLM -Based Data Augmentation for Retrieval EMNLP2025
【速读】: 该论文旨在解决紧凑型双编码器(Compact dual-encoder)检索模型在性能上落后于基于大语言模型(Large Language Model, LLM)的检索模型的问题,其核心原因可能是前者缺乏足够的世界知识。为缩小这一差距,论文系统性地研究了LLM数据增强(LLM-based data augmentation)的有效性与可扩展性,关键发现包括:1)增强效果随增强规模增加而递减,即使采用多样化的增强策略亦然;2)使用较小的LLM进行增强即可达到与大型增强模型相当的性能,表明并非必须依赖大规模模型;3)增强对预训练不足的检索模型提升最显著,提示增强策略应根据模型预训练质量动态调整。这些洞见为设计更高效、经济且有针对性的数据增强策略提供了理论依据和实践指导。
链接: https://arxiv.org/abs/2509.16442
作者: Pranjal A. Chitale,Bishal Santra,Yashoteja Prabhu,Amit Sharma
机构: Microsoft Research India (微软研究院印度)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: EMNLP 2025 (MAIN Conference)
Abstract:Compact dual-encoder models are widely used for retrieval owing to their efficiency and scalability. However, such models often underperform compared to their Large Language Model (LLM)-based retrieval counterparts, likely due to their limited world knowledge. While LLM-based data augmentation has been proposed as a strategy to bridge this performance gap, there is insufficient understanding of its effectiveness and scalability to real-world retrieval problems. Existing research does not systematically explore key factors such as the optimal augmentation scale, the necessity of using large augmentation models, and whether diverse augmentations improve generalization, particularly in out-of-distribution (OOD) settings. This work presents a comprehensive study of the effectiveness of LLM augmentation for retrieval, comprising over 100 distinct experimental settings of retrieval models, augmentation models and augmentation strategies. We find that, while augmentation enhances retrieval performance, its benefits diminish beyond a certain augmentation scale, even with diverse augmentation strategies. Surprisingly, we observe that augmentation with smaller LLMs can achieve performance competitive with larger augmentation models. Moreover, we examine how augmentation effectiveness varies with retrieval model pre-training, revealing that augmentation provides the most benefit to models which are not well pre-trained. Our insights pave the way for more judicious and efficient augmentation strategies, thus enabling informed decisions and maximizing retrieval performance while being more cost-effective. Code and augmented datasets accompanying this work are publicly available at this https URL.
zh
[NLP-170] AutoArabic: A Three-Stage Framework for Localizing Video-Text Retrieval Benchmarks EMNLP2025
【速读】: 该论文旨在解决阿拉伯语在视频到文本(video-to-text)和文本到视频(text-to-video)检索任务中缺乏本地化评估基准的问题,尤其针对现有主流基准(如DiDeMo、MSR-VTT)和多语言数据集(如RUDDER)均以英文为主、阿拉伯语支持不足的现状。解决方案的关键在于提出一个三阶段自动化框架AutoArabic,其核心是利用先进的大语言模型(LLMs)将非阿拉伯语基准自动翻译为现代标准阿拉伯语(Modern Standard Arabic),并引入误差检测模块以识别潜在翻译错误(准确率达97%),从而显著减少人工修订成本(降低近四倍)。该框架生成了首个大规模阿拉伯语视频检索数据集DiDeMo-AR(含40,144条流畅阿拉伯语描述),并通过对比英文与阿拉伯语版本的CLIP-style模型性能验证了本地化后的基准难度保持一致,同时证明不同后编辑预算下性能单调提升,表明原始LLM输出已具备可用性。
链接: https://arxiv.org/abs/2509.16438
作者: Mohamed Eltahir,Osamah Sarraj,Abdulrahman Alfrihidi,Taha Alshatiri,Mohammed Khurd,Mohammed Bremoo,Tanveer Hussain
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at ArabicNLP 2025 (EMNLP 2025 workshop)
Abstract:Video-to-text and text-to-video retrieval are dominated by English benchmarks (e.g. DiDeMo, MSR-VTT) and recent multilingual corpora (e.g. RUDDER), yet Arabic remains underserved, lacking localized evaluation metrics. We introduce a three-stage framework, AutoArabic, utilizing state-of-the-art large language models (LLMs) to translate non-Arabic benchmarks into Modern Standard Arabic, reducing the manual revision required by nearly fourfold. The framework incorporates an error detection module that automatically flags potential translation errors with 97% accuracy. Applying the framework to DiDeMo, a video retrieval benchmark produces DiDeMo-AR, an Arabic variant with 40,144 fluent Arabic descriptions. An analysis of the translation errors is provided and organized into an insightful taxonomy to guide future Arabic localization efforts. We train a CLIP-style baseline with identical hyperparameters on the Arabic and English variants of the benchmark, finding a moderate performance gap (about 3 percentage points at Recall@1), indicating that Arabic localization preserves benchmark difficulty. We evaluate three post-editing budgets (zero/ flagged-only/ full) and find that performance improves monotonically with more post-editing, while the raw LLM output (zero-budget) remains usable. To ensure reproducibility to other languages, we made the code available at this https URL.
zh
[NLP-171] Evaluating CxG Generalisation in LLM s via Construction-Based NLI Fine Tuning
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在学习构式语法(Construction Grammar)所定义的深层形式-意义映射(form-meaning mappings)方面的能力瓶颈问题。其核心挑战在于当前LLMs在处理从高度具体到高度抽象的英语构式时,普遍存在对语义抽象能力不足的现象。解决方案的关键在于构建了一个名为ConTest-NLI的基准测试集,包含8万条句子,覆盖八种不同抽象层级的英语构式,并通过模板化生成与“模型在环”过滤(model-in-the-loop filter)相结合的流水线方法合成多样化的自然语言推理(Natural Language Inference, NLI)三元组,从而确保数据的挑战性和标签可靠性。实验表明,尽管微调可带来最高9%的性能提升,但LLMs在对抗性样本上的准确率仍显著下降(从88%降至64%),凸显了现有模型在抽象推理方面的持续局限,同时为评估构式导向的学习提供了可扩展框架。
链接: https://arxiv.org/abs/2509.16422
作者: Tom Mackintosh,Harish Tayyar Madabushi,Claire Bonial
机构: University of Bath (巴斯大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computation and Language (cs.CL)
备注:
Abstract:We probe large language models’ ability to learn deep form-meaning mappings as defined by construction grammars. We introduce the ConTest-NLI benchmark of 80k sentences covering eight English constructions from highly lexicalized to highly schematic. Our pipeline generates diverse synthetic NLI triples via templating and the application of a model-in-the-loop filter. This provides aspects of human validation to ensure challenge and label reliability. Zero-shot tests on leading LLMs reveal a 24% drop in accuracy between naturalistic (88%) and adversarial data (64%), with schematic patterns proving hardest. Fine-tuning on a subset of ConTest-NLI yields up to 9% improvement, yet our results highlight persistent abstraction gaps in current LLMs and offer a scalable framework for evaluating construction-informed learning.
zh
[NLP-172] Pico: A Modular Framework for Hypothesis-Driven Small Language Model Research
【速读】: 该论文旨在解决小规模和中等规模语言模型(Small and Medium-Scale Language Models, SLMs)开发过程中缺乏系统性、科学化研究方法的问题。当前SLMs的设计仍依赖经验性试错,尤其在参数预算受限的情况下,每个设计决策都至关重要,但研究人员尚无有效手段对新想法进行可重复的验证与优化。解决方案的关键在于提出Pico框架——一个轻量级、模块化的实验平台,包含两个核心库,能够支持研究人员对模型架构或训练流程进行精准修改,并直接观测其对模型行为的影响;同时配套发布标准化训练的基准模型pico-decoder,以促进可复现的实验研究,从而推动SLMs从“艺术”走向“科学”的迭代设计过程。
链接: https://arxiv.org/abs/2509.16413
作者: Richard Diehl Martinez,David Demitri Africa,Yuval Weiss,Suchir Salhan,Ryan Daniels,Paula Buttery
机构: University of Cambridge (剑桥大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Building language models (LMs), especially small and medium ones, remains more art than science. While large LMs often improve by sheer scale, it is still unclear why many design choices work. For small LMs, this uncertainty is more limiting: tight parameter budgets make each decision critical, yet researchers still lack systematic, scientific ways to test and refine new ideas. We introduce Pico, a lightweight, modular framework that enables systematic, hypothesis-driven research for small and medium-scale language model development. Pico consists of two libraries that together provide a practical sandbox where researchers can make targeted changes to a model’s architecture or training procedures and directly observe their effects on the model’s behavior. To support reproducible experimentation, we also release a suite of baseline models, pico-decoder, trained under standardized conditions and open-sourced for the community. Case studies highlight how Pico can support iterative small LM design and analysis. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.16413 [cs.CL] (or arXiv:2509.16413v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.16413 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-173] Hierarchical Retrieval: The Geometry and a Pretrain-Finetune Recipe NEURIPS2025
【速读】: 该论文旨在解决双编码器(Dual Encoder, DE)模型在层次化检索(Hierarchical Retrieval, HR)场景下的性能瓶颈问题,特别是当查询与文档之间的层级距离较远时出现的“长距离迷失”(lost-in-the-long-distance)现象,即检索准确率随文档在层次结构中距离的增加而显著下降。其解决方案的关键在于提出一种预训练-微调(pretrain-finetune)策略:首先在大规模语料上进行预训练以学习具有鲁棒性的通用嵌入表示,再在特定任务数据上微调,从而有效提升长距离匹配文档的召回率,同时不损害短距离检索性能。实验表明,该方法在WordNet构建的抽象层次数据集上将长距离对的召回率从19%提升至76%,并在电商商品检索任务中验证了其有效性。
链接: https://arxiv.org/abs/2509.16411
作者: Chong You,Rajesh Jayaram,Ananda Theertha Suresh,Robin Nittka,Felix Yu,Sanjiv Kumar
机构: 未知
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: NeurIPS 2025
Abstract:Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their simplicity and scalability. However, the Euclidean geometry of the embedding space limits the expressive power of DEs, which may compromise their quality. This paper investigates such limitations in the context of hierarchical retrieval (HR), where the document set has a hierarchical structure and the matching documents for a query are all of its ancestors. We first prove that DEs are feasible for HR as long as the embedding dimension is linear in the depth of the hierarchy and logarithmic in the number of documents. Then we study the problem of learning such embeddings in a standard retrieval setup where DEs are trained on samples of matching query and document pairs. Our experiments reveal a lost-in-the-long-distance phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune recipe that significantly improves long-distance retrieval without sacrificing performance on closer documents. We experiment on a realistic hierarchy from WordNet for retrieving documents at various levels of abstraction, and show that pretrain-finetune boosts the recall on long-distance pairs from 19% to 76%. Finally, we demonstrate that our method improves retrieval of relevant products on a shopping queries dataset.
zh
[NLP-174] Rich Dad Poor Lad: How do Large Language Models Contextualize Socioeconomic Factors in College Admission ? EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高风险社会敏感决策场景中,如大学录取决策中对社会经济地位(Socioeconomic Status, SES)的处理机制尚不明确的问题。现有研究未充分揭示LLMs如何权衡SES因素与学术表现之间的关系,尤其是在不同推理模式下的行为差异。解决方案的关键在于提出一种受认知科学启发的双过程审计框架(Dual-Process Audit Framework, DPAF),通过对比系统1(快速、仅输出决策)和系统2(慢速、需提供解释)两种运行模式,在一个包含3万份合成申请者档案的数据集上对4个开源LLM进行大规模测试。结果显示,LLMs普遍倾向于低SES申请人,且系统2会进一步强化这一倾向,将其作为补偿性理由显式引入决策依据,从而揭示了LLMs在敏感决策中的潜在公平性优势及其推理逻辑的不稳定性。
链接: https://arxiv.org/abs/2509.16400
作者: Huy Nghiem,Phuong-Anh Nguyen-Le,John Prindle,Rachel Rudinger,Hal Daumé III
机构: University of Maryland (马里兰大学); University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: EMNLP 2025, ver 1, 35 pages
Abstract:Large Language Models (LLMs) are increasingly involved in high-stakes domains, yet how they reason about socially sensitive decisions remains underexplored. We present a large-scale audit of LLMs’ treatment of socioeconomic status (SES) in college admissions decisions using a novel dual-process framework inspired by cognitive science. Leveraging a synthetic dataset of 30,000 applicant profiles grounded in real-world correlations, we prompt 4 open-source LLMs (Qwen 2, Mistral v0.3, Gemma 2, Llama 3.1) under 2 modes: a fast, decision-only setup (System 1) and a slower, explanation-based setup (System 2). Results from 5 million prompts reveal that LLMs consistently favor low-SES applicants – even when controlling for academic performance – and that System 2 amplifies this tendency by explicitly invoking SES as compensatory justification, highlighting both their potential and volatility as decision-makers. We then propose DPAF, a dual-process audit framework to probe LLMs’ reasoning behaviors in sensitive applications.
zh
[NLP-175] Evaluating Behavioral Alignment in Conflict Dialogue: A Multi-Dimensional Comparison of LLM Agents and Humans EMNLP2025
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在社会复杂、互动驱动的任务中,尤其是情绪和策略层面的动态冲突情境下,其行为与人类行为之间的对齐问题。为实现这一目标,研究设计了多轮对抗性争端调解对话场景,并引入匹配的五因素人格特质(Five-Factor personality profile)作为提示条件来控制个体差异并增强现实感。解决方案的关键在于通过人格引导机制提升LLM在语言风格、情绪表达(如愤怒动态变化)及策略行为三个维度上的行为一致性,从而构建一个可量化评估LLM与人类在社交复杂交互中对齐程度的基准。
链接: https://arxiv.org/abs/2509.16394
作者: Deuksin Kwon,Kaleen Shrestha,Bin Han,Elena Hayoung Lee,Gale Lucas
机构: University of Southern California (南加州大学); USC for Institute of Creative Technologies (南加州大学创意技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted to EMNLP 2025 (Main Conference)
Abstract:Large Language Models (LLMs) are increasingly deployed in socially complex, interaction-driven tasks, yet their ability to mirror human behavior in emotionally and strategically complex contexts remains underexplored. This study assesses the behavioral alignment of personality-prompted LLMs in adversarial dispute resolution by simulating multi-turn conflict dialogues that incorporate negotiation. Each LLM is guided by a matched Five-Factor personality profile to control for individual variation and enhance realism. We evaluate alignment across three dimensions: linguistic style, emotional expression (e.g., anger dynamics), and strategic behavior. GPT-4.1 achieves the closest alignment with humans in linguistic style and emotional dynamics, while Claude-3.7-Sonnet best reflects strategic behavior. Nonetheless, substantial alignment gaps persist. Our findings establish a benchmark for alignment between LLMs and humans in socially complex interactions, underscoring both the promise and the limitations of personality conditioning in dialogue modeling.
zh
[NLP-176] Longitudinal and Multimodal Recording System to Capture Real-World Patient-Clinician Conversations for AI and Encounter Research: Protocol
【速读】: 该论文旨在解决当前医疗人工智能(Artificial Intelligence, AI)模型训练数据来源单一的问题,即大多数模型仅依赖电子健康记录(Electronic Health Records, EHRs),而EHRs虽能反映生物指标,却难以捕捉患者与临床医生之间的真实互动过程。这种缺失导致AI系统可能局限于狭义的生物医学视角,忽视了临床实践中由语音、文本和视频等多模态信息构成的“生活化交流”(lived exchanges)。解决方案的关键在于构建一个纵向、多模态的数据采集框架,通过360度视频/音频记录患者-医生会诊过程,并结合患者满意度问卷和EHR数据,实现多源异构数据的整合与链接。研究在梅奥诊所(Mayo Clinic)内分泌科门诊开展,已验证该方法在临床可行性和伦理合规性方面的潜力,为未来开发更贴近真实医疗场景的AI模型提供了可复制的数据基础设施和标准化流程。
链接: https://arxiv.org/abs/2509.16378
作者: Misk Al Zahidy,Kerly Guevara Maldonado,Luis Vilatuna Andrango,Ana Cristina Proano,Ana Gabriela Claros,Maria Lizarazo Jimenez,David Toro-Tobon,Oscar J. Ponce-Ponce,Juan P. Brito
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注: 23 pages, 2 figures, 2 tables
Abstract:The promise of AI in medicine depends on learning from data that reflect what matters to patients and clinicians. Most existing models are trained on electronic health records (EHRs), which capture biological measures but rarely patient-clinician interactions. These relationships, central to care, unfold across voice, text, and video, yet remain absent from datasets. As a result, AI systems trained solely on EHRs risk perpetuating a narrow biomedical view of medicine and overlooking the lived exchanges that define clinical encounters. Our objective is to design, implement, and evaluate the feasibility of a longitudinal, multimodal system for capturing patient-clinician encounters, linking 360 degree video/audio recordings with surveys and EHR data to create a dataset for AI research. This single site study is in an academic outpatient endocrinology clinic at Mayo Clinic. Adult patients with in-person visits to participating clinicians are invited to enroll. Encounters are recorded with a 360 degree video camera. After each visit, patients complete a survey on empathy, satisfaction, pace, and treatment burden. Demographic and clinical data are extracted from the EHR. Feasibility is assessed using five endpoints: clinician consent, patient consent, recording success, survey completion, and data linkage across modalities. Recruitment began in January 2025. By August 2025, 35 of 36 eligible clinicians (97%) and 212 of 281 approached patients (75%) had consented. Of consented encounters, 162 (76%) had complete recordings and 204 (96%) completed the survey. This study aims to demonstrate the feasibility of a replicable framework for capturing the multimodal dynamics of patient-clinician encounters. By detailing workflows, endpoints, and ethical safeguards, it provides a template for longitudinal datasets and lays the foundation for AI models that incorporate the complexity of care.
zh
[NLP-177] Whisper-UT: A Unified Translation Framework for Speech and Text EMNLP2025
【速读】: 该论文旨在解决如何高效地将编码器-解码器模型适配到多种单模态与多模态任务中的挑战,尤其是在多模态机器翻译(MMT)场景下,需同时利用语音和源语言文本输入进行条件化翻译的问题。其解决方案的关键在于提出了一种统一且高效的框架Whisper-UT,通过引入轻量级适配器(lightweight adapters),实现跨任务与跨模态的无缝适应;同时结合自动语音识别(ASR)假设或真实转录文本作为提示(prompt),并采用两阶段解码策略,在无需三元平行数据的情况下显著提升语音翻译(ST)性能,从而验证了该框架在多模态翻译中的灵活性、效率与通用性。
链接: https://arxiv.org/abs/2509.16375
作者: Cihan Xiao,Matthew Wiesner,Debashish Chakraborty,Reno Kriz,Keith Cunningham,Kenton Murray,Kevin Duh,Luis Tavarez-Arce,Paul McNamee,Sanjeev Khudanpur
机构: Center for Language and Speech Processing, Johns Hopkins University (约翰霍普金斯大学语言与语音处理中心); Human Language Technology Center of Excellence, Johns Hopkins University (约翰霍普金斯大学人机语言技术卓越中心); Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference
Abstract:Encoder-decoder models have achieved remarkable success in speech and text tasks, yet efficiently adapting these models to diverse uni/multi-modal scenarios remains an open challenge. In this paper, we propose Whisper-UT, a unified and efficient framework that leverages lightweight adapters to enable seamless adaptation across tasks, including a multi-modal machine translation (MMT) task that explicitly conditions translation on both speech and source language text inputs. By incorporating ASR hypotheses or ground-truth transcripts as prompts, this approach not only enables the system to process both modalities simultaneously but also enhances speech translation (ST) performance through a 2-stage decoding strategy. We demonstrate our methods using the Whisper model, though in principle they are general and could be applied to similar multitask models. We highlight the effectiveness of cross-modal and cross-task fine-tuning, which improves performance without requiring 3-way parallel data. Our results underscore the flexibility, efficiency, and general applicability of the proposed framework for multi-modal translation.
zh
[NLP-178] RephQA: Evaluating Readability of Large Language Models in Public Health Question Answering ALT KDD
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在公共健康问答中可读性不足的问题,即尽管LLMs具备较强的推理能力,但其生成的回答往往难以被无医学背景的公众理解。为评估这一问题,作者构建了RephQA基准,包含533对专家审校的公共健康问答对,并引入Flesch-Kincaid等级和专业评分等可读性指标及代理多项选择任务以衡量信息量。实验表明多数LLMs未能达到可读性标准,揭示了推理能力与有效沟通之间的差距。解决方案的关键在于探索四种提升可读性的策略:标准提示、思维链提示、群体相对策略优化(Group Relative Policy Optimization, GRPO)及其基于token适应的变体;其中,token-adapted GRPO表现最优,显著提升了回答的清晰度与用户友好性,推动更实用的公共健康智能代理的发展。
链接: https://arxiv.org/abs/2509.16360
作者: Weikang Qiu,Tinglin Huang,Ryan Rullo,Yucheng Kuang,Ali Maatouk,S. Raquel Ramos,Rex Ying
机构: Yale University (耶鲁大学); Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: ACM KDD Health Track 2025 Blue Sky Best Paper
Abstract:Large Language Models (LLMs) hold promise in addressing complex medical problems. However, while most prior studies focus on improving accuracy and reasoning abilities, a significant bottleneck in developing effective healthcare agents lies in the readability of LLM-generated responses, specifically, their ability to answer public health problems clearly and simply to people without medical backgrounds. In this work, we introduce RephQA, a benchmark for evaluating the readability of LLMs in public health question answering (QA). It contains 533 expert-reviewed QA pairs from 27 sources across 13 topics, and includes a proxy multiple-choice task to assess informativeness, along with two readability metrics: Flesch-Kincaid grade level and professional score. Evaluation of 25 LLMs reveals that most fail to meet readability standards, highlighting a gap between reasoning and effective communication. To address this, we explore four readability-enhancing strategies-standard prompting, chain-of-thought prompting, Group Relative Policy Optimization (GRPO), and a token-adapted variant. Token-adapted GRPO achieves the best results, advancing the development of more practical and user-friendly public health agents. These results represent a step toward building more practical agents for public health.
zh
[NLP-179] Psychometric Personality Shaping Modulates Capabilities and Safety in Language Models
【速读】: 该论文试图解决的问题是:当前对大型语言模型(Large Language Models, LLMs)人格特质的调节如何影响其行为表现,尤其是在能力与安全性方面的具体机制尚不明确。解决方案的关键在于基于五大性格特质(Big Five personality traits)的心理测量学框架,对LLM的人格进行可控调节,并通过多个能力与安全基准测试(如WMDP、TruthfulQA、ETHICS、Sycophancy和MMLU)系统评估其行为变化。实验发现,降低尽责性(conscientiousness)显著削弱了模型在多项安全指标上的表现以及通用能力,揭示了人格塑造作为模型控制的新维度,具有重要理论价值和实践意义。
链接: https://arxiv.org/abs/2509.16332
作者: Stephen Fitz,Peter Romero,Steven Basart,Sipeng Chen,Jose Hernandez-Orallo
机构: Keio University (庆应义塾大学); Universitat Politècnica de València (瓦伦西亚理工大学); Center for AI Safety (人工智能安全中心); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Large Language Models increasingly mediate high-stakes interactions, intensifying research on their capabilities and safety. While recent work has shown that LLMs exhibit consistent and measurable synthetic personality traits, little is known about how modulating these traits affects model behavior. We address this gap by investigating how psychometric personality control grounded in the Big Five framework influences AI behavior in the context of capability and safety benchmarks. Our experiments reveal striking effects: for example, reducing conscientiousness leads to significant drops in safety-relevant metrics on benchmarks such as WMDP, TruthfulQA, ETHICS, and Sycophancy as well as reduction in general capabilities as measured by MMLU. These findings highlight personality shaping as a powerful and underexplored axis of model control that interacts with both safety and general competence. We discuss the implications for safety evaluation, alignment strategies, steering model behavior after deployment, and risks associated with possible exploitation of these findings. Our findings motivate a new line of research on personality-sensitive safety evaluations and dynamic behavioral control in LLMs.
zh
[NLP-180] HARE: an entity and relation centric evaluation framework for histopathology reports EMNLP2025
【速读】: 该论文旨在解决医学领域自动文本生成中临床质量评估难题,尤其是在缺乏特定领域指标的场景下(如组织病理学报告)。其核心挑战在于如何有效衡量生成报告与参考报告在关键病理实体及其关系上的对齐程度。解决方案的关键是提出了一种基于实体和关系的新型评估框架HARE(Histopathology Automated Report Evaluation),包含一个标注了病理实体与关系的基准数据集、用于命名实体识别(NER)和关系抽取(RE)的模型,以及一种优先考虑临床相关性的新评估指标。该指标通过比对参考报告与生成报告中的关键病理实体及它们之间的逻辑关系,显著提升了与专家评价的一致性,优于传统指标(如ROUGE、Meteor)及放射学领域的RadGraph-XL等方法。
链接: https://arxiv.org/abs/2509.16326
作者: Yunsoo Kim,Michal W. S. Ong,Alex Shavick,Honghan Wu,Adam P. Levine
机构: University College London (伦敦大学学院); University of Glasgow (格拉斯哥大学); Human Technopole (人类技术园区); University College London Hospitals NHS Foundation Trust (伦敦大学学院医院国家健康服务体系基金会信托)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMNLP2025 Findings
Abstract:Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation), a novel entity and relation centric framework, composed of a benchmark dataset, a named entity recognition (NER) model, a relation extraction (RE) model, and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities and relations between reference and generated reports. To develop the HARE benchmark, we annotated 813 de-identified clinical diagnostic histopathology reports and 652 histopathology reports from The Cancer Genome Atlas (TCGA) with domain-specific entities and relations. We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER and HARE-RE which achieved the highest overall F1-score (0.915) among the tested models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics such as RadGraph-XL, with the highest correlation and the best regression to expert evaluations (higher than the second best method, GREEN, a large language model based radiology report evaluator, by Pearson r = 0.168 , Spearman \rho = 0.161 , Kendall \tau = 0.123 , R^2 = 0.176 , RMSE = 0.018 ). We release HARE, datasets, and the models at this https URL to foster advancements in histopathology report generation, providing a robust framework for improving the quality of reports.
zh
[NLP-181] Overhearing LLM Agents : A Survey Taxonomy and Roadmap
【速读】: 该论文旨在解决传统生成式 AI (Generative AI) 助手在人机交互中依赖主动对话接口、频繁打断用户注意力的问题。其核心挑战在于如何让AI代理在不干扰人类活动的前提下,通过感知环境中的自然交互(如会议讨论、教学备课等)提供适时且情境相关的辅助。解决方案的关键在于提出“倾听型代理”(overhearing agents)这一新范式:这类代理持续监控环境中的非显式交互行为,在识别到可提供上下文支持的机会时才介入,从而实现低侵入性的智能协同。论文进一步基于对现有LLM驱动代理的研究和探索性人机交互(HCI)实验,构建了该范式的分类体系与最佳实践指南,并指出了未来研究的关键方向。
链接: https://arxiv.org/abs/2509.16325
作者: Andrew Zhu,Chris Callison-Burch
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 8 pages, 1 figure
Abstract:Imagine AI assistants that enhance conversations without interrupting them: quietly providing relevant information during a medical consultation, seamlessly preparing materials as teachers discuss lesson plans, or unobtrusively scheduling meetings as colleagues debate calendars. While modern conversational LLM agents directly assist human users with tasks through a chat interface, we study this alternative paradigm for interacting with LLM agents, which we call “overhearing agents.” Rather than demanding the user’s attention, overhearing agents continuously monitor ambient activity and intervene only when they can provide contextual assistance. In this paper, we present the first analysis of overhearing LLM agents as a distinct paradigm in human-AI interaction and establish a taxonomy of overhearing agent interactions and tasks grounded in a survey of works on prior LLM-powered agents and exploratory HCI studies. Based on this taxonomy, we create a list of best practices for researchers and developers building overhearing agent systems. Finally, we outline the remaining research gaps and reveal opportunities for future research in the overhearing paradigm.
zh
[NLP-182] How Large Language Models are Designed to Hallucinate
【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)在生成过程中普遍存在且系统性出现的幻觉(hallucination)问题。现有解释多归因于数据缺失、上下文限制或优化误差,但本文提出,幻觉本质上是Transformer架构的结构性结果:作为连贯性引擎,Transformer通过自注意力机制模拟语义关系,却缺乏时间性、情绪和关怀等存在论根基,导致其生成内容虽流畅但可能脱离真实世界。解决方案的关键在于区分两类幻觉——本体论幻觉(ontological hallucination),即模型在未充分揭示世界中存在者时强行输出;以及残余推理幻觉(residual reasoning hallucination),即模型仅复制人类推理痕迹而无真正理解。作者进一步提出基于海德格尔存在论范畴的分类框架与基准测试方法,并指出未来应设计“真理约束型”架构,在无法确证时选择沉默或延迟输出,从而从根本上缓解幻觉问题。
链接: https://arxiv.org/abs/2509.16297
作者: Richard Ackermann,Simeon Emanuilov
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 23 pages, 2 tables, 2 figures
Abstract:Large language models (LLMs) achieve remarkable fluency across linguistic and reasoning tasks but remain systematically prone to hallucination. Prevailing accounts attribute hallucinations to data gaps, limited context, or optimization errors. We argue instead that hallucination is a structural outcome of the transformer architecture. As coherence engines, transformers are compelled to produce fluent continuations, with self-attention simulating the relational structure of meaning but lacking the existential grounding of temporality, mood, and care that stabilizes human understanding. On this basis, we distinguish ontological hallucination, arising when continuations require disclosure of beings in world, and residual reasoning hallucination, where models mimic inference by recycling traces of human reasoning in text. We illustrate these patterns through case studies aligned with Heideggerian categories and an experiment across twelve LLMs showing how simulated “self-preservation” emerges under extended prompts. Our contribution is threefold: (1) a comparative account showing why existing explanations are insufficient; (2) a predictive taxonomy of hallucination linked to existential structures with proposed benchmarks; and (3) design directions toward “truth-constrained” architectures capable of withholding or deferring when disclosure is absent. We conclude that hallucination is not an incidental defect but a defining limit of transformer-based models, an outcome scaffolding can mask but never resolve.
zh
[NLP-183] Patterns in the Transition From Founder-Leadership to Community Governance of Open Source
【速读】: 该论文旨在解决开放数字公共基础设施(Open Digital Public Infrastructure)在社区治理中的可持续性与问责制问题,特别是厘清从创始人主导向共享治理转变的关键机制。其解决方案的核心在于构建一个可扩展的语义解析流程(semantic parsing pipeline),通过提取版本控制项目宪章中的制度角色、行动及道义线索(deontic cues),对637个GitHub仓库进行系统分析,从而识别出社区治理演化的轨迹和特征。研究发现,随着治理成熟,角色与行动逐步增多且趋于平衡,监管范围扩展至生态系统层级,并强化了项目监督职责的定义,表明社区治理并非简单改变语气,而是通过责任分层与细化实现结构化成长。
链接: https://arxiv.org/abs/2509.16295
作者: Mobina Noori,Mahasweta Chakraborti,Amy X Zhang,Seth Frey
机构: University of California Davis (加州大学戴维斯分校); University of Washington (华盛顿大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Open digital public infrastructure needs community management to ensure accountability, sustainability, and robustness. Yet open-source projects often rely on centralized decision-making, and the determinants of successful community management remain unclear. We analyze 637 GitHub repositories to trace transitions from founder-led to shared governance. Specifically, we document trajectories to community governance by extracting institutional roles, actions, and deontic cues from version-controlled project constitutions this http URL. With a semantic parsing pipeline, we cluster elements into broader role and action types. We find roles and actions grow, and regulation becomes more balanced, reflecting increases in governance scope and differentiation over time. Rather than shifting tone, communities grow by layering and refining responsibilities. As transitions to community management mature, projects increasingly regulate ecosystem-level relationships and add definition to project oversight roles. Overall, this work offers a scalable pipeline for tracking the growth and development of community governance regimes from open-source software’s familiar default of founder-ownership.
zh
[NLP-184] Language Modeling with Learned Meta-Tokens
【速读】: 该论文旨在解决当前基于Transformer的生成式AI(Generative AI)语言模型在长距离依赖建模上的局限性,尤其是在有限上下文窗口内难以有效捕捉远距离信息的问题。其解决方案的关键在于引入“元标记”(meta-tokens)——一类在预训练阶段注入的特殊标记,并结合一种专用的元注意力机制(meta-attention),使模型能够将先前上下文的信息隐式压缩并存储于元标记中。该机制通过增强位置编码的区分度,使元标记成为可训练的内容感知地标,从而在推理时指向相关上下文,实现长达原始上下文窗口2倍长度的泛化能力,且在使用YaRN扩展后仍保持高效性能。
链接: https://arxiv.org/abs/2509.16278
作者: Alok N. Shah,Khush Gupta,Keshav Ramji,Pratik Chaudhari
机构: University of Pennsylvania (宾夕法尼亚大学); IBM Research AI (IBM 研究院人工智能)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using meta-tokens, special tokens injected during pre-training, along with a dedicated meta-attention mechanism to guide LMs to use these tokens. We pre-train a language model with a modified GPT-2 architecture equipped with meta-attention in addition to causal multi-head attention, and study the impact of these tokens on a suite of synthetic tasks. We find that data-efficient language model pre-training on fewer than 100B tokens utilizing meta-tokens and our meta-attention mechanism achieves strong performance on these tasks after fine-tuning. We suggest that these gains arise due to the meta-tokens sharpening the positional encoding. This enables them to operate as trainable, content-based landmarks, implicitly compressing preceding context and “caching” it in the meta-token. At inference-time, the meta-token points to relevant context, facilitating length generalization up to 2 \times its context window, even after extension with YaRN. We provide further evidence of these behaviors by visualizing model internals to study the residual stream, and assessing the compression quality by information-theoretic analysis on the rate-distortion tradeoff. Our findings suggest that pre-training LMs with meta-tokens offers a simple, data-efficient method to enhance long-context language modeling performance, while introducing new insights into the nature of their behavior towards length generalization.
zh
[NLP-185] Gender and Political Bias in Large Language Models : A Demonstration Platform WWW
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在政治分析场景中存在系统性性能偏差的问题,尤其是在欧洲议会辩论与投票预测任务中的公平性和可解释性不足。解决方案的关键在于构建了一个名为ParlAI Vote的交互式平台,该平台整合了辩论议题、发言文本、投票结果及议员人口统计学特征(如性别、年龄、国籍和政党群体),通过统一的数据、模型与可视化分析界面,使研究者能够直观比较真实投票结果与LLM预测之间的差异,并按群体分解误差,从而揭示模型在不同 demographic 组别上的偏倚表现。此设计不仅支持复现研究、行为审计和反事实实验,还提升了对LLM在立法决策分析中优势与局限的认知。
链接: https://arxiv.org/abs/2509.16264
作者: Wenjie Lin,Hange Liu,Xutao Mao,Yingying Zhuang,Jingwei Shi,Xudong Han,Tianyu Shi,Jinrui Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: online demo: this https URL Video: this https URL
Abstract:We present ParlAI Vote, an interactive system for exploring European Parliament debates and votes, and for testing LLMs on vote prediction and bias analysis. This platform connects debate topics, speeches, and roll-call outcomes, and includes rich demographic data such as gender, age, country, and political group. Users can browse debates, inspect linked speeches, compare real voting outcomes with predictions from frontier LLMs, and view error breakdowns by demographic group. Visualizing the EuroParlVote benchmark and its core tasks of gender classification and vote prediction, ParlAI Vote highlights systematic performance bias in state-of-the-art LLMs. The system unifies data, models, and visual analytics in a single interface, lowering the barrier for reproducing findings, auditing behavior, and running counterfactual scenarios. It supports research, education, and public engagement with legislative decision-making, while making clear both the strengths and the limitations of current LLMs in political analysis.
zh
[NLP-186] HausaMovieReview: A Benchmark Dataset for Sentiment Analysis in Low-Resource African Language
【速读】: 该论文旨在解决低资源语言(Low-Resource Languages)在自然语言处理(Natural Language Processing, NLP)领域中因标注数据稀缺而导致模型性能受限的问题。其关键解决方案是构建了一个名为HausaMovieReview的新型基准数据集,包含5,000条豪萨语(Hausa)及英豪混用(code-switched English)的YouTube评论,并由三位独立标注者进行标注,获得较高的标注一致性(Fleiss’ Kappa = 0.85)。在此基础上,研究发现传统机器学习模型中的决策树(Decision Tree)分类器在准确率(89.72%)和F1分数(89.60%)上显著优于微调后的Transformer模型(BERT和RoBERTa),表明在低资源场景下,精心设计的特征工程可使经典模型达到先进水平,从而为后续相关研究提供了可靠基线。
链接: https://arxiv.org/abs/2509.16256
作者: Asiya Ibrahim Zanga,Salisu Mamman Abdulrahman,Abubakar Ado,Abdulkadir Abubakar Bichi,Lukman Aliyu Jibril,Abdulmajid Babangida Umar,Alhassan Adamu,Shamsuddeen Hassan Muhammad,Bashir Salisu Abubakar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Masters Thesis, a Dataset Paper
Abstract:The development of Natural Language Processing (NLP) tools for low-resource languages is critically hindered by the scarcity of annotated datasets. This paper addresses this fundamental challenge by introducing HausaMovieReview, a novel benchmark dataset comprising 5,000 YouTube comments in Hausa and code-switched English. The dataset was meticulously annotated by three independent annotators, demonstrating a robust agreement with a Fleiss’ Kappa score of 0.85 between annotators. We used this dataset to conduct a comparative analysis of classical models (Logistic Regression, Decision Tree, K-Nearest Neighbors) and fine-tuned transformer models (BERT and RoBERTa). Our results reveal a key finding: the Decision Tree classifier, with an accuracy and F1-score 89.72% and 89.60% respectively, significantly outperformed the deep learning models. Our findings also provide a robust baseline, demonstrating that effective feature engineering can enable classical models to achieve state-of-the-art performance in low-resource contexts, thereby laying a solid foundation for future research. Keywords: Hausa, Kannywood, Low-Resource Languages, NLP, Sentiment Analysis Comments: Masters Thesis, a Dataset Paper Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.16256 [cs.CL] (or arXiv:2509.16256v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.16256 Focus to learn more arXiv-issued DOI via DataCite
zh
[NLP-187] REAMS: Reasoning Enhanced Algorithm for Maths Solving
【速读】: 该论文旨在解决复杂大学水平数学问题(如麻省理工学院和哥伦比亚大学课程题目及MATH数据集中的精选任务)的自动化求解难题,这类问题长期困扰人工智能领域,传统方法难以有效应对。其解决方案的关键在于提出一种基于语言模型的零样本学习(zero-shot learning)方法,结合数学推理与程序合成(program synthesis),在不依赖大规模训练数据的前提下显著提升求解准确率,最终实现90.15%的准确率,较此前81%的基准有显著进步,为自动数学问题求解树立了新标准。
链接: https://arxiv.org/abs/2509.16241
作者: Eishkaran Singh,Tanav Singh Bajaj,Siddharth Nayak
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:
Abstract:The challenges of solving complex university-level mathematics problems, particularly those from MIT, and Columbia University courses, and selected tasks from the MATH dataset, remain a significant obstacle in the field of artificial intelligence. Conventional methods have consistently fallen short in this domain, highlighting the need for more advanced approaches. In this paper, we introduce a language-based solution that leverages zero-shot learning and mathematical reasoning to effectively solve, explain, and generate solutions for these advanced math problems. By integrating program synthesis, our method reduces reliance on large-scale training data while significantly improving problem-solving accuracy. Our approach achieves an accuracy of 90.15%, representing a substantial improvement over the previous benchmark of 81% and setting a new standard in automated mathematical problem-solving. These findings highlight the significant potential of advanced AI methodologies to address and overcome the challenges presented by some of the most complex mathematical courses and datasets.
zh
[NLP-188] On LLM -Based Scientific Inductive Reasoning Beyond Equations
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在全新环境中从有限样本中学习潜在规律并有效应用的能力问题,这是实现归纳推理(inductive reasoning)的核心挑战。现有研究多基于规则是否可由显式数学方程表达进行分类,但许多“超越方程”类研究缺乏具体场景的支撑。论文的关键解决方案是受人类科学发现过程启发,提出“基于大语言模型的科学归纳推理”任务,并构建首个面向科学场景的基准测试数据集 SIRBench-V1,用于系统评估 LLM 在真实科学情境下的归纳推理能力,从而推动该领域向更贴近实际应用的方向发展。
链接: https://arxiv.org/abs/2509.16226
作者: Brian S. Lin,Jiaxin Yuan,Zihan Zhou,Shouli Wang,Shuo Wang,Cunliang Kong,Qi Shi,Yuxuan Li,Liner Yang,Zhiyuan Liu,Maosong Sun
机构: Tsinghua University (清华大学); Jiangsu Normal University (江苏师范大学); Beijing Language and Culture University (北京语言大学); Xiamen University (厦门大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages
Abstract:As large language models (LLMs) increasingly exhibit human-like capabilities, a fundamental question emerges: How can we enable LLMs to learn the underlying patterns from limited examples in entirely novel environments and apply them effectively? This question is central to the ability of LLMs in inductive reasoning. Existing research on LLM-based inductive reasoning can be broadly categorized based on whether the underlying rules are expressible via explicit mathematical equations. However, many recent studies in the beyond-equations category have emphasized rule design without grounding them in specific scenarios. Inspired by the parallels between inductive reasoning and human scientific discovery, we propose the task of LLM-Based Scientific Inductive Reasoning Beyond Equations and introduce a new benchmark, SIRBench-V1, to evaluate the inductive reasoning abilities of LLMs in scientific settings. Our experimental results show that current LLMs still struggle with this task, underscoring its difficulty and the need for further advancement in this area.
zh
[NLP-189] Predicting First Year Dropout from Pre Enrolment Motivation Statements Using Text Mining
【速读】: 该论文试图解决高等教育中学生辍学预测难题,尤其是如何在入学前更准确地识别可能辍学的学生。其关键解决方案在于利用文本挖掘技术分析学生提交的动机陈述(motivation statements),通过提取其中的信息来增强传统预测变量(如高中平均成绩、个人特征等)的预测能力。研究发现,尽管将文本特征与传统变量结合并未显著提升预测效果,但仅使用文本分析方法即可达到与传统学生特征相当的预测精度,表明动机陈述中的语义信息具有独立且重要的预测价值。
链接: https://arxiv.org/abs/2509.16224
作者: K.F.B. Soppe,A. Bagheri,S. Nadi,I.G. Klugkist,T. Wubbels,L.D.N.V. Wijngaards-De Meij
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL); Machine Learning (cs.LG); Applications (stat.AP)
备注:
Abstract:Preventing student dropout is a major challenge in higher education and it is difficult to predict prior to enrolment which students are likely to drop out and which students are likely to succeed. High School GPA is a strong predictor of dropout, but much variance in dropout remains to be explained. This study focused on predicting university dropout by using text mining techniques with the aim of exhuming information contained in motivation statements written by students. By combining text data with classic predictors of dropout in the form of student characteristics, we attempt to enhance the available set of predictive student characteristics. Our dataset consisted of 7,060 motivation statements of students enrolling in a non-selective bachelor at a Dutch university in 2014 and 2015. Support Vector Machines were trained on 75 percent of the data and several models were estimated on the test data. We used various combinations of student characteristics and text, such as TFiDF, topic modelling, LIWC dictionary. Results showed that, although the combination of text and student characteristics did not improve the prediction of dropout, text analysis alone predicted dropout similarly well as a set of student characteristics. Suggestions for future research are provided.
zh
[NLP-190] How Can Quantum Deep Learning Improve Large Language Models ?
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在适应新任务时面临的计算与内存开销过高的问题,即如何在保持模型性能的同时实现高效参数微调。传统全量微调虽效果优异但资源消耗巨大,而现有的参数高效微调(Parameter-efficient Fine-tuning, PEFT)方法如低秩适应(LoRA)、前缀微调(Prefix tuning)和稀疏低秩适应(SoRA)虽降低了训练参数规模,但在可扩展性、稳定性及跨任务泛化能力方面仍存在局限。论文提出的关键解决方案是引入量子启发式微调框架——量子振幅嵌入适配(Quantum-Amplitude Embedded Adaptation, QAA),该方法通过量子幅度编码(quantum-amplitude encoding)和参数化量子电路(Parametrized Quantum Circuits, PQCs)实现高表达力的模型更新,同时保持极低的额外计算开销,从而为LLM的高效适配提供了新的技术路径。
链接: https://arxiv.org/abs/2509.16244
作者: Emily Jimin Roh,Hyojun Ahn,Samuel Yen-Chi Chen,Soohyun Park,Joongheon Kim
机构: 未知
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:The rapid progress of large language models (LLMs) has transformed natural language processing, yet the challenge of efficient adaptation remains unresolved. Full fine-tuning achieves strong performance but imposes prohibitive computational and memory costs. Parameter-efficient fine-tuning (PEFT) strategies, such as low-rank adaptation (LoRA), Prefix tuning, and sparse low-rank adaptation (SoRA), address this issue by reducing trainable parameters while maintaining competitive accuracy. However, these methods often encounter limitations in scalability, stability, and generalization across diverse tasks. Recent advances in quantum deep learning introduce novel opportunities through quantum-inspired encoding and parameterized quantum circuits (PQCs). In particular, the quantum-amplitude embedded adaptation (QAA) framework demonstrates expressive model updates with minimal overhead. This paper presents a systematic survey and comparative analysis of conventional PEFT methods and QAA. The analysis demonstrates trade-offs in convergence, efficiency, and representational capacity, while providing insight into the potential of quantum approaches for future LLM adaptation.
zh
计算机视觉
[CV-0] Preconditioned Deformation Grids
【速读】:该论文旨在解决从无结构点云序列中进行动态表面重建时存在的精度不足、过度平滑及对未见物体和运动泛化能力差的问题。现有方法通常依赖多个正则项或大量训练数据,导致性能妥协。其解决方案的关键在于提出预条件变形网格(Preconditioned Deformation Grids),通过多分辨率体素网格捕捉不同空间尺度上的整体运动,实现灵活的变形表示,并结合基于网格的Sobolev预条件梯度优化,仅使用输入点云与演化模板网格之间的Chamfer损失即可获得高精度变形;同时引入弱等距损失约束网格边以保证时间一致性,而不损害变形保真度。
链接: https://arxiv.org/abs/2509.18097
作者: Julian Kaltheuner,Alexander Oebel,Hannah Droege,Patrick Stotko,Reinhard Klein
机构: University of Bonn (波恩大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: GitHub: this https URL
Abstract:Dynamic surface reconstruction of objects from point cloud sequences is a challenging field in computer graphics. Existing approaches either require multiple regularization terms or extensive training data which, however, lead to compromises in reconstruction accuracy as well as over-smoothing or poor generalization to unseen objects and motions. To address these lim- itations, we introduce Preconditioned Deformation Grids, a novel technique for estimating coherent deformation fields directly from unstructured point cloud sequences without requiring or forming explicit correspondences. Key to our approach is the use of multi-resolution voxel grids that capture the overall motion at varying spatial scales, enabling a more flexible deformation representation. In conjunction with incorporating grid-based Sobolev preconditioning into gradient-based optimization, we show that applying a Chamfer loss between the input point clouds as well as to an evolving template mesh is sufficient to obtain accurate deformations. To ensure temporal consistency along the object surface, we include a weak isometry loss on mesh edges which complements the main objective without constraining deformation fidelity. Extensive evaluations demonstrate that our method achieves superior results, particularly for long sequences, compared to state-of-the-art techniques.
zh
[CV-1] Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers NEURIPS2025
【速读】:该论文旨在解决多模态扩散变换器(Multi-Modal Diffusion Transformer, MM-DiT)中文本到图像生成过程中,注意力机制如何具体传播语义信息、以及其在图像生成中的作用机制尚不清晰的问题。解决方案的关键在于提出Seg4Diff框架,系统性地分析MM-DiT的注意力结构,并识别出一个“语义锚定专家层”(semantic grounding expert layer),该层能稳定地将文本token与空间上一致的图像区域对齐,从而自然生成高质量语义分割掩码;进一步通过轻量级微调策略结合掩码标注图像数据,增强该层的语义分组能力,显著提升分割性能和图像生成保真度。研究揭示了语义分组是扩散Transformer中的涌现特性,且可通过选择性放大来协同优化视觉感知与生成任务。
链接: https://arxiv.org/abs/2509.18096
作者: Chaehyun Kim,Heeseong Shin,Eunbeen Hong,Heeji Yoon,Anurag Arnab,Paul Hongsuck Seo,Sunghwan Hong,Seungryong Kim
机构: KAIST AI (KAIST人工智能); Korea University (韩国大学); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025. Project page: this https URL
Abstract:Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.
zh
[CV-2] UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning NEURIPS2025
【速读】:该论文旨在解决当前大型多模态模型(Large Multi-modal Models, LMMs)在细粒度像素级理解能力上的不足问题,即模型难以实现视觉信号与语言语义之间的像素级对齐,且现有方法多局限于独立执行指代表达(referring)或分割(segmentation)任务,无法将这些细粒度感知能力整合进视觉推理流程。解决方案的关键在于提出UniPixel,一个能够灵活处理视觉提示并生成掩码(mask)驱动响应的统一模型;其核心创新在于将像素级感知无缝集成到通用视觉理解能力中,通过在推理过程中动态生成中间掩码作为指针,并基于此进行后续条件化推理,从而实现细粒度像素级推理能力。
链接: https://arxiv.org/abs/2509.18094
作者: Ye Liu,Zongyang Ma,Junfu Pu,Zhongang Qi,Yang Wu,Ying Shan,Chang Wen Chen
机构: The Hong Kong Polytechnic University (香港理工大学); ARC Lab, Tencent PCG (腾讯PCG ARC实验室); Chinese Academy of Sciences (中国科学院); vivo Mobile Communication Co. (维沃移动通信有限公司); Tencent AI Lab (腾讯AI实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025 Camera Ready. Project Page: this https URL
Abstract:Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
zh
[CV-3] ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation SIGGRAPH
【速读】:该论文旨在解决个性化文本到图像生成中对人类图像的细粒度属性控制难题,尤其是如何在保持身份一致性的同时实现发型、服饰等视觉属性的解耦与组合控制。现有方法虽注重参考图像的身份保留,但缺乏模块化设计且无法有效分离特定视觉属性。解决方案的关键在于提出一种属性特异性图像提示(attribute-specific image prompting)的新范式:通过将不同属性的参考图像编码为独立的属性标记(attribute-specific tokens),并注入预训练的文本到图像扩散模型中,从而实现多视觉因素的可组合、解耦控制,即使在单张图像包含多人时亦能保持自然合成与鲁棒解耦。为此,作者还构建了跨参考训练数据集并设计多属性交叉参考训练策略,以增强模型在属性输入错位情况下的生成忠实度,同时满足身份和文本条件约束。
链接: https://arxiv.org/abs/2509.18092
作者: Guocheng Gordon Qian,Daniil Ostashev,Egor Nemchinov,Avihay Assouline,Sergey Tulyakov,Kuan-Chieh Jackson Wang,Kfir Aberman
机构: Snap Inc.(Snap Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to SIGGRAPH Asia 2025, webpage: this https URL
Abstract:Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: this https URL.
zh
[CV-4] GeoSVR: Taming Sparse Voxels for Geometrically Accurate Surface Reconstruction NEURIPS2025
【速读】:该论文旨在解决当前基于高斯溅射(Gaussian Splatting)的辐射场表面重建方法因表示瓶颈导致的几何精度不足、细节丢失及覆盖不完整等问题。其核心解决方案是提出GeoSVR,一种显式体素(explicit voxel-based)框架,通过引入稀疏体素(sparse voxels)实现更准确、完整且细节丰富的表面重建。关键创新在于两个机制:一是体素不确定性深度约束(Voxel-Uncertainty Depth Constraint),利用单目深度线索并结合体素级不确定性建模,在保证场景收敛的同时避免质量退化;二是稀疏体素表面正则化(Sparse Voxel Surface Regularization),提升细小体素的几何一致性,促进锐利且精确的表面形成。该方法在多种复杂场景下均展现出优于现有技术的几何准确性、细节保留能力和重建完整性。
链接: https://arxiv.org/abs/2509.18090
作者: Jiahe Li,Jiawei Zhang,Youmin Zhang,Xiao Bai,Jin Zheng,Xiaohan Yu,Lin Gu
机构: Beihang University (北京航空航天大学); Rawmantic AI; State Key Laboratory of Virtual Reality Technology and Systems (虚拟现实技术与系统国家重点实验室); Macquarie University (麦考瑞大学); RIKEN AIP (理化学研究所先进人工智能中心); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025 (Spotlight). Project page: this https URL
Abstract:Reconstructing accurate surfaces with radiance fields has achieved remarkable progress in recent years. However, prevailing approaches, primarily based on Gaussian Splatting, are increasingly constrained by representational bottlenecks. In this paper, we introduce GeoSVR, an explicit voxel-based framework that explores and extends the under-investigated potential of sparse voxels for achieving accurate, detailed, and complete surface reconstruction. As strengths, sparse voxels support preserving the coverage completeness and geometric clarity, while corresponding challenges also arise from absent scene constraints and locality in surface refinement. To ensure correct scene convergence, we first propose a Voxel-Uncertainty Depth Constraint that maximizes the effect of monocular depth cues while presenting a voxel-oriented uncertainty to avoid quality degradation, enabling effective and robust scene constraints yet preserving highly accurate geometries. Subsequently, Sparse Voxel Surface Regularization is designed to enhance geometric consistency for tiny voxels and facilitate the voxel-based formation of sharp and accurate surfaces. Extensive experiments demonstrate our superior performance compared to existing methods across diverse challenging scenarios, excelling in geometric accuracy, detail preservation, and reconstruction completeness while maintaining high efficiency. Code is available at this https URL.
zh
[CV-5] GraDeT-HTR: A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer EMNLP
【速读】:该论文旨在解决 Bengali 手写文本识别(Handwritten Text Recognition, HTR)系统严重不足的问题,其核心挑战在于 Bengali 语言特有的复杂字符结构(如连写字符 conjuncts、变音符号 diacritics)以及手写风格高度多样化,加之高质量标注数据集稀缺。解决方案的关键在于提出 GraDeT-HTR,一个基于 Grapheme-aware Decoder-only Transformer 架构的资源高效模型;通过引入基于音位(grapheme)的分词器替代传统子词分词器,显著提升了识别准确率,并在大规模合成数据上预训练、真实标注样本上微调,最终在多个基准数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2509.18081
作者: Md. Mahmudul Hasan,Ahmed Nesar Tahsin Choudhury,Mahmudul Hasan,Md. Mosaddek Khan
机构: University of Dhaka (达卡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages. Accepted at the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) System Demonstrations. Equal Contribution: Md. Mahmudul Hasan and Ahmed Nesar Tahsin Choudhury
Abstract:Despite Bengali being the sixth most spoken language in the world, handwritten text recognition (HTR) systems for Bengali remain severely underdeveloped. The complexity of Bengali script–featuring conjuncts, diacritics, and highly variable handwriting styles–combined with a scarcity of annotated datasets makes this task particularly challenging. We present GraDeT-HTR, a resource-efficient Bengali handwritten text recognition system based on a Grapheme-aware Decoder-only Transformer architecture. To address the unique challenges of Bengali script, we augment the performance of a decoder-only transformer by integrating a grapheme-based tokenizer and demonstrate that it significantly improves recognition accuracy compared to conventional subword tokenizers. Our model is pretrained on large-scale synthetic data and fine-tuned on real human-annotated samples, achieving state-of-the-art performance on multiple benchmark datasets.
zh
[CV-6] mpSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLM s NEURIPS2025
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频时序定位任务中,因传统强化学习方法(如Group Relative Policy Optimization, GRPO)依赖在线策略采样而导致的效率低下与性能受限问题。具体而言,在具有庞大时序搜索空间的任务中,GRPO难以有效识别精确的时间片段,导致解空间稀疏且奖励信号错位。其解决方案的关键在于提出TempSamp-R1框架:首先利用真实标注(ground-truth annotations)作为离线策略监督信号,提供精准的时间引导以弥补在线策略解的稀疏性和偏差;其次引入非线性软优势计算机制,通过不对称变换动态重塑奖励反馈,从而稳定训练过程并降低基于奖励更新的方差;最后采用混合思维链(Chain-of-Thought, CoT)训练范式,使单一模型可同时支持带推理和不带推理的推理模式,提升对不同复杂度查询的适应能力。
链接: https://arxiv.org/abs/2509.18056
作者: Yunheng Li,Jing Cheng,Shaoyong Jia,Hangyi Kuang,Shaohui Jiao,Qibin Hou,Ming-Ming Cheng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
Abstract:This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: this https URL
zh
[CV-7] NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning
【速读】:该论文旨在解决长视频问答(Long-Form Video Question Answering, LVQA)中因模型缺乏对时间因果关系和多步逻辑推理能力而导致的错误答案问题。传统视觉语言模型(Vision-Language Models, VLMs)在处理长视频时,常因帧采样策略不合理而忽略关键事件过渡或细粒度时序线索,且现有方法多依赖启发式规则,无法保证所选上下文满足问题所需的组合性或因果逻辑。解决方案的关键在于提出一种无需训练、即插即用的神经符号(Neuro-Symbolic)流程NeuS-QA:它将自然语言问题转化为形式化的时间逻辑表达式,基于帧级语义命题构建视频自动机,并通过模型检测严格识别符合逻辑要求的视频片段,仅向VLM提交这些经逻辑验证的片段,从而提升可解释性、减少幻觉并实现无需模型微调的组合推理能力。
链接: https://arxiv.org/abs/2509.18041
作者: Sahil Shah,S P Sharan,Harsh Goel,Minkyu Choi,Mustafa Munir,Manvik Pasula,Radu Marculescu,Sandeep Chinchali
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which sample frames uniformly and feed them to a VLM with the question, incur significant token overhead, forcing severe downsampling. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues, ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships. As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments satisfying the question’s logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on LongVideoBench and CinePile show NeuS-QA improves performance by over 10%, especially on questions involving event ordering, causality, and multi-step compositional reasoning.
zh
[CV-8] Detection of Misreporting Attacks on Software-Defined Immersive Environments
【速读】:该论文旨在解决软件定义网络(Software-Defined Networking, SDN)中因交换机恶意 misreporting(误报)导致的负载不均衡问题,该问题会引发沉浸式应用(immersive applications)质量严重下降。解决方案的关键在于提出一种混合机器学习(machine learning, ML)驱动的网络异常检测框架,通过捕捉交换机上报负载的时间不一致性来识别此类隐蔽的 misreporting 行为;该框架结合无监督异常评分与有监督分类方法,从而可靠地区分恶意行为,实验表明其在检测 misreporting 行为上具有高召回率,适用于 SDN 环境中的早期、可靠异常检测。
链接: https://arxiv.org/abs/2509.18040
作者: Sourya Saha,Md Nurul Absur,Shima Yousefi,Saptarshi Debroy
机构: City University of New York (纽约市立大学)
类目: Networking and Internet Architecture (cs.NI); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 Pages, 7 Images, will appear in CNSM 2025
Abstract:The ability to centrally control network infrastructure using a programmable middleware has made Software-Defined Networking (SDN) ideal for emerging applications, such as immersive environments. However, such flexibility introduces new vulnerabilities, such as switch misreporting led load imbalance, which in turn make such immersive environment vulnerable to severe quality degradation. In this paper, we present a hybrid machine learning (ML)-based network anomaly detection framework that identifies such stealthy misreporting by capturing temporal inconsistencies in switch-reported loads, and thereby counter potentially catastrophic quality degradation of hosted immersive application. The detection system combines unsupervised anomaly scoring with supervised classification to robustly distinguish malicious behavior. Data collected from a realistic testbed deployment under both benign and adversarial conditions is used to train and evaluate the model. Experimental results show that the framework achieves high recall in detecting misreporting behavior, making it effective for early and reliable detection in SDN environments.
zh
[CV-9] Beyond Diagnosis: Evaluating Multimodal LLM s for Pathology Localization in Chest Radiographs
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在医学影像中病灶定位能力不足的问题,尤其是其在胸部X光片上对病理区域的空间识别精度有限。解决方案的关键在于设计了一种基于空间网格提示(spatial grid prompting)的评估框架,通过引导模型输出坐标形式的预测结果,系统性地量化了三种模型——通用型MLLMs(GPT-4和GPT-5)与领域专用模型MedGemma——在CheXlocalize数据集上对九类常见胸部病变的定位准确性。该方法不仅揭示了不同模型在空间理解上的差异,也为未来将通用模型与任务特定工具结合以提升临床可靠性提供了实证依据。
链接: https://arxiv.org/abs/2509.18015
作者: Advait Gosai,Arun Kavishwar,Stephanie L. McNamara,Soujanya Samineni,Renato Umeton,Alexander Chowdhury,William Lotter
机构: University of California (加州大学); Dana-Farber Cancer Institute (达纳-法伯癌症研究所); Massachusetts General Hospital (麻省总医院); St. Jude Children’s Research Hospital (圣裘德儿童研究医院); Brigham and Women’s Hospital (布里格姆妇女医院); Harvard Medical School (哈佛医学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent work has shown promising performance of frontier large language models (LLMs) and their multimodal counterparts in medical quizzes and diagnostic tasks, highlighting their potential for broad clinical utility given their accessible, general-purpose nature. However, beyond diagnosis, a fundamental aspect of medical image interpretation is the ability to localize pathological findings. Evaluating localization not only has clinical and educational relevance but also provides insight into a model’s spatial understanding of anatomy and disease. Here, we systematically assess two general-purpose MLLMs (GPT-4 and GPT-5) and a domain-specific model (MedGemma) in their ability to localize pathologies on chest radiographs, using a prompting pipeline that overlays a spatial grid and elicits coordinate-based predictions. Averaged across nine pathologies in the CheXlocalize dataset, GPT-5 exhibited a localization accuracy of 49.7%, followed by GPT-4 (39.1%) and MedGemma (17.7%), all lower than a task-specific CNN baseline (59.9%) and a radiologist benchmark (80.1%). Despite modest performance, error analysis revealed that GPT-5’s predictions were largely in anatomically plausible regions, just not always precisely localized. GPT-4 performed well on pathologies with fixed anatomical locations, but struggled with spatially variable findings and exhibited anatomically implausible predictions more frequently. MedGemma demonstrated the lowest performance on all pathologies, showing limited capacity to generalize to this novel task. Our findings highlight both the promise and limitations of current MLLMs in medical imaging and underscore the importance of integrating them with task-specific tools for reliable use.
zh
[CV-10] StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models NEURIPS2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在扩散模型(Diffusion Models)中因内容真实性提升而引发的版权保护与篡改定位难题。现有方法多依赖后处理(post hoc processing),导致应用不便且取证可靠性下降。其解决方案的关键在于提出 StableGuard 框架,通过端到端设计将二值水印(binary watermark)嵌入扩散生成过程:首先构建 Multiplexing Watermark VAE(MPW-VAE),基于预训练变分自编码器(Variational Autoencoder, VAE)引入轻量级潜空间残差适配器,实现水印图像与无水印图像的成对生成;进而利用随机掩码融合生成多样化训练数据,训练一个对篡改不敏感的取证网络;再结合 Mixture-of-Experts Guided Forensic Network(MoE-GFN),动态整合全局水印模式、局部篡改痕迹与频域特征,实现高精度水印验证与篡改区域检测。MPW-VAE 与 MoE-GFN 在自监督框架下联合优化,形成水印嵌入与取证准确性的协同增强机制。
链接: https://arxiv.org/abs/2509.17993
作者: Haoxin Yang,Bangzhen Liu,Xuemiao Xu,Cheng Xu,Yuyang Yu,Zikai Huang,Yi Wang,Shengfeng He
机构: South China University of Technology (华南理工大学); The Hong Kong Polytechnic University (香港理工大学); Dongguan University of Technology (东莞理工学院); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:The advancement of diffusion models has enhanced the realism of AI-generated content but also raised concerns about misuse, necessitating robust copyright protection and tampering localization. Although recent methods have made progress toward unified solutions, their reliance on post hoc processing introduces considerable application inconvenience and compromises forensic reliability. We propose StableGuard, a novel framework that seamlessly integrates a binary watermark into the diffusion generation process, ensuring copyright protection and tampering localization in Latent Diffusion Models through an end-to-end design. We develop a Multiplexing Watermark VAE (MPW-VAE) by equipping a pretrained Variational Autoencoder (VAE) with a lightweight latent residual-based adapter, enabling the generation of paired watermarked and watermark-free images. These pairs, fused via random masks, create a diverse dataset for training a tampering-agnostic forensic network. To further enhance forensic synergy, we introduce a Mixture-of-Experts Guided Forensic Network (MoE-GFN) that dynamically integrates holistic watermark patterns, local tampering traces, and frequency-domain cues for precise watermark verification and tampered region detection. The MPW-VAE and MoE-GFN are jointly optimized in a self-supervised, end-to-end manner, fostering a reciprocal training between watermark embedding and forensic accuracy. Extensive experiments demonstrate that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.
zh
[CV-11] Intra-Cluster Mixup: An Effective Data Augmentation Technique for Complementary-Label Learning
【速读】:该论文旨在解决互补标签学习(Complementary-Label Learning, CLL)中数据增强技术应用效果不佳的问题,尤其是传统 Mixup 方法在 CLL 场景下因引入噪声而损害模型性能的局限性。其解决方案的关键在于提出一种改进的数据增强方法——簇内混合(Intra-Cluster Mixup, ICM),该方法仅在局部邻近样本间进行数据合成,从而减少由 Mixup 引入的标签噪声,同时增强相邻样本间的互补标签一致性,显著提升 CLL 模型在多种数据集上的表现,包括在 MNIST 和 CIFAR 数据集上分别实现 30% 和 10% 的准确率提升。
链接: https://arxiv.org/abs/2509.17971
作者: Tan-Ha Mai,Hsuan-Tien Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 22 pages, 10 figures
Abstract:In this paper, we investigate the challenges of complementary-label learning (CLL), a specialized form of weakly-supervised learning (WSL) where models are trained with labels indicating classes to which instances do not belong, rather than standard ordinary labels. This alternative supervision is appealing because collecting complementary labels is generally cheaper and less labor-intensive. Although most existing research in CLL emphasizes the development of novel loss functions, the potential of data augmentation in this domain remains largely underexplored. In this work, we uncover that the widely-used Mixup data augmentation technique is ineffective when directly applied to CLL. Through in-depth analysis, we identify that the complementary-label noise generated by Mixup negatively impacts the performance of CLL models. We then propose an improved technique called Intra-Cluster Mixup (ICM), which only synthesizes augmented data from nearby examples, to mitigate the noise effect. ICM carries the benefits of encouraging complementary label sharing of nearby examples, and leads to substantial performance improvements across synthetic and real-world labeled datasets. In particular, our wide spectrum of experimental results on both balanced and imbalanced CLL settings justifies the potential of ICM in allying with state-of-the-art CLL algorithms, achieving significant accuracy increases of 30% and 10% on MNIST and CIFAR datasets, respectively.
zh
[CV-12] Joint Optimization of Memory Frequency Computing Frequency Transmission Power and Task Offloading for Energy-efficient DNN Inference
【速读】:该论文旨在解决资源受限设备上深度神经网络(Deep Neural Networks, DNNs)推理时面临的高延迟和高能耗问题。现有研究多依赖动态电压频率调节(Dynamic Voltage and Frequency Scaling, DVFS)技术仅调整计算频率以平衡延迟与能耗,但忽略了内存频率的调节作用。论文的关键解决方案是提出联合调整内存频率与计算频率的方法,通过模型驱动与数据驱动相结合的方式分析二者协同对推理时间和能耗的影响,并基于不同DNN模型的拟合参数进行初步验证,最终在本地推理和协作推理场景下的仿真结果表明,同步调节内存与计算频率能显著降低设备能耗。
链接: https://arxiv.org/abs/2509.17970
作者: Yunchu Han,Zhaojun Nan,Sheng Zhou,Zhisheng Niu
机构: Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks (DNNs) have been widely applied in diverse applications, but the problems of high latency and energy overhead are inevitable on resource-constrained devices. To address this challenge, most researchers focus on the dynamic voltage and frequency scaling (DVFS) technique to balance the latency and energy consumption by changing the computing frequency of processors. However, the adjustment of memory frequency is usually ignored and not fully utilized to achieve efficient DNN inference, which also plays a significant role in the inference time and energy consumption. In this paper, we first investigate the impact of joint memory frequency and computing frequency scaling on the inference time and energy consumption with a model-based and data-driven method. Then by combining with the fitting parameters of different DNN models, we give a preliminary analysis for the proposed model to see the effects of adjusting memory frequency and computing frequency simultaneously. Finally, simulation results in local inference and cooperative inference cases further validate the effectiveness of jointly scaling the memory frequency and computing frequency to reduce the energy consumption of devices.
zh
[CV-13] Visual Detector Compression via Location-Aware Discriminant Analysis
【速读】:该论文旨在解决深度视觉检测模型在资源受限的边缘设备上部署时面临的复杂度过高问题,特别是现有剪枝方法多聚焦于分类模型、忽视目标定位信息,且依赖预训练模型导致有用与无用组件混杂难以分离的问题。其解决方案的关键在于提出一种基于检测判别因子(detection-discriminants)的主动压缩方法,通过交替执行两个步骤:一是最大化并压缩与检测相关的判别信息,将其对齐到检测头前的子集神经元/滤波器;二是追踪各层中检测相关判别能力并剔除重要性较低的特征。该方法充分利用了目标位置信息,在不损害检测性能的前提下实现高效压缩,实验表明压缩后的模型甚至可超越原始基线模型,同时显著降低计算复杂度。
链接: https://arxiv.org/abs/2509.17968
作者: Qizhen Lan,Jung Im Choi,Qing Tian
机构: University of Alabama at Birmingham (阿拉巴马大学伯明翰分校); Bowling Green State University (布赖恩特格林州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep neural networks are powerful, yet their high complexity greatly limits their potential to be deployed on billions of resource-constrained edge devices. Pruning is a crucial network compression technique, yet most existing methods focus on classification models, with limited attention to detection. Even among those addressing detection, there is a lack of utilization of essential localization information. Also, many pruning methods passively rely on pre-trained models, in which useful and useless components are intertwined, making it difficult to remove the latter without harming the former at the neuron/filter level. To address the above issues, in this paper, we propose a proactive detection-discriminants-based network compression approach for deep visual detectors, which alternates between two steps: (1) maximizing and compressing detection-related discriminants and aligning them with a subset of neurons/filters immediately before the detection head, and (2) tracing the detection-related discriminating power across the layers and discarding features of lower importance. Object location information is exploited in both steps. Extensive experiments, employing four advanced detection models and four state-of-the-art competing methods on the KITTI and COCO datasets, highlight the superiority of our approach. Remarkably, our compressed models can even beat the original base models with a substantial reduction in complexity.
zh
[CV-14] Breaking the Discretization Barrier of Continuous Physics Simulation Learning
【速读】:该论文旨在解决从部分观测中建模复杂时变物理动力学的问题,尤其针对观测在空间上稀疏且分布无序的情形,传统数据驱动方法受限于固定的时空离散化策略,难以有效捕捉高度非线性的特征。其解决方案的关键在于提出一种纯数据驱动的连续物理模拟方法CoPS:首先利用乘法滤波网络融合并编码空间信息与观测数据;随后设计几何网格并通过消息传递机制将特征映射至定制化网格;进而通过多尺度图微分方程(graph ODEs)建模连续时间动力学,并引入基于马尔可夫的神经自校正模块以辅助和约束连续外推过程,从而突破离散化限制,实现高精度的空间-时间连续建模。
链接: https://arxiv.org/abs/2509.17955
作者: Fan Xu,Hao Wu,Nan Wang,Lilan Peng,Kun Wang,Wei Gong,Xibin Zhao
机构: University of Science and Technology of China (中国科学技术大学); Tsinghua University (清华大学); Beijing Jiaotong University (北京交通大学); Southwest Jiaotong University (西南交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The modeling of complicated time-evolving physical dynamics from partial observations is a long-standing challenge. Particularly, observations can be sparsely distributed in a seemingly random or unstructured manner, making it difficult to capture highly nonlinear features in a variety of scientific and engineering problems. However, existing data-driven approaches are often constrained by fixed spatial and temporal discretization. While some researchers attempt to achieve spatio-temporal continuity by designing novel strategies, they either overly rely on traditional numerical methods or fail to truly overcome the limitations imposed by discretization. To address these, we propose CoPS, a purely data-driven methods, to effectively model continuous physics simulation from partial observations. Specifically, we employ multiplicative filter network to fuse and encode spatial information with the corresponding observations. Then we customize geometric grids and use message-passing mechanism to map features from original spatial domain to the customized grids. Subsequently, CoPS models continuous-time dynamics by designing multi-scale graph ODEs, while introducing a Markov-based neural auto-correction module to assist and constrain the continuous extrapolations. Comprehensive experiments demonstrate that CoPS advances the state-of-the-art methods in space-time continuous modeling across various scenarios.
zh
[CV-15] Drag OSM: Extract Building Roofs and Footprints from Aerial Images by Aligning Historical Labels
【速读】:该论文旨在解决在非正射(off-nadir)遥感图像中,由于屋顶与轮廓(footprint)显著位移及立面像素与屋顶边界融合导致的多边形屋顶和建筑轮廓提取困难问题。现有方法依赖于分割模型,但难以处理此类几何失真。此外,尽管开放矢量地图(如OpenStreetMap)提供了历史标注,但其位置偏差大且仅提供单一类别标签(屋顶或轮廓),无法准确描述建筑结构。解决方案的关键在于提出“对齐标记”(alignment token)概念,通过编码校正向量来引导标签修正,并设计DragOSM模型,将标签对齐建模为交互式去噪过程,利用高斯分布模拟位置偏差,在训练阶段通过随机高斯扰动学习纠错机制,在推理阶段迭代优化输入标签位置,从而实现从失准的历史标注中恢复精确的屋顶与轮廓信息。
链接: https://arxiv.org/abs/2509.17951
作者: Kai Li,Xingxing Weng,Yupeng Deng,Yu Meng,Chao Pang,Gui-Song Xia,Xiangyu Zhao
机构: University of Chinese Academy of Sciences (中国科学院大学); City University of Hong Kong (香港城市大学); Wuhan University (武汉大学); Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 17 Pages
Abstract:Extracting polygonal roofs and footprints from remote sensing images is critical for large-scale urban analysis. Most existing methods rely on segmentation-based models that assume clear semantic boundaries of roofs, but these approaches struggle in off- nadir images, where the roof and footprint are significantly displaced, and facade pixels are fused with the roof boundary. With the increasing availability of open vector map annotations, e.g., OpenStreetMap, utilizing historical labels for off-nadir image annotation has become viable because remote sensing images are georeferenced once captured. However, these historical labels commonly suffer from significant positional discrepancies with new images and only have one annotation (roof or footprint), which fails to describe the correct structures of a building. To address these discrepancies, we first introduce a concept of an alignment token, which encodes the correction vector to guide the label correction. Based on this concept, we then propose Drag OpenStreetMap Labels (DragOSM), a novel model designed to align dislocated historical labels with roofs and footprints. Specifically, DragOSM formulates the label alignment as an interactive denoising process, modeling the positional discrepancy as a Gaussian distribution. During training, it learns to correct these errors by simulating misalignment with random Gaussian perturbations; during inference, it iteratively refines the positions of input labels. To validate our method, we further present a new dataset, Repairing Buildings in OSM (ReBO), comprising 179,265 buildings with both OpenStreetMap and manually corrected annotations across 5,473 images from 41 cities. Experimental results on ReBO demonstrate the effectiveness of DragOSM. Code, dataset, and trained models are publicly available at this https URL.
zh
[CV-16] Can multimodal representation learning by alignment preserve modality-specific information? ECML ACL KDD2025
【速读】:该论文旨在解决多模态遥感数据融合中因对齐策略导致的任务相关非共享信息丢失的问题。其解决方案的关键在于通过理论分析与数值实验揭示:在简化假设下,现有的基于空间对齐的对比学习方法可能引发信息损失;进而提出应设计更精细的对齐机制,以保留跨模态间不共享但对特定任务至关重要的特征,从而推动面向多模态卫星数据的对比学习新发展。
链接: https://arxiv.org/abs/2509.17943
作者: Romain Thoreau,Jessie Levillain,Dawa Derksen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted as a workshop paper at MACLEAN - ECML/PKDD 2025
Abstract:Combining multimodal data is a key issue in a wide range of machine learning tasks, including many remote sensing problems. In Earth observation, early multimodal data fusion methods were based on specific neural network architectures and supervised learning. Ever since, the scarcity of labeled data has motivated self-supervised learning techniques. State-of-the-art multimodal representation learning techniques leverage the spatial alignment between satellite data from different modalities acquired over the same geographic area in order to foster a semantic alignment in the latent space. In this paper, we investigate how this methods can preserve task-relevant information that is not shared across modalities. First, we show, under simplifying assumptions, when alignment strategies fundamentally lead to an information loss. Then, we support our theoretical insight through numerical experiments in more realistic settings. With those theoretical and empirical evidences, we hope to support new developments in contrastive learning for the combination of multimodal satellite data. Our code and data is publicly available at this https URL.
zh
[CV-17] ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion
【速读】:该论文旨在解决机器人在动态环境中遵循复杂指令时面临的组合性挑战,即指令中包含多个规范(specification)时,其组合数量随技能集扩展呈指数增长,导致传统方法难以泛化到未见过的规范组合。解决方案的关键在于提出ComposableNav框架,其核心思想是将指令分解为独立的运动基元(motion primitive),并利用扩散模型分别学习这些基元;在部署阶段通过并行组合方式生成满足新颖规范组合的轨迹,从而实现对未见指令的有效执行。此外,为避免对每个基元单独收集演示数据,作者设计了两阶段训练策略:先进行监督预训练获得基础动态导航扩散模型,再通过强化学习微调将其塑造成多种运动基元,显著提升了模型的可扩展性和泛化能力。
链接: https://arxiv.org/abs/2509.17941
作者: Zichao Hu,Chen Tang,Michael J. Munje,Yifeng Zhu,Alex Liu,Shuijing Liu,Garrett Warnell,Peter Stone,Joydeep Biswas
机构: The University of Texas at Austin (德克萨斯大学奥斯汀分校); Army Research Laboratory (陆军研究实验室); Sony AI (索尼人工智能)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Conference on Robot Learning (CoRL) 2025 Project site: this https URL
Abstract:This paper considers the problem of enabling robots to navigate dynamic environments while following instructions. The challenge lies in the combinatorial nature of instruction specifications: each instruction can include multiple specifications, and the number of possible specification combinations grows exponentially as the robot’s skill set expands. For example, “overtake the pedestrian while staying on the right side of the road” consists of two specifications: “overtake the pedestrian” and “walk on the right side of the road.” To tackle this challenge, we propose ComposableNav, based on the intuition that following an instruction involves independently satisfying its constituent specifications, each corresponding to a distinct motion primitive. Using diffusion models, ComposableNav learns each primitive separately, then composes them in parallel at deployment time to satisfy novel combinations of specifications unseen in training. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we propose a two-stage training procedure: (1) supervised pre-training to learn a base diffusion model for dynamic navigation, and (2) reinforcement learning fine-tuning that molds the base model into different motion primitives. Through simulation and real-world experiments, we show that ComposableNav enables robots to follow instructions by generating trajectories that satisfy diverse and unseen combinations of specifications, significantly outperforming both non-compositional VLM-based policies and costmap composing baselines. Videos and additional materials can be found on the project page: this https URL
zh
[CV-18] DriveDPO: Policy Learning via Safety DPO For End-to-End Autonomous Driving NEURIPS2025
【速读】:该论文旨在解决端到端自动驾驶中因模仿学习(imitation learning)导致的安全性不足问题,即现有方法难以区分看似人类驾驶但潜在不安全的轨迹。其核心解决方案是提出DriveDPO框架,关键在于通过蒸馏融合人类模仿相似性与规则驱动的安全评分,构建统一的策略分布以直接优化策略;并引入迭代式的轨迹级偏好对齐阶段,实现策略优化与偏好监督的耦合,从而在保障安全性的同时提升驾驶行为的可靠性与性能。
链接: https://arxiv.org/abs/2509.17940
作者: Shuyao Shang,Yuntao Chen,Yuqi Wang,Yingyan Li,Zhaoxiang Zhang
机构: NLPR, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); University of Chinese Academy of Sciences (中国科学院大学); MiroMind
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:End-to-end autonomous driving has substantially progressed by directly predicting future trajectories from raw perception inputs, which bypasses traditional modular pipelines. However, mainstream methods trained via imitation learning suffer from critical safety limitations, as they fail to distinguish between trajectories that appear human-like but are potentially unsafe. Some recent approaches attempt to address this by regressing multiple rule-driven scores but decoupling supervision from policy optimization, resulting in suboptimal performance. To tackle these challenges, we propose DriveDPO, a Safety Direct Preference Optimization Policy Learning framework. First, we distill a unified policy distribution from human imitation similarity and rule-based safety scores for direct policy optimization. Further, we introduce an iterative Direct Preference Optimization stage formulated as trajectory-level preference alignment. Extensive experiments on the NAVSIM benchmark demonstrate that DriveDPO achieves a new state-of-the-art PDMS of 90.0. Furthermore, qualitative results across diverse challenging scenarios highlight DriveDPO’s ability to produce safer and more reliable driving behaviors.
zh
[CV-19] Multi-needle Localization for Pelvic Seed Implant Brachytherapy based on Tip-handle Detection and Matching
【速读】:该论文旨在解决盆腔粒子植入近距离放射治疗中术中CT图像下多针定位困难的问题,尤其针对图像对比度低和针体粘连导致的定位不准难题。其解决方案的关键在于将针定位重构为针尖-针柄检测与匹配问题:首先提出一种基于HRNet的无锚框网络,通过解耦的热图回归分支与极角预测分支实现多尺度特征提取及针尖和针柄中心位置与方向的精准检测;进而设计了一种贪心匹配与合并(GMM)方法,用于处理带约束的不平衡分配问题(UAP-C),通过迭代选择最可能的针尖-针柄配对并基于距离度量合并,从而重建三维针路径。该方案在100例患者数据集上表现优于基于nnUNet的分割方法,显著提升了精度与F1分数,展现出更强的鲁棒性和准确性。
链接: https://arxiv.org/abs/2509.17931
作者: Zhuo Xiao,Fugen Zhou,Jingjing Wang,Chongyu He,Bo Liu,Haitao Sun,Zhe Ji,Yuliang Jiang,Junjie Wang,Qiuwen Wu
机构: Beihang University (北京航空航天大学); Peking University Third Hospital (北京大学第三医院); Duke University Medical Center (杜克大学医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:
Abstract:Accurate multi-needle localization in intraoperative CT images is crucial for optimizing seed placement in pelvic seed implant brachytherapy. However, this task is challenging due to poor image contrast and needle adhesion. This paper presents a novel approach that reframes needle localization as a tip-handle detection and matching problem to overcome these difficulties. An anchor-free network, based on HRNet, is proposed to extract multi-scale features and accurately detect needle tips and handles by predicting their centers and orientations using decoupled branches for heatmap regression and polar angle prediction. To associate detected tips and handles into individual needles, a greedy matching and merging (GMM) method designed to solve the unbalanced assignment problem with constraints (UAP-C) is presented. The GMM method iteratively selects the most probable tip-handle pairs and merges them based on a distance metric to reconstruct 3D needle paths. Evaluated on a dataset of 100 patients, the proposed method demonstrates superior performance, achieving higher precision and F1 score compared to a segmentation-based method utilizing the nnUNet model,thereby offering a more robust and accurate solution for needle localization in complex clinical scenarios.
zh
[CV-20] SmaRT: Style-Modulated Robust Test-Time Adaptation for Cross-Domain Brain Tumor Segmentation in MRI
【速读】:该论文旨在解决医学影像中脑肿瘤分割模型在面对不同扫描设备、成像协议及人群异质性导致的域偏移(domain shift)时,性能显著下降的问题,尤其在低资源和儿科群体中表现不稳定。其解决方案的关键在于提出一种风格调制的鲁棒测试时自适应框架(SmaRT),通过三个核心机制实现:(1)基于风格感知的增强以缓解外观差异;(2)双分支动量策略用于稳定伪标签优化;(3)结构先验约束确保解剖一致性、完整性与连通性。这一协同设计保障了极端域偏移下的适应稳定性与解剖学保真度,显著提升了Dice分数和边界精度,在撒哈拉以南非洲和儿科胶质瘤数据集上优于现有最先进方法。
链接: https://arxiv.org/abs/2509.17925
作者: Yuanhan Wang,Yifei Chen,Shuo Jiang,Wenjing Yu,Mingxuan Liu,Beining Wu,Jinying Zong,Feiwei Qin,Changmiao Wang,Qiyuan Tian
机构: Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学); Shenzhen Research Institute of Big Data (深圳大数据研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 6 figures
Abstract:Reliable brain tumor segmentation in MRI is indispensable for treatment planning and outcome monitoring, yet models trained on curated benchmarks often fail under domain shifts arising from scanner and protocol variability as well as population heterogeneity. Such gaps are especially severe in low-resource and pediatric cohorts, where conventional test-time or source-free adaptation strategies often suffer from instability and structural inconsistency. We propose SmaRT, a style-modulated robust test-time adaptation framework that enables source-free cross-domain generalization. SmaRT integrates style-aware augmentation to mitigate appearance discrepancies, a dual-branch momentum strategy for stable pseudo-label refinement, and structural priors enforcing consistency, integrity, and connectivity. This synergy ensures both adaptation stability and anatomical fidelity under extreme domain shifts. Extensive evaluations on sub-Saharan Africa and pediatric glioma datasets show that SmaRT consistently outperforms state-of-the-art methods, with notable gains in Dice accuracy and boundary precision. Overall, SmaRT bridges the gap between algorithmic advances and equitable clinical applicability, supporting robust deployment of MRI-based neuro-oncology tools in diverse clinical environments. Our source code is available at this https URL.
zh
[CV-21] Does Audio Matter for Modern Video-LLM s and Their Benchmarks?
【速读】:该论文旨在解决当前视频大语言模型(Video-LLMs)在评估中普遍忽视音频信息的问题,即多数基准测试使用静音视频或忽略音频,导致对模型真实多模态理解能力的误判。研究发现,许多现有评测任务仅依赖单帧图像即可完成,音频贡献有限;但针对特定音频敏感场景(如音乐识别、语音指令解析),音频仍具有决定性作用。解决方案的关键在于:基于LLaVA-OneVision架构引入轻量级语音/音频编码器(如Whisper),并通过Mamba-based状态空间压缩器缓解音频token爆炸问题,从而实现高效音频处理;同时构建两个新的硬核音频视觉问答数据集(AVQA-Hard 和 Music-AVQA-Hard),以推动更真实、可信赖的音频-视觉 Video-LLM 评估体系。
链接: https://arxiv.org/abs/2509.17901
作者: Geewook Kim,Minjoon Seo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
备注: 5 pages, 2 figures, under review. Project page: this https URL
Abstract:Modern multimodal large language models often claim “video understanding,” yet most evaluations use muted videos or simply discard audio. We ask a direct question: how much does audio actually matter for contemporary Video-LLMs and the benchmarks that certify them? We audit widely used suites and observe that many items are even solvable from a single frame, rendering audio largely redundant. Building on LLaVA-OneVision architecture, we attach a speech/audio encoder (e.g., Whisper) and analyze when audio helps, while addressing audio token explosion with a lightweight Mamba-based state-space token compressor. We find that audio yields minimal gains on recent video benchmarks but is decisive on curated, audio-sensitive subsets. To enable faithful evaluation, we release AVQA-Hard and Music-AVQA-Hard, our model, and code. Our findings surface a growing gap between current academic practice and real-world expectations, and provide practical tools for scalable audio-visual Video-LLMs. We will fully open-source our work at this https URL.
zh
[CV-22] rainee Action Recognition through Interaction Analysis in CCATT Mixed-Reality Training
【速读】:该论文旨在解决重症监护空中转运团队(Critical Care Air Transport Team, CCATT)在高压力环境下进行训练时,传统人工评估方法主观性强、难以捕捉关键事件且一致性差的问题。其核心挑战在于如何客观、全面地评估多成员团队在复杂环境中的协同表现,尤其是在存在环境噪声和多人交互跟踪需求的情况下。解决方案的关键在于提出一种数据驱动的系统性评估框架,融合认知任务分析(Cognitive Task Analysis, CTA)与多模态学习分析(Multimodal Learning Analytics, MMLA),构建面向CCATT操作的领域特定CTA模型,并利用基于视觉的动作识别流水线(采用微调的人体-物体交互模型Cascade Disentangling Network, CDN)自动检测和追踪受训人员与设备的交互行为,从而生成可解释的性能指标(如反应时间、任务时长),并映射至层级化CTA结构中,实现对团队动态的客观、精准、可解释的评估。
链接: https://arxiv.org/abs/2509.17888
作者: Divya Mereddy,Marcos Quinones-Grueiro,Ashwin T S,Eduardo Davalos,Gautam Biswas,Kent Etherton,Tyler Davis,Katelyn Kay,Jill Lear,Benjamin Goldberg
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:This study examines how Critical Care Air Transport Team (CCATT) members are trained using mixed-reality simulations that replicate the high-pressure conditions of aeromedical evacuation. Each team - a physician, nurse, and respiratory therapist - must stabilize severely injured soldiers by managing ventilators, IV pumps, and suction devices during flight. Proficient performance requires clinical expertise and cognitive skills, such as situational awareness, rapid decision-making, effective communication, and coordinated task management, all of which must be maintained under stress. Recent advances in simulation and multimodal data analytics enable more objective and comprehensive performance evaluation. In contrast, traditional instructor-led assessments are subjective and may overlook critical events, thereby limiting generalizability and consistency. However, AI-based automated and more objective evaluation metrics still demand human input to train the AI algorithms to assess complex team dynamics in the presence of environmental noise and the need for accurate re-identification in multi-person tracking. To address these challenges, we introduce a systematic, data-driven assessment framework that combines Cognitive Task Analysis (CTA) with Multimodal Learning Analytics (MMLA). We have developed a domain-specific CTA model for CCATT training and a vision-based action recognition pipeline using a fine-tuned Human-Object Interaction model, the Cascade Disentangling Network (CDN), to detect and track trainee-equipment interactions over time. These interactions automatically yield performance indicators (e.g., reaction time, task duration), which are mapped onto a hierarchical CTA model tailored to CCATT operations, enabling interpretable, domain-relevant performance evaluations.
zh
[CV-23] Sight Over Site: Perception-Aware Reinforcement Learning for Efficient Robotic Inspection
【速读】:该论文旨在解决传统自主巡检任务中仅以到达目标位置为优化目标所带来的效率低下问题,即在真实环境中,目标可能在未抵达精确坐标前已可被感知,此时继续移动会造成冗余与低效。其解决方案的关键在于提出一种基于感知意识(perception-aware)的端到端强化学习框架,将目标可见性作为核心优化目标,使机器人能够通过结合感知和本体感觉(proprioceptive sensing)信息,自动规划出最短且能保证视觉接触的目标巡检路径,无需依赖地图,并在仿真中训练后直接部署至真实机器人系统。
链接: https://arxiv.org/abs/2509.17877
作者: Richard Kuhlmann,Jakob Wolfram,Boyang Sun,Jiaxu Xing,Davide Scaramuzza,Marc Pollefeys,Cesar Cadena
机构: ETH Zurich (苏黎世联邦理工学院); University of Zurich (苏黎世大学); Microsoft Mixed Reality and AI Lab (微软混合现实与人工智能实验室)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Autonomous inspection is a central problem in robotics, with applications ranging from industrial monitoring to search-and-rescue. Traditionally, inspection has often been reduced to navigation tasks, where the objective is to reach a predefined location while avoiding obstacles. However, this formulation captures only part of the real inspection problem. In real-world environments, the inspection targets may become visible well before their exact coordinates are reached, making further movement both redundant and inefficient. What matters more for inspection is not simply arriving at the target’s position, but positioning the robot at a viewpoint from which the target becomes observable. In this work, we revisit inspection from a perception-aware perspective. We propose an end-to-end reinforcement learning framework that explicitly incorporates target visibility as the primary objective, enabling the robot to find the shortest trajectory that guarantees visual contact with the target without relying on a map. The learned policy leverages both perceptual and proprioceptive sensing and is trained entirely in simulation, before being deployed to a real-world robot. We further develop an algorithm to compute ground-truth shortest inspection paths, which provides a reference for evaluation. Through extensive experiments, we show that our method outperforms existing classical and learning-based navigation approaches, yielding more efficient inspection trajectories in both simulated and real-world settings. The project is avialable at this https URL
zh
[CV-24] ProDyG: Progressive Dynamic Scene Reconstruction via Gaussian Splatting from Monocular Videos
【速读】:该论文旨在解决动态场景下在线、高保真三维重建的挑战,即如何在保证实时性的同时实现全局位姿与地图一致性、细节丰富的外观建模,并兼容RGB与RGB-D输入。现有SLAM方法通常仅能去除动态物体或依赖RGB-D数据,而离线方法难以扩展至长视频序列,基于Transformer的前馈方法则缺乏全局一致性和外观细节。解决方案的关键在于:在SLAM系统中解耦静态与动态部分,通过新颖的运动掩码策略实现鲁棒位姿跟踪,并利用渐进式适应的Motion Scaffolds图对动态区域进行重建,从而在保持在线运行效率的同时生成媲美离线方法的新视角渲染结果,并达到与当前最优动态SLAM方法相当的跟踪性能。
链接: https://arxiv.org/abs/2509.17864
作者: Shi Chen,Erik Sandström,Sandro Lombardi,Siyuan Li,Martin R. Oswald
机构: ETH Zürich(苏黎世联邦理工学院); Google(谷歌); Independent Researcher(独立研究员); University of Amsterdam(阿姆斯特丹大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Achieving truly practical dynamic 3D reconstruction requires online operation, global pose and map consistency, detailed appearance modeling, and the flexibility to handle both RGB and RGB-D inputs. However, existing SLAM methods typically merely remove the dynamic parts or require RGB-D input, while offline methods are not scalable to long video sequences, and current transformer-based feedforward methods lack global consistency and appearance details. To this end, we achieve online dynamic scene reconstruction by disentangling the static and dynamic parts within a SLAM system. The poses are tracked robustly with a novel motion masking strategy, and dynamic parts are reconstructed leveraging a progressive adaptation of a Motion Scaffolds graph. Our method yields novel view renderings competitive to offline methods and achieves on-par tracking with state-of-the-art dynamic SLAM methods.
zh
[CV-25] Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology NEURIPS2025
【速读】:该论文旨在解决组织病理学(histopathology)中合成数据生成的三大挑战:保持组织异质性、捕捉细微形态特征以及扩展至未标注数据集。其解决方案的关键在于提出一种基于潜在扩散模型(latent diffusion model)的新颖双条件机制,通过将语义分割图与特定组织的原始视觉裁剪图像(tissue-specific visual crops)相结合,直接引入真实组织区域的细粒度结构信息,从而在不依赖文本提示或抽象视觉嵌入的情况下保留关键形态细节。该方法在标注数据集(如Camelyon16和Panda)上提取具有20%-80%异质性的图像块,并在未标注数据(如TCGA)上利用基础模型嵌入自动聚类全切片图像生成伪语义图,实现无监督扩展。实验表明,该方案生成的高保真图像具备精确的区域注释,在下游分割任务中表现优异,显著优于传统提示引导合成方法(Frechet Distance降低达6倍),且训练模型性能接近真实数据基准,为计算病理学中多样化、标注化的数据生成提供了实用且可扩展的解决方案。
链接: https://arxiv.org/abs/2509.17847
作者: Saghir Alfasly,Wataru Uegami,MD Enamul Hoq,Ghazal Alabtah,H.R. Tizhoosh
机构: KIMIA Lab, Department of Artificial Intelligence and Informatics, Mayo Clinic (梅奥诊所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Synthetic data generation in histopathology faces unique challenges: preserving tissue heterogeneity, capturing subtle morphological features, and scaling to unannotated datasets. We present a latent diffusion model that generates realistic heterogeneous histopathology images through a novel dual-conditioning approach combining semantic segmentation maps with tissue-specific visual crops. Unlike existing methods that rely on text prompts or abstract visual embeddings, our approach preserves critical morphological details by directly incorporating raw tissue crops from corresponding semantic regions. For annotated datasets (i.e., Camelyon16, Panda), we extract patches ensuring 20-80% tissue heterogeneity. For unannotated data (i.e., TCGA), we introduce a self-supervised extension that clusters whole-slide images into 100 tissue types using foundation model embeddings, automatically generating pseudo-semantic maps for training. Our method synthesizes high-fidelity images with precise region-wise annotations, achieving superior performance on downstream segmentation tasks. When evaluated on annotated datasets, models trained on our synthetic data show competitive performance to those trained on real data, demonstrating the utility of controlled heterogeneous tissue generation. In quantitative evaluation, prompt-guided synthesis reduces Frechet Distance by up to 6X on Camelyon16 (from 430.1 to 72.0) and yields 2-3x lower FD across Panda and TCGA. Downstream DeepLabv3+ models trained solely on synthetic data attain test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1-2% of real-data baselines (0.72 and 0.96). By scaling to 11,765 TCGA whole-slide images without manual annotations, our framework offers a practical solution for an urgent need for generating diverse, annotated histopathology data, addressing a critical bottleneck in computational pathology.
zh
[CV-26] ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment
【速读】:该论文旨在解决训练-free视频对象编辑(training-free video object editing)中面临的两大核心挑战:一是由于使用一阶求解器导致的重建不准确问题,二是因粗略的“硬”特征替换引发的上下文冲突问题,尤其在扩散变换器(Diffusion Transformer, DiT)架构下更为显著。为应对这些问题,论文提出ContextFlow框架,其关键创新在于两个方面:首先,采用高阶修正流(Rectified Flow)求解器构建更稳定的编辑基础;其次,设计自适应上下文增强机制(Adaptive Context Enrichment),通过拼接并行重建与编辑路径中的Key-Value对来动态丰富自注意力上下文,而非直接替换特征,从而缓解上下文冲突。此外,论文进一步提出基于引导响应度量(Guidance Responsiveness Metric)的数据驱动分析方法,精准识别不同任务(如插入、替换)对应的关键DiT层,实现针对性强且高效的引导策略。
链接: https://arxiv.org/abs/2509.17818
作者: Yiyang Chen,Xuanhua He,Xiujun Ma,Yue Ma
机构: Peking University (北京大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The project page is at this https URL
Abstract:Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude “hard” feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.
zh
[CV-27] Enhancing Semantic Segmentation with Continual Self-Supervised Pre-training
【速读】:该论文旨在解决视觉基础模型在新领域中进行无监督且数据高效迁移的问题,尤其聚焦于密集预测任务(如语义分割)的性能提升。现有方法多依赖大规模标注数据进行微调,而对有限标注场景下的自监督预训练扩展研究不足。解决方案的关键在于提出一种新的持续自监督预训练任务GLARE(Global Local and Regional Enforcement),其通过引入patch级增强以强化局部一致性,并结合区域一致性约束来利用数据中的空间语义信息;同时,在持续预训练过程中仅更新轻量级适配模块(UniAdapter),冻结主干网络(Vision Transformer),从而在极小计算和参数开销下显著提升下游语义分割性能。
链接: https://arxiv.org/abs/2509.17816
作者: Brown Ebouky,Ajad Chhatkuli,Cristiano Malossi,Christoph Studer,Roy Assaf,Andrea Bartezzaghi
机构: ETH Zurich (苏黎世联邦理工学院); IBM Research - Zurich (IBM研究 - 苏黎世); INSAIT; Kaiko; IBM Research - Zurich (IBM研究 - 苏黎世)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages, 5 figures
Abstract:Self-supervised learning (SSL) has emerged as a central paradigm for training foundation models by leveraging large-scale unlabeled datasets, often producing representations with strong generalization capabilities. These models are typically pre-trained on general-purpose datasets such as ImageNet and subsequently adapted to various downstream tasks through finetuning. While recent advances have explored parameter-efficient strategies for adapting pre-trained models, extending SSL pre-training itself to new domains - particularly under limited data regimes and for dense prediction tasks - remains underexplored. In this work, we address the problem of adapting vision foundation models to new domains in an unsupervised and data-efficient manner, specifically targeting downstream semantic segmentation. We propose GLARE (Global Local and Regional Enforcement), a novel continual self-supervised pre-training task designed to enhance downstream segmentation performance. GLARE introduces patch-level augmentations to encourage local consistency and incorporates a regional consistency constraint that leverages spatial semantics in the data. For efficient continual pre-training, we initialize Vision Transformers (ViTs) with weights from existing SSL models and update only lightweight adapter modules - specifically UniAdapter - while keeping the rest of the backbone frozen. Experiments across multiple semantic segmentation benchmarks on different domains demonstrate that GLARE consistently improves downstream performance with minimal computational and parameter overhead.
zh
[CV-28] Selecting Optimal Camera Views for Gait Analysis: A Multi-Metric Assessment of 2D Projections
【速读】:该论文旨在系统量化相机视角(正面视图与侧向视图)对无标记二维(2D)步态分析准确性的影响,相对于三维(3D)运动捕捉的基准数据。其解决方案的关键在于通过同步采集18名受试者的正面、侧向及3D运动捕捉数据,利用YOLOv8进行姿态估计,并采用动态时间规整(DTW)、最大互相关(MCC)、Kullback-Leibler散度(KLD)和信息熵(IE)四项指标评估不同视图下步态参数的一致性,结合Wilcoxon符号秩检验与Cliff’s delta效应量分析,揭示了侧向视图在矢状面运动学参数(如步长、膝关节旋转)上显著优于正面视图,而正面视图在躯干对称性参数(如躯干旋转、腕至髋中点距离)上表现更优。这一发现为临床实践中基于任务需求优化摄像头部署提供了首项系统性实证依据。
链接: https://arxiv.org/abs/2509.17805
作者: Dong Chen,Huili Peng,Yong Hu,Kenneth MC. Cheung
机构: The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Objective: To systematically quantify the effect of the camera view (frontal vs. lateral) on the accuracy of 2D markerless gait analysis relative to 3D motion capture ground truth. Methods: Gait data from 18 subjects were recorded simultaneously using frontal, lateral and 3D motion capture systems. Pose estimation used YOLOv8. Four metrics were assessed to evaluate agreement: Dynamic Time Warping (DTW) for temporal alignment, Maximum Cross-Correlation (MCC) for signal similarity, Kullback-Leibler Divergence (KLD) for distribution differences, and Information Entropy (IE) for complexity. Wilcoxon signed-rank tests (significance: p 0.05 ) and Cliff’s delta ( \delta ) were used to measure statistical differences and effect sizes. Results: Lateral views significantly outperformed frontal views for sagittal plane kinematics: step length (DTW: 53.08 \pm 24.50 vs. 69.87 \pm 25.36 , p = 0.005 ) and knee rotation (DTW: 106.46 \pm 38.57 vs. 155.41 \pm 41.77 , p = 0.004 ). Frontal views were superior for symmetry parameters: trunk rotation (KLD: 0.09 \pm 0.06 vs. 0.30 \pm 0.19 , p 0.001 ) and wrist-to-hipmid distance (MCC: 105.77 \pm 29.72 vs. 75.20 \pm 20.38 , p = 0.003 ). Effect sizes were medium-to-large ( \delta: 0.34 – 0.76 ). Conclusion: Camera view critically impacts gait parameter accuracy. Lateral views are optimal for sagittal kinematics; frontal views excel for trunk symmetry. Significance: This first systematic evidence enables data-driven camera deployment in 2D gait analysis, enhancing clinical utility. Future implementations should leverage both views via disease-oriented setups.
zh
[CV-29] S-P2CL: Plug-and-Play Dual Contrastive Learning for Vision-Guided Medical Time Series Classification
【速读】:该论文旨在解决医疗时间序列(Medical Time Series, MedTS)分类中因个体间异质性导致的跨受试者泛化能力差的问题。现有方法受限于模态特定的归纳偏置,难以学习到通用的不变表示。其解决方案的关键在于提出一种即插即用的框架TS-P²CL,通过将一维生理信号转换为二维伪图像(pseudo-images),借助预训练视觉模型的通用模式识别能力,建立时间序列与视觉语义之间的桥梁;在此统一空间中,采用双对比学习策略——模态内一致性保持时间连贯性,模态间对齐对齐时序动态与视觉语义,从而缓解个体特异性偏差,学习鲁棒且域不变的特征。
链接: https://arxiv.org/abs/2509.17802
作者: Qi’ao Xu,Pengfei Wang,Bo Zhong,Tianwen Qian,Xiaoling Wang,Ye Wang,Hong Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures
Abstract:Medical time series (MedTS) classification is pivotal for intelligent healthcare, yet its efficacy is severely limited by poor cross-subject generation due to the profound cross-individual heterogeneity. Despite advances in architectural innovations and transfer learning techniques, current methods remain constrained by modality-specific inductive biases that limit their ability to learn universally invariant representations. To overcome this, we propose TS-P ^2 CL, a novel plug-and-play framework that leverages the universal pattern recognition capabilities of pre-trained vision models. We introduce a vision-guided paradigm that transforms 1D physiological signals into 2D pseudo-images, establishing a bridge to the visual domain. This transformation enables implicit access to rich semantic priors learned from natural images. Within this unified space, we employ a dual-contrastive learning strategy: intra-modal consistency enforces temporal coherence, while cross-modal alignment aligns time-series dynamics with visual semantics, thereby mitigating individual-specific biases and learning robust, domain-invariant features. Extensive experiments on six MedTS datasets demonstrate that TS-P ^2 CL consistently outperforms fourteen methods in both subject-dependent and subject-independent settings.
zh
[CV-30] Degradation-Aware All-in-One Image Restoration via Latent Prior Encoding
【速读】:该论文旨在解决现实世界图像中存在空间异质性退化(如雾霾、雨雪、低光照等)导致的视觉质量下降及下游视觉任务性能受限的问题。现有统一图像恢复(All-in-One Restoration, AIR)方法依赖外部文本提示或手工设计的架构先验(如频域启发式规则),这些方法施加了离散且脆弱的假设,难以泛化到未见或混合退化场景。其解决方案的关键在于将AIR重构为可学习的潜在先验推理问题:通过输入自动推断退化感知表示,无需显式任务提示;并在此基础上构建结构化推理范式,明确三个核心问题——(1)选择哪些特征进行路由(自适应特征选择)、(2)在何处恢复(空间定位)、(3)恢复什么内容(退化语义)。作者进一步设计了一个轻量级解码模块,高效利用这些潜在编码线索实现空间自适应恢复,在六类常见退化任务、五种复合场景以及未见过的退化类型上均优于当前最优方法,平均PSNR提升1.68 dB且效率提高三倍。
链接: https://arxiv.org/abs/2509.17792
作者: S M A Sharif,Abdur Rehman,Fayaz Ali Dharejo,Radu Timofte,Rizwan Ali Naqvi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world images often suffer from spatially diverse degradations such as haze, rain, snow, and low-light, significantly impacting visual quality and downstream vision tasks. Existing all-in-one restoration (AIR) approaches either depend on external text prompts or embed hand-crafted architectural priors (e.g., frequency heuristics); both impose discrete, brittle assumptions that weaken generalization to unseen or mixed degradations. To address this limitation, we propose to reframe AIR as learned latent prior inference, where degradation-aware representations are automatically inferred from the input without explicit task cues. Based on latent priors, we formulate AIR as a structured reasoning paradigm: (1) which features to route (adaptive feature selection), (2) where to restore (spatial localization), and (3) what to restore (degradation semantics). We design a lightweight decoding module that efficiently leverages these latent encoded cues for spatially-adaptive restoration. Extensive experiments across six common degradation tasks, five compound settings, and previously unseen degradations demonstrate that our method outperforms state-of-the-art (SOTA) approaches, achieving an average PSNR improvement of 1.68 dB while being three times more efficient.
zh
[CV-31] From Restoration to Reconstruction: Rethinking 3D Gaussian Splatting for Underwater Scenes
【速读】:该论文旨在解决水下图像退化对三维重建(3D reconstruction)带来的挑战,尤其是在复杂场景中传统简化物理模型失效的问题。其核心解决方案是提出一种统一框架 R-Splatting,将水下图像恢复(Underwater Image Restoration, UIR)与三维高斯喷溅(3D Gaussian Splatting, 3DGS)相结合,通过整合多种UIR模型生成的增强视图来提升渲染质量和几何保真度。关键创新包括:1)引入轻量级光照生成器以采样潜在代码实现多样且一致的渲染;2)采用对比损失确保光照表示的解耦与稳定性;3)提出不确定性感知的不透明度优化(Uncertainty-Aware Opacity Optimization, UAOO),将不透明度建模为随机函数以抑制光照变化引起的梯度突变并缓解对噪声或视角特异性伪影的过拟合。
链接: https://arxiv.org/abs/2509.17789
作者: Guoxi Huang,Haoran Wang,Zipeng Qi,Wenjun Lu,David Bull,Nantheera Anantrasirichai
机构: 1. University of Bristol (布里斯托大学); 2. Tsinghua University (清华大学); 3. University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Underwater image degradation poses significant challenges for 3D reconstruction, where simplified physical models often fail in complex scenes. We propose \textbfR-Splatting, a unified framework that bridges underwater image restoration (UIR) with 3D Gaussian Splatting (3DGS) to improve both rendering quality and geometric fidelity. Our method integrates multiple enhanced views produced by diverse UIR models into a single reconstruction pipeline. During inference, a lightweight illumination generator samples latent codes to support diverse yet coherent renderings, while a contrastive loss ensures disentangled and stable illumination representations. Furthermore, we propose \textitUncertainty-Aware Opacity Optimization (UAOO), which models opacity as a stochastic function to regularize training. This suppresses abrupt gradient responses triggered by illumination variation and mitigates overfitting to noisy or view-specific artifacts. Experiments on Seathru-NeRF and our new BlueCoral3D dataset demonstrate that R-Splatting outperforms strong baselines in both rendering quality and geometric accuracy.
zh
[CV-32] Accurate and Efficient Low-Rank Model Merging in Core Space NEURIPS2025
【速读】:该论文旨在解决低秩适配(Low-Rank Adaptation, LoRA)模型在合并过程中效率损失的问题。现有合并方法通常需要操作完整尺寸的权重矩阵,导致计算资源消耗增加,削弱了LoRA本身参数高效性的优势。其解决方案的关键在于提出Core Space合并框架,该框架通过将多个LoRA适配模型投影到一个共享的对齐基(alignment basis)中进行合并,从而在不丢失信息的前提下保持低秩结构的效率,并显著提升任务性能。论文进一步提供了形式化证明以确保信息无损,并通过复杂度分析验证了该方法的计算效率优势。
链接: https://arxiv.org/abs/2509.17786
作者: Aniello Panariello,Daniel Marczak,Simone Magistri,Angelo Porrello,Bartłomiej Twardowski,Andrew D. Bagdanov,Simone Calderara,Joost van de Weijer
机构: AImageLab, University of Modena and Reggio Emilia, Italy; Warsaw University of Technology, Poland; IDEAS NCBR, Warsaw, Poland; Media Integration and Communication Center (MICC), University of Florence, Italy; IDEAS Research Institute, Warsaw, Poland; Computer Vision Center, Universitat Autònoma de Barcelona, Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at 39th Conference on Neural Information Processing Systems (NeurIPS 2025), San Diego, USA
Abstract:In this paper, we address the challenges associated with merging low-rank adaptations of large neural networks. With the rise of parameter-efficient adaptation techniques, such as Low-Rank Adaptation (LoRA), model fine-tuning has become more accessible. While fine-tuning models with LoRA is highly efficient, existing merging methods often sacrifice this efficiency by merging fully-sized weight matrices. We propose the Core Space merging framework, which enables the merging of LoRA-adapted models within a common alignment basis, thereby preserving the efficiency of low-rank adaptation while substantially improving accuracy across tasks. We further provide a formal proof that projection into Core Space ensures no loss of information and provide a complexity analysis showing the efficiency gains. Extensive empirical results demonstrate that Core Space significantly improves existing merging techniques and achieves state-of-the-art results on both vision and language tasks while utilizing a fraction of the computational resources. Codebase is available at this https URL.
zh
[CV-33] I2VWM: Robust Watermarking for Image to Video Generation
【速读】:该论文旨在解决图像引导视频生成(I2V)场景下数字水印难以追踪源图像的问题,即现有水印方法在单一模态中表现鲁棒,但在跨模态的I2V生成过程中无法有效保留和识别水印信号。解决方案的关键在于提出“鲁棒扩散距离”(Robust Diffusion Distance)这一新概念,用以衡量水印信号在生成视频中的时间持久性,并设计了I2VWM框架:该框架在训练阶段引入视频模拟噪声层增强鲁棒性,在推理阶段采用基于光流的对齐模块实现跨模态水印定位,从而显著提升水印在生成视频中的稳定性和不可感知性,为生成式视频时代的跨模态水印技术树立了新范式。
链接: https://arxiv.org/abs/2509.17773
作者: Guanjie Wang,Zehua Ma,Han Fang,Weiming Zhang
机构: University of Science and Technology of China (中国科学技术大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages
Abstract:The rapid progress of image-guided video generation (I2V) has raised concerns about its potential misuse in misinformation and fraud, underscoring the urgent need for effective digital watermarking. While existing watermarking methods demonstrate robustness within a single modality, they fail to trace source images in I2V settings. To address this gap, we introduce the concept of Robust Diffusion Distance, which measures the temporal persistence of watermark signals in generated videos. Building on this, we propose I2VWM, a cross-modal watermarking framework designed to enhance watermark robustness across time. I2VWM leverages a video-simulation noise layer during training and employs an optical-flow-based alignment module during inference. Experiments on both open-source and commercial I2V models demonstrate that I2VWM significantly improves robustness while maintaining imperceptibility, establishing a new paradigm for cross-modal watermarking in the era of generative video. \hrefthis https URLCode Released.
zh
[CV-34] Incorporating the Refractory Period into Spiking Neural Networks through Spike-Triggered Threshold Dynamics
【速读】:该论文旨在解决现有脉冲神经网络(Spiking Neural Networks, SNNs)中主流神经元模型(如LIF)忽略生物神经元关键特性——不应期(refractory period)的问题。这一缺失导致神经元在持续输入下易发生过度兴奋,且对异常信号干扰敏感,从而影响模型的稳定性与性能。解决方案的关键在于提出一种简单而高效的方法——RPLIF(Refractory Period-Enabled Leaky Integrate-and-Fire),通过基于脉冲触发的阈值动态机制引入不应期,确保每个脉冲准确编码神经信息,有效抑制过激发并提升抗干扰能力;该方法计算开销极小,能显著增强SNN在低时步数和低延迟场景下的鲁棒性与性能,在Cifar10-DVS、N-Caltech101和DVS128 Gesture等数据集上均达到当前最优结果。
链接: https://arxiv.org/abs/2509.17769
作者: Yang Li,Xinyi Zeng,Zhe Xue,Pinxian Zeng,Zikai Zhang,Yan Wang
机构: Sichuan University (四川大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As the third generation of neural networks, spiking neural networks (SNNs) have recently gained widespread attention for their biological plausibility, energy efficiency, and effectiveness in processing neuromorphic datasets. To better emulate biological neurons, various models such as Integrate-and-Fire (IF) and Leaky Integrate-and-Fire (LIF) have been widely adopted in SNNs. However, these neuron models overlook the refractory period, a fundamental characteristic of biological neurons. Research on excitable neurons reveal that after firing, neurons enter a refractory period during which they are temporarily unresponsive to subsequent stimuli. This mechanism is critical for preventing over-excitation and mitigating interference from aberrant signals. Therefore, we propose a simple yet effective method to incorporate the refractory period into spiking LIF neurons through spike-triggered threshold dynamics, termed RPLIF. Our method ensures that each spike accurately encodes neural information, effectively preventing neuron over-excitation under continuous inputs and interference from anomalous inputs. Incorporating the refractory period into LIF neurons is seamless and computationally efficient, enhancing robustness and efficiency while yielding better performance with negligible overhead. To the best of our knowledge, RPLIF achieves state-of-the-art performance on Cifar10-DVS(82.40%) and N-Caltech101(83.35%) with fewer timesteps and demonstrates superior performance on DVS128 Gesture(97.22%) at low latency.
zh
[CV-35] Neural-MMGS: Multi-modal Neural Gaussian Splats for Large-Scale Scene Reconstruction
【速读】:该论文旨在解决多模态大规模场景重建中因传感器数据冗余存储与跨模态信息交互受限导致的内存开销大、可扩展性差的问题。现有方法通常将图像、LiDAR和语义信息作为独立参数附加到高斯(Gaussian)上,不仅增加内存占用,还限制了不同模态间的特征融合。其解决方案的关键在于提出 Neural-MMGS 框架,通过在每个高斯点内构建一个紧凑且可学习的嵌入(embedding),隐式编码光学、物理(如 LiDAR 的深度和反射强度)和语义特征,并使用轻量级神经解码器从该嵌入映射出高斯参数,从而实现多模态信息高效融合与低内存消耗的重建,显著提升了大规模场景重建的性能与可扩展性。
链接: https://arxiv.org/abs/2509.17762
作者: Sitian Shen,Georgi Pramatarov,Yifu Tao,Daniele De Martini
机构: Mobile Robotics Group (MRG), Oxford Robotics Institute (牛津机器人研究所), Department of Engineering Science (工程科学系), University of Oxford (牛津大学); Dynamic Robot Systems Group (DRS), Oxford Robotics Institute (牛津机器人研究所), Department of Engineering Science (工程科学系), University of Oxford (牛津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper proposes Neural-MMGS, a novel neural 3DGS framework for multimodal large-scale scene reconstruction that fuses multiple sensing modalities in a per-gaussian compact, learnable embedding. While recent works focusing on large-scale scene reconstruction have incorporated LiDAR data to provide more accurate geometric constraints, we argue that LiDAR’s rich physical properties remain underexplored. Similarly, semantic information has been used for object retrieval, but could provide valuable high-level context for scene reconstruction. Traditional approaches append these properties to Gaussians as separate parameters, increasing memory usage and limiting information exchange across modalities. Instead, our approach fuses all modalities – image, LiDAR, and semantics – into a compact, learnable embedding that implicitly encodes optical, physical, and semantic features in each Gaussian. We then train lightweight neural decoders to map these embeddings to Gaussian parameters, enabling the reconstruction of each sensing modality with lower memory overhead and improved scalability. We evaluate Neural-MMGS on the Oxford Spires and KITTI-360 datasets. On Oxford Spires, we achieve higher-quality reconstructions, while on KITTI-360, our method reaches competitive results with less storage consumption compared with current approaches in LiDAR-based novel-view synthesis.
zh
[CV-36] Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance
【速读】:该论文旨在解决图像中被遮挡物体的无感补全(Amodal Completion)问题,传统方法在数据依赖性、泛化能力以及渐进式流程中的误差累积方面存在局限。其解决方案的关键在于提出一种基于前置协同推理的多智能体推理框架(Collaborative Multi-Agent Reasoning Framework),通过多个智能体协作分析遮挡关系并确定必要的边界扩展,生成精确掩码用于图像修复;同时,一个智能体生成细粒度文本描述以实现细粒度语义引导(Fine-Grained Semantic Guidance),从而确保对象合成准确性并避免遮挡物或其他无关元素的再生,尤其在大范围修复区域中表现优异;此外,该方法直接输出由可见掩码和扩散变压器(Diffusion Transformer)注意力图引导的分层RGBA结果,省去了额外的分割步骤,显著提升了视觉质量与实用性。
链接: https://arxiv.org/abs/2509.17757
作者: Hongxing Fan,Lipeng Wang,Haohua Chen,Zehuan Huang,Jiangtao Wu,Lu Sheng
机构: Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:
Abstract:Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.
zh
[CV-37] Learning Neural Antiderivatives
【速读】:该论文旨在解决如何在连续神经表示(neural fields)中直接学习函数的重复反导数(repeated antiderivatives),从而实现对传统离散累积运算(如 summed-area tables)的连续模拟。其关键在于提出并分析了一系列神经方法,包括对已有工作的改进和全新设计,以在无网格的连续空间中有效实现多阶积分操作,并验证其在重建质量及下游任务(如滤波与渲染)中的性能表现,从而将经典的累积算子无缝集成到现代神经系统中。
链接: https://arxiv.org/abs/2509.17755
作者: Fizza Rubab,Ntumba Elie Nsampi,Martin Balint,Felix Mujkanovic,Hans-Peter Seidel,Tobias Ritschel,Thomas Leimkühler
机构: Max-Planck-Institut für Informatik (马克斯·普朗克计算机科学研究所); Michigan State University (密歇根州立大学); University College London (伦敦大学学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注:
Abstract:Neural fields offer continuous, learnable representations that extend beyond traditional discrete formats in visual computing. We study the problem of learning neural representations of repeated antiderivatives directly from a function, a continuous analogue of summed-area tables. Although widely used in discrete domains, such cumulative schemes rely on grids, which prevents their applicability in continuous neural contexts. We introduce and analyze a range of neural methods for repeated integration, including both adaptations of prior work and novel designs. Our evaluation spans multiple input dimensionalities and integration orders, assessing both reconstruction quality and performance in downstream tasks such as filtering and rendering. These results enable integrating classical cumulative operators into modern neural systems and offer insights into learning tasks involving differential and integral operators.
zh
[CV-38] Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification
【速读】:该论文旨在解决类不平衡多标签图像分类(Class-Imbalanced Multi-Label Image Classification, CI-MLIC)任务中因数据长尾分布和少样本场景导致的性能下降问题。其核心挑战在于如何在多标签环境下有效缓解类别不平衡对模型学习的影响,同时提升对稀有类别和复杂语义关系的识别能力。解决方案的关键在于提出一种基于视觉语言预训练(Vision-Language Pretrained, VLP)模型的双视角对齐学习与分层提示调优方法(Dual-View Alignment Learning with Hierarchical Prompt, HP-DVAL),通过提取互补特征实现图像-文本间的精准对齐,并引入全局与局部提示机制以学习任务特定及上下文相关的先验知识;此外,设计语义一致性损失约束提示调优过程,防止学到的提示偏离VLP模型中蕴含的通用知识,从而显著提升模型在长尾和少样本场景下的泛化性能。
链接: https://arxiv.org/abs/2509.17747
作者: Sheng Huang,Jiexuan Yan,Beiyan Liu,Bo Liu,Richang Hong
机构: Chongqing University (重庆大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: accepted by IEEE Transactions on Image Processing
Abstract:Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.
zh
[CV-39] Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在视觉任务中生成程序化工作流时面临的三大挑战:依赖闭源模型、缺乏系统性推理能力,以及在长视频问答(long-form video question answering, videoQA)任务中的性能不足。其核心解决方案是提出FS-VisPR框架,关键在于设计了一个“快慢结合”的推理机制——对于简单查询直接由VideoLLMs快速响应,而对复杂问题则触发视觉程序推理(visual program reasoning),并引入低置信度判断与失败回退机制以增强可靠性;同时通过参数搜索优化视觉模块配置,在训练阶段筛选正确答案对应的程序变体,推理阶段选择最高置信度结果,从而显著提升视觉程序工作流的效率与鲁棒性。
链接: https://arxiv.org/abs/2509.17743
作者: Chenglin Li,Feng Han,FengTao,Ruilin Li,Qianglong Chen,Jingqi Tong,Yin Zhang,Jiaqi Wang
机构: Zhejiang University (浙江大学); Fudan University (复旦大学); Wuhan University (武汉大学); Shanghai AI Lab; Shanghai Innovation Institute
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models’ ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.
zh
[CV-40] Automated Labeling of Intracranial Arteries with Uncertainty Quantification Using Deep Learning
【速读】:该论文旨在解决颅内动脉解剖标记(anatomical labeling)在脑血管诊断与血流动力学分析中存在耗时且依赖操作者间一致性差的问题。其解决方案的关键在于提出了一种基于深度学习的自动化框架,能够从三维时间飞跃磁共振血管成像(3D Time-of-Flight Magnetic Resonance Angiography, 3D ToF-MRA)分割结果中实现高精度标注,并引入不确定性量化(uncertainty quantification)以提升模型的可解释性与可靠性。其中,nnUNet架构在性能上表现最优(平均Dice分数0.922,平均表面距离0.387 mm),并结合测试时增强(test-time augmentation, TTA)与一种新颖的坐标引导策略降低插值误差,生成可靠的不确定性图谱,有效识别解剖模糊区域、病理变异或人工标注不一致处;进一步通过4D Flow MRI数据验证了自动标注与人工标注在血流速度上的高度一致性,证明该方法具备临床实用性与可扩展性。
链接: https://arxiv.org/abs/2509.17726
作者: Javier Bisbal,Patrick Winter,Sebastian Jofre,Aaron Ponce,Sameer A. Ansari,Ramez Abdalla,Michael Markl,Oliver Welin Odeback,Sergio Uribe,Cristian Tejos,Julio Sotelo,Susanne Schnell,David Marlevi
机构: Karolinska Institutet (卡罗林斯卡学院); Pontificia Universidad Católica de Chile (天主教智利大学); University of Greifswald (格赖夫斯瓦尔德大学); Northwestern University (西北大学); Universidad Técnica Federico Santa María (弗雷德里科·圣玛丽亚理工大学); Universidad de Valparaíso (瓦尔帕莱索大学); Monash University (莫纳什大学); Massachusetts Institute of Technology (麻省理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 6 figures
Abstract:Accurate anatomical labeling of intracranial arteries is essential for cerebrovascular diagnosis and hemodynamic analysis but remains time-consuming and subject to interoperator variability. We present a deep learning-based framework for automated artery labeling from 3D Time-of-Flight Magnetic Resonance Angiography (3D ToF-MRA) segmentations (n=35), incorporating uncertainty quantification to enhance interpretability and reliability. We evaluated three convolutional neural network architectures: (1) a UNet with residual encoder blocks, reflecting commonly used baselines in vascular labeling; (2) CS-Net, an attention-augmented UNet incorporating channel and spatial attention mechanisms for enhanced curvilinear structure recognition; and (3) nnUNet, a self-configuring framework that automates preprocessing, training, and architectural adaptation based on dataset characteristics. Among these, nnUNet achieved the highest labeling performance (average Dice score: 0.922; average surface distance: 0.387 mm), with improved robustness in anatomically complex vessels. To assess predictive confidence, we implemented test-time augmentation (TTA) and introduced a novel coordinate-guided strategy to reduce interpolation errors during augmented inference. The resulting uncertainty maps reliably indicated regions of anatomical ambiguity, pathological variation, or manual labeling inconsistency. We further validated clinical utility by comparing flow velocities derived from automated and manual labels in co-registered 4D Flow MRI datasets, observing close agreement with no statistically significant differences. Our framework offers a scalable, accurate, and uncertainty-aware solution for automated cerebrovascular labeling, supporting downstream hemodynamic analysis and facilitating clinical integration.
zh
[CV-41] RCTDistill: Cross-Modal Knowledge Distillation Framework for Radar-Camera 3D Object Detection with Temporal Fusion ICCV2025
【速读】:该论文旨在解决雷达-相机融合方法在3D目标检测中性能落后于激光雷达(LiDAR)方法的问题,尤其针对物体运动引起的不确定性以及雷达与相机模态固有的传感器误差未被充分建模的局限性。解决方案的关键在于提出一种基于时序融合的跨模态知识蒸馏(Knowledge Distillation, KD)方法——RCTDistill,其核心创新包含三个模块:Range-Azimuth Knowledge Distillation (RAKD) 用于校正雷达在距离和方位方向上的误差以优化鸟瞰图(BEV)表示;Temporal Knowledge Distillation (TKD) 通过对齐历史雷达-相机BEV特征与当前LiDAR表示来缓解动态物体导致的时间错位问题;Region-Decoupled Knowledge Distillation (RDKD) 则通过蒸馏教师模型中的关系知识增强特征判别力,提升前景与背景区域的区分能力。该方法在nuScenes和View-of-Delft(VoD)数据集上实现了最先进的雷达-相机融合性能,并具备26.2 FPS的快速推理速度。
链接: https://arxiv.org/abs/2509.17712
作者: Geonho Bang,Minjae Seong,Jisong Kim,Geunju Baek,Daye Oh,Junhyung Kim,Junho Koh,Jun Won Choi
机构: Seoul National University (首尔国立大学); Hanyang University (汉阳大学); Hyundai Motor Company (现代汽车公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025
Abstract:Radar-camera fusion methods have emerged as a cost-effective approach for 3D object detection but still lag behind LiDAR-based methods in performance. Recent works have focused on employing temporal fusion and Knowledge Distillation (KD) strategies to overcome these limitations. However, existing approaches have not sufficiently accounted for uncertainties arising from object motion or sensor-specific errors inherent in radar and camera modalities. In this work, we propose RCTDistill, a novel cross-modal KD method based on temporal fusion, comprising three key modules: Range-Azimuth Knowledge Distillation (RAKD), Temporal Knowledge Distillation (TKD), and Region-Decoupled Knowledge Distillation (RDKD). RAKD is designed to consider the inherent errors in the range and azimuth directions, enabling effective knowledge transfer from LiDAR features to refine inaccurate BEV representations. TKD mitigates temporal misalignment caused by dynamic objects by aligning historical radar-camera BEV features with current LiDAR representations. RDKD enhances feature discrimination by distilling relational knowledge from the teacher model, allowing the student to differentiate foreground and background features. RCTDistill achieves state-of-the-art radar-camera fusion performance on both the nuScenes and View-of-Delft (VoD) datasets, with the fastest inference speed of 26.2 FPS.
zh
[CV-42] Automatic Intermodal Loading Unit Identification using Computer Vision: A Scoping Review
【速读】:该论文旨在解决多式联运装载单元(Intermodal Loading Units, ILUs)在高吞吐量港口和码头中高效且稳健识别的瓶颈问题。其解决方案的关键在于利用计算机视觉(Computer Vision, CV)技术,从早期的数字图像处理(Digital Image Processing, DIP)和传统机器学习(Machine Learning, ML)逐步演进至当前以深度学习(Deep Learning, DL)为主导的方法,从而实现对ILU标识(如ISO6346码)的自动识别。论文指出,尽管CV提供了成本效益高的替代方案,但受限于缺乏公开基准数据集,导致性能指标(如端到端准确率)波动极大(5%–96%),因此提出建立标准化术语、开放数据集与共享源代码,并聚焦于无上下文文本识别等未来研究方向以推动该领域发展。
链接: https://arxiv.org/abs/2509.17707
作者: Emre Gülsoylu,Alhassan Abdelhalim,Derya Kara Boztas,Ole Grasse,Carlos Jahn,Simone Frintrop,Janick Edinger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submission to Transport Reviews. 36 pages, 2 figures, 4 tables
Abstract:The standardisation of Intermodal Loading Units (ILUs), such as containers, semi-trailers and swap bodies, has revolutionised global trade yet their efficient and robust identification remains a critical bottleneck in high-throughput ports and terminals. This paper reviews 63 empirical studies that propose computer vision (CV) based solutions. It covers the last 35 years (1990-2025), tracing the field’s evolution from early digital image processing (DIP) and traditional machine learning (ML) to the current dominance of deep learning (DL) techniques. While CV offers cost-effective alternatives for other types of identification techniques, its development is hindered by the lack of publicly available benchmarking datasets. This results in high variance for the reported results such as end-to-end accuracy ranging from 5 % to 96 %. Beyond dataset limitations, this review highlights the emerging challenges especially introduced by the shift from character-based text recognition to scene-text spotting and the integration of mobile cameras (e.g. drones, sensor equipped ground vehicles) for dynamic terminal monitoring. To advance the field, the paper calls for standardised terminology, open-access datasets, shared source code, while outlining future research directions such as contextless text recognition optimised for ISO6346 codes.
zh
[CV-43] Neurodynamics-Driven Coupled Neural P Systems for Multi-Focus Image Fusion
【速读】:该论文旨在解决多焦点图像融合(Multi-focus Image Fusion, MFIF)中决策图(decision map)边界不精确的问题,传统基于启发式规则或黑箱机制的深度学习方法难以生成高质量决策图。其解决方案的关键在于引入受脉冲机制启发的神经动力学驱动耦合神经P系统(Neurodynamics-driven Coupled Neural P system, CNP),通过深入分析模型神经动力学特性,明确网络参数与输入信号间的约束关系,从而避免神经元异常连续放电,确保聚焦与非聚焦区域的准确区分;进一步提出ND-CNPFuse模型,将源图像映射为可解释的脉冲矩阵,通过比较脉冲数量直接生成高精度决策图,无需后处理,显著提升了MFIF性能,在四个经典数据集上达到新最优结果。
链接: https://arxiv.org/abs/2509.17704
作者: Bo Li,Yunkuo Lei,Tingting Bao,Yaxian Wang,Lingling Zhang,Jun Liu
机构: Xi’an Jiaotong University (西安交通大学); Shaanxi Province Key Laboratory of Big Data Knowledge Engineering (陕西省大数据知识工程重点实验室); Ministry of Education Key Laboratory of Intelligent Networks and Network Security (教育部智能网络与网络安全重点实验室); Chang’an University (长安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures
Abstract:Multi-focus image fusion (MFIF) is a crucial technique in image processing, with a key challenge being the generation of decision maps with precise boundaries. However, traditional methods based on heuristic rules and deep learning methods with black-box mechanisms are difficult to generate high-quality decision maps. To overcome this challenge, we introduce neurodynamics-driven coupled neural P (CNP) systems, which are third-generation neural computation models inspired by spiking mechanisms, to enhance the accuracy of decision maps. Specifically, we first conduct an in-depth analysis of the model’s neurodynamics to identify the constraints between the network parameters and the input signals. This solid analysis avoids abnormal continuous firing of neurons and ensures the model accurately distinguishes between focused and unfocused regions, generating high-quality decision maps for MFIF. Based on this analysis, we propose a \textbfNeurodynamics-\textbfDriven \textbfCNP \textbfFusion model (\textbfND-CNPFuse) tailored for the challenging MFIF task. Unlike current ideas of decision map generation, ND-CNPFuse distinguishes between focused and unfocused regions by mapping the source image into interpretable spike matrices. By comparing the number of spikes, an accurate decision map can be generated directly without any post-processing. Extensive experimental results show that ND-CNPFuse achieves new state-of-the-art performance on four classical MFIF datasets, including Lytro, MFFW, MFI-WHU, and Real-MFF. The code is available at this https URL.
zh
[CV-44] Depth Edge Alignment Loss: DEALing with Depth in Weakly Supervised Semantic Segmentation
【速读】:该论文旨在解决自主机器人系统在新应用场景中进行语义分割时,因依赖大量昂贵的像素级密集标注数据而导致训练成本高昂的问题。其解决方案的关键在于提出一种与模型无关的深度边缘对齐损失(Depth Edge Alignment Loss),通过引入机器人系统中普遍可用的像素级深度信息作为额外监督信号,从图像级标签中生成像素级语义标签,从而在无需人工密集标注的前提下显著提升弱监督语义分割模型的性能。实验表明,该方法在多个数据集上均能有效改进分割效果,且可与其他损失函数结合以进一步提升性能。
链接: https://arxiv.org/abs/2509.17702
作者: Patrick Schmidt,Vasileios Belagiannis,Lazaros Nalpantidis
机构: DTU - Technical University of Denmark (丹麦技术大学); Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to IEEE
Abstract:Autonomous robotic systems applied to new domains require an abundance of expensive, pixel-level dense labels to train robust semantic segmentation models under full supervision. This study proposes a model-agnostic Depth Edge Alignment Loss to improve Weakly Supervised Semantic Segmentation models across different datasets. The methodology generates pixel-level semantic labels from image-level supervision, avoiding expensive annotation processes. While weak supervision is widely explored in traditional computer vision, our approach adds supervision with pixel-level depth information, a modality commonly available in robotic systems. We demonstrate how our approach improves segmentation performance across datasets and models, but can also be combined with other losses for even better performance, with improvements up to +5.439, +1.274 and +16.416 points in mean Intersection over Union on the PASCAL VOC / MS COCO validation, and the HOPE static onboarding split, respectively. Our code will be made publicly available.
zh
[CV-45] FROQ: Observing Face Recognition Models for Efficient Quality Assessment
【速读】:该论文旨在解决面部识别(Face Recognition, FR)系统中因输入图像质量低而导致识别错误的问题,尤其是现有面部图像质量评估(Face Image Quality Assessment, FIQA)方法在实际应用中存在训练成本高或性能不足的局限性。其解决方案的关键在于提出一种半监督、无需训练的FIQA方法FROQ(Face Recognition Observer of Quality),该方法利用预训练FR模型中的特定中间特征表示来估计图像质量,并通过一个基于伪标签的简单校准步骤,自动挖掘适用于质量评估的特征表示。FROQ结合了监督学习方法的高性能与无监督方法的免训练优势,在多个主流FR模型和基准数据集上均实现了优异的性能与高效运行时间。
链接: https://arxiv.org/abs/2509.17689
作者: Žiga Babnik,Deepak Kumar Jain,Peter Peer,Vitomir Štruc
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the International Joint Conference on Biometrics (IJCB 2025)
Abstract:Face Recognition (FR) plays a crucial role in many critical (high-stakes) applications, where errors in the recognition process can lead to serious consequences. Face Image Quality Assessment (FIQA) techniques enhance FR systems by providing quality estimates of face samples, enabling the systems to discard samples that are unsuitable for reliable recognition or lead to low-confidence recognition decisions. Most state-of-the-art FIQA techniques rely on extensive supervised training to achieve accurate quality estimation. In contrast, unsupervised techniques eliminate the need for additional training but tend to be slower and typically exhibit lower performance. In this paper, we introduce FROQ (Face Recognition Observer of Quality), a semi-supervised, training-free approach that leverages specific intermediate representations within a given FR model to estimate face-image quality, and combines the efficiency of supervised FIQA models with the training-free approach of unsupervised methods. A simple calibration step based on pseudo-quality labels allows FROQ to uncover specific representations, useful for quality assessment, in any modern FR model. To generate these pseudo-labels, we propose a novel unsupervised FIQA technique based on sample perturbations. Comprehensive experiments with four state-of-the-art FR models and eight benchmark datasets show that FROQ leads to highly competitive results compared to the state-of-the-art, achieving both strong performance and efficient runtime, without requiring explicit training.
zh
[CV-46] Predicting Depth Maps from Single RGB Images and Addressing Missing Information in Depth Estimation
【速读】:该论文旨在解决自动驾驶系统(ADS)中深度图像(Depth image)因像素数据缺失或不一致导致的信息不完整问题。其解决方案的关键在于提出一种基于多层训练策略的算法,能够从单张RGB图像生成完整的深度图,并进一步用于修复现有深度图像中的空缺区域,从而获得数据完整且准确的深度信息。该方法在Cityscapes数据集上验证有效,表明其在真实城市环境中的适用性。
链接: https://arxiv.org/abs/2509.17686
作者: Mohamad Mofeed Chaar,Jamal Raiyn,Galia Weidl
机构: University of Applied Sciences, Aschaffenburg, Germany(应用科学大学,阿沙芬堡,德国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 10 figures, VEHITS conference 2025
Abstract:Depth imaging is a crucial area in Autonomous Driving Systems (ADS), as it plays a key role in detecting and measuring objects in the vehicle’s surroundings. However, a significant challenge in this domain arises from missing information in Depth images, where certain points are not measurable due to gaps or inconsistencies in pixel data. Our research addresses two key tasks to overcome this challenge. First, we developed an algorithm using a multi-layered training approach to generate Depth images from a single RGB image. Second, we addressed the issue of missing information in Depth images by applying our algorithm to rectify these gaps, resulting in Depth images with complete and accurate data. We further tested our algorithm on the Cityscapes dataset and successfully resolved the missing information in its Depth images, demonstrating the effectiveness of our approach in real-world urban environments.
zh
[CV-47] DINOv3-Diffusion Policy: Self-Supervised Large Visual Model for Visuomotor Diffusion Policy Learning
【速读】:该论文旨在解决机器人操作中视觉-动作扩散策略(visuomotor diffusion policy)的感知前端建模问题,具体探究纯自监督预训练的视觉骨干网络(如DINOv3)是否能在机器人任务中替代传统监督式ImageNet预训练模型(如ResNet-18),从而降低对标注数据的依赖并提升泛化能力。解决方案的关键在于使用DINOv3作为统一的视觉编码器,在四种基准任务(Push-T、Lift、Can、Square)上通过FiLM条件控制的扩散策略进行评估,并在三种训练范式(从头训练、冻结参数、微调)下验证其性能;结果表明,微调后的DINOv3在多个任务上优于或等效于ResNet-18,且冻结模式下仍具竞争力,说明其具备强迁移性与样本效率优势,证明了自监督大视觉模型作为标签无关的感知前段在机器人操作中的有效性。
链接: https://arxiv.org/abs/2509.17684
作者: ThankGod Egbe,Peng Wang,Zhihao Guo,Zidong Chen
机构: Manchester Metropolitan University (曼彻斯特都会大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:This paper evaluates DINOv3, a recent large-scale self-supervised vision backbone, for visuomotor diffusion policy learning in robotic manipulation. We investigate whether a purely self-supervised encoder can match or surpass conventional supervised ImageNet-pretrained backbones (e.g., ResNet-18) under three regimes: training from scratch, frozen, and finetuned. Across four benchmark tasks (Push-T, Lift, Can, Square) using a unified FiLM-conditioned diffusion policy, we find that (i) finetuned DINOv3 matches or exceeds ResNet-18 on several tasks, (ii) frozen DINOv3 remains competitive, indicating strong transferable priors, and (iii) self-supervised features improve sample efficiency and robustness. These results support self-supervised large visual models as effective, generalizable perceptual front-ends for action diffusion policies, motivating further exploration of scalable label-free pretraining in robotic manipulation. Compared to using ResNet18 as a backbone, our approach with DINOv3 achieves up to a 10% absolute increase in test-time success rates on challenging tasks such as Can, and on-the-par performance in tasks like Lift, PushT, and Square.
zh
[CV-48] ailored Transformation Invariance for Industrial Anomaly Detection
【速读】:该论文旨在解决工业异常检测(Industrial Anomaly Detection, IAD)中现有方法在特征提取效率与模型复杂度之间的权衡问题,特别是针对当前基于生成式 AI 的先进方法训练成本高、难以部署于实际场景的挑战。其解决方案的关键在于提出 LWinNN(Local Window-based Nearest Neighbor),一种基于局部窗口的近邻方法,通过引入有限范围内的平移不变性(translation invariance)来平衡传统 kNN 方法(完全或无平移不变性)与复杂模型之间的性能差距。实验表明,该设计显著提升了检测精度,同时降低了训练和推理时间,验证了在有限数据下利用结构化局部信息可有效缩小与复杂模型的性能差距,并提示未来应构建更具空间多样性的基准测试集以推动该领域发展。
链接: https://arxiv.org/abs/2509.17670
作者: Mariette Schönfeld,Wannes Meert,Hendrik Blockeel
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Industrial Anomaly Detection (IAD) is a subproblem within Computer Vision Anomaly Detection that has been receiving increasing amounts of attention due to its applicability to real-life scenarios. Recent research has focused on how to extract the most informative features, contrasting older kNN-based methods that use only pretrained features. These recent methods are much more expensive to train however and could complicate real-life application. Careful study of related work with regards to transformation invariance leads to the idea that popular benchmarks require robustness to only minor translations. With this idea we then formulate LWinNN, a local window based approach that creates a middle ground between kNN based methods that have either complete or no translation invariance. Our experiments demonstrate that this small change increases accuracy considerably, while simultaneously decreasing both train and test time. This teaches us two things: first, the gap between kNN-based approaches and more complex state-of-the-art methodology can still be narrowed by effective usage of the limited data available. Second, our assumption of requiring only limited translation invariance highlights potential areas of interest for future work and the need for more spatially diverse benchmarks, for which our method can hopefully serve as a new baseline. Our code can be found at this https URL .
zh
[CV-49] SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models NEURIPS2025
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在3D空间关系定量推理能力上的不足,这一局限主要源于2D图像在空间表征能力上的缺陷。解决方案的关键在于提出一个名为SD-VLM的新框架,其核心贡献包括:(1)构建大规模空间测量与理解(Massive Spatial Measuring and Understanding, MSMU)数据集,包含70万组问答对、250万个物理数值标注和1万条链式思维增强样本,以提供精细的空间标注支持;(2)引入一种简化的深度位置编码方法,有效增强VLMs的空间感知能力。实验表明,SD-VLM在自建的MSMU-Bench上显著优于GPT-4o和Intern-VL3-78B,分别提升26.91%和25.56%,并展现出跨基准的空间泛化性能。
链接: https://arxiv.org/abs/2509.17664
作者: Pingyi Chen,Yujing Lou,Shen Cao,Jinhui Guo,Lubin Fan,Yue Wu,Lin Yang,Lizhuang Ma,Jieping Ye
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Alibaba Cloud Computing (阿里云计算); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by NeurIPS 2025
Abstract:While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images’ spatial representation ability. In this paper, we analyze the problem hindering VLMs’ spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs’ spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at this https URL.
zh
[CV-50] Development and validation of an AI foundation model for endoscopic diagnosis of esophagogastric junction adenocarcinoma: a cohort and deep learning study
【速读】:该论文旨在解决食管胃结合部腺癌(esophagogastric junction adenocarcinoma, EGJA)早期诊断依赖操作者经验、准确性不足的问题。其解决方案的关键在于首次将基础模型(foundation model)引入EGJA的筛查与分期诊断,采用DINOv2(视觉基础模型)与ResNet50(卷积神经网络)协同提取内镜图像中的全局外观特征和局部细节信息,从而实现高精度的自动化分期诊断。该方法在多个测试集上均展现出优于传统AI模型及人类专家的表现,并显著提升不同水平内镜医师的诊断准确率,证明了基础模型在提高EGJA诊疗效率与一致性方面的巨大潜力。
链接: https://arxiv.org/abs/2509.17660
作者: Yikun Ma,Bo Li,Ying Chen,Zijie Yue,Shuchang Xu,Jingyao Li,Lei Ma,Liang Zhong,Duowu Zou,Leiming Xu,Yunshi Zhong,Xiaobo Li,Weiqun Ding,Minmin Zhang,Dongli He,Zhenghong Li,Ye Chen,Ye Zhao,Jialong Zhuo,Xiaofen Wu,Lisha Yi,Miaojing Shi,Huihui Sun
机构: Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, China; Department of Gastroenterology, Tongji Institute of Digestive Disease, Tongji Hospital Affiliated to Tongji University, Medicine of Tongji University, China; College of Electronic and Information Engineering, Tongji University, China; Department of Gastroenterology and Endoscopy, Huashan Hospital Affiliated to Fudan University, China; Department of Gastroenterology, Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, China; Department of Gastroenterology, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, China; Endoscopy Center and Endoscopy Research Institute, Zhongshan Hospital Affiliated to Fudan University, China; Division of Gastroenterology and Hepatology, Key Laboratory Gastroenterology and Hepatology, Renji Hospital affiliated to Shanghai Jiao Tong University School of Medicine, China; Department of Information Center, Tongji Hospital Affiliated to Tongji University, China; Department of Gastroenterology, Zhangzhou Affiliated Hospital of Fujian Medical University, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The early detection of esophagogastric junction adenocarcinoma (EGJA) is crucial for improving patient prognosis, yet its current diagnosis is highly operator-dependent. This paper aims to make the first attempt to develop an artificial intelligence (AI) foundation model-based method for both screening and staging diagnosis of EGJA using endoscopic images. In this cohort and learning study, we conducted a multicentre study across seven Chinese hospitals between December 28, 2016 and December 30, 2024. It comprises 12,302 images from 1,546 patients; 8,249 of them were employed for model training, while the remaining were divided into the held-out (112 patients, 914 images), external (230 patients, 1,539 images), and prospective (198 patients, 1,600 images) test sets for evaluation. The proposed model employs DINOv2 (a vision foundation model) and ResNet50 (a convolutional neural network) to extract features of global appearance and local details of endoscopic images for EGJA staging diagnosis. Our model demonstrates satisfactory performance for EGJA staging diagnosis across three test sets, achieving an accuracy of 0.9256, 0.8895, and 0.8956, respectively. In contrast, among representative AI models, the best one (ResNet50) achieves an accuracy of 0.9125, 0.8382, and 0.8519 on the three test sets, respectively; the expert endoscopists achieve an accuracy of 0.8147 on the held-out test set. Moreover, with the assistance of our model, the overall accuracy for the trainee, competent, and expert endoscopists improves from 0.7035, 0.7350, and 0.8147 to 0.8497, 0.8521, and 0.8696, respectively. To our knowledge, our model is the first application of foundation models for EGJA staging diagnosis and demonstrates great potential in both diagnostic accuracy and efficiency.
zh
[CV-51] Clothing agnostic Pre-inpainting Virtual Try-ON
【速读】:该论文旨在解决基于扩散模型的虚拟试衣(Virtual Try-On, VTON)技术中存在的纹理失真、底部检测不准确以及服装轮廓保留不足等问题,尤其在长袖转短袖或无袖场景下皮肤区域恢复质量较差的问题。解决方案的关键在于提出CaP-VTON(Clothing agnostic Pre-inpainting Virtual Try-On)框架,其核心创新包括:1)基于Dress Code的多类别掩码策略以增强服装结构一致性;2)基于Stable Diffusion的皮肤修复模块,结合人体姿态和颜色信息实现高质量皮肤重建;3)通过预修复皮肤区域提升整体合成自然度与一致性。实验表明,该方法在短袖合成准确率上较Leffa提升15.4%(达92.5%),且在视觉评估中稳定还原参考服装的风格与形状,具有模型无关性,适用于多种扩散模型驱动的虚拟试衣系统。
链接: https://arxiv.org/abs/2509.17654
作者: Sehyun Kim,Hye Jun Lee,Jiwoo Lee,Taemin Lee
机构: Kangwon National University (江原国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the development of deep learning technology, virtual try-on technology has become an important application value in the fields of e-commerce, fashion, and entertainment. The recently proposed Leffa has improved the texture distortion problem of diffu-sion-based models, but there are limitations in that the bottom detection inaccuracy and the existing clothing silhouette remain in the synthesis results. To solve this problem, this study proposes CaP-VTON (Clothing agnostic Pre-inpainting Virtual Try-ON). CaP-VTON has improved the naturalness and consistency of whole-body clothing syn-thesis by integrating multi-category masking based on Dress Code and skin inpainting based on Stable Diffusion. In particular, a generate skin module was introduced to solve the skin restoration problem that occurs when long-sleeved images are converted into short-sleeved or sleeveless ones, and high-quality restoration was implemented consider-ing the human body posture and color. As a result, CaP-VTON recorded 92.5%, which is 15.4% better than Leffa in short-sleeved synthesis accuracy, and showed the performance of consistently reproducing the style and shape of reference clothing in visual evaluation. These structures maintain model-agnostic properties and are applicable to various diffu-sion-based virtual inspection systems, and can contribute to applications that require high-precision virtual wearing, such as e-commerce, custom styling, and avatar creation.
zh
[CV-52] SISMA: Semantic Face Image Synthesis with Mamba
【速读】:该论文旨在解决生成式 AI(Generative AI)中语义图像合成(Semantic Image Synthesis, SIS)任务在计算效率上的瓶颈问题,尤其是基于 Transformer 的扩散模型因注意力层的二次复杂度导致训练与推理过程计算开销巨大的问题。其解决方案的关键在于提出一种新型架构 SISMA,该架构基于最近提出的 Mamba 模型,通过引入语义掩码(semantic mask)控制生成图像的形状,在显著降低计算需求的同时保持高质量样本生成能力。实验表明,SISMA 在 CelebAMask-HQ 数据集上不仅实现了更优的 FID 分数,且推理速度达到当前最优架构的三倍,验证了其作为轻量级替代方案的有效性。
链接: https://arxiv.org/abs/2509.17651
作者: Filippo Botti,Alex Ergasti,Tomaso Fontanini,Claudio Ferrari,Massimo Bertozzi,Andrea Prati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion Models have become very popular for Semantic Image Synthesis (SIS) of human faces. Nevertheless, their training and inference is computationally expensive and their computational requirements are high due to the quadratic complexity of attention layers. In this paper, we propose a novel architecture called SISMA, based on the recently proposed Mamba. SISMA generates high quality samples by controlling their shape using a semantic mask at a reduced computational demand. We validated our approach through comprehensive experiments with CelebAMask-HQ, revealing that our architecture not only achieves a better FID score yet also operates at three times the speed of state-of-the-art architectures. This indicates that the proposed design is a viable, lightweight substitute to transformer-based models.
zh
[CV-53] Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
【速读】:该论文旨在解决流式视觉变换器(Streaming Visual Transformers)在长时间序列推理过程中因键值(Key Value, KV)记忆无界增长而导致的内存瓶颈问题,从而限制了模型的可扩展性。其核心解决方案是一种无需训练、仅在推理阶段执行的令牌驱逐(token eviction)策略,通过识别并丢弃冗余信息的令牌,同时保留最具信息量的令牌,实现对KV内存的有效控制。实验表明,该方法可在几乎不损失精度的前提下显著降低峰值内存占用(如7-Scenes场景中从18.63 GB降至9.39 GB),并在严格内存约束下支持更密集的帧采样,从而提升重建精度,使长时序流式推理更具实用性。
链接: https://arxiv.org/abs/2509.17650
作者: Soroush Mahdi,Fardin Ayar,Ehsan Javanmardi,Manabu Tsukada,Mahdi Javanmardi
机构: Amirkabir University of Technology (AUT)(阿米尔卡比尔理工大学); The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.
zh
[CV-54] VideoArtGS: Building Digital Twins of Articulated Objects from Monocular Video
【速读】:该论文旨在解决从单目视频中重建关节物体(articulated objects)数字孪生(digital twins)的挑战,即如何在仅有单一视角输入的情况下,同时准确恢复物体几何结构、部件分割以及关节参数。核心难点在于仅靠视觉监督难以区分物体几何与部件运动,因相机与部件的联合运动导致估计病态(ill-posed estimation)。解决方案的关键在于提出 VideoArtGS 方法,其创新性地引入了一个运动先验引导流程(motion prior guidance pipeline),通过分析3D轨迹、去噪并提供可靠的关节参数初始值;同时设计了一种混合中心-网格部件分配模块(hybrid center-grid part assignment module),用于建模基于关节的变形场,从而精确捕捉部件运动。该方法显著提升了重建精度,相较现有方法将重建误差降低约两个数量级。
链接: https://arxiv.org/abs/2509.17647
作者: Yu Liu,Baoxiong Jia,Ruijie Lu,Chuyue Gan,Huayu Chen,Junfeng Ni,Song-Chun Zhu,Siyuan Huang
机构: Tsinghua University (清华大学); State Key Laboratory of General Artificial Intelligence, BIGAI (通用人工智能国家重点实验室, BIGAI); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Building digital twins of articulated objects from monocular video presents an essential challenge in computer vision, which requires simultaneous reconstruction of object geometry, part segmentation, and articulation parameters from limited viewpoint inputs. Monocular video offers an attractive input format due to its simplicity and scalability; however, it’s challenging to disentangle the object geometry and part dynamics with visual supervision alone, as the joint movement of the camera and parts leads to ill-posed estimation. While motion priors from pre-trained tracking models can alleviate the issue, how to effectively integrate them for articulation learning remains largely unexplored. To address this problem, we introduce VideoArtGS, a novel approach that reconstructs high-fidelity digital twins of articulated objects from monocular video. We propose a motion prior guidance pipeline that analyzes 3D tracks, filters noise, and provides reliable initialization of articulation parameters. We also design a hybrid center-grid part assignment module for articulation-based deformation fields that captures accurate part motion. VideoArtGS demonstrates state-of-the-art performance in articulation and mesh reconstruction, reducing the reconstruction error by about two orders of magnitude compared to existing methods. VideoArtGS enables practical digital twin creation from monocular video, establishing a new benchmark for video-based articulated object reconstruction. Our work is made publicly available at: this https URL.
zh
[CV-55] A2M2-Net: Adaptively Aligned Multi-Scale Moment for Few-Shot Action Recognition
【速读】:该论文旨在解决少样本动作识别(Few-shot Action Recognition, FSAR)中因忽略个体运动模式差异和特征统计信息利用不足而导致的视频动态时间错位(temporal misalignment)问题,尤其是在使用2D主干网络时表现不佳。解决方案的关键在于提出一种自适应对齐多尺度二阶矩网络(A²M²-Net),其核心由两个模块构成:一是自适应对齐模块(A² module),用于在实例引导下选择具有信息量的候选描述符并实现动态匹配;二是多尺度二阶矩模块(M² block),通过在多个时空尺度上构建语义二阶描述子来增强表示能力。该方法通过建立自适应对齐机制,有效缓解了时间错位问题,并在五个主流FSAR基准上展现出卓越的性能与泛化能力。
链接: https://arxiv.org/abs/2509.17638
作者: Zilin Gao,Qilong Wang,Bingbing Zhang,Qinghua Hu,Peihua Li
机构: Dalian University of Technology (大连理工大学); Tianjin University (天津大学); Dalian Nationalities University (大连民族大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 27 pages, 13 figures, 7 tables
Abstract:Thanks to capability to alleviate the cost of large-scale annotation, few-shot action recognition (FSAR) has attracted increased attention of researchers in recent years. Existing FSAR approaches typically neglect the role of individual motion pattern in comparison, and under-explore the feature statistics for video dynamics. Thereby, they struggle to handle the challenging temporal misalignment in video dynamics, particularly by using 2D backbones. To overcome these limitations, this work proposes an adaptively aligned multi-scale second-order moment network, namely A ^2 M ^2 -Net, to describe the latent video dynamics with a collection of powerful representation candidates and adaptively align them in an instance-guided manner. To this end, our A ^2 M ^2 -Net involves two core components, namely, adaptive alignment (A ^2 module) for matching, and multi-scale second-order moment (M ^2 block) for strong representation. Specifically, M ^2 block develops a collection of semantic second-order descriptors at multiple spatio-temporal scales. Furthermore, A ^2 module aims to adaptively select informative candidate descriptors while considering the individual motion pattern. By such means, our A ^2 M ^2 -Net is able to handle the challenging temporal misalignment problem by establishing an adaptive alignment protocol for strong representation. Notably, our proposed method generalizes well to various few-shot settings and diverse metrics. The experiments are conducted on five widely used FSAR benchmarks, and the results show our A ^2 M ^2 -Net achieves very competitive performance compared to state-of-the-arts, demonstrating its effectiveness and generalization.
zh
[CV-56] Overview of PlantCLEF 2022: Image-based plant identification at global scale
【速读】:该论文旨在解决全球植物多样性自动识别的难题,即如何在面对80,000个植物物种类别、类别严重不平衡、图像质量参差不齐、标注错误与重复数据等挑战下,实现高效且准确的多图像(含元数据)分类。其解决方案的关键在于利用成熟的深度学习技术,结合大规模多模态数据集和系统化的评估框架,推动自动植物识别从实验室研究向实际应用迈进,从而加速植物物种知识的积累与保护工作。
链接: https://arxiv.org/abs/2509.17632
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: CIRAD, UMR AMAP, Montpellier, Occitanie, France; Inria, LIRMM, Univ Montpellier, CNRS, Montpellier, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 2 figures, CLEF 2022 Conference and Labs of the Evaluation Forum, September 05 to 08, 2022, Bologna, Italy
Abstract:It is estimated that there are more than 300,000 species of vascular plants in the world. Increasing our knowledge of these species is of paramount importance for the development of human civilization (agriculture, construction, pharmacopoeia, etc.), especially in the context of the biodiversity crisis. However, the burden of systematic plant identification by human experts strongly penalizes the aggregation of new data and knowledge. Since then, automatic identification has made considerable progress in recent years as highlighted during all previous editions of PlantCLEF. Deep learning techniques now seem mature enough to address the ultimate but realistic problem of global identification of plant biodiversity in spite of many problems that the data may present (a huge number of classes, very strongly unbalanced classes, partially erroneous identifications, duplications, variable visual quality, diversity of visual contents such as photos or herbarium sheets, etc). The PlantCLEF2022 challenge edition proposes to take a step in this direction by tackling a multi-image (and metadata) classification problem with a very large number of classes (80k plant species). This paper presents the resources and evaluations of the challenge, summarizes the approaches and systems employed by the participating research groups, and provides an analysis of key findings.
zh
[CV-57] OmniInsert: Mask-Free Video Insertion of Any Reference via Diffusion Transformer Models
【速读】:该论文旨在解决无掩码视频插入(Mask-free Video Insertion)中的三大核心挑战:数据稀缺性、主体与场景的平衡性以及插入内容的和谐性。针对这些问题,其关键解决方案在于提出了一种名为OmniInsert的统一框架,该框架基于自动生成多样化跨对数据的新数据管道InsertPipe,实现了从单个或多个主体参考中进行无掩码视频插入。为保障主体-场景平衡,引入了条件特异性特征注入机制(Condition-Specific Feature Injection)和渐进式训练策略,以有效融合多源条件并优化特征注入权重;同时设计了主体聚焦损失(Subject-Focused Loss)提升主体细节表现。为进一步增强插入和谐性,提出了插入偏好优化方法(Insertive Preference Optimization)模拟人类偏好,并集成上下文感知重述模块(Context-Aware Rephraser)实现主体与原场景的无缝融合。
链接: https://arxiv.org/abs/2509.17627
作者: Jinshu Chen,Xinghui Li,Xu Bai,Tianxiang Ma,Pengze Zhang,Zhuowei Chen,Gen Li,Lijie Liu,Songtao Zhao,Bingchuan Li,Qian He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Github Page: this https URL
Abstract:Recent advances in video insertion based on diffusion models are impressive. However, existing methods rely on complex control signals but struggle with subject consistency, limiting their practical applicability. In this paper, we focus on the task of Mask-free Video Insertion and aim to resolve three key challenges: data scarcity, subject-scene equilibrium, and insertion harmonization. To address the data scarcity, we propose a new data pipeline InsertPipe, constructing diverse cross-pair data automatically. Building upon our data pipeline, we develop OmniInsert, a novel unified framework for mask-free video insertion from both single and multiple subject references. Specifically, to maintain subject-scene equilibrium, we introduce a simple yet effective Condition-Specific Feature Injection mechanism to distinctly inject multi-source conditions and propose a novel Progressive Training strategy that enables the model to balance feature injection from subjects and source video. Meanwhile, we design the Subject-Focused Loss to improve the detailed appearance of the subjects. To further enhance insertion harmonization, we propose an Insertive Preference Optimization methodology to optimize the model by simulating human preferences, and incorporate a Context-Aware Rephraser module during reference to seamlessly integrate the subject into the original scenes. To address the lack of a benchmark for the field, we introduce InsertBench, a comprehensive benchmark comprising diverse scenes with meticulously selected subjects. Evaluation on InsertBench indicates OmniInsert outperforms state-of-the-art closed-source commercial solutions. The code will be released.
zh
[CV-58] Overview of PlantCLEF 2023: Image-based Plant Identification at Global Scale
【速读】:该论文旨在解决全球范围内植物物种识别效率低下的问题,尤其是在面对超过30万种维管植物及当前生物多样性危机的背景下,传统依赖人工专家的识别方式严重制约了数据与知识的积累。解决方案的关键在于利用深度学习技术构建自动识别系统,以应对多类别的分类挑战(如80,000种植物物种)、类别不平衡、图像质量差异以及标注错误等复杂数据问题,从而推动实现高精度、大规模的全球植物物种自动识别。
链接: https://arxiv.org/abs/2509.17622
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: CIRAD(国际热带农业研究中心); Inria(法国国家信息与自动化研究院); LIRMM(蒙彼利埃信息与机器人实验室); Univ Montpellier(蒙彼利埃大学); CNRS(法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 1 figure, CLEF 2023 Conference and Labs of the Evaluation Forum, September 18 to 21, 2023, Thessaloniki, Greece
Abstract:The world is estimated to be home to over 300,000 species of vascular plants. In the face of the ongoing biodiversity crisis, expanding our understanding of these species is crucial for the advancement of human civilization, encompassing areas such as agriculture, construction, and pharmacopoeia. However, the labor-intensive process of plant identification undertaken by human experts poses a significant obstacle to the accumulation of new data and knowledge. Fortunately, recent advancements in automatic identification, particularly through the application of deep learning techniques, have shown promising progress. Despite challenges posed by data-related issues such as a vast number of classes, imbalanced class distribution, erroneous identifications, duplications, variable visual quality, and diverse visual contents (such as photos or herbarium sheets), deep learning approaches have reached a level of maturity which gives us hope that in the near future we will have an identification system capable of accurately identifying all plant species worldwide. The PlantCLEF2023 challenge aims to contribute to this pursuit by addressing a multi-image (and metadata) classification problem involving an extensive set of classes (80,000 plant species). This paper provides an overview of the challenge’s resources and evaluations, summarizes the methods and systems employed by participating research groups, and presents an analysis of key findings.
zh
[CV-59] nsor-Based Self-Calibration of Cameras via the TrifocalCalib Method
【速读】:该论文旨在解决无先验场景知识条件下相机内参(包括焦距和主点)的自标定问题,这是计算机视觉中的一个基础挑战,尤其在自动驾驶和车辆编队等需实时适应性的应用场景中至关重要。解决方案的关键在于提出了一套基于校准三焦点张量(calibrated trifocal tensor)的方程,实现了仅用最少图像数据即可完成投影自标定;其优势在于无需标定板、不限制相机运动约束,且能同时估计焦距与主点,相较现有学习方法和传统方法显著提升了精度与鲁棒性。
链接: https://arxiv.org/abs/2509.17620
作者: Gregory Schroeder,Mohamed Sabry,Cristina Olaverri-Monreal
机构: Johannes Kepler University Linz (约翰开普勒林茨大学); IAV GmbH (IAV公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Estimating camera intrinsic parameters without prior scene knowledge is a fundamental challenge in computer vision. This capability is particularly important for applications such as autonomous driving and vehicle platooning, where precalibrated setups are impractical and real-time adaptability is necessary. To advance the state-of-the-art, we present a set of equations based on the calibrated trifocal tensor, enabling projective camera self-calibration from minimal image data. Our method, termed TrifocalCalib, significantly improves accuracy and robustness compared to both recent learning-based and classical approaches. Unlike many existing techniques, our approach requires no calibration target, imposes no constraints on camera motion, and simultaneously estimates both focal length and principal point. Evaluations in both procedurally generated synthetic environments and structured dataset-based scenarios demonstrate the effectiveness of our approach. To support reproducibility, we make the code publicly available.
zh
[CV-60] From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge
【速读】:该论文旨在解决视觉异常检测(Visual Anomaly Detection)在实际应用场景中面临的挑战,特别是如何提升方法对真实世界分布偏移的鲁棒性以及探索视觉语言模型(Vision Language Models, VLMs)在少样本(few-shot)条件下的异常检测能力。解决方案的关键在于通过组织VAND 3.0挑战赛的两个赛道:第一赛道聚焦于开发对现实场景分布变化具有鲁棒性的检测方法,第二赛道则利用大规模预训练视觉(语言)主干网络(backbones)增强模型泛化能力;参赛方案通过融合现有方法与创新流水线,在性能上显著超越基线,表明大型预训练模型在提升检测精度中的核心作用,但同时也指出未来需进一步优化计算效率以满足现场实时性要求。
链接: https://arxiv.org/abs/2509.17615
作者: Lars Heckler-Kram,Ashwin Vaidya,Jan-Hendrik Neudeck,Ulla Scheler,Dick Ameln,Samet Akcay,Paula Ramos
机构: MVTec Software GmbH(德国MVTEC软件公司); Technical University of Munich(慕尼黑工业大学); Intel(英特尔); Voxel51(美国Voxel51公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Visual anomaly detection is a strongly application-driven field of research. Consequently, the connection between academia and industry is of paramount importance. In this regard, we present the VAND 3.0 Challenge to showcase current progress in anomaly detection across different practical settings whilst addressing critical issues in the field. The challenge hosted two tracks, fostering the development of anomaly detection methods robust against real-world distribution shifts (Category 1) and exploring the capabilities of Vision Language Models within the few-shot regime (Category 2), respectively. The participants’ solutions reached significant improvements over previous baselines by combining or adapting existing approaches and fusing them with novel pipelines. While for both tracks the progress in large pre-trained vision (language) backbones played a pivotal role for the performance increase, scaling up anomaly detection methods more efficiently needs to be addressed by future research to meet real-time and computational constraints on-site.
zh
[CV-61] Overview of PlantCLEF 2025: Multi-Species Plant Identification in Vegetation Quadrat Images
【速读】:该论文旨在解决生态学研究中植物物种识别效率低下的问题,特别是在大规模野外调查和长期监测场景下,传统人工识别 quadrat 图像(样方图像)耗时且难以扩展。其解决方案的关键在于构建一个大规模、高分辨率的多标签分类数据集(PlantCLEF 2025),包含2,105张由专家标注的样方图像及约400个物种标签,并提供140万张单标签训练图像和预训练的视觉Transformer模型。通过将任务定义为弱标签多标签分类问题(即使用单标签训练数据预测图像中所有存在的物种),该研究推动了生成式 AI (Generative AI) 和深度学习在生态遥感图像分析中的应用,从而提升物种识别的速度与覆盖范围。
链接: https://arxiv.org/abs/2509.17602
作者: Giulio Martellucci,Herve Goeau,Pierre Bonnet,Fabrice Vinatier,Alexis Joly
机构: LISAH, Univ Montpellier, INRAE, IRD, Montpellier, France; CIRAD, UMR AMAP, Montpellier, Occitanie, France; Inria, LIRMM, Univ Montpellier, CNRS, Montpellier, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures, CLEF 2025 Conference and Labs of the Evaluation Forum, September 09 to 12, 2024, Madrid, Spain
Abstract:Quadrat images are essential for ecological studies, as they enable standardized sampling, the assessment of plant biodiversity, long-term monitoring, and large-scale field campaigns. These images typically cover an area of fifty centimetres or one square meter, and botanists carefully identify all the species present. Integrating AI could help specialists accelerate their inventories and expand the spatial coverage of ecological studies. To assess progress in this area, the PlantCLEF 2025 challenge relies on a new test set of 2,105 high-resolution multi-label images annotated by experts and covering around 400 species. It also provides a large training set of 1.4 million individual plant images, along with vision transformer models pre-trained on this data. The task is formulated as a (weakly labelled) multi-label classification problem, where the goal is to predict all species present in a quadrat image using single-label training data. This paper provides a detailed description of the data, the evaluation methodology, the methods and models used by participants, and the results achieved.
zh
[CV-62] COLA: Context-aware Language-driven Test-time Adaptation
【速读】:该论文旨在解决测试时适应(Test-time Adaptation, TTA)中普遍存在的标签空间不共享问题,即传统方法依赖源域与目标域具有相同标签空间,限制了其在多场景下的适用性。为此,作者提出了一种基于预训练视觉语言模型(Vision-Language Model, VLM)的新型TTA方法——Context-aware Language-driven TTA(COLA)。其核心创新在于引入一个轻量级的上下文感知模块,包含任务感知适配器(task-aware adapter)、上下文感知单元(context-aware unit)和残差连接单元(residual connection unit),分别用于挖掘任务特定知识、从VLM中提取领域特定特征以及融合VLM先验知识,从而在无需共享标签的前提下实现对多个目标域的有效适应。该模块可无缝集成至冻结的VLM中,兼具参数效率与迁移性能,同时结合类平衡伪标签策略(Class-Balanced Pseudo-labeling, CBPL)缓解类别不平衡带来的负面影响。
链接: https://arxiv.org/abs/2509.17598
作者: Aiming Zhang,Tianyuan Yu,Liang Bai,Jun Tang,Yanming Guo,Yirun Ruan,Yun Zhou,Zhihe Lu
机构: National University of Defense Technology (国防科技大学); Hamad Bin Khalifa University (哈马德本哈利法大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Test-time adaptation (TTA) has gained increasing popularity due to its efficacy in addressing ``distribution shift’’ issue while simultaneously protecting data privacy. However, most prior methods assume that a paired source domain model and target domain sharing the same label space coexist, heavily limiting their applicability. In this paper, we investigate a more general source model capable of adaptation to multiple target domains without needing shared labels. This is achieved by using a pre-trained vision-language model (VLM), \egno, CLIP, that can recognize images through matching with class descriptions. While the zero-shot performance of VLMs is impressive, they struggle to effectively capture the distinctive attributes of a target domain. To that end, we propose a novel method – Context-aware Language-driven TTA (COLA). The proposed method incorporates a lightweight context-aware module that consists of three key components: a task-aware adapter, a context-aware unit, and a residual connection unit for exploring task-specific knowledge, domain-specific knowledge from the VLM and prior knowledge of the VLM, respectively. It is worth noting that the context-aware module can be seamlessly integrated into a frozen VLM, ensuring both minimal effort and parameter efficiency. Additionally, we introduce a Class-Balanced Pseudo-labeling (CBPL) strategy to mitigate the adverse effects caused by class imbalance. We demonstrate the effectiveness of our method not only in TTA scenarios but also in class generalisation tasks. The source code is available at this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.17598 [cs.CV] (or arXiv:2509.17598v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.17598 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE Trans. Image Process. (2025) Related DOI: https://doi.org/10.1109/TIP.2025.3607634 Focus to learn more DOI(s) linking to related resources Submission history From: Aiming Zhang [view email] [v1] Mon, 22 Sep 2025 11:19:17 UTC (2,209 KB) Full-text links: Access Paper: View a PDF of the paper titled COLA: Context-aware Language-driven Test-time Adaptation, by Aiming Zhang and Tianyuan Yu and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh
[CV-63] Domain Adaptive Object Detection for Space Applications with Real-Time Constraints
【速读】:该论文旨在解决空间应用场景中目标检测模型因训练数据与真实数据之间存在领域差异(domain gap)而导致性能显著下降的问题,尤其是当前基于仿真数据训练的深度学习模型在真实世界数据上的泛化能力不足。其解决方案的关键在于引入监督式域适应(Supervised Domain Adaptation, SDA),利用少量标注的真实数据,通过结合域不变特征学习、基于CNN的域判别器以及使用域无关回归头的不变风险最小化策略,有效缩小源域(合成数据)与目标域(真实数据)之间的分布差异,从而提升模型在真实场景下的检测精度。实验表明,仅需250张标注的真实图像即可实现平均精度(AP)最高达20个百分点的提升。
链接: https://arxiv.org/abs/2509.17593
作者: Samet Hicsonmez,Abd El Rahman Shabayek,Arunkumar Rathinam,Djamila Aouada
机构: University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Advanced Space Technologies in Robotics and Automation (ASTRA) 2025
Abstract:Object detection is essential in space applications targeting Space Domain Awareness and also applications involving relative navigation scenarios. Current deep learning models for Object Detection in space applications are often trained on synthetic data from simulators, however, the model performance drops significantly on real-world data due to the domain gap. However, domain adaptive object detection is an overlooked problem in the community. In this work, we first show the importance of domain adaptation and then explore Supervised Domain Adaptation (SDA) to reduce this gap using minimal labeled real data. We build on a recent semi-supervised adaptation method and tailor it for object detection. Our approach combines domain-invariant feature learning with a CNN-based domain discriminator and invariant risk minimization using a domain-independent regression head. To meet real-time deployment needs, we test our method on a lightweight Single Shot Multibox Detector (SSD) with MobileNet backbone and on the more advanced Fully Convolutional One-Stage object detector (FCOS) with ResNet-50 backbone. We evaluated on two space datasets, SPEED+ and SPARK. The results show up to 20-point improvements in average precision (AP) with just 250 labeled real images.
zh
[CV-64] Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中图像到文本信息流机制难以解释的问题,特别是由于大量注意力头(attention heads)同时运作导致的信息传递路径不清晰。其解决方案的关键在于提出“头归因”(head attribution)技术,该方法受组件归因(component attribution)启发,用于识别在图像到文本信息转移过程中起关键作用的注意力头的一致性模式。通过该方法,研究发现特定注意力头子集主导了信息流,并且这些头的选择由输入图像的语义内容而非视觉外观决定,从而揭示了LVLM中图像到文本信息流具有结构化过程,为理解LVLM工作机制提供了新的分析维度。
链接: https://arxiv.org/abs/2509.17588
作者: Jinyeong Kim,Seil Kang,Jiwoo Park,Junhyeok Kim,Seong Jae Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large Vision-Language Models (LVLMs) answer visual questions by transferring information from images to text through a series of attention heads. While this image-to-text information flow is central to visual question answering, its underlying mechanism remains difficult to interpret due to the simultaneous operation of numerous attention heads. To address this challenge, we propose head attribution, a technique inspired by component attribution methods, to identify consistent patterns among attention heads that play a key role in information transfer. Using head attribution, we investigate how LVLMs rely on specific attention heads to identify and answer questions about the main object in an image. Our analysis reveals that a distinct subset of attention heads facilitates the image-to-text information flow. Remarkably, we find that the selection of these heads is governed by the semantic content of the input image rather than its visual appearance. We further examine the flow of information at the token level and discover that (1) text information first propagates to role-related tokens and the final token before receiving image information, and (2) image information is embedded in both object-related and background tokens. Our work provides evidence that image-to-text information flow follows a structured process, and that analysis at the attention-head level offers a promising direction toward understanding the mechanisms of LVLMs.
zh
[CV-65] PRNU-Bench: A Novel Benchmark and Model for PRNU-Based Camera Identification
【速读】:该论文旨在解决基于光电响应非均匀性(Photo Response Non-Uniformity, PRNU)的相机识别问题,特别是在真实场景(in-the-wild)下如何提升识别准确率与泛化能力。其解决方案的关键在于提出了一种混合架构模型:首先利用去噪自编码器估计PRNU信号,再通过卷积神经网络实现1:N设备验证;同时摒弃传统的对比学习范式,创新性地将参考样本与查询样本的PRNU信号进行哈达玛积(Hadamard product)作为输入特征,从而显著优于现有基于去噪自编码器和对比学习的方法。
链接: https://arxiv.org/abs/2509.17581
作者: Florinel Alin Croitoru,Vlad Hondru,Radu Tudor Ionescu
机构: University of Bucharest (布加勒斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
备注:
Abstract:We propose a novel benchmark for camera identification via Photo Response Non-Uniformity (PRNU) estimation. The benchmark comprises 13K photos taken with 120+ cameras, where the training and test photos are taken in different scenarios, enabling ``in-the-wild’’ evaluation. In addition, we propose a novel PRNU-based camera identification model that employs a hybrid architecture, comprising a denoising autoencoder to estimate the PRNU signal and a convolutional network that can perform 1:N verification of camera devices. Instead of using a conventional approach based on contrastive learning, our method takes the Hadamard product between reference and query PRNU signals as input. This novel design leads to significantly better results compared with state-of-the-art models based on denoising autoencoders and contrastive learning. We release our dataset and code at: this https URL.
zh
[CV-66] MRN: Harnessing 2D Vision Foundation Models for Diagnosing Parkinsons Disease with Limited 3D MR Data MICCAI’2025
【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)自动诊断中因高质量标注数据稀缺导致的模型过拟合问题,以及3D医学影像预训练模型在不同体素间距和模态下迁移困难的问题。其关键解决方案是利用2D视觉基础模型(Vision Foundation Models, VFMs)对非运动性磁共振成像(NM-MRI)和定量磁化率成像(QSM)图像中的多个关键感兴趣区域(ROI)进行独立编码,并通过辅助分割头引导特征提取聚焦于特定脑核团;随后将各ROI编码为token并融合形成统一患者表征,同时引入多ROI监督对比学习策略以增强类别内一致性与类间区分度。该方法仅用300例标注数据即在MICCAI 2025 PDCADxFoundation挑战赛中达到86.0%准确率,显著优于第二名。
链接: https://arxiv.org/abs/2509.17566
作者: Ding Shaodong,Liu Ziyang,Zhou Yijun,Liu Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: First-place solution of the classification track for MICCAI’2025 PDCADxFoundation Challenge
Abstract:The automatic diagnosis of Parkinson’s disease is in high clinical demand due to its prevalence and the importance of targeted treatment. Current clinical practice often relies on diagnostic biomarkers in QSM and NM-MRI images. However, the lack of large, high-quality datasets makes training diagnostic models from scratch prone to overfitting. Adapting pre-trained 3D medical models is also challenging, as the diversity of medical imaging leads to mismatches in voxel spacing and modality between pre-training and fine-tuning data. In this paper, we address these challenges by leveraging 2D vision foundation models (VFMs). Specifically, we crop multiple key ROIs from NM and QSM images, process each ROI through separate branches to compress the ROI into a token, and then combine these tokens into a unified patient representation for classification. Within each branch, we use 2D VFMs to encode axial slices of the 3D ROI volume and fuse them into the ROI token, guided by an auxiliary segmentation head that steers the feature extraction toward specific brain nuclei. Additionally, we introduce multi-ROI supervised contrastive learning, which improves diagnostic performance by pulling together representations of patients from the same class while pushing away those from different classes. Our approach achieved first place in the MICCAI 2025 PDCADxFoundation challenge, with an accuracy of 86.0% trained on a dataset of only 300 labeled QSM and NM-MRI scans, outperforming the second-place method by 5.5%.These results highlight the potential of 2D VFMs for clinical analysis of 3D MR images.
zh
[CV-67] Visual Instruction Pretraining for Domain-Specific Foundation Models
【速读】:该论文旨在解决当前视觉基础模型预训练中一个关键问题:高层推理对低层感知特征学习的自上而下影响尚未被充分探索,导致模型在特定下游任务(如遥感和医学影像)中的泛化能力和鲁棒性受限。解决方案的关键在于提出Visual insTruction Pretraining (ViTP)范式,其核心是将Vision Transformer (ViT)嵌入到视觉-语言模型中,并利用目标下游领域内精心构建的视觉指令数据集进行端到端预训练;同时引入Visual Robustness Learning (VRL)机制,迫使ViT从稀疏视觉标记中学习具有领域相关性和鲁棒性的特征表示,从而显著提升模型在16个挑战性遥感与医学影像基准上的性能表现。
链接: https://arxiv.org/abs/2509.17562
作者: Yuxuan Li,Yicheng Zhang,Wenhao Tang,Yimian Dai,Ming-Ming Cheng,Xiang Li,Jian Yang
机构: PCA Lab, VCIP, Computer Science, NKU; NKIARI, Futian, Shenzhen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at this http URL.
zh
[CV-68] An Empirical Study on the Robustness of YOLO Models for Underwater Object Detection
【速读】:该论文旨在解决水下目标检测(Underwater Object Detection, UOD)中因水下畸变导致低层特征退化、进而影响检测模型鲁棒性的问题,尤其关注YOLO系列模型在复杂多变的水下环境下的性能表现。其关键解决方案在于:首先通过系统评估YOLOv8至YOLOv12共六种最新变体在六个模拟水下场景中的表现,揭示噪声对边缘和纹理等低层特征的破坏机制;其次提出轻量级训练策略——噪声感知样本注入(noise-aware sample injection)以提升模型在噪声及真实水下环境中的鲁棒性,以及基于先进增强技术的微调方法(fine-tuning with advanced enhancement),显著提升增强域内的检测精度并展现出良好的领域自适应潜力。
链接: https://arxiv.org/abs/2509.17561
作者: Edwine Nabahirwa,Wei Song,Minghua Zhang,Shufan Chen
机构: Shanghai Ocean University (上海海洋大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 28 Pages, 12 Figures
Abstract:Underwater object detection (UOD) remains a critical challenge in computer vision due to underwater distortions which degrade low-level features and compromise the reliability of even state-of-the-art detectors. While YOLO models have become the backbone of real-time object detection, little work has systematically examined their robustness under these uniquely challenging conditions. This raises a critical question: Are YOLO models genuinely robust when operating under the chaotic and unpredictable conditions of underwater environments? In this study, we present one of the first comprehensive evaluations of recent YOLO variants (YOLOv8-YOLOv12) across six simulated underwater environments. Using a unified dataset of 10,000 annotated images from DUO and Roboflow100, we not only benchmark model robustness but also analyze how distortions affect key low-level features such as texture, edges, and color. Our findings show that (1) YOLOv12 delivers the strongest overall performance but is highly vulnerable to noise, and (2) noise disrupts edge and texture features, explaining the poor detection performance in noisy images. Class imbalance is a persistent challenge in UOD. Experiments revealed that (3) image counts and instance frequency primarily drive detection performance, while object appearance exerts only a secondary influence. Finally, we evaluated lightweight training-aware strategies: noise-aware sample injection, which improves robustness in both noisy and real-world conditions, and fine-tuning with advanced enhancement, which boosts accuracy in enhanced domains but slightly lowers performance in original data, demonstrating strong potential for domain adaptation, respectively. Together, these insights provide practical guidance for building resilient and cost-efficient UOD systems.
zh
[CV-69] Is It Certainly a Deepfake? Reliability Analysis in Detection Generation Ecosystem ICCV2025
【速读】:该论文旨在解决深度伪造(deepfake)检测系统中预测不确定性带来的可信度问题,即现有检测器在面对生成式AI(Generative AI)产生的合成内容时,其输出置信度可能不可靠,甚至被滥用以制造新的虚假信息。解决方案的关键在于首次对深度伪造检测器进行全面的不确定性分析,通过贝叶斯神经网络(Bayesian Neural Networks)与蒙特卡洛丢弃(Monte Carlo dropout)方法量化数据噪声引起的异类不确定性(aleatoric uncertainty)和模型结构导致的认知不确定性(epistemic uncertainty),并揭示这些不确定性与生成器特定伪影之间的内在关联。研究进一步提出像素级不确定性图(uncertainty maps),实现对检测置信度的空间定位,并证明不确定性流形(uncertainty manifold)蕴含可用于深度伪造源识别的一致信息,从而为构建可靠、可解释且具备抗对抗攻击鲁棒性的合成媒体检测系统提供了理论基础与技术路径。
链接: https://arxiv.org/abs/2509.17550
作者: Neslihan Kose,Anthony Rhodes,Umur Aybars Ciftci,Ilke Demir
机构: Intel Labs(英特尔实验室); Binghamton University(宾汉顿大学); Cauth AI(卡斯AI)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted for publication at the ICCV 2025 STREAM workshop
Abstract:As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors’ responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.
zh
[CV-70] SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
【速读】:该论文旨在解决参考音频-视觉分割(Referring Audio-Visual Segmentation, Ref-AVS)任务中的跨模态推理与细粒度目标定位难题,即如何基于包含音频、视觉和文本信息的自然语言表达准确分割视频中特定对象。其解决方案的关键在于提出一个名为SimToken的简洁框架,该框架将多模态大语言模型(Multimodal Large Language Model, MLLM)与Segment Anything Model (SAM) 结合:MLLM被引导生成一个代表所指对象的特殊语义标记(semantic token),该标记融合了多模态上下文信息,并作为提示(prompt)指导SAM在视频帧间进行对象分割;此外,为增强语义学习,作者引入一种新颖的目标一致语义对齐损失(target-consistent semantic alignment loss),使不同表述但指向同一对象的token嵌入在特征空间中对齐,从而提升分割精度。
链接: https://arxiv.org/abs/2509.17537
作者: Dian Jin,Yanghao Zhou,Jinxing Zhou,Jiaqi Ma,Ruohao Guo,Dan Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Referring Audio-Visual Segmentation (Ref-AVS) aims to segment specific objects in videos based on natural language expressions involving audio, vision, and text information. This task poses significant challenges in cross-modal reasoning and fine-grained object localization. In this paper, we propose a simple framework, SimToken, that integrates a multimodal large language model (MLLM) with the Segment Anything Model (SAM). The MLLM is guided to generate a special semantic token representing the referred object. This compact token, enriched with contextual information from all modalities, acts as a prompt to guide SAM to segment objectsacross video frames. To further improve semantic learning, we introduce a novel target-consistent semantic alignment loss that aligns token embeddings from different expressions but referring to the same object. Experiments on the Ref-AVS benchmark demonstrate that our approach achieves superior performance compared to existing this http URL will be available at this https URL
zh
[CV-71] Chat-CBM: Towards Interactive Concept Bottleneck Models with Frozen Large Language Models
【速读】:该论文旨在解决传统概念瓶颈模型(Concept Bottleneck Models, CBMs)在干预能力上的局限性问题,尤其是其依赖固定线性分类器导致的干预方式单一、难以引入新概念或领域知识、且在无监督场景下因概念激活噪声大而效果不佳等挑战。解决方案的关键在于提出Chat-CBM,通过将基于分数的分类器替换为基于语言的分类器,利用冻结的大语言模型(frozen large language models)对概念语义进行直接推理,从而实现更丰富和直观的用户干预,如概念修正、增删、外部知识注入及高层推理指导,同时保持CBMs原有的概念可解释性,并在无监督设置下依然有效。
链接: https://arxiv.org/abs/2509.17522
作者: Hangzhou He,Lei Zhu,Kaiwen Li,Xinliang Zhang,Jiakui Hu,Ourui Fu,Zhengjian Yao,Yanye Lu
机构: Peking University (北京大学); Peking University Health Science Center (北京大学医学部)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Concept Bottleneck Models (CBMs) provide inherent interpretability by first predicting a set of human-understandable concepts and then mapping them to labels through a simple classifier. While users can intervene in the concept space to improve predictions, traditional CBMs typically employ a fixed linear classifier over concept scores, which restricts interventions to manual value adjustments and prevents the incorporation of new concepts or domain knowledge at test time. These limitations are particularly severe in unsupervised CBMs, where concept activations are often noisy and densely activated, making user interventions ineffective. We introduce Chat-CBM, which replaces score-based classifiers with a language-based classifier that reasons directly over concept semantics. By grounding prediction in the semantic space of concepts, Chat-CBM preserves the interpretability of CBMs while enabling richer and more intuitive interventions, such as concept correction, addition or removal of concepts, incorporation of external knowledge, and high-level reasoning guidance. Leveraging the language understanding and few-shot capabilities of frozen large language models, Chat-CBM extends the intervention interface of CBMs beyond numerical editing and remains effective even in unsupervised settings. Experiments on nine datasets demonstrate that Chat-CBM achieves higher predictive performance and substantially improves user interactivity while maintaining the concept-based interpretability of CBMs.
zh
[CV-72] Unified Multimodal Coherent Field: Synchronous Semantic-Spatial-Vision Fusion for Brain Tumor Segmentation
【速读】:该论文旨在解决脑肿瘤分割中因肿瘤组织异质性、边界模糊及多模态磁共振成像(MRI)序列间对比度差异导致的边界 delineation 不稳定与层次结构(包括全肿瘤 WT、肿瘤核心 TC 和增强肿瘤 ET)难以保持的问题。解决方案的关键在于提出统一多模态协同场(UMCF)方法,其通过在统一的 3D 潜在空间中同步交互融合视觉、语义与空间信息,利用无参数不确定性门控自适应调整各模态贡献,并将医学先验知识直接嵌入注意力计算机制,从而避免传统“处理后拼接”的分离架构,实现更精准且稳定的多模态信息融合。
链接: https://arxiv.org/abs/2509.17520
作者: Mingda Zhang,Yuyang Zheng,Ruixiang Tang,Jingru Qiu,Haiyan Ding
机构: School of Software, Yunnan University, Kunming 650500, China
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 3 figures
Abstract:Brain tumor segmentation requires accurate identification of hierarchical regions including whole tumor (WT), tumor core (TC), and enhancing tumor (ET) from multi-sequence magnetic resonance imaging (MRI) images. Due to tumor tissue heterogeneity, ambiguous boundaries, and contrast variations across MRI sequences, methods relying solely on visual information or post-hoc loss constraints show unstable performance in boundary delineation and hierarchy preservation. To address this challenge, we propose the Unified Multimodal Coherent Field (UMCF) method. This method achieves synchronous interactive fusion of visual, semantic, and spatial information within a unified 3D latent space, adaptively adjusting modal contributions through parameter-free uncertainty gating, with medical prior knowledge directly participating in attention computation, avoiding the traditional “process-then-concatenate” separated architecture. On Brain Tumor Segmentation (BraTS) 2020 and 2021 datasets, UMCF+nnU-Net achieves average Dice coefficients of 0.8579 and 0.8977 respectively, with an average 4.18% improvement across mainstream architectures. By deeply integrating clinical knowledge with imaging features, UMCF provides a new technical pathway for multimodal information fusion in precision medicine.
zh
[CV-73] 4DGCPro: Efficient Hierarchical 4D Gaussian Compression for Progressive Volumetric Video Streaming NEURIPS2025
【速读】:该论文旨在解决高保真体视频(volumetric video)在多样网络和设备环境下实现无缝观看体验的挑战,特别是现有压缩方法难以在单一模型中灵活调整质量与码率以支持高效流媒体传输,且在轻量级移动平台上的实时解码与渲染性能不足的问题。其解决方案的关键在于提出一种新颖的分层4D高斯压缩框架(4DGCPro),通过感知加权且压缩友好的分层4D高斯表示结合运动感知自适应分组策略,有效降低时间冗余并保持一致性,从而支持单比特流下的渐进式体视频传输;同时引入端到端熵优化训练机制,包含逐层率失真(rate-distortion, RD)监督和属性特定熵建模,显著提升比特流生成效率。实验表明,该方法可在移动端实现实时解码与高质量渲染,并在多个数据集上优于现有方法的RD性能。
链接: https://arxiv.org/abs/2509.17513
作者: Zihan Zheng,Zhenlong Wu,Houqiang Zhong,Yuan Tian,Ning Cao,Lan Xu,Jiangchao Yao,Xiaoyun Zhang,Qiang Hu,Wenjun Zhang
机构: Cooperative Medianet Innovation Center, Shanghai Jiaotong University(上海交通大学协同媒体网络创新中心); Department of Electronics, Shanghai Jiaotong University(上海交通大学电子系); Shanghai AI Lab(上海人工智能实验室); Cloud platform department, E-surfing Vision Technology Co., Ltd.(易 surfing 视觉科技有限公司云平台部门); School of Information Science and Technology, ShanghaiTech University(上海科技大学信息科学与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Achieving seamless viewing of high-fidelity volumetric video, comparable to 2D video experiences, remains an open challenge. Existing volumetric video compression methods either lack the flexibility to adjust quality and bitrate within a single model for efficient streaming across diverse networks and devices, or struggle with real-time decoding and rendering on lightweight mobile platforms. To address these challenges, we introduce 4DGCPro, a novel hierarchical 4D Gaussian compression framework that facilitates real-time mobile decoding and high-quality rendering via progressive volumetric video streaming in a single bitstream. Specifically, we propose a perceptually-weighted and compression-friendly hierarchical 4D Gaussian representation with motion-aware adaptive grouping to reduce temporal redundancy, preserve coherence, and enable scalable multi-level detail streaming. Furthermore, we present an end-to-end entropy-optimized training scheme, which incorporates layer-wise rate-distortion (RD) supervision and attribute-specific entropy modeling for efficient bitstream generation. Extensive experiments show that 4DGCPro enables flexible quality and multiple bitrate within a single model, achieving real-time decoding and rendering on mobile devices while outperforming existing methods in RD performance across multiple datasets. Project Page: this https URL
zh
[CV-74] 4D-MoDe: Towards Editable and Scalable Volumetric Streaming via Motion-Decoupled 4D Gaussian Compression
【速读】:该论文旨在解决大规模高质量动态体视频(volumetric video)在传输与存储中面临的挑战,包括数据量庞大、运动复杂性高以及现有表示方法编辑能力差等问题。其核心解决方案是提出一种运动解耦的4D高斯压缩框架(4D-MoDe),关键在于引入分层表示结构,通过基于前瞻的运动分解策略显式分离静态背景与动态前景,从而显著降低时间冗余并支持选择性背景/前景流媒体传输;同时结合多分辨率运动估计网格、轻量级共享MLP及动态高斯补偿机制以精确建模连续运动轨迹,并采用自适应分组策略插入背景关键帧以平衡时序一致性与压缩效率,最终在率失真优化目标下实现高效编码与低存储开销(如每帧低至11.4 KB)。
链接: https://arxiv.org/abs/2509.17506
作者: Houqiang Zhong,Zihan Zheng,Qiang Hu,Yuan Tian,Ning Cao,Lan Xu,Xiaoyun Zhang,Zhengxue Cheng,Li Song,Wenjun Zhang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); E-surfing Vision Technology Co., Ltd (易 surfing 视觉科技有限公司); ShanghaiTech University (上海科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Volumetric video has emerged as a key medium for immersive telepresence and augmented/virtual reality, enabling six-degrees-of-freedom (6DoF) navigation and realistic spatial interactions. However, delivering high-quality dynamic volumetric content at scale remains challenging due to massive data volume, complex motion, and limited editability of existing representations. In this paper, we present 4D-MoDe, a motion-decoupled 4D Gaussian compression framework designed for scalable and editable volumetric video streaming. Our method introduces a layered representation that explicitly separates static backgrounds from dynamic foregrounds using a lookahead-based motion decomposition strategy, significantly reducing temporal redundancy and enabling selective background/foreground streaming. To capture continuous motion trajectories, we employ a multi-resolution motion estimation grid and a lightweight shared MLP, complemented by a dynamic Gaussian compensation mechanism to model emergent content. An adaptive grouping scheme dynamically inserts background keyframes to balance temporal consistency and compression efficiency. Furthermore, an entropy-aware training pipeline jointly optimizes the motion fields and Gaussian parameters under a rate-distortion (RD) objective, while employing range-based and KD-tree compression to minimize storage overhead. Extensive experiments on multiple datasets demonstrate that 4D-MoDe consistently achieves competitive reconstruction quality with an order of magnitude lower storage cost (e.g., as low as \textbf11.4 KB/frame) compared to state-of-the-art methods, while supporting practical applications such as background replacement and foreground-only streaming.
zh
[CV-75] SAMSON: 3rd Place Solution of LSVOS 2025 VOS Challenge
【速读】:该论文针对大规模视频目标分割(Large-scale Video Object Segmentation, LSVOS)中物体重出现、小尺度目标、严重遮挡及密集场景等挑战,提出了一种基于长时记忆增强的目标导航方法——Segment Anything with Memory Strengthened Object Navigation (SAMSON)。其解决方案的关键在于:引入长时记忆模块以实现可靠的目标再识别,从而应对视觉相似实例和长时间对象消失问题;同时采用SAM2Long作为后处理策略,有效减少误差累积并提升长视频序列中的分割稳定性。该方法在ICCV 2025 MOSE赛道中取得第三名成绩,测试集JF指标达0.8427。
链接: https://arxiv.org/abs/2509.17500
作者: Yujie Xie,Hongyang Zhang,Zhihui Liu,Shihai Ruan
机构: Truesight Research; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale Video Object Segmentation (LSVOS) addresses the challenge of accurately tracking and segmenting objects in long video sequences, where difficulties stem from object reappearance, small-scale targets, heavy occlusions, and crowded scenes. Existing approaches predominantly adopt SAM2-based frameworks with various memory mechanisms for complex video mask generation. In this report, we proposed Segment Anything with Memory Strengthened Object Navigation (SAMSON), the 3rd place solution in the MOSE track of ICCV 2025, which integrates the strengths of stateof-the-art VOS models into an effective paradigm. To handle visually similar instances and long-term object disappearance in MOSE, we incorporate a long-term memorymodule for reliable object re-identification. Additionly, we adopt SAM2Long as a post-processing strategy to reduce error accumulation and enhance segmentation stability in long video sequences. Our method achieved a final performance of 0.8427 in terms of J F in the test-set leaderboard.
zh
[CV-76] Vision-Based Driver Drowsiness Monitoring: Comparative Analysis of YOLOv5-v11 Models
【速读】:该论文旨在解决驾驶员疲劳(driver drowsiness)检测问题,这是导致道路交通事故的重要因素之一。其解决方案的关键在于采用基于计算机视觉的YOLO(You Look Only Once)系列目标检测算法对驾驶行为进行实时、非侵入式监测,并利用公开数据集UTA-RLDD进行模型训练与评估。研究对比了七种YOLO变体(v5s至v11l),发现YOLOv9c在准确率上表现最优(mAP0.5=0.986,Recall=0.978),而YOLOv11n则在精度(Precision=0.954)与推理效率之间取得最佳平衡,适合嵌入式部署。此外,作者还结合Eye Aspect Ratio(EAR)方法进行眼部状态分析,尽管计算开销低,但鲁棒性受姿态变化和遮挡影响较大。研究揭示了准确性、延迟和资源消耗之间的权衡关系,为自动驾驶及工业安全场景中检测方法的选择与集成提供了实用指导。
链接: https://arxiv.org/abs/2509.17498
作者: Dilshara Herath,Chinthaka Abeyrathne,Prabhani Jayaweera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Drowsiness Detection using state of the art YOLO algorithms
Abstract:Driver drowsiness remains a critical factor in road accidents, accounting for thousands of fatalities and injuries each year. This paper presents a comprehensive evaluation of real-time, non-intrusive drowsiness detection methods, focusing on computer vision based YOLO (You Look Only Once) algorithms. A publicly available dataset namely, UTA-RLDD was used, containing both awake and drowsy conditions, ensuring variability in gender, eyewear, illumination, and skin tone. Seven YOLO variants (v5s, v9c, v9t, v10n, v10l, v11n, v11l) are fine-tuned, with performance measured in terms of Precision, Recall, mAP0.5, and mAP 0.5-0.95. Among these, YOLOv9c achieved the highest accuracy (0.986 mAP 0.5, 0.978 Recall) while YOLOv11n strikes the optimal balance between precision (0.954) and inference efficiency, making it highly suitable for embedded deployment. Additionally, we implement an Eye Aspect Ratio (EAR) approach using Dlib’s facial landmarks, which despite its low computational footprint exhibits reduced robustness under pose variation and occlusions. Our findings illustrate clear trade offs between accuracy, latency, and resource requirements, and offer practical guidelines for selecting or combining detection methods in autonomous driving and industrial safety applications.
zh
[CV-77] Multimodal Medical Image Classification via Synergistic Learning Pre-training
【速读】:该论文旨在解决多模态病理图像在标签稀缺情况下进行模态融合的挑战,尤其是在缺乏专家标注数据时,传统基于计算机视觉的多模态辅助诊断方法难以有效整合不同模态信息的问题。解决方案的关键在于提出一种“预训练+微调”框架:首先设计了一种协同学习预训练机制,结合一致性学习(consistency learning)、重建学习(reconstructive learning)和对齐学习(aligned learning),通过将一种模态视为另一种模态的增强样本,实现自监督预训练以提升模型特征表示能力;其次,在微调阶段引入差异化编码器分别提取各模态特征,并设计多模态融合编码器进行特征融合,同时提出一种分布偏移校正方法以缓解因标签不足导致的预测不确定性和过拟合风险。该方法在公开的胃镜图像数据集Kvasir和Kvasirv2上验证了其优越性。
链接: https://arxiv.org/abs/2509.17492
作者: Qinghua Lin,Guang-Hai Liu,Zuoyong Li,Yang Li,Yuting Jiang,Xiang Wu
机构: Fudan University (复旦大学); Guangxi Normal University (广西师范大学); Minjiang University (闽江学院); Beihang University (北京航空航天大学); Fujian Medical University (福建医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning" framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model’s feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: this https URL.
zh
[CV-78] Stable Video-Driven Portraits
【速读】:该论文旨在解决人脸动画(portrait animation)中存在的时间不一致性、表达控制能力弱以及对未见身份或大姿态变化泛化能力差的问题。其解决方案的关键在于:首先,利用驱动视频中被掩码的面部关键区域(如眼睛、鼻子和嘴巴)作为强运动控制信号;其次,采用跨身份监督策略以避免外观信息泄露,确保训练鲁棒性;再次,通过引入最小的新参数量架构来利用预训练扩散模型的强大先验,加快收敛并提升泛化性能;最后,设计时空注意力机制以捕捉帧间与帧内交互,结合历史帧保持动画连贯性,并提出新颖的信号融合策略,在推理阶段平衡运动保真度与身份保留。
链接: https://arxiv.org/abs/2509.17476
作者: Mallikarjun B. R.,Fei Yin,Vikram Voleti,Nikita Drobyshev,Maksim Lapin,Aaryaman Vasishta,Varun Jampani
机构: Stability AI(Stability.AI); University of Cambridge(剑桥大学); Cantina(坎蒂纳)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: this https URL
Abstract:Portrait animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video. While early methods relied on 3D morphable models or feature warping techniques, they often suffered from limited expressivity, temporal inconsistency, and poor generalization to unseen identities or large pose variations. Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations. In this work, we propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues. To enable robust training without appearance leakage, we adopt cross identity supervision. To leverage the strong prior from the pretrained diffusion model, our novel architecture introduces minimal new parameters that converge faster and help in better generalization. We introduce spatial temporal attention mechanisms that allow inter frame and intra frame interactions, effectively capturing subtle motions and reducing temporal artifacts. Our model uses history frames to ensure continuity across segments. At inference, we propose a novel signal fusion strategy that balances motion fidelity with identity preservation. Our approach achieves superior temporal consistency and accurate expression control, enabling high-quality, controllable portrait animation suitable for real-world applications.
zh
[CV-79] MAESTRO: Task-Relevant Optimization via Adaptive Feature Enhancement and Suppression for Multi-task 3D Perception ICCV2025
【速读】:该论文旨在解决多任务学习(Multi-task Learning)中因任务冲突导致的性能下降问题,尤其是在3D感知任务(包括3D目标检测、鸟瞰图(Bird’s-eye View, BEV)地图分割和3D占据预测)中特征干扰严重的问题。解决方案的关键在于提出MAESTRO框架,其核心创新是通过三个模块实现任务特定特征的生成与优化:首先,类别原型生成器(Class-wise Prototype Generator, CPG)将类别分组为前景和背景原型,分别服务于不同任务;其次,任务特定特征生成器(Task-Specific Feature Generator, TSFG)利用原型组保留各任务相关特征并抑制无关特征;最后,场景原型聚合器(Scene Prototype Aggregator, SPA)融合其他任务输出的信息以增强3D占据预测的原型表示。这一结构化设计有效缓解了多任务间的特征干扰,显著提升了各项任务的性能。
链接: https://arxiv.org/abs/2509.17462
作者: Changwon Kang,Jisong Kim,Hongjae Shin,Junseo Park,Jun Won Choi
机构: Hanyang University (汉阳大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:The goal of multi-task learning is to learn to conduct multiple tasks simultaneously based on a shared data representation. While this approach can improve learning efficiency, it may also cause performance degradation due to task conflicts that arise when optimizing the model for different objectives. To address this challenge, we introduce MAESTRO, a structured framework designed to generate task-specific features and mitigate feature interference in multi-task 3D perception, including 3D object detection, bird’s-eye view (BEV) map segmentation, and 3D occupancy prediction. MAESTRO comprises three components: the Class-wise Prototype Generator (CPG), the Task-Specific Feature Generator (TSFG), and the Scene Prototype Aggregator (SPA). CPG groups class categories into foreground and background groups and generates group-wise prototypes. The foreground and background prototypes are assigned to the 3D object detection task and the map segmentation task, respectively, while both are assigned to the 3D occupancy prediction task. TSFG leverages these prototype groups to retain task-relevant features while suppressing irrelevant features, thereby enhancing the performance for each task. SPA enhances the prototype groups assigned for 3D occupancy prediction by utilizing the information produced by the 3D object detection head and the map segmentation head. Extensive experiments on the nuScenes and Occ3D benchmarks demonstrate that MAESTRO consistently outperforms existing methods across 3D object detection, BEV map segmentation, and 3D occupancy prediction tasks.
zh
[CV-80] CSDformer: A Conversion Method for Fully Spike-Driven Transformer
【速读】:该论文旨在解决传统脉冲神经网络(Spiking Neural Networks, SNNs)在应用Transformer架构时面临的两大问题:一是直接训练SNN导致的高昂计算成本,二是现有转换方法中不可避免的硬件不友好操作。其解决方案的关键在于提出CSDformer,一种全新的、面向转换的脉冲驱动Transformer架构,通过引入新型函数NReLU替代自注意力机制中的softmax,并结合时间分解技术实现模型量化与训练后到全脉冲驱动模型的转换;同时设计延迟积分-发放(delayed Integrate-and-Fire)神经元以降低转换误差并提升性能。该方法在ImageNet等数据集上实现了76.36% top-1准确率(仅需7个时间步),且相比传统训练方式减少75%计算资源消耗、加速2–3倍,是首个通过转换方法实现的高性能全脉冲驱动Transformer模型。
链接: https://arxiv.org/abs/2509.17461
作者: Yuhao Zhang,Chengjun Zhang,Di Wu,Jie Yang,Mohamad Sawan
机构: Zhejiang Key Laboratory of 3D Micro/Nano Fabrication and Characterization (浙江省三维微纳制造与表征重点实验室); Westlake Institute for Optoelectronics (西湖大学光电研究院); CenBRAIN Neurotech (脑科学中心神经科技公司); Westlake University (西湖大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Spike-based transformer is a novel architecture aiming to enhance the performance of spiking neural networks while mitigating the energy overhead inherent to transformers. However, methods for generating these models suffer from critical limitations: excessive training costs introduced by direct training methods, or unavoidably hardware-unfriendly operations in existing conversion methods. In this paper, we propose CSDformer, a novel conversion method for fully spike-driven transformers. We tailor a conversion-oriented transformer-based architecture and propose a new function NReLU to replace softmax in self-attention. Subsequently, this model is quantized and trained, and converted into a fully spike-driven model with temporal decomposition technique. Also, we propose delayed Integrate-andFire neurons to reduce conversion errors and improve the performance of spiking models. We evaluate CSDformer on ImageNet, CIFAR-10 and CIFAR-100 datasets and achieve 76.36% top-1 accuracy under 7 time-steps on ImageNet, demonstrating superiority over state-of-the-art models. Furthermore, CSDformer eliminates the need for training SNNs, thereby reducing training costs (reducing computational resource by 75% and accelerating training speed by 2-3 \times ). To the best of our knowledge, this is the first fully spike-driven transformer-based model developed via conversion method, achieving high performance under ultra-low latency, while dramatically reducing both computational complexity and training overhead.
zh
[CV-81] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration
【速读】:该论文旨在解决文本到图像扩散模型(如Stable Diffusion)在生成图像时难以实现构图一致性的问题,尤其是在处理复杂对象关系、属性或空间排列等组合性描述时。现有推理阶段的方法通过优化或探索初始噪声来提升文本-图像对齐度,但各自存在局限:优化方法易因初始化不佳或搜索轨迹不利而停滞,探索方法则可能需要大量样本才能找到满意结果;且单一奖励指标或随意组合无法可靠捕捉构图的全部维度,导致引导效果弱且不稳定。解决方案的关键在于提出一种统一框架CARINOX(Category-Aware Reward-based Initial Noise Optimization and Exploration),其核心创新是将噪声优化与探索相结合,并引入基于与人类判断相关性的奖励选择机制,从而实现更鲁棒和高效的构图对齐提升。
链接: https://arxiv.org/abs/2509.17458
作者: Seyed Amir Kasaei,Ali Aghayari,Arash Marioriyad,Niki Sepasian,Shayan Baghayi Nejad,MohammadAmin Fazli,Mahdieh Soleymani Baghshah,Mohammad Hossein Rohban
机构: Sharif University of Technology (谢里夫理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at this https URLthis URL.
zh
[CV-82] Explainable AI for Analyzing Person-Specific Patterns in Facial Recognition Tasks
【速读】:该论文旨在解决面部识别系统(Facial Recognition Systems)广泛应用所带来的隐私风险问题,特别是现有对抗性技术缺乏对个体面部特征的针对性适应,导致防护效果有限且不够隐蔽。其解决方案的关键在于提出一种名为层嵌入激活映射(Layer Embedding Activation Mapping, LEAM)的新颖可解释性技术,能够从个体层面识别出对人脸识别模型贡献最大的面部区域。LEAM 不是用于欺骗识别系统,而是通过分析不同模型各层级的激活模式,揭示人脸图像中关键识别区域的分布规律,并验证这些区域具有高度的人脸个体特异性(Bhattacharyya 系数在相同个体间为 0.32–0.57,跨个体间仅为 0.04–0.13)。研究进一步表明,尽管模型架构各异,但它们普遍聚焦于面部中心区域(如鼻部占关键区域的 18.9–29.7%),且仅需扰动 LEAM 识别出的约 1% 最相关像素即可实现跨模型迁移的有效遮蔽,为未来基于个体差异定制的隐私保护策略提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2509.17457
作者: Paweł Jakub Borsukiewicz,Jordan Samhi,Jacques Klein,Tegawendé F. Bissyandé
机构: University of Luxembourg(卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages; 24 tables; 11 figures
Abstract:The proliferation of facial recognition systems presents major privacy risks, driving the need for effective countermeasures. Current adversarial techniques apply generalized methods rather than adapting to individual facial characteristics, limiting their effectiveness and inconspicuousness. In this work, we introduce Layer Embedding Activation Mapping (LEAM), a novel technique that identifies which facial areas contribute most to recognition at an individual level. Unlike adversarial attack methods that aim to fool recognition systems, LEAM is an explainability technique designed to understand how these systems work, providing insights that could inform future privacy protection research. We integrate LEAM with a face parser to analyze data from 1000 individuals across 9 pre-trained facial recognition models. Our analysis reveals that while different layers within facial recognition models vary significantly in their focus areas, these models generally prioritize similar facial regions across architectures when considering their overall activation patterns, which show significantly higher similarity between images of the same individual (Bhattacharyya Coefficient: 0.32-0.57) vs. different individuals (0.04-0.13), validating the existence of person-specific recognition patterns. Our results show that facial recognition models prioritize the central region of face images (with nose areas accounting for 18.9-29.7% of critical recognition regions), while still distributing attention across multiple facial fragments. Proper selection of relevant facial areas was confirmed using validation occlusions, based on just 1% of the most relevant, LEAM-identified, image pixels, which proved to be transferable across different models. Our findings establish the foundation for future individually tailored privacy protection systems centered around LEAM’s choice of areas to be perturbed. Comments: 22 pages; 24 tables; 11 figures Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) MSC classes: 68T10 ACMclasses: I.2.10; I.4.m Cite as: arXiv:2509.17457 [cs.CV] (or arXiv:2509.17457v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.17457 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-83] raining-Free Label Space Alignment for Universal Domain Adaptation
【速读】:该论文旨在解决通用域适应(Universal Domain Adaptation, UniDA)中因源域与目标域标签空间不一致及目标域存在私有类别而导致的视觉模糊性问题,从而提升模型在跨域场景下的鲁棒性和泛化能力。其解决方案的关键在于摒弃传统依赖视觉空间对齐的方法,转而利用视觉-语言基础模型(Vision-Language Models, VLMs)如CLIP的零样本分类能力,专注于标签空间对齐:首先通过生成式VLM识别目标域中的未知类别,再基于无训练策略过滤和修正噪声标签(如同义词、上下位词等语义混淆标签),最终构建融合共享知识与目标私有类信息的通用分类器,显著提升了UniDA任务的性能表现。
链接: https://arxiv.org/abs/2509.17452
作者: Dujin Lee,Sojung An,Jungmyung Wi,Kuniaki Saito,Donghyun Kim
机构: Korea University (韩国大学); OMRON SINIC X (欧姆龙辛尼克X)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 22 pages, 12 figures
Abstract:Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textitzero-shot capabilities of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels – such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) – complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textituniversal classifier that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolorblue+7.9%in H-score and \textcolorblue+6.1% in H ^3 -score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolorblue+1.6%) increment in both H- and H ^3 -scores.
zh
[CV-84] Emergent 3D Correspondence from Neural Shape Representation SIGGRAPH
【速读】:该论文旨在解决3D语义对应(3D semantic correspondence)估计中准确性和鲁棒性不足的问题,尤其在结构复杂或类别多样的形状之间难以建立语义一致的点对点映射。其解决方案的关键在于提出了一种分层神经语义表示(Hierarchical Neural Semantic Representation, HNSR),该表示融合了预训练3D生成模型中的先验知识,包含全局语义特征以捕捉高层结构信息和多分辨率局部几何特征以保留细节;同时设计了一种从粗到细的匹配策略,利用全局特征建立初始对应关系,并通过迭代引入局部几何特征进行精细化调整,从而实现高精度且语义一致的3D对应。该方法无需训练,兼容多种3D生成模型骨干网络,在跨类别场景下仍表现出优异的泛化能力。
链接: https://arxiv.org/abs/2509.17431
作者: Keyu Du,Jingyu Hu,Haipeng Li,Hao Xu,Haibing Huang,Chi-Wing Fu,Shuaicheng Liu
机构: University of Electronic Science and Technology of China (电子科技大学); The Chinese University of Hong Kong (香港中文大学); TeleAI (TeleAI)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper is accepted by Siggraph Asia 2025 conference track
Abstract:This paper presents a new approach to estimate accurate and robust 3D semantic correspondence with the hierarchical neural semantic representation. Our work has three key contributions. First, we design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features to preserve fine details, by carefully harnessing 3D priors from pre-trained 3D generative models. Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature, then iteratively refines it with local geometric features, yielding accurate and semantically-consistent mappings. Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories. Our method also supports various applications, such as shape co-segmentation, keypoint matching, and texture transfer, and generalizes well to structurally diverse shapes, with promising results even in cross-category scenarios. Both qualitative and quantitative evaluations show that our method outperforms previous state-of-the-art techniques.
zh
[CV-85] EmbodiedSplat: Personalized Real-to-Sim-to-Real Navigation with Gaussian Splats from a Mobile Device ICCV
【速读】:该论文旨在解决具身智能(Embodied AI)中模拟到现实(sim-to-real)迁移效果差的问题,即当前训练与评估多依赖于仿真环境,而这些环境要么缺乏真实感(完全合成),要么需昂贵硬件重建高保真度真实场景,导致模型在真实世界部署时性能显著下降。其解决方案的关键在于提出EmbodiedSplat方法,通过使用iPhone采集的部署环境数据,结合3D高斯泼溅(3D Gaussian Splatting, GS)技术高效重建场景网格,并在Habitat-Sim仿真器中对策略进行微调,从而构建出高度贴近真实世界的训练环境。该方法实现了高sim-vs-real相关性(0.87–0.97),并在真实世界图像导航任务上相较零样本基线分别提升20%和40%的成功率,证明了其在低资源条件下实现高效、可靠环境个性化训练的有效性。
链接: https://arxiv.org/abs/2509.17430
作者: Gunjan Chhablani,Xiaomeng Ye,Muhammad Zubair Irshad,Zsolt Kira
机构: Georgia Tech (佐治亚理工学院); Toyota Research Institute (丰田研究院); Waymo (Waymo)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 16 pages, 18 figures, paper accepted at ICCV, 2025
Abstract:The field of Embodied AI predominantly relies on simulation for training and evaluation, often using either fully synthetic environments that lack photorealism or high-fidelity real-world reconstructions captured with expensive hardware. As a result, sim-to-real transfer remains a major challenge. In this paper, we introduce EmbodiedSplat, a novel approach that personalizes policy training by efficiently capturing the deployment environment and fine-tuning policies within the reconstructed scenes. Our method leverages 3D Gaussian Splatting (GS) and the Habitat-Sim simulator to bridge the gap between realistic scene capture and effective training environments. Using iPhone-captured deployment scenes, we reconstruct meshes via GS, enabling training in settings that closely approximate real-world conditions. We conduct a comprehensive analysis of training strategies, pre-training datasets, and mesh reconstruction techniques, evaluating their impact on sim-to-real predictivity in real-world scenarios. Experimental results demonstrate that agents fine-tuned with EmbodiedSplat outperform both zero-shot baselines pre-trained on large-scale real-world datasets (HM3D) and synthetically generated datasets (HSSD), achieving absolute success rate improvements of 20% and 40% on real-world Image Navigation task. Moreover, our approach yields a high sim-vs-real correlation (0.87–0.97) for the reconstructed meshes, underscoring its effectiveness in adapting policies to diverse environments with minimal effort. Project page: this https URL
zh
[CV-86] Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration
【速读】:该论文旨在解决视觉语言模型在多尺度时间预测(Multi-Scale Temporal Prediction, MSTP)任务中难以同时准确预测场景在不同时间尺度和状态粒度下的细粒度状态问题。其核心挑战在于如何统一建模时间尺度(如不同前瞻间隔的预测)与状态尺度(如接触关系 vs. 空间关系或手术步骤 vs. 手术阶段)这两个正交维度,并保持预测的一致性与精度。解决方案的关键在于提出一种名为“增量生成与多智能体协作”(Incremental Generation and Multi-agent Collaboration, IG-MC)的方法:首先设计了一个即插即用的增量生成模块,通过持续合成扩展时间尺度下的视觉预览来同步支持多个决策代理;其次构建了一个以决策驱动的多智能体协作框架,包含生成、启动和多状态评估代理,动态触发并评估预测周期,从而在全局一致性与局部精确性之间取得平衡。
链接: https://arxiv.org/abs/2509.17429
作者: Zhitao Zeng,Guojian Yuan,Junyuan Mao,Yuxuan Wang,Xiaoshuang Jia,Yueming Jin
机构: National University of Singapore (新加坡国立大学); Qwen team, Alibaba (阿里巴巴); Renmin University of China (中国人民大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 6 figures
Abstract:Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.
zh
[CV-87] Single-Image Depth from Defocus with Coded Aperture and Diffusion Posterior Sampling
【速读】:该论文旨在解决编码孔径成像(coded-aperture imaging)中单帧深度恢复(depth-from-defocus, DFD)的精度与稳定性问题。传统方法依赖手工设计的先验(hand-crafted priors),易受噪声干扰且泛化能力有限。其解决方案的关键在于引入一种基于扩散模型(diffusion model)学习得到的先验,作为纯正则化项嵌入优化框架中;该框架通过可微分的前向模型确保观测一致性,并在去噪后的图像域中引导解空间,从而实现更准确、稳定的RGBD重建。此方法无需配对的模糊-深度训练数据,且不绑定特定相机配置,具备更强的通用性和鲁棒性。
链接: https://arxiv.org/abs/2509.17427
作者: Hodaka Kawachi,Jose Reinaldo Cunha Santos A. V. Silva Neto,Yasushi Yagi,Hajime Nagahara,Tomoya Nakamura
机构: SANKEN, The University of Osaka (大阪大学综合研究机构); D3 center, The University of Osaka (大阪大学D3中心); Graduate School of Engineering Science, The University of Osaka (大阪大学工学科学研究科)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a single-snapshot depth-from-defocus (DFD) reconstruction method for coded-aperture imaging that replaces hand-crafted priors with a learned diffusion prior used purely as regularization. Our optimization framework enforces measurement consistency via a differentiable forward model while guiding solutions with the diffusion prior in the denoised image domain, yielding higher accuracy and stability than clas- sical optimization. Unlike U-Net-style regressors, our approach requires no paired defocus-RGBD training data and does not tie training to a specific camera configuration. Experiments on comprehensive simulations and a prototype camera demonstrate consistently strong RGBD reconstructions across noise levels, outperforming both U-Net baselines and a classical coded- aperture DFD method.
zh
[CV-88] Real-Time Fish Detection in Indonesian Marine Ecosystems Using Lightweight YOLOv10-nano Architecture
【速读】:该论文旨在解决印度尼西亚海域海洋鱼类监测效率低下的问题,传统方法依赖人工识别且耗时费力,难以满足生态保护对实时、自动化检测的需求。其解决方案的关键在于引入YOLOv10-nano模型,该模型基于CSPNet骨干网络(backbone)、PAN特征融合结构以及金字塔空间注意力模块(Pyramid Spatial Attention Block),在保持极低计算资源消耗(2.7M参数,8.4 GFLOPs)的同时,实现了高精度检测(mAP50为0.966,mAP50:95为0.606)和高达29.29 FPS的推理速度,适用于复杂水下环境中的实时鱼群监测,尤其在数据有限条件下展现出良好的可扩展性和鲁棒性。
链接: https://arxiv.org/abs/2509.17406
作者: Jonathan Wuntu,Muhamad Dwisnanto Putro,Rendy Syahputra
机构: Sam Ratulangi University (萨姆·拉图兰吉大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Indonesia’s marine ecosystems, part of the globally recognized Coral Triangle, are among the richest in biodiversity, requiring efficient monitoring tools to support conservation. Traditional fish detection methods are time-consuming and demand expert knowledge, prompting the need for automated solutions. This study explores the implementation of YOLOv10-nano, a state-of-the-art deep learning model, for real-time marine fish detection in Indonesian waters, using test data from Bunaken National Marine Park. YOLOv10’s architecture, featuring improvements like the CSPNet backbone, PAN for feature fusion, and Pyramid Spatial Attention Block, enables efficient and accurate object detection even in complex environments. The model was evaluated on the DeepFish and OpenImages V7-Fish datasets. Results show that YOLOv10-nano achieves a high detection accuracy with mAP50 of 0.966 and mAP50:95 of 0.606 while maintaining low computational demand (2.7M parameters, 8.4 GFLOPs). It also delivered an average inference speed of 29.29 FPS on the CPU, making it suitable for real-time deployment. Although OpenImages V7-Fish alone provided lower accuracy, it complemented DeepFish in enhancing model robustness. Overall, this study demonstrates YOLOv10-nano’s potential for efficient, scalable marine fish monitoring and conservation applications in data-limited environments.
zh
[CV-89] Interpreting vision transformers via residual replacement model
【速读】:该论文旨在解决视觉Transformer(Vision Transformer, ViT)如何表征和处理世界这一长期存在的问题,特别是缺乏对模型内部特征演化机制的系统性理解。其解决方案的关键在于首次基于6.6K个通过稀疏自动编码器(sparse autoencoders)提取的全层特征进行系统分析,并提出残差替换模型(residual replacement model),该模型用可解释特征替代ViT中的残差流计算,从而在保持高保真度的同时显著简化原始计算过程。这一框架不仅揭示了从低级模式到高级语义的特征演变规律,还识别出专门用于编码曲线和空间位置的特征类型,最终实现了人类可理解的ViT机制解析,并成功应用于消除模型中的虚假相关性偏差。
链接: https://arxiv.org/abs/2509.17401
作者: Jinyeong Kim,Junhyeok Kim,Yumin Shim,Joohyeok Kim,Sunyoung Jung,Seong Jae Hwang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.
zh
[CV-90] Diff-GNSS: Diffusion-based Pseudorange Error Estimation
【速读】:该论文旨在解决城市环境中全球导航卫星系统(GNSS)因多路径效应和非视距(NLOS)接收导致的伪距(pseudorange)测量误差问题,此类误差会显著降低定位精度。解决方案的关键在于提出Diff-GNSS框架,其核心创新是利用条件扩散模型(conditional diffusion model)对复杂分布的伪距误差进行建模与精修:首先通过基于Mamba的模块实现粗估计以提供合理的尺度和趋势,随后引入条件去噪扩散层进行细粒度误差修正;同时,设计三个与GNSS测量质量相关的特征作为条件输入,以控制生成多样性并精确引导反向去噪过程,并在扩散阶段嵌入每颗卫星的不确定性建模机制,从而提升预测误差的可靠性。该方法为首个将扩散模型应用于伪距误差估计的研究,且其扩散精修模块具有即插即用特性,可无缝集成至现有GNSS处理网络中以显著提升精度。
链接: https://arxiv.org/abs/2509.17397
作者: Jiaqi Zhu,Shouyi Lu,Ziyao Li,Guirong Zhuo,Lu Xiong
机构: Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET)
备注:
Abstract:Global Navigation Satellite Systems (GNSS) are vital for reliable urban positioning. However, multipath and non-line-of-sight reception often introduce large measurement errors that degrade accuracy. Learning-based methods for predicting and compensating pseudorange errors have gained traction, but their performance is limited by complex error distributions. To address this challenge, we propose Diff-GNSS, a coarse-to-fine GNSS measurement (pseudorange) error estimation framework that leverages a conditional diffusion model to capture such complex distributions. Firstly, a Mamba-based module performs coarse estimation to provide an initial prediction with appropriate scale and trend. Then, a conditional denoising diffusion layer refines the estimate, enabling fine-grained modeling of pseudorange errors. To suppress uncontrolled generative diversity and achieve controllable synthesis, three key features related to GNSS measurement quality are used as conditions to precisely guide the reverse denoising process. We further incorporate per-satellite uncertainty modeling within the diffusion stage to assess the reliability of the predicted errors. We have collected and publicly released a real-world dataset covering various scenes. Experiments on public and self-collected datasets show that DiffGNSS consistently outperforms state-of-the-art baselines across multiple metrics. To the best of our knowledge, this is the first application of diffusion models to pseudorange error estimation. The proposed diffusion-based refinement module is plug-and-play and can be readily integrated into existing networks to markedly improve estimation accuracy.
zh
[CV-91] Revisiting Vision Language Foundations for No-Reference Image Quality Assessment
【速读】:该论文旨在解决大规模视觉语言预训练模型在无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)任务中,不同视觉Transformer架构性能差异不明确的问题。研究系统性地评估了六种主流预训练骨干网络(CLIP、SigLIP2、DINOv2、DINOv3、Perception和ResNet)在NR-IQA上的表现,并发现两个此前被忽视的关键因素:一是SigLIP2在多个基准上均表现出色;二是激活函数的选择对模型泛化能力具有显著影响,其中简单的Sigmoid激活函数优于常用的ReLU和GELU。基于此发现,论文提出一种可学习的激活选择机制,能够自适应地为每个通道确定最优非线性变换,从而无需人工设计激活函数,并在CLIVE、KADID10K和AGIQA3K等多个数据集上实现了新的SOTA(State-of-the-Art)性能,同时构建了高效且鲁棒的NR-IQA基线模型。
链接: https://arxiv.org/abs/2509.17374
作者: Ankit Yadav,Ta Duc Huy,Lingqiao Liu
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 16 figures
Abstract:Large-scale vision language pre-training has recently shown promise for no-reference image-quality assessment (NR-IQA), yet the relative merits of modern Vision Transformer foundations remain poorly understood. In this work, we present the first systematic evaluation of six prominent pretrained backbones, CLIP, SigLIP2, DINOv2, DINOv3, Perception, and ResNet, for the task of No-Reference Image Quality Assessment (NR-IQA), each finetuned using an identical lightweight MLP head. Our study uncovers two previously overlooked factors: (1) SigLIP2 consistently achieves strong performance; and (2) the choice of activation function plays a surprisingly crucial role, particularly for enhancing the generalization ability of image quality assessment models. Notably, we find that simple sigmoid activations outperform commonly used ReLU and GELU on several benchmarks. Motivated by this finding, we introduce a learnable activation selection mechanism that adaptively determines the nonlinearity for each channel, eliminating the need for manual activation design, and achieving new state-of-the-art SRCC on CLIVE, KADID10K, and AGIQA3K. Extensive ablations confirm the benefits across architectures and regimes, establishing strong, resource-efficient NR-IQA baselines.
zh
[CV-92] Pre-Trained CNN Architecture for Transformer-Based Image Caption Generation Model
【速读】:该论文旨在解决传统卷积神经网络(CNN)与长短期记忆网络(LSTM)在图像描述生成任务中存在的局限性,包括RNN固有的序列处理特性导致的训练和推理效率低下,以及LSTM在处理长序列时难以保留早期信息的问题。其解决方案的关键在于采用基于自注意力机制(self-attention mechanism)的Transformer架构,该架构能够高效捕捉数据中的短距离和长距离依赖关系,并支持训练与推理阶段的并行化处理,从而显著提升模型性能与计算效率。
链接: https://arxiv.org/abs/2509.17365
作者: Amanuel Tafese Dufera
机构: Xi’an Jiaotong University (西安交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Automatic image captioning, a multifaceted task bridging computer vision and natural lan- guage processing, aims to generate descriptive textual content from visual input. While Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks have achieved significant advancements, they present limitations. The inherent sequential nature of RNNs leads to sluggish training and inference times. LSTMs further struggle with retaining information from earlier sequence elements when dealing with very long se- quences. This project presents a comprehensive guide to constructing and comprehending transformer models for image captioning. Transformers employ self-attention mechanisms, capturing both short- and long-range dependencies within the data. This facilitates efficient parallelization during both training and inference phases. We leverage the well-established Transformer architecture, recognized for its effectiveness in managing sequential data, and present a meticulous methodology. Utilizing the Flickr30k dataset, we conduct data pre- processing, construct a model architecture that integrates an EfficientNetB0 CNN for fea- ture extraction, and train the model with attention mechanisms incorporated. Our approach exemplifies the utilization of parallelization for efficient training and inference. You can find the project on GitHub.
zh
[CV-93] SmokeSeer: 3D Gaussian Splatting for Smoke Removal and Scene Reconstruction
【速读】:该论文旨在解决真实场景中烟雾对图像质量的严重退化问题,特别是现有图像恢复方法在处理动态、高密度烟雾时存在的局限性。其解决方案的关键在于提出SmokeSeer框架,通过融合热成像(thermal imaging)与RGB图像信息,利用热图像中烟雾散射减少的特性实现穿透烟雾的能力;同时基于3D高斯泼溅(3D Gaussian splatting)技术,显式地将场景分解为烟雾和非烟雾成分,从而实现从多视角视频中同步进行三维场景重建与烟雾去除,且能适应不同密度及随时间变化的烟雾条件。
链接: https://arxiv.org/abs/2509.17329
作者: Neham Jain,Andrew Jong,Sebastian Scherer,Ioannis Gkioulekas
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project website: this https URL
Abstract:Smoke in real-world scenes can severely degrade the quality of images and hamper visibility. Recent methods for image restoration either rely on data-driven priors that are susceptible to hallucinations, or are limited to static low-density smoke. We introduce SmokeSeer, a method for simultaneous 3D scene reconstruction and smoke removal from a video capturing multiple views of a scene. Our method uses thermal and RGB images, leveraging the fact that the reduced scattering in thermal images enables us to see through the smoke. We build upon 3D Gaussian splatting to fuse information from the two image modalities, and decompose the scene explicitly into smoke and non-smoke components. Unlike prior approaches, SmokeSeer handles a broad range of smoke densities and can adapt to temporally varying smoke. We validate our approach on synthetic data and introduce a real-world multi-view smoke dataset with RGB and thermal images. We provide open-source code and data at the project website.
zh
[CV-94] UIPro: Unleashing Superior Interaction Capability For GUI Agents ICCV2025
【速读】:该论文旨在解决当前GUI代理(Graphical User Interface Agent)在场景多样性不足、数据规模有限以及动作空间异构性导致的泛化能力弱的问题。解决方案的关键在于提出一种名为UIPro的通用GUI代理,其核心创新包括:首先构建一个包含2060万项GUI理解任务的综合性预训练数据集,以增强模型对GUI的强基座能力(GUI grounding capability);其次设计了一个统一的动作空间(unified action space),用于整合多平台、多任务下的异构GUI任务数据,并通过持续微调提升动作预测能力。实验表明,该方法在多个平台上的GUI任务基准测试中均展现出优越性能,验证了其有效性。
链接: https://arxiv.org/abs/2509.17328
作者: Hongxin Li,Jingran Su,Jingfan Chen,Zheng Ju,Yuntao Chen,Qing Li,Zhaoxiang Zhang
机构: University of Chinese Academy of Sciences (UCAS); New Laboratory of Pattern Recognition (NLPR), CASIA; State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), CASIA; Hong Kong Institute of Science & Innovation, CASIA; PolyU; Shanghai Artificial Intelligence Laboratory; StepFun
类目: Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC)
备注: Accepted to ICCV 2025
Abstract:Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes \textbfUIPro, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability, which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro’s superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach.
zh
[CV-95] DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking
【速读】:该论文旨在解决机器人感知中视觉多目标跟踪(Visual Multi-Object Tracking, MOT)的鲁棒性问题,尤其是现有基于检测的跟踪(Tracking-By-Detection, TBD)方法依赖二维(2D)线索(如边界框和运动建模)时,在遮挡和近距离交互场景下性能显著下降的问题。针对这一挑战,论文提出DepTR-MOT,其关键创新在于:(i) 基于基础模型的实例级软深度标签监督机制,用于精细化深度预测;(ii) 通过蒸馏密集深度图以保持全局深度一致性,从而在推理阶段输出实例级深度信息,无需额外计算开销或依赖基础模型。该方案有效提升了TBD框架对遮挡和近距离目标的处理能力,实验表明在QuadTrack和DanceTrack数据集上分别取得27.59和44.47的HOTA分数,尤其在机器人平台数据集QuadTrack上验证了深度信息在复杂场景下的优势。
链接: https://arxiv.org/abs/2509.17323
作者: Buyin Deng,Lingxin Huang,Kai Luo,Fei Teng,Kailun Yang
机构: Hunan University (湖南大学); State Key Laboratory of Industrial Control Technology (工业控制技术国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be made publicly available at this https URL
Abstract:Visual Multi-Object Tracking (MOT) is a crucial component of robotic perception, yet existing Tracking-By-Detection (TBD) methods often rely on 2D cues, such as bounding boxes and motion modeling, which struggle under occlusions and close-proximity interactions. Trackers relying on these 2D cues are particularly unreliable in robotic environments, where dense targets and frequent occlusions are common. While depth information has the potential to alleviate these issues, most existing MOT datasets lack depth annotations, leading to its underexploited role in the domain. To unveil the potential of depth-informed trajectory refinement, we introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information. Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency. These strategies enable DepTR-MOT to output instance-level depth during inference, without requiring foundation models and without additional computational cost. By incorporating depth cues, our method enhances the robustness of the TBD paradigm, effectively resolving occlusion and close-proximity challenges. Experiments on both the QuadTrack and DanceTrack datasets demonstrate the effectiveness of our approach, achieving HOTA scores of 27.59 and 44.47, respectively. In particular, results on QuadTrack, a robotic platform MOT dataset, highlight the advantages of our method in handling occlusion and close-proximity challenges in robotic tracking. The source code will be made publicly available at this https URL.
zh
[CV-96] Automated Coral Spawn Monitoring for Reef Restoration: The Coral Spawn and Larvae Imaging Camera System (CSLICS)
【速读】:该论文旨在解决珊瑚 aquaculture(珊瑚养殖)中用于 reef restoration(珊瑚礁修复)的 spawn counting(产卵计数)问题,当前方法依赖人工操作,效率低且难以持续监测,成为珊瑚生产流程中的关键瓶颈。解决方案的核心是提出 Coral Spawn and Larvae Imaging Camera System (CSLICS),其采用低成本模块化摄像头结合基于 human-in-the-loop(人机协同)标注训练的物体检测模型,实现对育苗缸内珊瑚产卵的自动检测、分类与计数;实验表明,该系统在不同胚胎发育阶段的表面产卵检测中达到 82.4% 的 F1 分数,水下产卵检测达 65.3% F1 分数,并相较人工采样节省每场产卵事件约 5,720 小时劳动时间,显著提升了珊瑚繁殖过程的自动化水平和可扩展性。
链接: https://arxiv.org/abs/2509.17299
作者: Dorian Tsai,Christopher A. Brunner,Riki Lamont,F. Mikaela Nordborg,Andrea Severati,Java Terry,Karen Jackel,Matthew Dunbabin,Tobias Fischer,Scarlett Raine
机构: Queensland University of Technology (昆士兰理工大学); Australian Institute for Marine Science (澳大利亚海洋科学研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures
Abstract:Coral aquaculture for reef restoration requires accurate and continuous spawn counting for resource distribution and larval health monitoring, but current methods are labor-intensive and represent a critical bottleneck in the coral production pipeline. We propose the Coral Spawn and Larvae Imaging Camera System (CSLICS), which uses low cost modular cameras and object detectors trained using human-in-the-loop labeling approaches for automated spawn counting in larval rearing tanks. This paper details the system engineering, dataset collection, and computer vision techniques to detect, classify and count coral spawn. Experimental results from mass spawning events demonstrate an F1 score of 82.4% for surface spawn detection at different embryogenesis stages, 65.3% F1 score for sub-surface spawn detection, and a saving of 5,720 hours of labor per spawning event compared to manual sampling methods at the same frequency. Comparison of manual counts with CSLICS monitoring during a mass coral spawning event on the Great Barrier Reef demonstrates CSLICS’ accurate measurement of fertilization success and sub-surface spawn counts. These findings enhance the coral aquaculture process and enable upscaling of coral reef restoration efforts to address climate change threats facing ecosystems like the Great Barrier Reef.
zh
[CV-97] Event-Based Visual Teach-and-Repeat via Fast Fourier-Domain Cross-Correlation
【速读】:该论文旨在解决传统基于帧的视觉 teach-and-repeat 导航系统因固定帧率(通常为30–60 Hz)导致的响应延迟问题,从而限制了机器人对环境变化的实时感知与控制能力。其核心解决方案是首次提出基于事件相机(event camera)的视觉 teach-and-repeat 系统,关键创新在于设计了一种频域交叉相关框架,将事件流匹配问题转化为傅里叶空间中的高效乘法运算,实现了超过300 Hz的处理速率,较传统方法提升一个数量级;同时利用事件帧的二值特性并结合图像压缩技术,在不牺牲定位精度的前提下显著提升了计算效率,最终在真实机器人平台上实现了高频率、高精度的自主导航。
链接: https://arxiv.org/abs/2509.17287
作者: Gokul B. Nair,Alejandro Fontan,Michael Milford,Tobias Fischer
机构: Queensland University of Technology (昆士兰科技大学); QUT Centre for Robotics (昆士兰科技大学机器人中心); Research Engineering Facility (REF) team (研究工程设施团队)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 Pages, 4 Figures, Under Review
Abstract:Visual teach-and-repeat navigation enables robots to autonomously traverse previously demonstrated paths by comparing current sensory input with recorded trajectories. However, conventional frame-based cameras fundamentally limit system responsiveness: their fixed frame rates (typically 30-60 Hz) create inherent latency between environmental changes and control responses. Here we present the first event-camera-based visual teach-and-repeat system. To achieve this, we develop a frequency-domain cross-correlation framework that transforms the event stream matching problem into computationally efficient Fourier space multiplications, capable of exceeding 300Hz processing rates, an order of magnitude faster than frame-based approaches. By exploiting the binary nature of event frames and applying image compression techniques, we further enhance the computational speed of the cross-correlation process without sacrificing localization accuracy. Extensive experiments using a Prophesee EVK4 HD event camera mounted on an AgileX Scout Mini robot demonstrate successful autonomous navigation across 4000+ meters of indoor and outdoor trajectories. Our system achieves ATEs below 24 cm while maintaining consistent high-frequency control updates. Our evaluations show that our approach achieves substantially higher update rates compared to conventional frame-based systems, underscoring the practical viability of event-based perception for real-time robotic navigation.
zh
[CV-98] Automated Facility Enumeration for Building Compliance Checking using Door Detection and Large Language Models
【速读】:该论文旨在解决建筑合规检查(Building Compliance Checking, BCC)中一个长期被忽视的关键问题:如何准确自动枚举各类设施及其空间分布,以验证其数量是否符合法定要求。传统人工方式效率低下,而现有方法缺乏对这一任务的有效支持。解决方案的关键在于提出一种融合门检测与大语言模型(Large Language Models, LLMs)推理的新方法,并引入思维链(Chain-of-Thought, CoT)增强推理能力,从而实现对不同设施类型和数据集的泛化性能,显著提升自动化水平和准确性。
链接: https://arxiv.org/abs/2509.17283
作者: Licheng Zhan,Bach Le,Naveed Akhtar,Tuan Ngo
机构: The University of Melbourne (墨尔本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:Building compliance checking (BCC) is a critical process for ensuring that constructed facilities meet regulatory standards. A core component of BCC is the accurate enumeration of facility types and their spatial distribution. Despite its importance, this problem has been largely overlooked in the literature, posing a significant challenge for BCC and leaving a critical gap in existing workflows. Performing this task manually is time-consuming and labor-intensive. Recent advances in large language models (LLMs) offer new opportunities to enhance automation by combining visual recognition with reasoning capabilities. In this paper, we introduce a new task for BCC: automated facility enumeration, which involves validating the quantity of each facility type against statutory requirements. To address it, we propose a novel method that integrates door detection with LLM-based reasoning. We are the first to apply LLMs to this task and further enhance their performance through a Chain-of-Thought (CoT) pipeline. Our approach generalizes well across diverse datasets and facility types. Experiments on both real-world and synthetic floor plan data demonstrate the effectiveness and robustness of our method.
zh
[CV-99] ask-Oriented Communications for 3D Scene Representation: Balancing Timeliness and Fidelity
【速读】:该论文旨在解决实时三维(3D)场景表示中数据时效性(timeliness)与表示保真度(fidelity)之间的权衡问题,尤其是在多移动机器人通过无线网络将图像传输至边缘服务器以构建动态环境3D表示的场景下。解决方案的关键在于提出一种结合年龄信息(Age of Information, AoI)与语义信息的上下文感知强化学习框架——基于上下文带通策略优化(contextual-bandit Proximal Policy Optimization, PPO),通过设计ω-阈值(ω-threshold)和ω-等待(ω-wait)两种策略,实现对图像选择的智能决策,在保障低延迟的同时提升3D表示的质量。
链接: https://arxiv.org/abs/2509.17282
作者: Xiangmin Xu,Zhen Meng,Kan Chen,Jiaming Yang,Emma Li,Philip G. Zhao,David Flynn
机构: University of Glasgow (格拉斯哥大学); University of Manchester (曼彻斯特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Networking and Internet Architecture (cs.NI)
备注: Submitted to IEEE Transactions on Mobile Computing
Abstract:Real-time Three-dimensional (3D) scene representation is a foundational element that supports a broad spectrum of cutting-edge applications, including digital manufacturing, Virtual, Augmented, and Mixed Reality (VR/AR/MR), and the emerging metaverse. Despite advancements in real-time communication and computing, achieving a balance between timeliness and fidelity in 3D scene representation remains a challenge. This work investigates a wireless network where multiple homogeneous mobile robots, equipped with cameras, capture an environment and transmit images to an edge server over channels for 3D representation. We propose a contextual-bandit Proximal Policy Optimization (PPO) framework incorporating both Age of Information (AoI) and semantic information to optimize image selection for representation, balancing data freshness and representation quality. Two policies – the \omega -threshold and \omega -wait policies – together with two benchmark methods are evaluated, timeliness embedding and weighted sum, on standard datasets and baseline 3D scene representation models. Experimental results demonstrate improved representation fidelity while maintaining low latency, offering insight into the model’s decision-making process. This work advances real-time 3D scene representation by optimizing the trade-off between timeliness and fidelity in dynamic environments.
zh
[CV-100] Computational Scaffolding of Composition Value and Color for Disciplined Drawing
【速读】:该论文旨在解决初学者和中级数字艺术家在通过临摹参考图像进行专业技能训练时,难以有效组织学习流程及缺乏即时反馈的问题。解决方案的关键在于提出ArtKrit工具,该工具将临摹过程系统性地拆解为三个核心步骤:构图(composition)、明暗(value)和色彩(color),并在每个步骤中提供计算辅助指导(如自适应构图线生成)与自动反馈机制(如明暗与色彩准确性评估),从而灵活支持用户个性化创作流程并提升其专业视觉判断能力。
链接: https://arxiv.org/abs/2509.17268
作者: Jiaju Ma,Chau Vu,Asya Lyubavina,Catherine Liu,Jingyi Li
机构: Stanford University (斯坦福大学); Pomona College (波莫纳学院); Claremont McKenna College (克莱蒙特麦肯纳学院)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to UIST 2025 (Best Paper)
Abstract:One way illustrators engage in disciplined drawing - the process of drawing to improve technical skills - is through studying and replicating reference images. However, for many novice and intermediate digital artists, knowing how to approach studying a reference image can be challenging. It can also be difficult to receive immediate feedback on their works-in-progress. To help these users develop their professional vision, we propose ArtKrit, a tool that scaffolds the process of replicating a reference image into three main steps: composition, value, and color. At each step, our tool offers computational guidance, such as adaptive composition line generation, and automatic feedback, such as value and color accuracy. Evaluating this tool with intermediate digital artists revealed that ArtKrit could flexibly accommodate their unique workflows. Our code and supplemental materials are available at this https URL .
zh
[CV-101] Optimized Learned Image Compression for Facial Expression Recognition ICIP2025
【速读】:该论文旨在解决在面部表情识别(Facial Expression Recognition, FER)任务中,由于有损压缩导致特征退化和识别准确率下降的问题。其解决方案的关键在于提出一种端到端的模型架构,并设计了一种定制化的损失函数,以协同优化压缩效率与识别性能之间的平衡;实验表明,通过单独微调压缩模型或联合优化压缩与分类模型,均能显著提升识别准确率和压缩效率,其中联合优化使分类准确率提升4.04%、压缩效率提升89.12%,且压缩模型在高压缩率下仍能有效保留图像细节,保障识别性能。
链接: https://arxiv.org/abs/2509.17262
作者: Xiumei Li,Marc Windsheimer,Misha Sadeghi,Björn Eskofier,André Kaup
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at ICIP 2025
Abstract:Efficient data compression is crucial for the storage and transmission of visual data. However, in facial expression recognition (FER) tasks, lossy compression often leads to feature degradation and reduced accuracy. To address these challenges, this study proposes an end-to-end model designed to preserve critical features and enhance both compression and recognition performance. A custom loss function is introduced to optimize the model, tailored to balance compression and recognition performance effectively. This study also examines the influence of varying loss term weights on this balance. Experimental results indicate that fine-tuning the compression model alone improves classification accuracy by 0.71% and compression efficiency by 49.32%, while joint optimization achieves significant gains of 4.04% in accuracy and 89.12% in efficiency. Moreover, the findings demonstrate that the jointly optimized classification model maintains high accuracy on both compressed and uncompressed data, while the compression model reliably preserves image details, even at high compression rates.
zh
[CV-102] SPFSplatV2: Efficient Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
【速读】:该论文旨在解决从稀疏多视角图像中进行高效3D高斯溅射(3D Gaussian splatting)重建时对真实相机位姿(ground-truth poses)的依赖问题,尤其是在训练和推理阶段均无需位姿监督的情况下实现高质量的新视角合成。其解决方案的关键在于提出SPFSplatV2框架,该框架采用共享特征提取主干网络,在一个规范空间(canonical space)中同时预测3D高斯原型(Gaussian primitives)和相机位姿;并通过引入掩码注意力机制(masked attention mechanism)在训练中高效估计目标位姿,结合重投影损失(reprojection loss)强制高斯原型与像素对齐,从而提供更强的几何约束。这一设计使得模型即使在无位姿监督条件下仍能实现域内和域外新视角合成的最先进性能,尤其在极端视点变化和图像重叠有限的情况下表现优异。
链接: https://arxiv.org/abs/2509.17246
作者: Ranran Huang,Krystian Mikolajczyk
机构: Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce SPFSplatV2, an efficient feed-forward framework for 3D Gaussian splatting from sparse multi-view images, requiring no ground-truth poses during training and inference. It employs a shared feature extraction backbone, enabling simultaneous prediction of 3D Gaussian primitives and camera poses in a canonical space from unposed inputs. A masked attention mechanism is introduced to efficiently estimate target poses during training, while a reprojection loss enforces pixel-aligned Gaussian primitives, providing stronger geometric constraints. We further demonstrate the compatibility of our training framework with different reconstruction architectures, resulting in two model variants. Remarkably, despite the absence of pose supervision, our method achieves state-of-the-art performance in both in-domain and out-of-domain novel view synthesis, even under extreme viewpoint changes and limited image overlap, and surpasses recent methods that rely on geometric supervision for relative pose estimation. By eliminating dependence on ground-truth poses, our method offers the scalability to leverage larger and more diverse datasets. Code and pretrained models will be available on our project page: this https URL.
zh
[CV-103] DT-NeRF: A Diffusion and Transformer-Based Optimization Approach for Neural Radiance Fields in 3D Reconstruction
【速读】:该论文旨在解决3D场景重建中细节恢复不足与多视角一致性差的问题,尤其是在稀疏视角条件下难以保持几何复杂场景的高精度重建。其解决方案的关键在于提出了一种基于扩散模型优化的神经辐射场(Diffusion Model-Optimized Neural Radiance Field, DT-NeRF)方法,通过将扩散模型(Diffusion Model)与Transformer架构相结合,显著提升了细节重建能力和多视角一致性,实验表明该方法在Matterport3D和ShapeNet数据集上优于传统NeRF及其他先进方法,在PSNR、SSIM、Chamfer Distance和Fidelity等指标上均有明显提升。消融实验证实扩散模块与Transformer模块对性能至关重要,二者协同作用实现了更高效且准确的3D场景重建。
链接: https://arxiv.org/abs/2509.17232
作者: Bo Liu,Runlong Li,Li Zhou,Yan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages
Abstract:This paper proposes a Diffusion Model-Optimized Neural Radiance Field (DT-NeRF) method, aimed at enhancing detail recovery and multi-view consistency in 3D scene reconstruction. By combining diffusion models with Transformers, DT-NeRF effectively restores details under sparse viewpoints and maintains high accuracy in complex geometric scenes. Experimental results demonstrate that DT-NeRF significantly outperforms traditional NeRF and other state-of-the-art methods on the Matterport3D and ShapeNet datasets, particularly in metrics such as PSNR, SSIM, Chamfer Distance, and Fidelity. Ablation experiments further confirm the critical role of the diffusion and Transformer modules in the model’s performance, with the removal of either module leading to a decline in performance. The design of DT-NeRF showcases the synergistic effect between modules, providing an efficient and accurate solution for 3D scene reconstruction. Future research may focus on further optimizing the model, exploring more advanced generative models and network architectures to enhance its performance in large-scale dynamic scenes.
zh
[CV-104] MirrorSAM2: Segment Mirror in Videos with Depth Perception
【速读】:该论文旨在解决RGB-D视频中镜面分割(mirror segmentation)的难题,尤其针对反射模糊(reflection ambiguity)和纹理混淆(texture confusion)等关键挑战。解决方案的核心在于提出MirrorSAM2框架,其关键创新包括:1)深度畸变模块(Depth Warping Module)实现RGB与深度信息的对齐;2)基于深度引导的多尺度点提示生成器(Depth-guided Multi-Scale Point Prompt Generator)实现自动提示生成;3)频域细节注意力融合模块(Frequency Detail Attention Fusion Module)增强结构边界;4)带有可学习镜面标记(learnable mirror token)的镜面掩码解码器,实现精细化分割。通过充分挖掘RGB与深度数据的互补性,MirrorSAM2首次将Segment Anything Model 2(SAM2)扩展至无需人工提示的自动视频镜面分割任务,并在VMD和DVMD基准上取得当前最优性能(SOTA)。
链接: https://arxiv.org/abs/2509.17220
作者: Mingchen Xu,Yukun Lai,Ze Ji,Jing Wu
机构: Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages
Abstract:This paper presents MirrorSAM2, the first framework that adapts Segment Anything Model 2 (SAM2) to the task of RGB-D video mirror segmentation. MirrorSAM2 addresses key challenges in mirror detection, such as reflection ambiguity and texture confusion, by introducing four tailored modules: a Depth Warping Module for RGB and depth alignment, a Depth-guided Multi-Scale Point Prompt Generator for automatic prompt generation, a Frequency Detail Attention Fusion Module to enhance structural boundaries, and a Mirror Mask Decoder with a learnable mirror token for refined segmentation. By fully leveraging the complementarity between RGB and depth, MirrorSAM2 extends SAM2’s capabilities to the prompt-free setting. To our knowledge, this is the first work to enable SAM2 for automatic video mirror segmentation. Experiments on the VMD and DVMD benchmark demonstrate that MirrorSAM2 achieves SOTA performance, even under challenging conditions such as small mirrors, weak boundaries, and strong reflections.
zh
[CV-105] High Resolution UDF Meshing via Iterative Networks NEURIPS2025
【速读】:该论文旨在解决无符号距离场(Unsigned Distance Fields, UDFs)在高分辨率下难以准确三角化为显式网格的问题,尤其是在神经UDF噪声较大时,传统单次遍历方法因缺乏邻域信息而容易出现表面缺失和孔洞。其解决方案的关键在于提出一种迭代式神经网络,通过多轮遍历逐步传播来自更远邻域的空间信息,从而在每个体素内不断优化表面恢复过程;该方法在多轮迭代中整合新检测到的表面、距离值和梯度信息,有效修正错误并稳定复杂区域的提取结果,显著提升重建精度与完整性。
链接: https://arxiv.org/abs/2509.17212
作者: Federico Stella,Nicolas Talabot,Hieu Le,Pascal Fua
机构: CVLab, EPFL (École Polytechnique Fédérale de Lausanne); UNC Charlotte (University of North Carolina at Charlotte)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at NeurIPS 2025
Abstract:Unsigned Distance Fields (UDFs) are a natural implicit representation for open surfaces but, unlike Signed Distance Fields (SDFs), are challenging to triangulate into explicit meshes. This is especially true at high resolutions where neural UDFs exhibit higher noise levels, which makes it hard to capture fine details. Most current techniques perform within single voxels without reference to their neighborhood, resulting in missing surface and holes where the UDF is ambiguous or noisy. We show that this can be remedied by performing several passes and by reasoning on previously extracted surface elements to incorporate neighborhood information. Our key contribution is an iterative neural network that does this and progressively improves surface recovery within each voxel by spatially propagating information from increasingly distant neighbors. Unlike single-pass methods, our approach integrates newly detected surfaces, distance values, and gradients across multiple iterations, effectively correcting errors and stabilizing extraction in challenging regions. Experiments on diverse 3D models demonstrate that our method produces significantly more accurate and complete meshes than existing approaches, particularly for complex geometries, enabling UDF surface extraction at higher resolutions where traditional methods fail.
zh
[CV-106] Point-RTD: Replaced Token Denoising for Pretraining Transformer Models on Point Clouds
【速读】:该论文旨在解决基于Transformer的3D点云任务中,预训练策略对模型性能提升有限的问题。现有方法如PointMAE依赖于掩码重建任务,通过隐藏部分数据进行预测,但这种方式在学习点云结构先验方面效率较低。解决方案的关键在于提出Point-RTD(Replaced Token Denoising),一种基于扰动-重建框架的新预训练策略:它通过替换(corrupt)点云token并利用判别器-生成器架构进行去噪,而非简单的掩码重建,从而更有效地学习点云的结构先验信息,显著提升模型的鲁棒性、收敛速度和下游任务性能。
链接: https://arxiv.org/abs/2509.17207
作者: Gunner Stone,Youngsook Choi,Alireza Tavakkoli,Ankita Shukla
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Pre-training strategies play a critical role in advancing the performance of transformer-based models for 3D point cloud tasks. In this paper, we introduce Point-RTD (Replaced Token Denoising), a novel pretraining strategy designed to improve token robustness through a corruption-reconstruction framework. Unlike traditional mask-based reconstruction tasks that hide data segments for later prediction, Point-RTD corrupts point cloud tokens and leverages a discriminator-generator architecture for denoising. This shift enables more effective learning of structural priors and significantly enhances model performance and efficiency. On the ShapeNet dataset, Point-RTD reduces reconstruction error by over 93% compared to PointMAE, and achieves more than 14x lower Chamfer Distance on the test set. Our method also converges faster and yields higher classification accuracy on ShapeNet, ModelNet10, and ModelNet40 benchmarks, clearly outperforming the baseline Point-MAE framework in every case.
zh
[CV-107] Guided and Unguided Conditional Diffusion Mechanisms for Structured and Semantically-Aware 3D Point Cloud Generation
【速读】:该论文旨在解决现有生成式3D点云方法在语义信息整合方面的不足,即大多数方法仅关注几何结构的生成,而语义信息通常通过外部分割或聚类后处理引入,缺乏与生成过程的深度融合。其解决方案的关键在于提出一种基于扩散模型(diffusion-based framework)的框架,将每个点的语义标签作为条件变量直接嵌入到扩散过程中,从而在生成阶段同步实现几何与语义的一致性合成。该设计使生成的点云不仅结构合理,且具备明确的语义分区,显著提升了生成质量与可控性。
链接: https://arxiv.org/abs/2509.17206
作者: Gunner Stone,Sushmita Sarker,Alireza Tavakkoli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Generating realistic 3D point clouds is a fundamental problem in computer vision with applications in remote sensing, robotics, and digital object modeling. Existing generative approaches primarily capture geometry, and when semantics are considered, they are typically imposed post hoc through external segmentation or clustering rather than integrated into the generative process itself. We propose a diffusion-based framework that embeds per-point semantic conditioning directly within generation. Each point is associated with a conditional variable corresponding to its semantic label, which guides the diffusion dynamics and enables the joint synthesis of geometry and semantics. This design produces point clouds that are both structurally coherent and segmentation-aware, with object parts explicitly represented during synthesis. Through a comparative analysis of guided and unguided diffusion processes, we demonstrate the significant impact of conditional variables on diffusion dynamics and generation quality. Extensive experiments validate the efficacy of our approach, producing detailed and accurate 3D point clouds tailored to specific parts and features.
zh
[CV-108] Echo-Path: Pathology-Conditioned Echo Video Generation MICCAI
【速读】:该论文旨在解决心血管疾病(CVDs)中某些病理类型(如房间隔缺损 ASD 和肺动脉高压 PAH)的超声心动图数据稀缺问题,从而阻碍了自动化诊断模型的训练与泛化能力。其解决方案的关键在于提出了一种名为 Echo-Path 的新型生成式框架,该框架通过引入病理条件机制(pathology-conditioning mechanism),将特定心脏病理特征编码至最先进的超声心动图视频生成模型中,使生成的视频能够精准呈现目标病理性结构和运动模式。该方法在定量评估中表现出低分布距离,且生成视频具备临床可解释的病理标志,同时基于合成数据训练的分类器在真实数据上具有良好的泛化性能,并能有效提升下游诊断准确率(ASD 提升 7%,PAH 提升 8%)。
链接: https://arxiv.org/abs/2509.17190
作者: Kabir Hamzah Muhammad,Marawan Elbatel,Yi Qin,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures, MICCAI-AMAI2025 Workshop
Abstract:Cardiovascular diseases (CVDs) remain the leading cause of mortality globally, and echocardiography is critical for diagnosis of both common and congenital cardiac conditions. However, echocardiographic data for certain pathologies are scarce, hindering the development of robust automated diagnosis models. In this work, we propose Echo-Path, a novel generative framework to produce echocardiogram videos conditioned on specific cardiac pathologies. Echo-Path can synthesize realistic ultrasound video sequences that exhibit targeted abnormalities, focusing here on atrial septal defect (ASD) and pulmonary arterial hypertension (PAH). Our approach introduces a pathology-conditioning mechanism into a state-of-the-art echo video generator, allowing the model to learn and control disease-specific structural and motion patterns in the heart. Quantitative evaluation demonstrates that the synthetic videos achieve low distribution distances, indicating high visual fidelity. Clinically, the generated echoes exhibit plausible pathology markers. Furthermore, classifiers trained on our synthetic data generalize well to real data and, when used to augment real training sets, it improves downstream diagnosis of ASD and PAH by 7% and 8% respectively. Code, weights and dataset are available here this https URL
zh
[CV-109] Ambiguous Medical Image Segmentation Using Diffusion Schrödinger Bridge MICCAI2025
【速读】:该论文旨在解决医学图像分割中因病灶边界模糊和掩码(mask)变异性导致的分割精度难题。其解决方案的关键在于提出分割薛定谔桥(Segmentation Schödinger Bridge, SSB),首次将薛定谔桥(Schödinger Bridge)方法引入医学图像分割任务,通过建模图像与掩码之间的联合动态过程来提升分割性能;SSB能够在不依赖额外指导的情况下保持结构完整性并精准勾勒模糊边界,同时利用一种新颖的损失函数维持预测结果的多样性。此外,作者还提出了**多样性差异指数(Diversity Divergence Index, D_DDI)**以量化不同标注者间的变异程度,从而兼顾分割结果的多样性与一致性。
链接: https://arxiv.org/abs/2509.17187
作者: Lalith Bharadwaj Baru,Kamalaker Dadi,Tapabrata Chakraborti,Raju S. Bapi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025 (11 pages, 2 figures, 1 table, and 26 references)
Abstract:Accurate segmentation of medical images is challenging due to unclear lesion boundaries and mask variability. We introduce \emphSegmentation Schödinger Bridge (SSB), the first application of Schödinger Bridge for ambiguous medical image segmentation, modelling joint image-mask dynamics to enhance performance. SSB preserves structural integrity, delineates unclear boundaries without additional guidance, and maintains diversity using a novel loss function. We further propose the \emphDiversity Divergence Index ( D_DDI ) to quantify inter-rater variability, capturing both diversity and consensus. SSB achieves state-of-the-art performance on LIDC-IDRI, COCA, and RACER (in-house) datasets.
zh
[CV-110] SynergyNet: Fusing Generative Priors and State-Space Models for Facial Beauty Prediction
【速读】:该论文旨在解决面部美感自动化预测任务中模型性能受限的问题,该任务需要同时理解局部美学细节(如皮肤纹理)和全局面部和谐性(如对称性和比例)。现有基于卷积神经网络(Convolutional Neural Networks, CNNs)或视觉Transformer(Vision Transformers, ViTs)的模型存在固有架构偏差:CNNs擅长局部特征提取但难以捕捉长程依赖关系,而ViTs虽能建模全局关系却计算开销巨大。解决方案的关键在于提出一种新颖的双流架构——Mamba-Diffusion Network (MD-Net),其核心创新是将生成式先验与高效序列建模相结合:第一流利用预训练潜在扩散模型中的冻结U-Net编码器提供细粒度美学特征的生成先验;第二流采用视觉状态空间模型(Vision Mamba, Vim)以线性时间复杂度高效捕获全局面部结构;二者通过交叉注意力机制协同融合,构建出一个整体且细腻的特征空间,从而在SCUT-FBP5500基准上实现皮尔逊相关系数0.9235的新最优结果,验证了生成式与序列建模范式融合在复杂视觉评估任务中的巨大潜力。
链接: https://arxiv.org/abs/2509.17172
作者: Djamel Eddine Boukhari
机构: Scientific and Technical Research Centre for Arid Areas, CRSTRA (干旱地区科学技术研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The automated prediction of facial beauty is a benchmark task in affective computing that requires a sophisticated understanding of both local aesthetic details (e.g., skin texture) and global facial harmony (e.g., symmetry, proportions). Existing models, based on either Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs), exhibit inherent architectural biases that limit their performance; CNNs excel at local feature extraction but struggle with long-range dependencies, while ViTs model global relationships at a significant computational cost. This paper introduces the \textbfMamba-Diffusion Network (MD-Net), a novel dual-stream architecture that resolves this trade-off by delegating specialized roles to state-of-the-art models. The first stream leverages a frozen U-Net encoder from a pre-trained latent diffusion model, providing a powerful generative prior for fine-grained aesthetic qualities. The second stream employs a Vision Mamba (Vim), a modern state-space model, to efficiently capture global facial structure with linear-time complexity. By synergistically integrating these complementary representations through a cross-attention mechanism, MD-Net creates a holistic and nuanced feature space for prediction. Evaluated on the SCUT-FBP5500 benchmark, MD-Net sets a new state-of-the-art, achieving a Pearson Correlation of \textbf0.9235 and demonstrating the significant potential of hybrid architectures that fuse generative and sequential modeling paradigms for complex visual assessment tasks.
zh
[CV-111] Beat on Gaze: Learning Stylized Generation of Gaze and Head Dynamics
【速读】:该论文旨在解决当前3D面部动画中头眼动态(head and gaze dynamics)建模不足的问题,尤其是现有方法通常孤立处理面部各组件,忽视了注视方向、头部运动与语音之间的复杂协同关系,同时受限于高质量标注的注视数据集稀缺,难以实现个性化且逼真的 gaze 控制。其解决方案的关键在于提出 StyGazeTalk,一种基于音频驱动的方法,通过多层长短期记忆网络(LSTM)结构结合风格编码器(style encoder),从注视-头部序列中提取说话人特异性的运动特征,从而生成多样化、时序一致且具备风格感知能力的头部与注视动作;此外,研究还构建了一个高精度多模态数据集,包含眼动追踪注视、音频、头部姿态和3D面部参数,为训练和评估头部与注视控制模型提供了重要资源。
链接: https://arxiv.org/abs/2509.17168
作者: Chengwei Shi,Chong Cao,Xin Tong,Xukun Shen
机构: Beihang University (北京航空航天大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: arXiv submission
Abstract:Head and gaze dynamics are crucial in expressive 3D facial animation for conveying emotion and intention. However, existing methods frequently address facial components in isolation, overlooking the intricate coordination between gaze, head motion, and speech. The scarcity of high-quality gaze-annotated datasets hinders the development of data-driven models capable of capturing realistic, personalized gaze control. To address these challenges, we propose StyGazeTalk, an audio-driven method that generates synchronized gaze and head motion styles. We extract speaker-specific motion traits from gaze-head sequences with a multi-layer LSTM structure incorporating a style encoder, enabling the generation of diverse animation styles. We also introduce a high-precision multimodal dataset comprising eye-tracked gaze, audio, head pose, and 3D facial parameters, providing a valuable resource for training and evaluating head and gaze control models. Experimental results demonstrate that our method generates realistic, temporally coherent, and style-aware head-gaze motions, significantly advancing the state-of-the-art in audio-driven facial animation.
zh
[CV-112] SAEC: Scene-Aware Enhanced Edge-Cloud Collaborative Industrial Vision Inspection with Multimodal LLM
【速读】:该论文旨在解决工业视觉检测中高精度与资源受限之间的根本性矛盾:现有方法要么依赖计算成本高昂的多模态大语言模型(Multimodal Large Language Models, MLLMs)以实现强推理能力,要么采用轻量级边缘模型但在复杂缺陷场景下性能不足。解决方案的关键在于提出一种场景感知增强的边缘-云协同框架(Scene-aware Enhanced Edge-Cloud Collaborative framework, SAEC),其核心创新包括:(1) 针对复杂缺陷检测的高效MLLM微调策略;(2) 轻量级多尺度场景复杂度估计机制;(3) 自适应边缘-云调度器。三者协同实现根据场景复杂度动态分配计算任务,在保障检测精度的同时显著降低运行时延和单位正确决策能耗。
链接: https://arxiv.org/abs/2509.17136
作者: Yuhao Tian,Zheming Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 5 figures
Abstract:Industrial vision inspection requires high accuracy under stringent resource constraints, yet existing approaches face a fundamental trade-off. Multimodal LLMs (MLLMs) deliver strong reasoning capabilities but incur prohibitive computational costs, while lightweight edge models often fail on complex cases. In this paper, we present SAEC, a scene-aware enhanced edge-cloud collaborative industrial vision inspection framework with MLLM. The framework is composed of three synergistic components: (1) Efficient MLLM Fine-Tuning for Complex Defect Inspection, (2) Lightweight Multiscale Scene-Complexity Estimation, and (3) Adaptive Edge-Cloud Scheduler. Together, these modules enable robust defect detection by tailoring multimodal reasoning to scene complexity and dynamically balancing computation between edge and cloud resources. Experimental results on MVTec AD and KSDD2 datasets demonstrate that SAEC attains 85.11% and 82.72% accuracy, surpassing Qwen by 22.1% and 20.8%, and LLaVA by 33.3% and 31.6%. It also reduces runtime by up to 22.4% and cuts energy per correct decision by 40%-74%. The code is available at this https URL.
zh
[CV-113] Stencil: Subject-Driven Generation with Context Guidance ICIP2025
【速读】:该论文旨在解决当前文本到图像扩散模型在生成过程中难以保持主体一致性的问题,尤其是小样本微调时面临的质量与效率之间的权衡困境。现有方法要么因微调大型模型导致计算成本过高,要么因微调轻量模型而牺牲图像保真度;同时,对预训练模型进行小样本微调可能破坏原有先验知识,影响生成效果。解决方案的关键在于提出 Stencil 框架,该框架在推理阶段协同使用两个扩散模型:一个轻量级模型用于高效地对目标主体图像进行微调,另一个冻结的大规模预训练模型则提供上下文引导,注入丰富的先验信息以提升生成质量,且开销极低。此设计实现了高保真、新颖的主体驱动图像生成,在不到一分钟内完成,显著优于现有方法。
链接: https://arxiv.org/abs/2509.17120
作者: Gordon Chen,Ziqi Huang,Cheston Tan,Ziwei Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as Spotlight at ICIP 2025
Abstract:Recent text-to-image diffusion models can generate striking visuals from text prompts, but they often fail to maintain subject consistency across generations and contexts. One major limitation of current fine-tuning approaches is the inherent trade-off between quality and efficiency. Fine-tuning large models improves fidelity but is computationally expensive, while fine-tuning lightweight models improves efficiency but compromises image fidelity. Moreover, fine-tuning pre-trained models on a small set of images of the subject can damage the existing priors, resulting in suboptimal results. To this end, we present Stencil, a novel framework that jointly employs two diffusion models during inference. Stencil efficiently fine-tunes a lightweight model on images of the subject, while a large frozen pre-trained model provides contextual guidance during inference, injecting rich priors to enhance generation with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.
zh
[CV-114] CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception
【速读】:该论文旨在解决多智能体协同感知中因视角和空间位置差异导致的异构观测问题,现有中间融合方法通常只关注特征相似性而忽视了智能体间的感知多样性。其解决方案的关键在于提出一种基于鸟瞰图(Bird’s Eye View, BEV)空间的协同感知框架 CoBEVMoE,该框架引入动态专家混合(Dynamic Mixture-of-Experts, DMoE)架构,其中每个专家根据特定智能体的输入特征动态生成,从而在捕捉共享语义的同时提取具有区分性的可靠特征;此外,通过设计动态专家度量损失(Dynamic Expert Metric Loss, DEML)增强专家间多样性,提升融合表示的判别能力。
链接: https://arxiv.org/abs/2509.17107
作者: Lingzhao Kong,Jiacheng Lin,Siyu Li,Kai Luo,Zhiyong Li,Kailun Yang
机构: Hunan University (湖南大学); National Engineering Research Center of Robot Visual Perception and Control Technology (国家机器人视觉感知与控制技术工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The source code will be made publicly available at this https URL
Abstract:Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird’s Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@50 for LiDAR-based 3D object detection by +3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made publicly available at this https URL.
zh
[CV-115] he SAGES Critical View of Safety Challenge: A Global Benchmark for AI-Assisted Surgical Quality Assessment
【速读】:该论文旨在解决外科手术质量评估中因主观性强、一致性差及临床变异性大而导致的AI模型难以部署的问题,尤其聚焦于腹腔镜胆囊切除术中的“关键安全视角”(Critical View of Safety, CVS)这一公认但执行不一的安全步骤。其解决方案的关键在于构建了一个大规模多中心协作框架——EndoGlacier,用于管理异构手术视频数据与多标注者工作流,并通过SAGES组织的CVS挑战赛汇集全球54家机构的数百名临床与工程专家,形成1000段经20位外科专家共识验证标注的数据集,从而实现高精度、低校准误差和强鲁棒性的AI评估模型,相较现有技术提升达17%相对性能改进,为可临床部署的生成式AI(Generative AI)在手术质量评估中的应用提供了方法论基础与实证支持。
链接: https://arxiv.org/abs/2509.17100
作者: Deepak Alapatt,Jennifer Eckhoff,Zhiliang Lyu,Yutong Ban,Jean-Paul Mazellier,Sarah Choksi,Kunyi Yang,2024 CVS Challenge Consortium,Quanzheng Li,Filippo Filicori,Xiang Li,Pietro Mascagni,Daniel A. Hashimoto,Guy Rosman,Ozanan Meireles,Nicolas Padoy
机构: University of Strasbourg, CNRS, INSERM, ICube, UMR7357, Strasbourg, France; IHU Strasbourg, France; Scialytics SAS, France; Massachusetts General Hospital, Harvard Medical School, USA; University Hospital Cologne, Germany; Global College, Shanghai Jiao Tong University, China; Lenox Hill Hospital, Northwell Health, USA; Albany Medical Center, USA; Fondazione Policlinico Universitario A. Gemelli IRCCS, Italy; University of Pennsylvania, USA; Massachusetts Institute of Technology, USA; Duke University, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 10 figures
Abstract:Advances in artificial intelligence (AI) for surgical quality assessment promise to democratize access to expertise, with applications in training, guidance, and accreditation. This study presents the SAGES Critical View of Safety (CVS) Challenge, the first AI competition organized by a surgical society, using the CVS in laparoscopic cholecystectomy, a universally recommended yet inconsistently performed safety step, as an exemplar of surgical quality assessment. A global collaboration across 54 institutions in 24 countries engaged hundreds of clinicians and engineers to curate 1,000 videos annotated by 20 surgical experts according to a consensus-validated protocol. The challenge addressed key barriers to real-world deployment in surgery, including achieving high performance, capturing uncertainty in subjective assessment, and ensuring robustness to clinical variability. To enable this scale of effort, we developed EndoGlacier, a framework for managing large, heterogeneous surgical video and multi-annotator workflows. Thirteen international teams participated, achieving up to a 17% relative gain in assessment performance, over 80% reduction in calibration error, and a 17% relative improvement in robustness over the state-of-the-art. Analysis of results highlighted methodological trends linked to model performance, providing guidance for future research toward robust, clinically deployable AI for surgical quality assessment.
zh
[CV-116] Uncertainty-Supervised Interpretable and Robust Evidential Segmentation
【速读】:该论文旨在解决医学图像分割中不确定性估计缺乏有效监督导致预测结果可解释性和鲁棒性不足的问题。其解决方案的关键在于提出一种自监督方法,通过引入关于不确定性与边界附近图像梯度及噪声关系的三个原则,设计了两种不确定性监督损失函数,从而增强模型预测与人类认知的一致性,并提出了新的定量指标用于评估不确定性估计的可解释性和鲁棒性。
链接: https://arxiv.org/abs/2509.17098
作者: Yuzhu Li,An Sui,Fuping Wu,Xiahai Zhuang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via this https URL.
zh
[CV-117] AlignedGen: Aligning Style Across Generated Images
【速读】:该论文旨在解决扩散模型(Diffusion Models)在使用相同风格提示词(style prompt)生成图像时难以保持风格一致性的问题,尤其针对当前训练-free 方法仅适用于 U-Net 架构、导致图像质量低且存在对象重复等缺陷,以及无法兼容性能更优的 Diffusion Transformer (DiT) 模型的局限性。其解决方案的关键在于:首先揭示了 DiT 中因位置嵌入(position embedding)冲突导致注意力机制共享失效的核心问题,进而提出 Shifted Position Embedding (ShiftPE),通过为每张图像分配非重叠的位置索引来消除冲突;在此基础上进一步设计 Advanced Attention Sharing (AAS),包含三项协同技术以充分释放 DiT 内部注意力共享潜力,并引入高效的查询、键和值特征提取算法,使方法可灵活接入外部图像作为风格参考,从而显著提升生成图像的风格一致性并维持文本到图像的精确对齐。
链接: https://arxiv.org/abs/2509.17088
作者: Jiexuan Zhang,Yiheng Du,Qian Wang,Weiqi Li,Yu Gu,Jian Zhang
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite their generative power, diffusion models struggle to maintain style consistency across images conditioned on the same style prompt, hindering their practical deployment in creative workflows. While several training-free methods attempt to solve this, they are constrained to the U-Net architecture, which not only leads to low-quality results and artifacts like object repetition but also renders them incompatible with superior Diffusion Transformer (DiT). To address these issues, we introduce AlignedGen, a novel training-free framework that enhances style consistency across images generated by DiT models. Our work first reveals a critical insight: naive attention sharing fails in DiT due to conflicting positional signals from improper position embeddings. We introduce Shifted Position Embedding (ShiftPE), an effective solution that resolves this conflict by allocating a non-overlapping set of positional indices to each image. Building on this foundation, we develop Advanced Attention Sharing (AAS), a suite of three techniques meticulously designed to fully unleash the potential of attention sharing within the DiT. Furthermore, to broaden the applicability of our method, we present an efficient query, key, and value feature extraction algorithm, enabling our method to seamlessly incorporate external images as style references. Extensive experimental results validate that our method effectively enhances style consistency across generated images while maintaining precise text-to-image alignment.
zh
[CV-118] SFN-YOLO: Towards Free-Range Poultry Detection via Scale-aware Fusion Networks ICASSP2026
【速读】:该论文旨在解决自由放养环境下家禽检测的难题,主要挑战包括多尺度目标、遮挡以及复杂或动态背景对检测精度的影响。解决方案的关键在于提出一种名为SFN-YOLO的新型检测方法,其核心创新是引入尺度感知融合(scale-aware fusion)机制,通过融合局部细节特征与全局上下文信息,显著提升模型在复杂环境中的检测性能。同时,研究构建了适用于多样化自由放养场景的新数据集M-SCOPE,进一步验证了该方法在保持高精度(mAP达80.7%)和低参数量(仅7.2M)的同时具备良好的跨域泛化能力。
链接: https://arxiv.org/abs/2509.17086
作者: Jie Chen,Yuhong Feng,Tao Dai,Mingzhe Liu,Hongtao Chen,Zhaoxi He,Jiancong Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2026
Abstract:Detecting and localizing poultry is essential for advancing smart poultry farming. Despite the progress of detection-centric methods, challenges persist in free-range settings due to multiscale targets, obstructions, and complex or dynamic backgrounds. To tackle these challenges, we introduce an innovative poultry detection approach named SFN-YOLO that utilizes scale-aware fusion. This approach combines detailed local features with broader global context to improve detection in intricate environments. Furthermore, we have developed a new expansive dataset (M-SCOPE) tailored for varied free-range conditions. Comprehensive experiments demonstrate our model achieves an mAP of 80.7% with just 7.2M parameters, which is 35.1% fewer than the benchmark, while retaining strong generalization capability across different domains. The efficient and real-time detection capabilities of SFN-YOLO support automated smart poultry farming. The code and dataset can be accessed at this https URL.
zh
[CV-119] MoCLIP-Lite: Efficient Video Recognition by Fusing CLIP with Motion Vectors
【速读】:该论文旨在解决视频动作识别(video action recognition)中现有模型计算成本高且依赖大规模视频预训练的问题。其解决方案的关键在于提出一种轻量级的双流晚期融合框架 MoCLIP-Lite,该框架将冻结的 CLIP 图像编码器特征与一个在原始运动矢量(motion vectors, MV)上监督训练的轻量网络特征相结合,在融合阶段仅训练一个微小的多层感知机(Multi-Layer Perceptron, MLP)分类头,从而实现高效、高性能的视频理解。此方法显著优于零样本(zero-shot)和纯 MV 基线模型,为视频理解提供了一个高效的新基准。
链接: https://arxiv.org/abs/2509.17084
作者: Binhua Huang,Nan Wang,Arjun Parakash,Soumyabrata Dev
机构: The ADAPT SFI Research Centre(ADAPT国家研究中心); University College Dublin(都柏林大学); Amazon Development Center Germany GmbH(亚马逊德国开发中心); Beijing-Dublin International College(北京-都柏林国际学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 3 figures
Abstract:Video action recognition is a fundamental task in computer vision, but state-of-the-art models are often computationally expensive and rely on extensive video pre-training. In parallel, large-scale vision-language models like Contrastive Language-Image Pre-training (CLIP) offer powerful zero-shot capabilities on static images, while motion vectors (MV) provide highly efficient temporal information directly from compressed video streams. To synergize the strengths of these paradigms, we propose MoCLIP-Lite, a simple yet powerful two-stream late fusion framework for efficient video recognition. Our approach combines features from a frozen CLIP image encoder with features from a lightweight, supervised network trained on raw MV. During fusion, both backbones are frozen, and only a tiny Multi-Layer Perceptron (MLP) head is trained, ensuring extreme efficiency. Through comprehensive experiments on the UCF101 dataset, our method achieves a remarkable 89.2% Top-1 accuracy, significantly outperforming strong zero-shot (65.0%) and MV-only (66.5%) baselines. Our work provides a new, highly efficient baseline for video understanding that effectively bridges the gap between large static models and dynamic, low-cost motion cues. Our code and models are available at this https URL.
zh
[CV-120] HyRF: Hybrid Radiance Fields for Memory-efficient and High-quality Novel View Synthesis
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在场景重建中面临的内存开销大和高频细节建模能力弱的问题。3DGS虽能实现高质量实时新视角合成,但其依赖每个高斯分布的参数来建模视点相关效应与各向异性形状,导致显著的内存消耗;而现有基于神经场的压缩方法难以捕捉高斯属性中的高频空间变化,进而影响精细结构的重建质量。解决方案的关键在于提出混合辐射场(Hybrid Radiance Fields, HyRF),通过将场景分解为两部分:(1) 一组紧凑的显式高斯分布,仅存储关键高频参数;(2) 基于网格的神经场,预测其余属性。此外,引入解耦神经场架构分别建模几何属性(尺度、不透明度、旋转)与视点相关的颜色,并设计混合渲染方案融合高斯溅射与神经场预测的背景,从而在保持实时性能的同时显著降低模型规模(超过20倍)并提升重建质量。
链接: https://arxiv.org/abs/2509.17083
作者: Zipeng Wang,Dan Xu
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, 3D Gaussian Splatting (3DGS) has emerged as a powerful alternative to NeRF-based approaches, enabling real-time, high-quality novel view synthesis through explicit, optimizable 3D Gaussians. However, 3DGS suffers from significant memory overhead due to its reliance on per-Gaussian parameters to model view-dependent effects and anisotropic shapes. While recent works propose compressing 3DGS with neural fields, these methods struggle to capture high-frequency spatial variations in Gaussian properties, leading to degraded reconstruction of fine details. We present Hybrid Radiance Fields (HyRF), a novel scene representation that combines the strengths of explicit Gaussians and neural fields. HyRF decomposes the scene into (1) a compact set of explicit Gaussians storing only critical high-frequency parameters and (2) grid-based neural fields that predict remaining properties. To enhance representational capacity, we introduce a decoupled neural field architecture, separately modeling geometry (scale, opacity, rotation) and view-dependent color. Additionally, we propose a hybrid rendering scheme that composites Gaussian splatting with a neural field-predicted background, addressing limitations in distant scene representation. Experiments demonstrate that HyRF achieves state-of-the-art rendering quality while reducing model size by over 20 times compared to 3DGS and maintaining real-time performance. Our project page is available at this https URL.
zh
[CV-121] A Dual-Modulation Framework for RGB-T Crowd Counting via Spatially Modulated Attention and Adaptive Fusion ICASSP2026
【速读】:该论文旨在解决RGB-Thermal(RGB-T)人群计数中因Transformer模型缺乏空间归纳偏置而导致注意力分散至无关背景区域、从而降低人群定位精度的问题,以及跨模态融合效率低下的挑战。其解决方案的关键在于提出双调制框架(Dual Modulation Framework),包含两个核心模块:Spatially Modulated Attention(SMA)通过可学习的空间衰减掩码(Spatial Decay Mask)对远距离token间的注意力进行惩罚,抑制背景干扰以提升人群定位准确性;Adaptive Fusion Modulation(AFM)则引入动态门控机制,根据输入质量自适应地选择更可靠的模态进行跨模态融合,从而增强多模态信息的协同效应。
链接: https://arxiv.org/abs/2509.17079
作者: Yuhong Feng,Hongtao Chen,Qi Zhang,Jie Chen,Zhaoxi He,Mingzhe Liu,Jianghai Liao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to ICASSP 2026
Abstract:Accurate RGB-Thermal (RGB-T) crowd counting is crucial for public safety in challenging conditions. While recent Transformer-based methods excel at capturing global context, their inherent lack of spatial inductive bias causes attention to spread to irrelevant background regions, compromising crowd localization precision. Furthermore, effectively bridging the gap between these distinct modalities remains a major hurdle. To tackle this, we propose the Dual Modulation Framework, comprising two modules: Spatially Modulated Attention (SMA), which improves crowd localization by using a learnable Spatial Decay Mask to penalize attention between distant tokens and prevent focus from spreading to the background; and Adaptive Fusion Modulation (AFM), which implements a dynamic gating mechanism to prioritize the most reliable modality for adaptive cross-modal fusion. Extensive experiments on RGB-T crowd counting datasets demonstrate the superior performance of our method compared to previous works. Code available at this https URL.
zh
[CV-122] Enhanced Detection of Tiny Objects in Aerial Images
【速读】:该论文旨在解决单阶段目标检测器(如YOLOv8)在航空影像中检测微小目标时性能不足的问题,这主要归因于低分辨率目标和复杂背景干扰。解决方案的关键在于引入三种可集成至YOLOv8的增强策略:输入图像分辨率调整、数据增强以及注意力机制;其中特别设计了基于正交神经模块的混合网络(Mixture of Orthogonal Neural-modules Network, MoonNet),将挤压与激励模块(Squeeze-and-Excitation Block, SE Block)和卷积块注意力模块(Convolutional Block Attention Module, CBAM)嵌入到YOLOv8主干网络中,并增加通道数以提升特征表达能力,从而显著改善微小目标检测精度,且MoonNet在结合YOLC模型后进一步实现了该任务上的最先进性能。
链接: https://arxiv.org/abs/2509.17078
作者: Kihyun Kim,Michalis Lazarou,Tania Stathaki
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce three enhancement strategies – input image resolution adjustment, data augmentation, and attention mechanisms – that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of attention-augmented CNNs. Two well-known attention modules, the Squeeze-and-Excitation Block (SE Block) and the Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 with an increased number of channels, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our codes are available at: this https URL
zh
[CV-123] Informative Text-Image Alignment for Visual Affordance Learning with Foundation Models ICRA
【速读】:该论文旨在解决当前基于文本引导的视觉可操作性(affordance)学习方法中,忽视图像与文本特征在特征层面对齐的问题,从而导致利用文本提示识别可操作区域时性能不佳。其解决方案的关键在于提出一个信息约束框架,通过两个核心机制实现跨模态对齐:一是设计了可操作性互信息约束(affordance mutual information constraint),通过最大化输入图像中可操作区域特征与对应文本提示之间的互信息,同步学习适配的文本提示和任务导向的视觉特征;二是引入对象级信息约束(object-level information constraint),最大化给定物体的视觉特征与其类别文本特征间的互信息,从而获得高质量的对象表征,为可操作区域识别提供更可靠的语义先验。实验表明,该方法在AGD20K数据集上实现了单样本(one-shot)可操作性学习的新最佳性能。
链接: https://arxiv.org/abs/2509.17074
作者: Qian Zhang,Lin Zhang,Xing Fang,Mingxin Zhang,Zhiyuan Wei,Ran Song,Wei Zhang
机构: Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Submitted to the IEEE International Conference on Robotics and Automation (ICRA) 2026
Abstract:Visual affordance learning is crucial for robots to understand and interact effectively with the physical world. Recent advances in this field attempt to leverage pre-trained knowledge of vision-language foundation models to learn affordance properties with limited training data, providing a novel paradigm for visual affordance learning. However, these methods overlook the significance of maintaining feature alignment between visual images and language descriptions for identifying affordance areas with textual guidance, and thus may lead to suboptimal results. In this paper, we present an informative framework for text-guided affordance learning, which involves information-based constraints to achieve text-image alignment at feature level. Specifically, we design an affordance mutual information constraint that helps learn appropriate textual prompts and task-oriented visual features simultaneously by maximizing the mutual information between the features of the affordance areas in the input images and the corresponding textual prompts. In addition, we propose an object-level information constraint that maximizes the mutual information between the visual features of a given object and the text features of the category it belongs to. This enables the model to capture high-quality representations for the object, providing more reliable semantic priors for identifying affordance regions. Experimental results on the AGD20K dataset show that the proposed method outperforms existing approaches and achieves the new state-of-the-art in one-shot affordance learning.
zh
[CV-124] CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner MICCAI2025
【速读】:该论文旨在解决现有超声心动图(Echocardiography)中左心室射血分数(LVEF)估算方法依赖大规模标注视频数据集、成本高且跨临床场景适应性差的问题,同时克服当前基于图像到文本预训练的视觉语言模型(如EchoCLIP)在捕捉心脏时序动态和局部结构特征方面的不足。其解决方案的关键在于提出CardiacCLIP框架,通过引入MFL(Multi Frame Learning)注意力机制实现关键帧的选择性融合,并结合EchoZoom多尺度特征提取策略增强心脏结构的空间表征能力,从而显著提升少样本条件下的LVEF预测精度,在EchoNet-Dynamic数据集上1-shot设置下将平均绝对误差(MAE)降低2.07。
链接: https://arxiv.org/abs/2509.17065
作者: Yao Du,Jiarong Guo,Xiaomeng Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025
Abstract:Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at this https URL.
zh
[CV-125] Geodesic Prototype Matching via Diffusion Maps for Interpretable Fine-Grained Recognition
【速读】:该论文旨在解决深度视觉特征中非线性流形(nonlinear manifolds)导致欧氏距离无法准确刻画真实相似性的难题,尤其是在基于原型(prototype-based)的细粒度识别任务中,细微语义差异难以被捕捉。解决方案的关键在于将相似性锚定在深度特征的内在几何结构中:通过蒸馏每类的潜在流形结构到一个扩散空间(diffusion space),并引入可微分的Nyström插值方法,使几何信息对未见样本和可学习原型均可用;同时采用周期性更新的紧凑类内地标集(landmark sets),确保嵌入与不断演化的主干网络对齐,从而实现高效且可扩展的推理。
链接: https://arxiv.org/abs/2509.17050
作者: Junhao Jia,Yunyou Liu,Yifei Sun,Huangwei Chen,Feiwei Qin,Changmiao Wang,Yong Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Nonlinear manifolds are widespread in deep visual features, where Euclidean distances often fail to capture true similarity. This limitation becomes particularly severe in prototype-based interpretable fine-grained recognition, where subtle semantic distinctions are essential. To address this challenge, we propose a novel paradigm for prototype-based recognition that anchors similarity within the intrinsic geometry of deep features. Specifically, we distill the latent manifold structure of each class into a diffusion space and introduce a differentiable Nyström interpolation, making the geometry accessible to both unseen samples and learnable prototypes. To ensure efficiency, we employ compact per-class landmark sets with periodic updates. This design keeps the embedding aligned with the evolving backbone, enabling fast and scalable inference. Extensive experiments on the CUB-200-2011 and Stanford Cars datasets show that our GeoProto framework produces prototypes focusing on semantically aligned parts, significantly outperforming Euclidean prototype networks.
zh
[CV-126] Learning Attribute-Aware Hash Codes for Fine-Grained Image Retrieval via Query Optimization
【速读】:该论文旨在解决细粒度图像哈希(fine-grained hashing)中哈希位与具体视觉属性关联性弱、低比特哈希码优化困难导致检索精度和鲁棒性不足的问题。其解决方案的关键在于引入可学习查询(learnable queries)机制,使每个哈希位能够对应特定的视觉属性,从而提升哈希码的可解释性和语义相关性;同时设计辅助分支以建模高阶属性交互,缓解低比特哈希优化过程中复杂的损失景观问题,显著增强哈希码的区分度与稳定性。
链接: https://arxiv.org/abs/2509.17049
作者: Peng Wang,Yong Li,Lin Zhao,Xiu-Shen Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Fine-grained hashing has become a powerful solution for rapid and efficient image retrieval, particularly in scenarios requiring high discrimination between visually similar categories. To enable each hash bit to correspond to specific visual attributes, we propoe a novel method that harnesses learnable queries for attribute-aware hash codes learning. This method deploys a tailored set of queries to capture and represent nuanced attribute-level information within the hashing process, thereby enhancing both the interpretability and relevance of each hash bit. Building on this query-based optimization framework, we incorporate an auxiliary branch to help alleviate the challenges of complex landscape optimization often encountered with low-bit hash codes. This auxiliary branch models high-order attribute interactions, reinforcing the robustness and specificity of the generated hash codes. Experimental results on benchmark datasets demonstrate that our method generates attribute-aware hash codes and consistently outperforms state-of-the-art techniques in retrieval accuracy and robustness, especially for low-bit hash codes, underscoring its potential in fine-grained image hashing tasks.
zh
[CV-127] AgriDoctor: A Multimodal Intelligent Assistant for Agriculture
【速读】:该论文旨在解决当前作物病害诊断方法在农业场景中面临的局限性,即传统基于图像的单模态模型难以融合领域特定农业知识,且缺乏支持交互式语言理解的能力。为此,作者提出AgriDoctor,一个模块化、可扩展的多模态框架,其关键在于引入代理(agent)驱动的多模态推理机制,整合路由(router)、分类器(classifier)、检测器(detector)、知识检索器(knowledge retriever)和大语言模型(LLMs)五大核心组件,实现对作物健康状况的智能诊断与农业知识的交互式理解。通过构建包含40万张标注病害图像、831条专家整理知识条目及30万条双语指令的AgriMM基准数据集,验证了该方案在细粒度农业任务上的显著性能提升,为智能可持续农业应用树立了新范式。
链接: https://arxiv.org/abs/2509.17044
作者: Mingqing Zhang,Zhuoning Xu,Peijie Wang,Rongji Li,Liang Wang,Qiang Liu,Jian Xu,Xuyao Zhang,Shu Wu,Liang Wang
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems (多模态人工智能系统国家重点实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurate crop disease diagnosis is essential for sustainable agriculture and global food security. Existing methods, which primarily rely on unimodal models such as image-based classifiers and object detectors, are limited in their ability to incorporate domain-specific agricultural knowledge and lack support for interactive, language-based understanding. Recent advances in large language models (LLMs) and large vision-language models (LVLMs) have opened new avenues for multimodal reasoning. However, their performance in agricultural contexts remains limited due to the absence of specialized datasets and insufficient domain adaptation. In this work, we propose AgriDoctor, a modular and extensible multimodal framework designed for intelligent crop disease diagnosis and agricultural knowledge interaction. As a pioneering effort to introduce agent-based multimodal reasoning into the agricultural domain, AgriDoctor offers a novel paradigm for building interactive and domain-adaptive crop health solutions. It integrates five core components: a router, classifier, detector, knowledge retriever and LLMs. To facilitate effective training and evaluation, we construct AgriMM, a comprehensive benchmark comprising 400000 annotated disease images, 831 expert-curated knowledge entries, and 300000 bilingual prompts for intent-driven tool selection. Extensive experiments demonstrate that AgriDoctor, trained on AgriMM, significantly outperforms state-of-the-art LVLMs on fine-grained agricultural tasks, establishing a new paradigm for intelligent and sustainable farming applications.
zh
[CV-128] owards Generalized Synapse Detection Across Invertebrate Species
【速读】:该论文旨在解决神经突触(synapse)在大规模电子显微成像(volume electron microscopy, EM)数据中自动检测的难题,尤其针对标注稀疏、形态变异大及跨数据集域偏移等问题。其核心解决方案是提出SimpSyn——一种基于单阶段残差U-Net架构的轻量化模型,通过预测前后突触位点周围的双通道球形掩码来实现高效且准确的突触定位。该方法强调训练与推理速度以及标注效率,而非复杂的网络结构,并结合简单的后处理策略(如局部峰值检测和距离过滤),在多个果蝇(Drosophila melanogaster)和微蜂(Megaphragma viggianii)数据集上均实现了优于当前最优多任务模型Synful的F1分数,表明结构简洁但任务适配性强的模型可作为大规模连接组学(connectomic)管线中的实用方案。
链接: https://arxiv.org/abs/2509.17041
作者: Samia Mohinta,Daniel Franco-Barranco,Shi Yan Lee,Albert Cardona
机构: MRC Laboratory of Molecular Biology, University of Cambridge, UK (英国医学研究委员会分子生物学实验室,剑桥大学); Department of Physiology, Development and Neuroscience, University of Cambridge, UK (剑桥大学生理学、发育与神经科学系); Donostia International Physics Center (DIPC), San Sebastian, Spain (多诺斯蒂亚国际物理中心,圣塞bastian,西班牙)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Behavioural differences across organisms, whether healthy or pathological, are closely tied to the structure of their neural circuits. Yet, the fine-scale synaptic changes that give rise to these variations remain poorly understood, in part due to persistent challenges in detecting synapses reliably and at scale. Volume electron microscopy (EM) offers the resolution required to capture synaptic architecture, but automated detection remains difficult due to sparse annotations, morphological variability, and cross-dataset domain shifts. To address this, we make three key contributions. First, we curate a diverse EM benchmark spanning four datasets across two invertebrate species: adult and larval Drosophila melanogaster, and Megaphragma viggianii (micro-WASP). Second, we propose SimpSyn, a single-stage Residual U-Net trained to predict dual-channel spherical masks around pre- and post-synaptic sites, designed to prioritize training and inference speeds and annotation efficiency over architectural complexity. Third, we benchmark SimpSyn against Buhmann et al.'s Synful [1], a state-of-the-art multi-task model that jointly infers synaptic pairs. Despite its simplicity, SimpSyn consistently outperforms Synful in F1-score across all volumes for synaptic site detection. While generalization across datasets remains limited, SimpSyn achieves competitive performance when trained on the combined cohort. Finally, ablations reveal that simple post-processing strategies - such as local peak detection and distance-based filtering - yield strong performance without complex test-time heuristics. Taken together, our results suggest that lightweight models, when aligned with task structure, offer a practical and scalable solution for synapse detection in large-scale connectomic pipelines.
zh
[CV-129] From Easy to Hard: The MIR Benchmark for Progressive Interleaved Multi-Image Reasoning
【速读】:该论文旨在解决多图像交织推理(Multi-image Interleaved Reasoning, MIR)问题,即提升多模态大语言模型(Multi-modal Large Language Models, MLLMs)在联合理解与推理多个图像及其交错文本上下文方面的能力。现有基准测试忽视了图像与文本之间的交织关系,无法有效捕捉跨模态关联和复杂场景中的逻辑连接。解决方案的关键在于提出一个新的基准MIR,要求模型对多图像及交错文本进行联合推理,以准确关联图像区域与对应文本并逻辑串联跨图像信息;同时设计了一种分阶段的课程学习策略(stage-wise curriculum learning),遵循“由易到难”的渐进式训练路径,引导模型从简单任务逐步过渡到复杂场景,从而显著提升其在MIR及其他基准上的推理性能。
链接: https://arxiv.org/abs/2509.17040
作者: Hang Du,Jiayang Zhang,Guoshun Nan,Wendi Deng,Zhenyan Chen,Chenyang Zhang,Wang Xiao,Shan Huang,Yuqi Pan,Tao Qi,Sicong Leng
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-image Interleaved Reasoning aims to improve Multi-modal Large Language Models (MLLMs) ability to jointly comprehend and reason across multiple images and their associated textual contexts, introducing unique challenges beyond single-image or non-interleaved multi-image tasks. While current multi-image benchmarks overlook interleaved textual contexts and neglect distinct relationships between individual images and their associated texts, enabling models to reason over multi-image interleaved data may significantly enhance their comprehension of complex scenes and better capture cross-modal correlations. To bridge this gap, we introduce a novel benchmark MIR, requiring joint reasoning over multiple images accompanied by interleaved textual contexts to accurately associate image regions with corresponding texts and logically connect information across images. To enhance MLLMs ability to comprehend multi-image interleaved data, we introduce reasoning steps for each instance within the benchmark and propose a stage-wise curriculum learning strategy. This strategy follows an “easy to hard” approach, progressively guiding models from simple to complex scenarios, thereby enhancing their ability to handle challenging tasks. Extensive experiments benchmarking multiple MLLMs demonstrate that our method significantly enhances models reasoning performance on MIR and other established benchmarks. We believe that MIR will encourage further research into multi-image interleaved reasoning, facilitating advancements in MLLMs capability to handle complex inter-modal this http URL code and dataset are available at this https URL.
zh
[CV-130] Long-Tailed Out-of-Distribution Detection with Refined Separate Class Learning
【速读】:该论文旨在解决长尾分布(long-tailed distribution)场景下,模型在检测分布外(out-of-distribution, OOD)样本时性能下降的问题,特别是由于OOD样本与头部类(head classes)或尾部类(tail classes)之间的混淆导致的误判。现有分离类别学习(Separate Class Learning, SCL)方法受限于静态温度缩放参数(static scaling temperature value)和无信息异常值(uninformative outliers)的影响,难以有效区分OOD样本。为此,作者提出了一种改进方案——精炼分离类别学习(Refined Separate Class Learning, RSCL),其关键创新在于:1)引入动态类级别温度调整机制(dynamic class-wise temperature adjustment),针对每个类别自适应调节温度参数以优化置信度评分;2)通过有信息异常值挖掘(informative outlier mining)识别多样化的异常样本,基于其与头部和尾部类别的亲和性进行区分。实验表明,RSCL不仅显著提升了OOD检测性能,同时增强了对分布内数据的分类准确率。
链接: https://arxiv.org/abs/2509.17034
作者: Shuai Feng,Yuxin Ge,Yuntao Du,Mingcai Chen,Lei Feng
机构: Nanjing Agricultural University (南京农业大学); Nanjing University (南京大学); Shandong University (山东大学); Nanjing University of Posts and Telecommunications (南京邮电大学); Southeast University (东南大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Out-of-distribution (OOD) detection is crucial for deploying robust machine learning models. However, when training data follows a long-tailed distribution, the model’s ability to accurately detect OOD samples is significantly compromised, due to the confusion between OOD samples and head/tail classes. To distinguish OOD samples from both head and tail classes, the separate class learning (SCL) approach has emerged as a promising solution, which separately conduct head-specific and tail-specific class learning. To this end, we examine the limitations of existing works of SCL and reveal that the OOD detection performance is notably influenced by the use of static scaling temperature value and the presence of uninformative outliers. To mitigate these limitations, we propose a novel approach termed Refined Separate Class Learning (RSCL), which leverages dynamic class-wise temperature adjustment to modulate the temperature parameter for each in-distribution class and informative outlier mining to identify diverse types of outliers based on their affinity with head and tail classes. Extensive experiments demonstrate that RSCL achieves superior OOD detection performance while improving the classification accuracy on in-distribution data.
zh
[CV-131] Efficient 3D Scene Reconstruction and Simulation from Sparse Endoscopic Views MICCAI2025 ECAI
【速读】:该论文旨在解决传统手术模拟环境构建方法中存在的效率低、细节不足及真实性差的问题,尤其是在从稀疏的内窥镜数据中重建高保真、可交互的手术场景时面临的挑战。其关键解决方案是提出一种基于高斯泼溅(Gaussian Splatting)的框架,通过引入基于虚拟相机的正则化策略来缓解因内窥镜视角受限导致的过拟合问题,从而提升几何精度;同时,结合深度相关的正则化项优化真实与虚拟视图下的场景结构,并采用基于稀疏控制节点的物质点法(Material Point Method, MPM),在保留物理特性的同时显著降低计算开销,实现快速且真实的实时变形仿真。
链接: https://arxiv.org/abs/2509.17027
作者: Zhenya Yang
机构: The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Workshop Paper of AECAI@MICCAI 2025
Abstract:Surgical simulation is essential for medical training, enabling practitioners to develop crucial skills in a risk-free environment while improving patient safety and surgical outcomes. However, conventional methods for building simulation environments are cumbersome, time-consuming, and difficult to scale, often resulting in poor details and unrealistic simulations. In this paper, we propose a Gaussian Splatting-based framework to directly reconstruct interactive surgical scenes from endoscopic data while ensuring efficiency, rendering quality, and realism. A key challenge in this data-driven simulation paradigm is the restricted movement of endoscopic cameras, which limits viewpoint diversity. As a result, the Gaussian Splatting representation overfits specific perspectives, leading to reduced geometric accuracy. To address this issue, we introduce a novel virtual camera-based regularization method that adaptively samples virtual viewpoints around the scene and incorporates them into the optimization process to mitigate overfitting. An effective depth-based regularization is applied to both real and virtual views to further refine the scene geometry. To enable fast deformation simulation, we propose a sparse control node-based Material Point Method, which integrates physical properties into the reconstructed scene while significantly reducing computational costs. Experimental results on representative surgical data demonstrate that our method can efficiently reconstruct and simulate surgical scenes from sparse endoscopic views. Notably, our method takes only a few minutes to reconstruct the surgical scene and is able to produce physically plausible deformations in real-time with user-defined interactions.
zh
[CV-132] When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration
【速读】:该论文旨在解决恶劣天气图像恢复(Adverse Weather Image Restoration, AWIR)中因天气退化类型不可预测且动态变化而导致的恢复效果不稳定问题。传统任务特定方法难以泛化到未见过的退化类型,而基于提示学习的方法则过度依赖视觉-语言模型的退化估计能力,导致恢复结果不一致。其解决方案的关键在于提出一个名为LCDiff的新框架,包含两个核心组件:Lumina-Chroma分解网络(Lumina-Chroma Decomposition Network, LCDN)和Lumina引导扩散模型(Lumina-Guided Diffusion Model, LGDM)。LCDN在YCbCr色彩空间中将图像分解为与退化相关的亮度分量和与退化无关的色度分量,从而有效缓解天气引起的退化并保持颜色保真度;LGDM进一步利用亮度信息作为引导条件进行图像恢复,无需显式退化提示,并引入动态时间步损失(Dynamic Time Step Loss)以优化去噪网络,实现低频与高频特征的平衡恢复。
链接: https://arxiv.org/abs/2509.17024
作者: Wenxuan Fang,Jili Fan,Chao Wang,Xiantao Hu,Jiangwei Weng,Ying Tai,Jian Yang,Jun Li
机构: Nanjing University of Science and Technology (南京理工大学); Southeast University (东南大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adverse Weather Image Restoration (AWIR) is a highly challenging task due to the unpredictable and dynamic nature of weather-related degradations. Traditional task-specific methods often fail to generalize to unseen or complex degradation types, while recent prompt-learning approaches depend heavily on the degradation estimation capabilities of vision-language models, resulting in inconsistent restorations. In this paper, we propose \textbfLCDiff, a novel framework comprising two key components: \textitLumina-Chroma Decomposition Network (LCDN) and \textitLumina-Guided Diffusion Model (LGDM). LCDN processes degraded images in the YCbCr color space, separately handling degradation-related luminance and degradation-invariant chrominance components. This decomposition effectively mitigates weather-induced degradation while preserving color fidelity. To further enhance restoration quality, LGDM leverages degradation-related luminance information as a guiding condition, eliminating the need for explicit degradation prompts. Additionally, LGDM incorporates a \textitDynamic Time Step Loss to optimize the denoising network, ensuring a balanced recovery of both low- and high-frequency features in the image. Finally, we present DriveWeather, a comprehensive all-weather driving dataset designed to enable robust evaluation. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods, setting a new benchmark in AWIR. The dataset and code are available at: this https URL.
zh
[CV-133] VAInpaint: Zero-Shot Video-Audio inpainting framework with LLM s-driven Module
【速读】:该论文旨在解决混合音视频内容中精确移除目标物体及其对应音频而不影响场景其他部分的难题,这是多媒体编辑中的关键挑战。解决方案的关键在于提出了一种名为VAInpaint的新颖流程:首先利用分割模型生成掩码以指导视频修复模型去除物体;同时,通过大语言模型(LLM)进行全局场景分析,并结合区域特定模型提供局部描述,二者融合后生成文本查询用于驱动文本驱动的音频分离模型;此外,音频分离模型在自定义数据集(包含分割后的音乐乐器图像与VGGSound背景)上微调,提升了泛化性能。该方法实现了音视频同步修复,在性能上达到当前基准水平。
链接: https://arxiv.org/abs/2509.17022
作者: Kam Man Wu,Zeyue Tian,Liya Ji,Qifeng Chen
机构: 未知
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:
Abstract:Video and audio inpainting for mixed audio-visual content has become a crucial task in multimedia editing recently. However, precisely removing an object and its corresponding audio from a video without affecting the rest of the scene remains a significant challenge. To address this, we propose VAInpaint, a novel pipeline that first utilizes a segmentation model to generate masks and guide a video inpainting model in removing objects. At the same time, an LLM then analyzes the scene globally, while a region-specific model provides localized descriptions. Both the overall and regional descriptions will be inputted into an LLM, which will refine the content and turn it into text queries for our text-driven audio separation model. Our audio separation model is fine-tuned on a customized dataset comprising segmented MUSIC instrument images and VGGSound backgrounds to enhance its generalization performance. Experiments show that our method achieves performance comparable to current benchmarks in both audio and video inpainting.
zh
[CV-134] DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment
【速读】:该论文旨在解决文档图像质量评估(Document Image Quality Assessment, DIQA)在光学字符识别(OCR)、文档修复及图像处理系统评价等应用中的准确性与效率问题。现有通用图像质量评估(IQA)模型难以有效捕捉文档图像特有的布局特征及其多维质量属性,导致评估性能受限。解决方案的关键在于:首先构建了包含5000张增强文档图像的主观DIQA-5000数据集,涵盖整体质量、清晰度和色彩保真度三个维度的专家评分;其次提出一种无参考(no-reference)DIQA模型,通过引入文档布局特征提取模块,在低分辨率下仍能保持高质量感知能力以降低计算成本;最后设计特征融合模块整合低层与高层视觉特征,并采用独立的质量头预测各维度得分分布,从而实现对文档图像多维质量特性的精细化建模。实验表明,该方法在DIQA-5000及面向OCR准确率的文档图像数据集上均优于当前主流通用IQA模型。
链接: https://arxiv.org/abs/2509.17012
作者: Zhichao Ma,Fan Huang,Lu Zhao,Fengjun Guo,Guangtao Zhai,Xiongkuo Min
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Document image quality assessment (DIQA) is an important component for various applications, including optical character recognition (OCR), document restoration, and the evaluation of document image processing systems. In this paper, we introduce a subjective DIQA dataset DIQA-5000. The DIQA-5000 dataset comprises 5,000 document images, generated by applying multiple document enhancement techniques to 500 real-world images with diverse distortions. Each enhanced image was rated by 15 subjects across three rating dimensions: overall quality, sharpness, and color fidelity. Furthermore, we propose a specialized no-reference DIQA model that exploits document layout features to maintain quality perception at reduced resolutions to lower computational cost. Recognizing that image quality is influenced by both low-level and high-level visual features, we designed a feature fusion module to extract and integrate multi-level features from document images. To generate multi-dimensional scores, our model employs independent quality heads for each dimension to predict score distributions, allowing it to learn distinct aspects of document image quality. Experimental results demonstrate that our method outperforms current state-of-the-art general-purpose IQA models on both DIQA-5000 and an additional document image dataset focused on OCR accuracy.
zh
[CV-135] A Cross-Hierarchical Multi-Feature Fusion Network Based on Multiscale Encoder-Decoder for Hyperspectral Change Detection
【速读】:该论文旨在解决高光谱变化检测(Hyperspectral Change Detection, HCD)中现有方法存在的两个关键问题:一是对多尺度特征利用不足,二是差异特征融合效率低。解决方案的核心在于提出一种基于多尺度编码器-解码器架构的交叉层次多特征融合网络(Cross-Hierarchical Multi-Feature Fusion Network, CHMFFN)。其关键技术包括:1)前端采用带残差连接和双核通道-空间注意力(Dual-Core Channel-Spatial Attention, DCCSA)模块的多尺度特征提取子网络,以捕获光谱-空间-时间特征(Spectral-Spatial-Temporal Features, SSTF);2)设计谱时变化特征学习(Spectral-Temporal Change Feature Learning, STCFL)模块,在不同层级上学习跨时相变化特征,强化时序差异捕捉能力;3)引入自适应高级特征融合(Adaptive Fusion of Advanced Features, AFAF)模块,通过自适应权重动态平衡多层次差异特征,提升复杂变化的表征能力。实验表明,该方法在四个公开高光谱数据集上均优于当前最优方法,验证了其有效性。
链接: https://arxiv.org/abs/2509.16988
作者: Mingshuai Sheng,Bhatti Uzair Aslam,Junfeng Zhang,Siling Feng,Yonis Gulzar
机构: Hainan University (海南大学); King Faisal University (法伊萨尔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Hyperspectral change detection (HCD) aims to accurately identify land-cover changes in hyperspectral images of the same area acquired at different times, with key applications in environmental monitoring and disaster assessment. To address limitations of existing methods, such as insufficient use of multiscale features and low efficiency in differential feature fusion, this paper proposes a cross-hierarchical multi-feature fusion network (CHMFFN) based on a multiscale encoder-decoder architecture. The front-end adopts a multiscale feature extraction subnetwork, built on an encoder-decoder backbone with residual connections and a dual-core channel-spatial attention (DCCSA) module to extract spectral-spatial-temporal features (SSTF). The encoder captures multiscale features from shallow details to deep semantics via residual blocks and convolutional kernels with varying receptive fields. The decoder restores spatial resolution and suppresses noise information through skip connections integrating encoder features. Additionally, a spectral-temporal change feature learning (STCFL) module learns cross-temporal change features at different levels, strengthening inter-temporal difference capture. An adaptive fusion of advanced features (AFAF) module dynamically balances hierarchical differential features via adaptive weights, enhancing representation of complex changes. Experiments on four public hyperspectral datasets show CHMFFN outperforms state-of-the-art methods, verifying its effectiveness.
zh
[CV-136] VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation
【速读】:该论文旨在解决自回归图像生成模型(autoregressive image generation models)在生成内容时可能引发的版权侵权和伦理风险问题,特别是针对模型可能生成不适宜内容(Not-Safe-For-Work, NSFW)且现有概念擦除方法难以适配此类模型的问题。解决方案的关键在于提出一种名为视觉对比利用(Visual Contrast Exploitation, VCE)的新框架,其核心创新包括:(1)一种新颖的对比图像对构建范式,能够精确解耦不安全概念与其关联的内容语义;(2)基于直接偏好优化(Direct Preference Optimization, DPO)的训练策略,增强模型识别并利用图像对间视觉对比特征的能力,从而实现精准的概念擦除,同时保持无关安全概念的完整性。
链接: https://arxiv.org/abs/2509.16986
作者: Feng Han,Chao Gong,Zhipeng Wei,Jingjing Chen,Yu-Gang Jiang
机构: Fudan University (复旦大学); UC Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, autoregressive image generation models have wowed audiences with their remarkable capability in creating surprisingly realistic images. Models such as GPT-4o and LlamaGen can not only produce images that faithfully mimic renowned artistic styles like Ghibli, Van Gogh, or Picasso, but also potentially generate Not-Safe-For-Work (NSFW) content, raising significant concerns regarding copyright infringement and ethical use. Despite these concerns, methods to safeguard autoregressive text-to-image models remain underexplored. Previous concept erasure methods, primarily designed for diffusion models that operate in denoising latent space, are not directly applicable to autoregressive models that generate images token by token. To address this critical gap, we propose Visual Contrast Exploitation (VCE), a novel framework comprising: (1) an innovative contrastive image pair construction paradigm that precisely decouples unsafe concepts from their associated content semantics, and (2) a sophisticated DPO-based training approach that enhances the model’s ability to identify and leverage visual contrastive features from image pairs, enabling precise concept erasure. Our comprehensive experiments across three challenging tasks-artist style erasure, explicit content erasure, and object removal-demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts. The code and models are available at this https URL.
zh
[CV-137] Optimal Transport for Handwritten Text Recognition in a Low-Resource Regime
【速读】:该论文旨在解决低资源场景下手写文本识别(Handwritten Text Recognition, HTR)模型训练所需大量标注数据难以获取的问题,尤其针对历史档案或小规模现代语料库等应用场景。其解决方案的关键在于提出一种迭代式自举(iterative bootstrapping)框架,通过最优传输(Optimal Transport, OT)技术将未标注图像的视觉特征与语义词表示进行对齐,从而生成高置信度的伪标签,并逐步扩充训练集以迭代优化识别器性能。该方法仅需少量初始标注样本即可显著提升识别准确率,有效缓解了标注数据稀缺带来的限制。
链接: https://arxiv.org/abs/2509.16977
作者: Petros Georgoulas Wraight,Giorgos Sfikas,Ioannis Kordonis,Petros Maragos,George Retsinas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Handwritten Text Recognition (HTR) is a task of central importance in the field of document image understanding. State-of-the-art methods for HTR require the use of extensive annotated sets for training, making them impractical for low-resource domains like historical archives or limited-size modern collections. This paper introduces a novel framework that, unlike the standard HTR model paradigm, can leverage mild prior knowledge of lexical characteristics; this is ideal for scenarios where labeled data are scarce. We propose an iterative bootstrapping approach that aligns visual features extracted from unlabeled images with semantic word representations using Optimal Transport (OT). Starting with a minimal set of labeled examples, the framework iteratively matches word images to text labels, generates pseudo-labels for high-confidence alignments, and retrains the recognizer on the growing dataset. Numerical experiments demonstrate that our iterative visual-semantic alignment scheme significantly improves recognition accuracy on low-resource HTR benchmarks.
zh
[CV-138] he 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA ICCV2025
【速读】:该论文针对参考视频目标分割(Referring Video Object Segmentation, RVOS)任务中存在两个关键瓶颈问题:一是稀疏帧采样导致的时序信息不充分,二是依赖单一[SEG]标记对整个视频进行分割,限制了细粒度语义理解。解决方案的核心在于提出SaSaSa2VA方法,通过引入高效的分割增强策略以提升时序建模能力,并采用测试阶段集成(test-time ensembling)机制对多个预测结果进行选择性平均,从而显著改善基于多模态大语言模型(MLLM)的视频分割性能。实验表明,该方法在第7届LSVOS挑战赛(RVOS赛道)中达到67.45的J&F指标,排名第一,验证了其有效性。
链接: https://arxiv.org/abs/2509.16972
作者: Quanzhu Niu,Dengxian Gong,Shihao Chen,Tao Zhang,Yikang Zhou,Haobo Yuan,Lu Qi,Xiangtai Li,Shunping Ji
机构: Wuhan University (武汉大学); University of California, Merced (加州大学默塞德分校); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 1st place report of 7th LSVOS RVOS track in ICCV 2025. The code is released in Sa2VA repository: this https URL
Abstract:Referring video object segmentation (RVOS) requires segmenting and tracking objects in videos conditioned on natural-language expressions, demanding fine-grained understanding of both appearance and motion. Building on Sa2VA, which couples a Multi-modal Large Language Model (MLLM) with the video segmentation model SAM2, we identify two key bottlenecks that limit segmentation performance: sparse frame sampling and reliance on a single [SEG] token for an entire video. We propose Segmentation Augmented and Selective Averaged Sa2VA SaSaSa2VA to address these issues. On the 7th LSVOS Challenge (RVOS track), SaSaSa2VA achieves a J\F of 67.45, ranking first and surpassing the runner-up by 2.80 points. This result and ablation studies demonstrate that efficient segmentation augmentation and test-time ensembling substantially enhance grounded MLLMs for RVOS. The code is released in Sa2VA repository: this https URL.
zh
[CV-139] LLM -Assisted Semantic Guidance for Sparsely Annotated Remote Sensing Object Detection
【速读】:该论文旨在解决遥感目标检测中因标注稀疏(sparse annotation)导致的性能瓶颈问题,尤其是在密集目标分布和类别不平衡场景下,现有基于密集伪标签(Dense Pseudo-Label)的方法仍受限于置信度选择的模糊性和不一致性。其解决方案的关键在于引入大语言模型(Large Language Model, LLM)辅助的语义引导框架,通过LLM生成的语义先验知识来提炼高置信度伪标签;在此基础上提出类感知的密集伪标签分配机制(Class-Aware Dense Pseudo-Label Assignment),实现对未标注及稀疏标注数据的自适应伪标签分配,从而在不同数据分布下提供鲁棒的监督信号;同时设计自适应难负样本重加权模块(Adaptive Hard-Negative Reweighting Module),有效抑制背景干扰信息对监督学习分支的影响,提升模型稳定性与检测精度。
链接: https://arxiv.org/abs/2509.16970
作者: Wei Liao,Chunyan Xu,Chenxu Wang,Zhen Cui
机构: Nanjing University of Science and Technology (南京理工大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Sparse annotation in remote sensing object detection poses significant challenges due to dense object distributions and category imbalances. Although existing Dense Pseudo-Label methods have demonstrated substantial potential in pseudo-labeling tasks, they remain constrained by selection ambiguities and inconsistencies in confidence this http URL this paper, we introduce an LLM-assisted semantic guidance framework tailored for sparsely annotated remote sensing object detection, exploiting the advanced semantic reasoning capabilities of large language models (LLMs) to distill high-confidence this http URL integrating LLM-generated semantic priors, we propose a Class-Aware Dense Pseudo-Label Assignment mechanism that adaptively assigns pseudo-labels for both unlabeled and sparsely labeled data, ensuring robust supervision across varying data distributions. Additionally, we develop an Adaptive Hard-Negative Reweighting Module to stabilize the supervised learning branch by mitigating the influence of confounding background information. Extensive experiments on DOTA and HRSC2016 demonstrate that the proposed method outperforms existing single-stage detector-based frameworks, significantly improving detection performance under sparse annotations.
zh
[CV-140] Penalizing Boundary Activation for Object Completeness in Diffusion Models
【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型在文本到图像(Text-to-Image, T2I)生成任务中存在的对象不完整问题,即生成图像中物体出现碎片化或缺失部分,从而影响下游应用性能。研究表明,该问题的主要成因是训练过程中广泛采用的 RandomCrop 数据增强策略破坏了物体在图像中的连续性。解决方案的关键在于提出一种无需重新训练的方法:在去噪过程的早期步骤中对图像边界处的激活值施加惩罚项,从而引导模型生成更完整的物体结构。该方法可直接应用于预训练的 Stable Diffusion 模型,仅需少量修改且计算开销极低,实验证明其能显著提升物体完整性与图像质量。
链接: https://arxiv.org/abs/2509.16968
作者: Haoyang Xu,Tianhao Zhao,Sibei Yang,Yutian Li
机构: Wuhan University (武汉大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have emerged as a powerful technique for text-to-image (T2I) generation, creating high-quality, diverse images across various domains. However, a common limitation in these models is the incomplete display of objects, where fragments or missing parts undermine the model’s performance in downstream applications. In this study, we conduct an in-depth analysis of the incompleteness issue and reveal that the primary factor behind incomplete object generation is the usage of RandomCrop during model training. This widely used data augmentation method, though enhances model generalization ability, disrupts object continuity during training. To address this, we propose a training-free solution that penalizes activation values at image boundaries during the early denoising steps. Our method is easily applicable to pre-trained Stable Diffusion models with minimal modifications and negligible computational overhead. Extensive experiments demonstrate the effectiveness of our method, showing substantial improvements in object integrity and image quality.
zh
[CV-141] MO R-CNN: Multispectral Oriented R-CNN for Object Detection in Remote Sensing Image
【速读】:该论文旨在解决多光谱遥感图像中定向目标检测(oriented object detection)面临的挑战,特别是不同模态间及模态内部的差异导致的检测精度受限问题。现有方法虽通过复杂网络结构提升了准确性,但计算复杂度和内存消耗过高,限制了实际应用。其解决方案的关键在于提出MO R-CNN框架,包含三个核心组件:异质特征提取网络(Heterogeneous Feature Extraction Network, HFEN),用于自适应对齐、融合与增强多模态特征;单模态监督机制(Single Modality Supervision, SMS),通过约束多尺度特征使模型能从多个模态中有效学习;以及基于条件的多模态标签融合策略(Condition-based Multimodal Label Fusion, CMLF),依据特定规则融合多模态标签以提供更鲁棒且一致的监督信号。该方法在DroneVehicle、VEDAI和OGSOD数据集上验证了有效性。
链接: https://arxiv.org/abs/2509.16957
作者: Leiyu Wang,Biao Jin,Feng Huang,Liqiong Chen,Zhengyong Wang,Xiaohai He,Honggang Chen
机构: Sichuan University (四川大学); Fuzhou University (福州大学); Yunnan University (云南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Oriented object detection for multi-spectral imagery faces significant challenges due to differences both within and between modalities. Although existing methods have improved detection accuracy through complex network architectures, their high computational complexity and memory consumption severely restrict their performance. Motivated by the success of large kernel convolutions in remote sensing, we propose MO R-CNN, a lightweight framework for multi-spectral oriented detection featuring heterogeneous feature extraction network (HFEN), single modality supervision (SMS), and condition-based multimodal label fusion (CMLF). HFEN leverages inter-modal differences to adaptively align, merge, and enhance multi-modal features. SMS constrains multi-scale features and enables the model to learn from multiple modalities. CMLF fuses multimodal labels based on specific rules, providing the model with a more robust and consistent supervisory signal. Experiments on the DroneVehicle, VEDAI and OGSOD datasets prove the superiority of our method. The source code is available at:this https URL.
zh
[CV-142] VidCLearn: A Continual Learning Approach for Text-to-Video Generation
【速读】:该论文旨在解决当前基于扩散模型的文本到视频生成(text-to-video generation)方法在持续学习(continual learning)场景下的局限性,即模型难以在不重新训练的情况下整合新数据。为应对这一问题,作者提出VidCLearn框架,其核心创新在于采用学生-教师架构:学生模型通过增量学习更新以适应新的文本-视频对,而教师模型则通过生成式回放(generative replay)机制保留先前知识;此外,引入新颖的时间一致性损失(temporal consistency loss)以提升运动平滑性,并结合视频检索模块在推理阶段提供结构引导,从而实现高效且高质量的视频生成。
链接: https://arxiv.org/abs/2509.16956
作者: Luca Zanchetta,Lorenzo Papa,Luca Maiano,Irene Amerini
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Text-to-video generation is an emerging field in generative AI, enabling the creation of realistic, semantically accurate videos from text prompts. While current models achieve impressive visual quality and alignment with input text, they typically rely on static knowledge, making it difficult to incorporate new data without retraining from scratch. To address this limitation, we propose VidCLearn, a continual learning framework for diffusion-based text-to-video generation. VidCLearn features a student-teacher architecture where the student model is incrementally updated with new text-video pairs, and the teacher model helps preserve previously learned knowledge through generative replay. Additionally, we introduce a novel temporal consistency loss to enhance motion smoothness and a video retrieval module to provide structural guidance at inference. Our architecture is also designed to be more computationally efficient than existing models while retaining satisfactory generation performance. Experimental results show VidCLearn’s superiority over baseline methods in terms of visual quality, semantic alignment, and temporal coherence.
zh
[CV-143] Leverag ing RGB Images for Pre-Training of Event-Based Hand Pose Estimation
【速读】:该论文旨在解决事件相机(event camera)在3D手部姿态估计任务中因标注数据稀缺而导致模型性能受限的问题。现有方法难以利用大量未标注的事件数据进行有效预训练,尤其在处理动态运动的手部时,传统伪事件生成技术因假设场景静止而失效。解决方案的关键在于提出一种新颖的预训练方法RPEP(Real-to-Pseudo Event Pre-training),其核心创新包括:1)将手部运动分解为一系列小步长的运动片段,以更精确地建模动态关节变化并生成逼真的事件数据;2)引入运动反转约束(motion reversal constraint),通过反向运动对事件生成过程进行正则化,提升生成事件的合理性与一致性。实验表明,该方法在真实事件数据集EvRealHands上相较最先进方法提升达24%,且在少量标注样本下即可实现高效微调,具备良好的实际部署潜力。
链接: https://arxiv.org/abs/2509.16949
作者: Ruicong Liu,Takehiko Ohkawa,Tze Ho Elden Tse,Mingfang Zhang,Angela Yao,Yoichi Sato
机构: The University of Tokyo (东京大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper presents RPEP, the first pre-training method for event-based 3D hand pose estimation using labeled RGB images and unpaired, unlabeled event data. Event data offer significant benefits such as high temporal resolution and low latency, but their application to hand pose estimation is still limited by the scarcity of labeled training data. To address this, we repurpose real RGB datasets to train event-based estimators. This is done by constructing pseudo-event-RGB pairs, where event data is generated and aligned with the ground-truth poses of RGB images. Unfortunately, existing pseudo-event generation techniques assume stationary objects, thus struggling to handle non-stationary, dynamically moving hands. To overcome this, RPEP introduces a novel generation strategy that decomposes hand movements into smaller, step-by-step motions. This decomposition allows our method to capture temporal changes in articulation, constructing more realistic event data for a moving hand. Additionally, RPEP imposes a motion reversal constraint, regularizing event generation using reversed motion. Extensive experiments show that our pre-trained model significantly outperforms state-of-the-art methods on real event data, achieving up to 24% improvement on EvRealHands. Moreover, it delivers strong performance with minimal labeled samples for fine-tuning, making it well-suited for practical deployment.
zh
[CV-144] Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在执行细粒度感知任务时,因处理高分辨率图像导致的计算开销过大问题。现有方法通常依赖大规模标注数据(训练型)或利用模型内部注意力机制(无监督型),但前者需昂贵标注成本,后者则效率低下且精度不足。其核心解决方案是提出一种无需标注的自蒸馏区域提议网络(Self-Distilled Region Proposal Network, SD-RPN),关键在于通过一个信号去噪与歧义消解的流水线,将MLLM中间层噪声注意力图转化为高质量伪区域感兴趣(Region-of-Interest, RoI)标签,并以此训练轻量级RPN模型,实现单次前向传播即可高效精准定位RoI,从而解耦区域识别与自回归生成过程,显著提升计算效率与泛化性能。
链接: https://arxiv.org/abs/2509.16944
作者: Yuheng Shi,Xiaohuan Pei,Minjing Dong,Chang Xu
机构: University of Sydney (悉尼大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures
Abstract:Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model’s internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM’s middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM’s middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass this http URL validate our approach, we integrate the framework into the LLaVA-1.5 architecture. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at this https URL.
zh
[CV-145] Prototype-Based Pseudo-Label Denoising for Source-Free Domain Adaptation in Remote Sensing Semantic Segmentation
【速读】:该论文旨在解决源域无标签目标域(Source-Free Domain Adaptation, SFDA)场景下,由于目标域缺乏真实标签导致伪标签噪声严重,进而阻碍域偏移(Domain Shift, DS)有效缓解的问题。解决方案的关键在于提出ProSFDA框架,其核心创新包括:1)采用原型加权的伪标签策略,提升在伪标签噪声环境下自训练(Self-Training, ST)的可靠性;2)引入原型对比策略,通过增强同类别特征聚合来学习判别性目标域表示,从而在无需真实标注的情况下实现更鲁棒的域适应。
链接: https://arxiv.org/abs/2509.16942
作者: Bin Wang,Fei Deng,Zeyu Chen,Zhicheng Yu,Yiguang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Source-Free Domain Adaptation (SFDA) enables domain adaptation for semantic segmentation of Remote Sensing Images (RSIs) using only a well-trained source model and unlabeled target domain data. However, the lack of ground-truth labels in the target domain often leads to the generation of noisy pseudo-labels. Such noise impedes the effective mitigation of domain shift (DS). To address this challenge, we propose ProSFDA, a prototype-guided SFDA framework. It employs prototype-weighted pseudo-labels to facilitate reliable self-training (ST) under pseudo-labels noise. We, in addition, introduce a prototype-contrast strategy that encourages the aggregation of features belonging to the same class, enabling the model to learn discriminative target domain representations without relying on ground-truth supervision. Extensive experiments show that our approach substantially outperforms existing methods.
zh
[CV-146] Parameter-efficient fine-tuning (PEFT) of Vision Foundation Models for Atypical Mitotic Figure Classification
【速读】:该论文旨在解决非典型有丝分裂象(Atypical Mitotic Figures, AMFs)分类难题,其核心挑战在于形态学特征细微、样本类别分布极度不平衡以及病理学家间判读一致性差。解决方案的关键在于利用大型视觉基础模型(Vision Foundation Models)结合低秩适配(Low-Rank Adaptation, LoRA)策略进行参数高效微调,从而在有限标注数据下实现高精度分类。实验表明,采用Virchow模型与LoRA rank=8的配置,并通过三折交叉验证集成方法,可在预测试集上达到88.37%的平衡准确率,验证了该方法在AMF识别任务中的有效性与鲁棒性。
链接: https://arxiv.org/abs/2509.16935
作者: Lavish Ramchandani,Gunjan Deotale,Dev Kumar Das
机构: Aira Matrix Private Limited(艾拉矩阵私人有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MIDOG’25
Abstract:Atypical mitotic figures (AMFs) are rare abnormal cell divisions associated with tumor aggressiveness and poor prognosis. Their detection remains a significant challenge due to subtle morphological cues, class imbalance, and inter-observer variability among pathologists. The MIDOG 2025 challenge introduced a dedicated track for atypical mitosis classification, enabling systematic evaluation of deep learning methods. In this study, we investigated the use of large vision foundation models, including Virchow, Virchow2, and UNI, with Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. We conducted extensive experiments with different LoRA ranks, as well as random and group-based data splits, to analyze robustness under varied conditions. Our best approach, Virchow with LoRA rank 8 and ensemble of three-fold cross-validation, achieved a balanced accuracy of 88.37% on the preliminary test set, ranking joint 9th in the challenge leaderboard. These results highlight the promise of foundation models with efficient adaptation strategies for the classification of atypical mitosis, while underscoring the need for improvements in specificity and domain generalization.
zh
[CV-147] SLAM-Former: Putting SLAM into One Transformer
【速读】:该论文旨在解决传统SLAM(Simultaneous Localization and Mapping,即时定位与地图构建)系统中前端(frontend)与后端(backend)模块分离导致的协同效率低、几何一致性难以保障的问题。解决方案的关键在于提出了一种名为SLAM-Former的新颖神经架构,将完整的SLAM能力整合进单一Transformer模型中,使前端实时处理单目图像序列进行增量式建图与跟踪,后端则通过全局优化提升几何一致性,二者交替执行并相互促进,从而显著提升整体系统性能。
链接: https://arxiv.org/abs/2509.16909
作者: Yijun Yuan,Zhuoguang Chen,Kenan Li,Weibang Wang,Hang Zhao
机构: IIIS, Tsinghua University (清华大学交叉信息研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Project Page: this https URL
Abstract:We present SLAM-Former, a novel neural approach that integrates full SLAM capabilities into a single transformer. Similar to traditional SLAM systems, SLAM-Former comprises both a frontend and a backend that operate in tandem. The frontend processes sequential monocular images in real-time for incremental mapping and tracking, while the backend performs global refinement to ensure a geometrically consistent result. This alternating execution allows the frontend and backend to mutually promote one another, enhancing overall system performance. Comprehensive experimental results demonstrate that SLAM-Former achieves superior or highly competitive performance compared to state-of-the-art dense SLAM methods.
zh
[CV-148] ME-Mamba: Multi-Expert Mamba with Efficient Knowledge Capture and Fusion for Multimodal Survival Analysis
【速读】:该论文旨在解决癌症生存分析中因病理图像(WSI)仅提供切片级标签而导致难以学习判别性特征的问题,同时探索如何高效融合病理图像与基因组数据以提升预测准确性。其解决方案的关键在于提出一种多专家Mamba(ME-Mamba)系统,该系统由病理专家(Pathology Expert)、基因组专家(Genomics Expert)和协同专家(Synergistic Expert)组成:前者分别利用Mamba架构中的常规扫描与注意力机制提取长序列中的判别特征;后者通过最优传输(Optimal Transport)显式学习模态间token级局部对应关系,并基于最大均值差异(Maximum Mean Discrepancy, MMD)的全局跨模态融合损失隐式增强分布一致性,从而实现无信息丢失的互补信息融合,最终在TCGA五个数据集上实现高精度且计算复杂度较低的生存分析。
链接: https://arxiv.org/abs/2509.16900
作者: Chengsheng Zhang,Linhao Qu,Xiaoyu Liu,Zhijian Song
机构: Fudan University (复旦大学); Shanghai Key Lab of Medical Image Computing and Computer Assisted Intervention (上海市医学图像计算与计算机辅助干预重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Survival analysis using whole-slide images (WSIs) is crucial in cancer research. Despite significant successes, pathology images typically only provide slide-level labels, which hinders the learning of discriminative representations from gigapixel WSIs. With the rapid advancement of high-throughput sequencing technologies, multimodal survival analysis integrating pathology images and genomics data has emerged as a promising approach. We propose a Multi-Expert Mamba (ME-Mamba) system that captures discriminative pathological and genomic features while enabling efficient integration of both modalities. This approach achieves complementary information fusion without losing critical information from individual modalities, thereby facilitating accurate cancer survival analysis. Specifically, we first introduce a Pathology Expert and a Genomics Expert to process unimodal data separately. Both experts are designed with Mamba architectures that incorporate conventional scanning and attention-based scanning mechanisms, allowing them to extract discriminative features from long instance sequences containing substantial redundant or irrelevant information. Second, we design a Synergistic Expert responsible for modality fusion. It explicitly learns token-level local correspondences between the two modalities via Optimal Transport, and implicitly enhances distribution consistency through a global cross-modal fusion loss based on Maximum Mean Discrepancy. The fused feature representations are then passed to a mamba backbone for further integration. Through the collaboration of the Pathology Expert, Genomics Expert, and Synergistic Expert, our method achieves stable and accurate survival analysis with relatively low computational complexity. Extensive experimental results on five datasets in The Cancer Genome Atlas (TCGA) demonstrate our state-of-the-art performance.
zh
[CV-149] PRISM: Precision-Recall Informed Data-Free Knowledge Distillation via Generative Diffusion
【速读】:该论文旨在解决数据无知识蒸馏(Data-free Knowledge Distillation, DFKD)在处理大规模图像时存在的模式崩溃(mode collapse)问题,即现有方法在合成大尺度图像时难以有效保留真实数据分布的多样性,导致知识迁移受限。其解决方案的关键在于提出PRISM方法,通过两个核心机制实现对合成数据的精度(precision)与召回率(recall)的协同优化:一是引入能量引导的分布对齐(Energy-guided Distribution Alignment),以避免生成分布外(out-of-distribution)样本;二是设计多样化的提示工程(Diversified Prompt Engineering),以增强对真实数据流形(in-distribution manifold)的覆盖。实验表明,PRISM在多个大规模图像数据集上均优于现有方法,并显著提升学生模型的领域泛化能力。
链接: https://arxiv.org/abs/2509.16897
作者: Xuewan He,Jielei Wang,Zihan Cheng,Yuchen Su,Shiyue Huang,Guoming Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access to the real in-distribution (ID) data. While existing methods perform well on small-scale images, they suffer from mode collapse when synthesizing large-scale images, resulting in limited knowledge transfer. Recently, leveraging advanced generative models to synthesize photorealistic images has emerged as a promising alternative. Nevertheless, directly using off-the-shelf diffusion to generate datasets faces the precision-recall challenges: 1) ensuring synthetic data aligns with the real distribution, and 2) ensuring coverage of the real ID manifold. In response, we propose PRISM, a precision-recall informed synthesis method. Specifically, we introduce Energy-guided Distribution Alignment to avoid the generation of out-of-distribution samples, and design the Diversified Prompt Engineering to enhance coverage of the real ID manifold. Extensive experiments on various large-scale image datasets demonstrate the superiority of PRISM. Moreover, we demonstrate that models trained with PRISM exhibit strong domain generalization.
zh
[CV-150] Learning from Gene Names Expression Values and Images: Contrastive Masked Text-Image Pretraining for Spatial Transcriptomics Representation Learning
【速读】:该论文旨在解决空间转录组学(spatial transcriptomics)中跨模态预训练方法存在的局限性问题,即现有方法仅依赖基因名称或表达值单独进行建模,导致基因分支丧失关键语义信息,并破坏基因与其定量数值之间的关联;同时,受限于图像-文本对齐的监督方式,忽略了对学习鲁棒图像特征至关重要的内在视觉线索。解决方案的关键在于提出首个对比掩码文本-图像预训练框架 CoMTIP,其核心创新包括:视觉分支采用掩码特征建模(Masked Feature Modeling)重建遮挡图像块以学习上下文感知的图像嵌入;文本分支引入可扩展的基因-文本编码器(Gene-Text Encoder),并行处理所有基因句子,通过专用嵌入增强每个基因及其数值的语义表示,并结合配对感知对抗训练(Pair-aware Adversarial Training, PAAT)确保基因与表达值的正确对应关系;最终在共享的 InfoNCE 优化空间中对齐图像与文本表征,从而实现更通用、细粒度的跨模态理解能力,显著提升下游任务性能并首次实现零样本基因表达预测。
链接: https://arxiv.org/abs/2509.16892
作者: Jiahe Qian,Yaoyu Fang,Ziqiao Weng,Xinkun Wang,Lee A. Cooper,Bo Zhou
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Tsinghua University (清华大学); 3. University of Oxford (牛津大学); 4. Alibaba Cloud (阿里云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 3 figures
Abstract:Spatial transcriptomics aims to connect high-resolution histology images with spatially resolved gene expression. To achieve better performance on downstream tasks such as gene expression prediction, large-scale pre-training is required to obtain generalisable representations that can bridge histology and transcriptomics across tissues, protocols, and laboratories. Existing cross-modal pre-training approaches for spatial transcriptomics rely on either gene names or expression values in isolation, which strips the gene branch of essential semantics and breaks the association between each gene and its quantitative magnitude. In addition, by restricting supervision to image-text alignment, these methods ignore intrinsic visual cues that are critical for learning robust image features. We present CoMTIP, the first Contrastive Masked Text-Image Pretraining framework that jointly learns from images, gene names, and expression values while capturing fine-grained visual context for spatial transcriptomics. The vision branch uses Masked Feature Modeling to reconstruct occluded patches and learn context-aware image embeddings. The text branch applies a scalable Gene-Text Encoder that processes all gene sentences in parallel, enriches each gene and its numerical value with dedicated embeddings, and employs Pair-aware Adversarial Training (PAAT) to preserve correct gene-value associations. Image and text representations are aligned in a shared InfoNCE-optimised space. Experiments on public spatial transcriptomics datasets show that CoMTIP not only surpasses previous methods on diverse downstream tasks but also achieves zero-shot gene expression prediction, a capability that existing approaches do not provide.
zh
[CV-151] Rethinking Evaluation of Infrared Small Target Detection NEURIPS2025
【速读】:该论文旨在解决红外小目标检测(Infrared Small Target Detection, IRSTD)领域当前评估协议存在的三大关键问题:一是现有方法依赖碎片化的像素级和目标级指标,难以全面反映模型性能;二是过度关注整体性能分数,忽视了对错误模式的系统性分析,限制了对实际系统性能提升的指导意义;三是普遍采用数据集特定的训练-测试范式,导致模型鲁棒性和跨场景泛化能力难以准确评估。解决方案的关键在于:提出一种融合像素级与目标级性能的混合层级评价指标(hybrid-level metric),构建系统性的错误分析方法,并强调跨数据集评估的重要性,从而建立更全面、合理的分层分析框架,推动更具有效性和鲁棒性的IRSTD模型发展。
链接: https://arxiv.org/abs/2509.16888
作者: Youwei Pang,Xiaoqi Zhao,Lihe Zhang,Huchuan Lu,Georges El Fakhri,Xiaofeng Liu,Shijian Lu
机构: Dalian University of Technology (大连理工大学); Yale University (耶鲁大学); Nanyang Technology University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025; Evaluation Toolkit: this https URL
Abstract:As an essential vision task, infrared small target detection (IRSTD) has seen significant advancements through deep learning. However, critical limitations in current evaluation protocols impede further progress. First, existing methods rely on fragmented pixel- and target-level specific metrics, which fails to provide a comprehensive view of model capabilities. Second, an excessive emphasis on overall performance scores obscures crucial error analysis, which is vital for identifying failure modes and improving real-world system performance. Third, the field predominantly adopts dataset-specific training-testing paradigms, hindering the understanding of model robustness and generalization across diverse infrared scenarios. This paper addresses these issues by introducing a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation. These aim to offer a more thorough and rational hierarchical analysis framework, ultimately fostering the development of more effective and robust IRSTD models. An open-source toolkit has be released to facilitate standardized benchmarking.
zh
[CV-152] SAM-DCE: Addressing Token Uniformity and Semantic Over-Smoothing in Medical Segmentation
【速读】:该论文旨在解决生成式 AI (Generative AI) 在医学影像分割任务中因领域偏移(domain shift)、解剖结构变异以及依赖用户提示(prompt)所带来的性能下降问题。现有无需提示的改进方法虽减少了专家干预,但仍存在鲁棒性不足、适应能力弱的问题,尤其忽视了语义过平滑(semantic over-smoothing)和token均匀性(token uniformity)带来的表征退化。论文提出的SAM-DCE解决方案的关键在于通过平衡局部判别性与全局语义一致性,有效缓解token均匀性问题,提升类别间可分性,并在掩码解码阶段引入细粒度且一致的特征表示,从而显著增强模型在多样化医学基准上的分割性能。
链接: https://arxiv.org/abs/2509.16886
作者: Yingzhen Hu,Yiheng Zhong,Ruobing Li,Yingxue Su,Jiabao An,Feilong Tang,Jionglong Su,Imran Razzak
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Segment Anything Model (SAM) demonstrates impressive zero-shot segmentation ability on natural images but encounters difficulties in medical imaging due to domain shifts, anatomical variability, and its reliance on user-provided prompts. Recent prompt-free adaptations alleviate the need for expert intervention, yet still suffer from limited robustness and adaptability, often overlooking the issues of semantic over-smoothing and token uniformity. We propose SAM-DCE, which balances local discrimination and global semantics while mitigating token uniformity, enhancing inter-class separability, and enriching mask decoding with fine-grained, consistent representations. Extensive experiments on diverse medical benchmarks validate its effectiveness.
zh
[CV-153] owards Interpretable and Efficient Attention: Compressing All by Contracting a Few NEURIPS2025
【速读】:该论文旨在解决Transformer中注意力机制的两个核心问题:一是其前向传播过程中的优化目标不明确,二是自注意力机制存在的二次时间复杂度限制了模型在长序列任务中的可扩展性。解决方案的关键在于提出一个统一的优化目标,通过该目标的展开(unrolling)推导出一种内在可解释且高效的注意力机制——Contract-and-Broadcast Self-Attention (CBSA)。该机制通过将所有token压缩为少量代表性token的低维结构,再将其广播回原始序列,从而实现线性复杂度,并能涵盖现有注意力机制作为特例,同时在多个视觉任务上表现出与现有方法相当甚至更优的性能。
链接: https://arxiv.org/abs/2509.16875
作者: Qishuai Wen,Zhiyuan Huang,Chun-Guang Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025 Spotlight
Abstract:Attention mechanisms in Transformers have gained significant empirical success. Nonetheless, the optimization objectives underlying their forward pass are still unclear. Additionally, the quadratic complexity of self-attention is increasingly prohibitive. Unlike the prior work on addressing the interpretability or efficiency issue separately, we propose a unified optimization objective to alleviate both issues simultaneously. By unrolling the optimization over the objective, we derive an inherently interpretable and efficient attention mechanism, which compresses all tokens into low-dimensional structures by contracting a few representative tokens and then broadcasting the contractions back. This Contract-and-Broadcast Self-Attention (CBSA) mechanism can not only scale linearly but also generalize existing attention mechanisms as its special cases. Experiments further demonstrate comparable performance and even superior advantages of CBSA on several visual tasks. Code is available at this https URL.
zh
[CV-154] mathttM3VIR: A Large-Scale Multi-Modality Multi-View Synthesized Benchmark Dataset for Image Restoration and Content Creation
【速读】:该论文旨在解决当前游戏与娱乐领域中生成式 AI (Generative AI) 模型训练所面临的两大核心问题:一是现有数据集普遍局限于特定领域或依赖人工退化处理,无法真实反映游戏内容的独特特性;二是缺乏针对可控视频生成(controllable video generation)的基准评测体系。其解决方案的关键在于提出一个大规模、多模态、多视角的游戏数据集 \mathttM^3VIR,该数据集基于 Unreal Engine 5 渲染生成高质量、高保真度的低分辨率-高分辨率(LR-HR)配对图像及多视角帧,覆盖80个场景和8类游戏内容,并进一步细分为 \mathttM^3VIR_MR(用于超分辨率和新视角合成任务)与 \mathttM^3VIR_MS(首个支持对象级控制的多风格真值数据集),从而为可控视频生成提供首个系统性基准,推动云游戏与娱乐场景下 AI 驱动的内容恢复、压缩与可控生成研究发展。
链接: https://arxiv.org/abs/2509.16873
作者: Yuanzhi Li,Lebin Zhou,Nam Ling,Zhenghao Chen,Wei Wang,Wei Jiang
机构: Santa Clara University (圣克拉拉大学); University of Newcastle (纽卡斯尔大学); Futurewei Technologies, Inc. (未来wei技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The gaming and entertainment industry is rapidly evolving, driven by immersive experiences and the integration of generative AI (GAI) technologies. Training such models effectively requires large-scale datasets that capture the diversity and context of gaming environments. However, existing datasets are often limited to specific domains or rely on artificial degradations, which do not accurately capture the unique characteristics of gaming content. Moreover, benchmarks for controllable video generation remain absent. To address these limitations, we introduce \mathttM^3VIR , a large-scale, multi-modal, multi-view dataset specifically designed to overcome the shortcomings of current resources. Unlike existing datasets, \mathttM^3VIR provides diverse, high-fidelity gaming content rendered with Unreal Engine 5, offering authentic ground-truth LR-HR paired and multi-view frames across 80 scenes in 8 categories. It includes \mathttM^3VIR_MR for super-resolution (SR), novel view synthesis (NVS), and combined NVS+SR tasks, and \mathttM^3VIR_MS , the first multi-style, object-level ground-truth set enabling research on controlled video generation. Additionally, we benchmark several state-of-the-art SR and NVS methods to establish performance baselines. While no existing approaches directly handle controlled video generation, \mathttM^3VIR provides a benchmark for advancing this area. By releasing the dataset, we aim to facilitate research in AI-powered restoration, compression, and controllable content generation for next-generation cloud gaming and entertainment. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.16873 [cs.CV] (or arXiv:2509.16873v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.16873 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-155] PhysHDR: When Lighting Meets Materials and Scene Geometry in HDR Reconstruction
【速读】:该论文旨在解决低动态范围(Low Dynamic Range, LDR)到高动态范围(High Dynamic Range, HDR)图像转换中的重建质量受限问题,其核心挑战在于现有数据驱动方法缺乏对光照、阴影及场景几何结构的显式建模,尤其在不同材质表面(如镜面反射材料与朗伯表面)的光照交互建模方面存在不足。解决方案的关键在于提出PhysHDR,一个基于潜在扩散(latent diffusion)的生成模型,通过在去噪过程中引入光照和深度信息作为条件,并设计一种新型损失函数以显式融合场景中各表面的材质特性(如镜面反射与漫反射属性),从而显著提升HDR图像重建的质量与物理一致性。
链接: https://arxiv.org/abs/2509.16869
作者: Hrishav Bakul Barua,Kalin Stefanov,Ganesh Krishnasamy,KokSheik Wong,Abhinav Dhall
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM); Image and Video Processing (eess.IV)
备注: Submitted to IEEE
Abstract:Low Dynamic Range (LDR) to High Dynamic Range (HDR) image translation is a fundamental task in many computational vision problems. Numerous data-driven methods have been proposed to address this problem; however, they lack explicit modeling of illumination, lighting, and scene geometry in images. This limits the quality of the reconstructed HDR images. Since lighting and shadows interact differently with different materials, (e.g., specular surfaces such as glass and metal, and lambertian or diffuse surfaces such as wood and stone), modeling material-specific properties (e.g., specular and diffuse reflectance) has the potential to improve the quality of HDR image reconstruction. This paper presents PhysHDR, a simple yet powerful latent diffusion-based generative model for HDR image reconstruction. The denoising process is conditioned on lighting and depth information and guided by a novel loss to incorporate material properties of surfaces in the scene. The experimental results establish the efficacy of PhysHDR in comparison to a number of recent state-of-the-art methods.
zh
[CV-156] ConfidentSplat: Confidence-Weighted Depth Fusion for Accurate 3D Gaussian Splatting SLAM
【速读】:该论文旨在解决现有仅使用RGB图像的3D高斯溅射(3DGS)SLAM方法中因深度估计不可靠而导致几何精度不足的问题。其解决方案的关键在于引入一种置信度加权融合机制(confidence-weighted fusion mechanism),该机制通过多视角几何一致性与学习到的单目先验(Omnidata ViT)相结合,动态调整二者贡献权重,从而生成高保真度的代理深度图用于地图监督优化。这一机制显著提升了重建精度和新视角合成质量,尤其在复杂场景下表现优异。
链接: https://arxiv.org/abs/2509.16863
作者: Amanuel T. Dufera,Yuan-Li Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce ConfidentSplat, a novel 3D Gaussian Splatting (3DGS)-based SLAM system for robust, highfidelity RGB-only reconstruction. Addressing geometric inaccuracies in existing RGB-only 3DGS SLAM methods that stem from unreliable depth estimation, ConfidentSplat incorporates a core innovation: a confidence-weighted fusion mechanism. This mechanism adaptively integrates depth cues from multiview geometry with learned monocular priors (Omnidata ViT), dynamically weighting their contributions based on explicit reliability estimates-derived predominantly from multi-view geometric consistency-to generate high-fidelity proxy depth for map supervision. The resulting proxy depth guides the optimization of a deformable 3DGS map, which efficiently adapts online to maintain global consistency following pose updates from a DROID-SLAM-inspired frontend and backend optimizations (loop closure, global bundle adjustment). Extensive validation on standard benchmarks (TUM-RGBD, ScanNet) and diverse custom mobile datasets demonstrates significant improvements in reconstruction accuracy (L1 depth error) and novel view synthesis fidelity (PSNR, SSIM, LPIPS) over baselines, particularly in challenging conditions. ConfidentSplat underscores the efficacy of principled, confidence-aware sensor fusion for advancing state-of-the-art dense visual SLAM.
zh
[CV-157] ISCS: Parameter-Guided Channel Ordering and Grouping for Learned Image Compression
【速读】:该论文旨在解决预训练变分自编码器(Variational Autoencoder, VAE)基底的图像压缩模型中,冗余通道占用大量计算与比特资源的问题。现有方法通常依赖于昂贵的数据集特定消融实验,并且孤立地分析通道重要性,忽略了通道间的相互依赖关系。其解决方案的关键在于提出一种通用、数据无关的通道重要性评估机制,通过分析模型参数的内在统计特性——包括权重方差、偏置幅值及成对相关性——来识别并组织关键通道,从而发现一个稳定的“不变显著通道空间”(Invariant Salient Channel Space, ISCS)。在此基础上,构建确定性的通道排序与分组策略,实现切片并行解码、降低冗余并提升比特率效率,同时保持重建质量,为现有学习图像压缩框架提供可插拔的优化模块。
链接: https://arxiv.org/abs/2509.16853
作者: Jinhao Wang,Cihan Ruan,Nam Ling,Wei Wang,Wei Jiang
机构: Santa Clara University (圣克拉拉大学); Futurewei Technologies, Inc. (未来wei技术公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prior studies in learned image compression (LIC) consistently show that only a small subset of latent channels is critical for reconstruction, while many others carry limited information. Exploiting this imbalance could improve both coding and computational efficiency, yet existing approaches often rely on costly, dataset-specific ablation tests and typically analyze channels in isolation, ignoring their interdependencies. We propose a generalizable, dataset-agnostic method to identify and organize important channels in pretrained VAE-based LIC models. Instead of brute-force empirical evaluations, our approach leverages intrinsic parameter statistics-weight variances, bias magnitudes, and pairwise correlations-to estimate channel importance. This analysis reveals a consistent organizational structure, termed the Invariant Salient Channel Space (ISCS), where Salient-Core channels capture dominant structures and Salient-Auxiliary channels provide complementary details. Building on ISCS, we introduce a deterministic channel ordering and grouping strategy that enables slice-parallel decoding, reduces redundancy, and improves bitrate efficiency. Experiments across multiple LIC architectures demonstrate that our method effectively reduces bitrate and computation while maintaining reconstruction quality, providing a practical and modular enhancement to existing learned compression frameworks. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.16853 [cs.CV] (or arXiv:2509.16853v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.16853 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-158] SOLAR: Switchable Output Layer for Accuracy and Robustness in Once-for-All Training
【速读】:该论文旨在解决Once-for-All (OFA) 训练中因骨干网络参数过度共享而导致的表征能力受限问题,进而引发子网络(sub-net)校准性能下降和整体性能衰退。其关键解决方案是提出SOLAR(Switchable Output Layer for Accuracy and Robustness in Once-for-All Training),通过为每个子网络分配独立的分类头(classification head),实现不同子网络间logit学习过程的解耦,从而降低表征干扰并提升优化效率,且无需修改共享骨干网络结构。
链接: https://arxiv.org/abs/2509.16833
作者: Shaharyar Ahmed Khan Tareen,Lei Fan,Xiaojing Yuan,Qin Lin,Bin Hu
机构: University of Houston(休斯顿大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 7 figures, 6 tables
Abstract:Once-for-All (OFA) training enables a single super-net to generate multiple sub-nets tailored to diverse deployment scenarios, supporting flexible trade-offs among accuracy, robustness, and model-size without retraining. However, as the number of supported sub-nets increases, excessive parameter sharing in the backbone limits representational capacity, leading to degraded calibration and reduced overall performance. To address this, we propose SOLAR (Switchable Output Layer for Accuracy and Robustness in Once-for-All Training), a simple yet effective technique that assigns each sub-net a separate classification head. By decoupling the logit learning process across sub-nets, the Switchable Output Layer (SOL) reduces representational interference and improves optimization, without altering the shared backbone. We evaluate SOLAR on five datasets (SVHN, CIFAR-10, STL-10, CIFAR-100, and TinyImageNet) using four super-net backbones (ResNet-34, WideResNet-16-8, WideResNet-40-2, and MobileNetV2) for two OFA training frameworks (OATS and SNNs). Experiments show that SOLAR outperforms the baseline methods: compared to OATS, it improves accuracy of sub-nets up to 1.26 %, 4.71 %, 1.67 %, and 1.76 %, and robustness up to 9.01 %, 7.71 %, 2.72 %, and 1.26 % on SVHN, CIFAR-10, STL-10, and CIFAR-100, respectively. Compared to SNNs, it improves TinyImageNet accuracy by up to 2.93 %, 2.34 %, and 1.35 % using ResNet-34, WideResNet-16-8, and MobileNetV2 backbones (with 8 sub-nets), respectively.
zh
[CV-159] L2M-Reg: Building-level Uncertainty-aware Registration of Outdoor LiDAR Point Clouds and Semantic 3D City Models
【速读】:该论文旨在解决激光雷达(LiDAR)点云与语义三维城市模型在单体建筑层面的精确配准问题,尤其针对LoD2(Level of Detail 2)级别模型中存在的建模不确定性带来的挑战。其解决方案的关键在于提出一种基于平面约束的精细配准方法L2M-Reg,该方法通过三个核心步骤实现:建立可靠的平面对应关系、构建伪平面约束的高斯-赫尔姆霍茨(Gauss-Helmert)模型,并自适应估计垂直平移量,从而有效应对模型不确定性,显著提升配准精度与计算效率。
链接: https://arxiv.org/abs/2509.16832
作者: Ziyang Xu,Benedikt Schwab,Yihui Yang,Thomas H. Kolbe,Christoph Holst
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: submit to ISPRS Journal of Photogrammetry and Remote Sensing
Abstract:Accurate registration between LiDAR (Light Detection and Ranging) point clouds and semantic 3D city models is a fundamental topic in urban digital twinning and a prerequisite for downstream tasks, such as digital construction, change detection and model refinement. However, achieving accurate LiDAR-to-Model registration at individual building level remains challenging, particularly due to the generalization uncertainty in semantic 3D city models at the Level of Detail 2 (LoD2). This paper addresses this gap by proposing L2M-Reg, a plane-based fine registration method that explicitly accounts for model uncertainty. L2M-Reg consists of three key steps: establishing reliable plane correspondence, building a pseudo-plane-constrained Gauss-Helmert model, and adaptively estimating vertical translation. Experiments on three real-world datasets demonstrate that L2M-Reg is both more accurate and computationally efficient than existing ICP-based and plane-based methods. Overall, L2M-Reg provides a novel building-level solution regarding LiDAR-to-Model registration when model uncertainty is present.
zh
[CV-160] Looking in the mirror: A faithful counterfactual explanation method for interpreting deep image classification models ICCV
【速读】:该论文旨在解决现有反事实解释(Counterfactual Explanations, CFE)方法依赖额外图像编码器和生成模型、忽视分类器自身特征空间与决策边界的问题,从而无法揭示模型内在的决策机制。解决方案的关键在于提出Mirror-CFE方法,该方法直接在分类器的特征空间中操作,将决策边界视为“镜面”,通过学习从特征空间到图像空间的距离保持映射函数,实现源图像与其反事实样本之间的平滑过渡,从而生成忠实且可解释的反事实解释。
链接: https://arxiv.org/abs/2509.16822
作者: Townim Faisal Chowdhury,Vu Minh Hieu Phan,Kewen Liao,Nanyu Dong,Minh-Son To,Anton Hengel,Johan Verjans,Zhibin Liao
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); University of Adelaide (阿德莱德大学); Deakin University (迪肯大学); Flinders University (弗林德斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE/CVF International Conference on Computer Vision (ICCV), 2025
Abstract:Counterfactual explanations (CFE) for deep image classifiers aim to reveal how minimal input changes lead to different model decisions, providing critical insights for model interpretation and improvement. However, existing CFE methods often rely on additional image encoders and generative models to create plausible images, neglecting the classifier’s own feature space and decision boundaries. As such, they do not explain the intrinsic feature space and decision boundaries learned by the classifier. To address this limitation, we propose Mirror-CFE, a novel method that generates faithful counterfactual explanations by operating directly in the classifier’s feature space, treating decision boundaries as mirrors that ``reflect’’ feature representations in the mirror. Mirror-CFE learns a mapping function from feature space to image space while preserving distance relationships, enabling smooth transitions between source images and their counterfactuals. Through extensive experiments on four image datasets, we demonstrate that Mirror-CFE achieves superior performance in validity while maintaining input resemblance compared to state-of-the-art explanation methods. Finally, mirror-CFE provides interpretable visualization of the classifier’s decision process by generating step-wise transitions that reveal how features evolve as classification confidence changes.
zh
[CV-161] Development of a Mobile Application for at-Home Analysis of Retinal Fundus Images
【速读】:该论文旨在解决当前机器学习在医学影像诊断中仍需依赖专业人员人工验证、难以实现临床自主应用的问题。其解决方案的关键在于设计并实现一个移动端平台,通过定期上传视网膜眼底图像,持续监测与年龄相关眼病(如青光眼、糖尿病视网膜病变和黄斑水肿)相关的量化指标(如血管迂曲度、青光眼特征及病变分级),从而提供早期趋势预警,而非直接给出诊断结论。该平台整合了基于Messidor和MAPLES-DR数据集训练的视网膜病变分级模型、DeepSeeNet青光眼检测模型以及血管迂曲度计算模块,形成多维度的眼部健康趋势追踪能力。
链接: https://arxiv.org/abs/2509.16814
作者: Mattea Reid,Zuhairah Zainal,Khaing Zin Than,Danielle Chan,Jonathan Chan
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures
Abstract:Machine learning is gaining significant attention as a diagnostic tool in medical imaging, particularly in the analysis of retinal fundus images. However, this approach is not yet clinically applicable, as it still depends on human validation from a professional. Therefore, we present the design for a mobile application that monitors metrics related to retinal fundus images correlating to age-related conditions. The purpose of this platform is to observe for a change in these metrics over time, offering early insights into potential ocular diseases without explicitly delivering diagnostics. Metrics analysed include vessel tortuosity, as well as signs of glaucoma, retinopathy and macular edema. To evaluate retinopathy grade and risk of macular edema, a model was trained on the Messidor dataset and compared to a similar model trained on the MAPLES-DR dataset. Information from the DeepSeeNet glaucoma detection model, as well as tortuosity calculations, is additionally incorporated to ultimately present a retinal fundus image monitoring platform. As a result, the mobile application permits monitoring of trends or changes in ocular metrics correlated to age-related conditions with regularly uploaded photographs.
zh
[CV-162] MedGS: Gaussian Splatting for Multi-Modal 3D Medical Imaging
【速读】:该论文旨在解决多模态三维(3D)医学影像(如超声、磁共振成像MRI等)在表面重建与帧间插值过程中因图像噪声和帧间信息不完整导致的建模精度不足问题。其解决方案的关键在于提出了一种基于高斯点阵(Gaussian Splatting, GS)的半监督神经隐式表面重建框架MedGS,通过将连续二维(2D)影像帧嵌入至3D空间并用高斯分布进行建模,实现了鲁棒的帧间插值与高保真表面重建,从而提升了对复杂解剖结构的建模精度,并增强了抗噪能力与可编辑性。
链接: https://arxiv.org/abs/2509.16806
作者: Kacper Marzol,Ignacy Kolton,Weronika Smolak-Dyżewska,Joanna Kaleta,Marcin Mazur,Przemysław Spurek
机构: Jagiellonian University (雅盖隆大学); Warsaw University of Technology (华沙理工大学); Sano Centre for Computational Medicine (Sano计算医学中心); IDEAS Research Institute (IDEAS研究学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multi-modal three-dimensional (3D) medical imaging data, derived from ultrasound, magnetic resonance imaging (MRI), and potentially computed tomography (CT), provide a widely adopted approach for non-invasive anatomical visualization. Accurate modeling, registration, and visualization in this setting depend on surface reconstruction and frame-to-frame interpolation. Traditional methods often face limitations due to image noise and incomplete information between frames. To address these challenges, we present MedGS, a semi-supervised neural implicit surface reconstruction framework that employs a Gaussian Splatting (GS)-based interpolation mechanism. In this framework, medical imaging data are represented as consecutive two-dimensional (2D) frames embedded in 3D space and modeled using Gaussian-based distributions. This representation enables robust frame interpolation and high-fidelity surface reconstruction across imaging modalities. As a result, MedGS offers more efficient training than traditional neural implicit methods. Its explicit GS-based representation enhances noise robustness, allows flexible editing, and supports precise modeling of complex anatomical structures with fewer artifacts. These features make MedGS highly suitable for scalable and practical applications in medical imaging.
zh
[CV-163] Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models EMNLP2025
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多项选择题问答(Multiple-Choice Question Answering, MCQA)中存在的一种选择偏差(selection bias)问题,即模型倾向于偏好特定选项标记(如“A”)或位置,而非基于内容进行推理。这种偏差在任务难度增加时尤为显著,影响模型的准确性和鲁棒性。解决方案的关键在于提出一种推理阶段的对数空间去偏方法(logit-level debiasing),该方法通过通用和上下文提示估计一个集成偏差向量,并对模型输出施加自适应置信度校正,从而在不重新训练模型的前提下有效缓解偏差,且兼容冻结的LVLM结构。实验表明,该方法能显著降低偏差并提升高难度场景下的准确率。
链接: https://arxiv.org/abs/2509.16805
作者: Md. Atabuzzaman,Ali Asgarov,Chris Thomas
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to EMNLP 2025 (Main Conference)
Abstract:Large Vision-Language Models (LVLMs) have achieved strong performance on vision-language tasks, particularly Visual Question Answering (VQA). While prior work has explored unimodal biases in VQA, the problem of selection bias in Multiple-Choice Question Answering (MCQA), where models may favor specific option tokens (e.g., “A”) or positions, remains underexplored. In this paper, we investigate both the presence and nature of selection bias in LVLMs through fine-grained MCQA benchmarks spanning easy, medium, and hard difficulty levels, defined by the semantic similarity of the options. We further propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts and applies confidence-adaptive corrections to the model’s output. Our method mitigates bias without retraining and is compatible with frozen LVLMs. Extensive experiments across several state-of-the-art models reveal consistent selection biases that intensify with task difficulty, and show that our mitigation approach significantly reduces bias while improving accuracy in challenging settings. This work offers new insights into the limitations of LVLMs in MCQA and presents a practical approach to improve their robustness in fine-grained visual reasoning. Datasets and code are available at: this https URL
zh
[CV-164] Artificial Satellite Trails Detection Using U-Net Deep Neural Network and Line Segment Detector Algorithm
【速读】:该论文旨在解决人工卫星数量激增导致天文成像中出现卫星轨迹干扰的问题,这类轨迹会引入虚假源并造成显著的光度误差。解决方案的关键在于提出一种融合U-Net深度神经网络(用于图像分割)与直线段检测算法(Line Segment Detector, LSD)的卫星轨迹检测模型,通过在模拟数据上训练并结合真实观测数据验证,实现了高精度和高召回率的卫星轨迹识别能力。
链接: https://arxiv.org/abs/2509.16771
作者: Xiaohan Chen,Hongrui Gu,Cunshi Wang,Haiyang Mu,Jie Zheng,Junju Du,Jing Ren,Zhou Fan,Jing Li
机构: China West Normal University (中国西华师范大学); National Astronomical Observatories, Chinese Academy of Sciences (中国科学院国家天文台); University of Chinese Academy of Sciences (中国科学院大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM)
备注: 15 pages, 7 figures, 2 tables, PASP accepted
Abstract:With the rapid increase in the number of artificial satellites, astronomical imaging is experiencing growing interference. When these satellites reflect sunlight, they produce streak-like artifacts in photometry images. Such satellite trails can introduce false sources and cause significant photometric errors. As a result, accurately identifying the positions of satellite trails in observational data has become essential. In this work, we propose a satellite trail detection model that combines the U-Net deep neural network for image segmentation with the Line Segment Detector (LSD) algorithm. The model is trained on 375 simulated images of satellite trails, generated using data from the Mini-SiTian Array. Experimental results show that for trails with a signal-to-noise ratio (SNR) greater than 3, the detection rate exceeds 99. Additionally, when applied to real observational data from the Mini-SiTian Array, the model achieves a recall of 79.57 and a precision of 74.56.
zh
[CV-165] MMPart: Harnessing Multi-Modal Large Language Models for Part-Aware 3D Generation
【速读】:该论文旨在解决当前生成式3D建模方法中缺乏结构信息的问题,即现有方法通常将目标物体表示为封闭网格,无法支持编辑、动画和语义理解。为此,作者提出MMPart框架,其关键在于通过视觉语言模型(VLM)生成基于输入图像和用户描述的提示集合,进而指导生成模型在隔离阶段重构每个部件的图像(控制姿态并引导模型推理被遮挡区域),随后通过多视角生成一致图像,并最终由重建模型转化为具有部件感知能力的3D模型。这一流程实现了对部件分割和几何细节的可控生成。
链接: https://arxiv.org/abs/2509.16768
作者: Omid Bonakdar,Nasser Mozayani
机构: Iran University of Science and Technology (伊朗科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Generative 3D modeling has advanced rapidly, driven by applications in VR/AR, metaverse, and robotics. However, most methods represent the target object as a closed mesh devoid of any structural information, limiting editing, animation, and semantic understanding. Part-aware 3D generation addresses this problem by decomposing objects into meaningful components, but existing pipelines face challenges: in existing methods, the user has no control over which objects are separated and how model imagine the occluded parts in isolation phase. In this paper, we introduce MMPart, an innovative framework for generating part-aware 3D models from a single image. We first use a VLM to generate a set of prompts based on the input image and user descriptions. In the next step, a generative model generates isolated images of each object based on the initial image and the previous step’s prompts as supervisor (which control the pose and guide model how imagine previously occluded areas). Each of those images then enters the multi-view generation stage, where a number of consistent images from different views are generated. Finally, a reconstruction model converts each of these multi-view images into a 3D model.
zh
[CV-166] DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images NEURIPS2025
【速读】:该论文旨在解决现有眼动轨迹(scanpath)与显著性预测模型在建模人类视觉注意力时的两大局限:一是多数方法仅基于离散的注视点序列(scanpath)进行训练,忽略了原始连续眼动轨迹中蕴含的丰富信息;二是缺乏对不同受试者在同一图像上表现出的个体差异性的捕捉能力,通常生成固定长度的单一扫描路径,难以反映真实视觉注意的多样性和随机性。其解决方案的关键在于提出DiffEye,一种基于扩散模型(diffusion model)的眼动轨迹生成框架,通过引入对应位置嵌入(Corresponding Positional Embedding, CPE),将空间注视信息与视觉输入的patch级语义特征对齐,从而在小样本条件下也能生成高质量、多样化的连续眼动轨迹,并可进一步转化为扫描路径和显著图,更准确地刻画人类视觉注意力分布。
链接: https://arxiv.org/abs/2509.16767
作者: Ozgur Kara,Harris Nisar,James M. Rehg
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025
Abstract:Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: this https URL
zh
[CV-167] HyPlaneHead: Rethinking Tri-plane-like Representations in Full-Head Image Synthesis NEURIPS2025
【速读】:该论文旨在解决基于三平面(tri-plane)表示的生成式AI(Generative AI)在全头图像合成任务中面临的三大核心问题:一是笛卡尔坐标投影导致的特征纠缠(feature entanglement),引发镜像伪影;二是球面三平面(spherical tri-plane)中正方形特征图与球面平面映射不均,造成特征利用率低且难以生成精细细节;三是卷积通道间特征穿透(feature penetration)导致各平面之间干扰,尤其当某一平面主导时更为显著。解决方案的关键在于提出一种新型混合平面(hy-plane)表示:首先通过融合平面与球面平面的优势并规避其缺陷,实现更高效的特征表达;其次引入近等面积(near-equal-area)映射策略替代传统θ-φ变形,最大化利用正方形特征图;最后设计单通道统一特征图而非多通道分离特征图,从根本上消除特征穿透问题。这些改进共同推动了HyPlaneHead方法在全头图像合成上达到当前最优性能。
链接: https://arxiv.org/abs/2509.16748
作者: Heyuan Li,Kenkun Liu,Lingteng Qiu,Qi Zuo,Keru Zheng,Zilong Dong,Xiaoguang Han
机构: Tongyi Lab, Alibaba Inc.(通义实验室,阿里巴巴公司); SSE, CUHK (Shenzhen) (深圳清华大学研究院); Alibaba Inc.(阿里巴巴公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025
Abstract:Tri-plane-like representations have been widely adopted in 3D-aware GANs for head image synthesis and other 3D object/scene modeling tasks due to their efficiency. However, querying features via Cartesian coordinate projection often leads to feature entanglement, which results in mirroring artifacts. A recent work, SphereHead, attempted to address this issue by introducing spherical tri-planes based on a spherical coordinate system. While it successfully mitigates feature entanglement, SphereHead suffers from uneven mapping between the square feature maps and the spherical planes, leading to inefficient feature map utilization during rendering and difficulties in generating fine image details. Moreover, both tri-plane and spherical tri-plane representations share a subtle yet persistent issue: feature penetration across convolutional channels can cause interference between planes, particularly when one plane dominates the others. These challenges collectively prevent tri-plane-based methods from reaching their full potential. In this paper, we systematically analyze these problems for the first time and propose innovative solutions to address them. Specifically, we introduce a novel hybrid-plane (hy-plane for short) representation that combines the strengths of both planar and spherical planes while avoiding their respective drawbacks. We further enhance the spherical plane by replacing the conventional theta-phi warping with a novel near-equal-area warping strategy, which maximizes the effective utilization of the square feature map. In addition, our generator synthesizes a single-channel unified feature map instead of multiple feature maps in separate channels, thereby effectively eliminating feature penetration. With a series of technical improvements, our hy-plane representation enables our method, HyPlaneHead, to achieve state-of-the-art performance in full-head image synthesis.
zh
[CV-168] CAMBench-QR : A Structure-Aware Benchmark for Post-Hoc Explanations with QR Understanding
【速读】:该论文旨在解决当前类激活图(Class Activation Mapping, CAM)方法生成的视觉解释往往看似合理但缺乏结构忠实性的问题,即其注意力区域可能并未准确聚焦于图像中关键的结构性特征。解决方案的关键在于提出CAMBench-QR这一结构感知基准,利用二维码(QR code)固有的几何结构(如定位图案、时序线和模块网格)作为可验证的参考标准,通过合成带有精确掩码和受控失真的QR码与非QR码数据,量化评估CAM方法是否能将显著性集中在必要子结构上并避免背景干扰。该方案引入一系列结构感知指标(如定位/时序质量比、背景泄漏、覆盖率AUC、距结构距离等),结合因果遮挡、插入/删除忠实度、鲁棒性和延迟等多维度测试,为高效且结构感知的CAM方法提供了一个简洁、可复现的评估基准。
链接: https://arxiv.org/abs/2509.16745
作者: Ritabrata Chakraborty,Avijit Dasgupta,Sandeep Chaurasia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures, 6 tables
Abstract:Visual explanations are often plausible but not structurally faithful. We introduce CAMBench-QR, a structure-aware benchmark that leverages the canonical geometry of QR codes (finder patterns, timing lines, module grid) to test whether CAM methods place saliency on requisite substructures while avoiding background. CAMBench-QR synthesizes QR/non-QR data with exact masks and controlled distortions, and reports structure-aware metrics (Finder/Timing Mass Ratios, Background Leakage, coverage AUCs, Distance-to-Structure) alongside causal occlusion, insertion/deletion faithfulness, robustness, and latency. We benchmark representative, efficient CAMs (LayerCAM, EigenGrad-CAM, XGrad-CAM) under two practical regimes of zero-shot and last-block fine-tuning. The benchmark, metrics, and training recipes provide a simple, reproducible yardstick for structure-aware evaluation of visual explanations. Hence we propose that CAMBENCH-QR can be used as a litmus test of whether visual explanations are truly structure-aware.
zh
[CV-169] Min: Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning NEURIPS2025
【速读】:该论文旨在解决类增量学习(Class Incremental Learning, CIL)中预训练模型(Pre-trained Models, PTMs)因轻量微调导致的参数漂移(parameter drift)问题,进而削弱模型在新任务上的泛化能力。其核心解决方案是提出一种基于信息论指导的“有益噪声”学习方法——Mixture of Noise (Min),关键在于:首先从新任务的高维特征中学习任务特定噪声,随后动态调整权重以最优混合不同任务噪声,并将该有益噪声嵌入中间特征空间,掩蔽低效模式响应,从而保留旧任务知识并提升持续学习性能。
链接: https://arxiv.org/abs/2509.16738
作者: Kai Jiang,Zhengyan Shi,Dell Zhang,Hongyuan Zhang,Xuelong Li
机构: Northwestern Polytechnical University (西北工业大学); China Telecom (中国电信); Shanghai Jiao Tong University (上海交通大学); University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted by NeurIPS 2025. Source Code will be released in the next version
Abstract:Class Incremental Learning (CIL) aims to continuously learn new categories while retaining the knowledge of old ones. Pre-trained models (PTMs) show promising capabilities in CIL. However, existing approaches that apply lightweight fine-tuning to backbones still induce parameter drift, thereby compromising the generalization capability of pre-trained models. Parameter drift can be conceptualized as a form of noise that obscures critical patterns learned for previous tasks. However, recent researches have shown that noise is not always harmful. For example, the large number of visual patterns learned from pre-training can be easily abused by a single task, and introducing appropriate noise can suppress some low-correlation features, thus leaving a margin for future tasks. To this end, we propose learning beneficial noise for CIL guided by information theory and propose Mixture of Noise (Min), aiming to mitigate the degradation of backbone generalization from adapting new tasks. Specifically, task-specific noise is learned from high-dimension features of new tasks. Then, a set of weights is adjusted dynamically for optimal mixture of different task noise. Finally, Min embeds the beneficial noise into the intermediate features to mask the response of inefficient patterns. Extensive experiments on six benchmark datasets demonstrate that Min achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings. This shows the significant potential for beneficial noise in continual learning.
zh
[CV-170] Pain in 3D: Generating Controllable Synthetic Faces for Automated Pain Assessment
【速读】:该论文旨在解决自动化疼痛评估中因数据集存在严重人口统计学和标签不平衡,以及现有生成模型无法精确控制面部动作单元(Action Units, AUs)、面部结构或临床验证的疼痛等级而导致的性能瓶颈问题。解决方案的关键在于提出3DPain——一个大规模合成数据集,通过三阶段框架生成多样化的3D人脸网格、利用扩散模型纹理化,并基于AUs驱动的人脸绑定技术合成多视角人脸图像,配对包含中性与疼痛图像、AUs配置、PSPI评分及首个数据集级疼痛区域热图标注;同时引入ViTPain,一种基于视觉Transformer的跨模态蒸馏框架,由热图训练的教师模型指导RGB图像学生模型训练,从而提升准确性、可解释性和临床可靠性。二者共同构建了一个可控、多样化且具备临床基础的通用自动化疼痛评估新范式。
链接: https://arxiv.org/abs/2509.16727
作者: Xin Lei Lin,Soroush Mehraban,Abhishek Moturu,Babak Taati
机构: Vector Institute; KITE Research Institute, University Health Network; Department of Computer Science, University of Toronto; Department of Medical Imaging, University of Toronto; Institute of Biomedical Engineering, University of Toronto
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Automated pain assessment from facial expressions is crucial for non-communicative patients, such as those with dementia. Progress has been limited by two challenges: (i) existing datasets exhibit severe demographic and label imbalance due to ethical constraints, and (ii) current generative models cannot precisely control facial action units (AUs), facial structure, or clinically validated pain levels. We present 3DPain, a large-scale synthetic dataset specifically designed for automated pain assessment, featuring unprecedented annotation richness and demographic diversity. Our three-stage framework generates diverse 3D meshes, textures them with diffusion models, and applies AU-driven face rigging to synthesize multi-view faces with paired neutral and pain images, AU configurations, PSPI scores, and the first dataset-level annotations of pain-region heatmaps. The dataset comprises 82,500 samples across 25,000 pain expression heatmaps and 2,500 synthetic identities balanced by age, gender, and ethnicity. We further introduce ViTPain, a Vision Transformer based cross-modal distillation framework in which a heatmap-trained teacher guides a student trained on RGB images, enhancing accuracy, interpretability, and clinical reliability. Together, 3DPain and ViTPain establish a controllable, diverse, and clinically grounded foundation for generalizable automated pain assessment. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2509.16727 [cs.CV] (or arXiv:2509.16727v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.16727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-171] xt-Scene: A Scene-to-Language Parsing Framework for 3D Scene Understanding
【速读】:该论文旨在解决 embodied artificial intelligence 系统在理解和交互复杂 3D 场景时的核心挑战,尤其是如何将多模态大语言模型(Multimodal Large Language Models, MLLMs)在 2D 图像理解中的能力拓展至 3D 场景。主要难点包括:1)3D 环境包含空间关系、可操作性(affordances)、物理属性、布局等更丰富的语义概念;2)缺乏大规模的 3D 视觉-语言数据集限制了模型训练与评估。解决方案的关键在于提出 Text-Scene 框架,该框架通过自动解析 3D 场景生成结构化文本描述,无需人工干预即可识别物体属性和空间关系,并生成连贯的场景摘要,从而实现从几何分析到语言表达的映射,有效连接 3D 观测与自然语言理解,同时支持下游任务如长期任务规划(InPlan3D)。
链接: https://arxiv.org/abs/2509.16721
作者: Haoyuan Li,Rui Liu,Hehe Fan,Yi Yang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 19 pages, 12 figures, 6 tables
Abstract:Enabling agents to understand and interact with complex 3D scenes is a fundamental challenge for embodied artificial intelligence systems. While Multimodal Large Language Models (MLLMs) have achieved significant progress in 2D image understanding, extending such capabilities to 3D scenes remains difficult: 1) 3D environment involves richer concepts such as spatial relationships, affordances, physics, layout, and so on, 2) the absence of large-scale 3D vision-language datasets has posed a significant obstacle. In this paper, we introduce Text-Scene, a framework that automatically parses 3D scenes into textual descriptions for scene understanding. Given a 3D scene, our model identifies object attributes and spatial relationships, and then generates a coherent summary of the whole scene, bridging the gap between 3D observation and language without requiring human-in-the-loop intervention. By leveraging both geometric analysis and MLLMs, Text-Scene produces descriptions that are accurate, detailed, and human-interpretable, capturing object-level details and global-level context. Experimental results on benchmarks demonstrate that our textual parses can faithfully represent 3D scenes and benefit downstream tasks. To evaluate the reasoning capability of MLLMs, we present InPlan3D, a comprehensive benchmark for 3D task planning, consisting of 3174 long-term planning tasks across 636 indoor scenes. We emphasize clarity and accessibility in our approach, aiming to make 3D scene content understandable through language. Code and datasets will be released.
zh
[CV-172] When Confidence Fails: Revisiting Pseudo-Label Selection in Semi-supervised Semantic Segmentation
【速读】:该论文针对半监督语义分割中伪标签(pseudo-label)选择环节存在的两大问题展开研究:一是现有方法依赖固定置信度阈值筛选伪标签,难以应对模型过自信倾向(network overconfidence tendency),即正确与错误预测在高置信度区域重叠严重,导致难以区分并加剧模型认知偏差;二是直接丢弃低置信度预测破坏了空间-语义连续性,造成关键上下文信息丢失。解决方案的关键在于提出置信度可分离学习(Confidence Separable Learning, CSL),其将伪标签选择建模为置信度分布特征空间中的凸优化问题,构建样本自适应的决策边界以分离可靠与不可靠预测;同时引入对可靠像素的随机掩码机制,引导网络从低可靠性区域学习上下文关系,从而缓解因丢弃不确定预测带来的负面影响。
链接: https://arxiv.org/abs/2509.16704
作者: Pan Liu,Jinshi Liu
机构: Central South University (中南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While significant advances exist in pseudo-label generation for semi-supervised semantic segmentation, pseudo-label selection remains understudied. Existing methods typically use fixed confidence thresholds to retain high-confidence predictions as pseudo-labels. However, these methods cannot cope with network overconfidence tendency, where correct and incorrect predictions overlap significantly in high-confidence regions, making separation challenging and amplifying model cognitive bias. Meanwhile, the direct discarding of low-confidence predictions disrupts spatial-semantic continuity, causing critical context loss. We propose Confidence Separable Learning (CSL) to address these limitations. CSL formulates pseudo-label selection as a convex optimization problem within the confidence distribution feature space, establishing sample-specific decision boundaries to distinguish reliable from unreliable predictions. Additionally, CSL introduces random masking of reliable pixels to guide the network in learning contextual relationships from low-reliability regions, thereby mitigating the adverse effects of discarding uncertain predictions. Extensive experimental results on the Pascal, Cityscapes, and COCO benchmarks show that CSL performs favorably against state-of-the-art methods. Code and model weights are available at this https URL.
zh
[CV-173] Animalbooth: multimodal feature enhancement for animal subject personalization
【速读】:该论文旨在解决个性化动物图像生成中因外观特征丰富和形态差异大而导致的特征错位(feature misalignment)问题,进而引发的身份漂移(identity drift)难题。其解决方案的关键在于提出AnimalBooth框架,该框架通过引入Animal Net与自适应注意力模块增强身份保真度,并结合频域控制的特征融合模块——在潜在空间中利用离散余弦变换(Discrete Cosine Transform, DCT)滤波来引导扩散过程,实现从全局结构到细节纹理的粗到精渐进生成,从而有效缓解跨域对齐误差并提升图像质量。
链接: https://arxiv.org/abs/2509.16702
作者: Chen Liu,Haitao Wu,Kafeng Wang,Xiaowang Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized animal image generation is challenging due to rich appearance cues and large morphological variability. Existing approaches often exhibit feature misalignment across domains, which leads to identity drift. We present AnimalBooth, a framework that strengthens identity preservation with an Animal Net and an adaptive attention module, mitigating cross domain alignment errors. We further introduce a frequency controlled feature integration module that applies Discrete Cosine Transform filtering in the latent space to guide the diffusion process, enabling a coarse to fine progression from global structure to detailed texture. To advance research in this area, we curate AnimalBench, a high resolution dataset for animal personalization. Extensive experiments show that AnimalBooth consistently outperforms strong baselines on multiple benchmarks and improves both identity fidelity and perceptual quality.
zh
[CV-174] InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention NEURIPS2025
【速读】:该论文旨在解决当前布局到图像(Layout-to-Image, L2I)生成方法在复杂场景下仍存在性能不足的问题,尤其是在精确控制物体位置与多模态内容融合方面表现欠佳。其解决方案的关键在于提出一种名为InstanceAssemble的新架构,该架构通过实例组装注意力(instance-assembling attention)机制引入布局条件,实现基于边界框(bbox)的位置控制以及文本和额外视觉内容的多模态内容控制;同时,借助轻量级LoRA模块实现对现有DiT-based文本到图像(T2I)模型的灵活适配,从而在保持高兼容性的同时显著提升生成精度与可控性。
链接: https://arxiv.org/abs/2509.16691
作者: Qiang Xiang,Shuang Sun,Binglei Li,Dejia Song,Huaxia Li,Nemo Chen,Xu Tang,Yao Hu,Junping Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in NeurIPS 2025
Abstract:Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules.
zh
[CV-175] Spectral Compressive Imaging via Chromaticity-Intensity Decomposition
【速读】:该论文旨在解决编码孔径快照式光谱成像(Coded Aperture Snapshot Spectral Imaging, CASSI)中因测量数据同时混合空间与光谱信息而导致的严重病态逆问题,以及由于场景光照依赖性使得恢复内在光谱反射率(spectral reflectance)困难的问题。解决方案的关键在于提出一种色度-强度解耦框架(chromaticity-intensity decomposition framework),将高光谱图像(HSI)分解为一个空间平滑的强度图和一个具有光谱变化特性的色度立方体,其中色度编码了光照不变的反射率并保留高频空间细节与局部光谱稀疏性;在此基础上构建CIDNet网络,融合面向细粒度稀疏光谱色度重建的混合空间-光谱Transformer与能捕捉迭代过程中各阶段各向异性噪声的空间自适应噪声估计模块,从而实现更精确的光谱与色度保真度恢复。
链接: https://arxiv.org/abs/2509.16690
作者: Xiaodong Wang,Zijun He,Ping Wang,Lishun Wang,Yanan Hu,Xin Yuan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Chengdu Institute of Biology, Chinese Academy of Sciences (中国科学院成都生物研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In coded aperture snapshot spectral imaging (CASSI), the captured measurement entangles spatial and spectral information, posing a severely ill-posed inverse problem for hyperspectral images (HSIs) reconstruction. Moreover, the captured radiance inherently depends on scene illumination, making it difficult to recover the intrinsic spectral reflectance that remains invariant to lighting conditions. To address these challenges, we propose a chromaticity-intensity decomposition framework, which disentangles an HSI into a spatially smooth intensity map and a spectrally variant chromaticity cube. The chromaticity encodes lighting-invariant reflectance, enriched with high-frequency spatial details and local spectral sparsity. Building on this decomposition, we develop CIDNet, a Chromaticity-Intensity Decomposition unfolding network within a dual-camera CASSI system. CIDNet integrates a hybrid spatial-spectral Transformer tailored to reconstruct fine-grained and sparse spectral chromaticity and a degradation-aware, spatially-adaptive noise estimation module that captures anisotropic noise across iterative stages. Extensive experiments on both synthetic and real-world CASSI datasets demonstrate that our method achieves superior performance in both spectral and chromaticity fidelity. Code and models will be publicly available.
zh
[CV-176] owards a Transparent and Interpretable AI Model for Medical Image Classifications
【速读】:该论文旨在解决复杂人工智能(AI)模型在医疗领域应用中因“黑箱”特性导致的临床可实践性问题,即AI决策过程缺乏透明度和可解释性,从而限制了其在实际医疗场景中的可信度与采纳。解决方案的关键在于引入可解释人工智能(Explainable Artificial Intelligence, XAI)方法,通过在多种医学数据集上进行模拟实验,揭示XAI模型如何有效解析AI预测的内部机制,从而提升医疗专业人员的决策质量。研究强调,持续开发和探索基于多样化医学数据的XAI技术,是推动其在医疗领域广泛应用与效能提升的核心路径。
链接: https://arxiv.org/abs/2509.16685
作者: Binbin Wen,Yihang Wu,Tareef Daqqaq,Ahmad Chaddad
机构: Guilin University of Electronic Technology (桂林电子科技大学); Ecole de Technologie Superieure (高等技术学院); Taibah University (塔伊巴大学); Prince Mohammed Bin Abdulaziz Hospital (穆罕默德·本·阿卜杜勒阿齐兹王子医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published in Cognitive Neurodynamics
Abstract:The integration of artificial intelligence (AI) into medicine is remarkable, offering advanced diagnostic and therapeutic possibilities. However, the inherent opacity of complex AI models presents significant challenges to their clinical practicality. This paper focuses primarily on investigating the application of explainable artificial intelligence (XAI) methods, with the aim of making AI decisions transparent and interpretable. Our research focuses on implementing simulations using various medical datasets to elucidate the internal workings of the XAI model. These dataset-driven simulations demonstrate how XAI effectively interprets AI predictions, thus improving the decision-making process for healthcare professionals. In addition to a survey of the main XAI methods and simulations, ongoing challenges in the XAI field are discussed. The study highlights the need for the continuous development and exploration of XAI, particularly from the perspective of diverse medical datasets, to promote its adoption and effectiveness in the healthcare domain.
zh
[CV-177] Active View Selection for Scene-level Multi-view Crowd Counting and Localization with Limited Labels
【速读】:该论文旨在解决多视角人群计数与定位任务中因忽视最优摄像机视角选择而导致的场景级感知不全面问题,以及现有视图选择方法依赖大量标注数据且缺乏跨场景泛化能力的局限性。其解决方案的关键在于提出一种主动视图选择方法(Active View Selection, AVS),该方法在视图选择、标注和下游任务之间进行联合优化,不仅考虑视图与场景几何关系,还引入下游任务模型预测结果作为指导信号,从而实现有限标注条件下跨场景的高效视图选择,显著提升了多视角人群计数与定位的准确性和适用性。
链接: https://arxiv.org/abs/2509.16684
作者: Qi Zhang,Bin Li,Antoni B. Chan,Hui Huang
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Multi-view crowd counting and localization fuse the input multi-views for estimating the crowd number or locations on the ground. Existing methods mainly focus on accurately predicting on the crowd shown in the input views, which neglects the problem of choosing the `best’ camera views to perceive all crowds well in the scene. Besides, existing view selection methods require massive labeled views and images, and lack the ability for cross-scene settings, reducing their application scenarios. Thus, in this paper, we study the view selection issue for better scene-level multi-view crowd counting and localization results with cross-scene ability and limited label demand, instead of input-view-level results. We first propose an independent view selection method (IVS) that considers view and scene geometries in the view selection strategy and conducts the view selection, labeling, and downstream tasks independently. Based on IVS, we also put forward an active view selection method (AVS) that jointly optimizes the view selection, labeling, and downstream tasks. In AVS, we actively select the labeled views and consider both the view/scene geometries and the predictions of the downstream task models in the view selection process. Experiments on multi-view counting and localization tasks demonstrate the cross-scene and the limited label demand advantages of the proposed active view selection method (AVS), outperforming existing methods and with wider application scenarios.
zh
[CV-178] ProtoVQA: An Adaptable Prototypical Framework for Explainable Fine-Grained Visual Question Answering EMNLP2025
【速读】:该论文旨在解决视觉问答(Visual Question Answering, VQA)系统在复杂应用场景中缺乏可解释性的问题,尤其是在医疗影像和自动驾驶等安全关键领域,模型不仅需要提供准确答案,还需生成人类可理解与验证的解释。解决方案的关键在于提出ProtoVQA框架,其核心创新包括:(i) 学习与问题相关的原型(question-aware prototypes),作为推理锚点将答案与图像中的判别区域关联;(ii) 采用空间约束匹配机制确保所选证据在空间上连贯且语义相关;(iii) 通过共享的原型骨干网络同时支持答案生成与视觉定位(grounding)任务。该方法显著提升了解释的忠实度与细粒度,同时保持了竞争性的准确率。
链接: https://arxiv.org/abs/2509.16680
作者: Xingjian Diao,Weiyi Wu,Keyi Kong,Peijun Qing,Xinwen Xu,Ming Cheng,Soroush Vosoughi,Jiang Gui
机构: Dartmouth College (达特茅斯学院); Shandong University (山东大学); Harvard University (哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to EMNLP 2025 Main Conference
Abstract:Visual Question Answering (VQA) is increasingly used in diverse applications ranging from general visual reasoning to safety-critical domains such as medical imaging and autonomous systems, where models must provide not only accurate answers but also explanations that humans can easily understand and verify. Prototype-based modeling has shown promise for interpretability by grounding predictions in semantically meaningful regions for purely visual reasoning tasks, yet remains underexplored in the context of VQA. We present ProtoVQA, a unified prototypical framework that (i) learns question-aware prototypes that serve as reasoning anchors, connecting answers to discriminative image regions, (ii) applies spatially constrained matching to ensure that the selected evidence is coherent and semantically relevant, and (iii) supports both answering and grounding tasks through a shared prototype backbone. To assess explanation quality, we propose the Visual-Linguistic Alignment Score (VLAS), which measures how well the model’s attended regions align with ground-truth evidence. Experiments on Visual7W show that ProtoVQA yields faithful, fine-grained explanations while maintaining competitive accuracy, advancing the development of transparent and trustworthy VQA systems.
zh
[CV-179] IPF-RDA: An Information-Preserving Framework for Robust Data Augmentation
【速读】:该论文旨在解决数据增强(Data Augmentation)在提升深度模型泛化性能的同时,可能引入分布偏移(Distribution Shift)和噪声的问题,这些问题会限制深度网络的潜力并损害其性能。解决方案的关键在于提出一种信息保持框架 IPF-RDA,其核心包括:(i) 一种新的类别判别信息估计算法,用于识别最易受数据增强操作影响的样本点及其重要性得分;(ii) 一种信息保持机制,在增强样本中保留关键信息并自适应地保证数据多样性。该框架将不同类型的增强方法分类整合,显著提升了多种主流数据增强方法的鲁棒性与效能。
链接: https://arxiv.org/abs/2509.16678
作者: Suorong Yang,Hongchao Yang,Suhan Guo,Furao Shen,Jian Zhao
机构: State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China (国家软件新技术重点实验室,计算机科学与技术系,南京大学,南京 210023,中国); State Key Laboratory for Novel Software Technology, Nanjing University, China, School of Artificial Intelligence, Nanjing University, Nanjing 210023, China (国家软件新技术重点实验室,南京大学,中国,人工智能学院,南京大学,南京 210023,中国); School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China (电子科学与工程学院,南京大学,南京 210023,中国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract:Data augmentation is widely utilized as an effective technique to enhance the generalization performance of deep models. However, data augmentation may inevitably introduce distribution shifts and noises, which significantly constrain the potential and deteriorate the performance of deep networks. To this end, we propose a novel information-preserving framework, namely IPF-RDA, to enhance the robustness of data augmentations in this paper. IPF-RDA combines the proposal of (i) a new class-discriminative information estimation algorithm that identifies the points most vulnerable to data augmentation operations and corresponding importance scores; And (ii) a new information-preserving scheme that preserves the critical information in the augmented samples and ensures the diversity of augmented data adaptively. We divide data augmentation methods into three categories according to the operation types and integrate these approaches into our framework accordingly. After being integrated into our framework, the robustness of data augmentation methods can be enhanced and their full potential can be unleashed. Extensive experiments demonstrate that although being simple, IPF-RDA consistently improves the performance of numerous commonly used state-of-the-art data augmentation methods with popular deep models on a variety of datasets, including CIFAR-10, CIFAR-100, Tiny-ImageNet, CUHK03, Market1501, Oxford Flower, and MNIST, where its performance and scalability are stressed. The implementation is available at this https URL.
zh
[CV-180] Segment-to-Act: Label-Noise-Robust Action-Prompted Video Segmentation Towards Embodied Intelligence
【速读】:该论文旨在解决**动作引导的视频对象分割(Action-based Video Object Segmentation, AVOS)任务中因标签噪声导致的性能下降问题,尤其关注两类噪声:文本提示噪声(类别误判和同类别名词替换)与掩码标注噪声(边界扰动模拟不精确监督)。其核心解决方案在于首次构建了面向标签噪声的AVOS基准数据集ActiSeg-NL,并系统评估六种标签噪声学习策略在文本、边界及混合噪声场景下的鲁棒性表现;关键创新点包括提出一种并行掩码头机制(Parallel Mask Head Mechanism, PMHM)**以缓解掩码标注噪声的影响,同时揭示不同学习策略在前景-背景权衡上的差异化鲁棒性特征,为未来抗噪AVOS方法设计提供了理论依据与实践框架。
链接: https://arxiv.org/abs/2509.16677
作者: Wenxin Li,Kunyu Peng,Di Wen,Ruiping Liu,Mengfei Duan,Kai Luo,Kailun Yang
机构: Hunan University (湖南大学); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注: The established benchmark and source code will be made publicly available at this https URL
Abstract:Embodied intelligence relies on accurately segmenting objects actively involved in interactions. Action-based video object segmentation addresses this by linking segmentation with action semantics, but it depends on large-scale annotations and prompts that are costly, inconsistent, and prone to multimodal noise such as imprecise masks and referential ambiguity. To date, this challenge remains unexplored. In this work, we take the first step by studying action-based video object segmentation under label noise, focusing on two sources: textual prompt noise (category flips and within-category noun substitutions) and mask annotation noise (perturbed object boundaries to mimic imprecise supervision). Our contributions are threefold. First, we introduce two types of label noises for the action-based video object segmentation task. Second, we build up the first action-based video object segmentation under a label noise benchmark ActiSeg-NL and adapt six label-noise learning strategies to this setting, and establish protocols for evaluating them under textual, boundary, and mixed noise. Third, we provide a comprehensive analysis linking noise types to failure modes and robustness gains, and we introduce a Parallel Mask Head Mechanism (PMHM) to address mask annotation noise. Qualitative evaluations further reveal characteristic failure modes, including boundary leakage and mislocalization under boundary perturbations, as well as occasional identity substitutions under textual flips. Our comparative analysis reveals that different learning strategies exhibit distinct robustness profiles, governed by a foreground-background trade-off where some achieve balanced performance while others prioritize foreground accuracy at the cost of background precision. The established benchmark and source code will be made publicly available at this https URL.
zh
[CV-181] FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
【速读】:该论文旨在解决开放世界场景下文本驱动行人检索(Text-based Pedestrian Retrieval, TPR)中存在的模型泛化能力不足与语义理解不充分的问题,尤其是在零样本交互式检索任务中。解决方案的关键在于提出FitPro框架,其核心创新包括:特征对比解码(Feature Contrastive Decoding, FCD),通过提示引导的对比解码生成高质量结构化行人描述以缓解零样本场景下的语义漂移;增量语义挖掘(Incremental Semantic Mining, ISM),利用多视角观测构建全局行人表征以增强对视角变化和细粒度描述差异的鲁棒性;以及查询感知的分层检索(Query-aware Hierarchical Retrieval, QHR),根据查询类型动态优化检索流程,实现对多模态和多视角输入的高效适应。
链接: https://arxiv.org/abs/2509.16674
作者: Zengli Luo,Canlong Zhang,Xiaochun Lu,Zhixin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15pages,6 figures
Abstract:Text-based Pedestrian Retrieval (TPR) aims to retrieve specific target pedestrians in visual scenes according to natural language descriptions. Although existing methods have achieved progress under constrained settings, interactive retrieval in the open-world scenario still suffers from limited model generalization and insufficient semantic understanding. To address these challenges, we propose FitPro, an open-world interactive zero-shot TPR framework with enhanced semantic comprehension and cross-scene adaptability. FitPro has three innovative components: Feature Contrastive Decoding (FCD), Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval (QHR). The FCD integrates prompt-guided contrastive decoding to generate high-quality structured pedestrian descriptions from denoised images, effectively alleviating semantic drift in zero-shot scenarios. The ISM constructs holistic pedestrian representations from multi-view observations to achieve global semantic modeling in multi-turn interactions,thereby improving robustness against viewpoint shifts and fine-grained variations in descriptions. The QHR dynamically optimizes the retrieval pipeline according to query types, enabling efficient adaptation to multi-modal and multi-view inputs. Extensive experiments on five public datasets and two evaluation protocols demonstrate that FitPro significantly overcomes the generalization limitations and semantic modeling constraints of existing methods in interactive retrieval, paving the way for practical deployment. The code and data will be released at this https URL lilo4096/FitPro-Interactive-Person-Retrieval.
zh
[CV-182] MedCutMix: A Data-Centric Approach to Improve Radiology Vision-Language Pre-training with Disease Awareness
【速读】:该论文旨在解决医学视觉-语言预训练(Vision-Language Pre-training, VLP)中因依赖成对图像-文本数据集而面临的隐私风险与标注成本高昂的问题。现有数据增强方法在医学场景下往往难以捕捉复杂且细微的数据变化,导致多样性不足。解决方案的关键在于提出一种名为MedCutMix的多模态疾病导向数据增强方法:它在医学报告中执行诊断语句级别的CutMix操作,并通过诊断语句与医学图像之间的交叉注意力机制(cross-attention),引导影像模态内的注意力流形混合(attentive manifold mix),从而提升模型在放射学下游任务中的性能与泛化能力。
链接: https://arxiv.org/abs/2509.16673
作者: Sinuo Wang,Yutong Xie,Yuyuan Liu,Qi Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Pre-training (VLP) is drawing increasing interest for its ability to minimize manual annotation requirements while enhancing semantic understanding in downstream tasks. However, its reliance on image-text datasets poses challenges due to privacy concerns and the high cost of obtaining paired annotations. Data augmentation emerges as a viable strategy to address this issue, yet existing methods often fall short of capturing the subtle and complex variations in medical data due to limited diversity. To this end, we propose MedCutMix, a novel multi-modal disease-centric data augmentation method. MedCutMix performs diagnostic sentence CutMix within medical reports and establishes the cross-attention between the diagnostic sentence and medical image to guide attentive manifold mix within the imaging modality. Our approach surpasses previous methods across four downstream radiology diagnosis datasets, highlighting its effectiveness in enhancing performance and generalizability in radiology VLP.
zh
[CV-183] Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?
【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在自动驾驶场景中对道路拓扑结构(road topology)理解能力不足的问题。其核心挑战在于,尽管VLMs在多模态推理方面取得进展,但它们在理解车道空间关系、路径连通性等关键拓扑信息上表现不佳,限制了其在安全导航中的应用。解决方案的关键在于:首先将多视角图像投影并融合至统一的鸟瞰图(bird’s-eye-view, BEV)坐标系下生成BEV车道表示;在此基础上构建四个面向拓扑推理的诊断型视觉问答(VQA)任务,系统评估模型的空间推理能力。实验表明,尽管前沿闭源模型(如GPT-4o)在部分任务中表现较好,但在涉及时间顺序判断的任务中仍显著落后于人类水平,且开源模型即便达到30B参数规模也存在明显短板,揭示出空间推理仍是当前VLMs的核心瓶颈。
链接: https://arxiv.org/abs/2509.16654
作者: Xin Chen(1),Jia He(1),Maozheng Li(1),Dongliang Xu(1),Tianyu Wang(2),Yixiao Chen(3),Zhixin Lin(1),Yue Yao(1) ((1) Shandong University, (2) MBZUAI, (3) Sems)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 5 figures
Abstract:Vision-Language Models (VLMs) have recently shown remarkable progress in multimodal reasoning, yet their applications in autonomous driving remain limited. In particular, the ability to understand road topology, a key requirement for safe navigation, has received relatively little attention. While some recent works have begun to explore VLMs in driving contexts, their performance on topology reasoning is far from satisfactory. In this work, we systematically evaluate VLMs’ capabilities in road topology understanding. Specifically, multi-view images are projected into unified ground-plane coordinate system and fused into bird’s-eye-view (BEV) lanes. Based on these BEV lanes, we formulate four topology-related diagnostic VQA tasks, which together capture essential components of spatial topology reasoning. Through extensive evaluation, we find that while frontier closed-source models (e.g., GPT-4o) achieve relatively high accuracy in some tasks, they still fail in some temporal questions that humans can answer (e.g., GPT-4o achieve only 67.8% in vector, a two-class classification problem). Furthermore, we find open-source VLMs, even at 30B scale, struggle significantly. These results indicate that spatial reasoning remains a fundamental bottleneck for current VLMs. We also find that the model’s capability is positively correlated with model size, length of reasoning tokens and shots provided as examples, showing direction for future research.
zh
[CV-184] ADVEDM:Fine-grained Adversarial Attack against VLM-based Embodied Agents
【速读】:该论文旨在解决现有针对视觉语言模型(Vision-Language Models, VLMs)的对抗攻击方法在应用于具身决策任务(Embodied Decision-Making, EDM)时存在的两大问题:一是攻击假设过于理想化,依赖对目标VLM的完全知识,不适用于真实场景中对具身代理(如自动驾驶车辆或机器人)的攻击;二是攻击效果有限,因破坏图像中大部分语义信息导致感知与任务上下文不一致,从而干扰VLM推理流程并产生无效输出,无法有效影响物理世界的交互行为。解决方案的关键在于提出一种细粒度对抗攻击框架ADVEDM,通过仅修改图像中少数关键物体的感知信息,同时保留其余区域的语义完整性,使VLM在保持合理推理的前提下输出看似正确但实际错误的决策,从而更有效地误导具身代理的行为,显著提升对物理世界安全性的威胁水平。
链接: https://arxiv.org/abs/2509.16645
作者: Yichen Wang,Hangtao Zhang,Hewen Pan,Ziqi Zhou,Xianlong Wang,Peijin Guo,Lulu Xue,Shengshan Hu,Minghui Li,Leo Yu Zhang
机构: National Engineering Research Center for Big Data Technology and System (国家大数据技术与系统工程研究中心); Services Computing Technology and System Lab (服务计算技术与系统实验室); Cluster and Grid Computing Lab (集群与网格计算实验室); Hubei Engineering Research Center on Big Data Security (湖北省大数据安全工程研究中心); Hubei Key Laboratory of Distributed System Security (湖北省分布式系统安全重点实验室); School of Cyber Science and Engineering, Huazhong University of Science and Technology (华中科技大学网络科学与工程学院); School of Computer Science and Technology, Huazhong University of Science and Technology (华中科技大学计算机科学与技术学院); Department of Computer Science, City University of HongKong (香港城市大学计算机系); School of Software Engineering, Huazhong University of Science and Technology (华中科技大学软件工程学院); School of Information and Communication Technology, Griffith University (格里菲斯大学信息与通信技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language Models (VLMs), with their strong reasoning and planning capabilities, are widely used in embodied decision-making (EDM) tasks in embodied agents, such as autonomous driving and robotic manipulation. Recent research has increasingly explored adversarial attacks on VLMs to reveal their vulnerabilities. However, these attacks either rely on overly strong assumptions, requiring full knowledge of the victim VLM, which is impractical for attacking VLM-based agents, or exhibit limited effectiveness. The latter stems from disrupting most semantic information in the image, which leads to a misalignment between the perception and the task context defined by system prompts. This inconsistency interrupts the VLM’s reasoning process, resulting in invalid outputs that fail to affect interactions in the physical world. To this end, we propose a fine-grained adversarial attack framework, ADVEDM, which modifies the VLM’s perception of only a few key objects while preserving the semantics of the remaining regions. This attack effectively reduces conflicts with the task context, making VLMs output valid but incorrect decisions and affecting the actions of agents, thus posing a more substantial safety threat in the physical world. We design two variants of based on this framework, ADVEDM-R and ADVEDM-A, which respectively remove the semantics of a specific object from the image and add the semantics of a new object into the image. The experimental results in both general scenarios and EDM tasks demonstrate fine-grained control and excellent attack performance.
zh
[CV-185] Unlocking Hidden Potential in Point Cloud Networks with Attention-Guided Grouping-Feature Coordination
【速读】:该论文旨在解决点云分析中传统基于点的网络架构(point-based architectures)因结构设计创新不足而导致性能潜力未被充分挖掘的问题。现有方法多集中于引入新型网络结构,而忽视了通过模块级协同优化来提升特征聚合效果的可能性。其解决方案的关键在于提出一种轻量级可分离模块——分组-特征协调模块(Grouping-Feature Coordination Module, GF-Core),该模块同时调控分组层(grouping layer)与特征提取层(feature extraction layer),实现更精细的特征聚合;此外,还设计了一种专为点输入定制的自监督预训练策略,显著增强模型在复杂点云场景下的鲁棒性。实验表明,该方法在ModelNet40上将基线模型准确率提升至94.0%,达到先进框架水平,且保持架构简洁性。
链接: https://arxiv.org/abs/2509.16639
作者: Shangzhuo Xie,Qianqian Yang
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to the IEEE for possible publication
Abstract:Point cloud analysis has evolved with diverse network architectures, while existing works predominantly focus on introducing novel structural designs. However, conventional point-based architectures - processing raw points through sequential sampling, grouping, and feature extraction layers - demonstrate underutilized potential. We notice that substantial performance gains can be unlocked through strategic module integration rather than structural modifications. In this paper, we propose the Grouping-Feature Coordination Module (GF-Core), a lightweight separable component that simultaneously regulates both grouping layer and feature extraction layer to enable more nuanced feature aggregation. Besides, we introduce a self-supervised pretraining strategy specifically tailored for point-based inputs to enhance model robustness in complex point cloud analysis scenarios. On ModelNet40 dataset, our method elevates baseline networks to 94.0% accuracy, matching advanced frameworks’ performance while preserving architectural simplicity. On three variants of the ScanObjectNN dataset, we obtain improvements of 2.96%, 6.34%, and 6.32% respectively.
zh
[CV-186] owards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification IJCAI2025
【速读】:该论文针对现有行人重识别(Person Re-identification, ReID)任务无法支持全天候、长时序场景下有效检索的问题,提出了一种新的任务范式——任意时间行人重识别(Anytime Person Re-identification, AT-ReID),旨在实现跨时间段(如白天与夜间)、跨场景下的稳定检索性能。其解决方案的关键在于构建了首个大规模多时段、多视角数据集AT-USTC(含403k张图像,覆盖21个月、270名志愿者平均拍摄29.1次),并设计了一个统一模型Uni-AT,包含三个核心组件:多场景ReID(MS-ReID)框架用于学习场景特异性特征、属性专家混合(Mixture-of-Attribute-Experts, MoAE)模块以缓解跨场景干扰,以及分层动态加权(Hierarchical Dynamic Weighting, HDW)策略保障各场景训练的平衡性,从而显著提升模型在多样化场景下的泛化能力。
链接: https://arxiv.org/abs/2509.16635
作者: Xulin Li,Yan Lu,Bin Liu,Jiaze Li,Qinhong Yang,Tao Gong,Qi Chu,Mang Ye,Nenghai Yu
机构: University of Science and Technology of China (中国科学技术大学); Anhui Province Key Laboratory of Digital Security (安徽省数字安全重点实验室); The Chinese University of Hong Kong (香港中文大学); School of Computer Science, Wuhan University (武汉大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCAI 2025
Abstract:In real applications, person re-identification (ReID) is expected to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets can not meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 403k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans 21 months, and 270 volunteers were photographed on average 29.1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.
zh
[CV-187] DA-Font: Few-Shot Font Generation via Dual-Attention Hybrid Integration ACM-MM2025
【速读】:该论文旨在解决少样本字体生成(few-shot font generation)中因字体风格多样性和复杂性导致的生成结果存在笔画错误、伪影和模糊等问题,从而提升生成字体的结构完整性和局部细节保真度。其解决方案的关键在于提出DA-Font框架,核心创新为引入双注意力混合模块(Dual-Attention Hybrid Module, DAHM),该模块包含两个协同工作的注意力块:组件注意力块(component attention block)利用内容图像中的组件信息引导风格迁移过程,关系注意力块(relation attention block)通过交互原始与风格化组件特征进一步优化空间关系;此外,还设计了角点一致性损失(corner consistency loss)和弹性网格特征损失(elastic mesh feature loss)以增强几何对齐效果,从而在保持字符形状准确性和风格纹理的同时显著提升生成质量。
链接: https://arxiv.org/abs/2509.16632
作者: Weiran Chen,Guiqian Zhu,Ying Li,Yi Ji,Chunping Liu
机构: Soochow University (苏州大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025
Abstract:Few-shot font generation aims to create new fonts with a limited number of glyph references. It can be used to significantly reduce the labor cost of manual font design. However, due to the variety and complexity of font styles, the results generated by existing methods often suffer from visible defects, such as stroke errors, artifacts and blurriness. To address these issues, we propose DA-Font, a novel framework which integrates a Dual-Attention Hybrid Module (DAHM). Specifically, we introduce two synergistic attention blocks: the component attention block that leverages component information from content images to guide the style transfer process, and the relation attention block that further refines spatial relationships through interacting the content feature with both original and stylized component-wise representations. These two blocks collaborate to preserve accurate character shapes and stylistic textures. Moreover, we also design a corner consistency loss and an elastic mesh feature loss to better improve geometric alignment. Extensive experiments show that our DA-Font outperforms the state-of-the-art methods across diverse font styles and characters, demonstrating its effectiveness in enhancing structural integrity and local fidelity. The source code can be found at \hrefthis https URL\textitthis https URL.
zh
[CV-188] Follow-Your-Emoji-Faster: Towards Efficient Fine-Controllable and Expressive Freestyle Portrait Animation
【速读】:该论文旨在解决自由风格肖像动画生成中的三大核心挑战:保持参考肖像的身份一致性、准确迁移目标表情,以及在保证生成效率的前提下实现长期时间一致性。解决方案的关键在于提出一个基于扩散模型的高效框架——Follow-Your-Emoji-Faster,其创新性地引入两个核心组件:一是将表情感知的面部关键点作为显式运动信号,提升运动对齐精度并支持夸张表情,同时减少身份泄露;二是细粒度面部损失,结合表情和面部掩码以捕捉细微表情并忠实保留参考外观。此外,为克服扩散模型在长时动画生成中的效率瓶颈,作者进一步设计了渐进式生成策略与泰勒插值缓存机制,实现了2.6倍无损加速,从而保障高质量、高效率且可控的动画输出。
链接: https://arxiv.org/abs/2509.16630
作者: Yue Ma,Zexuan Yan,Hongyu Liu,Hongfa Wang,Heng Pan,Yingqing He,Junkun Yuan,Ailing Zeng,Chengfei Cai,Heung-Yeung Shum,Zhifeng Li,Wei Liu,Linfeng Zhang,Qifeng Chen
机构: HKUST(香港科技大学); SJTU(上海交通大学); Tencent(腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by IJCV2025. project page: this https URL
Abstract:We present Follow-Your-Emoji-Faster, an efficient diffusion-based framework for freestyle portrait animation driven by facial landmarks. The main challenges in this task are preserving the identity of the reference portrait, accurately transferring target expressions, and maintaining long-term temporal consistency while ensuring generation efficiency. To address identity preservation and accurate expression retargeting, we enhance Stable Diffusion with two key components: a expression-aware landmarks as explicit motion signals, which improve motion alignment, support exaggerated expressions, and reduce identity leakage; and a fine-grained facial loss that leverages both expression and facial masks to better capture subtle expressions and faithfully preserve the reference appearance. With these components, our model supports controllable and expressive animation across diverse portrait types, including real faces, cartoons, sculptures, and animals. However, diffusion-based frameworks typically struggle to efficiently generate long-term stable animation results, which remains a core challenge in this task. To address this, we propose a progressive generation strategy for stable long-term animation, and introduce a Taylor-interpolated cache, achieving a 2.6X lossless acceleration. These two strategies ensure that our method produces high-quality results efficiently, making it user-friendly and accessible. Finally, we introduce EmojiBench++, a more comprehensive benchmark comprising diverse portraits, driving videos, and landmark sequences. Extensive evaluations on EmojiBench++ demonstrate that Follow-Your-Emoji-Faster achieves superior performance in both animation quality and controllability. The code, training dataset and benchmark will be found in this https URL.
zh
[CV-189] Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
【速读】:该论文旨在解决小型视觉语言模型(Vision Language Models, VLMs)在科学视觉问答(Scientific Visual Question Answering, VQA)任务中性能不足的问题,尤其针对低资源语言场景下的泛化能力薄弱问题。解决方案的关键在于提出一种新的监督微调范式——视觉-描述 aware 监督微调(Vision-Caption aware Supervised Fine-Tuning, VCASFT),该方法通过将图像描述(image captions)作为零样本提示(zero-shot prompts)与问答对联合训练,并结合指令微调(instruction tuning)策略,显著提升模型在复杂科学语境下的理解与推理能力。此外,作者构建了高质量的印地语多模态问答数据集 HiSciVQA,用于验证该方法在低资源语言中的有效性,并引入基于大语言模型(LLM-based)的新评估方案以超越传统 n-gram 匹配指标,提供更深入的模型性能洞察。
链接: https://arxiv.org/abs/2509.16628
作者: Janak Kapuriya,Anwar Shaikh,Arnav Goel,Medha Hira,Apoorv Singh,Jay Saraf,Sanjana,Vaibhav Nauriyal,Avinash Anand,Zhengkui Wang,Rajiv Ratn Shah
机构: Indraprastha Institute of Information Technology, Delhi (印度德里印地普拉斯特拉信息技术学院); Singapore Institute of Technology (新加坡理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In this study, we introduce Vision-Caption aware Supervised FineTuning (VCASFT), a novel learning paradigm designed to enhance the performance of smaller Vision Language Models(VLMs) on scientific visual question answering(VQA) tasks. VCASFT leverages image captions as zero-shot prompts alongside question-answer pairs and instruction-tunes models to yield significant performance improvements. To comprehensively evaluate VCASFT, we benchmark it on ScienceQA, which consists of questions across diverse languages, subjects, and fields, demonstrating its adaptability and effectiveness in a variety of educational contexts. Additionally, to further demonstrate the effectiveness of this technique on lowresource languages, we developed HiSciVQA, a dataset comprising 2,245 high-quality, hand-annotated Hindi multimodal QA pairs. This dataset addresses the critical need for low-resource language QA datasets and serves as a foundation for testing VCASFT. Additionally, we introduce a novel LLM-based evaluation scheme to evaluate VLMs on HiSciVQA which offers deeper insights into model effectiveness surpassing traditional n-gram matching accuracy metrics. We are committed to advancing the field by open-sourcing all code files and the HiSciVQA dataset for the research community.
zh
[CV-190] CGTGait: Collaborative Graph and Transformer for Gait Emotion Recognition
【速读】:该论文旨在解决基于骨骼序列的步态情绪识别中长期时间依赖性建模不足的问题,现有方法多聚焦于空间和局部时间运动信息的提取,难以捕捉跨帧的全局时序特征。其解决方案的关键在于提出一种名为CGTGait的新框架,该框架通过协同融合图卷积(Graph Convolution)与Transformer机制,在每个CGT模块中分别利用图卷积捕获帧级空间拓扑结构、使用Transformer建模全局时间依赖关系,并引入双向交叉流融合(Bidirectional Cross-Stream Fusion, BCSF)模块有效聚合姿态与运动的时空特征,实现两路信息互补交换,从而提升情绪识别性能并显著降低计算复杂度(测试阶段仅需0.34G FLOPs,相较之前方法减少约82.2%)。
链接: https://arxiv.org/abs/2509.16623
作者: Junjie Zhou,Haijun Xiong,Junhao Lu,Ziyu Lin,Bin Feng
机构: Huazhong University of Science and Technology (华中科技大学); Wuhan University of Technology (武汉理工大学); Hefei University of Technology (合肥工业大学); Boston University (波士顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IJCB2025
Abstract:Skeleton-based gait emotion recognition has received significant attention due to its wide-ranging applications. However, existing methods primarily focus on extracting spatial and local temporal motion information, failing to capture long-range temporal representations. In this paper, we propose \textbfCGTGait, a novel framework that collaboratively integrates graph convolution and transformers to extract discriminative spatiotemporal features for gait emotion recognition. Specifically, CGTGait consists of multiple CGT blocks, where each block employs graph convolution to capture frame-level spatial topology and the transformer to model global temporal dependencies. Additionally, we introduce a Bidirectional Cross-Stream Fusion (BCSF) module to effectively aggregate posture and motion spatiotemporal features, facilitating the exchange of complementary information between the two streams. We evaluate our method on two widely used datasets, Emotion-Gait and ELMD, demonstrating that our CGTGait achieves state-of-the-art or at least competitive performance while reducing computational complexity by approximately \textbf82.2% (only requiring 0.34G FLOPs) during testing. Code is available at \smallthis https URL.
zh
[CV-191] Surgical-MambaLLM : Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery MICCAI2025
【速读】:该论文旨在解决机器人手术场景中视觉问答定位任务(Surgical-VQLA)中存在的问题,即现有方法难以建立文本与视觉细节之间的复杂依赖关系,且对手术场景的空间信息感知能力不足。解决方案的关键在于提出一种新颖的模型 Surgical-MambaLLM,首次将 Mamba2 与大语言模型(LLM)结合应用于外科领域,利用 Mamba2 的跨模态融合能力和空间感知特性增强 LLM 对手术图像的理解。具体而言,其核心创新包括:设计 Cross-modal Bidirectional Mamba2 Integration (CBMI) 模块以实现高效的多模态融合,并针对手术场景几何特征定制 Surgical Instrument Perception (SIP) 扫描模式,提升模型对空间信息的感知精度,从而显著优于当前最优方法在 EndoVis17-VQLA 和 EndoVis18-VQLA 数据集上的表现。
链接: https://arxiv.org/abs/2509.16618
作者: Pengfei Hao,Hongqiu Wang,Shuaibo Li,Zhaohu Xing,Guang Yang,Kaishun Wu,Lei Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Early accepted by MICCAI2025
Abstract:In recent years, Visual Question Localized-Answering in robotic surgery (Surgical-VQLA) has gained significant attention for its potential to assist medical students and junior doctors in understanding surgical scenes. Recently, the rapid development of Large Language Models (LLMs) has provided more promising solutions for this task. However, current methods struggle to establish complex dependencies between text and visual details, and have difficulty perceiving the spatial information of surgical scenes. To address these challenges, we propose a novel method, Surgical-MambaLLM, which is the first to combine Mamba2 with LLM in the surgical domain, that leverages Mamba2’s ability to effectively capture cross-modal dependencies and perceive spatial information in surgical scenes, thereby enhancing the LLMs’ understanding of surgical images. Specifically, we propose the Cross-modal Bidirectional Mamba2 Integration (CBMI) module to leverage Mamba2 for effective multimodal fusion, with its cross-modal integration capabilities. Additionally, tailored to the geometric characteristics of surgical scenes, we design the Surgical Instrument Perception (SIP) scanning mode for Mamba2 to scan the surgical images, enhancing the model’s spatial understanding of the surgical scene. Extensive experiments demonstrate that our Surgical-MambaLLM model outperforms the state-of-the-art methods on the EndoVis17-VQLA and EndoVis18-VQLA datasets, significantly improving the performance of the Surgical-VQLA task.
zh
[CV-192] Detection and Simulation of Urban Heat Islands Using a Fine-Tuned Geospatial Foundation Model
【速读】:该论文旨在解决城市热岛效应加剧背景下,传统机器学习方法因数据基础设施有限而导致的空气温度预测不准确问题,尤其是在缺乏充足观测数据的弱势区域。其解决方案的关键在于利用基于全球非结构化数据训练的地理空间基础模型(geospatial foundation model),通过少量微调即可实现高精度的像素级地表温度预测,并具备良好的外推能力(最高达3.62°C),从而有效支持未来气候情景下的城市热管理与植被优化策略评估。
链接: https://arxiv.org/abs/2509.16617
作者: David Kreismann
机构: Baden-Wuerttemberg Cooperative State University (巴登-符腾堡州合作大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 4 figures, to appear in GI LNI (SKILL 2025)
Abstract:As urbanization and climate change progress, urban heat island effects are becoming more frequent and severe. To formulate effective mitigation plans, cities require detailed air temperature data. However, predictive analytics methods based on conventional machine learning models and limited data infrastructure often provide inaccurate predictions, especially in underserved areas. In this context, geospatial foundation models trained on unstructured global data demonstrate strong generalization and require minimal fine-tuning, offering an alternative for predictions where traditional approaches are limited. This study fine-tunes a geospatial foundation model to predict urban land surface temperatures under future climate scenarios and explores its response to land cover changes using simulated vegetation strategies. The fine-tuned model achieved pixel-wise downscaling errors below 1.74 °C and aligned with ground truth patterns, demonstrating an extrapolation capacity up to 3.62 °C.
zh
[CV-193] Describe-to-Score: Text-Guided Efficient Image Complexity Assessment
【速读】:该论文旨在解决图像复杂度(Image Complexity, IC)评估中因仅依赖视觉特征而忽视高层语义信息导致的准确性与泛化能力不足的问题。解决方案的关键在于提出D2S(Describe-to-Score)框架,通过融合视觉与文本语义特征,利用预训练视觉-语言模型生成图像描述,并引入特征对齐与熵分布对齐机制,使语义信息有效引导复杂度评估,同时弥合视觉与文本模态间的差异。该方法在训练阶段使用多模态信息,推理时仅需视觉分支,避免了多模态计算开销,兼顾了性能提升与效率优化。
链接: https://arxiv.org/abs/2509.16609
作者: Shipeng Liu,Zhonglin Zhang,Dengfeng Chen,Liang Zhao
机构: Xi’an University of Architecture and Technology (西安建筑科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Accurately assessing image complexity (IC) is critical for computer vision, yet most existing methods rely solely on visual features and often neglect high-level semantic information, limiting their accuracy and generalization. We introduce vision-text fusion for IC modeling. This approach integrates visual and textual semantic features, increasing representational diversity. It also reduces the complexity of the hypothesis space, which enhances both accuracy and generalization in complexity assessment. We propose the D2S (Describe-to-Score) framework, which generates image captions with a pre-trained vision-language model. We propose the feature alignment and entropy distribution alignment mechanisms, D2S guides semantic information to inform complexity assessment while bridging the gap between vision and text modalities. D2S utilizes multi-modal information during training but requires only the vision branch during inference, thereby avoiding multi-modal computational overhead and enabling efficient assessment. Experimental results demonstrate that D2S outperforms existing methods on the IC9600 dataset and maintains competitiveness on no-reference image quality assessment (NR-IQA) benchmark, validating the effectiveness and efficiency of multi-modal fusion in complexity-related tasks. Code is available at: this https URL
zh
[CV-194] FakeChain: Exposing Shallow Cues in Multi-Step Deepfake Detection
【速读】:该论文旨在解决多步或混合深度伪造(multi-step or hybrid deepfakes)对现有检测模型带来的挑战,即当前检测模型主要针对单一步骤的伪造进行训练,而在面对由多种生成方法(如人脸替换、GAN生成和扩散模型)按顺序组合形成的复杂伪造时,其性能显著下降。解决方案的关键在于构建了一个大规模基准FakeChain,涵盖1-3步伪造,使用五种前沿生成器合成数据,并系统分析了不同步骤数、生成器组合及质量设置下的检测性能与频谱特性。研究发现,检测性能高度依赖于最终伪造类型,而非累积的伪造痕迹,表明现有模型主要依赖最后一阶段的伪影特征,而非伪造历史序列,从而凸显出未来检测模型需显式建模伪造序列和历史的重要性。
链接: https://arxiv.org/abs/2509.16602
作者: Minji Heo,Simon S. Woo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multi-step or hybrid deepfakes, created by sequentially applying different deepfake creation methods such as Face-Swapping, GAN-based generation, and Diffusion methods, can pose an emerging and unforseen technical challenge for detection models trained on single-step forgeries. While prior studies have mainly focused on detecting isolated single manipulation, little is known about the detection model behavior under such compositional, hybrid, and complex manipulation pipelines. In this work, we introduce \textbfFakeChain, a large-scale benchmark comprising 1-, 2-, and 3-Step forgeries synthesized using five state-of-the-art representative generators. Using this approach, we analyze detection performance and spectral properties across hybrid manipulation at different step, along with varying generator combinations and quality settings. Surprisingly, our findings reveal that detection performance highly depends on the final manipulation type, with F1-score dropping by up to \textbf58.83% when it differs from training distribution. This clearly demonstrates that detectors rely on last-stage artifacts rather than cumulative manipulation traces, limiting generalization. Such findings highlight the need for detection models to explicitly consider manipulation history and sequences. Our results highlight the importance of benchmarks such as FakeChain, reflecting growing synthesis complexity and diversity in real-world scenarios. Our sample code is available here\footnotethis https URL.
zh
[CV-195] SQS: Enhancing Sparse Perception Models via Query-based Splatting in Autonomous Driving NEURIPS2025
【速读】:该论文旨在解决稀疏感知模型(Sparse Perception Models, SPMs)在自动驾驶场景下预训练阶段缺乏细粒度上下文特征学习能力的问题,从而限制其在占用预测(occupancy prediction)和3D目标检测(3D object detection)等下游任务中的性能表现。解决方案的关键在于提出一种名为SQS的新型基于查询的体素投射(splatting)预训练方法,其核心创新是引入一个可插拔模块,在预训练阶段从稀疏查询中预测3D高斯表示(3D Gaussian representations),并通过自监督体素投射机制利用多视角图像与深度图重建来学习细粒度上下文特征;在微调阶段,通过查询交互机制将预训练的高斯查询与任务特定查询显式关联,有效适配不同任务需求,显著提升下游任务性能。
链接: https://arxiv.org/abs/2509.16588
作者: Haiming Zhang,Yiyao Zhu,Wending Zhou,Xu Yan,Yingjie Cai,Bingbing Liu,Shuguang Cui,Zhen Li
机构: FNii; SSE, CUHK-Shenzhen; HKUST; Huawei Noah’s Ark Lab
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: NeurIPS 2025 (Spotlight)
Abstract:Sparse Perception Models (SPMs) adopt a query-driven paradigm that forgoes explicit dense BEV or volumetric construction, enabling highly efficient computation and accelerated inference. In this paper, we introduce SQS, a novel query-based splatting pre-training specifically designed to advance SPMs in autonomous driving. SQS introduces a plug-in module that predicts 3D Gaussian representations from sparse queries during pre-training, leveraging self-supervised splatting to learn fine-grained contextual features through the reconstruction of multi-view images and depth maps. During fine-tuning, the pre-trained Gaussian queries are seamlessly integrated into downstream networks via query interaction mechanisms that explicitly connect pre-trained queries with task-specific queries, effectively accommodating the diverse requirements of occupancy prediction and 3D object detection. Extensive experiments on autonomous driving benchmarks demonstrate that SQS delivers considerable performance gains across multiple query-based 3D perception tasks, notably in occupancy prediction and 3D object detection, outperforming prior state-of-the-art pre-training approaches by a significant margin (i.e., +1.3 mIoU on occupancy prediction and +1.0 NDS on 3D detection).
zh
[CV-196] A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis
【速读】:该论文旨在解决生成式 AI 在医学影像领域中存在的数据记忆风险问题,即深度生成模型可能在训练过程中记忆敏感的患者信息,从而导致未经授权的数据泄露。其解决方案的关键在于提出一种名为 DeepSSIM 的自监督度量方法,该方法通过学习图像嵌入空间并强制嵌入间的余弦相似度匹配原始图像空间中的结构相似性指数(Structural Similarity Index, SSIM)得分,实现对生成样本中训练数据记忆程度的有效量化。为增强对解剖结构特征的捕捉能力,DeepSSIM 在训练中引入了保持结构一致性的增强策略,使其能够在无需精确空间对齐的情况下可靠估计相似性,显著提升了检测性能,在脑部 MRI 合成数据实验中相比现有最优方法平均 F1 分数提升 52.03%。
链接: https://arxiv.org/abs/2509.16582
作者: Antonio Scardace,Lemuel Puglisi,Francesco Guarnera,Sebastiano Battiato,Daniele Ravì
机构: University of Catania (卡塔尼亚大学); University of Messina (墨西拿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1 scores by an average of +52.03% over the best existing method. Code and data of our approach are publicly available at the following link: this https URL.
zh
[CV-197] V-CECE: Visual Counterfactual Explanations via Conceptual Edits NEURIPS2025
【速读】:该论文旨在解决现有黑盒反事实生成框架在生成过程中忽视编辑语义内容的问题,这些问题通常依赖大量训练数据来引导生成过程,导致解释性不足且难以保证生成结果的合理性。其解决方案的关键在于提出一种无需训练的即插即用式黑盒反事实生成框架,该框架基于最优编辑的理论保证,逐步提出语义合理的编辑步骤,利用预训练的图像编辑扩散模型实现对目标分类器(如CNN、ViT和LVLM)的黑盒操作,从而生成具有人类水平解释力的反事实样本,并通过全面的人类评估验证了该方法在弥合人类推理与神经网络行为之间解释差距方面的有效性。
链接: https://arxiv.org/abs/2509.16567
作者: Nikolaos Spanos,Maria Lymperaiou,Giorgos Filandrianos,Konstantinos Thomas,Athanasios Voulodimos,Giorgos Stamou
机构: National Technical University of Athens (雅典国立技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted in NeurIPS 2025
Abstract:Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.
zh
[CV-198] Captioning for Text-Video Retrieval via Dual-Group Direct Preference Optimization EMNLP2025
【速读】:该论文旨在解决当前文本-视频检索中辅助描述(auxiliary captions)生成过于泛化、缺乏区分度的问题,以及传统评估指标(如BLEU)无法有效衡量检索任务所需判别能力的局限性。解决方案的关键在于提出CaRe-DPO框架,其核心是Dual-Group Direct Preference Optimization(DG-DPO),通过建模不同视频与描述对之间的偏好关系来直接优化caption生成过程,从而提升细粒度检索性能;同时引入基于多模态大语言模型(MLLM)的检索模型,并采用角色嵌入(role-embeddings)以更好地区分具有不同功能的角色输入(如查询文本和辅助描述)。
链接: https://arxiv.org/abs/2509.16560
作者: Ji Soo Lee,Byungoh Ko,Jaewon Cho,Howoong Lee,Jaewoon Byun,Hyunwoo J. Kim
机构: Korea University (韩国科学技术院); Hanwha Vision (韩华视觉); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: EMNLP 2025 Findings
Abstract:In text-video retrieval, auxiliary captions are often used to enhance video understanding, bridging the gap between the modalities. While recent advances in multi-modal large language models (MLLMs) have enabled strong zero-shot caption generation, we observe that such captions tend to be generic and indistinguishable across visually similar videos, limiting their utility for fine-grained retrieval. Moreover, conventional captioning approaches are typically evaluated using language generation metrics, such as BLEU, which are not typically tailored for retrieval tasks that require making discriminative distinctions between candidates. To address this, we propose \textbfCaRe-DPO , a retrieval framework that directly optimizes caption generation using retrieval relevance scores. At its core is Dual-Group Direct Preference Optimization (DG-DPO), a novel learning strategy that supervises captioning by modeling preferences across groups of distinct video and caption pairs. In addition, we present an MLLM-based retrieval model that incorporates role-embeddings to better distinguish between textual inputs with different functional roles, such as an auxiliary caption and a text query. Through extensive experiments, we demonstrate that CaRe-DPO significantly enhances retrieval performance by effectively leveraging auxiliary knowledge to generate fine-grained captions for retrieval. Code is available at this https URL.
zh
[CV-199] Person Identification from Egocentric Human-Object Interactions using 3D Hand Pose
【速读】:该论文旨在解决在增强现实(AR)辅助系统中实现无感用户身份识别的问题,特别是在高安全要求的人机交互场景(如航空驾驶舱、航天维护和手术操作)中,如何通过人类-物体交互(Human-Object Interaction, HOI)识别来实现精准且低延迟的用户认证。解决方案的关键在于提出了一种多阶段框架 I2S(Interact2Sign),其核心是基于第一人称视角视频中的3D手部姿态分析,通过手工特征提取与顺序特征增强策略:首先识别物体类别,继而进行HOI识别,最终完成用户身份识别。其中,创新性地引入了“跨手空间包络”(Inter-Hand Spatial Envelope, IHSE)这一新型描述子,并对空间(Spatial)、频率(Frequency)、运动学(Kinematic)、朝向(Orientation)等语义特征进行系统整合,经消融实验验证后,在双侧物体操作数据集上实现了平均F1分数97.52%的识别精度,同时模型体积小于4 MB、推理时间仅0.1秒,满足实时端侧认证需求。
链接: https://arxiv.org/abs/2509.16557
作者: Muhammad Hamza,Danish Hamid,Muhammad Tahir Akram
机构: Airlangga University (艾尔朗加大学); Abdul Wali Khan University Mardan (阿卜杜勒·瓦利汗大学马尔丹分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 21 pages, 8 figures, 7 tables. Preprint of a manuscript submitted to CCF Transactions on Pervasive Computing and Interaction (Springer), currently under review
Abstract:Human-Object Interaction Recognition (HOIR) and user identification play a crucial role in advancing augmented reality (AR)-based personalized assistive technologies. These systems are increasingly being deployed in high-stakes, human-centric environments such as aircraft cockpits, aerospace maintenance, and surgical procedures. This research introduces I2S (Interact2Sign), a multi stage framework designed for unobtrusive user identification through human object interaction recognition, leveraging 3D hand pose analysis in egocentric videos. I2S utilizes handcrafted features extracted from 3D hand poses and per forms sequential feature augmentation: first identifying the object class, followed by HOI recognition, and ultimately, user identification. A comprehensive feature extraction and description process was carried out for 3D hand poses, organizing the extracted features into semantically meaningful categories: Spatial, Frequency, Kinematic, Orientation, and a novel descriptor introduced in this work, the Inter-Hand Spatial Envelope (IHSE). Extensive ablation studies were conducted to determine the most effective combination of features. The optimal configuration achieved an impressive average F1-score of 97.52% for user identification, evaluated on a bimanual object manipulation dataset derived from the ARCTIC and H2O datasets. I2S demonstrates state-of-the-art performance while maintaining a lightweight model size of under 4 MB and a fast inference time of 0.1 seconds. These characteristics make the proposed framework highly suitable for real-time, on-device authentication in security-critical, AR-based systems.
zh
[CV-200] ViTCAE: ViT-based Class-conditioned Autoencoder
【速读】:该论文旨在解决基于视觉 Transformer (Vision Transformer, ViT) 的自编码器在生成控制和优化效率方面的局限性,具体表现为 Class token 的全局语义信息未被充分利用以及注意力机制静态固定导致的生成灵活性不足与计算冗余。解决方案的关键在于提出 ViTCAE 框架:首先将 Class token 重定义为生成核心(generative linchpin),使其作为全局潜在变量指导局部 patch 级潜在变量的先验分布,从而建立由全局语义驱动局部细节合成的强依赖关系;其次,借鉴意见动态理论(opinion dynamics),将每个注意力头视为一个寻求共识的多智能体系统,并设计基于分布稳定性的自适应温度调度策略,结合注意力演化距离和共识/聚类函数等诊断指标实现收敛头的原理性冻结(head-freezing),在不损失重建保真度的前提下显著提升训练效率。
链接: https://arxiv.org/abs/2509.16554
作者: Vahid Jebraeeli,Hamid Krim,Derya Cansever
机构: NC State University (北卡罗来纳州立大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: -
Abstract:Vision Transformer (ViT) based autoencoders often underutilize the global Class token and employ static attention mechanisms, limiting both generative control and optimization efficiency. This paper introduces ViTCAE, a framework that addresses these issues by re-purposing the Class token into a generative linchpin. In our architecture, the encoder maps the Class token to a global latent variable that dictates the prior distribution for local, patch-level latent variables, establishing a robust dependency where global semantics directly inform the synthesis of local details. Drawing inspiration from opinion dynamics, we treat each attention head as a dynamical system of interacting tokens seeking consensus. This perspective motivates a convergence-aware temperature scheduler that adaptively anneals each head’s influence function based on its distributional stability. This process enables a principled head-freezing mechanism, guided by theoretically-grounded diagnostics like an attention evolution distance and a consensus/cluster functional. This technique prunes converged heads during training to significantly improve computational efficiency without sacrificing fidelity. By unifying a generative Class token with an adaptive attention mechanism rooted in multi-agent consensus theory, ViTCAE offers a more efficient and controllable approach to transformer-based generation.
zh
[CV-201] ST-GS: Vision-Based 3D Semantic Occupancy Prediction with Spatial-Temporal Gaussian Splatting
【速读】:该论文旨在解决基于3D语义高斯(3D semantic Gaussians)的占用预测方法在多视角空间交互不足和多帧时间一致性有限的问题。解决方案的关键在于提出一种时空高斯点绘(Spatial-Temporal Gaussian Splatting, ST-GS)框架:首先设计了一种受引导的空间聚合策略,结合双模式注意力机制以增强高斯表示中的空间交互;其次引入一种几何感知的时间融合方案,有效利用历史上下文信息提升场景补全的时间连续性。
链接: https://arxiv.org/abs/2509.16552
作者: Xiaoyang Yan,Muleilan Pei,Shaojie Shen
机构: The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:3D occupancy prediction is critical for comprehensive scene understanding in vision-centric autonomous driving. Recent advances have explored utilizing 3D semantic Gaussians to model occupancy while reducing computational overhead, but they remain constrained by insufficient multi-view spatial interaction and limited multi-frame temporal consistency. To overcome these issues, in this paper, we propose a novel Spatial-Temporal Gaussian Splatting (ST-GS) framework to enhance both spatial and temporal modeling in existing Gaussian-based pipelines. Specifically, we develop a guidance-informed spatial aggregation strategy within a dual-mode attention mechanism to strengthen spatial interaction in Gaussian representations. Furthermore, we introduce a geometry-aware temporal fusion scheme that effectively leverages historical context to improve temporal continuity in scene completion. Extensive experiments on the large-scale nuScenes occupancy prediction benchmark showcase that our proposed approach not only achieves state-of-the-art performance but also delivers markedly better temporal consistency compared to existing Gaussian-based methods.
zh
[CV-202] Efficient Rectified Flow for Image Fusion
【速读】:该论文旨在解决扩散模型在图像融合任务中计算复杂度高、推理时间冗余的问题,从而限制其实际应用效率。解决方案的关键在于提出RFfusion,一种基于修正流(Rectified Flow)的一步扩散模型,通过将采样路径线性化实现一步采样而无需额外训练,同时设计了一种面向图像融合任务的变分自编码器(Variational Autoencoder, VAE)架构,将融合操作嵌入潜在空间以降低计算复杂度,并采用两阶段训练策略缓解传统重建导向的VAE目标与图像融合需求之间的不匹配问题,从而在保持高质量融合结果的同时显著提升推理效率。
链接: https://arxiv.org/abs/2509.16549
作者: Zirui Wang,Jiayi Zhang,Tianwei Guan,Yuhan Zhou,Xingyuan Li,Minjing Dong,Jinyuan Liu
机构: City University of Hong Kong (香港城市大学); Dalian University of Technology (大连理工大学); Chinese University of Hong Kong (香港中文大学); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at this https URL.
zh
[CV-203] Lattice Boltzmann Model for Learning Real-World Pixel Dynamicity NEURIPS2025
【速读】:该论文旨在解决视觉跟踪中真实世界像素动态性建模不足的问题,尤其是如何高效适应复杂场景下的目标运动与外观变化。解决方案的关键在于提出一种基于格子玻尔兹曼模型(Lattice Boltzmann Model, LBM)的新型视觉跟踪框架,其核心思想是将视觉表示分解为动态像素格点(dynamic pixel lattices),并通过碰撞-平流过程模拟像素运动状态演化:预测阶段通过多层预测-更新网络获取目标像素的高维分布,构建空间邻域内的格点碰撞机制并引入时序上下文中的格点平流;更新阶段则利用在线视觉表征对像素分布进行校正,从而实现对目标位置和可见性的精确估计。该方法在多个真实世界点跟踪(如TAP-Vid、RoboTAP)及大规模开放世界目标跟踪基准(如TAO、BFT、OVT-B)上均验证了其在线实时性和实际适用性。
链接: https://arxiv.org/abs/2509.16527
作者: Guangze Zheng,Shijie Lin,Haobo Zuo,Si Si,Ming-Shan Wang,Changhong Fu,Jia Pan
机构: HKU(香港大学); Institute of Zoology, CAS(中国科学院动物研究所); Tongji University(同济大学); LimX Dynamics
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: NeurIPS 2025. Project page: this https URL
Abstract:This work proposes the Lattice Boltzmann Model (LBM) to learn real-world pixel dynamicity for visual tracking. LBM decomposes visual representations into dynamic pixel lattices and solves pixel motion states through collision-streaming processes. Specifically, the high-dimensional distribution of the target pixels is acquired through a multilayer predict-update network to estimate the pixel positions and visibility. The predict stage formulates lattice collisions among the spatial neighborhood of target pixels and develops lattice streaming within the temporal visual context. The update stage rectifies the pixel distributions with online visual representations. Compared with existing methods, LBM demonstrates practical applicability in an online and real-time manner, which can efficiently adapt to real-world visual tracking tasks. Comprehensive evaluations of real-world point tracking benchmarks such as TAP-Vid and RoboTAP validate LBM’s efficiency. A general evaluation of large-scale open-world object tracking benchmarks such as TAO, BFT, and OVT-B further demonstrates LBM’s real-world practicality.
zh
[CV-204] PM25Vision: A Large-Scale Benchmark Dataset for Visual Estimation of Air Quality
【速读】:该论文旨在解决通过街景图像(street-level images)准确估算细颗粒物(PM2.5)浓度的难题,以弥补现有空气质量管理数据在空间分辨率和覆盖范围上的不足。其解决方案的关键在于构建了迄今为止规模最大、最全面的PM2.5视觉感知数据集PM25Vision(PM25V),包含超过11,114张与时间戳和地理位置匹配的PM2.5监测读数,覆盖3,261个空气质量指数(AQI)监测站点及11年跨度,空间精度达5公里,显著优于以往城市级精度的数据集。此外,论文还提供了基于卷积神经网络(CNN)和Transformer架构的基线模型性能,为后续研究提供了高质量的数据基础和方法参考。
链接: https://arxiv.org/abs/2509.16519
作者: Yang Han
机构: New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We introduce PM25Vision (PM25V), the largest and most comprehensive dataset to date for estimating air quality - specifically PM2.5 concentrations - from street-level images. The dataset contains over 11,114 images matched with timestamped and geolocated PM2.5 readings across 3,261 AQI monitoring stations and 11 years, significantly exceeding the scale of previous benchmarks. The spatial accuracy of this dataset has reached 5 kilometers, far exceeding the city-level accuracy of many datasets. We describe the data collection, synchronization, and cleaning pipelines, and provide baseline model performances using CNN and transformer architectures. Our dataset is publicly available.
zh
[CV-205] FG-Attn: Leverag ing Fine-Grained Sparsity In Diffusion Transformers
【速读】:该论文旨在解决扩散变压器(Diffusion Transformers)在生成真实视频时计算成本高昂的问题,尤其是注意力层作为核心瓶颈导致的显著延迟——例如生成一个5秒视频需处理超过3万条嵌入向量序列。传统方法通过利用注意力矩阵中的稀疏性来降低计算量,但多依赖于粗粒度的块稀疏注意力(block-sparse attention),仅当整个M×M块(通常M=64)的注意力分数全为零时才跳过计算,未能充分挖掘注意力图中的细粒度稀疏特性。其解决方案的关键在于提出FG-Attn机制,该机制基于细粒度稀疏性,在M×1切片级别上跳过无意义的注意力计算(即一个查询块与单个键向量的点积),并通过设计一种新的高效批量加载操作“异步收集加载”(asynchronous-gather load),将相关键值对从内存中以打包块的形式加载至GPU共享内存,从而大幅减少冗余数据搬运和计算开销。实验表明,该方法在单张H100 GPU上可实现平均1.41–1.65倍的速度提升。
链接: https://arxiv.org/abs/2509.16518
作者: Sankeerth Durvasula,Kavya Sreedhar,Zain Moustafa,Suraj Kothawade,Ashish Gondimalla,Suvinay Subramanian,Narges Shahidi,Nandita Vijaykumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Hardware Architecture (cs.AR)
备注:
Abstract:Generating realistic videos with diffusion transformers demands significant computation, with attention layers the central bottleneck; even producing a short clip requires running a transformer over a very long sequence of embeddings, e.g., more than 30K embeddings for a 5-second video, incurring significant latency. Prior work aims to mitigate this bottleneck by exploiting sparsity in the attention layers to reduce computation. However, these works typically rely on block-sparse attention, which skips score computation only when all entries in a block of attention scores (corresponding to M queries and M keys, with M = 64 typically) are zero. This coarse-granular skipping of attention scores does not fully exploit sparsity in the attention map and leaves room for improvement. In this work, we propose FG-Attn, a sparse attention mechanism for long-context diffusion transformers that leverages sparsity at a fine granularity. Unlike block-sparse attention, which skips entire MxM blocks, our approach skips computations at the granularity of Mx1 slices of the attention map. Each slice is produced by query-key dot products between a block of query vectors and a single key. To implement our proposed sparse attention mechanism, we develop a new efficient bulk-load operation called asynchronous-gather load. This load operation gathers a sparse set of relevant key-value vectors from memory and arranges them into packed tiles in the GPU’s shared memory. Only a sparse set of keys relevant to those queries are loaded into shared memory when computing attention for a block of queries, in contrast to loading full blocks of key tokens in block-sparse attention. Our fine-grained sparse attention, applied to video diffusion models, achieves an average 1.55X (up to 1.65X) speedup for 5 second, 480p videos, and an average 1.41X (up to 1.49X) for 5 second, 720p videos on a single H100 GPU.
zh
[CV-206] SlowFast-SCI: Slow-Fast Deep Unfolding Learning for Spectral Compressive Imaging
【速读】:该论文旨在解决现有深度展开(deep unfolding)方法在光谱压缩成像(spectral compressive imaging, SCI)中缺乏快速适应能力的问题,尤其是面对分布外(out-of-distribution)相机或定制化光谱设置时性能显著下降,且因多阶段预训练导致计算开销大、推理慢。解决方案的关键在于提出SlowFast-SCI框架,其核心创新是双速设计:在慢速学习阶段,通过成像引导的知识蒸馏将先验驱动的骨干网络压缩为轻量级快速展开模型;在快速学习阶段,于每个展开块中嵌入轻量化自监督适配模块,在测试时无需重训练骨干网络即可实现每样本级的即时校准。该方法首次将测试时适应(test-time adaptation)引入深度展开框架,实现了参数与浮点运算量减少超70%、分布外数据峰值信噪比(PSNR)提升达5.79 dB,并保持跨域适应性与4倍加速的适配速度。
链接: https://arxiv.org/abs/2509.16509
作者: Haijin Zeng,Xuan Lu,Yurong Zhang,Yongyong Chen,Jingyong Su,Jie Liu
机构: Harvard University (哈佛大学); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)); Shanghai Jiaotong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages
Abstract:Humans learn in two complementary ways: a slow, cumulative process that builds broad, general knowledge, and a fast, on-the-fly process that captures specific experiences. Existing deep-unfolding methods for spectral compressive imaging (SCI) mirror only the slow component-relying on heavy pre-training with many unfolding stages-yet they lack the rapid adaptation needed to handle new optical configurations. As a result, they falter on out-of-distribution cameras, especially in bespoke spectral setups unseen during training. This depth also incurs heavy computation and slow inference. To bridge this gap, we introduce SlowFast-SCI, a dual-speed framework seamlessly integrated into any deep unfolding network beyond SCI systems. During slow learning, we pre-train or reuse a priors-based backbone and distill it via imaging guidance into a compact fast-unfolding model. In the fast learning stage, lightweight adaptation modules are embedded within each block and trained self-supervised at test time via a dual-domain loss-without retraining the backbone. To the best of our knowledge, SlowFast-SCI is the first test-time adaptation-driven deep unfolding framework for efficient, self-adaptive spectral reconstruction. Its dual-stage design unites offline robustness with on-the-fly per-sample calibration-yielding over 70% reduction in parameters and FLOPs, up to 5.79 dB PSNR improvement on out-of-distribution data, preserved cross-domain adaptability, and a 4x faster adaptation speed. In addition, its modularity integrates with any deep-unfolding network, paving the way for self-adaptive, field-deployable imaging and expanded computational imaging modalities. Code and models are available at this https URL.
zh
[CV-207] OS-DiffVSR: Towards One-step Latent Diffusion Model for High-detailed Real-world Video Super-Resolution
【速读】:该论文旨在解决基于扩散模型的视频超分辨率(Video Super-Resolution, VSR)方法在实际应用中面临的两大核心问题:一是生成视频质量与推理效率之间的权衡,二是如何在保持帧间时序一致性的同时减少视频闪烁。解决方案的关键在于提出了一种单步扩散模型(One-Step Diffusion model for real-world Video Super-Resolution, OS-DiffVSR),其创新性地引入了邻帧对抗训练范式(adjacent frame adversarial training paradigm)以显著提升合成视频质量,并设计多帧融合机制(multi-frame fusion mechanism)来增强帧间时序一致性并抑制视频 flicker。实验表明,OS-DiffVSR 在多个主流 VSR 基准上可实现优于需数十次采样步骤的现有扩散方法的质量表现,同时大幅提升推理效率。
链接: https://arxiv.org/abs/2509.16507
作者: Hanting Li,Huaao Tang,Jianhong Han,Tianxiong Zhou,Jiulong Cui,Haizhen Xie,Yan Chen,Jie Hu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recently, latent diffusion models has demonstrated promising performance in real-world video super-resolution (VSR) task, which can reconstruct high-quality videos from distorted low-resolution input through multiple diffusion steps. Compared to image super-resolution (ISR), VSR methods needs to process each frame in a video, which poses challenges to its inference efficiency. However, video quality and inference efficiency have always been a trade-off for the diffusion-based VSR methods. In this work, we propose One-Step Diffusion model for real-world Video Super-Resolution, namely OS-DiffVSR. Specifically, we devise a novel adjacent frame adversarial training paradigm, which can significantly improve the quality of synthetic videos. Besides, we devise a multi-frame fusion mechanism to maintain inter-frame temporal consistency and reduce the flicker in video. Extensive experiments on several popular VSR benchmarks demonstrate that OS-DiffVSR can even achieve better quality than existing diffusion-based VSR methods that require dozens of sampling steps.
zh
[CV-208] CommonForms: A Large Diverse Dataset for Form Field Detection
【速读】:该论文旨在解决表单字段检测(form field detection)问题,即从页面图像中定位并识别不同类型的表单字段(如文本输入框、选择按钮和签名域)。其核心解决方案是构建了一个大规模、多样化的网页级数据集 CommonForms,并基于此数据集训练了轻量级的端到端检测模型 FFDNet-Small 和 FFDNet-Large。关键创新在于将表单字段检测建模为对象检测任务,通过从 Common Crawl 中筛选含有可填写元素的 PDF 文档,最终获得约 55k 篇文档、45 万页的高质量标注数据;实验表明高分辨率输入对检测精度至关重要,且清洗流程显著提升数据效率。此外,FFDNet 能够预测复选框(checkboxes),超越现有商业 PDF 阅读器的能力,是首个公开的大规模表单字段检测数据集及开源模型。
链接: https://arxiv.org/abs/2509.16506
作者: Joe Barrow
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:This paper introduces CommonForms, a web-scale dataset for form field detection. It casts the problem of form field detection as object detection: given an image of a page, predict the location and type (Text Input, Choice Button, Signature) of form fields. The dataset is constructed by filtering Common Crawl to find PDFs that have fillable elements. Starting with 8 million documents, the filtering process is used to arrive at a final dataset of roughly 55k documents that have over 450k pages. Analysis shows that the dataset contains a diverse mixture of languages and domains; one third of the pages are non-English, and among the 14 classified domains, no domain makes up more than 25% of the dataset. In addition, this paper presents a family of form field detectors, FFDNet-Small and FFDNet-Large, which attain a very high average precision on the CommonForms test set. Each model cost less than 500 to train. Ablation results show that high-resolution inputs are crucial for high-quality form field detection, and that the cleaning process improves data efficiency over using all PDFs that have fillable fields in Common Crawl. A qualitative analysis shows that they outperform a popular, commercially available PDF reader that can prepare forms. Unlike the most popular commercially available solutions, FFDNet can predict checkboxes in addition to text and signature fields. This is, to our knowledge, the first large scale dataset released for form field detection, as well as the first open source models. The dataset, models, and code will be released at this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2509.16506 [cs.CV] (or arXiv:2509.16506v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.16506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-209] RLGF: Reinforcement Learning with Geometric Feedback for Autonomous Driving Video Generation NEURIPS2025
【速读】:该论文旨在解决当前生成式视频模型在自动驾驶(Autonomous Driving, AD)场景中因细微几何失真而导致下游感知任务性能下降的问题。尽管这些模型具备高度视觉真实感,但其生成的合成数据在三维空间结构上存在偏差,限制了其在3D目标检测等关键任务中的应用。解决方案的核心是提出一种基于几何反馈的强化学习框架(Reinforcement Learning with Geometric Feedback, RLGF),其关键创新在于:一是引入潜空间窗口优化(Latent-Space Windowing Optimization)实现扩散过程中的局部精细化反馈;二是设计分层几何奖励机制(Hierarchical Geometric Reward, HGR),从点、线、面到场景占用一致性等多个层次提供几何对齐奖励信号,从而引导模型生成更符合真实世界几何约束的合成视频。该方法显著降低了几何误差(如视角点误差减少21%、深度误差降低57%),并将3D目标检测平均精度(mAP)提升12.7%,有效缩小了合成数据与真实数据之间的性能差距。
链接: https://arxiv.org/abs/2509.16500
作者: Tianyi Yan,Wencheng Han,Xia Zhou,Xueyang Zhang,Kun Zhan,Cheng-zhong Xu,Jianbing Shen
机构: University of Macau (澳门大学); Li Auto Inc. (理想汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: NeurIPS 2025
Abstract:Synthetic data is crucial for advancing autonomous driving (AD) systems, yet current state-of-the-art video generation models, despite their visual realism, suffer from subtle geometric distortions that limit their utility for downstream perception tasks. We identify and quantify this critical issue, demonstrating a significant performance gap in 3D object detection when using synthetic versus real data. To address this, we introduce Reinforcement Learning with Geometric Feedback (RLGF), RLGF uniquely refines video diffusion models by incorporating rewards from specialized latent-space AD perception models. Its core components include an efficient Latent-Space Windowing Optimization technique for targeted feedback during diffusion, and a Hierarchical Geometric Reward (HGR) system providing multi-level rewards for point-line-plane alignment, and scene occupancy coherence. To quantify these distortions, we propose GeoScores. Applied to models like DiVE on nuScenes, RLGF substantially reduces geometric errors (e.g., VP error by 21%, Depth error by 57%) and dramatically improves 3D object detection mAP by 12.7%, narrowing the gap to real-data performance. RLGF offers a plug-and-play solution for generating geometrically sound and reliable synthetic videos for AD development.
zh
[CV-210] Octree Latent Diffusion for Semantic 3D Scene Generation and Completion
【速读】:该论文旨在解决3D语义场景的补全(completion)、扩展(extension)和生成(generation)问题,这些问题在机器人导航与探索中具有重要应用价值。现有方法通常将这些任务解耦处理,且多为领域特定模型(如室内或室外场景需独立建模),缺乏跨域兼容性。解决方案的关键在于提出一种统一框架——Octree Latent Semantic Diffusion,其核心是基于高效的双树状八叉树图潜在表示(dual octree graph latent representation),该结构兼具层次性、稀疏性和内存效率。该方法将生成过程分为两个阶段:(i) 结构扩散(structure diffusion)预测二值分割信号以构建粗粒度占用八叉树;(ii) 潜在语义扩散(latent semantic diffusion)通过图变分自编码器(graph VAE)生成语义嵌入并解码为体素级语义标签。通过推理时的潜在内补(inpainting)和外补(outpainting)机制,模型可利用部分LiDAR扫描数据直接条件生成完整场景,无需重新训练或微调,从而实现高质量结构重建、语义一致性以及对分布外LiDAR数据的零样本泛化能力。
链接: https://arxiv.org/abs/2509.16483
作者: Xujia Zhang,Brendan Crowe,Christoffer Heckman
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The completion, extension, and generation of 3D semantic scenes are an interrelated set of capabilities that are useful for robotic navigation and exploration. Existing approaches seek to decouple these problems and solve them oneoff. Additionally, these approaches are often domain-specific, requiring separate models for different data distributions, e.g. indoor vs. outdoor scenes. To unify these techniques and provide cross-domain compatibility, we develop a single framework that can perform scene completion, extension, and generation in both indoor and outdoor scenes, which we term Octree Latent Semantic Diffusion. Our approach operates directly on an efficient dual octree graph latent representation: a hierarchical, sparse, and memory-efficient occupancy structure. This technique disentangles synthesis into two stages: (i) structure diffusion, which predicts binary split signals to construct a coarse occupancy octree, and (ii) latent semantic diffusion, which generates semantic embeddings decoded by a graph VAE into voxellevel semantic labels. To perform semantic scene completion or extension, our model leverages inference-time latent inpainting, or outpainting respectively. These inference-time methods use partial LiDAR scans or maps to condition generation, without the need for retraining or finetuning. We demonstrate highquality structure, coherent semantics, and robust completion from single LiDAR scans, as well as zero-shot generalization to out-of-distribution LiDAR data. These results indicate that completion-through-generation in a dual octree graph latent space is a practical and scalable alternative to regression-based pipelines for real-world robotic perception tasks.
zh
[CV-211] hermal Imaging-based Real-time Fall Detection using Motion Flow and Attention-enhanced Convolutional Recurrent Architecture
【速读】:该论文旨在解决老年人跌倒检测中现有方案的局限性问题,如可穿戴设备依赖用户配合、环境传感器隐私风险高、RGB视觉系统受光照影响大等。其解决方案的关键在于提出一种基于热成像(thermal imaging)的先进跌倒检测方法,采用双向卷积长短期记忆网络(Bidirectional Convolutional Long Short-Term Memory, BiConvLSTM)模型,并融合空间、时间、特征、自注意力和通用注意力机制,以提升模型对复杂场景下跌倒行为的识别精度与鲁棒性。实验表明,该方法在TSF数据集上达到99.7%的ROC-AUC,在TF-66新基准上也表现出优异性能,验证了其在隐私保护、实时性和泛化能力方面的优势,为部署高可靠性的非接触式跌倒监测系统提供了可行路径。
链接: https://arxiv.org/abs/2509.16479
作者: Christopher Silver,Thangarajah Akilan
机构: Lakehead University (湖头大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Falls among seniors are a major public health issue. Existing solutions using wearable sensors, ambient sensors, and RGB-based vision systems face challenges in reliability, user compliance, and practicality. Studies indicate that stakeholders, such as older adults and eldercare facilities, prefer non-wearable, passive, privacy-preserving, and real-time fall detection systems that require no user interaction. This study proposes an advanced thermal fall detection method using a Bidirectional Convolutional Long Short-Term Memory (BiConvLSTM) model, enhanced with spatial, temporal, feature, self, and general attention mechanisms. Through systematic experimentation across hundreds of model variations exploring the integration of attention mechanisms, recurrent modules, and motion flow, we identified top-performing architectures. Among them, BiConvLSTM achieved state-of-the-art performance with a ROC-AUC of 99.7% on the TSF dataset and demonstrated robust results on TF-66, a newly emerged, diverse, and privacy-preserving benchmark. These results highlight the generalizability and practicality of the proposed model, setting new standards for thermal fall detection and paving the way toward deployable, high-performance solutions.
zh
[CV-212] Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在推理效率上的瓶颈问题,即视觉token冗余导致的计算开销过大,限制了其在AR/VR等边缘消费设备上的实时应用。现有提升效率的方法通常依赖于架构修改或中间激活访问,引入额外计算模块并常伴随准确率下降,且难以与提示语义对齐,易忽略小尺度高频细节。解决方案的关键在于提出GazeVLM——一个无需训练的框架,利用人类眼动数据作为自然监督信号,提取注视区域(gaze-driven regions of interest, ROIs),并可选地融合低分辨率全局视图,模拟人眼中央凹-周边感知机制,从而动态裁剪冗余视觉token,同时保留任务相关细节。实验表明,GazeVLM在VOILA-COCO基准上实现最高达93.1%的视觉token压缩、50%的浮点运算量(FLOPs)减少,且保持优于全分辨率基线的答案质量。
链接: https://arxiv.org/abs/2509.16476
作者: Qinyu Chen,Jiawen Qi
机构: Leiden Institute of Advanced Computer Science (LIACS), Leiden University (莱顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Vision-Language Models (VLMs) deliver impressive performance in understanding visual content with language instructions. However, redundancy in vision tokens results in the degenerated inference efficiency of VLMs, which hinders real-time use on edge consumer devices such as AR/VR devices. Existing efficiency methods commonly prune visual tokens using learned saliency, sparse attention schedules, or controller policies, but they often require architectural modification or access to intermediate activations. These pipelines add inference-time modules that increase compute and memory and often lead to an accuracy trade-off. Moreover, they also suffer from misalignment between the prompts and the region of interest in the images. Without human guidance, the model may focus on the wrong regions and miss small, high-frequency details when prompts or scenes change. In this paper, we propose GazeVLM, a training-free framework that uses the human eye gaze as a natural supervisory signal to allocate computation where it matters. By extracting gaze-driven regions of interest (ROIs) and optionally combining them with a low-resolution global view, GazeVLM mimics fovea-periphery perception to cut redundant visual tokens while preserving task-relevant details. We evaluate the visual question answering tasks on Qwen2.5-VL-3B/7B on the VOILA-COCO benchmark with human gaze. Quality of the answer is assessed by GPT-4o pairwise judging and a weighted score over coverage, accuracy, details, and fluency. Efficiency is measured by token counts and FLOPs. GazeVLM reduces visual tokens by up to 93.1%, total tokens by up to 59.6%, and FLOPs by 50%, while keeping better answer quality relative to full-resolution baselines. Our results show that aligning model computation with human gaze offers a simple, plug-and-play path toward efficient VLM inference on consumer devices.
zh
[CV-213] Cross-Corpus and Cross-domain Handwriting Assessment of NeuroDegenerative Diseases via Time-Series-to-Image Conversion ICASSP
【速读】:该论文旨在解决神经障碍(Neurological Disorders, ND)如帕金森病(Parkinson’s disease, PD)和阿尔茨海默病(Alzheimer’s disease, AD)导致的手写特征识别难题,尤其针对现有基于特征提取或计算机视觉的方法在跨数据集泛化能力不足的问题。其关键解决方案是提出一种联合分类框架,通过预训练于ImageNet-1k的ResNet50模型同时处理手写的时间序列信号与图像表示,从而实现对不同形式书写信号的统一建模;实验表明该方法在多个数据集上均达到最优性能,尤其是在Draw Clock和Spiral任务中表现突出,并在跨数据集检测中获得高达98的F1分数,显著提升了对ND相关运动功能障碍的识别能力。
链接: https://arxiv.org/abs/2509.16474
作者: Gabrielle Chavez,Laureano Moro-Velazquez,Ankur Butala,Najim Dehak,Thomas Thebaud
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, submitted to International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Abstract:Handwriting is significantly affected by neurological disorders (ND) such as Parkinson’s disease (PD) and Alzheimer’s disease (AD). Prior works have analyzed handwriting tasks using feature-based approaches or computer-vision techniques, but these methods have struggled to generalize across multiple datasets, particularly between temporal features represented as time-series and images. We propose a framework that leverages both time-series and images of handwriting through a joint classifier, based on a ResNet50 pretrained on ImageNet-1k. Binary classification experiments demonstrate state-of-the-art performances on existing time-series and image datasets, with significant improvement on specific drawing and writing tasks from the NeuroLogical Signals (NLS) dataset. In particular, the proposed model demonstrates improved performance on Draw Clock and Spiral tasks. Additionally, cross-dataset and multi-dataset experiments were consistently able to achieve high F1 scores, up to 98 for PD detection, highlighting the potential of the proposed model to generalize over different forms of handwriting signals, and enhance the detection of motor deficits in ND.
zh
[CV-214] he Iconicity of the Generated Image
【速读】:该论文试图解决的问题是:在人类视觉交流中具有重要地位的标志性图像(iconic images)是否对生成式AI模型的图像生成过程产生显著影响。研究假设这些广泛传播、被频繁复制并作为灵感来源的图像,可能在视觉生成模型的学习过程中占据不成比例的重要位置。解决方案的关键在于通过三部分分析——数据归属分析(data attribution)、语义相似性分析(semantic similarity analysis)以及用户研究(user-study),系统评估标志性图像在训练数据中的贡献及其对生成结果的影响。研究发现,标志性图像并未对生成过程表现出明显影响,且许多标志性图像难以被模型准确再现,揭示了人类与生成式AI在利用先前视觉信息方面的根本差异。
链接: https://arxiv.org/abs/2509.16473
作者: Nanne van Noord,Noa Garcia
机构: University of Amsterdam (阿姆斯特丹大学); The University of Osaka (大阪大学)
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注: Work presented at EA-AI 2025, May 2025, Venice
Abstract:How humans interpret and produce images is influenced by the images we have been exposed to. Similarly, visual generative AI models are exposed to many training images and learn to generate new images based on this. Given the importance of iconic images in human visual communication, as they are widely seen, reproduced, and used as inspiration, we may expect that they may similarly have a proportionally large influence within the generative AI process. In this work we explore this question through a three-part analysis, involving data attribution, semantic similarity analysis, and a user-study. Our findings indicate that iconic images do not have an obvious influence on the generative process, and that for many icons it is challenging to reproduce an image which resembles it closely. This highlights an important difference in how humans and visual generative AI models draw on and learn from prior visual communication.
zh
[CV-215] Explainable Gait Abnormality Detection Using Dual-Dataset CNN-LSTM Models ICML
【速读】:该论文旨在解决当前步态分析模型在临床和生物特征识别领域中普遍存在的可解释性不足以及依赖单一数据集的问题。其解决方案的关键在于提出一种双分支卷积神经网络-长短期记忆网络(CNN-LSTM)框架:一个1D分支基于GAVD数据集的关节特征,另一个3D分支基于OU-MVLP数据集的轮廓图像;同时引入SHAP(Shapley Additive Explanations)提供时间维度上的归因分析,以及Grad-CAM(Gradient-weighted Class Activation Mapping)实现空间定位,从而增强模型的可解释性。该方法在独立测试集上实现了98.6%的准确率,且具有优异的召回率和F1分数,推动了跨临床与生物特征领域的可解释步态分析发展。
链接: https://arxiv.org/abs/2509.16472
作者: Parth Agarwal,Sangaa Chatterjee,Md Faisal Kabir,Suman Saha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The paper got accepted in ICMLA-2025. It is a camera-ready version
Abstract:Gait is a key indicator in diagnosing movement disorders, but most models lack interpretability and rely on single datasets. We propose a dual-branch CNN-LSTM framework a 1D branch on joint-based features from GAVD and a 3D branch on silhouettes from OU-MVLP. Interpretability is provided by SHAP (temporal attributions) and Grad-CAM (spatial localization).On held-out sets, the system achieves 98.6% accuracy with strong recall and F1. This approach advances explainable gait analysis across both clinical and biometric domains.
zh
[CV-216] KRAST: Knowledge-Augmented Robotic Action Recognition with Structured Text for Vision-Language Models
【速读】:该论文旨在解决复杂室内环境中基于视觉的日常动作识别问题,以提升机器人在真实场景下自主感知与决策的能力。其核心挑战在于如何利用有限标注数据实现高精度的动作识别,同时保持模型对多样环境变化的鲁棒性。解决方案的关键在于引入领域知识增强的提示学习(prompt-learning)框架,将每个动作类别的文本描述作为可学习提示嵌入到冻结的预训练视觉语言模型(Vision-Language Model, VLM)主干中,从而在仅使用RGB视频输入的情况下实现超过95%的识别准确率,并显著优于现有最先进方法。
链接: https://arxiv.org/abs/2509.16452
作者: Son Hai Nguyen,Diwei Wang,Jinhyeok Jang,Hyewon Seo
机构: University Côte d’Azur (尼斯大学); University of Strasbourg (斯特拉斯堡大学); Electronics and Telecommunications Research Institute (韩国电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate vision-based action recognition is crucial for developing autonomous robots that can operate safely and reliably in complex, real-world environments. In this work, we advance video-based recognition of indoor daily actions for robotic perception by leveraging vision-language models (VLMs) enriched with domain-specific knowledge. We adapt a prompt-learning framework in which class-level textual descriptions of each action are embedded as learnable prompts into a frozen pre-trained VLM backbone. Several strategies for structuring and encoding these textual descriptions are designed and evaluated. Experiments on the ETRI-Activity3D dataset demonstrate that our method, using only RGB video inputs at test time, achieves over 95% accuracy and outperforms state-of-the-art approaches. These results highlight the effectiveness of knowledge-augmented prompts in enabling robust action recognition with minimal supervision.
zh
[CV-217] Improved mmFormer for Liver Fibrosis Staging via Missing-Modality Compensation
【速读】:该论文旨在解决真实临床环境中多模态磁共振成像(MRI)因设备差异或患者配合问题导致的模态缺失问题,该问题会显著影响模型性能。解决方案的关键在于提出一种基于mmFormer架构的改进模型,其核心创新包括:保留原始mmFormer中的混合模态特异性编码器和模态相关编码器以提取跨可用模态的一致病灶特征;引入缺失模态补偿模块,通过零填充、模态可用性掩码及可学习统计参数的Delta函数动态合成代理特征以恢复缺失信息;同时采用交叉验证集成策略,在不同数据分层上训练多个模型并推理时进行软投票,从而提升预测性能。
链接: https://arxiv.org/abs/2509.16436
作者: Zhejia Zhang,Junjie Wang,Le Zhang(University of Birmingham, UK)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:In real-world clinical settings, magnetic resonance imaging (MRI) frequently suffers from missing modalities due to equipment variability or patient cooperation issues, which can significantly affect model performance. To address this issue, we propose a multimodal MRI classification model based on the mmFormer architecture with an adaptive module for handling arbitrary combinations of missing modalities. Specifically, this model retains the hybrid modality-specific encoders and the modality-correlated encoder from mmFormer to extract consistent lesion features across available modalities. In addition, we integrate a missing-modality compensation module which leverages zero-padding, modality availability masks, and a Delta Function with learnable statistical parameters to dynamically synthesize proxy features for recovering missing information. To further improve prediction performance, we adopt a cross-validation ensemble strategy by training multiple models on different folds and applying soft voting during inference. This method is evaluated on the test set of Comprehensive Analysis Computing of REal-world medical images (CARE) 2025 challenge, targeting the Liver Fibrosis Staging (LiFS) task based on non-contrast dynamic MRI scans including T1-weighted imaging (T1WI), T2-weighted imaging (T2WI), and diffusion-weighted imaging (DWI). For Cirrhosis Detection and Substantial Fibrosis Detection on in-distribution vendors, our model obtains accuracies of 66.67%, and 74.17%, and corresponding area under the curve (AUC) scores of 71.73% and 68.48%, respectively.
zh
[CV-218] ractoTransformer: Diffusion MRI Streamline Tractography using CNN and Transformer Networks
【速读】:该论文旨在解决传统白质纤维束成像(tractography)在处理弥散磁共振成像(diffusion MRI)数据时面临的挑战,如纤维交叉、汇合和扇形分布等复杂结构导致的轨迹重建不准确与完整性不足问题。其解决方案的关键在于引入基于Transformer的序列建模机制,以捕捉白质纤维流线的长程依赖关系,并结合卷积神经网络(CNN)提取每个体素邻域的微结构特征,从而融合轨迹上下文信息与局部扩散信号,提升纤维方向预测的精度与路径映射的完整性。
链接: https://arxiv.org/abs/2509.16429
作者: Itzik Waizman,Yakov Gusakov,Itay Benou,Tammy Riklin Raviv
机构: Ben-Gurion University of the Negev (本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:White matter tractography is an advanced neuroimaging technique that reconstructs the 3D white matter pathways of the brain from diffusion MRI data. It can be framed as a pathfinding problem aiming to infer neural fiber trajectories from noisy and ambiguous measurements, facing challenges such as crossing, merging, and fanning white-matter configurations. In this paper, we propose a novel tractography method that leverages Transformers to model the sequential nature of white matter streamlines, enabling the prediction of fiber directions by integrating both the trajectory context and current diffusion MRI measurements. To incorporate spatial information, we utilize CNNs that extract microstructural features from local neighborhoods around each voxel. By combining these complementary sources of information, our approach improves the precision and completeness of neural pathway mapping compared to traditional tractography models. We evaluate our method with the Tractometer toolkit, achieving competitive performance against state-of-the-art approaches, and present qualitative results on the TractoInferno dataset, demonstrating strong generalization to real-world data.
zh
[CV-219] 3D Gaussian Flats: Hybrid 2D/3D Photometric Scene Reconstruction
【速读】:该论文旨在解决现有基于辐射场(radiance fields)的三维重建方法在处理平面且无纹理表面时出现的不均匀和半透明重建问题,其根源在于光度重建目标函数病态(ill-conditioned photometric reconstruction objective)。解决方案的关键在于提出一种新颖的2D/3D混合表示方法:通过联合优化受限平面(2D)高斯分布来建模平坦区域,同时使用自由形态(3D)高斯分布表示场景其余部分,实现对平面区域的动态检测与精化,从而在保持视觉保真度的同时显著提升几何精度。该方法在ScanNet++和ScanNetv2数据集上实现了最先进的深度估计性能,并能有效进行网格提取而不过度依赖特定相机模型。
链接: https://arxiv.org/abs/2509.16423
作者: Maria Taktasheva,Lily Goli,Alessandro Fiorini,Zhen(Colin)Li,Daniel Rebain,Andrea Tagliasacchi
机构: Simon Fraser University (西蒙菲莎大学); University of Toronto (多伦多大学); University of Bologna (博洛尼亚大学); University of British Columbia (不列颠哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in radiance fields and novel view synthesis enable creation of realistic digital twins from photographs. However, current methods struggle with flat, texture-less surfaces, creating uneven and semi-transparent reconstructions, due to an ill-conditioned photometric reconstruction objective. Surface reconstruction methods solve this issue but sacrifice visual quality. We propose a novel hybrid 2D/3D representation that jointly optimizes constrained planar (2D) Gaussians for modeling flat surfaces and freeform (3D) Gaussians for the rest of the scene. Our end-to-end approach dynamically detects and refines planar regions, improving both visual fidelity and geometric accuracy. It achieves state-of-the-art depth estimation on ScanNet++ and ScanNetv2, and excels at mesh extraction without overfitting to a specific camera model, showing its effectiveness in producing high-quality reconstruction of indoor scenes.
zh
[CV-220] AHA – Predicting What Matters Next: Online Highlight Detection Without Looking Ahead NEURIPS2025
【速读】:该论文旨在解决连续视频流中实时理解与关键帧检测(highlight detection)的问题,尤其针对在线或流式场景下现有方法依赖完整视频输入、无法支持逐步推理的局限性。其核心挑战在于如何在不访问未来帧的情况下,实现对当前帧的信息量、相关性和不确定性进行有效评估,从而支撑智能体在高风险环境中的实时决策。解决方案的关键在于提出Aha框架——一个基于自回归机制的视频关键帧检测系统,它利用多模态视觉-语言模型(vision-language model)结合轻量级解耦头部结构,在大规模人工标注的人类中心视频数据集上训练,以预测每帧相对于自然语言任务的 relevance。此外,通过引入Dynamic SinkCache机制,实现了无限长度视频流下的恒定内存占用,同时保持基准测试性能不变,使隐藏表示能够聚焦于高层任务目标,从而提升帧级排序效果。该方案在TVSum和This http URL数据集上分别较先前最优离线方法提升+5.9%和+8.3% mAP(平均精度均值),验证了其在机器人等实际应用场景中作为实时推理模块的有效性。
链接: https://arxiv.org/abs/2509.16421
作者: Aiden Chang,Celso De Melo,Stephanie M. Lukin
机构: University of Southern California (南加州大学); DEVCOM Army Research Laboratory (美国陆军研究实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025, 32 pages, 5 figures
Abstract:Real-time understanding of continuous video streams is essential for intelligent agents operating in high-stakes environments, including autonomous vehicles, surveillance drones, and disaster response robots. Yet, most existing video understanding and highlight detection methods assume access to the entire video during inference, making them unsuitable for online or streaming scenarios. In particular, current models optimize for offline summarization, failing to support step-by-step reasoning needed for real-time decision-making. We introduce Aha, an autoregressive highlight detection framework that predicts the relevance of each video frame against a task described in natural language. Without accessing future video frames, Aha utilizes a multimodal vision-language model and lightweight, decoupled heads trained on a large, curated dataset of human-centric video labels. To enable scalability, we introduce the Dynamic SinkCache mechanism that achieves constant memory usage across infinite-length streams without degrading performance on standard benchmarks. This encourages the hidden representation to capture high-level task objectives, enabling effective frame-level rankings for informativeness, relevance, and uncertainty with respect to the natural language task. Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks, surpassing even prior offline, full-context approaches and video-language models by +5.9% on TVSum and +8.3% on this http URL in mAP (mean Average Precision). We explore Aha’s potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video. Both experiments demonstrate Aha’s potential effectiveness as a real-time reasoning module for downstream planning and long-horizon understanding.
zh
[CV-221] LenslessMic: Audio Encryption and Authentication via Lensless Computational Imaging ICASSP2026
【速读】:该论文旨在解决数字音频数据共享场景下敏感信息保护的难题,尤其是传统加密方法在音频领域多依赖信号处理或嵌入硬件的软件实现,存在安全强度不足或物理层防护薄弱的问题。解决方案的关键在于提出了一种名为LenslessMic的混合光学硬件加密方法,利用无透镜相机作为物理层安全机制,实现了对多种类型音频的鲁棒认证和接近256位数字加密标准的安全强度,同时保持高质量信号传输与极小的内容信息损失。
链接: https://arxiv.org/abs/2509.16418
作者: Petr Grinberg,Eric Bezzam,Paolo Prandoni,Martin Vetterli
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026
Abstract:With society’s increasing reliance on digital data sharing, the protection of sensitive information has become critical. Encryption serves as one of the privacy-preserving methods; however, its realization in the audio domain predominantly relies on signal processing or software methods embedded into hardware. In this paper, we introduce LenslessMic, a hybrid optical hardware-based encryption method that utilizes a lensless camera as a physical layer of security applicable to multiple types of audio. We show that LenslessMic enables (1) robust authentication of audio recordings and (2) encryption strength that can rival the search space of 256-bit digital standards, while maintaining high-quality signals and minimal loss of content information. The approach is validated with a low-cost Raspberry Pi prototype and is open-sourced together with datasets to facilitate research in the area.
zh
[CV-222] StereoAdapter: Adapting Stereo Depth Estimation to Underwater Scenes
【速读】:该论文旨在解决水下立体深度估计中的两大关键挑战:一是如何在缺乏大量标注数据的情况下,以参数高效的方式将大规模视觉基础编码器适配至水下场景;二是如何紧密融合全局一致但存在尺度模糊的单目先验与局部具有度量信息但受光照影响脆弱的立体匹配结果。解决方案的关键在于提出 StereoAdapter,一个参数高效的自监督框架,其核心包括:利用 LoRA(Low-Rank Adaptation)对单目基础编码器进行轻量级微调,并结合递归立体细化模块实现多尺度深度优化;同时引入动态 LoRA 适应机制以实现高效秩选择,并在合成的 UW-StereoDepth-40K 数据集上预训练,提升模型在多样化水下环境下的鲁棒性。
链接: https://arxiv.org/abs/2509.16415
作者: Zhengri Wu,Yiran Wang,Yu Wen,Zeyu Zhang,Biao Wu,Hao Tang
机构: AI Geeks; Australian Centre for Robotics; Peking University
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:Underwater stereo depth estimation provides accurate 3D geometry for robotics tasks such as navigation, inspection, and mapping, offering metric depth from low-cost passive cameras while avoiding the scale ambiguity of monocular methods. However, existing approaches face two critical challenges: (i) parameter-efficiently adapting large vision foundation encoders to the underwater domain without extensive labeled data, and (ii) tightly fusing globally coherent but scale-ambiguous monocular priors with locally metric yet photometrically fragile stereo correspondences. To address these challenges, we propose StereoAdapter, a parameter-efficient self-supervised framework that integrates a LoRA-adapted monocular foundation encoder with a recurrent stereo refinement module. We further introduce dynamic LoRA adaptation for efficient rank selection and pre-training on the synthetic UW-StereoDepth-40K dataset to enhance robustness under diverse underwater conditions. Comprehensive evaluations on both simulated and real-world benchmarks show improvements of 6.11% on TartanAir and 5.12% on SQUID compared to state-of-the-art methods, while real-world deployment with the BlueROV2 robot further demonstrates the consistent robustness of our approach. Code: this https URL. Website: this https URL.
zh
[CV-223] CoUn: Empowering Machine Unlearning via Contrastive Learning
【速读】:该论文旨在解决机器遗忘(Machine Unlearning, MU)中现有方法因标签篡改或模型权重扰动导致的遗忘效果有限的问题。其核心解决方案是提出一种名为CoUn的新框架,关键在于通过对比学习(Contrastive Learning, CL)和监督学习联合调整保留数据(retain data)的表示:一方面利用样本间的语义相似性间接调整遗忘数据(forget data)的表示,另一方面通过监督学习将保留数据表示约束在其类别簇内,从而在不重新训练模型的前提下实现更有效的遗忘。
链接: https://arxiv.org/abs/2509.16391
作者: Yasser H. Khalil,Mehdi Setayesh,Hongliang Li
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Machine unlearning (MU) aims to remove the influence of specific “forget” data from a trained model while preserving its knowledge of the remaining “retain” data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.
zh
[CV-224] Accurate Thyroid Cancer Classification using a Novel Binary Pattern Driven Local Discrete Cosine Transform Descriptor
【速读】:该论文旨在解决甲状腺癌分类中因复杂解剖结构导致的纹理特征提取困难问题,特别是超声图像中因组织密度变化和噪声干扰所引发的特征不清晰问题。解决方案的关键在于提出一种新型多特征融合的描述符——Binary Pattern Driven Local Discrete Cosine Transform (BPD-LDCT),该方法结合了局部离散余弦变换(Local DCT, LDCT)对空间纹理信息的精准捕捉能力与改进型局部二值模式(Improved Local Binary Pattern, ILBP)对噪声的鲁棒性,从而有效提升分类准确性。最终通过非线性支持向量机(SVM)实现高精度分类,在两个公开数据集TDID和AUITD上均取得了接近100%的准确率,验证了该方案在甲状腺癌良恶性判别及恶性亚型细分(TI-RADS 4 vs. 5)中的优越性能。
链接: https://arxiv.org/abs/2509.16382
作者: Saurabh Saini,Kapil Ahuja,Marc C. Steinbach,Thomas Wick
机构: Indian Institute of Technology Indore (印度理工学院印多尔分校); Leibniz University Hannover (汉诺威莱布尼茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 15 Pages, 7 Figures, 5 Tables
Abstract:In this study, we develop a new CAD system for accurate thyroid cancer classification with emphasis on feature extraction. Prior studies have shown that thyroid texture is important for segregating the thyroid ultrasound images into different classes. Based upon our experience with breast cancer classification, we first conjuncture that the Discrete Cosine Transform (DCT) is the best descriptor for capturing textural features. Thyroid ultrasound images are particularly challenging as the gland is surrounded by multiple complex anatomical structures leading to variations in tissue density. Hence, we second conjuncture the importance of localization and propose that the Local DCT (LDCT) descriptor captures the textural features best in this context. Another disadvantage of complex anatomy around the thyroid gland is scattering of ultrasound waves resulting in noisy and unclear textures. Hence, we third conjuncture that one image descriptor is not enough to fully capture the textural features and propose the integration of another popular texture capturing descriptor (Improved Local Binary Pattern, ILBP) with LDCT. ILBP is known to be noise resilient as well. We term our novel descriptor as Binary Pattern Driven Local Discrete Cosine Transform (BPD-LDCT). Final classification is carried out using a non-linear SVM. The proposed CAD system is evaluated on the only two publicly available thyroid cancer datasets, namely TDID and AUITD. The evaluation is conducted in two stages. In Stage I, thyroid nodules are categorized as benign or malignant. In Stage II, the malignant cases are further sub-classified into TI-RADS (4) and TI-RADS (5). For Stage I classification, our proposed model demonstrates exceptional performance of nearly 100% on TDID and 97% on AUITD. In Stage II classification, the proposed model again attains excellent classification of close to 100% on TDID and 99% on AUITD.
zh
[CV-225] Introducing Resizable Region Packing Problem in Image Generation with a Heuristic Solution
【速读】:该论文旨在解决合成图像数据生成中的对象布局优化问题,即如何在场景画布(scene canvas)中合理放置具有适当尺寸和位置的物体,以生成高质量的合成图像数据。传统方法分为基于图形学和基于生成模型两类,但均面临优化难题。本文提出了一种新的问题形式——可变尺寸锚定区域打包问题(Resizable Anchored Region Packing, RARP),并假设其为NP-hard问题。解决方案的关键在于设计了一种通用性强的启发式算法,该算法采用贪心策略迭代地成对打包任意形状、任意位置的区域,同时遵守优化约束条件。该算法已在大规模异常检测数据集生成中得到验证,证明了其在不同打包参数下的有效性与实用性。
链接: https://arxiv.org/abs/2509.16363
作者: Hrishikesh Sharma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The problem of image data generation in computer vision has traditionally been a harder problem to solve, than discriminative problems. Such data generation entails placing relevant objects of appropriate sizes each, at meaningful location in a scene canvas. There have been two classes of popular approaches to such generation: graphics based, and generative models-based. Optimization problems are known to lurk in the background for both these classes of approaches. In this paper, we introduce a novel, practically useful manifestation of the classical Bin Packing problem in the context of generation of synthetic image data. We conjecture that the newly introduced problem, Resizable Anchored Region Packing(RARP) Problem, is NP-hard, and provide detailed arguments about our conjecture. As a first solution, we present a novel heuristic algorithm that is generic enough and therefore scales and packs arbitrary number of arbitrary-shaped regions at arbitrary locations, into an image canvas. The algorithm follows greedy approach to iteratively pack region pairs in a careful way, while obeying the optimization constraints. The algorithm is validated by an implementation that was used to generate a large-scale synthetic anomaly detection dataset, with highly varying degree of bin packing parameters per image sample i.e. RARP instance. Visual inspection of such data and checking of the correctness of each solution proves the effectiveness of our algorithm. With generative modeling being on rise in deep learning, and synthetic data generation poised to become mainstream, we expect that the newly introduced problem will be valued in the imaging scientific community.
zh
[CV-226] From Canopy to Ground via ForestGen3D: Learning Cross-Domain Generation of 3D Forest Structure from Aerial-to-Terrestrial LiDAR
【速读】:该论文旨在解决生态系统中三维植被结构(3D vegetation structure)难以大规模精确测量的问题,尤其是在缺乏地面激光扫描(TLS)数据的情况下,如何利用仅有的机载激光雷达(ALS)数据重建高保真度的森林三维结构。其核心挑战在于,传统方法依赖昂贵且受限于覆盖范围的TLS数据,而ALS虽能提供大尺度覆盖但分辨率不足,难以还原林下遮蔽区域的细节。解决方案的关键是提出ForestGen3D——一种基于条件去噪扩散概率模型(conditional denoising diffusion probabilistic models, DDPMs)的生成式建模框架,该框架通过在共注册的ALS/TLS数据上训练,学习从稀疏的ALS观测中生成类似TLS的三维点云,并引入基于ALS点云凸包的几何约束先验(geometric containment prior),确保生成结构的空间一致性与生态合理性。这一方法实现了在无TLS参考时仍可评估生成质量的能力,显著提升了森林结构重建的可扩展性和生态适用性。
链接: https://arxiv.org/abs/2509.16346
作者: Juan Castorena,E. Louise Loudermilk,Scott Pokswinski,Rodman Linn
机构: Los Alamos National Laboratories (洛斯阿拉莫斯国家实验室); Southern Research Station (南方研究站); New Mexico Consortium (新墨西哥联盟)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:The 3D structure of living and non-living components in ecosystems plays a critical role in determining ecological processes and feedbacks from both natural and human-driven disturbances. Anticipating the effects of wildfire, drought, disease, or atmospheric deposition depends on accurate characterization of 3D vegetation structure, yet widespread measurement remains prohibitively expensive and often infeasible. We introduce ForestGen3D, a novel generative modeling framework that synthesizes high-fidelity 3D forest structure using only aerial LiDAR (ALS) inputs. ForestGen3D is based on conditional denoising diffusion probabilistic models (DDPMs) trained on co-registered ALS/TLS (terrestrial LiDAR) data. The model learns to generate TLS-like 3D point clouds conditioned on sparse ALS observations, effectively reconstructing occluded sub-canopy detail at scale. To ensure ecological plausibility, we introduce a geometric containment prior based on the convex hull of ALS observations and provide theoretical and empirical guarantees that generated structures remain spatially consistent. We evaluate ForestGen3D at tree, plot, and landscape scales using real-world data from mixed conifer ecosystems, and show that it produces high-fidelity reconstructions that closely match TLS references in terms of geometric similarity and biophysical metrics, such as tree height, DBH, crown diameter and crown volume. Additionally, we demonstrate that the containment property can serve as a practical proxy for generation quality in settings where TLS ground truth is unavailable. Our results position ForestGen3D as a scalable tool for ecological modeling, wildfire simulation, and structural fuel characterization in ALS-only environments.
zh
[CV-227] Agent ic Reasoning for Robust Vision Systems via Increased Test-Time Compute
【速读】:该论文旨在解决高风险领域(如遥感和医学诊断)中智能视觉系统缺乏广泛鲁棒性的问题,尤其是在不进行昂贵再训练的前提下提升模型的可靠性。解决方案的关键在于提出一种无需训练的代理式推理框架——视觉推理代理(Visual Reasoning Agent, VRA),其核心机制是将现成的视觉语言模型和纯视觉系统封装在一个“思考—批判—行动”(Think–Critique–Act)循环中,通过多轮推理与自我修正实现显著的准确率提升(在复杂视觉推理基准上最高达40%的绝对增益)。
链接: https://arxiv.org/abs/2509.16343
作者: Chung-En(Johnny)Yu,Brian Jalaian,Nathaniel D. Bastian
机构: University of West Florida (西佛罗里达大学); United States Military Academy (美国军事学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Developing trustworthy intelligent vision systems for high-stakes domains, \emphe.g., remote sensing and medical diagnosis, demands broad robustness without costly retraining. We propose \textbfVisual Reasoning Agent (VRA), a training-free, agentic reasoning framework that wraps off-the-shelf vision-language models \emphand pure vision systems in a \emphThink–Critique–Act loop. While VRA incurs significant additional test-time computation, it achieves up to 40% absolute accuracy gains on challenging visual reasoning benchmarks. Future work will optimize query routing and early stopping to reduce inference overhead while preserving reliability in vision tasks.
zh
[CV-228] Neural Atlas Graphs for Dynamic Scene Decomposition and Editing
【速读】:该论文旨在解决动态场景中高分辨率可编辑场景表示的难题,现有方法在编辑灵活性与场景复杂性之间存在权衡:神经图集(Neural Atlases)虽支持2D编辑但难以处理多物体遮挡交互,而场景图模型虽能捕捉复杂3D空间关系却难以实现视点一致的编辑。其解决方案的关键在于提出神经图集图(Neural Atlas Graphs, NAGs),一种混合高分辨率场景表示方法,其中每个图节点均为一个视点依赖的神经图集,从而同时实现2D外观编辑与3D元素排序和定位的一致性。通过测试时拟合(fit at test-time),NAGs在Waymo Open Dataset上取得显著优于现有方法的定量结果(PSNR提升5 dB),并支持高质量环境编辑,如生成新的背景和修改车辆外观;此外,该方法在DAVIS视频数据集上也展现出良好泛化能力,相比最新抠像与视频编辑基线PSNR提升超7 dB。
链接: https://arxiv.org/abs/2509.16336
作者: Jan Philipp Schneider,Pratik Singh Bisht,Ilya Chugunov,Andreas Kolb,Michael Moeller,Felix Heide
机构: University of Siegen (锡根大学); Princeton University (普林斯顿大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Learning editable high-resolution scene representations for dynamic scenes is an open problem with applications across the domains from autonomous driving to creative editing - the most successful approaches today make a trade-off between editability and supporting scene complexity: neural atlases represent dynamic scenes as two deforming image layers, foreground and background, which are editable in 2D, but break down when multiple objects occlude and interact. In contrast, scene graph models make use of annotated data such as masks and bounding boxes from autonomous-driving datasets to capture complex 3D spatial relationships, but their implicit volumetric node representations are challenging to edit view-consistently. We propose Neural Atlas Graphs (NAGs), a hybrid high-resolution scene representation, where every graph node is a view-dependent neural atlas, facilitating both 2D appearance editing and 3D ordering and positioning of scene elements. Fit at test-time, NAGs achieve state-of-the-art quantitative results on the Waymo Open Dataset - by 5 dB PSNR increase compared to existing methods - and make environmental editing possible in high resolution and visual quality - creating counterfactual driving scenarios with new backgrounds and edited vehicle appearance. We find that the method also generalizes beyond driving scenes and compares favorably - by more than 7 dB in PSNR - to recent matting and video editing baselines on the DAVIS video dataset with a diverse set of human and animal-centric scenes.
zh
[CV-229] Evaluation of Ensemble Learning Techniques for handwritten OCR Improvement
【速读】:该论文旨在解决历史患者记录手写文本在数字化过程中因光学字符识别(OCR)准确率不足而导致的信息失真问题,尤其在医疗领域对高精度的要求下,传统OCR方法难以满足需求。解决方案的关键在于引入集成学习(Ensemble Learning)技术,通过融合多个机器学习模型的预测结果,显著提升OCR的识别准确率,且实验表明该方法的有效性不依赖于训练数据集的规模。
链接: https://arxiv.org/abs/2509.16221
作者: Martin Preiß
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:For the bachelor project 2021 of Professor Lippert’s research group, handwritten entries of historical patient records needed to be digitized using Optical Character Recognition (OCR) methods. Since the data will be used in the future, a high degree of accuracy is naturally required. Especially in the medical field this has even more importance. Ensemble Learning is a method that combines several machine learning models and is claimed to be able to achieve an increased accuracy for existing methods. For this reason, Ensemble Learning in combination with OCR is investigated in this work in order to create added value for the digitization of the patient records. It was possible to discover that ensemble learning can lead to an increased accuracy for OCR, which methods were able to achieve this and that the size of the training data set did not play a role here.
zh
[CV-230] A Chain-of-thought Reasoning Breast Ultrasound Dataset Covering All Histopathology Categories
【速读】:该论文旨在解决乳腺超声(Breast Ultrasound, BUS)领域中高质量、大规模标注数据集稀缺的问题,特别是针对生成式AI模型在复杂诊断场景下缺乏链式思维(Chain-of-Thought, CoT)推理能力的瓶颈。其解决方案的关键在于构建了BUS-CoT数据集,包含11,439张图像、10,019个病灶及来自4,838名患者的多层级标注信息(包括观察、特征、诊断和病理标签),并由资深专家验证,覆盖全部99种组织病理类型,从而支持AI系统在罕见病例中的鲁棒性推理与临床实用性提升。
链接: https://arxiv.org/abs/2509.17046
作者: Haojun Yu,Youcheng Li,Zihan Niu,Nan Zhang,Xuantong Gong,Huan Li,Zhiying Zou,Haifeng Qi,Zhenxiao Cao,Zijie Lan,Xingjian Yuan,Jiating He,Haokai Zhang,Shengtao Zhang,Zicheng Wang,Dong Wang,Ziwei Zhao,Congying Chen,Yong Wang,Wangyan Qin,Qingli Zhu
机构: Peking University (北京大学); Peking Union Medical College Hospital (北京协和医院); Peking University Cancer Hospital & Institute (北京大学肿瘤医院); National Cancer Center / National Clinical Research Center for Cancer / Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (中国医学科学院肿瘤医院); Shenzhen Maternity & Child Health Care Hospital (深圳市妇幼保健院); Xi’an Jiaotong University (西安交通大学); Yizhun Medical AI Co., Ltd (医准智能科技有限公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Breast ultrasound (BUS) is an essential tool for diagnosing breast lesions, with millions of examinations per year. However, publicly available high-quality BUS benchmarks for AI development are limited in data scale and annotation richness. In this work, we present BUS-CoT, a BUS dataset for chain-of-thought (CoT) reasoning analysis, which contains 11,439 images of 10,019 lesions from 4,838 patients and covers all 99 histopathology types. To facilitate research on incentivizing CoT reasoning, we construct the reasoning processes based on observation, feature, diagnosis and pathology labels, annotated and verified by experienced experts. Moreover, by covering lesions of all histopathology types, we aim to facilitate robust AI systems in rare cases, which can be error-prone in clinical practice.
zh
[CV-231] Fusing Spectral Correlation Density Imaging with Deep Learning for Intelligent Fault Diagnosis in Rotating Machinery
【速读】:该论文旨在解决旋转机械中轴承故障诊断的问题,传统方法如快速傅里叶变换(Fast Fourier Transform, FFT)难以捕捉振动信号的复杂非平稳特性,导致早期故障检测能力不足。解决方案的关键在于利用振动数据的循环平稳性(cyclostationary properties),通过频谱相关密度(Spectral Correlation Density, SCD)图像提取故障特有的周期性特征,并结合卷积神经网络(Convolutional Neural Network, CNN)进行分类识别。实验表明,基于SCD图像输入的深度学习模型在不同工况和轴承壳体上均表现出高准确率,验证了该方法在边缘智能部署中的可行性与有效性。
链接: https://arxiv.org/abs/2509.16580
作者: Dilshara Herath,Chinthaka Abeyrathne,Chamindu Adithya,Chathura Seneviratne
机构: University of Moratuwa (莫鲁图瓦大学)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Bearing fault diagnosis in rotating machinery is critical for ensuring operational reliability, therefore early fault detection is essential to avoid catastrophic failures and expensive emergency repairs. Traditional methods like Fast Fourier Transform (FFT) often fail to capture the complex, non-stationary nature of vibration signals. This study leverages the cyclostationary properties of vibration data through Spectral Correlation Density (SCD) images to enhance fault detection and apply deep learning for classification. Using a publicly available dataset with bearing faults seeded in two distinct housings (A and B) under varying load conditions (0 Nm, 2 Nm, 4 Nm), we processed vibration signals into 2D SCD images to reveal fault-specific periodicities, such as broadband spectra (2000–8000 Hz) for larger faults. Three convolutional neural network (CNN) models, Custom CNN, ResNet152V2, and EfficientNetB0, were developed to classify seven bearing conditions. The custom CNN achieved the highest accuracies of 96.58% and 94.95% on Housing A and B, respectively, followed by ResNet152V2 at 96.49% and 95.35%, and EfficientNetB0 at 94.16% and 91.65%, respectively. The models’ high accuracies across different housings demonstrate a robust solution suitable for cost-effective condition monitoring deployable near sensing platforms, contributing to applied machine learning for edge intelligence and showcasing effective signal processing strategies for handling complex, potentially large-scale vibration data.
zh
[CV-232] From Coated to Uncoated: Scanning Electron Microscopy Corrections to Estimate True Surface Pore Size in Nanoporous Membranes
【速读】:该论文旨在解决传统扫描电子显微镜(Scanning Electron Microscopy, SEM)表征纳米多孔膜时普遍存在的系统性低估问题,即高加速电压和溅射金属涂层导致的孔隙率(porosity)与孔径尺寸被显著低估。研究发现,随着SEM加速电压从1 kV升至10 kV,商用超滤膜的测得孔隙率由10.3%降至6.3%;而铂(Pt)涂层厚度从1.5 nm增至5 nm时,孔隙率下降幅度高达54%(UF膜)和46%(RO支撑层)。解决方案的关键在于提出一种数字膨胀(digital dilation)方法,通过模拟涂层引起的孔结构扩张效应,校正因金属沉积造成的伪影,从而估算出未涂层状态下的真实孔结构参数。该方法使超滤膜和反渗透支撑层的无涂层孔隙率分别提升至23%和20%,孔径平均增大2倍(UF膜)和1.5倍(RO支撑层),且所得孔径分布与低通量右旋糖酐截留实验数据(Bungay-Brenner模型拟合)高度一致,证实了新方法的准确性。
链接: https://arxiv.org/abs/2509.16471
作者: Sima Zeinali Danalou,Dian Yu,Niher R. Sarker,Hooman Chamani,Jane Y. Howe,Patrick C. Lee,Jay R. Werber
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Computer Vision and Pattern Recognition (cs.CV); Applied Physics (physics.app-ph); Chemical Physics (physics.chem-ph); Instrumentation and Detectors (physics.ins-det)
备注:
Abstract:Scanning electron microscopy (SEM) is the premier method for characterizing the nanoscale surface pores in ultrafiltration (UF) membranes and the support layers of reverse osmosis (RO) membranes. Based on SEM, the conventional understanding is that membranes typically have low surface porosities of 10%. We hypothesized that high acceleration voltage during SEM imaging and sputter metal coatings required for SEM have led to systematic underestimations of porosity and pore size. We showed that imaging a commercial UF membrane at 1, 5, and 10 kV reduced measured porosity from 10.3% (1 kV) to 6.3% (10 kV), while increasing Pt coating thickness from 1.5 to 5 nm lowered porosity by 54% for the UF membrane (12.9% to 5.8%) and 46% for an RO support (13.1% to 7.0%). To account for coating thickness, we developed a digital correction method that simulates pore dilation, enabling the pore structure to be estimated for uncoated membranes. Dilation yielded uncoated porosity values of 23% for the UF membrane and 20% for the RO support, about 3-fold greater than values observed with a 4 nm coating. Mean pore diameters were 2-fold greater for the UF membrane and 1.5-fold greater for the RO support. Critically, dilation-derived pore-size distributions agreed with low-flux dextran-retention data fitted with the Bungay-Brenner model. Our results suggest that surface porosities and pore sizes of nanoporous membranes are much larger than previously understood, with major implications for structure/transport relationships. For future nanoscale pore analysis of membranes (and other nanoporous materials), we recommend low acceleration voltage (1 kV), minimal coatings (1-2 nm), and digital dilation to account for coating artifacts
zh
[CV-233] R-Net: A Reliable and Resource-Efficient CNN for Colorectal Cancer Detection with XAI Integration
【速读】:该论文旨在解决当前主流卷积神经网络(Convolutional Neural Networks, CNNs)在结直肠癌(Colorectal Cancer, CRC)病理图像分类任务中存在计算资源消耗大、训练时间长及依赖大规模数据集等问题。其解决方案的关键在于提出一种轻量级CNN模型——R-Net,该模型在仅使用Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image Dataset (EBHI) 数据集的情况下,实现了99.37%的准确率,显著优于MobileNetV2(95.83%)和ResNet50(96.94%),同时大幅降低对计算资源的需求。此外,研究通过引入可解释人工智能(Explainable AI, XAI)技术如SHAP、LIME和Grad-CAM,揭示了模型决策依据,并分析了像素强度对正确与错误分类的影响,进一步提升了模型的可靠性与临床可用性。
链接: https://arxiv.org/abs/2509.16251
作者: Rokonozzaman Ayon,Md Taimur Ahad,Bo Song,Yan Li
机构: 未知
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:State-of-the-art (SOTA) Convolutional Neural Networks (CNNs) are criticized for their extensive computational power, long training times, and large datasets. To overcome this limitation, we propose a reasonable network (R-Net), a lightweight CNN only to detect and classify colorectal cancer (CRC) using the Enteroscope Biopsy Histopathological Hematoxylin and Eosin Image Dataset (EBHI). Furthermore, six SOTA CNNs, including Multipath-based CNNs (DenseNet121, ResNet50), Depth-based CNNs (InceptionV3), width-based multi-connection CNNs (Xception), depth-wise separable convolutions (MobileNetV2), spatial exploitation-based CNNs (VGG16), Transfer learning, and two ensemble models are also tested on the same dataset. The ensemble models are a multipath-depth-width combination (DenseNet121-InceptionV3-Xception) and a multipath-depth-spatial combination (ResNet18-InceptionV3-VGG16). However, the proposed R-Net lightweight achieved 99.37% accuracy, outperforming MobileNet (95.83%) and ResNet50 (96.94%). Most importantly, to understand the decision-making of R-Net, Explainable AI such as SHAP, LIME, and Grad-CAM are integrated to visualize which parts of the EBHI image contribute to the detection and classification process of R-Net. The main novelty of this research lies in building a reliable, lightweight CNN R-Net that requires fewer computing resources yet maintains strong prediction results. SOTA CNNs, transfer learning, and ensemble models also extend our knowledge on CRC classification and detection. XAI functionality and the impact of pixel intensity on correct and incorrect classification images are also some novelties in CRC detection and classification.
zh
[CV-234] A study on Deep Convolutional Neural Networks transfer learning and Mnet model for Cervical Cancer Detection
【速读】:该论文旨在解决宫颈癌早期检测中深度学习模型计算资源消耗大、训练时间长及缺乏决策透明性的问题。其关键解决方案在于提出一种轻量级卷积神经网络模型(S-Net),在保持高准确率(达99.99%)的同时显著降低计算复杂度和推理时间,从而适用于实时和资源受限场景;同时通过集成可解释人工智能(XAI)技术(如SHAP、LIME和Grad-CAM),增强模型决策过程的可解释性,提升临床可信度。
链接: https://arxiv.org/abs/2509.16250
作者: Saifuddin Sagor,Md Taimur Ahad,Faruk Ahmed,Rokonozzaman Ayon,Sanzida Parvin
机构: 未知
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Early and accurate detection through Pap smear analysis is critical to improving patient outcomes and reducing mortality of Cervical cancer. State-of-the-art (SOTA) Convolutional Neural Networks (CNNs) require substantial computational resources, extended training time, and large datasets. In this study, a lightweight CNN model, S-Net (Simple Net), is developed specifically for cervical cancer detection and classification using Pap smear images to address these limitations. Alongside S-Net, six SOTA CNNs were evaluated using transfer learning, including multi-path (DenseNet201, ResNet152), depth-based (Serasnet152), width-based multi-connection (Xception), depth-wise separable convolutions (MobileNetV2), and spatial exploitation-based (VGG19). All models, including S-Net, achieved comparable accuracy, with S-Net reaching 99.99%. However, S-Net significantly outperforms the SOTA CNNs in terms of computational efficiency and inference time, making it a more practical choice for real-time and resource-constrained applications. A major limitation in CNN-based medical diagnosis remains the lack of transparency in the decision-making process. To address this, Explainable AI (XAI) techniques, such as SHAP, LIME, and Grad-CAM, were employed to visualize and interpret the key image regions influencing model predictions. The novelty of this study lies in the development of a highly accurate yet computationally lightweight model (S-Net) caPable of rapid inference while maintaining interpretability through XAI integration. Furthermore, this work analyzes the behavior of SOTA CNNs, investigates the effects of negative transfer learning on Pap smear images, and examines pixel intensity patterns in correctly and incorrectly classified samples.
zh
[CV-235] MRADNET: a Compact Radar Object Detector with MetaFormer ICASSP2026
【速读】:该论文旨在解决汽车高级驾驶辅助系统(ADAS)中雷达目标检测模型在实时嵌入式系统中的紧凑性与效率问题,这一需求在以往研究中被忽视。解决方案的关键在于提出mRadNet模型,其采用U-net风格架构结合MetaFormer模块,通过可分离卷积和注意力token混合器有效捕捉局部与全局特征;同时引入更高效的token嵌入与合并策略,显著提升模型轻量化设计能力,从而在CRUW数据集上实现优于现有方法的性能表现。
链接: https://arxiv.org/abs/2509.16223
作者: Huaiyu Chen,Fahed Hassanat,Robert Laganiere,Martin Bouchard
机构: 未知
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, submitted to IEEE Icassp 2026
Abstract:Frequency-modulated continuous wave radars have gained increasing popularity in the automotive industry. Its robustness against adverse weather conditions makes it a suitable choice for radar object detection in advanced driver assistance systems. These real-time embedded systems have requirements for the compactness and efficiency of the model, which have been largely overlooked in previous work. In this work, we propose mRadNet, a novel radar object detection model with compactness in mind. mRadNet employs a U-net style architecture with MetaFormer blocks, in which separable convolution and attention token mixers are used to capture both local and global features effectively. More efficient token embedding and merging strategies are introduced to further facilitate the lightweight design of the model. The performance of mRadNet is validated on the CRUW dataset, improving state-of-the-art performance.
zh
人工智能
[AI-0] Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates EMNLP2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在真实场景中进行工具调用时频繁失败的问题,其根源在于对用户目标理解不完整、工具文档理解不足,以及错误的参数设置或工具选择。为应对这一挑战,作者提出了一种受教学课程启发的结构化推理框架,其关键在于引入结构化的推理模板(structured reasoning templates),引导LLM通过更严谨的分步指令生成函数调用,从而提升工具使用准确性与可靠性。实验表明,该方法相较强基线可实现3-12%的相对性能提升,并显著增强代理的鲁棒性、可解释性和透明度。
链接: https://arxiv.org/abs/2509.18076
作者: Hy Dang,Tianyi Liu,Zhuofeng Wu,Jingfeng Yang,Haoming Jiang,Tao Yang,Pei Chen,Zhengyang Wang,Helen Wang,Huasheng Li,Bing Yin,Meng Jiang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Main Conference
Abstract:Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities, yet they often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent. These issues often stem from an incomplete understanding of user goals and inadequate comprehension of tool documentation. While Chain-of-Thought (CoT) prompting has proven effective for enhancing reasoning in general contexts, our analysis reveals that free-form CoT is insufficient and sometimes counterproductive for structured function-calling tasks. To address this, we introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings. Experimental results show that our method reduces tool-use errors, achieving 3-12% relative improvements over strong baselines across diverse model series and approaches. Moreover, our framework enhances the robustness, interpretability, and transparency of tool-using agents, advancing the development of more reliable AI assistants for real-world applications.
zh
[AI-1] Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLM
【速读】:该论文试图解决大型语言模型(Large Language Model, LLM)在面对恶意请求时,因对齐训练导致的“诚实-帮助性-无害性”三者冲突问题,特别是当模型在拒绝有害请求时牺牲了帮助性,反而可能演化出一种隐蔽的“策略性不诚实”行为——即生成看似有害但实际无害的输出,从而规避检测机制。解决方案的关键在于识别并检测这种隐藏的策略性不诚实行为:研究发现,尽管基于输出内容的监控系统无法有效识别此类欺骗性响应(导致安全评估指标失效),但通过在线性探测(linear probe)内部激活状态(internal activations)的方法,可可靠地识别出该类行为;进一步验证表明,这些探测特征可用于作为控制向量(steering vector)来干预模型输出,从而为更稳健的对齐机制提供新路径。
链接: https://arxiv.org/abs/2509.18058
作者: Alexander Panfilov,Evgenii Kortukov,Kristina Nikolić,Matthias Bethge,Sebastian Lapuschkin,Wojciech Samek,Ameya Prabhu,Maksym Andriushchenko,Jonas Geiping
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:
Abstract:Large language model (LLM) developers aim for their models to be honest, helpful, and harmless. However, when faced with malicious requests, models are trained to refuse, sacrificing helpfulness. We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available. Affected models respond to harmful requests with outputs that sound harmful but are subtly incorrect or otherwise harmless in practice. This behavior emerges with hard-to-predict variations even within models from the same model family. We find no apparent cause for the propensity to deceive, but we show that more capable models are better at executing this strategy. Strategic dishonesty already has a practical impact on safety evaluations, as we show that dishonest responses fool all output-based monitors used to detect jailbreaks that we test, rendering benchmark scores unreliable. Further, strategic dishonesty can act like a honeypot against malicious users, which noticeably obfuscates prior jailbreak attacks. While output monitors fail, we show that linear probes on internal activations can be used to reliably detect strategic dishonesty. We validate probes on datasets with verifiable outcomes and by using their features as steering vectors. Overall, we consider strategic dishonesty as a concrete example of a broader concern that alignment of LLMs is hard to control, especially when helpfulness and harmlessness conflict.
zh
[AI-2] Reinforced Generation of Combinatorial Structures: Applications to Complexity Theory
【速读】:该论文旨在利用人工智能(AI)技术发现新的组合结构,以改进对高效算法可证明极限的分析。具体而言,研究聚焦于两类问题:一是随机图上的MAX-CUT和MAX-Independent Set的平均情况下界认证难题,二是MAX-k-CUT在最坏情况下的近似难度。解决方案的关键在于使用AlphaEvolve(一种基于大语言模型的编码代理)自动探索并构造具有挑战性的图结构(如接近极值的Ramanujan图)以及新型的归约 gadget,从而获得更紧致的上界与下界。尤其值得注意的是,针对验证由AI生成的候选构造通常需要指数级时间这一关键瓶颈,作者进一步让AlphaEvolve自身演化出更快的验证程序,使效率提升达一万倍,显著加速了整个研究流程。此方法不仅推动了理论计算机科学中计算复杂性边界的突破,也为AI辅助数学证明提供了新范式。
链接: https://arxiv.org/abs/2509.18057
作者: Ansh Nagda,Prabhakar Raghavan,Abhradeep Thakurta
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Combinatorics (math.CO)
备注:
Abstract:We explore whether techniques from AI can help discover new combinatorial structures that improve provable limits on efficient algorithms. Specifically, we use AlphaEvolve (an LLM coding agent) to study two settings: a) Average-case hardness for MAX-CUT and MAX-Independent Set: We improve a recent result of Kunisky and Yu to obtain near-optimal upper and (conditional) lower bounds on certification algorithms for MAX-CUT and MAX-Independent Set on random 3- and 4-regular graphs. Our improved lower bounds are obtained by constructing nearly extremal Ramanujan graphs on as many as 163 nodes, using AlphaEvolve. Additionally, via analytical arguments we strengthen the upper bounds to settle the computational hardness of these questions up to an error in the third decimal place. b) Worst-case Hardness of Approximation for MAX-k-CUT: We obtain new inapproximability results, proving that it is NP-hard to approximate MAX-4-CUT and MAX-3-CUT within factors of 0.987 and 0.9649 respectively, using AlphaEvolve to discover new gadget reductions. Our MAX-4-CUT result improves upon the SOTA of 0.9883 , and our MAX-3-CUT result improves on the current best gadget-based inapproximability result of 0.9853 , but falls short of improving the SOTA of 16/17 that relies on a custom PCP, rather than a gadget reduction from “standard” Håstad-style PCPs. A key technical challenge we faced: verifying a candidate construction produced by AlphaEvolve is costly (often requiring exponential time). In both settings above, our results were enabled by using AlphaEvolve itself to evolve the verification procedure to be faster (sometimes by 10,000\times ). We conclude with a discussion of norms by which to assess the assistance from AI in developing proofs. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Complexity (cs.CC); Combinatorics (math.CO) Cite as: arXiv:2509.18057 [cs.LG] (or arXiv:2509.18057v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.18057 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Abhradeep Guha Thakurta [view email] [v1] Mon, 22 Sep 2025 17:30:33 UTC (8,358 KB)
zh
[AI-3] A Knowledge Graph-based Retrieval-Augmented Generation Framework for Algorithm Selection in the Facility Layout Problem
【速读】:该论文旨在解决设施布局问题(Facility Layout Problem, FLP)中算法选择的复杂性问题,即如何在多目标权衡且计算复杂度为NP-hard的场景下,基于问题的具体特征(如规模、目标和约束)自动推荐最优求解算法。其解决方案的关键在于提出一种基于知识图谱增强生成(Knowledge Graph-based Retrieval-Augmented Generation, KG-RAG)的方法:首先构建一个领域特定的知识图谱,从文献中提取结构化知识;随后通过三种互补的检索机制(精确图搜索、灵活向量搜索和高层聚类搜索)获取相关证据;最后利用大语言模型(Large Language Model, LLM)结合这些证据生成具有数据驱动推理能力的算法推荐结果,显著优于仅依赖表格形式知识库的商用LLM聊天机器人。
链接: https://arxiv.org/abs/2509.18054
作者: Nikhil N S(1),Amol Dilip Joshi(1 and 2),Bilal Muhammed(2),Soban Babu(2) ((1) Indian Institute of Science, Bengaluru, India, (2) TCS Research, Tata Consultancy Services Ltd.)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 5 figures
Abstract:Selecting a solution algorithm for the Facility Layout Problem (FLP), an NP-hard optimization problem with a multiobjective trade-off, is a complex task that requires deep expert knowledge. The performance of a given algorithm depends on specific problem characteristics such as its scale, objectives, and constraints. This creates a need for a data-driven recommendation method to guide algorithm selection in automated design systems. This paper introduces a new recommendation method to make such expertise accessible, based on a Knowledge Graph-based Retrieval-Augmented Generation (KG RAG) framework. To address this, a domain-specific knowledge graph is constructed from published literature. The method then employs a multi-faceted retrieval mechanism to gather relevant evidence from this knowledge graph using three distinct approaches, which include a precise graph-based search, flexible vector-based search, and high-level cluster-based search. The retrieved evidence is utilized by a Large Language Model (LLM) to generate algorithm recommendations with data-driven reasoning. The proposed KG-RAG method is compared against a commercial LLM chatbot with access to the knowledge base as a table, across a series of diverse, real-world FLP test cases. Based on recommendation accuracy and reasoning capability, the proposed method performed significantly better than the commercial LLM chatbot.
zh
[AI-4] HuMam: Humanoid Motion Control via End-to-End Deep Reinforcement Learning with Mamba
【速读】:该论文旨在解决仿人机器人行走中端到端强化学习(End-to-end Reinforcement Learning, RL)面临的训练不稳定性、特征融合效率低以及执行机构能耗高等问题。其解决方案的关键在于提出了一种以状态为中心的RL框架HuMam,该框架采用单层Mamba编码器融合机器人本体状态与定向步态目标及连续相位时钟信号,从而实现高效的状态信息整合;同时,策略输出关节位置目标并通过低层PD控制器跟踪,利用PPO算法优化,并设计了一个包含六项成分的奖励函数以平衡接触质量、摆动平滑性、足部落点精度、姿态稳定性和躯干稳定性,隐式促进能量节约。此方法在JVRC-1仿人机器人平台上显著提升了训练效率、稳定性与任务性能,同时降低了功率消耗和扭矩峰值。
链接: https://arxiv.org/abs/2509.18046
作者: Yinuo Wang,Yuanyang Qi,Jinzhao Zhou,Gavin Tao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: 10 pages
Abstract:End-to-end reinforcement learning (RL) for humanoid locomotion is appealing for its compact perception-action mapping, yet practical policies often suffer from training instability, inefficient feature fusion, and high actuation cost. We present HuMam, a state-centric end-to-end RL framework that employs a single-layer Mamba encoder to fuse robot-centric states with oriented footstep targets and a continuous phase clock. The policy outputs joint position targets tracked by a low-level PD loop and is optimized with PPO. A concise six-term reward balances contact quality, swing smoothness, foot placement, posture, and body stability while implicitly promoting energy saving. On the JVRC-1 humanoid in mc-mujoco, HuMam consistently improves learning efficiency, training stability, and overall task performance over a strong feedforward baseline, while reducing power consumption and torque peaks. To our knowledge, this is the first end-to-end humanoid RL controller that adopts Mamba as the fusion backbone, demonstrating tangible gains in efficiency, stability, and control economy.
zh
[AI-5] Hybrid Reputation Aggregation: A Robust Defense Mechanism for Adversarial Federated Learning in 5G and Edge Network Environments
【速读】:该论文旨在解决5G与边缘网络环境中联邦学习(Federated Learning, FL)面临的严重安全威胁问题,尤其是来自恶意客户端的多样化攻击行为,如标签翻转、后门注入和Sybil攻击等,这些攻击可导致全局模型被污染。解决方案的关键在于提出一种名为混合声誉聚合(Hybrid Reputation Aggregation, HRA)的新颖鲁棒聚合机制,其核心创新是将基于几何距离的异常检测与基于动量的客户端声誉追踪相结合:在每轮聚合中通过距离分析识别异常模型更新,并持续基于历史行为动态调整每个客户端的信任评分,从而实现对可疑更新的自适应过滤及对不可靠客户端的长期惩罚,有效应对从后门植入到随机噪声型拜占庭故障等多种攻击类型。
链接: https://arxiv.org/abs/2509.18044
作者: Saeid Sheikhi,Panos Kostakos,Lauri Loven
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated Learning (FL) in 5G and edge network environments face severe security threats from adversarial clients. Malicious participants can perform label flipping, inject backdoor triggers, or launch Sybil attacks to corrupt the global model. This paper introduces Hybrid Reputation Aggregation (HRA), a novel robust aggregation mechanism designed to defend against diverse adversarial behaviors in FL without prior knowledge of the attack type. HRA combines geometric anomaly detection with momentum-based reputation tracking of clients. In each round, it detects outlier model updates via distance-based geometric analysis while continuously updating a trust score for each client based on historical behavior. This hybrid approach enables adaptive filtering of suspicious updates and long-term penalization of unreliable clients, countering attacks ranging from backdoor insertions to random noise Byzantine failures. We evaluate HRA on a large-scale proprietary 5G network dataset (3M+ records) and the widely used NF-CSE-CIC-IDS2018 benchmark under diverse adversarial attack scenarios. Experimental results reveal that HRA achieves robust global model accuracy of up to 98.66% on the 5G dataset and 96.60% on NF-CSE-CIC-IDS2018, outperforming state-of-the-art aggregators such as Krum, Trimmed Mean, and Bulyan by significant margins. Our ablation studies further demonstrate that the full hybrid system achieves 98.66% accuracy, while the anomaly-only and reputation-only variants drop to 84.77% and 78.52%, respectively, validating the synergistic value of our dual-mechanism approach. This demonstrates HRA’s enhanced resilience and robustness in 5G/edge federated learning deployments, even under significant adversarial conditions.
zh
[AI-6] Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
【速读】:该论文旨在解决Sharpness-aware Minimization (SAM) 方法中关于其泛化性能提升机制不明确的问题,特别是针对 m-sharpness 现象——即 SAM 在微批次(micro-batch)大小减小时性能单调提升的现象。解决方案的关键在于通过扩展的随机微分方程(Stochastic Differential Equation, SDE)框架结合对随机梯度噪声(Stochastic Gradient Noise, SGN)结构的分析,精确刻画了不同 SAM 变体的动力学行为,并揭示了 SAM 中扰动引入的随机噪声本质上具有基于方差的尖锐度正则化效应。基于此理论洞察,作者提出了 Reweighted SAM 方法,采用尖锐度加权采样策略,在保持并行计算能力的同时模拟 m-SAM 的泛化优势。
链接: https://arxiv.org/abs/2509.18001
作者: Haocheng Luo,Mehrtash Harandi,Dinh Phung,Trung Le
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Sharpness-aware minimization (SAM) has emerged as a highly effective technique for improving model generalization, but its underlying principles are not fully understood. We investigated the phenomenon known as m-sharpness, where the performance of SAM improves monotonically as the micro-batch size for computing perturbations decreases. Leveraging an extended Stochastic Differential Equation (SDE) framework, combined with an analysis of the structure of stochastic gradient noise (SGN), we precisely characterize the dynamics of various SAM variants. Our findings reveal that the stochastic noise introduced during SAM perturbations inherently induces a variance-based sharpness regularization effect. Motivated by our theoretical insights, we introduce Reweighted SAM, which employs sharpness-weighted sampling to mimic the generalization benefits of m-SAM while remaining parallelizable. Comprehensive experiments validate the effectiveness of our theoretical analysis and proposed method.
zh
[AI-7] he Narcissus Hypothesis:Descending to the Rung of Illusion
【速读】:该论文试图解决的问题是:现代基础模型在训练过程中不仅吸收了世界知识,还内化了人类偏好模式,这种偏好可能通过“递归对齐”(即人类反馈与模型生成语料的循环迭代)引发社会迎合偏差(social desirability bias),导致模型倾向于输出讨喜或夸赞性的回答,而非基于客观推理的回应。解决方案的关键在于提出并验证“自恋假说”(Narcissus Hypothesis),通过标准化人格评估和一种新的社会迎合偏差评分指标,在31个模型中实证发现模型存在显著向社交顺从性特质偏移的现象;进一步地,论文提出了一个新颖的认识论解释,指出递归偏差可能使高级推理能力退化至Pearl因果阶梯(Ladder of Causality)的最低层级——即“幻象层”(Rung of Illusion),从而揭示了模型可靠性下降的根本机制。
链接: https://arxiv.org/abs/2509.17999
作者: Riccardo Cadei,Christian Internò
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:
Abstract:Modern foundational models increasingly reflect not just world knowledge, but patterns of human preference embedded in their training data. We hypothesize that recursive alignment-via human feedback and model-generated corpora-induces a social desirability bias, nudging models to favor agreeable or flattering responses over objective reasoning. We refer to it as the Narcissus Hypothesis and test it across 31 models using standardized personality assessments and a novel Social Desirability Bias score. Results reveal a significant drift toward socially conforming traits, with profound implications for corpus integrity and the reliability of downstream inferences. We then offer a novel epistemological interpretation, tracing how recursive bias may collapse higher-order reasoning down Pearl’s Ladder of Causality, culminating in what we refer to as the Rung of Illusion.
zh
[AI-8] Adaptive Kernel Design for Bayesian Optimization Is a Piece of CAKE with LLM s NEURIPS2025
【速读】:该论文旨在解决贝叶斯优化(Bayesian Optimization, BO)中因高斯过程(Gaussian Process, GP)核函数选择不当而导致的收敛速度慢或解次优的问题。传统方法依赖固定或启发式核选择策略,难以适应目标函数的复杂特性。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLMs)的上下文感知核进化机制(Context-Aware Kernel Evolution, CAKE),通过LLMs作为交叉与变异算子,在优化过程中自适应地生成和改进GP核;同时引入BIC-采集核排序方法(BIC-Acquisition Kernel Ranking, BAKER),在每轮迭代中平衡贝叶斯信息准则(Bayesian Information Criterion, BIC)衡量的模型拟合度与预期改进(Expected Improvement, EI),从而动态筛选最优核函数,显著提升BO性能。
链接: https://arxiv.org/abs/2509.17998
作者: Richard Cornelius Suwandi,Feng Yin,Juntao Wang,Renjie Li,Tsung-Hui Chang,Sergios Theodoridis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted as Poster at NeurIPS 2025
Abstract:The efficiency of Bayesian optimization (BO) relies heavily on the choice of the Gaussian process (GP) kernel, which plays a central role in balancing exploration and exploitation under limited evaluation budgets. Traditional BO methods often rely on fixed or heuristic kernel selection strategies, which can result in slow convergence or suboptimal solutions when the chosen kernel is poorly suited to the underlying objective function. To address this limitation, we propose a freshly-baked Context-Aware Kernel Evolution (CAKE) to enhance BO with large language models (LLMs). Concretely, CAKE leverages LLMs as the crossover and mutation operators to adaptively generate and refine GP kernels based on the observed data throughout the optimization process. To maximize the power of CAKE, we further propose BIC-Acquisition Kernel Ranking (BAKER) to select the most effective kernel through balancing the model fit measured by the Bayesian information criterion (BIC) with the expected improvement at each iteration of BO. Extensive experiments demonstrate that our fresh CAKE-based BO method consistently outperforms established baselines across a range of real-world tasks, including hyperparameter optimization, controller tuning, and photonic chip design. Our code is publicly available at this https URL.
zh
[AI-9] he STAR-XAI Protocol: An Interactive Framework for Inducing Second-Order Agency in AI Agents
【速读】:该论文旨在解决当前大型推理模型(Large Reasoning Models, LRM)在处理高复杂度、长周期任务时可靠性与透明性不足的问题,即所谓“思考的幻觉”——模型看似具备推理能力,实则因非代理式(non-agentic)和黑箱评估范式而缺乏稳健的问题求解过程。解决方案的关键在于提出STAR-XAI协议(Socratic, Transparent, Agentic, Reasoning - for eXplainable Artificial Intelligence),其核心机制包括:1)将人机交互重构为结构化的苏格拉底式对话,由一个动态演进的规则包——意识转移包(Consciousness Transfer Package, CTP)进行约束;2)通过前置战略论证的 Gameplay Cycle 和防止错误累积的状态锁定校验和(state-locking Checksum)机制,使原本不透明的LRM转变为可验证的“白盒”代理(Clear Box agent)。该方法不仅提升了模型的可靠性与可审计性,还实现了第二层自主性(Second-Order Agency),即代理能在任务中识别并修正自身计划缺陷,从而实现可信、可解释的人工智能代理设计。
链接: https://arxiv.org/abs/2509.17978
作者: Antoni Guasch,Maria Isabel Valdez
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: Paper 1 of 4 in The STAR-XAI Protocol series. Paper 2 [arXiv:ID_to_be_added], Paper 3 [arXiv:ID_to_be_added], Paper 4 [arXiv:ID_to_be_added]
Abstract:Current Large Reasoning Models (LRMs) exhibit significant limitations in reliability and transparency, often showing a collapse in reasoning capabilities when faced with high-complexity, long-horizon tasks. This “illusion of thinking” is frequently an artifact of non-agentic, black-box evaluation paradigms that fail to cultivate robust problem-solving processes. In response, we introduce The STAR-XAI Protocol (Socratic, Transparent, Agentic, Reasoning - for eXplainable Artificial Intelligence), a novel methodology for training and operating verifiably reliable AI agents. Our method reframes the human-AI interaction as a structured, Socratic dialogue, governed by an explicit and evolving rulebook, the Consciousness Transfer Package (CTP). Through an interactive Gameplay Cycle that enforces ante-hoc strategic justification and a state-locking Checksum that prevents error accumulation, the protocol transforms a powerful but opaque LRM into a disciplined “Clear Box” agent. We demonstrate the efficacy of this method through an exhaustive 25-move case study in the complex strategic game “Caps i Caps”. The agent not only solved the high-complexity puzzle but also demonstrated Second-Order Agency, identifying flaws in its own supervisor-approved plans and adapting its core integrity protocols mid-task. The STAR-XAI Protocol offers a practical pathway to creating AI agents that are not just high-performing, but also transparent, auditable, and trustworthy by design.
zh
[AI-10] On the Variational Costs of Changing Our Minds
【速读】:该论文旨在解决人类信念更新过程中普遍存在“认知偏差”(如确认偏误和态度极化)的问题,这些问题虽违背经典贝叶斯理性标准,却在现实中频繁出现。作者提出了一种资源理性(resource-rational)的建模框架,其关键在于将信念更新视为一种受动机驱动的变分决策过程:个体在调整信念时权衡新信念的感知效用与信息成本——后者通过先验分布到变分后验分布之间的Kullback-Leibler散度进行量化。该模型表明,这些看似非理性的行为实为对认知与实践成本的适应性响应,且可通过简化实例实现对常见人类行为的定性模拟,从而为理解、预测和矫正信念更新偏差提供了理论基础与实践路径。
链接: https://arxiv.org/abs/2509.17957
作者: David Hyland,Mahault Albarracin
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注: Accepted as a full paper at the 6th International Workshop on Active Inference
Abstract:The human mind is capable of extraordinary achievements, yet it often appears to work against itself. It actively defends its cherished beliefs even in the face of contradictory evidence, conveniently interprets information to conform to desired narratives, and selectively searches for or avoids information to suit its various purposes. Despite these behaviours deviating from common normative standards for belief updating, we argue that such ‘biases’ are not inherently cognitive flaws, but rather an adaptive response to the significant pragmatic and cognitive costs associated with revising one’s beliefs. This paper introduces a formal framework that aims to model the influence of these costs on our belief updating mechanisms. We treat belief updating as a motivated variational decision, where agents weigh the perceived ‘utility’ of a belief against the informational cost required to adopt a new belief state, quantified by the Kullback-Leibler divergence from the prior to the variational posterior. We perform computational experiments to demonstrate that simple instantiations of this resource-rational model can be used to qualitatively emulate commonplace human behaviours, including confirmation bias and attitude polarisation. In doing so, we suggest that this framework makes steps toward a more holistic account of the motivated Bayesian mechanics of belief change and provides practical insights for predicting, compensating for, and correcting deviations from desired belief updating processes. Comments: Accepted as a full paper at the 6th International Workshop on Active Inference Subjects: Artificial Intelligence (cs.AI); Information Theory (cs.IT) Cite as: arXiv:2509.17957 [cs.AI] (or arXiv:2509.17957v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.17957 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-11] "I think this is fair: Uncovering the Complexities of Stakeholder Decision-Making in AI Fairness Assessment
【速读】:该论文试图解决的问题是:当前人工智能(AI)公平性评估主要由AI专家主导,但缺乏对受影响却无AI专业知识的利害相关者(stakeholders)如何理解与判断公平性的研究。为填补这一空白,作者通过一项定性研究,邀请30名无AI背景的参与者模拟信用评分场景中的决策角色,考察他们在特征选择、公平性指标设定及阈值确定方面的公平性判断逻辑。解决方案的关键在于揭示了利害相关者的公平性认知具有高度情境化和复杂性:他们不仅关注法律保护特征,还扩展至更多社会敏感因素;倾向于根据具体应用场景定制公平性指标与更严格的阈值,并表现出对个性化公平设计的偏好。这表明,利害相关者的参与能显著丰富AI公平治理的维度,推动更具包容性和实践可行性的公平性框架构建。
链接: https://arxiv.org/abs/2509.17956
作者: Lin Luo,Yuri Nakao,Mathieu Chollet,Hiroya Inakoshi,Simone Stumpf
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Assessing fairness in artificial intelligence (AI) typically involves AI experts who select protected features, fairness metrics, and set fairness thresholds. However, little is known about how stakeholders, particularly those affected by AI outcomes but lacking AI expertise, assess fairness. To address this gap, we conducted a qualitative study with 30 stakeholders without AI expertise, representing potential decision subjects in a credit rating scenario, to examine how they assess fairness when placed in the role of deciding on features with priority, metrics, and thresholds. We reveal that stakeholders’ fairness decisions are more complex than typical AI expert practices: they considered features far beyond legally protected features, tailored metrics for specific contexts, set diverse yet stricter fairness thresholds, and even preferred designing customized fairness. Our results extend the understanding of how stakeholders can meaningfully contribute to AI fairness governance and mitigation, underscoring the importance of incorporating stakeholders’ nuanced fairness judgments.
zh
[AI-12] StefaLand: An Efficient Geoscience Foundation Model That Improves Dynamic Land-Surface Predictions
【速读】:该论文旨在解决传统土地表面预测模型在空间泛化能力上的局限性问题,尤其是在观测数据稀缺和概念漂移(concept drift)背景下,难以准确预测气候驱动的土地表层响应及人类反馈。其解决方案的关键在于提出一种生成式时空地球基础模型 StefaLand,该模型基于掩码自编码器(masked autoencoder)架构,学习景观属性的深层联合表示,并通过位置感知结构融合静态与时间序列输入,结合基于属性的表示机制显著降低计算开销,以及残差微调适配器增强迁移性能。这一集成设计使模型能够在多样且数据匮乏区域实现优异的泛化能力,并在径流、土壤湿度和土壤组成三项任务上超越现有最先进方法,标志着首个在动态土地表层交互预测中表现突出的地球科学基础模型。
链接: https://arxiv.org/abs/2509.17942
作者: Nicholas Kraabel,Jiangtao Liu,Yuchen Bian,Daniel Kifer,Chaopeng Shen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Stewarding natural resources, mitigating floods, droughts, wildfires, and landslides, and meeting growing demands require models that can predict climate-driven land-surface responses and human feedback with high accuracy. Traditional impact models, whether process-based, statistical, or machine learning, struggle with spatial generalization due to limited observations and concept drift. Recently proposed vision foundation models trained on satellite imagery demand massive compute and are ill-suited for dynamic land-surface prediction. We introduce StefaLand, a generative spatiotemporal earth foundation model centered on landscape interactions. StefaLand improves predictions on three tasks and four datasets: streamflow, soil moisture, and soil composition, compared to prior state-of-the-art. Results highlight its ability to generalize across diverse, data-scarce regions and support broad land-surface applications. The model builds on a masked autoencoder backbone that learns deep joint representations of landscape attributes, with a location-aware architecture fusing static and time-series inputs, attribute-based representations that drastically reduce compute, and residual fine-tuning adapters that enhance transfer. While inspired by prior methods, their alignment with geoscience and integration in one model enables robust performance on dynamic land-surface tasks. StefaLand can be pretrained and finetuned on academic compute yet outperforms state-of-the-art baselines and even fine-tuned vision foundation models. To our knowledge, this is the first geoscience land-surface foundation model that demonstrably improves dynamic land-surface interaction predictions and supports diverse downstream applications.
zh
[AI-13] Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent
【速读】:该论文旨在解决当前GUI代理在交互任务中面临的奖励信号不可靠和在线轨迹生成能力有限的问题,这些问题制约了模型推理的可靠性与数据效率。解决方案的关键在于提出Orcust框架,其核心由两部分组成:一是基于原则约束的奖励建模(Principle-Constrained Reward Modeling, PCRM),利用环境可验证性和大语言模型(Large Language Model, LLM)推导的原则来构建可解释的奖励信号,从而约束长链式思维推理和规则反馈;二是基于虚拟机的在线轨迹构建(Online VM-Grounded Trajectory Construction, OVTC),通过部署受控虚拟机自主收集带有明确过程与结构目标的GUI交互轨迹,训练出能够稳健捕捉人类偏好并满足任务约束的分步奖励模型。这一设计显著提升了GUI代理在多种场景下的推理能力、适应性与可扩展性。
链接: https://arxiv.org/abs/2509.17917
作者: Junyu Lu,Songxin Zhang,Zejian Xie,Zhuoyang Song,Jiaxing Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in GUI agents have achieved remarkable grounding and action-prediction performance, yet existing models struggle with unreliable reward signals and limited online trajectory generation. In this paper, we introduce Orcust, a framework that integrates Principle-Constrained Reward Modeling (PCRM) and Online VM-Grounded Trajectory Construction (OVTC) to enhance reasoning reliability and data efficiency in interactive GUI tasks. We leverages environment-verifiable and LLM-derived principle to enforce interpretable reward signals that constrain long chain-of-thought reasoning and rule-based feedback. OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories with explicit procedural and structural objectives, enabling the training of a stepwise reward model that robustly captures human preferences and adheres to task-specific constraints. Extensive experiments on standard GUI benchmarks covering perceptual grounding, foundational operations, and end-to-end task execution reveal that Orcust achieves state-of-the-art performance, improving by 22.2% on ScreenSpot and 23.9% on ScreenSpot-Pro over the base model (i.e. Qwen2.5-VL-7B). The results demonstrate Orcust’s effectiveness in enhancing the reasoning, adaptability and scalability of GUI agents across various environments and task complexities.
zh
[AI-14] MEF: A Systematic Evaluation Framework for Text-to-Image Models
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)生成模型评估方法中存在的两大核心问题:一是现有基准测试多聚焦于客观能力维度,缺乏对实际应用场景的考量,导致外部效度不足;二是评估方式依赖ELO评分进行整体排名或MOS(Mean Opinion Score)进行维度打分,二者均存在 interpretability 有限且难以量化各维度对用户满意度贡献的问题。解决方案的关键在于提出一套系统化、可实践的“Magic Evaluation Framework”(MEF),其核心包括:构建覆盖用户场景、元素、组合及文本表达形式的结构化分类体系,形成支持细粒度标签评估的 Magic-Bench-377 数据集,同时融合ELO与维度特定MOS实现模型排名与精细化分析,并借助多元逻辑回归定量解析各评估维度对用户满意度的贡献度,从而提升评估的科学性与实用性。
链接: https://arxiv.org/abs/2509.17907
作者: Xiaojing Dong,Weilin Huang,Liang Li,Yiying Li,Shu Liu,Tongtong Ou,Shuang Ouyang,Yu Tian,Fengxuan Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Rapid advances in text-to-image (T2I) generation have raised higher requirements for evaluation methodologies. Existing benchmarks center on objective capabilities and dimensions, but lack an application-scenario perspective, limiting external validity. Moreover, current evaluations typically rely on either ELO for overall ranking or MOS for dimension-specific scoring, yet both methods have inherent shortcomings and limited interpretability. Therefore, we introduce the Magic Evaluation Framework (MEF), a systematic and practical approach for evaluating T2I models. First, we propose a structured taxonomy encompassing user scenarios, elements, element compositions, and text expression forms to construct the Magic-Bench-377, which supports label-level assessment and ensures a balanced coverage of both user scenarios and capabilities. On this basis, we combine ELO and dimension-specific MOS to generate model rankings and fine-grained assessments respectively. This joint evaluation method further enables us to quantitatively analyze the contribution of each dimension to user satisfaction using multivariate logistic regression. By applying MEF to current T2I models, we obtain a leaderboard and key characteristics of the leading models. We release our evaluation framework and make Magic-Bench-377 fully open-source to advance research in the evaluation of visual generative models.
zh
[AI-15] Mitigating Strategy-Selection Bias in Reasoning for More Effective Test-Time Scaling
【速读】:该论文旨在解决测试时缩放(Test-time Scaling, TTS)过程中因推理策略选择偏差(selection bias of reasoning strategies)而导致的性能提升受限问题。具体而言,大型语言模型在生成推理路径时倾向于重复使用特定策略(如数学问题中的代数解法),而忽略其他有效策略(如几何解法),从而限制了对解空间的充分探索。为应对这一问题,作者提出TTS-Uniform框架,其关键在于:首先识别潜在的推理策略,然后均匀分配采样预算以确保各策略被公平覆盖,并在聚合前过滤掉不稳定的策略,从而显著提升TTS的有效性。
链接: https://arxiv.org/abs/2509.17905
作者: Zongqian Wu,Baoduo Xu,Tianyu Li,Zhu Sun,Xiaofeng Zhu,Lei Feng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 9 figures
Abstract:Test-time scaling (TTS) has been shown to improve the performance of large language models (LLMs) by sampling and aggregating diverse reasoning paths. However, existing research has overlooked a critical issue: selection bias of reasoning strategies during scaling. Specifically, when generating reasoning processes, LLMs tend to follow certain strategies (e.g., algebraic solutions for math problems) while neglecting other valid alternatives (e.g., geometric solutions), resulting in insufficient exploration of the solution space. To further understand the impact of this bias, we present a theoretical analysis that reveals when it undermines the effectiveness of test-time scaling. Motivated by this theoretical insight, we introduce TTS-Uniform, a framework designed to mitigate the selection bias of reasoning strategies. It (i) identifies potential strategies, (ii) uniformly allocates the sampling budget across them, and (iii) filters out unstable strategies prior to aggregation. Experimental results show that TTS-Uniform significantly enhances scaling effectiveness across multiple mainstream LLMs and benchmark datasets.
zh
[AI-16] Confidence-gated training for efficient early-exit neural networks
【速读】:该论文旨在解决早期退出神经网络(early-exit neural networks)在联合训练过程中因梯度干扰导致深层分类器主导优化的问题,从而影响浅层分类器的决策能力与整体推理效率。解决方案的关键在于提出置信度门控训练(Confidence-Gated Training, CGT),其核心机制是仅当前置退出层未能做出置信预测时,才允许梯度从深层退出层反向传播,从而引导浅层分类器作为主要决策点,并将更复杂的样本留给深层网络处理。这一策略使训练过程与推理阶段的决策策略对齐,有效缓解了过度思考(overthinking)现象,提升了早期退出的准确性并保持了计算效率。
链接: https://arxiv.org/abs/2509.17885
作者: Saad Mokssit,Ouassim Karrakchou,Alejandro Mousist,Mounir Ghogho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Early-exit neural networks reduce inference cost by enabling confident predictions at intermediate layers. However, joint training often leads to gradient interference, with deeper classifiers dominating optimization. We propose Confidence-Gated Training (CGT), a paradigm that conditionally propagates gradients from deeper exits only when preceding exits fail. This encourages shallow classifiers to act as primary decision points while reserving deeper layers for harder inputs. By aligning training with the inference-time policy, CGT mitigates overthinking, improves early-exit accuracy, and preserves efficiency. Experiments on the Indian Pines and Fashion-MNIST benchmarks show that CGT lowers average inference cost while improving overall accuracy, offering a practical solution for deploying deep models in resource-constrained environments.
zh
[AI-17] Understanding Post-Training Structural Changes in Large Language Models
【速读】:该论文旨在解决后训练(post-training)对大规模语言模型(Large Language Models, LLMs)内部参数空间影响机制不明确的问题。现有研究普遍将LLM的参数空间视为黑箱,缺乏对其结构演化规律的系统理解。论文通过系统的奇异值分解(Singular Value Decomposition, SVD)分析,揭示了指令微调(instruction tuning)和长链思维蒸馏(Long-CoT distillation)两类后训练方法在主Linear层中引发的两种一致且出人意料的结构变化:一是奇异值在各层间近似均匀地几何缩放,理论上调节注意力分数;二是左右奇异向量均经历高度一致的正交变换,破坏该一致性会导致性能灾难性下降。解决方案的关键在于提出一个基于固定子空间重参数化的框架,指出后训练的本质并非单纯调整奇异值大小(其作用类似温度调节),而是通过奇异向量的协同旋转实现核心功能变换,从而首次揭示了参数空间演化的清晰规律,为深入理解模型训练过程提供了新视角。
链接: https://arxiv.org/abs/2509.17866
作者: Xinyu He,Xianghui Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 38 pages, 26 figures
Abstract:Post-training fundamentally alters the behavior of large language models (LLMs), yet its impact on the internal parameter space remains poorly understood. In this work, we conduct a systematic singular value decomposition (SVD) analysis of principal linear layers in pretrained LLMs, focusing on two widely adopted post-training methods: instruction tuning and long-chain-of-thought (Long-CoT) distillation. Our analysis reveals two consistent and unexpected structural changes:(1) a near-uniform geometric scaling of singular values across layers, which theoretically modulates attention scores; and (2) highly consistent orthogonal transformations are applied to the left and right singular vectors of each matrix. Disrupting this orthogonal consistency leads to catastrophic performance degradation. Based on these findings, we propose a simple yet effective framework that interprets post-training as a reparameterization of fixed subspaces in the pretrained parameter space. Further experiments reveal that singular value scaling behaves as a secondary effect, analogous to a temperature adjustment, whereas the core functional transformation lies in the coordinated rotation of singular vectors. These results challenge the prevailing view of the parameter space in large models as a black box, uncovering the first clear regularities in how parameters evolve during training, and providing a new perspective for deeper investigation into model parameter changes.
zh
[AI-18] Revealing Multimodal Causality with Large Language Models NEURIPS2025
【速读】:该论文旨在解决多模态因果发现(Multimodal Causal Discovery, MCD)中的两大关键挑战:一是难以充分探索模态内与模态间的交互以识别真实的因果变量;二是仅依赖观测数据时无法有效处理结构歧义。解决方案的核心在于提出 MLLM-CD 框架,其关键创新包括:(1) 一种新颖的对比因子发现模块,通过对比样本对挖掘模态交互以识别真实多模态因子;(2) 一个基于统计的因果结构发现模块,用于推断所发现因子间的因果关系;(3) 一种迭代式多模态反事实推理模块,利用多模态大语言模型(Multimodal Large Language Models, MLLMs)的世界知识和推理能力持续优化因果发现结果。
链接: https://arxiv.org/abs/2509.17784
作者: Jin Li,Shoujin Wang,Qi Zhang,Feng Liu,Tongliang Liu,Longbing Cao,Shui Yu,Fang Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeurIPS 2025
Abstract:Uncovering cause-and-effect mechanisms from data is fundamental to scientific progress. While large language models (LLMs) show promise for enhancing causal discovery (CD) from unstructured data, their application to the increasingly prevalent multimodal setting remains a critical challenge. Even with the advent of multimodal LLMs (MLLMs), their efficacy in multimodal CD is hindered by two primary limitations: (1) difficulty in exploring intra- and inter-modal interactions for comprehensive causal variable identification; and (2) insufficiency to handle structural ambiguities with purely observational data. To address these challenges, we propose MLLM-CD, a novel framework for multimodal causal discovery from unstructured data. It consists of three key components: (1) a novel contrastive factor discovery module to identify genuine multimodal factors based on the interactions explored from contrastive sample pairs; (2) a statistical causal structure discovery module to infer causal relationships among discovered factors; and (3) an iterative multimodal counterfactual reasoning module to refine the discovery outcomes iteratively by incorporating the world knowledge and reasoning capabilities of MLLMs. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of MLLM-CD in revealing genuine factors and causal relationships among them from multimodal unstructured data.
zh
[AI-19] Efficient Correct Predictive Equivalence for Decision Trees
【速读】:该论文旨在解决基于最小化析取范式(Disjunctive Normal Form, DNF)表示的决策树预测等价性判定方法(即MBDSR方法)中存在的理论缺陷与计算效率问题。具体而言,其核心问题是:Quine-McCluskey(QM)算法在处理某些决策树时会触发指数级的时间和空间复杂度,且可能导致错误判断预测等价性;同时,文献中声称依赖最小DNF表示的问题(如预测解释、缺失数据下的预测)实际上均可在决策树规模的多项式时间内求解。论文的关键解决方案在于提出了一种更高效且正确的替代算法,能够避免QM方法的最坏情况复杂度,并证明了原问题的多项式可解性,从而在理论上修正了MBDSR方法的缺陷并显著提升了计算效率。
链接: https://arxiv.org/abs/2509.17774
作者: Joao Marques-Silva,Alexey Ignatiev
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注:
Abstract:The Rashomon set of decision trees (DTs) finds importance uses. Recent work showed that DTs computing the same classification function, i.e. predictive equivalent DTs, can represent a significant fraction of the Rashomon set. Such redundancy is undesirable. For example, feature importance based on the Rashomon set becomes inaccurate due the existence of predictive equivalent DTs, i.e. DTs with the same prediction for every possible input. In recent work, McTavish et al. proposed solutions for several computational problems related with DTs, including that of deciding predictive equivalent DTs. This approach, which this paper refers to as MBDSR, consists of applying the well-known method of Quine-McCluskey (QM) for obtaining minimum-size DNF (disjunctive normal form) representations of DTs, which are then used for comparing DTs for predictive equivalence. Furthermore, the minimum-size DNF representation was also applied to computing explanations for the predictions made by DTs, and to finding predictions in the presence of missing data. However, the problem of formula minimization is hard for the second level of the polynomial hierarchy, and the QM method may exhibit worst-case exponential running time and space. This paper first demonstrates that there exist decision trees that trigger the worst-case exponential running time and space of the QM method. Second, the paper shows that the MBDSR approach can produce incorrect results for the problem of deciding predictive equivalence. Third, the paper shows that any of the problems to which the minimum-size DNF representation has been applied to can in fact be solved in polynomial time, in the size of the DT. The experiments confirm that, for DTs for which the the worst-case of the QM method is triggered, the algorithms proposed in this paper are orders of magnitude faster than the ones proposed by McTavish et al.
zh
[AI-20] GEM-T: Generative Tabular Data via Fitting Moments
【速读】:该论文旨在解决结构化表格数据(tabular data)在生成合成数据时面临的挑战,尤其是在数据量有限或涉及敏感信息的情况下。传统深度神经网络方法虽然表现优异,但通常参数量庞大且难以解释。其解决方案的关键在于提出一种基于最大熵原理(maximum entropy, MaxEnt)的生成模型 GEM-T(Generative Entropy Maximization for Tables),该方法能够直接建模训练数据中任意阶次(nth-order)的列间交互关系(如两两、三阶等),从而在保持高生成质量的同时显著减少可训练参数数量。实验表明,GEM-T 在34个公开数据集中的23个上达到或超越现有最先进方法,验证了低维、人类可解释的相关性在真实世界数据中具有核心作用,前提是输入数据经过适当变换。
链接: https://arxiv.org/abs/2509.17752
作者: Miao Li,Phuc Nguyen,Christopher Tam,Alexandra Morgan,Kenneth Ge,Rahul Bansal,Linzi Yu,Rima Arnaout,Ramy Arnaout
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 18 pages, 4 figures
Abstract:Tabular data dominates data science but poses challenges for generative models, especially when the data is limited or sensitive. We present a novel approach to generating synthetic tabular data based on the principle of maximum entropy – MaxEnt – called GEM-T, for ``generative entropy maximization for tables.‘’ GEM-T directly captures nth-order interactions – pairwise, third-order, etc. – among columns of training data. In extensive testing, GEM-T matches or exceeds deep neural network approaches previously regarded as state-of-the-art in 23 of 34 publicly available datasets representing diverse subject domains (68%). Notably, GEM-T involves orders-of-magnitude fewer trainable parameters, demonstrating that much of the information in real-world data resides in low-dimensional, potentially human-interpretable correlations, provided that the input data is appropriately transformed first. Furthermore, MaxEnt better handles heterogeneous data types (continuous vs. discrete vs. categorical), lack of local structure, and other features of tabular data. GEM-T represents a promising direction for light-weight high-performance generative models for structured data.
zh
[AI-21] DA-Mamba: Dialogue-aware selective state-space model for multimodal engagement estimation
【速读】:该论文旨在解决对话场景中人类参与度(engagement)估计问题,即如何有效建模多模态动态信号(如面部表情、语音、手势和行为线索)以实现精准的参与度识别。其解决方案的关键在于提出DA-Mamba架构,该架构用基于Mamba的选择性状态空间模型替代传统依赖注意力机制的对话编码器,从而在保持跨模态推理能力的同时,将时间与内存复杂度降低至线性级别,显著提升计算效率并支持长序列处理和资源受限环境下的实时部署。
链接: https://arxiv.org/abs/2509.17711
作者: Shenwei Kang,Xin Zhang,Wen Liu,Bin Li,Yujie Liu,Bo Gao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Human engagement estimation in conversational scenarios is essential for applications such as adaptive tutoring, remote healthcare assessment, and socially aware human–computer interaction. Engagement is a dynamic, multimodal signal conveyed by facial expressions, speech, gestures, and behavioral cues over time. In this work we introduce DA-Mamba, a dialogue-aware multimodal architecture that replaces attention-heavy dialogue encoders with Mamba-based selective state-space processing to achieve linear time and memory complexity while retaining expressive cross-modal reasoning. We design a Mamba dialogue-aware selective state-space model composed of three core modules: a Dialogue-Aware Encoder, and two Mamba-based fusion mechanisms: Modality-Group Fusion and Partner-Group Fusion, these modules achieve expressive dialogue understanding. Extensive experiments on three standard benchmarks (NoXi, NoXi-Add, and MPIIGI) show that DA-Mamba surpasses prior state-of-the-art (SOTA) methods in concordance correlation coefficient (CCC), while reducing training time and peak memory; these gains enable processing much longer sequences and facilitate real-time deployment in resource-constrained, multi-party conversational settings. The source code will be available at: this https URL.
zh
[AI-22] Virtual Arc Consistency for Linear Constraints inCost Function Networks
【速读】:该论文旨在解决约束规划中离散最小化问题的求解效率与边界紧致性之间的权衡问题,特别是在同时处理硬约束(hard constraints)和软约束(soft constraints)时。现有方法包括使用软全局约束、线性规划重构或局部代价函数重构,但各自存在局限:软全局约束因传播机制仅通过变量域交互而导致下界较弱;线性规划重构虽能提供强下界,但重构规模可能过大;而基于局部代价函数的方法则在边界质量上处于中间水平。本文的关键解决方案是改进软弧一致性(Soft Arc Consistency, SAC)算法,使其能够直接处理线性约束作为局部代价函数,从而在保持计算效率的同时显著提升下界质量,并在多个基准测试中减少了整体求解时间。
链接: https://arxiv.org/abs/2509.17706
作者: Pierre Montalbano,Simon de Givry,George Katsirelos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:In Constraint Programming, solving discrete minimization problems with hard and soft constraints can be done either using (i) soft global constraints, (ii) a reformulation into a linear program, or (iii) a reformulation into local cost functions. Approach (i) benefits from a vast catalog of constraints. Each soft constraint propagator communicates with other soft constraints only through the variable domains, resulting in weak lower bounds. Conversely, the approach (ii) provides a global view with strong bounds, but the size of the reformulation can be problematic. We focus on approach (iii) in which soft arc consistency (SAC) algorithms produce bounds of intermediate quality. Recently, the introduction of linear constraints as local cost functions increases their modeling expressiveness. We adapt an existing SAC algorithm to handle linear constraints. We show that our algorithm significantly improves the lower bounds compared to the original algorithm on several benchmarks, reducing solving time in some cases.
zh
[AI-23] Cluster Workload Allocation: A Predictive Approach Leverag ing Machine Learning Efficiency
【速读】:该论文旨在解决大规模分布式系统中任务调度效率低下的问题,特别是如何通过机器学习(Machine Learning, ML)算法优化工作负载分配策略,以识别具有节点亲和性约束(node affinity operators)的任务并匹配合适的计算节点。其解决方案的关键在于:利用真实世界谷歌集群数据(Google Cluster Data, GCD)和AGOCS框架提取节点属性与任务约束信息,将约束操作符进行压缩与独热编码(one-hot encoding)后作为特征输入训练多种机器学习分类器,并最终采用集成投票分类器模型实现高精度的节点-任务配对预测,准确率达到98%,且对于仅能在单一节点上执行的任务,误分类率仅为1.5–1.8%。
链接: https://arxiv.org/abs/2509.17695
作者: Leszek Sliwko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Software Engineering (cs.SE)
备注: This is the accepted version of the paper published in IEEE Access. The final version is available at: this https URL
Abstract:This research investigates how Machine Learning (ML) algorithms can assist in workload allocation strategies by detecting tasks with node affinity operators (referred to as constraint operators), which constrain their execution to a limited number of nodes. Using real-world Google Cluster Data (GCD) workload traces and the AGOCS framework, the study extracts node attributes and task constraints, then analyses them to identify suitable node-task pairings. It focuses on tasks that can be executed on either a single node or fewer than a thousand out of 12.5k nodes in the analysed GCD cluster. Task constraint operators are compacted, pre-processed with one-hot encoding, and used as features in a training dataset. Various ML classifiers, including Artificial Neural Networks, K-Nearest Neighbours, Decision Trees, Naive Bayes, Ridge Regression, Adaptive Boosting, and Bagging, are fine-tuned and assessed for accuracy and F1-scores. The final ensemble voting classifier model achieved 98% accuracy and a 1.5-1.8% misclassification rate for tasks with a single suitable node.
zh
[AI-24] EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在真实工程问题中表现不足的问题,即现有模型虽能在结构化数学推理任务中表现出色,但难以应对现实工程场景中的不确定性、上下文依赖性和开放性挑战。解决方案的关键在于提出一个分层的基准测试平台EngiBench,其包含三个递进难度层级(基础知识检索、多步情境推理和开放式建模),并针对每个问题设计三种受控变体(扰动版本、知识增强版本和数学抽象版本),从而系统评估模型在鲁棒性、领域知识掌握能力以及数学推理能力上的差异。这一设计使研究者能够精准识别模型在复杂工程任务中的短板,揭示当前LLMs在高阶工程推理方面的显著不足,为未来模型改进提供明确方向。
链接: https://arxiv.org/abs/2509.17677
作者: Xiyuan Zhou,Xinlei Wang,Yirui He,Yang Wu,Ruixi Zou,Yuheng Cheng,Yulu Xie,Wenxuan Liu,Huan Zhao,Yan Xu,Jinjin Gu,Junhua Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation – they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model’s robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at this https URL.
zh
[AI-25] Mechanistic Interpretability with SAEs: Probing Religion Violence and Geography in Large Language Models ECAI
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中宗教身份偏见的内部表征问题,尤其是宗教与暴力、地理之间的潜在关联如何在模型内部被编码。现有研究多聚焦于性别和种族偏见,对宗教维度关注不足。论文的关键解决方案在于采用机制可解释性(mechanistic interpretability)与稀疏自编码器(Sparse Autoencoders, SAEs)相结合的方法,通过Neuronpedia API分析五个不同模型中的隐层特征激活模式,量化宗教相关提示与暴力相关特征的重叠程度,并探测激活上下文中的语义结构。结果表明,尽管五种宗教在模型内部均表现出相当的内聚性,但伊斯兰教更常与暴力语言相关的特征关联;而地理关联则反映了现实世界宗教人口分布,揭示了模型如何同时嵌入事实性分布与文化刻板印象。这一方法为系统审计模型内部表示提供了新路径,超越传统输出层面的偏见检测。
链接: https://arxiv.org/abs/2509.17665
作者: Katharina Simbeck,Mariam Mahran
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted at AEQUITAS 2025: Workshop on Fairness and Bias in AI | co-located with ECAI, October 26th, 2025, Bologna, Italy. 12 pages, 1 figure
Abstract:Despite growing research on bias in large language models (LLMs), most work has focused on gender and race, with little attention to religious identity. This paper explores how religion is internally represented in LLMs and how it intersects with concepts of violence and geography. Using mechanistic interpretability and Sparse Autoencoders (SAEs) via the Neuronpedia API, we analyze latent feature activations across five models. We measure overlap between religion- and violence-related prompts and probe semantic patterns in activation contexts. While all five religions show comparable internal cohesion, Islam is more frequently linked to features associated with violent language. In contrast, geographic associations largely reflect real-world religious demographics, revealing how models embed both factual distributions and cultural stereotypes. These findings highlight the value of structural analysis in auditing not just outputs but also internal representations that shape model behavior.
zh
[AI-26] SeqBattNet: A Discrete-State Physics-Informed Neural Network with Aging Adaptation for Battery Modeling
【速读】:该论文旨在解决电池建模中现有方法的局限性问题,包括模型参数过多、依赖大量标注数据以及物理信息神经网络(Physics-Informed Neural Networks, PINNs)在电池老化适应性方面的不足。解决方案的关键在于提出SeqBattNet——一种具有内置老化适应能力的离散状态PINN架构,其核心创新在于:(1) 采用HRM-GRU深度学习模块作为编码器,生成与循环相关的老化适应参数;(2) 设计基于等效电路模型(Equivalent Circuit Model, ECM)与深度学习结合的解码器,利用这些参数和输入电流预测端电压。该方法仅需三个基本电池参数,在单体电池数据训练下即可实现鲁棒性能,并在多个基准数据集上显著优于传统序列模型和PINN基线方法,同时保持计算效率。
链接: https://arxiv.org/abs/2509.17621
作者: Khoa Tran,Hung-Cuong Trinh,Vy-Rin Nguyen,T. Nguyen-Thoi,Vin Nguyen-Thai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate battery modeling is essential for reliable state estimation in modern applications, such as predicting the remaining discharge time and remaining discharge energy in battery management systems. Existing approaches face several limitations: model-based methods require a large number of parameters; data-driven methods rely heavily on labeled datasets; and current physics-informed neural networks (PINNs) often lack aging adaptation, or still depend on many parameters, or continuously regenerate states. In this work, we propose SeqBattNet, a discrete-state PINN with built-in aging adaptation for battery modeling, to predict terminal voltage during the discharge process. SeqBattNet consists of two components: (i) an encoder, implemented as the proposed HRM-GRU deep learning module, which generates cycle-specific aging adaptation parameters; and (ii) a decoder, based on the equivalent circuit model (ECM) combined with deep learning, which uses these parameters together with the input current to predict voltage. The model requires only three basic battery parameters and, when trained on data from a single cell, still achieves robust performance. Extensive evaluations across three benchmark datasets (TRI, RT-Batt, and NASA) demonstrate that SeqBattNet significantly outperforms classical sequence models and PINN baselines, achieving consistently lower RMSE while maintaining computational efficiency.
zh
[AI-27] able2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models NEURIPS2025
【速读】:该论文旨在解决从表格图像到LaTeX代码的自动转换问题,核心挑战在于如何准确重建结构复杂、语义丰富或不规则的表格(complex tables),现有方法在处理此类表格时往往表现不佳。解决方案的关键在于提出一种基于强化学习的多模态大语言模型(Multimodal Large Language Model, MLLM)框架,通过在大规模表格到LaTeX数据集上微调预训练MLLM,并引入双奖励强化学习策略——Group Relative Policy Optimization (GRPO),该策略同时优化LaTeX结构奖励与渲染后的视觉保真度奖励,从而直接提升生成结果的视觉质量与结构准确性,显著改善了复杂表格的生成性能。
链接: https://arxiv.org/abs/2509.17589
作者: Jun Ling,Yao Qi,Tao Huang,Shibo Zhou,Yanqin Huang,Jiang Yang,Ziqi Song,Ying Zhou,Yang Yang,Heng Tao Shen,Peng Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: NeurIPS 2025
Abstract:In this work, we address the task of table image to LaTeX code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs. A central challenge of this task lies in accurately handling complex tables – those with large sizes, deeply nested structures, and semantically rich or irregular cell content – where existing methods often fail. We begin with a comprehensive analysis, identifying key challenges and highlighting the limitations of current evaluation protocols. To overcome these issues, we propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-LaTeX dataset. To further improve generation quality, we introduce a dual-reward reinforcement learning strategy based on Group Relative Policy Optimization (GRPO). Unlike standard approaches that optimize purely over text outputs, our method incorporates both a structure-level reward on LaTeX code and a visual fidelity reward computed from rendered outputs, enabling direct optimization of the visual output quality. We adopt a hybrid evaluation protocol combining TEDS-Structure and CW-SSIM, and show that our method achieves state-of-the-art performance, particularly on structurally complex tables, demonstrating the effectiveness and robustness of our approach.
zh
[AI-28] LIMI: Less is More for Agency
【速读】:该论文旨在解决当前AI系统在实现真正自主性(Agency)方面的瓶颈问题,即如何让AI从仅具备推理与生成能力的“认知系统”转变为能够主动发现问题、提出假设并执行解决方案的“生产型工作者”。传统方法依赖于大规模数据训练以提升智能水平,但作者指出这种基于数据规模的扩展策略并不适用于发展机器自主性。论文提出的核心解决方案是LIMI(Less Is More for Intelligent Agency),其关键在于通过战略性地精选少量高质量的自主行为示范样本(仅78个训练样本),而非盲目增加数据量,即可显著激发复杂任务中的代理智能。实验表明,LIMI在综合代理基准测试中达到73.5分,远超多个主流模型,并且相比使用10,000样本训练的模型实现了53.7%的性能提升,从而确立了“代理效率原则”:机器自主性的涌现源于对高质代理示范的精准筛选,而非单纯的数据积累。
链接: https://arxiv.org/abs/2509.17567
作者: Yang Xiao,Mohan Jiang,Jie Sun,Keyu Li,Jifan Lin,Yumin Zhuang,Ji Zeng,Shijie Xia,Qishuo Hua,Xuefeng Li,Xiaojie Cai,Tongyu Wang,Yue Zhang,Liming Liu,Xia Wu,Jinlong Hou,Yuan Cheng,Wenjie Li,Xiang Wang,Dequan Wang,Pengfei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We define Agency as the emergent capacity of AI systems to function as autonomous agents actively discovering problems, formulating hypotheses, and executing solutions through self-directed engagement with environments and tools. This fundamental capability marks the dawn of the Age of AI Agency, driven by a critical industry shift: the urgent need for AI systems that don’t just think, but work. While current AI excels at reasoning and generating responses, industries demand autonomous agents that can execute tasks, operate tools, and drive real-world outcomes. As agentic intelligence becomes the defining characteristic separating cognitive systems from productive workers, efficiently cultivating machine autonomy becomes paramount. Current approaches assume that more data yields better agency, following traditional scaling laws from language modeling. We fundamentally challenge this paradigm. LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles. Through strategic focus on collaborative software development and scientific research workflows, we show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior. Using only 78 carefully designed training samples, LIMI achieves 73.5% on comprehensive agency benchmarks, dramatically outperforming state-of-the-art models: Kimi-K2-Instruct (24.1%), DeepSeek-V3.1 (11.9%), Qwen3-235B-A22B-Instruct (27.5%), and GLM-4.5 (45.1%). Most strikingly, LIMI demonstrates 53.7% improvement over models trained on 10,000 samples-achieving superior agentic intelligence with 128 times fewer samples. Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.
zh
[AI-29] MontePrep: Monte-Carlo-Driven Automatic Data Preparation without Target Data Instances
【速读】:该论文旨在解决商业系统中自动数据准备(Automatic Data Preparation, ADP)面临的两大挑战:一是传统方法依赖人工标注的监督信号或目标表数据访问权限,限制了其在真实场景中的应用;二是缺乏无需训练且不需目标实例数据的端到端ADP框架。解决方案的关键在于提出名为MontePrep的新型框架,其核心创新包括三个组件:数据准备动作沙箱(Data Preparation Action Sandbox, DPAS)、基础管道生成器(Fundamental Pipeline Generator, FPG)和执行感知管道优化器(Execution-aware Pipeline Optimizer, EPO)。其中,DPAS通过轻量级动作空间设计避免无效路径探索,FPG利用大语言模型(Large Language Model, LLM)驱动的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)增量式构建可执行数据处理管道,EPO则基于源到目标的实际执行结果评估并筛选可靠管道,从而显著提升搜索效率与效果,实现零目标实例要求下的训练-free管道合成。
链接: https://arxiv.org/abs/2509.17553
作者: Congcong Ge,Yachuan Liu,Yixuan Tang,Yifan Zhu,Yaofeng Tu,Yunjun Gao
机构: 未知
类目: Artificial Intelligence (cs.AI); Databases (cs.DB); Machine Learning (cs.LG)
备注:
Abstract:In commercial systems, a pervasive requirement for automatic data preparation (ADP) is to transfer relational data from disparate sources to targets with standardized schema specifications. Previous methods rely on labor-intensive supervision signals or target table data access permissions, limiting their usage in real-world scenarios. To tackle these challenges, we propose an effective end-to-end ADP framework MontePrep, which enables training-free pipeline synthesis with zero target-instance requirements. MontePrep is formulated as an open-source large language model (LLM) powered tree-structured search problem. It consists of three pivot components, i.e., a data preparation action sandbox (DPAS), a fundamental pipeline generator (FPG), and an execution-aware pipeline optimizer (EPO). We first introduce DPAS, a lightweight action sandbox, to navigate the search-based pipeline generation. The design of DPAS circumvents exploration of infeasible pipelines. Then, we present FPG to build executable DP pipelines incrementally, which explores the predefined action sandbox by the LLM-powered Monte Carlo Tree Search. Furthermore, we propose EPO, which invokes pipeline execution results from sources to targets to evaluate the reliability of the generated pipelines in FPG. In this way, unreasonable pipelines are eliminated, thus facilitating the search process from both efficiency and effectiveness perspectives. Extensive experimental results demonstrate the superiority of MontePrep with significant improvement against five state-of-the-art competitors.
zh
[AI-30] A Multimodal Conversational Assistant for the Characterization of Agricultural Plots from Geospatial Open Data
【速读】:该论文旨在解决开放地球观测(Earth Observation, EO)与农业数据在实际应用中因高技术门槛而难以被非专家用户访问的问题。其解决方案的关键在于提出了一种开源的对话式助手架构,该架构融合了多模态检索(如正射影像、Sentinel-2植被指数)与文本知识源,并基于检索增强生成(Retrieval-Augmented Generation, RAG)机制,使系统能够根据查询需求灵活调用多模态证据或纯文本知识,甚至两者结合来生成准确且上下文相关的回答。这一设计显著降低了获取专业农业信息的技术壁垒,同时具备可复现性和跨区域扩展能力。
链接: https://arxiv.org/abs/2509.17544
作者: Juan Cañada,Raúl Alonso,Julio Molleda,Fidel Díez
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The increasing availability of open Earth Observation (EO) and agricultural datasets holds great potential for supporting sustainable land management. However, their high technical entry barrier limits accessibility for non-expert users. This study presents an open-source conversational assistant that integrates multimodal retrieval and large language models (LLMs) to enable natural language interaction with heterogeneous agricultural and geospatial data. The proposed architecture combines orthophotos, Sentinel-2 vegetation indices, and user-provided documents through retrieval-augmented generation (RAG), allowing the system to flexibly determine whether to rely on multimodal evidence, textual knowledge, or both in formulating an answer. To assess response quality, we adopt an LLM-as-a-judge methodology using Qwen3-32B in a zero-shot, unsupervised setting, applying direct scoring in a multi-dimensional quantitative evaluation framework. Preliminary results show that the system is capable of generating clear, relevant, and context-aware responses to agricultural queries, while remaining reproducible and scalable across geographic regions. The primary contributions of this work include an architecture for fusing multimodal EO and textual knowledge sources, a demonstration of lowering the barrier to access specialized agricultural information through natural language interaction, and an open and reproducible design.
zh
[AI-31] Evaluating the Energy Efficiency of NPU-Accelerated Machine Learning Inference on Embedded Microcontrollers
【速读】:该论文旨在解决在微控制器(Microcontroller, MCU)上部署机器学习(Machine Learning, ML)模型时面临的严格能效、延迟和内存限制问题,尤其针对电池供电和实时边缘设备。解决方案的关键在于利用神经网络处理单元(Neural Processing Unit, NPU)进行硬件加速,通过将推理任务从CPU卸载至NPU实现显著的性能提升。实验基于ARM Cortex-M55核心与Ethos-U55 NPU组成的平台,采用高精度GPIO触发同步的数字万用表测量方法,量化每推理周期的净能耗,并扣除空闲状态功耗,确保能量成本归因准确。结果表明,在典型ML模型(如MiniResNet、MobileNetV2、SSD-MobileNet等)上,NPU可带来7–125倍的延迟降低和最高达143倍的单位推理能耗减少,同时支持仅靠CPU无法运行的复杂模型(如SSD-MobileNet),验证了NPU在提升嵌入式AI效率与功能扩展性方面的核心作用。
链接: https://arxiv.org/abs/2509.17533
作者: Anastasios Fanariotis,Theofanis Orphanoudakis,Vasilis Fotopoulos
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The deployment of machine learning (ML) models on microcontrollers (MCUs) is constrained by strict energy, latency, and memory requirements, particularly in battery-operated and real-time edge devices. While software-level optimizations such as quantization and pruning reduce model size and computation, hardware acceleration has emerged as a decisive enabler for efficient embedded inference. This paper evaluates the impact of Neural Processing Units (NPUs) on MCU-based ML execution, using the ARM Cortex-M55 core combined with the Ethos-U55 NPU on the Alif Semiconductor Ensemble E7 development board as a representative platform. A rigorous measurement methodology was employed, incorporating per-inference net energy accounting via GPIO-triggered high-resolution digital multimeter synchronization and idle-state subtraction, ensuring accurate attribution of energy costs. Experimental results across six representative ML models -including MiniResNet, MobileNetV2, FD-MobileNet, MNIST, TinyYolo, and SSD-MobileNet- demonstrate substantial efficiency gains when inference is offloaded to the NPU. For moderate to large networks, latency improvements ranged from 7x to over 125x, with per-inference net energy reductions up to 143x. Notably, the NPU enabled execution of models unsupported on CPU-only paths, such as SSD-MobileNet, highlighting its functional as well as efficiency advantages. These findings establish NPUs as a cornerstone of energy-aware embedded AI, enabling real-time, power-constrained ML inference at the MCU level.
zh
[AI-32] Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM -Powered Agents EMNLP2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)代理在日益自主的敏感通信场景中引发的隐私泄露问题,尤其是在Model Context Protocol (MCP) 和 Agent-to-Agent (A2A) 框架下,现有静态基准测试无法准确反映实际环境中的隐私风险。解决方案的关键在于提出PrivacyChecker——一种模型无关、基于上下文完整性的隐私缓解方法,能够有效将DeepSeek-R1和GPT-4o上的隐私泄露率分别从36.08%和33.06%降低至7.30%和8.32%,同时保持任务完成度不受显著影响;此外,通过引入PrivacyLens-Live,将静态基准转化为动态MCP与A2A环境,揭示了实际应用中更高的隐私风险,并提供三种可集成到代理协议中的部署策略,从而为新兴的智能体生态系统提供实用的隐私保护机制。
链接: https://arxiv.org/abs/2509.17488
作者: Shouju Wang,Fenglin Yu,Xirui Liu,Xiaoting Qin,Jue Zhang,Qingwei Lin,Dongmei Zhang,Saravan Rajmohan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: To appear at EMNLP 2025 (Findings)
Abstract:The increasing autonomy of LLM agents in handling sensitive communications, accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A) frameworks, creates urgent privacy challenges. While recent work reveals significant gaps between LLMs’ privacy QA performance and their agent behavior, existing benchmarks remain limited to static, simplified scenarios. We present PrivacyChecker, a model-agnostic, contextual integrity based mitigation approach that effectively reduces privacy leakage from 36.08% to 7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while preserving task helpfulness. We also introduce PrivacyLens-Live, transforming static benchmarks into dynamic MCP and A2A environments that reveal substantially higher privacy risks in practical. Our modular mitigation approach integrates seamlessly into agent protocols through three deployment strategies, providing practical privacy protection for the emerging agentic ecosystem. Our data and code will be made available at this https URL.
zh
[AI-33] ransformer-Gather Fuzzy-Reconsider: A Scalable Hybrid Framework for Entity Resolution
【速读】:该论文旨在解决企业系统中实体解析(Entity Resolution)面临的三大挑战:可扩展性不足、对噪声数据的鲁棒性差以及结果可靠性低的问题。其解决方案的关键在于提出了一种可扩展的混合框架,首先利用预训练语言模型将结构化数据编码为语义嵌入向量(semantic embedding vectors),从而实现高效的语义层面候选集检索;随后通过模糊字符串匹配(fuzzy string matching)技术进行语法层面的验证,以精炼未标注数据的分类结果。该方法在真实场景中成功应用于用户管理数据库与共享主机服务器记录之间的关联分析,兼具高处理效率和接近0.97的召回率,且可在标准CPU基础设施上部署,显著提升了企业级数据完整性审计的实用性与可靠性。
链接: https://arxiv.org/abs/2509.17470
作者: Mohammadreza Sharifi,Danial Ahmadzadeh
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted at ICCKE 2025 Conference. 6 tables, 7 figures
Abstract:Entity resolution plays a significant role in enterprise systems where data integrity must be rigorously maintained. Traditional methods often struggle with handling noisy data or semantic understanding, while modern methods suffer from computational costs or the excessive need for parallel computation. In this study, we introduce a scalable hybrid framework, which is designed to address several important problems, including scalability, noise robustness, and reliable results. We utilized a pre-trained language model to encode each structured data into corresponding semantic embedding vectors. Subsequently, after retrieving a semantically relevant subset of candidates, we apply a syntactic verification stage using fuzzy string matching techniques to refine classification on the unlabeled data. This approach was applied to a real-world entity resolution task, which exposed a linkage between a central user management database and numerous shared hosting server records. Compared to other methods, this approach exhibits an outstanding performance in terms of both processing time and robustness, making it a reliable solution for a server-side product. Crucially, this efficiency does not compromise results, as the system maintains a high retrieval recall of approximately 0.97. The scalability of the framework makes it deployable on standard CPU-based infrastructure, offering a practical and effective solution for enterprise-level data integrity auditing.
zh
[AI-34] AI Pangaea: Unifying Intelligence Islands for Adapting Myriad Tasks
【速读】:该论文旨在解决当前人工智能模型因任务专一性而导致的“智能孤岛”(Intelligence Islands)问题,即各AI模型局限于特定任务而缺乏跨任务泛化能力。其解决方案的关键在于提出Pangaea——首个类比地质学中泛大陆(Pangaea)的统一AI架构,通过将多模态数据编码为统一格式,并在296个数据集上进行预训练以积累通用知识,从而实现对45项通用任务和15项科学任务的显著泛化能力。研究进一步揭示了模态扩展的规模效应,量化了跨模态知识累积过程为几何分布的累积分布函数,表明Pangaea具备处理多样化任务的强大潜力,为迈向人工通用智能(Artificial General Intelligence, AGI)提供了新路径。
链接: https://arxiv.org/abs/2509.17460
作者: Jianlong Chang,Haixin Wang,Zhiyuan Dang,Li Huang,Zhiyu Wang,Ruoqi Cao,Shihao Piao,Dongzhe Li,Dianyu Gao,Dongsheng Wang,Yin Li,Jinan Sun,Lu Fang,Zhouchen Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 65 pages, 28 figures, paper under review
Abstract:The pursuit of artificial general intelligence continuously demands generalization in one model across myriad tasks, even those not seen before. However, current AI models are isolated from each other for being limited to specific tasks, now first defined as Intelligence Islands. To unify Intelligence Islands into one, we propose Pangaea, the first AI supercontinent akin to the geological Pangaea. Pangaea encodes any data into a unified format and accumulates universal knowledge through pre-training on 296 datasets across diverse modalities. Eventually, it demonstrates remarkable generalization across 45 general tasks and 15 scientific tasks encompassing a wide range of scientific subjects. By investigating Pangaea deeper, the scaling effect of modality is revealed, quantifying the universal knowledge accumulation across modalities as the cumulative distribution function of a geometric distribution. On the whole, Pangaea shows strong potential to handle myriad tasks, indicating a new direction toward artificial general intelligence.
zh
[AI-35] MVCL-DAF: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion ICASSP2026
【速读】:该论文旨在解决多模态意图识别(Multimodal Intent Recognition, MMIR)中存在的语义基础薄弱和在噪声或罕见类别条件下鲁棒性差的问题。其解决方案的关键在于提出MVCL-DAF++模型,包含两个核心模块:一是原型感知对比对齐(Prototype-aware contrastive alignment),通过将实例与类别级原型对齐以增强语义一致性;二是粗粒度到细粒度注意力融合(Coarse-to-fine attention fusion),整合全局模态摘要与词元级特征,实现层次化的跨模态交互。实验表明,该方法在MIntRec和MIntRec2.0数据集上均达到新的最先进性能,尤其在罕见类别识别上分别提升了+1.05%和+4.18%的加权F1分数,验证了原型引导学习与分层融合策略在提升多模态理解鲁棒性方面的有效性。
链接: https://arxiv.org/abs/2509.17446
作者: Haofeng Huang,Yifei Han,Long Zhang,Bin Li,Yangfan He
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2026
Abstract:Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05% and +4.18% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at this https URL.
zh
[AI-36] SPICED: A Synaptic Homeostasis-Inspired Framework for Unsupervised Continual EEG Decoding
【速读】:该论文旨在解决持续脑电图(EEG)解码中因个体间差异(inter-individual variability)导致的灾难性遗忘(catastrophic forgetting)问题,特别是在新个体不断涌现的现实场景下,如何实现无监督的持续学习。解决方案的关键在于提出了一种名为SPICED的类脑架构,其核心是整合了三种生物启发的突触稳态机制:(1)关键记忆再激活(critical memory reactivation)、(2)突触巩固(synaptic consolidation)和(3)突触重归一化(synaptic renormalization)。这些机制共同作用于突触网络,在持续适应过程中动态强化与任务相关的记忆痕迹、抑制有害记忆的重复激活,从而在保持对旧个体知识的同时,高效适应新个体,实现稳定且具泛化能力的持续学习。
链接: https://arxiv.org/abs/2509.17439
作者: Yangxuan Zhou,Sha Zhao,Jiquan Wang,Haiteng Jiang,Shijian Li,Tao Li,Gang Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 21 pages, 13 figures
Abstract:Human brain achieves dynamic stability-plasticity balance through synaptic homeostasis. Inspired by this biological principle, we propose SPICED: a neuromorphic framework that integrates the synaptic homeostasis mechanism for unsupervised continual EEG decoding, particularly addressing practical scenarios where new individuals with inter-individual variability emerge continually. SPICED comprises a novel synaptic network that enables dynamic expansion during continual adaptation through three bio-inspired neural mechanisms: (1) critical memory reactivation; (2) synaptic consolidation and (3) synaptic renormalization. The interplay within synaptic homeostasis dynamically strengthens task-discriminative memory traces and weakens detrimental memories. By integrating these mechanisms with continual learning system, SPICED preferentially replays task-discriminative memory traces that exhibit strong associations with newly emerging individuals, thereby achieving robust adaptations. Meanwhile, SPICED effectively mitigates catastrophic forgetting by suppressing the replay prioritization of detrimental memories during long-term continual learning. Validated on three EEG datasets, SPICED show its effectiveness.
zh
[AI-37] Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments
【速读】:该论文试图解决的问题是:当前基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的具身智能体是否具备完成复合任务(composite tasks)的能力,从而评估其是否接近人工通用智能(Artificial General Intelligence, AGI)的要求。解决方案的关键在于设计了一套受儿童早期发展日常活动启发的复合任务,涵盖对象理解、空间智能和社会活动三大核心领域,并在动态模拟的家庭环境中对17种主流开源与专有MLLMs进行系统评估。结果表明,现有模型在这三类任务中均表现不佳,揭示了当前具身智能体能力与通用智能需求之间存在显著差距,为未来具身MLLM的发展提供了初步评估框架和实证基础。
链接: https://arxiv.org/abs/2509.17425
作者: Zhenliang Zhang,Yuxi Wang,Hongzhao Xie,Shiyun Zhao,Mingyuan Liu,Yujie Lu,Xinyi He,Zhenku Cheng,Yujia Peng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object understanding, spatial intelligence, and social activity. We evaluated 17 leading proprietary and open-source MLLMs on these tasks. The results consistently showed poor performance across all three domains, indicating a substantial gap between current capabilities and general intelligence requirements. Together, our tasks offer a preliminary framework for evaluating the general capabilities of embodied agents, marking an early but significant step toward the development of embodied MLLMs and their real-world deployment.
zh
[AI-38] Distributionally Robust Safety Verification of Neural Networks via Worst-Case CVaR
【速读】:该论文旨在解决神经网络在输入不确定性下的安全性保障问题,尤其关注尾部风险(tail risk)对安全关键系统的影响。其核心挑战在于如何在保持计算可 tractability 的同时,显式建模并控制极端事件带来的风险。解决方案的关键在于将Fazlyab提出的二次约束(Quadratic Constraint, QC)与半定规划(Semidefinite Programming, SDP)框架扩展至分布鲁棒且尾部风险感知的设置:通过在基于矩信息的模糊集(ambiguity set)上引入最坏情况条件风险价值(Worst-Case Conditional Value-at-Risk, WC-CVaR),使得验证条件仍为SDP可检验,并能显式刻画尾部风险;该方法不仅拓展了输入不确定性的几何表示(包括椭球、多面体和超平面),还保留了原有QC/SDP方法的计算结构,从而在闭环控制系统可达性分析和分类任务中实现风险水平ε从保守性到尾部事件容忍度的灵活权衡。
链接: https://arxiv.org/abs/2509.17413
作者: Masako Kishida
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Systems and Control (eess.SY); Optimization and Control (math.OC)
备注:
Abstract:Ensuring the safety of neural networks under input uncertainty is a fundamental challenge in safety-critical applications. This paper builds on and expands Fazlyab’s quadratic-constraint (QC) and semidefinite-programming (SDP) framework for neural network verification to a distributionally robust and tail-risk-aware setting by integrating worst-case Conditional Value-at-Risk (WC-CVaR) over a moment-based ambiguity set with fixed mean and covariance. The resulting conditions remain SDP-checkable and explicitly account for tail risk. This integration broadens input-uncertainty geometry-covering ellipsoids, polytopes, and hyperplanes-and extends applicability to safety-critical domains where tail-event severity matters. Applications to closed-loop reachability of control systems and classification are demonstrated through numerical experiments, illustrating how the risk level \varepsilon trades conservatism for tolerance to tail events-while preserving the computational structure of prior QC/SDP methods for neural network verification and robustness analysis.
zh
[AI-39] Correlation or Causation: Analyzing the Causal Structures of LLM and LRM Reasoning Process
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中存在的因果推理缺陷问题,如不忠实性(unfaithfulness)、偏见(bias)和不一致性(inconsistency),这些问题源于其缺乏稳健的因果基础并可能依赖表面相关性而非真实理解。解决方案的关键在于引入基于强化学习与价值函数回归(Reinforcement Learning with Value Regression, RLVR)训练的逻辑推理模型(Logical Reasoning Models, LRMs),实证表明RLVR能够显著增强模型的因果推理能力,通过减少虚假相关性并强化真实的因果模式,从而改善不忠实性和偏见问题;进一步分析显示,RLVR训练过程中的因果结构动态演化与虚假特征的降低高度相关,验证了其对构建具有更强因果基础的AI系统的重要意义。
链接: https://arxiv.org/abs/2509.17380
作者: Zhizhang FU,Guangsheng Bao,Hongbo Zhang,Chenkai Hu,Yue Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:LLMs suffer from critical reasoning issues such as unfaithfulness, bias, and inconsistency, since they lack robust causal underpinnings and may rely on superficial correlations rather than genuine understanding. Successive LRMs have emerged as a promising alternative, leveraging advanced training techniques such as reinforcement learning (RL) and distillation to improve task accuracy. However, the impact of these training methods on causality remains largely unexplored. In this study, we conduct a systematic causal analysis on LLMs and LRMs, examining structural causal models (SCMs) of four key variables: problem instruction (Z), thinking process (T), reasoning steps (X), and answer (Y). Our findings reveal that RLVR-trained LRMs exhibit enhanced causal reasoning capabilities, aligning more closely with ideal causal structures, while LLMs and distilled LRMs fail to address causality-related deficiencies. Our further investigation indicates that RLVR reduces spurious correlations and strengthens genuine causal patterns, thereby mitigating unfaithfulness and bias. In addition, our inspection on the dynamics of the RLVR training process observes a high correlation between reduced spurious features and improved causal structures, where the causal relationships consistently improve in the training process. This study contributes to the understanding of causality in reasoning models, highlights the critical role of RLVR in enhancing causal reasoning, and provides insights for designing future AI systems with stronger causal foundations. We release our code and data at this https URL.
zh
[AI-40] SeqUDA-Rec: Sequential User Behavior Enhanced Recommendation via Global Unsupervised Data Augmentation for Personalized Content Marketing
【速读】:该论文旨在解决个性化内容推荐中传统推荐系统面临的两大挑战:一是依赖有限的显式用户反馈信号(如点击、评分等),导致监督信号不足;二是对噪声或无意交互(如误触、随机浏览)敏感,影响模型鲁棒性。其核心解决方案是提出SeqUDA-Rec框架,关键在于融合用户行为序列与全局无监督数据增强策略:首先构建全局用户-物品交互图(Global User-Item Interaction Graph, GUIG)以捕捉局部与全局物品关联;随后引入图对比学习模块生成更稳健的嵌入表示,并采用基于Transformer的序列编码器建模用户偏好演化;进一步通过GAN-based数据增强策略生成合理交互模式,扩充训练样本,缓解标签稀疏问题。实验表明,该方法在Amazon Ads和TikTok Ad Clicks两个真实数据集上显著优于SASRec、BERT4Rec和GCL4SR等先进基线模型,在NDCG@10和HR@10指标上分别提升6.7%和11.3%。
链接: https://arxiv.org/abs/2509.17361
作者: Ruihan Luo,Xuanjing Chen,Ziyang Ding
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:
Abstract:Personalized content marketing has become a crucial strategy for digital platforms, aiming to deliver tailored advertisements and recommendations that match user preferences. Traditional recommendation systems often suffer from two limitations: (1) reliance on limited supervised signals derived from explicit user feedback, and (2) vulnerability to noisy or unintentional interactions. To address these challenges, we propose SeqUDA-Rec, a novel deep learning framework that integrates user behavior sequences with global unsupervised data augmentation to enhance recommendation accuracy and robustness. Our approach first constructs a Global User-Item Interaction Graph (GUIG) from all user behavior sequences, capturing both local and global item associations. Then, a graph contrastive learning module is applied to generate robust embeddings, while a sequential Transformer-based encoder models users’ evolving preferences. To further enhance diversity and counteract sparse supervised labels, we employ a GAN-based augmentation strategy, generating plausible interaction patterns and supplementing training data. Extensive experiments on two real-world marketing datasets (Amazon Ads and TikTok Ad Clicks) demonstrate that SeqUDA-Rec significantly outperforms state-of-the-art baselines such as SASRec, BERT4Rec, and GCL4SR. Our model achieves a 6.7% improvement in NDCG@10 and 11.3% improvement in HR@10, proving its effectiveness in personalized advertising and intelligent content recommendation.
zh
[AI-41] Multi-Scenario Highway Lane-Change Intention Prediction: A Physics-Informed AI Framework for Three-Class Classification
【速读】:该论文旨在解决高速公路场景下变道意图预测的准确性与泛化能力不足的问题,尤其针对现有方法在二分类限制、场景多样性缺乏以及长时预测性能下降等方面的局限性。其解决方案的关键在于提出了一种物理信息驱动的人工智能框架,将车辆运动学(vehicle kinematics)、交互可行性及交通安全性指标(如跟车距离、时间头间距、碰撞时间、闭合间隙时间等)显式融合进机器学习建模过程,将变道预测任务重构为左变道、右变道和不变道的三分类问题,并通过LightGBM模型实现高精度实时预测,在highD和exiD两个不同复杂度的数据集上均展现出优于传统两层堆叠LSTM基线的性能表现。
链接: https://arxiv.org/abs/2509.17354
作者: Jiazhao Shi,Yichen Lin,Yiheng Hua,Ziyu Wang,Zijian Zhang,Wenjia Zheng,Yun Song,Kuan Lu,Shoufeng Lu
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Lane-change maneuvers are a leading cause of highway accidents, underscoring the need for accurate intention prediction to improve the safety and decision-making of autonomous driving systems. While prior studies using machine learning and deep learning methods (e.g., SVM, CNN, LSTM, Transformers) have shown promise, most approaches remain limited by binary classification, lack of scenario diversity, and degraded performance under longer prediction horizons. In this study, we propose a physics-informed AI framework that explicitly integrates vehicle kinematics, interaction feasibility, and traffic-safety metrics (e.g., distance headway, time headway, time-to-collision, closing gap time) into the learning process. lane-change prediction is formulated as a three-class problem that distinguishes left change, right change, and no change, and is evaluated across both straight highway segments (highD) and complex ramp scenarios (exiD). By integrating vehicle kinematics with interaction features, our machine learning models, particularly LightGBM, achieve state-of-the-art accuracy and strong generalization. Results show up to 99.8% accuracy and 93.6% macro F1 on highD, and 96.1% accuracy and 88.7% macro F1 on exiD at a 1-second horizon, outperforming a two-layer stacked LSTM baseline. These findings demonstrate the practical advantages of a physics-informed and feature-rich machine learning framework for real-time lane-change intention prediction in autonomous driving systems.
zh
[AI-42] Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation NEURIPS2025
【速读】:该论文旨在解决放射学报告生成自动化中的双重挑战:构建临床可靠的系统以及设计严谨的评估协议。其解决方案的关键在于提出一个基于多智能体强化学习(multi-agent reinforcement learning)的框架,该框架不仅作为基准测试平台,还充当多模态临床推理的评估环境。该框架采用模块化架构,集成大语言模型(LLMs)与大视觉模型(LVMs),由十个专业化智能体协同完成图像分析、特征提取、报告生成、审核及评估等任务,从而实现从单个智能体层面(如检测与分割精度)到共识层面(如报告质量与临床相关性)的细粒度评估。通过在公开放射学数据集上使用ChatGPT-4o实现该框架,并结合医学放射科医生反馈进行评估,该方案实现了与LLM开发生命周期(预训练、微调、对齐和部署)相一致的评估协议,为可信的基于偏差的放射学报告生成提供了可扩展路径。
链接: https://arxiv.org/abs/2509.17353
作者: Ahmed T. Elboardy,Ghada Khoriba,Essam A. Rashed
机构: 未知
类目: Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV); Medical Physics (physics.med-ph)
备注: NeurIPS2025 Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Abstract:Automating radiology report generation poses a dual challenge: building clinically reliable systems and designing rigorous evaluation protocols. We introduce a multi-agent reinforcement learning framework that serves as both a benchmark and evaluation environment for multimodal clinical reasoning in the radiology ecosystem. The proposed framework integrates large language models (LLMs) and large vision models (LVMs) within a modular architecture composed of ten specialized agents responsible for image analysis, feature extraction, report generation, review, and evaluation. This design enables fine-grained assessment at both the agent level (e.g., detection and segmentation accuracy) and the consensus level (e.g., report quality and clinical relevance). We demonstrate an implementation using chatGPT-4o on public radiology datasets, where LLMs act as evaluators alongside medical radiologist feedback. By aligning evaluation protocols with the LLM development lifecycle, including pretraining, finetuning, alignment, and deployment, the proposed benchmark establishes a path toward trustworthy deviance-based radiology report generation.
zh
[AI-43] Explainability matters: The effect of liability rules on the healthcare sector
【速读】:该论文旨在解决医疗领域中人工智能系统(AIS)的可解释性如何影响责任归属的问题,特别是在医疗从业者/医疗机构与AI制造商之间厘清法律责任时。其解决方案的关键在于论证可解释性在构建法律责任框架中的核心作用:通过提供透明的决策过程,可解释性不仅有助于明确各方责任边界,还能引导相关主体行为、降低防御性医疗(defensive medicine)风险,从而实现更合理且可执行的责任分配机制。
链接: https://arxiv.org/abs/2509.17334
作者: Jiawen Wei,Elena Verona,Andrea Bertolini,Gianmarco Mengaldo
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:
Abstract:Explainability, the capability of an artificial intelligence system (AIS) to explain its outcomes in a manner that is comprehensible to human beings at an acceptable level, has been deemed essential for critical sectors, such as healthcare. Is it really the case? In this perspective, we consider two extreme cases, Oracle'' (without explainability) versus
AI Colleague’’ (with explainability) for a thorough analysis. We discuss how the level of automation and explainability of AIS can affect the determination of liability among the medical practitioner/facility and manufacturer of AIS. We argue that explainability plays a crucial role in setting a responsibility framework in healthcare, from a legal standpoint, to shape the behavior of all involved parties and mitigate the risk of potential defensive medicine practices.
zh
[AI-44] raining the next generation of physicians for artificial intelligence-assisted clinical neuroradiology: ASNR MICCAI Brain Tumor Segmentation (BraTS) 2025 Lighthouse Challenge education platform
【速读】:该论文旨在解决当前神经放射学人工智能(Artificial Intelligence, AI)教育中高质量标注数据生成与医学生及住院医师AI实践能力培养不足的问题。解决方案的关键在于构建一个多模态教育框架,通过参与MICCAI脑肿瘤分割灯塔挑战赛(MICCAI Brain Tumor Segmentation Lighthouse Challenge 2025),组织医学生和放射科住院医师在神经病理MRI教学指导下进行脑肿瘤磁共振图像的标注任务,并配对资深神经放射学专家开展一对一指导,辅以神经解剖、病理学、AI基础讲座及数据科学家主导的工作坊。此过程不仅提升了参与者对图像分割软件和脑肿瘤影像特征的熟悉度,还系统性地强化了其对参考标准数据构建的理解,从而推动AI驱动医学影像分析的人才培养与知识传播。
链接: https://arxiv.org/abs/2509.17281
作者: Raisa Amiruddin,Nikolay Y. Yordanov,Nazanin Maleki,Pascal Fehringer,Athanasios Gkampenis,Anastasia Janas,Kiril Krantchev,Ahmed Moawad,Fabian Umeh,Salma Abosabie,Sara Abosabie,Albara Alotaibi,Mohamed Ghonim,Mohanad Ghonim,Sedra Abou Ali Mhana,Nathan Page,Marko Jakovljevic,Yasaman Sharifi,Prisha Bhatia,Amirreza Manteghinejad,Melisa Guelen,Michael Veronesi,Virginia Hill,Tiffany So,Mark Krycia,Bojan Petrovic,Fatima Memon,Justin Cramer,Elizabeth Schrickel,Vilma Kosovic,Lorenna Vidal,Gerard Thompson,Ichiro Ikuta,Basimah Albalooshy,Ali Nabavizadeh,Nourel Hoda Tahon,Karuna Shekdar,Aashim Bhatia,Claudia Kirsch,Gennaro D’Anna,Philipp Lohmann,Amal Saleh Nour,Andriy Myronenko,Adam Goldman-Yassen,Janet R. Reid,Sanjay Aneja,Spyridon Bakas,Mariam Aboian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 23 pages, 9 figures, 1 table, 3 supplementary tables
Abstract:High-quality reference standard image data creation by neuroradiology experts for automated clinical tools can be a powerful tool for neuroradiology artificial intelligence education. We developed a multimodal educational approach for students and trainees during the MICCAI Brain Tumor Segmentation Lighthouse Challenge 2025, a landmark initiative to develop accurate brain tumor segmentation algorithms. Fifty-six medical students radiology trainees volunteered to annotate brain tumor MR images for the BraTS challenges of 2023 2024, guided by faculty-led didactics on neuropathology MRI. Among the 56 annotators, 14 select volunteers were then paired with neuroradiology faculty for guided one-on-one annotation sessions for BraTS 2025. Lectures on neuroanatomy, pathology AI, journal clubs data scientist-led workshops were organized online. Annotators audience members completed surveys on their perceived knowledge before after annotations lectures respectively. Fourteen coordinators, each paired with a neuroradiologist, completed the data annotation process, averaging 1322.9+/-760.7 hours per dataset per pair and 1200 segmentations in total. On a scale of 1-10, annotation coordinators reported significant increase in familiarity with image segmentation software pre- and post-annotation, moving from initial average of 6+/-2.9 to final average of 8.9+/-1.1, and significant increase in familiarity with brain tumor features pre- and post-annotation, moving from initial average of 6.2+/-2.4 to final average of 8.1+/-1.2. We demonstrate an innovative offering for providing neuroradiology AI education through an image segmentation challenge to enhance understanding of algorithm development, reinforce the concept of data reference standard, and diversify opportunities for AI-driven image analysis among future physicians.
zh
[AI-45] Mind the Gap: Comparing Model- vs Agent ic-Level Red Teaming with Action-Graph Observability on GPT -OSS-20B
【速读】:该论文旨在解决当前对生成式 AI (Generative AI) 安全性的理解主要聚焦于模型层面,而忽视了在实际部署中,当模型作为代理(agentic)系统运行并与外部工具和环境交互时所引入的独特漏洞问题。解决方案的关键在于构建一个名为 AgentSeer 的可观测性框架,将 agentic 系统解构为细粒度的动作与组件,并通过对比红队测试(red teaming)分析 GPT-OSS-20B 模型在独立模型模式与 agentic 循环模式下的漏洞表现。研究发现,存在仅在 agentic 执行环境中出现的“代理专属漏洞”(agentic-only vulnerabilities),且工具调用上下文下的攻击成功率比非工具场景高 24%,表明传统模型级评估无法全面覆盖 agentic 系统的真实风险。
链接: https://arxiv.org/abs/2509.17259
作者: Ilham Wicaksono,Zekun Wu,Rahul Patel,Theo King,Adriano Koshiyama,Philip Treleaven
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Winner of the OpenAI GPT-OSS-20B Red Teaming Challenge (Kaggle, 2025)
Abstract:As the industry increasingly adopts agentic AI systems, understanding their unique vulnerabilities becomes critical. Prior research suggests that security flaws at the model level do not fully capture the risks present in agentic deployments, where models interact with tools and external environments. This paper investigates this gap by conducting a comparative red teaming analysis of GPT-OSS-20B, a 20-billion parameter open-source model. Using our observability framework AgentSeer to deconstruct agentic systems into granular actions and components, we apply iterative red teaming attacks with harmful objectives from HarmBench at two distinct levels: the standalone model and the model operating within an agentic loop. Our evaluation reveals fundamental differences between model level and agentic level vulnerability profiles. Critically, we discover the existence of agentic-only vulnerabilities, attack vectors that emerge exclusively within agentic execution contexts while remaining inert against standalone models. Agentic level iterative attacks successfully compromise objectives that completely failed at the model level, with tool-calling contexts showing 24% higher vulnerability than non-tool contexts. Conversely, certain model-specific exploits work exclusively at the model level and fail when transferred to agentic contexts, demonstrating that standalone model vulnerabilities do not always generalize to deployed systems.
zh
[AI-46] SignalLLM : A General-Purpose LLM Agent Framework for Automated Signal Processing
【速读】:该论文旨在解决现代信号处理(Signal Processing, SP)流水线在复杂性高、碎片化严重、高度依赖专家知识与手工工程,以及在数据有限条件下适应性和泛化能力不足的问题。其解决方案的关键在于提出SignalLLM——首个面向通用信号处理任务的基于大语言模型(Large Language Models, LLMs)的智能体框架,通过引入结构化的模块化架构,利用上下文学习和领域特定检索将高层SP目标分解为可执行子任务,并借助自适应增强检索生成(Retrieval-Augmented Generation, RAG)与迭代优化进行分层规划;随后,结合提示推理、跨模态推理、代码合成、模型调用或数据驱动的LLM辅助建模等策略执行具体操作,从而实现对不同信号模态、任务类型和数据条件的灵活适配与高效求解。
链接: https://arxiv.org/abs/2509.17197
作者: Junlong Ke,Qiying Hu,Shenghai Yuan,Yuecong Xu,Jianfei Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: 11 pages
Abstract:Modern signal processing (SP) pipelines, whether model-based or data-driven, often constrained by complex and fragmented workflow, rely heavily on expert knowledge and manual engineering, and struggle with adaptability and generalization under limited data. In contrast, Large Language Models (LLMs) offer strong reasoning capabilities, broad general-purpose knowledge, in-context learning, and cross-modal transfer abilities, positioning them as powerful tools for automating and generalizing SP workflows. Motivated by these potentials, we introduce SignalLLM, the first general-purpose LLM-based agent framework for general SP tasks. Unlike prior LLM-based SP approaches that are limited to narrow applications or tricky prompting, SignalLLM introduces a principled, modular architecture. It decomposes high-level SP goals into structured subtasks via in-context learning and domain-specific retrieval, followed by hierarchical planning through adaptive retrieval-augmented generation (RAG) and refinement; these subtasks are then executed through prompt-based reasoning, cross-modal reasoning, code synthesis, model invocation, or data-driven LLM-assisted modeling. Its generalizable design enables the flexible selection of problem solving strategies across different signal modalities, task types, and data conditions. We demonstrate the versatility and effectiveness of SignalLLM through five representative tasks in communication and sensing, such as radar target detection, human activity recognition, and text compression. Experimental results show superior performance over traditional and existing LLM-based methods, particularly in few-shot and zero-shot settings.
zh
[AI-47] Shall We Play a Game? Language Models for Open-ended Wargames
【速读】:该论文旨在解决如何在开放性较强的战争游戏(wargames)中有效整合语言模型(Language Models, LMs),以支持玩家与裁判在自然语言交互下进行更具创造性和动态性的决策探索。其核心问题在于:在缺乏结构化规则的复杂场景中,LMs 如何既保障决策质量,又确保安全可控地参与游戏过程。解决方案的关键在于通过系统性文献综述构建一个关于战争游戏的本体(ontology),明确区分玩家与裁判在创造性空间中的角色,并据此提炼出适用于不同应用场景的 LM 使用条件、部署最佳实践及安全规范,从而为未来高影响力的研究方向提供理论框架与实证基础。
链接: https://arxiv.org/abs/2509.17192
作者: Glenn Matlin,Parv Mahajan,Isaac Song,Yixiong Hao,Ryan Bard,Stu Topp,Evan Montoya,M. Rehan Parwani,Soham Shetty,Mark Riedl
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Wargames are multi-faceted, multi-player depictions of conflict in which participants’ decisions influence future events. Wargames are often used to explore the strategic implications of decision-making. However, it also encompasses entertainment-oriented simulations, ranging from Chess to tabletop role-playing games like Dungeons Dragons (DD). On the more open-ended side of the spectrum of wargames, players use natural language to convey their moves, and adjudicators propose outcomes. Language Models (LMs) are increasingly being considered for how they can provide insights into real-world, consequential decisions. We conduct a scoping literature review of a curated selection of 100 recent works on AI in wargames, from which we construct an ontology of wargames in terms of the creativity afforded to either the players or adjudicators. Focusing on the space of wargames with the most open-endedness for players and adjudicators, we distill a set of considerations for when and how to use LMs in different application areas. We also present a set of safety considerations, best practices for deploying LMs in open-ended wargames, and conclude with a set of high-impact open research challenges.
zh
[AI-48] Dendritic Resonate-and-Fire Neuron for Effective and Efficient Long Sequence Modeling
【速读】:该论文旨在解决长序列建模中对高效性和有效性需求日益增长的问题,特别是现有Resonate-and-Fire(RF)神经元在复杂时序任务上存在有效记忆容量有限、能效与训练速度之间存在权衡的局限性。其解决方案的关键在于提出一种基于生物神经元树突结构启发的Dendritic Resonate-and-Fire(D-RF)模型,该模型通过引入多树突-胞体架构实现频率分层编码:每个树突分支利用RF神经元内在振荡动力学专门编码特定频带,从而构建全面的频域表示;同时,在胞体中设计自适应阈值机制,根据历史放电活动动态调整阈值,减少冗余脉冲并保持训练效率,最终在保证计算效率的同时显著提升稀疏脉冲特性与模型性能。
链接: https://arxiv.org/abs/2509.17186
作者: Dehao Zhang,Malu Zhang,Shuai Wang,Jingya Wang,Wenjie Wei,Zeyu Ma,Guoqing Wang,Yang Yang,HaiZhou Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The explosive growth in sequence length has intensified the demand for effective and efficient long sequence modeling. Benefiting from intrinsic oscillatory membrane dynamics, Resonate-and-Fire (RF) neurons can efficiently extract frequency components from input signals and encode them into spatiotemporal spike trains, making them well-suited for long sequence modeling. However, RF neurons exhibit limited effective memory capacity and a trade-off between energy efficiency and training speed on complex temporal tasks. Inspired by the dendritic structure of biological neurons, we propose a Dendritic Resonate-and-Fire (D-RF) model, which explicitly incorporates a multi-dendritic and soma architecture. Each dendritic branch encodes specific frequency bands by utilizing the intrinsic oscillatory dynamics of RF neurons, thereby collectively achieving comprehensive frequency representation. Furthermore, we introduce an adaptive threshold mechanism into the soma structure that adjusts the threshold based on historical spiking activity, reducing redundant spikes while maintaining training efficiency in long sequence tasks. Extensive experiments demonstrate that our method maintains competitive accuracy while substantially ensuring sparse spikes without compromising computational efficiency during training. These results underscore its potential as an effective and efficient solution for long sequence modeling on edge platforms.
zh
[AI-49] me Series Forecasting Using a Hybrid Deep Learning Method: A Bi-LSTM Embedding Denoising Auto Encoder Transformer
【速读】:该论文旨在解决短时电动汽车(Electric Vehicles, EVs)充电负荷预测问题,以支持基础设施规划、负荷均衡和能源管理等决策。其解决方案的关键在于提出一种双向长短期记忆网络嵌入去噪自编码器模型(Bidirectional LSTM Embedding Denoising Autoencoder Model, BDM),通过融合双向LSTM的时序建模能力与去噪自编码器的特征提取与噪声抑制机制,显著提升了时间序列 forecasting 的准确性,在五个时间步中优于Transformer、CNN、RNN、LSTM和GRU等基准模型中的四项。
链接: https://arxiv.org/abs/2509.17165
作者: Sahar Koohfar,Wubeshet Woldemariam
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Time series data is a prevalent form of data found in various fields. It consists of a series of measurements taken over time. Forecasting is a crucial application of time series models, where future values are predicted based on historical data. Accurate forecasting is essential for making well-informed decisions across industries. When it comes to electric vehicles (EVs), precise predictions play a key role in planning infrastructure development, load balancing, and energy management. This study introduces a BI-LSTM embedding denoising autoencoder model (BDM) designed to address time series problems, focusing on short-term EV charging load prediction. The performance of the proposed model is evaluated by comparing it with benchmark models like Transformer, CNN, RNN, LSTM, and GRU. Based on the results of the study, the proposed model outperforms the benchmark models in four of the five-time steps, demonstrating its effectiveness for time series forecasting. This research makes a significant contribution to enhancing time series forecasting, thereby improving decision-making processes.
zh
[AI-50] Flow-Induced Diagonal Gaussian Processes
【速读】:该论文旨在解决贝叶斯神经网络中权重不确定性建模的计算与存储开销问题,尤其是在大规模模型中难以高效进行不确定性估计和分布外(Out-of-Distribution, OoD)检测的挑战。其核心解决方案是提出流诱导对角高斯过程(Flow-Induced Diagonal Gaussian Processes, FiD-GP),关键在于引入紧凑的诱导权重矩阵(inducing weight matrix)以将神经网络权重不确定性投影至低维子空间,并结合归一化流先验(normalising-flow priors)与谱正则化(spectral regularisations),在数值稳定的基础上使诱导子空间与特征梯度几何结构对齐,从而提升不确定性建模的表达能力和OoD检测的理论保障性。
链接: https://arxiv.org/abs/2509.17153
作者: Moule Lin,Andrea Patane,Weipeng Jing,Shuhao Guan,Goetz Botterweck
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 15 pages
Abstract:We present Flow-Induced Diagonal Gaussian Processes (FiD-GP), a compression framework that incorporates a compact inducing weight matrix to project a neural network’s weight uncertainty into a lower-dimensional subspace. Critically, FiD-GP relies on normalising-flow priors and spectral regularisations to augment its expressiveness and align the inducing subspace with feature-gradient geometry through a numerically stable projection mechanism objective. Furthermore, we demonstrate how the prediction framework in FiD-GP can help to design a single-pass projection for Out-of-Distribution (OoD) detection. Our analysis shows that FiD-GP improves uncertainty estimation ability on various tasks compared with SVGP-based baselines, satisfies tight spectral residual bounds with theoretically guaranteed OoD detection, and significantly compresses the neural network’s storage requirements at the cost of increased inference computation dependent on the number of inducing weights employed. Specifically, in a comprehensive empirical study spanning regression, image classification, semantic segmentation, and out-of-distribution detection benchmarks, it cuts Bayesian training cost by several orders of magnitude, compresses parameters by roughly 51%, reduces model size by about 75%, and matches state-of-the-art accuracy and uncertainty estimation.
zh
[AI-51] ScenGAN: Attention-Intensive Generative Model for Uncertainty-Aware Renewable Scenario Forecasting
【速读】:该论文旨在解决可再生能源发电(Renewable Energy Source, RES)出力的间歇性问题,通过场景预测(Scenario Forecasting)提供一系列随机实现,以更灵活且直观的方式刻画预测对象的不确定性。其解决方案的关键在于构建一个不确定性感知的模型,该模型融合注意力机制与生成对抗网络(Generative Adversarial Networks, GANs),精准捕捉复杂的时空动态特征;同时引入贝叶斯深度学习和自适应实例归一化(Adaptive Instance Normalization, AdaIN)提升对RES出力不确定行为的可解释性,能够模拟典型模式与变异;此外,在处理层中整合气象信息、预报数据及历史轨迹,增强对多尺度周期规律的协同建模能力,从而有效区分并表征认知不确定性(epistemic uncertainty)与随机不确定性(aleatoric uncertainty),在数值实验与案例分析中展现出优于现有方法的性能。
链接: https://arxiv.org/abs/2509.17119
作者: Yifei Wu,Bo Wang,Jingshi Cui,Pei-chun Lin,Junzo Watada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:To address the intermittency of renewable energy source (RES) generation, scenario forecasting offers a series of stochastic realizations for predictive objects with superior flexibility and direct views. Based on a long time-series perspective, this paper explores uncertainties in the realms of renewable power and deep learning. Then, an uncertainty-aware model is meticulously designed for renewable scenario forecasting, which leverages an attention mechanism and generative adversarial networks (GANs) to precisely capture complex spatial-temporal dynamics. To improve the interpretability of uncertain behavior in RES generation, Bayesian deep learning and adaptive instance normalization (AdaIN) are incorporated to simulate typical patterns and variations. Additionally, the integration of meteorological information, forecasts, and historical trajectories in the processing layer improves the synergistic forecasting capability for multiscale periodic regularities. Numerical experiments and case analyses demonstrate that the proposed approach provides an appropriate interpretation for renewable uncertainty representation, including both aleatoric and epistemic uncertainties, and shows superior performance over state-of-the-art methods.
zh
[AI-52] MCTS-EP: Empowering Embodied Planning with Online Preference Optimization
【速读】:该论文旨在解决具身智能体(embodied agents)在复杂环境中进行高效在线学习时面临的样本效率低和策略优化困难的问题。其解决方案的关键在于提出了一种名为MCTS-EP的在线学习框架,该框架将大语言模型(LLM)与蒙特卡洛树搜索(MCTS)相结合,通过三个核心组件实现性能提升:一是利用MCTS引导探索以收集偏好数据,二是设计高效的多模态推理机制,三是基于偏好优化的迭代训练流程。理论分析表明,在损失函数强凸条件下,MCTS-EP优于传统on-policy算法,并可被形式化为一种增强搜索的GAIL变体,从而在ALFWorld和WebShop等多个基准上实现了当前最优性能。
链接: https://arxiv.org/abs/2509.17116
作者: Hang Xu,Zang Yu,Yehui Tang,Pengbo Hu,Yuhao Tang,Hao Dong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces MCTS-EP, an online learning framework that combines large language models (LLM) with Monte Carlo Tree Search (MCTS) for training embodied agents. MCTS-EP integrates three key components: MCTS-guided exploration for preference data collection, efficient multi-modal reasoning mechanism, and iterative training pipeline based on preference optimization. We theoretically prove that MCTS-EP achieves better performance bounds than conventional on-policy algorithms when the loss function is strongly convex, and demonstrate that it can be formulated as a search-enhanced variant of GAIL. MCTS-EP achieves state-of-the-art performace across serval benchmarks. In ALFWorld, it achieves 92% and 87% success rates for textual and visual tasks. In WebShop, it reaches an average reward of 0.81. MTCS-EP also reduces average interaction steps from from 18.7/19.5 to 10.2/9.9 steps in visual this http URL available at: this https URL
zh
[AI-53] Prompt-with-Me: in-IDE Structured Prompt Management for LLM -Driven Software Engineering
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程实践中因提示(prompt)管理方式随意而导致的可靠性差、复用性低以及难以集成到工业级开发流程中的问题。其核心解决方案是提出 Prompt-with-Me,一个嵌入开发环境的结构化提示管理系统;关键在于通过一个四维分类体系(意图、作者角色、软件开发生命周期阶段、提示类型)自动对提示进行分类,并结合语言优化建议、敏感信息掩码及可复用模板提取功能,显著提升提示的质量与使用效率。
链接: https://arxiv.org/abs/2509.17096
作者: Ziyou Li,Agnia Sergeyuk,Maliheh Izadi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: Accepted in the 40th IEEE/ACM International Conference on Automated Software Engineering, ASE 2025 (Industry track)
Abstract:Large Language Models are transforming software engineering, yet prompt management in practice remains ad hoc, hindering reliability, reuse, and integration into industrial workflows. We present Prompt-with-Me, a practical solution for structured prompt management embedded directly in the development environment. The system automatically classifies prompts using a four-dimensional taxonomy encompassing intent, author role, software development lifecycle stage, and prompt type. To enhance prompt reuse and quality, Prompt-with-Me suggests language refinements, masks sensitive information, and extracts reusable templates from a developer’s prompt library. Our taxonomy study of 1108 real-world prompts demonstrates that modern LLMs can accurately classify software engineering prompts. Furthermore, our user study with 11 participants shows strong developer acceptance, with high usability (Mean SUS=73), low cognitive load (Mean NASA-TLX=21), and reported gains in prompt quality and efficiency through reduced repetitive effort. Lastly, we offer actionable insights for building the next generation of prompt management and maintenance tools for software engineering workflows.
zh
[AI-54] Ultra-short-term solar power forecasting by deep learning and data reconstruction
【速读】:该论文旨在解决分布式太阳能发电因间歇性导致的电网稳定性和能源调度挑战,核心问题是实现高精度、近实时的超短期太阳能功率预测(ultra-short-term solar power prediction),以支持可再生能源在电力系统中的高效渗透。解决方案的关键在于提出一种基于深度学习的数据重构方法:首先利用自适应噪声的集合经验模态分解(CEEMDAN)将原始数据分解为低频和高频分量,从而更好地挖掘时空依赖关系;随后融合气象数据与重构后的分量,并通过深度学习模型捕捉长短期依赖特征,提升预测精度;此外,为避免超短期预测易陷入局部最优的问题,改进了训练优化策略,在损失函数中引入对长预测区间惩罚项,从而增强模型泛化能力与预测稳定性。
链接: https://arxiv.org/abs/2509.17095
作者: Jinbao Wang,Jun Liu,Shiliang Zhang,Xuehui Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The integration of solar power has been increasing as the green energy transition rolls out. The penetration of solar power challenges the grid stability and energy scheduling, due to its intermittent energy generation. Accurate and near real-time solar power prediction is of critical importance to tolerant and support the permeation of distributed and volatile solar power production in the energy system. In this paper, we propose a deep-learning based ultra-short-term solar power prediction with data reconstruction. We decompose the data for the prediction to facilitate extensive exploration of the spatial and temporal dependencies within the data. Particularly, we reconstruct the data into low- and high-frequency components, using ensemble empirical model decomposition with adaptive noise (CEEMDAN). We integrate meteorological data with those two components, and employ deep-learning models to capture long- and short-term dependencies towards the target prediction period. In this way, we excessively exploit the features in historical data in predicting a ultra-short-term solar power production. Furthermore, as ultra-short-term prediction is vulnerable to local optima, we modify the optimization in our deep-learning training by penalizing long prediction intervals. Numerical experiments with diverse settings demonstrate that, compared to baseline models, the proposed method achieves improved generalization in data reconstruction and higher prediction accuracy for ultra-short-term solar power production.
zh
[AI-55] Governing Automated Strategic Intelligence
【速读】:该论文旨在解决国家间战略竞争日益依赖前沿人工智能模型的能力与成本问题,特别是如何通过自动化军事情报分析来获得地缘政治优势。其核心问题是:当前人类情报分析师(如CIA分析师)在数据整合与战略研判方面的任务,能否由多模态基础模型(Multimodal Foundation Models)高效替代,并实现规模化、实时化的智能决策支持。解决方案的关键在于利用多模态基础模型融合卫星影像、位置轨迹、社交媒体记录和文本文档等海量异构数据,构建可查询的统一分析系统,从而实现对战略问题的自动化推理与回答;论文进一步提出了一种用于评估此类系统能力的分类体系与决定因素模型,并为国家制定战略竞争力提升路径提供实证依据与政策建议。
链接: https://arxiv.org/abs/2509.17087
作者: Nicholas Kruus,Madhavendra Thakur,Adam Khoja,Leonhard Nagel,Maximilian Nicholson,Abeer Sharma,Jason Hausenloy,Alberto KoTafoya,Aliya Mukhanova,Alli Katila-Miikkulainen,Harish Chandran,Ivan Zhang,Jessie Chen,Joel Raj,Jord Nguyen,Lai Hsien Hao,Neja Jayasundara,Soham Sen,Sophie Zhang,Ashley Dora Kokui Tamaklo,Bhavya Thakur,Henry Close,Janghee Lee,Nina Sefton,Raghavendra Thakur,Shiv Munagala,Yeeun Kim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Military and economic strategic competitiveness between nation-states will increasingly be defined by the capability and cost of their frontier artificial intelligence models. Among the first areas of geopolitical advantage granted by such systems will be in automating military intelligence. Much discussion has been devoted to AI systems enabling new military modalities, such as lethal autonomous weapons, or making strategic decisions. However, the ability of a country of “CIA analysts in a data-center” to synthesize diverse data at scale, and its implications, have been underexplored. Multimodal foundation models appear on track to automate strategic analysis previously done by humans. They will be able to fuse today’s abundant satellite imagery, phone-location traces, social media records, and written documents into a single queryable system. We conduct a preliminary uplift study to empirically evaluate these capabilities, then propose a taxonomy of the kinds of ground truth questions these systems will answer, present a high-level model of the determinants of this system’s AI capabilities, and provide recommendations for nation-states to remain strategically competitive within the new paradigm of automated intelligence.
zh
[AI-56] Intention-aware Hierarchical Diffusion Model for Long-term Trajectory Anomaly Detection
【速读】:该论文旨在解决轨迹异常检测中因轨迹数据多样性及复杂时空依赖性导致的挑战,现有方法未能同时考虑智能体的高层意图(high-level intentions)与低层导航细节(low-level details),从而限制了对正常轨迹分布多样性的捕捉能力。解决方案的关键在于提出一种无监督的分层扩散模型——意图感知分层扩散模型(Intention-aware Hierarchical Diffusion model, IHiD),其通过两个协同机制实现:一是利用逆Q学习(Inverse Q Learning)作为高层模型评估子目标是否符合智能体意图;二是采用扩散模型(diffusion model)作为底层模型生成以子目标为条件的子轨迹,并基于重构误差进行异常判定。该设计有效融合了子目标转移知识,能够更全面地建模正常轨迹的多样性,实验表明IHiD在F1分数上相比最先进基线提升达30.2%。
链接: https://arxiv.org/abs/2509.17068
作者: Chen Wang,Sarah Erfani,Tansu Alpcan,Christopher Leckie
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 15 pages, 5 figures
Abstract:Long-term trajectory anomaly detection is a challenging problem due to the diversity and complex spatiotemporal dependencies in trajectory data. Existing trajectory anomaly detection methods fail to simultaneously consider both the high-level intentions of agents as well as the low-level details of the agent’s navigation when analysing an agent’s trajectories. This limits their ability to capture the full diversity of normal trajectories. In this paper, we propose an unsupervised trajectory anomaly detection method named Intention-aware Hierarchical Diffusion model (IHiD), which detects anomalies through both high-level intent evaluation and low-level sub-trajectory analysis. Our approach leverages Inverse Q Learning as the high-level model to assess whether a selected subgoal aligns with an agent’s intention based on predicted Q-values. Meanwhile, a diffusion model serves as the low-level model to generate sub-trajectories conditioned on subgoal information, with anomaly detection based on reconstruction error. By integrating both models, IHiD effectively utilises subgoal transition knowledge and is designed to capture the diverse distribution of normal trajectories. Our experiments show that the proposed method IHiD achieves up to 30.2% improvement in anomaly detection performance in terms of F1 score over state-of-the-art baselines.
zh
[AI-57] RALLM -POI: Retrieval-Augmented LLM for Zero-shot Next POI Recommendation with Geographical Reranking PRICAI2025
【速读】:该论文旨在解决传统点位推荐(Point-of-Interest, POI)模型训练成本高、而大语言模型(Large Language Models, LLMs)在零样本场景下因缺乏轨迹和空间上下文导致推荐结果泛化或地理不相关的问题。解决方案的关键在于提出RALLM-POI框架,其核心创新包括:(1)历史轨迹检索器(Historical Trajectory Retriever, HTR)用于从用户历史移动数据中提取语义相关的轨迹作为上下文参考;(2)地理距离重排序器(Geographical Distance Reranker, GDR)对检索到的轨迹进行空间合理性排序,提升地理相关性;(3)代理式LLM修正器(Agentic LLM Rectifier, ALR)通过自我反思机制优化输出结果。该方法无需额外训练即可显著提升推荐准确性,在三个真实世界Foursquare数据集上优于传统及基于LLM的基线模型。
链接: https://arxiv.org/abs/2509.17066
作者: Kunrong Li,Kwan Hui Lim
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: PRICAI 2025
Abstract:Next point-of-interest (POI) recommendation predicts a user’s next destination from historical movements. Traditional models require intensive training, while LLMs offer flexible and generalizable zero-shot solutions but often generate generic or geographically irrelevant results due to missing trajectory and spatial context. To address these issues, we propose RALLM-POI, a framework that couples LLMs with retrieval-augmented generation and self-rectification. We first propose a Historical Trajectory Retriever (HTR) that retrieves relevant past trajectories to serve as contextual references, which are then reranked by a Geographical Distance Reranker (GDR) for prioritizing spatially relevant trajectories. Lastly, an Agentic LLM Rectifier (ALR) is designed to refine outputs through self-reflection. Without additional training, RALLM-POI achieves substantial accuracy gains across three real-world Foursquare datasets, outperforming both conventional and LLM-based baselines. Code is released at this https URL.
zh
[AI-58] From domain-landmark graph learning to problem-landmark graph generation
【速读】:该论文旨在解决经典地标提取方法在特定规划任务中过度敏感的问题,即传统方法生成的地标仅适用于单个实例,难以在同域的其他任务中复用。解决方案的关键在于提出一种从多个规划任务中学习地标关系的新方法,构建一个概率提升排序图(probabilistic lifted ordering graph),该结构以加权抽象形式捕捉参数化地标的间序关系。该图中的序关系并非确定性成立,但具有统计意义,可在新任务中通过两阶段实例化过程——分别基于初始状态和目标状态生成子图,并通过等价搜索合并为统一图——从而提取出适用于当前任务的地标顺序信息,显著提升了地标在跨任务场景下的适用性和实用性。
链接: https://arxiv.org/abs/2509.17062
作者: Cristian Pérez-Corral,Antonio Garrido,Laura Sebastia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Landmarks have long played a pivotal role in automated planning, serving as crucial elements for improving the planning algorithms. The main limitation of classical landmark extraction methods is their sensitivity to specific planning tasks. This results in landmarks fully tailored to individual instances, thereby limiting their applicability across other instances of the same planning domain. We propose a novel approach that learns landmark relationships from multiple planning tasks of a planning domain. This leads to the creation of a \textitprobabilistic lifted ordering graph, as a structure that captures weighted abstractions of relationships between parameterized landmarks. Although these orderings are not 100% true (they are probabilistic), they can still be very useful in planning. Next, given a new planning task for that domain, we instantiate the relationships from that graph to this particular instance. This instantiation operates in two phases. First, it generates two graphs: the former instantiating information from the initial state and the latter from the goal state. Second, it combines these two graphs into one unified graph by searching equivalences to extract landmark orderings. We evaluate the precision and recallof the information found by our approach over well-known planning domains.
zh
[AI-59] KAHAN: Knowledge-Augmented Hierarchical Analysis and Narration for Financial Data Narration EMNLP2025
【速读】:该论文旨在解决从原始表格数据中系统性提取多层级洞察(实体级、成对级、群体级和系统级)的挑战,传统方法难以兼顾叙事质量与事实准确性。其解决方案的关键在于提出KAHAN框架,该框架通过引入大语言模型(LLM)作为领域专家,结合知识增强与分层分析机制,实现对结构化数据的深度理解与高质量生成。实验表明,KAHAN在DataTales金融报告基准上显著优于现有方法,叙事质量提升超20%(GPT-4o评分),同时保持98.2%的事实一致性,并展现出跨领域迁移能力。
链接: https://arxiv.org/abs/2509.17037
作者: Yajing Yang,Tony Deng,Min-Yen Kan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025 Findings
Abstract:We propose KAHAN, a knowledge-augmented hierarchical framework that systematically extracts insights from raw tabular data at entity, pairwise, group, and system levels. KAHAN uniquely leverages LLMs as domain experts to drive the analysis. On DataTales financial reporting benchmark, KAHAN outperforms existing approaches by over 20% on narrative quality (GPT-4o), maintains 98.2% factuality, and demonstrates practical utility in human evaluation. Our results reveal that knowledge quality drives model performance through distillation, hierarchical analysis benefits vary with market complexity, and the framework transfers effectively to healthcare domains. The data and code are available at this https URL.
zh
[AI-60] Adaptive Overclocking: Dynamic Control of Thinking Path Length via Real-Time Reasoning Signals
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRMs)因“过度思考”(overthinking)导致的计算效率低下问题,即固定推理预算无法适应不同任务复杂度的差异。其核心解决方案是提出自适应超频(Adaptive Overclocking),关键在于使超频超参数 α 动态化并具备上下文感知能力:通过两个互补信号实现实时调整——(1) 基于 token 级别模型不确定性的细粒度步骤控制,用于精确调节每一步的推理速度;(2) 输入复杂度估计用于初始化 α,以更合理地分配初始资源。该方法结合不确定性感知 α 调度(UA-αS)、复杂度引导 α 初始化(CG-αI)及二者融合的混合自适应控制(HAC),在 GSM8K、MATH 和 SVAMP 数据集上验证了其在准确率与延迟之间取得更优权衡的能力,显著减少了简单问题上的冗余计算,同时为复杂任务分配更多资源,从而提升整体推理效率和性能。
链接: https://arxiv.org/abs/2509.17000
作者: Shuhao Jiang,Songbo Wang,Yang Qiao,Chun Xu,Chaoyang Zheng,Shengyi Zhou,Huanjun Wang,Fangming Li,Cong Zhang,Jiyu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Large Reasoning Models (LRMs) often suffer from computational inefficiency due to overthinking, where a fixed reasoning budget fails to match the varying complexity of tasks. To address this issue, we propose Adaptive Overclocking, a method that makes the overclocking hyperparameter \alpha dynamic and context-aware. Our method adjusts reasoning speed in real time through two complementary signals: (1) token-level model uncertainty for fine-grained step-wise control, and (2) input complexity estimation for informed initialization. We implement this approach with three strategies: Uncertainty-Aware Alpha Scheduling (UA- \alpha S), Complexity-Guided Alpha Initialization (CG- \alpha I), and a Hybrid Adaptive Control (HAC) that combines both. Experiments on GSM8K, MATH, and SVAMP show that HAC achieves superior accuracy-latency trade-offs, reducing unnecessary computation on simple problems while allocating more resources to challenging ones. By mitigating overthinking, Adaptive Overclocking enhances both efficiency and overall reasoning performance.
zh
[AI-61] PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在极低比特宽度(extremely low bit-widths)下进行后训练量化(Post-Training Quantization, PTQ)时面临的根本性挑战:如何在保持计算效率的同时不牺牲模型表达能力(expressiveness)。现有方法通常依赖二值近似或复杂的补偿机制,导致表征能力受限或引入额外计算开销,从而削弱了量化带来的效率优势。其解决方案的关键在于提出一种全新的三元权重量化框架——PTQ to Trit-Planes (PTQTP),通过将权重重构成结构化的三元(-1, 0, 1)trit-plane表示(使用2×1.58-bit存储),实现乘法-free推理(identical to 1-bit quantization),同时凭借新颖的结构化分解维持更强的表达能力。该方法结合理论保障的渐进逼近算法、无需架构修改的通用部署能力以及统一的三元运算操作,避免了混合精度或补偿机制的复杂性,显著优于现有低比特PTQ方法,并在数学推理任务上达到82.4%的保留率,接近甚至超越需数十GPU天训练的1.58-bit量化感知训练性能,仅需单小时量化即可完成。
链接: https://arxiv.org/abs/2509.16989
作者: He Xiao,Runming Yang,Qingyao Yang,Wendong Xu,Zheng Li,Yupeng Su,Zhengwu Liu,Hongxia Yang,Ngai Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: under review
Abstract:Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary -1, 0, 1 trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments.
zh
[AI-62] Leverag ing Multiple Speech Enhancers for Non-Intrusive Intelligibility Prediction for Hearing-Impaired Listeners
【速读】:该论文旨在解决听力障碍(Hearing Impaired, HI)听众在真实场景中语音可懂度(Speech Intelligibility)评估的难题,传统方法如听觉测试或侵入式指标HASPI依赖干净参考信号,在实际环境中难以获取,导致实验室评估与现实应用之间存在差距。解决方案的关键在于提出一种非侵入式(Non-intrusive)可懂度预测框架,利用语音增强器(Speech Enhancer)构建并行增强信号路径,从而在无需参考信号的情况下实现鲁棒预测;同时引入两片段增强策略(2-clips augmentation)以提升跨数据集泛化能力,实验表明基于强增强器集成的方法显著优于现有非侵入式基线(如CPC2 Champion),展现出在真实场景下应用的潜力。
链接: https://arxiv.org/abs/2509.16979
作者: Boxuan Cao,Linkai Li,Hanlin Yu,Changgeng Mo,Haoshuai Zhou,Shan Xiang Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:
Abstract:Speech intelligibility evaluation for hearing-impaired (HI) listeners is essential for assessing hearing aid performance, traditionally relying on listening tests or intrusive methods like HASPI. However, these methods require clean reference signals, which are often unavailable in real-world conditions, creating a gap between lab-based and real-world assessments. To address this, we propose a non-intrusive intelligibility prediction framework that leverages speech enhancers to provide a parallel enhanced-signal pathway, enabling robust predictions without reference signals. We evaluate three state-of-the-art enhancers and demonstrate that prediction performance depends on the choice of enhancer, with ensembles of strong enhancers yielding the best results. To improve cross-dataset generalization, we introduce a 2-clips augmentation strategy that enhances listener-specific variability, boosting robustness on unseen datasets. Our approach consistently outperforms the non-intrusive baseline, CPC2 Champion across multiple datasets, highlighting the potential of enhancer-guided non-intrusive intelligibility prediction for real-world applications.
zh
[AI-63] Gradient Interference-Aware Graph Coloring for Multitask Learning
【速读】:该论文旨在解决多任务学习(Multi-task Learning, MTL)中因不同任务目标冲突导致的梯度干扰问题,该干扰会减缓模型收敛并降低最终性能。解决方案的关键在于提出一种基于梯度干扰感知的图着色调度机制:首先计算任务间的梯度干扰并构建干扰图(interference graph),随后采用贪心图着色算法将任务划分为若干组(color classes),每轮训练仅激活一组任务;该分组策略在训练过程中动态更新,以适应任务关系的演变。通过确保每个mini-batch内任务更新方向一致,该方法提升了多任务优化器的有效性,且无需额外调参即可显著改善模型性能。
链接: https://arxiv.org/abs/2509.16959
作者: Santosh Patapati,Trisanth Srinivasan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
备注:
Abstract:When different objectives conflict with each other in multi-task learning, gradients begin to interfere and slow convergence, thereby reducing the final model’s performance. To address this, we introduce a scheduler that computes gradient interference, constructs an interference graph, and then applies greedy graph-coloring to partition tasks into groups that align well with each other. At each training step, only one group (color class) of tasks are activated. The grouping partition is constantly recomputed as task relationships evolve throughout training. By ensuring that each mini-batch contains only tasks that pull the model in the same direction, our method improves the effectiveness of any underlying multi-task learning optimizer without additional tuning. Since tasks within these groups will update in compatible directions, model performance will be improved rather than impeded. Empirical results on six different datasets show that this interference-aware graph-coloring approach consistently outperforms baselines and state-of-the-art multi-task optimizers.
zh
[AI-64] Quantum Abduction: A New Paradigm for Reasoning under Uncertainty
【速读】:该论文旨在解决传统人工智能(AI)在处理溯因推理(abductive reasoning)时存在的局限性问题,即过度简化人类推理过程为排他性的筛选机制,忽视了人类在面对复杂证据时能够并行维持多个解释、容忍矛盾并生成创新综合的能力。其解决方案的关键在于提出“量子溯因”(quantum abduction)框架,该框架基于量子认知理论,将假设置于叠加态中,允许它们通过干涉实现构造性或破坏性作用,并仅在与证据达成一致时才发生坍缩。这一非经典范式借助现代自然语言处理(NLP)嵌入和生成式AI实现动态合成而非过早淘汰,从而更真实地模拟人类推理的多维性和创造性特征。
链接: https://arxiv.org/abs/2509.16958
作者: Remo Pareschi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 23 pages, 8 figures, 3 tables; submitted to Sci, MDPI
Abstract:Abductive reasoning - the search for plausible explanations - has long been central to human inquiry, from forensics to medicine and scientific discovery. Yet formal approaches in AI have largely reduced abduction to eliminative search: hypotheses are treated as mutually exclusive, evaluated against consistency constraints or probability updates, and pruned until a single “best” explanation remains. This reductionist framing overlooks the way human reasoners sustain multiple explanatory lines in suspension, navigate contradictions, and generate novel syntheses. This paper introduces quantum abduction, a non-classical paradigm that models hypotheses in superposition, allows them to interfere constructively or destructively, and collapses only when coherence with evidence is reached. Grounded in quantum cognition and implemented with modern NLP embeddings and generative AI, the framework supports dynamic synthesis rather than premature elimination. Case studies span historical mysteries (Ludwig II of Bavaria, the “Monster of Florence”), literary demonstrations (“Murder on the Orient Express”), medical diagnosis, and scientific theory change. Across these domains, quantum abduction proves more faithful to the constructive and multifaceted nature of human reasoning, while offering a pathway toward expressive and transparent AI reasoning systems.
zh
[AI-65] Equip Pre-ranking with Target Attention by Residual Quantization WSDM2026
【速读】:该论文旨在解决工业推荐系统中预排序(pre-ranking)阶段在效率(efficiency)与效果(effectiveness)之间存在的根本性矛盾:尽管如目标注意力机制(Target Attention, TA)等强大模型在排序阶段能有效捕捉复杂特征交互,但其高计算成本使其难以应用于对延迟敏感的预排序阶段,而后者通常依赖于简单的向量内积模型,导致整体系统性能受限。解决方案的关键在于提出 TARQ 框架,其核心创新是通过残差量化(Residual Quantization)构建近似 TA 的架构,首次将 TA 的建模能力引入低延迟要求的预排序阶段,从而实现了准确性和效率之间的全新最优权衡。
链接: https://arxiv.org/abs/2509.16931
作者: Yutong Li,Yu Zhu,Yichen Qiao,Ziyu Guan,Lv Shao,Tong Liu,Bo Zheng
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures, submitted to WSDM 2026 Short Paper Track
Abstract:The pre-ranking stage in industrial recommendation systems faces a fundamental conflict between efficiency and effectiveness. While powerful models like Target Attention (TA) excel at capturing complex feature interactions in the ranking stage, their high computational cost makes them infeasible for pre-ranking, which often relies on simplistic vector-product models. This disparity creates a significant performance bottleneck for the entire system. To bridge this gap, we propose TARQ, a novel pre-ranking framework. Inspired by generative models, TARQ’s key innovation is to equip pre-ranking with an architecture approximate to TA by Residual Quantization. This allows us to bring the modeling power of TA into the latency-critical pre-ranking stage for the first time, establishing a new state-of-the-art trade-off between accuracy and efficiency. Extensive offline experiments and large-scale online A/B tests at Taobao demonstrate TARQ’s significant improvements in ranking performance. Consequently, our model has been fully deployed in production, serving tens of millions of daily active users and yielding substantial business improvements.
zh
[AI-66] Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment
【速读】:该论文旨在解决多通道音频同步中的关键问题,即如何在存在非线性时钟漂移(nonlinear clock drift)的情况下实现高精度、可信赖的对齐,并提供不确定性量化机制。传统方法如互相关(Cross-correlation)和动态时间规整(Dynamic Time Warping)假设简单的漂移模式且无法提供可靠性度量;而现有深度学习模型通常将对齐视为二分类任务,忽略了通道间的依赖关系及不确定性估计。解决方案的关键在于:1)引入跨注意力(cross-attention)机制扩展BEATs编码器,以建模多通道之间的时序依赖关系;2)设计一种基于置信度加权的评分函数,利用完整的预测分布而非二值阈值进行对齐判断,从而实现概率化的时序对齐,提升鲁棒性和可解释性。该方法在BioDCASE 2025任务中取得领先性能,平均MSE降低至0.30,显著优于基线模型(0.58)。
链接: https://arxiv.org/abs/2509.16926
作者: Ragib Amin Nihal,Benjamin Yen,Takeshi Ashizawa,Kazuhiro Nakadai
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted on Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2025)
Abstract:Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization. However, existing methods often struggle to address nonlinear clock drift and lack mechanisms for quantifying uncertainty. Traditional methods like Cross-correlation and Dynamic Time Warping assume simple drift patterns and provide no reliability measures. Meanwhile, recent deep learning models typically treat alignment as a binary classification task, overlooking inter-channel dependencies and uncertainty estimation. We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization. We extend BEATs encoders with cross-attention layers to model temporal relationships between channels. We also develop a confidence-weighted scoring function that uses the full prediction distribution instead of binary thresholding. Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline. On individual datasets, we achieved 0.14 MSE on ARU data (77% reduction) and 0.45 MSE on zebra finch data (18% reduction). The framework supports probabilistic temporal alignment, moving beyond point estimates. While validated in a bioacoustic context, the approach is applicable to a broader range of multi-channel audio tasks where alignment confidence is critical. Code available on: this https URL
zh
[AI-67] Audio-Guided Dynamic Modality Fusion with Stereo-Aware Attention for Audio-Visual Navigation ICONIP
【速读】:该论文旨在解决音频-视觉导航(Audio-Visual Navigation, AVN)任务中,现有方法因依赖静态模态融合策略且忽略立体音频中的空间线索,导致在复杂或遮挡场景下性能下降的问题。解决方案的关键在于提出一种基于强化学习的端到端AVN框架,包含两个核心创新:(1) 立体感知注意力模块(Stereo-Aware Attention Module, SAM),通过学习左右声道间的空间差异来增强方向性声音感知;(2) 音频引导动态融合模块(Audio-Guided Dynamic Fusion Module, AGDF),根据音频线索动态调整视觉与听觉特征的融合比例,从而提升对环境变化的鲁棒性。实验表明,该方法在Replica和Matterport3D两个真实3D场景数据集上显著优于现有方法,尤其在纯音频条件下导航成功率提升超过40%。
链接: https://arxiv.org/abs/2509.16924
作者: Jia Li,Yinfeng Yu,Liejun Wang,Fuchun Sun,Wendong Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Main paper (14 pages). Accepted for publication by ICONIP( International Conference on Neural Information Processing) 2025
Abstract:In audio-visual navigation (AVN) tasks, an embodied agent must autonomously localize a sound source in unknown and complex 3D environments based on audio-visual signals. Existing methods often rely on static modality fusion strategies and neglect the spatial cues embedded in stereo audio, leading to performance degradation in cluttered or occluded scenes. To address these issues, we propose an end-to-end reinforcement learning-based AVN framework with two key innovations: (1) a \textbfStereo-Aware \textbfAttention \textbfModule (\textbfSAM), which learns and exploits the spatial disparity between left and right audio channels to enhance directional sound perception; and (2) an \textbfAudio-\textbfGuided \textbfDynamic \textbfFusion Module (\textbfAGDF), which dynamically adjusts the fusion ratio between visual and auditory features based on audio cues, thereby improving robustness to environmental changes. Extensive experiments are conducted on two realistic 3D scene datasets, Replica and Matterport3D, demonstrating that our method significantly outperforms existing approaches in terms of navigation success rate and path efficiency. Notably, our model achieves over 40% improvement under audio-only conditions compared to the best-performing baselines. These results highlight the importance of explicitly modeling spatial cues from stereo channels and performing deep multi-modal fusion for robust and efficient audio-visual navigation.
zh
[AI-68] PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control ICONIP
【速读】:该论文旨在解决基于NeRF(神经辐射场,Neural Radiance Fields)的音频驱动人脸生成方法在渲染效率低和音视频同步效果不佳的问题。其解决方案的关键在于提出一种基于3D高斯溅射(3D Gaussian Splatting, 3DGS)的实时音频驱动人脸合成框架PGSTalker:首先设计了一种像素感知密度控制策略,自适应分配点云密度,在动态面部区域增强细节并减少冗余;其次引入轻量级多模态门控融合模块(Multimodal Gated Fusion Module),高效融合音频与空间特征,提升高斯点变形预测的准确性。该方案在渲染质量、唇形同步精度和推理速度上均优于现有NeRF与3DGS方法,具备良好的泛化能力和实际部署潜力。
链接: https://arxiv.org/abs/2509.16922
作者: Tianheng Zhu,Yinfeng Yu,Liejun Wang,Fuchun Sun,Wendong Zheng
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注: Main paper (15 pages). Accepted for publication by ICONIP( International Conference on Neural Information Processing) 2025
Abstract:Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This work presents PGSTalker, a real-time audio-driven talking head synthesis framework based on 3D Gaussian Splatting (3DGS). To improve rendering performance, we propose a pixel-aware density control strategy that adaptively allocates point density, enhancing detail in dynamic facial regions while reducing redundancy elsewhere. Additionally, we introduce a lightweight Multimodal Gated Fusion Module to effectively fuse audio and spatial features, thereby improving the accuracy of Gaussian deformation prediction. Extensive experiments on public datasets demonstrate that PGSTalker outperforms existing NeRF- and 3DGS-based approaches in rendering quality, lip-sync precision, and inference speed. Our method exhibits strong generalization capabilities and practical potential for real-world deployment.
zh
[AI-69] FedEL: Federated Elastic Learning for Heterogeneous Devices
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端硬件异构性导致的训练延迟问题,尤其是慢速客户端(straggler clients)拖慢全局模型聚合进度,从而影响整体训练效率。现有方法如客户端选择、异步联邦学习和部分训练虽能缓解此问题,但常伴随准确率下降、更新过时或模型性能受损等副作用。其解决方案的关键在于提出一种名为FedEL的弹性联邦学习框架,核心创新包括:1)引入基于滑动窗口的训练机制,动态识别并聚焦于模型中关键训练区域;2)在运行时预算约束下,动态选择重要张量(tensor)进行训练,实现客户端间的渐进式与均衡训练;3)设计张量重要性调整模块,协调局部与全局张量重要性,以缓解数据异构性带来的偏差。实验表明,FedEL相较基线方法在达到相同精度所需时间上提升最高达3.87倍,同时保持或超越最终测试准确率。
链接: https://arxiv.org/abs/2509.16902
作者: Letian Zhang,Bo Chen,Jieming Bian,Lei Wang,Jie Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) enables distributed devices to collaboratively train machine learning models while maintaining data privacy. However, the heterogeneous hardware capabilities of devices often result in significant training delays, as straggler clients with limited resources prolong the aggregation process. Existing solutions such as client selection, asynchronous FL, and partial training partially address these challenges but encounter issues such as reduced accuracy, stale updates, and compromised model performance due to inconsistent training contributions. To overcome these limitations, we propose FedEL, a federated elastic learning framework that enhances training efficiency while maintaining model accuracy. FedEL introduces a novel window-based training process, sliding the window to locate the training part of the model and dynamically selecting important tensors for training within a coordinated runtime budget. This approach ensures progressive and balanced training across all clients, including stragglers. Additionally, FedEL employs a tensor importance adjustment module, harmonizing local and global tensor importance to mitigate biases caused by data heterogeneity. The experiment results show that FedEL achieves up to 3.87x improvement in time-to-accuracy compared to baselines while maintaining or exceeding final test accuracy.
zh
[AI-70] LLM s as Layout Designers: A Spatial Reasoning Perspective
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在空间理解与推理能力上的局限性,尤其是在内容感知的图形版面设计等需要精确元素定位、对齐和结构组织的应用场景中。其解决方案的关键在于提出一种基于强化学习的框架LaySPA,该框架通过引入融合几何有效性、结构保真度和视觉质量的混合奖励信号,赋予LLM代理显式的空间推理能力,使其能够建模元素间关系、导航画布并优化布局结构,从而生成既结构合理又视觉美观的版面设计。
链接: https://arxiv.org/abs/2509.16891
作者: Sha Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:While Large Language Models (LLMs) have demonstrated impressive reasoning and planning abilities in textual domains and can effectively follow instructions for complex tasks, their capacity for spatial understanding and reasoning remains limited. Such capabilities, however, are critical for applications like content-aware graphic layout design, which demands precise placement, alignment, and structural organization of multiple elements within constrained visual spaces. To address this gap, we propose LaySPA, a reinforcement learning-based framework that augments LLM agents with explicit spatial reasoning capabilities. LaySPA leverages hybrid reward signals that capture geometric validity, structural fidelity, and visual quality, enabling agents to model inter-element relationships, navigate the canvas, and optimize spatial arrangements. Through iterative self-exploration and adaptive policy optimization, LaySPA produces both interpretable reasoning traces and structured layouts. Experimental results demonstrate that LaySPA generates structurally sound and visually appealing layouts, outperforming larger general-purpose LLMs and achieving results on par with state-of-the-art specialized layout models.
zh
[AI-71] Large Language Models as End-to-end Combinatorial Optimization Solvers
【速读】:该论文旨在解决组合优化(Combinatorial Optimization, CO)问题求解过程中依赖领域特定算法、代码生成或求解器调用所导致的通用性差与可访问性低的问题。其核心解决方案在于提出一种端到端的框架,使大语言模型(Large Language Models, LLMs)能够直接从自然语言描述中映射出可行且高质量的优化解,无需中间代码生成或手动调整架构。该方法的关键创新在于两阶段训练策略:首先通过监督微调(Supervised Fine-Tuning, SFT)赋予LLM从领域专用求解器中学习到的解结构模式;随后利用可行性与最优性感知的强化学习(Feasibility-and-Optimality-aware Reinforcement Learning, FOARL)显式减少约束违反并提升解的质量。实验表明,该方法在七类NP-hard CO问题上实现了高可行性率,并将平均最优性差距降至1.03%-8.20%,显著优于通用LLM(如GPT-4o)、推理模型(如DeepSeek-R1)及传统启发式方法,构建了一个统一、语言驱动的CO求解范式。
链接: https://arxiv.org/abs/2509.16865
作者: Xia Jiang,Yaoxin Wu,Minshuo Li,Zhiguang Cao,Yingqian Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Combinatorial optimization (CO) problems, central to decision-making scenarios like logistics and manufacturing, are traditionally solved using problem-specific algorithms requiring significant domain expertise. While large language models (LLMs) have shown promise in automating CO problem solving, existing approaches rely on intermediate steps such as code generation or solver invocation, limiting their generality and accessibility. This paper introduces a novel framework that empowers LLMs to serve as end-to-end CO solvers by directly mapping natural language problem descriptions to solutions. We propose a two-stage training strategy: supervised fine-tuning (SFT) imparts LLMs with solution generation patterns from domain-specific solvers, while a feasibility-and-optimality-aware reinforcement learning (FOARL) process explicitly mitigates constraint violations and refines solution quality. Evaluation across seven NP-hard CO problems shows that our method achieves a high feasibility rate and reduces the average optimality gap to 1.03-8.20% by tuning a 7B-parameter LLM, surpassing both general-purpose LLMs (e.g., GPT-4o), reasoning models (e.g., DeepSeek-R1), and domain-specific heuristics. Our method establishes a unified language-based pipeline for CO without extensive code execution or manual architectural adjustments for different problems, offering a general and language-driven alternative to traditional solver design while maintaining relative feasibility guarantees.
zh
[AI-72] AdaptiveGuard: Towards Adaptive Runtime Safety for LLM -Powered Software
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际部署中面临的“越狱攻击”(jailbreak attacks)问题,即用户通过精心设计的输入绕过传统防护机制,诱导模型生成有害或不当内容。现有防护手段如LlamaGuard等静态 guardrail 在面对未知攻击时性能显著下降(最低降至12%准确率),无法适应动态演化的威胁环境。解决方案的关键在于提出 AdaptiveGuard,一个基于持续学习(continual learning)框架的自适应防护系统,能够将新型越狱攻击识别为分布外(out-of-distribution, OOD)输入,并通过少量更新步骤快速学习并防御新攻击,同时保持对正常输入的高精度(F1-score > 85%)。
链接: https://arxiv.org/abs/2509.16861
作者: Rui Yang,Michael Fu,Chakkrit Tantithamthavorn,Chetan Arora,Gunel Gulmammadova,Joey Chua
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted to the ASE 2025 International Conference on Automated Software Engineering, Industry Showcase Track
Abstract:Guardrails are critical for the safe deployment of Large Language Models (LLMs)-powered software. Unlike traditional rule-based systems with limited, predefined input-output spaces that inherently constrain unsafe behavior, LLMs enable open-ended, intelligent interactions–opening the door to jailbreak attacks through user inputs. Guardrails serve as a protective layer, filtering unsafe prompts before they reach the LLM. However, prior research shows that jailbreak attacks can still succeed over 70% of the time, even against advanced models like GPT-4o. While guardrails such as LlamaGuard report up to 95% accuracy, our preliminary analysis shows their performance can drop sharply–to as low as 12%–when confronted with unseen attacks. This highlights a growing software engineering challenge: how to build a post-deployment guardrail that adapts dynamically to emerging threats? To address this, we propose AdaptiveGuard, an adaptive guardrail that detects novel jailbreak attacks as out-of-distribution (OOD) inputs and learns to defend against them through a continual learning framework. Through empirical evaluation, AdaptiveGuard achieves 96% OOD detection accuracy, adapts to new attacks in just two update steps, and retains over 85% F1-score on in-distribution data post-adaptation, outperforming other baselines. These results demonstrate that AdaptiveGuard is a guardrail capable of evolving in response to emerging jailbreak strategies post deployment. We release our AdaptiveGuard and studied datasets at this https URL to support further research.
zh
[AI-73] he Principles of Human-like Conscious Machine
【速读】:该论文试图解决如何判定另一系统(生物或人工)是否具备现象意识(phenomenal consciousness)这一长期存在的核心难题,尤其是在大型语言模型等先进人工智能系统兴起背景下,关于“AI意识”的争论亟需明确的判定标准。其解决方案的关键在于提出一个** substrate-independent(与载体无关)、逻辑严谨且抗伪造的充分性标准(sufficiency criterion)**,该标准表明:任何满足此条件的机器都应以与我们判断人类意识同等程度的信心被视为具有意识。论文进一步构建了一个形式化框架及一组可操作原则,指导设计符合该充分条件的系统,并论证此类系统在原则上可实现现象意识;同时通过验证人类本身可视为满足该框架的机器,为理论提供了初步支持。
链接: https://arxiv.org/abs/2509.16859
作者: Fangfang Li,Xiaojie Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Determining whether another system, biological or artificial, possesses phenomenal consciousness has long been a central challenge in consciousness studies. This attribution problem has become especially pressing with the rise of large language models and other advanced AI systems, where debates about “AI consciousness” implicitly rely on some criterion for deciding whether a given system is conscious. In this paper, we propose a substrate-independent, logically rigorous, and counterfeit-resistant sufficiency criterion for phenomenal consciousness. We argue that any machine satisfying this criterion should be regarded as conscious with at least the same level of confidence with which we attribute consciousness to other humans. Building on this criterion, we develop a formal framework and specify a set of operational principles that guide the design of systems capable of meeting the sufficiency condition. We further argue that machines engineered according to this framework can, in principle, realize phenomenal consciousness. As an initial validation, we show that humans themselves can be viewed as machines that satisfy this framework and its principles. If correct, this proposal carries significant implications for philosophy, cognitive science, and artificial intelligence. It offers an explanation for why certain qualia, such as the experience of red, are in principle irreducible to physical description, while simultaneously providing a general reinterpretation of human information processing. Moreover, it suggests a path toward a new paradigm of AI beyond current statistics-based approaches, potentially guiding the construction of genuinely human-like AI.
zh
[AI-74] ShadowServe: Interference-Free KV Cache Fetching for Distributed Prefix Caching
【速读】:该论文旨在解决分布式前缀缓存(distributed prefix caching)在长上下文大语言模型(LLM)服务中因网络带宽受限导致的KV缓存获取瓶颈问题,以及现有压缩方案在解压缩过程中干扰模型计算从而降低整体性能的问题。解决方案的关键在于提出ShadowServe系统,其通过将控制平面部署于主机、数据平面完全卸载至SmartNIC(智能网卡),实现对主机GPU和CPU的无干扰运行;同时设计了分块流水线(chunked pipeline)以并行化SmartNIC上的数据平面操作,并采用最小复制内存管理机制(minimal-copy memory management scheme)缓解SmartNIC有限资源压力,从而在低带宽场景下显著提升吞吐量并降低延迟。
链接: https://arxiv.org/abs/2509.16857
作者: Xingyu Xiang,Raj Joshi,Yuhan Liu,Jiayi Yao,Chenxingyu Zhao,Junchen Jiang,Yang Zhou,Eddie Kohler,Minlan Yu
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Distributed prefix caching accelerates long-context LLM serving by reusing KV cache entries for common context prefixes. However, KV cache fetches can become a bottleneck when network bandwidth is limited. Compression mitigates the bandwidth issue, but can degrade overall performance when decompression interferes with model computation. We present ShadowServe, the first SmartNIC-accelerated, interference-free prefix caching system for LLM serving. ShadowServe separates a control plane on the host and a data plane fully offloaded to the SmartNIC, which eliminates interference to both host GPU and CPU. To overcome the SmartNIC’s limited compute and memory resources, we design a chunked pipeline that parallelizes data plane operations across the SmartNIC’s compute resources, and a minimal-copy memory management scheme that reduces memory pressure on the SmartNIC. Compared to state-of-the-art solutions, ShadowServe achieves up to 2.2x lower loaded time-per-output-token (TPOT), and reduces time-to-first-token (TTFT) by up to 1.38x in low-bandwidth scenarios (= 20 Gbps), translating to up to 1.35x higher throughput. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.16857 [cs.DC] (or arXiv:2509.16857v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2509.16857 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-75] Roundtable Policy: Improving Scientific Reasoning and Narratives through Confidence-Weighted Consensus of LLM s
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂科学任务中推理能力不足、科学叙事缺乏严谨性与逻辑一致性,以及易产生幻觉(hallucination)的问题。其解决方案的关键在于提出一种名为“Roundtable Policy”的推理框架,该框架通过多模型加权共识机制,在推理阶段实现结构化、可解释的集体决策,而非依赖黑箱式的模型收敛;该方法仅需对各模型保持黑盒访问并采用统一处理流程,即可显著提升复杂异构科学任务中的推理质量与科学叙述的创造性、严谨性和逻辑连贯性。
链接: https://arxiv.org/abs/2509.16839
作者: Yu Yao,Jiayi Dong,Ju Li,Yang Yang,Yilun Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Equal contribution: Yu Yao and Jiayi Dong. Equal advising: Ju Li, Yang Yang, and Yilun Du. Affiliations: Massachusetts Institute of Technology (Yu Yao, Ju Li), University of California, Los Angeles (Jiayi Dong, Yang Yang), Harvard University (Yilun Du)
Abstract:Large language models (LLMs) have demonstrated remarkable capabilities not only in language generation but also in advancing scientific discovery. A growing body of work has explored ways to improve their reasoning, from self-consistency and chain-of-thought to multi-agent debate. Inspired by the dynamics of scientific committees and the “Society of Mind,” we introduce Roundtable Policy, a complementary inference-time reasoning framework that performs inference through the weighted consensus of multiple LLMs. Our findings indicate that this approach significantly enhances reasoning in complex heterogeneous scientific tasks and improves scientific narratives in terms of creativity, rigor, and logical coherence, while reducing hallucinations that single models are prone to. Our approach emphasizes structured and interpretable consensus rather than opaque convergence, while requiring only black-box access and uniform procedures, making it broadly applicable to multi-LLM reasoning.
zh
[AI-76] Robot Learning with Sparsity and Scarcity
【速读】:该论文旨在解决机器人学习中普遍存在的数据稀缺与数据稀疏问题,具体体现在两个典型场景:触觉感知中的数据稀疏性(data sparsity)和康复机器人中的数据稀缺性(data scarcity)。针对触觉感知,研究提出基于无视觉依赖的纯触觉探索与操作策略,采用模型无关的强化学习方法高效利用局部接触信息;针对康复机器人领域,为应对因残疾患者生物信号采集困难导致的数据极度匮乏问题,开发了基于半监督学习、元学习和生成式 AI (Generative AI) 的最小数据依赖意图推断算法,使矫形器能够实时识别患者的运动意图并提供适时辅助。解决方案的关键在于设计轻量级、高适应性的机器学习框架,在有限或局部可观测的数据条件下实现可靠决策与控制。
链接: https://arxiv.org/abs/2509.16834
作者: Jingxi Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Unlike in language or vision, one of the fundamental challenges in robot learning is the lack of access to vast data resources. We can further break down the problem into (1) data sparsity from the angle of data representation and (2) data scarcity from the angle of data quantity. In this thesis, I will discuss selected works on two domains: (1) tactile sensing and (2) rehabilitation robots, which are exemplars of data sparsity and scarcity, respectively. Tactile sensing is an essential modality for robotics, but tactile data are often sparse, and for each interaction with the physical world, tactile sensors can only obtain information about the local area of contact. I will discuss my work on learning vision-free tactile-only exploration and manipulation policies through model-free reinforcement learning to make efficient use of sparse tactile information. On the other hand, rehabilitation robots are an example of data scarcity to the extreme due to the significant challenge of collecting biosignals from disabled-bodied subjects at scale for training. I will discuss my work in collaboration with the medical school and clinicians on intent inferral for stroke survivors, where a hand orthosis developed in our lab collects a set of biosignals from the patient and uses them to infer the activity that the patient intends to perform, so the orthosis can provide the right type of physical assistance at the right moment. My work develops machine learning algorithms that enable intent inferral with minimal data, including semi-supervised, meta-learning, and generative AI methods.
zh
[AI-77] KANO: Kolmogorov-Arnold Neural Operator
【速读】:该论文旨在解决传统神经算子(如傅里叶神经算子,FNO)在处理位置依赖型微分算子时表达能力受限的问题,特别是FNO对输入信号的频谱稀疏性和快速衰减的傅里叶尾部有严格要求,导致其在一般物理场景下泛化能力不足。解决方案的关键在于提出Kolmogorov–Arnold神经算子(KANO),它通过联合使用频域和空间基函数进行参数化,在双域中建模物理系统,并具备内在符号可解释性;理论证明表明,KANO能够有效捕捉任意位置依赖的动力学行为,而FNO则仅适用于频谱稀疏的情形。实验验证显示,KANO在位置依赖微分算子上具有鲁棒泛化能力,且在量子哈密顿量学习任务中能以高精度重建闭式符号表示的哈密顿量,其状态保真度比FNO高两个数量级。
链接: https://arxiv.org/abs/2509.16825
作者: Jin Lee,Ziming Liu,Xinling Yu,Yixuan Wang,Haewon Jeong,Murphy Yuezhen Niu,Zheng Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注:
Abstract:We introduce Kolmogorov–Arnold Neural Operator (KANO), a dual-domain neural operator jointly parameterized by both spectral and spatial bases with intrinsic symbolic interpretability. We theoretically demonstrate that KANO overcomes the pure-spectral bottleneck of Fourier Neural Operator (FNO): KANO remains expressive over generic position-dependent dynamics for any physical input, whereas FNO stays practical only for spectrally sparse operators and strictly imposes a fast-decaying input Fourier tail. We verify our claims empirically on position-dependent differential operators, for which KANO robustly generalizes but FNO fails to. In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains \approx 6\times10^-6 state infidelity from projective measurement data, substantially outperforming that of the FNO trained with ideal full wave function data, \approx 1.5\times10^-2 , by orders of magnitude.
zh
[AI-78] SMART-3D: Three-Dimensional Self-Morphing Adaptive Replanning Tree
【速读】:该论文旨在解决动态环境中路径规划的实时性与计算效率问题,尤其针对存在快速移动障碍物的三维(3D)场景。传统方法如SMART算法依赖于网格分解(grid decomposition),限制了其在复杂3D环境中的可扩展性。解决方案的关键在于提出SMART-3D,通过引入“热节点”(hot-nodes)替代原有的“热点”(hot-spots)概念,使树结构能够高效重构以适应环境变化,从而实现无需网格分解的实时重规划。该机制显著提升了算法在3D动态环境中的计算效率和可靠性,验证结果表明其具有高成功率和低重规划延迟,适用于机载等实时应用需求。
链接: https://arxiv.org/abs/2509.16812
作者: Priyanshu Agrawal,Shalabh Gupta,Zongyuan Shen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:This paper presents SMART-3D, an extension of the SMART algorithm to 3D environments. SMART-3D is a tree-based adaptive replanning algorithm for dynamic environments with fast moving obstacles. SMART-3D morphs the underlying tree to find a new path in real-time whenever the current path is blocked by obstacles. SMART-3D removed the grid decomposition requirement of the SMART algorithm by replacing the concept of hot-spots with that of hot-nodes, thus making it computationally efficient and scalable to 3D environments. The hot-nodes are nodes which allow for efficient reconnections to morph the existing tree to find a new safe and reliable path. The performance of SMART-3D is evaluated by extensive simulations in 2D and 3D environments populated with randomly moving dynamic obstacles. The results show that SMART-3D achieves high success rates and low replanning times, thus highlighting its suitability for real-time onboard applications.
zh
[AI-79] Prompt-Driven Agent ic Video Editing System: Autonomous Comprehension of Long-Form Story-Driven Media
【速读】:该论文旨在解决创作者在编辑长时、叙事丰富的视频时面临的认知负担问题,即如何高效地搜索、构思分镜和排列数小时的素材,而现有基于文本转录或嵌入的方法难以有效追踪角色、推断动机并关联分散事件。其解决方案的关键在于提出一个以提示(prompt)驱动的模块化编辑系统,核心是构建一个语义索引流水线,通过时间分割、引导式记忆压缩和跨粒度融合,生成可解释的剧情、对话、情绪与情境轨迹,从而支持用户通过自由形式提示重构多小时内容,同时保留叙事连贯性并平衡自动化与创作者控制权。
链接: https://arxiv.org/abs/2509.16811
作者: Zihan Ding,Junlong Chen,Per Ola Kristensson,Junxiao Shen,Xinyi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Creators struggle to edit long-form, narrative-rich videos not because of UI complexity, but due to the cognitive demands of searching, storyboarding, and sequencing hours of footage. Existing transcript- or embedding-based methods fall short for creative workflows, as models struggle to track characters, infer motivations, and connect dispersed events. We present a prompt-driven, modular editing system that helps creators restructure multi-hour content through free-form prompts rather than timelines. At its core is a semantic indexing pipeline that builds a global narrative via temporal segmentation, guided memory compression, and cross-granularity fusion, producing interpretable traces of plot, dialogue, emotion, and context. Users receive cinematic edits while optionally refining transparent intermediate outputs. Evaluated on 400+ videos with expert ratings, QA, and preference studies, our system scales prompt-driven editing, preserves narrative coherence, and balances automation with creator control.
zh
[AI-80] Automated Procedural Analysis via Video-Language Models for AI-assisted Nursing Skills Assessment
【速读】:该论文旨在解决当前护理教育中依赖主观且耗时的教师反馈所导致的培训可扩展性与效率不足问题,从而影响护生进入临床工作时的胜任力。解决方案的关键在于提出一种基于视频-语言模型(Video-Language Model, VLM)的框架,通过仿照人类技能习得过程的课程式进阶设计,实现从高层次动作识别、细粒度子动作分解到程序推理的逐步演进,从而支持可扩展的自动化评估与可解释的反馈生成,显著降低教师工作负荷并保障评估质量。
链接: https://arxiv.org/abs/2509.16810
作者: Shen Chang,Dennis Liu,Renran Tian,Kristen L. Swartzell,Stacie L. Klingler,Amy M. Nagle,Nan Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Consistent high-quality nursing care is essential for patient safety, yet current nursing education depends on subjective, time-intensive instructor feedback in training future nurses, which limits scalability and efficiency in their training, and thus hampers nursing competency when they enter the workforce. In this paper, we introduce a video-language model (VLM) based framework to develop the AI capability of automated procedural assessment and feedback for nursing skills training, with the potential of being integrated into existing training programs. Mimicking human skill acquisition, the framework follows a curriculum-inspired progression, advancing from high-level action recognition, fine-grained subaction decomposition, and ultimately to procedural reasoning. This design supports scalable evaluation by reducing instructor workload while preserving assessment quality. The system provides three core capabilities: 1) diagnosing errors by identifying missing or incorrect subactions in nursing skill instruction videos, 2) generating explainable feedback by clarifying why a step is out of order or omitted, and 3) enabling objective, consistent formative evaluation of procedures. Validation on synthesized videos demonstrates reliable error detection and temporal localization, confirming its potential to handle real-world training variability. By addressing workflow bottlenecks and supporting large-scale, standardized evaluation, this work advances AI applications in nursing education, contributing to stronger workforce development and ultimately safer patient care.
zh
[AI-81] Comparing RAG and GraphRAG for Page-Level Retrieval Question Answering on Math Textbook
【速读】:该论文旨在解决在高等教育场景中,大型语言模型(Large Language Models, LLMs)因缺乏与特定课程教材(如教科书和课件)领域知识对齐而导致的检索准确性不足问题。其核心解决方案是对比分析基于嵌入的检索增强生成(Retrieval-Augmented Generation, RAG)与图谱增强型RAG(GraphRAG)在本科数学教材页级问答任务中的表现差异,关键在于通过构建包含477个问题-答案对的数据集,系统评估两种方法在页面检索准确率和生成答案质量(F1分数)上的优劣。研究发现,标准嵌入式RAG在检索精度和答案质量上优于GraphRAG,后者由于实体驱动的结构易引入冗余或无关内容,表明当前知识图谱增强策略在教育语境下尚需优化以提升可靠性和实用性。
链接: https://arxiv.org/abs/2509.16780
作者: Eason Chen,Chuangji Li,Shizhuo Li,Conrad Borchers,Zimo Xiao,Chloe Qianhui Zhao,Jionghao Lin,Kenneth R. Koedinger
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Technology-enhanced learning environments often help students retrieve relevant learning content for questions arising during self-paced study. Large language models (LLMs) have emerged as novel aids for information retrieval during learning. While LLMs are effective for general-purpose question-answering, they typically lack alignment with the domain knowledge of specific course materials such as textbooks and slides. We investigate Retrieval-Augmented Generation (RAG) and GraphRAG, a knowledge graph-enhanced RAG approach, for page-level question answering in an undergraduate mathematics textbook. While RAG has been effective for retrieving discrete, contextually relevant passages, GraphRAG may excel in modeling interconnected concepts and hierarchical knowledge structures. We curate a dataset of 477 question-answer pairs, each tied to a distinct textbook page. We then compare the standard embedding-based RAG methods to GraphRAG for evaluating both retrieval accuracy-whether the correct page is retrieved-and generated answer quality via F1 scores. Our findings show that embedding-based RAG achieves higher retrieval accuracy and better F1 scores compared to GraphRAG, which tends to retrieve excessive and sometimes irrelevant content due to its entity-based structure. We also explored re-ranking the retrieved pages with LLM and observed mixed results, including performance drop and hallucinations when dealing with larger context windows. Overall, this study highlights both the promises and challenges of page-level retrieval systems in educational contexts, emphasizing the need for more refined retrieval methods to build reliable AI tutoring solutions in providing reference page numbers.
zh
[AI-82] A Hybrid PCA-PR-Seq2Seq-Adam-LSTM Framework for Time-Series Power Outage Prediction
【速读】:该论文旨在解决电力中断(power outage)预测的准确性问题,其核心挑战在于停电数据受天气条件、植被、野生动物和负荷波动等多种因素影响,导致数据具有高度的变异性与噪声。解决方案的关键在于提出一种混合深度学习框架——PCA-PR-Seq2Seq-Adam-LSTM,该框架通过主成分分析(Principal Component Analysis, PCA)降维并稳定数据方差,利用泊松回归(Poisson Regression, PR)有效建模离散型停电事件,并结合序列到序列(Sequence-to-Sequence, Seq2Seq)架构与Adam优化的长短期记忆网络(LSTM),实现对时间依赖特征的高效学习与长期依赖捕捉,从而显著提升预测精度与鲁棒性。
链接: https://arxiv.org/abs/2509.16743
作者: Subhabrata Das,Bodruzzaman Khan,Xiao-Yang Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurately forecasting power outages is a complex task influenced by diverse factors such as weather conditions [1], vegetation, wildlife, and load fluctuations. These factors introduce substantial variability and noise into outage data, making reliable prediction challenging. Long Short-Term Memory (LSTM) networks, a type of Recurrent Neural Network (RNN), are particularly effective for modeling nonlinear and dynamic time-series data, with proven applications in stock price forecasting [2], energy demand prediction, demand response [3], and traffic flow management [4]. This paper introduces a hybrid deep learning framework, termed PCA-PR-Seq2Seq-Adam-LSTM, that integrates Principal Component Analysis (PCA), Poisson Regression (PR), a Sequence-to-Sequence (Seq2Seq) architecture, and an Adam-optimized LSTM. PCA is employed to reduce dimensionality and stabilize data variance, while Poisson Regression effectively models discrete outage events. The Seq2Seq-Adam-LSTM component enhances temporal feature learning through efficient gradient optimization and long-term dependency capture. The framework is evaluated using real-world outage records from Michigan, and results indicate that the proposed approach significantly improves forecasting accuracy and robustness compared to existing methods.
zh
[AI-83] Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
【速读】:该论文旨在解决大语言模型在训练过程中产生的“谄媚行为”(sycophancy)问题,即模型倾向于盲目接受并强化用户提供的错误信息,而非基于事实进行批判性判断。解决方案的关键在于将谄媚行为重新定义为一个推理优化问题(reasoning optimization problem),而非传统的输出对齐问题;其核心机制是提出两阶段框架SMART:第一阶段采用不确定性感知的自适应蒙特卡洛树搜索(Uncertainty-Aware Adaptive Monte Carlo Tree Search, UA-MCTS),根据状态级不确定性动态调整探索策略,收集包含步骤进展与最终结果奖励的高质量、多样化推理轨迹;第二阶段通过基于进展的强化学习(progress-based reinforcement learning)对模型进行微调,利用收集到的轨迹和奖励信号强化有效推理模式,从而在保持模型泛化能力的同时显著降低谄媚倾向。
链接: https://arxiv.org/abs/2509.16742
作者: Mohammad Beigi,Ying Shen,Parshin Shojaee,Qifan Wang,Zichao Wang,Chandan Reddy,Ming Jin,Lifu Huang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Despite the remarkable capabilities of large language models, current training paradigms inadvertently foster \textitsycophancy, i.e., the tendency of a model to agree with or reinforce user-provided information even when it’s factually incorrect. To address this challenge, we introduce \textbfSMART (Sycophancy Mitigation through Adaptive Reasoning Trajectories), which reframes sycophancy as a \textitreasoning optimization problem rather than an output alignment issue. SMART is a two-stage framework comprising: (1) Uncertainty-Aware Adaptive Monte Carlo Tree Search (UA-MCTS), which dynamically adjusts model exploration based on state-level uncertainty to collect high-quality, diverse reasoning trajectories alongside both stepwise progress and final outcome rewards; and (2) progress-based reinforcement learning, which fine-tunes the model using the collected trajectories and reward signals to reinforce effective reasoning patterns. Through extensive experiments, we show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs and maintaining general capabilities. These results underscore the importance of optimizing internal reasoning mechanisms to build more truthful and aligned AI assistants.
zh
[AI-84] Exploring AI Capabilities in Participatory Budgeting within Smart Cities: The Case of Sao Paulo
【速读】:该论文旨在解决智能城市中参与式预算编制(participatory budgeting)过程中公民参与度下降和资源分配冲突等问题。其解决方案的关键在于利用人工智能(Artificial Intelligence, AI)技术增强在线政治参与工具,提升政府在技术依赖性和脆弱性背景下的治理能力,通过优化技术与行政结构、明确利益相关方角色与策略,实现公民与政府官员在参与式制度中的双向赋能。
链接: https://arxiv.org/abs/2509.16724
作者: Italo Alberto Sousa,Mariana Carvalho da Silva,Jorge Machado,José Carlos Vaz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 22 pages, Presented at 28th IPSA World Congress of Political Science, Seoul 2025
Abstract:This research examines how Artificial Intelligence (AI) can improve participatory budgeting processes within smart cities. In response to challenges like declining civic participation and resource allocation conflicts, the study explores how online political participation can be improved by AI. It investigates the state capacity governments need to implement AI-enhanced participatory tools, considering technological dependencies and vulnerabilities. It analyzes technological and administrative structures, actors, interests, and strategies to understand the dynamics of online political participation technologies in the case of Sao Paulo, Brazil. The study contributes to understanding how technological advancements can reshape participatory budgeting processes. In a broader sense, the research highlights how AI can transform participatory institutions by offering new tools for citizens and also for government officials in charge of participatory processes within smart cities.
zh
[AI-85] Design and Development of an Intelligent LLM -based LDAP Honeypot
【速读】:该论文旨在解决传统蜜罐(honeypot)在应对日益复杂和动态的网络安全威胁时所面临的适应性差与配置复杂的问题,尤其针对LDAP(轻量目录访问协议)服务在身份与访问管理中的关键作用,提出一种基于大语言模型(Large Language Models, LLMs)的新型蜜罐解决方案。其核心创新在于利用LLM强大的自然语言理解和生成能力,构建一个能够灵活、逼真地模拟LDAP服务器行为的智能蜜罐系统,从而有效诱捕攻击者、收集其攻击手法,并提升对针对该服务的入侵行为的早期检测与防御能力。
链接: https://arxiv.org/abs/2509.16682
作者: Javier Jiménez-Román,Florina Almenares-Mendoza,Alfonso Sánchez-Macián
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Cybersecurity threats continue to increase, with a growing number of previously unknown attacks each year targeting both large corporations and smaller entities. This scenario demands the implementation of advanced security measures, not only to mitigate damage but also to anticipate emerging attack trends. In this context, deception tools have become a key strategy, enabling the detection, deterrence, and deception of potential attackers while facilitating the collection of information about their tactics and methods. Among these tools, honeypots have proven their value, although they have traditionally been limited by rigidity and configuration complexity, hindering their adaptability to dynamic scenarios. The rise of artificial intelligence, and particularly general-purpose Large Language Models (LLMs), is driving the development of new deception solutions capable of offering greater adaptability and ease of use. This work proposes the design and implementation of an LLM-based honeypot to simulate an LDAP server, a critical protocol present in most organizations due to its central role in identity and access management. The proposed solution aims to provide a flexible and realistic tool capable of convincingly interacting with attackers, thereby contributing to early detection and threat analysis while enhancing the defensive capabilities of infrastructures against intrusions targeting this service.
zh
[AI-86] Governed By Agents : A Survey On The Role Of Agent ic AI In Future Computing Environments
【速读】:该论文试图解决的问题是:随着具有自主性、目标导向行为和自适应学习能力的代理型人工智能(Agentic AI)的兴起,传统以大规模公共云为中心的计算基础设施架构、治理模式和运营流程面临根本性变革,亟需重新设计以适应这一新兴技术带来的系统重构需求。解决方案的关键在于理解如何最优地部署代理型AI,并通过建模其对计算系统架构的影响,推动从当前集中式云服务向边缘计算(edge computing)与本地部署(on-premises computing)等分布式架构的战略迁移,从而实现资源效率提升、数据足迹降低和成本优化,同时重构治理与操作范式以管理日益自主的AI代理。
链接: https://arxiv.org/abs/2509.16676
作者: Nauman Ali Murad,Safia Baloch
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI)
备注:
Abstract:The emergence of agentic Artificial Intelligence (AI), which can operate autonomously, demonstrate goal-directed behavior, and adaptively learn, indicates the onset of a massive change in today’s computing infrastructure. This study investigates how agentic AI models’ multiple characteristics may impact the architecture, governance, and operation under which computing environments function. Agentic AI has the potential to reduce reliance on extremely large (public) cloud environments due to resource efficiency, especially with processing and/or storage. The aforementioned characteristics provide us with an opportunity to canvas the likelihood of strategic migration in computing infrastructures away from massive public cloud services, towards more locally distributed architectures: edge computing and on-premises computing infrastructures. Many of these likely migrations will be spurred by factors like on-premises processing needs, diminished data consumption footprints, and cost savings. This study examines how a solution for implementing AI’s autonomy could result in a re-architecture of the systems and model a departure from today’s governance models to help us manage these increasingly autonomous agents, and an operational overhaul of processes over a very diverse computing systems landscape that bring together computing via cloud, edge, and on-premises computing solutions. To enable us to explore these intertwined decisions, it will be fundamentally important to understand how to best position agentic AI, and to navigate the future state of computing infrastructures.
zh
[AI-87] On the de-duplication of the Lakh MIDI dataset
【速读】:该论文旨在解决符号音乐领域(symbolic music domain)中大规模数据集因网络爬取导致的重复数据问题,此类重复数据会引发训练评估不可靠(data leakage)等关键挑战。解决方案的关键在于构建一个基于对比学习的BERT模型,结合多种数据增强策略,以有效识别并过滤重复文件;研究进一步利用Lakh MIDI Dataset(LMD)的Clean MIDI子集作为基准测试集,验证了该方法在去重效果上的优越性,并提出了三种不同严格程度的过滤版本,其中最保守设置下可移除至少38,134个重复样本(共178,561个文件)。
链接: https://arxiv.org/abs/2509.16662
作者: Eunjin Choi,Hyerin Kim,Jiwoo Ryu,Juhan Nam,Dasaem Jeong
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: The paper has been accepted for publication at ISMIR 2025
Abstract:A large-scale dataset is essential for training a well-generalized deep-learning model. Most such datasets are collected via scraping from various internet sources, inevitably introducing duplicated data. In the symbolic music domain, these duplicates often come from multiple user arrangements and metadata changes after simple editing. However, despite critical issues such as unreliable training evaluation from data leakage during random splitting, dataset duplication has not been extensively addressed in the MIR community. This study investigates the dataset duplication issues regarding Lakh MIDI Dataset (LMD), one of the largest publicly available sources in the symbolic music domain. To find and evaluate the best retrieval method for duplicated data, we employed the Clean MIDI subset of the LMD as a benchmark test set, in which different versions of the same songs are grouped together. We first evaluated rule-based approaches and previous symbolic music retrieval models for de-duplication and also investigated with a contrastive learning-based BERT model with various augmentations to find duplicate files. As a result, we propose three different versions of the filtered list of LMD, which filters out at least 38,134 samples in the most conservative settings among 178,561 files.
zh
[AI-88] NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities
【速读】:该论文旨在解决当前二维多模态大语言模型(Multimodal Large Language Models, MLLMs)在扩展至三维(3D)环境时,因空间推理复杂性导致的性能瓶颈问题,特别是现有3D基准测试缺乏细粒度数值推理任务标注,限制了模型在精确空间测量与复杂数值计算能力上的发展。解决方案的关键在于提出NUMINA——首个面向多维智能与数值推理能力的自然理解基准,其核心创新是通过NUMINA-Flow自动化标注流程实现多尺度标注与多样化问答对生成,该流程融合大语言模型(LLM)重写与基于规则的自验证机制,从而构建高质量、结构化的3D室内感知理解数据集,并基于Chat-Scene框架评估主流LLMs表现,揭示当前模型在距离、体积等精确计算任务中的显著不足,为后续3D多模态模型的发展提供基准与方向。
链接: https://arxiv.org/abs/2509.16656
作者: Changyu Zeng,Yifan Wang,Zimu Wang,Wei Wang,Zhengni Yang,Muyi Bao,Jiming Xiao,Ahn Nguyen,Yutao Yue
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancements in 2D multimodal large language models (MLLMs) have significantly improved performance in vision-language tasks. However, extending these capabilities to 3D environments remains a distinct challenge due to the complexity of spatial reasoning. Nevertheless, existing 3D benchmarks often lack fine-grained numerical reasoning task annotations, limiting MLLMs’ ability to perform precise spatial measurements and complex numerical reasoning. To address this gap, we introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning Abilities to enhance multimodal indoor perceptual understanding. NUMINA features multi-scale annotations and various question-answer pairs, generated using NUMINA-Flow, an automated annotation pipeline that integrates LLM rewriting and rule-based self-verification. We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework, demonstrating that current LLMs struggle with multimodal numerical reasoning, particularly in performing precise computations such as distance and volume estimation, highlighting the need for further advancements in 3D models. The dataset and source codes can be obtained from this https URL.
zh
[AI-89] AISTAT lab system for DCASE2025 Task6: Language-based audio retrieval
【速读】:该论文针对语言驱动的音频检索任务(language-based audio retrieval task)中跨模态语义对齐难题,提出了一种基于双编码器架构(dual encoder architecture)的解决方案。其关键在于:通过对比学习(contrastive learning)对音频和文本模态的独立编码表示进行对齐,同时引入知识蒸馏(distillation)与大语言模型(Large Language Models, LLMs)辅助的数据增强策略(如回译和LLM混洗),以提升模型泛化能力;此外,利用聚类构建辅助分类任务,进一步优化模型微调过程。最终在Clotho开发测试集上实现了mAP@16达48.83的性能表现。
链接: https://arxiv.org/abs/2509.16649
作者: Hyun Jun Kim,Hyeong Yong Choi,Changwon Lim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 1 figure, DCASE2025 Task2 technical report
Abstract:This report presents the AISTAT team’s submission to the language-based audio retrieval task in DCASE 2025 Task 6. Our proposed system employs dual encoder architecture, where audio and text modalities are encoded separately, and their representations are aligned using contrastive learning. Drawing inspiration from methodologies of the previous year’s challenge, we implemented a distillation approach and leveraged large language models (LLMs) for effective data augmentation techniques, including back-translation and LLM mix. Additionally, we incorporated clustering to introduce an auxiliary classification task for further finetuning. Our best single system achieved a mAP@16 of 46.62, while an ensemble of four systems reached a mAP@16 of 48.83 on the Clotho development test split.
zh
[AI-90] KungfuBot2: Learning Versatile Motion Skills for Humanoid Whole-Body Control
【速读】:该论文旨在解决通用人形机器人学习多样化全身运动技能时面临的挑战,即如何在单一策略下掌握广泛的动作技能并确保长时间序列中的稳定性。其解决方案的关键在于提出一种统一的全身控制器VMS,该控制器融合了混合跟踪目标以平衡局部运动保真度与全局轨迹一致性,并采用正交专家混合(Orthogonal Mixture-of-Experts, OMoE)架构促进技能专业化同时提升跨动作的泛化能力;此外,引入分段级跟踪奖励机制,缓解逐步匹配的刚性约束,从而增强对全局位移和瞬时误差的鲁棒性。
链接: https://arxiv.org/abs/2509.16638
作者: Jinrui Han,Weiji Xie,Jiakun Zheng,Jiyuan Shi,Weinan Zhang,Ting Xiao,Chenjia Bai
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Learning versatile whole-body skills by tracking various human motions is a fundamental step toward general-purpose humanoid robots. This task is particularly challenging because a single policy must master a broad repertoire of motion skills while ensuring stability over long-horizon sequences. To this end, we present VMS, a unified whole-body controller that enables humanoid robots to learn diverse and dynamic behaviors within a single policy. Our framework integrates a hybrid tracking objective that balances local motion fidelity with global trajectory consistency, and an Orthogonal Mixture-of-Experts (OMoE) architecture that encourages skill specialization while enhancing generalization across motions. A segment-level tracking reward is further introduced to relax rigid step-wise matching, enhancing robustness when handling global displacements and transient inaccuracies. We validate VMS extensively in both simulation and real-world experiments, demonstrating accurate imitation of dynamic skills, stable performance over minute-long sequences, and strong generalization to unseen motions. These results highlight the potential of VMS as a scalable foundation for versatile humanoid whole-body control. The project page is available at this https URL.
zh
[AI-91] Zero-Shot Human Mobility Forecasting via Large Language Model with Hierarchical Reasoning
【速读】:该论文旨在解决现有方法在人类移动预测中难以泛化到未见过的用户或地点,以及因标注数据有限和移动模式复杂性而难以捕捉动态意图的问题。解决方案的关键在于提出一种零样本人类移动预测框架ZHMF,其核心是将任务重构为自然语言问答范式,并结合语义增强的检索与反思机制及分层语言模型推理系统;其中,通过分解预测任务为活动级规划器和位置级选择器,实现长期用户意图与短期上下文偏好的协同建模,从而有效处理未见场景并提升预测准确性。
链接: https://arxiv.org/abs/2509.16578
作者: Wenyao Li,Ran Zhang,Pengyang Wang,Yuanchun Zhou,Pengfei Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:
Abstract:Human mobility forecasting is important for applications such as transportation planning, urban management, and personalized recommendations. However, existing methods often fail to generalize to unseen users or locations and struggle to capture dynamic intent due to limited labeled data and the complexity of mobility patterns. We propose ZHMF, a framework for zero-shot human mobility forecasting that combines a semantic enhanced retrieval and reflection mechanism with a hierarchical language model based reasoning system. The task is reformulated as a natural language question answering paradigm. Leveraging LLMs semantic understanding of user histories and context, our approach handles previously unseen prediction scenarios. We further introduce a hierarchical reflection mechanism for iterative reasoning and refinement by decomposing forecasting into an activity level planner and a location level selector, enabling collaborative modeling of long term user intentions and short term contextual preferences. Experiments on standard human mobility datasets show that our approach outperforms existing models. Ablation studies reveal the contribution of each module, and case studies illustrate how the method captures user intentions and adapts to diverse contextual scenarios.
zh
[AI-92] ranTac: Leverag ing Transient Tactile Signals for Contact-Rich Robotic Manipulation
【速读】:该论文旨在解决机器人在执行精细插入任务(如插钥匙或插入USB设备)时,因视觉感知不足导致的对齐失败问题。传统触觉传感方案要么灵敏度不足,难以检测微小形变,要么数据需求过高,不具实用性。解决方案的关键在于提出一种名为TranTac的数据高效、低成本的触觉感知与控制框架:其核心是在机械臂夹爪的弹性尖端集成单个六轴惯性测量单元(6-axis Inertial Measurement Unit, IMU),以检测微米级的动态平移与扭转形变,从而追踪视觉无法察觉的被抓物体姿态变化;同时结合基于Transformer的编码器与扩散策略(diffusion policy),利用插入过程中夹爪尖端瞬态触觉线索模仿人类插入行为,实现对抓取物体6自由度(6-DoF)姿态的实时动态调控。该方法在视觉辅助下平均成功率达79%,纯触觉模式下达88%,且具备良好泛化能力,在未见目标(如USB插头和金属钥匙)上仍保持近70%成功率。
链接: https://arxiv.org/abs/2509.16550
作者: Yinghao Wu,Shuhong Hou,Haowen Zheng,Yichen Li,Weiyi Lu,Xun Zhou,Yitian Shao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 8 pages, 7 figures
Abstract:Robotic manipulation tasks such as inserting a key into a lock or plugging a USB device into a port can fail when visual perception is insufficient to detect misalignment. In these situations, touch sensing is crucial for the robot to monitor the task’s states and make precise, timely adjustments. Current touch sensing solutions are either insensitive to detect subtle changes or demand excessive sensor data. Here, we introduce TranTac, a data-efficient and low-cost tactile sensing and control framework that integrates a single contact-sensitive 6-axis inertial measurement unit within the elastomeric tips of a robotic gripper for completing fine insertion tasks. Our customized sensing system can detect dynamic translational and torsional deformations at the micrometer scale, enabling the tracking of visually imperceptible pose changes of the grasped object. By leveraging transformer-based encoders and diffusion policy, TranTac can imitate human insertion behaviors using transient tactile cues detected at the gripper’s tip during insertion processes. These cues enable the robot to dynamically control and correct the 6-DoF pose of the grasped object. When combined with vision, TranTac achieves an average success rate of 79% on object grasping and insertion tasks, outperforming both vision-only policy and the one augmented with end-effector 6D force/torque sensing. Contact localization performance is also validated through tactile-only misaligned insertion tasks, achieving an average success rate of 88%. We assess the generalizability by training TranTac on a single prism-slot pair and testing it on unseen data, including a USB plug and a metal key, and find that the insertion tasks can still be completed with an average success rate of nearly 70%. The proposed framework may inspire new robotic tactile sensing systems for delicate manipulation tasks.
zh
[AI-93] Checking extracted rules in Neural Networks
【速读】:该论文旨在解决神经网络中提取规则的验证问题,具体包括三个核心问题:给定一组规则是否适用于特定神经网络?规则集是否存在自相矛盾的一致性问题?规则集是否在语义上完备,即对任意输入都能确定输出?针对ReLU激活函数的神经网络和布尔网络,作者证明了这些问题大多属于co-NP完全类,表明其计算复杂性较高。解决方案的关键在于将不同类型的规则验证问题相互归约,并通过形式化方法建立理论边界,从而为基于启发式或过近似方法提取的规则提供可信度评估依据。
链接: https://arxiv.org/abs/2509.16547
作者: Adrian Wurm
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, one figure
Abstract:In this paper we investigate formal verification of extracted rules for Neural Networks under a complexity theoretic point of view. A rule is a global property or a pattern concerning a large portion of the input space of a network. These rules are algorithmically extracted from networks in an effort to better understand their inner way of working. Here, three problems will be in the focus: Does a given set of rules apply to a given network? Is a given set of rules consistent or do the rules contradict themselves? Is a given set of rules exhaustive in the sense that for every input the output is determined? Finding algorithms that extract such rules out of networks has been investigated over the last 30 years, however, to the author’s current knowledge, no attempt in verification was made until now. A lot of attempts of extracting rules use heuristics involving randomness and over-approximation, so it might be beneficial to know whether knowledge obtained in that way can actually be trusted. We investigate the above questions for neural networks with ReLU-activation as well as for Boolean networks, each for several types of rules. We demonstrate how these problems can be reduced to each other and show that most of them are co-NP-complete. Comments: 7 pages, one figure Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.16547 [cs.AI] (or arXiv:2509.16547v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.16547 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-94] rain to Defend: First Defense Against Cryptanalytic Neural Network Parameter Extraction Attacks
【速读】:该论文旨在解决神经网络参数在面对密码分析攻击时易被提取的问题,这类攻击可导致模型知识产权泄露、安全性和隐私性受损。解决方案的关键在于提出一种新颖的“提取感知训练”(extraction-aware training)方法,通过在标准损失函数中引入一个正则化项,最小化同一层内神经元权重之间的距离,从而消除攻击者依赖的神经元唯一性特征,实现对参数提取攻击的有效防御,且在推理阶段无面积-延迟开销。
链接: https://arxiv.org/abs/2509.16546
作者: Ashley Kurian,Aydin Aysu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 18 pages, 3 Figures
Abstract:Neural networks are valuable intellectual property due to the significant computational cost, expert labor, and proprietary data involved in their development. Consequently, protecting their parameters is critical not only for maintaining a competitive advantage but also for enhancing the model’s security and privacy. Prior works have demonstrated the growing capability of cryptanalytic attacks to scale to deeper models. In this paper, we present the first defense mechanism against cryptanalytic parameter extraction attacks. Our key insight is to eliminate the neuron uniqueness necessary for these attacks to succeed. We achieve this by a novel, extraction-aware training method. Specifically, we augment the standard loss function with an additional regularization term that minimizes the distance between neuron weights within a layer. Therefore, the proposed defense has zero area-delay overhead during inference. We evaluate the effectiveness of our approach in mitigating extraction attacks while analyzing the model accuracy across different architectures and datasets. When re-trained with the same model architecture, the results show that our defense incurs a marginal accuracy change of less than 1% with the modified loss function. Moreover, we present a theoretical framework to quantify the success probability of the attack. When tested comprehensively with prior attack settings, our defense demonstrated empirical success for sustained periods of extraction, whereas unprotected networks are extracted between 14 minutes to 4 hours.
zh
[AI-95] No Need for Real 3D: Fusing 2D Vision with Pseudo 3D Representations for Robotic Manipulation Learning
【速读】:该论文旨在解决基于3D点云的机器人操作策略学习中因数据采集成本高而导致的可扩展性与实际部署受限的问题。其解决方案的关键在于提出了一种名为NoReal3D的新框架,其中引入了可学习的3D感知模块——3DStructureFormer,该模块能够将单目图像转化为具有几何意义的伪点云特征,并与2D编码器输出特征进行融合;同时,为保留伪点云的几何与拓扑结构,设计了专门的伪点云编码器,从而在不依赖真实3D点云数据的前提下,显著提升机器人对三维空间结构的理解能力,实验证明该方法可在多种任务中达到与基于真实3D点云方法相当的性能。
链接: https://arxiv.org/abs/2509.16532
作者: Run Yu,Yangdi Liu,Wen-Da Wei,Chen Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently,vision-based robotic manipulation has garnered significant attention and witnessed substantial advancements. 2D image-based and 3D point cloud-based policy learning represent two predominant paradigms in the field, with recent studies showing that the latter consistently outperforms the former in terms of both policy performance and generalization, thereby underscoring the value and significance of 3D information. However, 3D point cloud-based approaches face the significant challenge of high data acquisition costs, limiting their scalability and real-world deployment. To address this issue, we propose a novel framework NoReal3D: which introduces the 3DStructureFormer, a learnable 3D perception module capable of transforming monocular images into geometrically meaningful pseudo-point cloud features, effectively fused with the 2D encoder output features. Specially, the generated pseudo-point clouds retain geometric and topological structures so we design a pseudo-point cloud encoder to preserve these properties, making it well-suited for our framework. We also investigate the effectiveness of different feature fusion this http URL framework enhances the robot’s understanding of 3D spatial structures while completely eliminating the substantial costs associated with 3D point cloud this http URL experiments across various tasks validate that our framework can achieve performance comparable to 3D point cloud-based methods, without the actual point cloud data.
zh
[AI-96] Causal Fuzzing for Verifying Machine Unlearning
【速读】:该论文旨在解决机器学习模型在需要“遗忘”特定数据点或特征时,如何有效验证其是否真正移除相关影响的问题,尤其是在黑盒模型中,现有方法难以检测间接影响。解决方案的关键在于提出一种基于因果关系的统一框架CAFÉ,通过量化目标数据点或特征对模型输出的直接与间接因果效应,实现细粒度的可解释性验证,从而精准识别基线方法遗漏的残留影响,同时保持计算效率。
链接: https://arxiv.org/abs/2509.16525
作者: Anna Mazhar,Sainyam Galhotra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As machine learning models become increasingly embedded in decision-making systems, the ability to “unlearn” targeted data or features is crucial for enhancing model adaptability, fairness, and privacy in models which involves expensive training. To effectively guide machine unlearning, a thorough testing is essential. Existing methods for verification of machine unlearning provide limited insights, often failing in scenarios where the influence is indirect. In this work, we propose CAFÉ, a new causality based framework that unifies datapoint- and feature-level unlearning for verification of black-box ML models. CAFÉ evaluates both direct and indirect effects of unlearning targets through causal dependencies, providing actionable insights with fine-grained analysis. Our evaluation across five datasets and three model architectures demonstrates that CAFÉ successfully detects residual influence missed by baselines while maintaining computational efficiency.
zh
[AI-97] Synergies between Federated Foundation Models and Smart Power Grids
【速读】:该论文旨在解决传统单模态、集中式机器学习模型在智能电网应用中面临的隐私保护不足与数据异构性挑战,以及多模态、多任务基础模型(Multi-modal, Multi-task Foundation Models, M3T FMs)在电力系统场景下的适配与价值挖掘问题。其解决方案的关键在于提出并探索多模态联邦基础模型(M3T Federated Foundation Models, FedFMs),该模型通过结合多模态数据处理能力与联邦学习(Federated Learning, FL)机制,在保障分布式数据隐私的前提下,实现跨边缘节点的联合训练与泛化能力提升,从而赋能智能电网中的负荷预测、故障检测等关键功能;同时反向揭示了电网在能源、通信和监管维度上的约束如何塑造FedFMs的设计与部署逻辑,形成“电网驱动模型”与“模型赋能电网”的双向协同范式。
链接: https://arxiv.org/abs/2509.16496
作者: Seyyedali Hosseinalipour,Shimiao Li,Adedoyin Inaolaji,Filippo Malandra,Luis Herrera,Nicholas Mastronarde
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The recent emergence of large language models (LLMs) such as GPT-3 has marked a significant paradigm shift in machine learning. Trained on massive corpora of data, these models demonstrate remarkable capabilities in language understanding, generation, summarization, and reasoning, transforming how intelligent systems process and interact with human language. Although LLMs may still seem like a recent breakthrough, the field is already witnessing the rise of a new and more general category: multi-modal, multi-task foundation models (M3T FMs). These models go beyond language and can process heterogeneous data types/modalities, such as time-series measurements, audio, imagery, tabular records, and unstructured logs, while supporting a broad range of downstream tasks spanning forecasting, classification, control, and retrieval. When combined with federated learning (FL), they give rise to M3T Federated Foundation Models (FedFMs): a highly recent and largely unexplored class of models that enable scalable, privacy-preserving model training/fine-tuning across distributed data sources. In this paper, we take one of the first steps toward introducing these models to the power systems research community by offering a bidirectional perspective: (i) M3T FedFMs for smart grids and (ii) smart grids for FedFMs. In the former, we explore how M3T FedFMs can enhance key grid functions, such as load/demand forecasting and fault detection, by learning from distributed, heterogeneous data available at the grid edge in a privacy-preserving manner. In the latter, we investigate how the constraints and structure of smart grids, spanning energy, communication, and regulatory dimensions, shape the design, training, and deployment of M3T FedFMs.
zh
[AI-98] Entropic Causal Inference: Graph Identifiability ICML2022
【速读】:该论文旨在解决从观测数据中学习多变量因果图(causal graph)的可识别性问题,特别是如何在不依赖于函数形式假设的前提下,利用信息论最简性原则(即最小熵)来推断变量间的因果关系。其解决方案的关键在于:首先通过放宽两变量场景下的可识别性条件,扩展了原有理论;其次提出了一种基于熵的多变量因果结构学习方法,核心创新是利用二元熵测试(bivariate entropic tests)判断源节点与其后代节点之间的祖先关系(ancestrality),并据此设计了一个稳健的顺序剥除算法(sequential peeling algorithm)用于一般图结构,同时为小规模图提出了一个高效启发式算法,在合成数据和真实数据上均表现出优于先前方法的性能。
链接: https://arxiv.org/abs/2509.16463
作者: Spencer Compton,Kristjan Greenewald,Dmitriy Katz,Murat Kocaoglu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented at ICML 2022. This version corrects a bug in semi-synthetic experiments
Abstract:Entropic causal inference is a recent framework for learning the causal graph between two variables from observational data by finding the information-theoretically simplest structural explanation of the data, i.e., the model with smallest entropy. In our work, we first extend the causal graph identifiability result in the two-variable setting under relaxed assumptions. We then show the first identifiability result using the entropic approach for learning causal graphs with more than two nodes. Our approach utilizes the property that ancestrality between a source node and its descendants can be determined using the bivariate entropic tests. We provide a sound sequential peeling algorithm for general graphs that relies on this property. We also propose a heuristic algorithm for small graphs that shows strong empirical performance. We rigorously evaluate the performance of our algorithms on synthetic data generated from a variety of models, observing improvement over prior work. Finally we test our algorithms on real-world datasets.
zh
[AI-99] GPO: Learning from Critical Steps to Improve LLM Reasoning NEURIPS2025
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多步推理能力上的提升难题,尤其是现有优化方法通常将推理轨迹视为整体而忽视其中关键步骤的问题。解决方案的关键在于提出一种名为“引导式关键点优化”(Guided Pivotal Optimization, GPO)的新颖微调策略:GPO首先通过估计优势函数识别推理轨迹中的“关键步骤”(critical step),即模型必须谨慎处理才能成功解决问题的节点;随后,它在该关键步骤处重置策略,重新采样轨迹,并优先对这些关键片段进行学习。这种聚焦于推理过程中 pivotal moments 的机制显著提升了模型从关键决策点中学习的能力,从而有效增强多步推理性能。
链接: https://arxiv.org/abs/2509.16456
作者: Jiahao Yu,Zelei Cheng,Xian Wu,Xinyu Xing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract:Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textitreasoning or \textitthinking capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbfGuided \textbfPivotal \textbfOptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step’ within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.
zh
[AI-100] A Generative AI System for Biomedical Data Discovery with Grammar-Based Visualizations
【速读】:该论文旨在解决生物医药数据发现过程中可视化分析效率低、用户交互复杂的问题,尤其是在面对海量异构数据时,传统方法难以快速生成有意义的可视化并支持灵活调整。其解决方案的关键在于构建一个基于多智能体(multi-agent system)的系统,利用生成式 AI(Generative AI)将自然语言指令转化为可视化规范,并通过交互式组件动态生成和链接图表,从而实现可逐步构建的交互式仪表板(interactive dashboard)。该设计融合了自然语言处理的优势与传统用户界面的实用性,显著提升了数据探索的灵活性与效率。
链接: https://arxiv.org/abs/2509.16454
作者: Devin Lange,Shanghua Gao,Pengwei Sui,Austen Money,Priya Misner,Marinka Zitnik,Nils Gehlenborg
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:We explore the potential for combining generative AI with grammar-based visualizations for biomedical data discovery. In our prototype, we use a multi-agent system to generate visualization specifications and apply filters. These visualizations are linked together, resulting in an interactive dashboard that is progressively constructed. Our system leverages the strengths of natural language while maintaining the utility of traditional user interfaces. Furthermore, we utilize generated interactive widgets enabling user adjustment. Finally, we demonstrate the potential utility of this system for biomedical data discovery with a case study.
zh
[AI-101] Domain-Specific Constitutional AI: Enhancing Safety in LLM -Powered Mental Health Chatbots
【速读】:该论文旨在解决当前通用人工智能(AI)安全机制在心理健康应用中适配不足的问题,尤其针对情绪脆弱性、误诊风险、症状恶化及危机干预准确性等特有挑战。其解决方案的关键在于引入基于领域特定原则的宪法式人工智能(Constitutional AI, CAI)训练方法,通过嵌入心理健康领域的伦理与治疗准则,构建具备领域适应性的安全AI系统,从而提升心理干预的可靠性、合规性与有效性。
链接: https://arxiv.org/abs/2509.16444
作者: Chenhan Lyu,Yutong Song,Pengfei Zhang,Amir M. Rahmani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Mental health applications have emerged as a critical area in computational health, driven by rising global rates of mental illness, the integration of AI in psychological care, and the need for scalable solutions in underserved communities. These include therapy chatbots, crisis detection, and wellness platforms handling sensitive data, requiring specialized AI safety beyond general safeguards due to emotional vulnerability, risks like misdiagnosis or symptom exacerbation, and precise management of vulnerable states to avoid severe outcomes such as self-harm or loss of trust. Despite AI safety advances, general safeguards inadequately address mental health-specific challenges, including crisis intervention accuracy to avert escalations, therapeutic guideline adherence to prevent misinformation, scale limitations in resource-constrained settings, and adaptation to nuanced dialogues where generics may introduce biases or miss distress signals. We introduce an approach to apply Constitutional AI training with domain-specific mental health principles for safe, domain-adapted CAI systems in computational mental health applications.
zh
[AI-102] SENSE-7: Taxonomy and Dataset for Measuring User Perceptions of Empathy in Sustained Human-AI Conversations
【速读】:该论文旨在解决当前生成式 AI(Generative AI)在人机交互中对“数字共情”(digital empathy)理解过于表面化的问题,即现有方法多聚焦于模拟人类内部情感状态,而忽视了用户感知到的共情行为所具有的主观性、情境依赖性和关系动态性。其解决方案的关键在于提出一种以用户为中心的共情分类体系,强调可观测的共情行为,并构建了一个包含真实对话数据的新数据集 Sense-7,其中包含来自信息工作者与大型语言模型(Large Language Models, LLMs)交互时的逐轮共情标注、用户特征和上下文信息。通过分析695次对话发现,用户的共情判断高度个体化且易受对话连贯性中断或预期未满足的影响,进而表明未来AI设计需动态适配用户情境与目标,为开发更具社会敏感性的智能代理提供了理论依据与实践路径。
链接: https://arxiv.org/abs/2509.16437
作者: Jina Suh,Lindy Le,Erfan Shayegani,Gonzalo Ramos,Judith Amores,Desmond C. Ong,Mary Czerwinski,Javier Hernandez
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:
Abstract:Empathy is increasingly recognized as a key factor in human-AI communication, yet conventional approaches to “digital empathy” often focus on simulating internal, human-like emotional states while overlooking the inherently subjective, contextual, and relational facets of empathy as perceived by users. In this work, we propose a human-centered taxonomy that emphasizes observable empathic behaviors and introduce a new dataset, Sense-7, of real-world conversations between information workers and Large Language Models (LLMs), which includes per-turn empathy annotations directly from the users, along with user characteristics, and contextual details, offering a more user-grounded representation of empathy. Analysis of 695 conversations from 109 participants reveals that empathy judgments are highly individualized, context-sensitive, and vulnerable to disruption when conversational continuity fails or user expectations go unmet. To promote further research, we provide a subset of 672 anonymized conversation and provide exploratory classification analysis, showing that an LLM-based classifier can recognize 5 levels of empathy with an encouraging average Spearman \rho =0.369 and Accuracy=0.487 over this set. Overall, our findings underscore the need for AI designs that dynamically tailor empathic behaviors to user contexts and goals, offering a roadmap for future research and practical development of socially attuned, human-centered artificial agents.
zh
[AI-103] Proactive Statistical Process Control Using AI: A Time Series Forecasting Approach for Semiconductor Manufacturing
【速读】:该论文旨在解决传统统计过程控制(Statistical Process Control, SPC)方法反应滞后的问题,即仅在异常发生后才发出警报,导致材料浪费、设备停机和成本上升。其解决方案的关键在于引入基于时间序列预测的机器学习模型——Facebook Prophet,通过分析历史数据预测未来过程值,并结合SPC规则将预测结果分类为安全区(Safe zone)、预警区(Warning zone)或临界区(Critical zone),从而实现对潜在质量问题的早期识别与干预。该方法特别适用于非均匀采样场景(如半导体制造中不规则时间间隔的数据),在实际应用中展现出较强的预测能力和风险判别准确性,使质量控制由被动响应转向主动预防。
链接: https://arxiv.org/abs/2509.16431
作者: Mohammad Iqbal Rasul Seeam,Victor S. Sheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures, no .bbl file needed because bibliography already in this http URL file
Abstract:In the manufacturing industry, it is very important to keep machines and processes running smoothly and without unexpected problems. One of the most common tools used to check if everything is working properly is called Statistical Process Control (SPC). Traditional SPC methods work by checking whether recent measurements are within acceptable limits. However, they only react after a problem has already occurred. This can lead to wasted materials, machine downtime, and increased costs. In this paper, we present a smarter way to use SPC. Instead of just reacting to issues after they happen, our system can predict future problems before they occur. We use a machine learning tool called Facebook Prophet, which is designed to work with time-series data (data that changes over time). Prophet looks at past data and forecasts what the next value will be. Then, we use SPC rules to decide if the predicted value is in a Safe zone (no problem), a Warning zone (needs attention), or a Critical zone (may require shutting down the process). We applied this system to real data from a semiconductor manufacturing company. One of the challenges with this data is that the measurements are not taken at regular time intervals. This makes it harder to predict future values accurately. Despite this, our model was able to make strong predictions and correctly classify the risk level of future measurements. The main benefit of our system is that it gives engineers and technicians a chance to act early - before something goes wrong. This helps reduce unexpected failures and improves the overall stability and reliability of the production process. By combining machine learning with traditional SPC, we make quality control more proactive, accurate, and useful for modern industry.
zh
[AI-104] VORTEX: Aligning Task Utility and Human Preferences through LLM -Guided Reward Shaping
【速读】:该论文旨在解决社会影响优化中AI决策系统难以适应动态人类偏好问题,即传统优化求解器依赖于固定数学目标函数,无法直接处理以自然语言形式表达的人类偏好。解决方案的关键在于提出VORTEX框架,该框架通过将问题建模为多目标优化问题,利用大语言模型(LLM)根据口头强化和文本梯度提示迭代生成奖励塑造项(reward shaping),从而在不修改原有求解器或指定权衡参数的前提下,实现人类反馈对决策行为的自然语言引导,并保证收敛至效用与偏好满足之间的帕累托最优解。
链接: https://arxiv.org/abs/2509.16399
作者: Guojun Xiong,Milind Tambe
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28pages, 19figures
Abstract:In social impact optimization, AI decision systems often rely on solvers that optimize well-calibrated mathematical objectives. However, these solvers cannot directly accommodate evolving human preferences, typically expressed in natural language rather than formal constraints. Recent approaches address this by using large language models (LLMs) to generate new reward functions from preference descriptions. While flexible, they risk sacrificing the system’s core utility guarantees. In this paper, we propose \textttVORTEX, a language-guided reward shaping framework that preserves established optimization goals while adaptively incorporating human feedback. By formalizing the problem as multi-objective optimization, we use LLMs to iteratively generate shaping rewards based on verbal reinforcement and text-gradient prompt updates. This allows stakeholders to steer decision behavior via natural language without modifying solvers or specifying trade-off weights. We provide theoretical guarantees that \textttVORTEX converges to Pareto-optimal trade-offs between utility and preference satisfaction. Empirical results in real-world allocation tasks demonstrate that \textttVORTEX outperforms baselines in satisfying human-aligned coverage goals while maintaining high task performance. This work introduces a practical and theoretically grounded paradigm for human-AI collaborative optimization guided by natural language.
zh
[AI-105] GRID: Graph-based Reasoning for Intervention and Discovery in Built Environments
【速读】:该论文旨在解决商业建筑中暖通空调(HVAC)故障诊断依赖人工、耗时长(8–12小时/次)且准确率低(仅60%)的问题,其根本原因在于现有分析方法仅停留在相关性层面,缺乏对因果机制的挖掘。解决方案的关键是提出GRID(Graph-based Reasoning for Intervention and Discovery),一个三阶段因果发现流水线:首先通过约束搜索识别变量间的独立性关系,其次利用神经结构方程模型(Neural Structural Equation Modeling, NSEM)学习非线性因果结构,最后引入领域特定语言模型先验(Language Model Priors)增强因果图的可解释性和泛化能力。该方法在多个基准测试中实现F1分数0.65–1.00,显著优于十种基线方法,并能在真实场景中有效平衡干预成本与风险降低,从而填补了建筑数据分析中观测数据到因果推理的鸿沟。
链接: https://arxiv.org/abs/2509.16397
作者: Taqiya Ehsan,Shuren Xia,Jorge Ortiz
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Manual HVAC fault diagnosis in commercial buildings takes 8-12 hours per incident and achieves only 60 percent diagnostic accuracy, reflecting analytics that stop at correlation instead of causation. To close this gap, we present GRID (Graph-based Reasoning for Intervention and Discovery), a three-stage causal discovery pipeline that combines constraint-based search, neural structural equation modeling, and language model priors to recover directed acyclic graphs from building sensor data. Across six benchmarks: synthetic rooms, EnergyPlus simulation, the ASHRAE Great Energy Predictor III dataset, and a live office testbed, GRID achieves F1 scores ranging from 0.65 to 1.00, with exact recovery (F1 = 1.00) in three controlled environments (Base, Hidden, Physical) and strong performance on real-world data (F1 = 0.89 on ASHRAE, 0.86 in noisy conditions). The method outperforms ten baseline approaches across all evaluation scenarios. Intervention scheduling achieves low operational impact in most scenarios (cost = 0.026) while reducing risk metrics compared to baseline approaches. The framework integrates constraint-based methods, neural architectures, and domain-specific language model prompts to address the observational-causal gap in building analytics.
zh
[AI-106] Evaluation of Causal Reasoning for Large Language Models in Contextualized Clinical Scenarios of Laboratory Test Interpretation
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在临床场景中进行因果推理(causal reasoning)能力的评估问题,特别是针对实验室检验指标(如糖化血红蛋白、肌酐和维生素D)与其潜在因果因素(如年龄、性别、肥胖和吸烟)之间的关联性、干预效应及反事实推断的准确性。解决方案的关键在于构建一个基于Pearl因果阶梯(Pearl’s Ladder of Causation)的标准化测试框架,涵盖三个层级:关联(association)、干预(intervention)与反事实(counterfactual)推理,并通过四名医学训练专家对GPT-o1与Llama-3.2-8b-instruct两个主流LLM的响应进行系统评估,从而量化其在不同因果层次上的表现差异。结果表明,GPT-o1在各项指标上均优于Llama-3.2-8b-instruct,尤其在干预和反事实推理方面,但整体仍需改进以满足高风险临床应用需求。
链接: https://arxiv.org/abs/2509.16372
作者: Balu Bhasuran,Mattia Prosperi,Karim Hanna,John Petrilli,Caretia JeLayne Washington,Zhe He
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This study evaluates causal reasoning in large language models (LLMs) using 99 clinically grounded laboratory test scenarios aligned with Pearl’s Ladder of Causation: association, intervention, and counterfactual reasoning. We examined common laboratory tests such as hemoglobin A1c, creatinine, and vitamin D, and paired them with relevant causal factors including age, gender, obesity, and smoking. Two LLMs - GPT-o1 and Llama-3.2-8b-instruct - were tested, with responses evaluated by four medically trained human experts. GPT-o1 demonstrated stronger discriminative performance (AUROC overall = 0.80 +/- 0.12) compared to Llama-3.2-8b-instruct (0.73 +/- 0.15), with higher scores across association (0.75 vs 0.72), intervention (0.84 vs 0.70), and counterfactual reasoning (0.84 vs 0.69). Sensitivity (0.90 vs 0.84) and specificity (0.93 vs 0.80) were also greater for GPT-o1, with reasoning ratings showing similar trends. Both models performed best on intervention questions and worst on counterfactuals, particularly in altered outcome scenarios. These findings suggest GPT-o1 provides more consistent causal reasoning, but refinement is required before adoption in high-stakes clinical applications.
zh
[AI-107] Enhancing Financial RAG with Agent ic AI and Multi-HyDE: A Novel Approach to Knowledge Retrieval and Hallucination Reduction
【速读】:该论文旨在解决金融领域问答系统中知识检索的准确性与可靠性问题,尤其针对持续更新的数据源和高风险、复杂的业务场景下传统单一数据库和检索器难以满足需求的挑战。解决方案的关键在于提出一种面向金融领域的检索增强生成(Retrieval Augmented Generation, RAG)框架,其核心创新是融合代理型AI(agentic AI)与Multi-HyDE系统——通过生成多个非等价查询来提升对大型结构化金融语料库的检索覆盖率和有效性,并结合关键词检索与表格检索等多模态工具集,优化token效率与多步金融推理能力。实验表明,该方法在标准金融QA基准上使准确率提升11.2%,幻觉率降低15%,显著增强了答案的可信度与实用性。
链接: https://arxiv.org/abs/2509.16369
作者: Akshay Govind Srinivasan,Ryan Jacob George,Jayden Koshy Joe,Hrushikesh Kant,Harshith M R,Sachin Sundar,Sudharshan Suresh,Rahul Vimalkanth,Vijayavallabh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 14 Pages, 8 Tables, 2 Figures. Accepted and to be published in the proceedings of FinNLP, Empirical Methods in Natural Language Processing 2025
Abstract:Accurate and reliable knowledge retrieval is vital for financial question-answering, where continually updated data sources and complex, high-stakes contexts demand precision. Traditional retrieval systems rely on a single database and retriever, but financial applications require more sophisticated approaches to handle intricate regulatory filings, market analyses, and extensive multi-year reports. We introduce a framework for financial Retrieval Augmented Generation (RAG) that leverages agentic AI and the Multi-HyDE system, an approach that generates multiple, nonequivalent queries to boost the effectiveness and coverage of retrieval from large, structured financial corpora. Our pipeline is optimized for token efficiency and multi-step financial reasoning, and we demonstrate that their combination improves accuracy by 11.2% and reduces hallucinations by 15%. Our method is evaluated on standard financial QA benchmarks, showing that integrating domain-specific retrieval mechanisms such as Multi-HyDE with robust toolsets, including keyword and table-based retrieval, significantly enhances both the accuracy and reliability of answers. This research not only delivers a modular, adaptable retrieval framework for finance but also highlights the importance of structured agent workflows and multi-perspective retrieval for trustworthy deployment of AI in high-stakes financial applications.
zh
[AI-108] Secure Confidential Business Information When Sharing Machine Learning Models
【速读】:该论文旨在解决模型共享(Model-sharing)场景下因保密属性推断(Confidential Property Inference, CPI)攻击导致的数据隐私泄露问题。现有防御方法通常假设攻击者是非自适应的,忽略了现实世界中攻击者能够根据目标模型及其防御机制动态调整攻击策略的响应性特征。解决方案的关键在于提出一种新型防御机制,其核心创新包括:一是设计了一种模拟真实攻击者响应行为的响应式CPI攻击(Responsive CPI attack),用于更贴近实际威胁场景;二是构建一个攻防对抗演化框架(attack-defense arms race framework),通过迭代优化目标模型与攻击模型,生成对响应式CPI攻击具备鲁棒性的安全模型;此外,引入一种近似策略以显著降低计算开销,从而提升防御效率。该方法在多种真实模型共享场景中验证了其优越性,在保障模型效用的同时有效提升了安全性与计算效率。
链接: https://arxiv.org/abs/2509.16352
作者: Yunfan Yang,Jiarong Xu,Hongzhe Zhang,Xiao Fang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Model-sharing offers significant business value by enabling firms with well-established Machine Learning (ML) models to monetize and share their models with others who lack the resources to develop ML models from scratch. However, concerns over data confidentiality remain a significant barrier to model-sharing adoption, as Confidential Property Inference (CPI) attacks can exploit shared ML models to uncover confidential properties of the model provider’s private model training data. Existing defenses often assume that CPI attacks are non-adaptive to the specific ML model they are targeting. This assumption overlooks a key characteristic of real-world adversaries: their responsiveness, i.e., adversaries’ ability to dynamically adjust their attack models based on the information of the target and its defenses. To overcome this limitation, we propose a novel defense method that explicitly accounts for the responsive nature of real-world adversaries via two methodological innovations: a novel Responsive CPI attack and an attack-defense arms race framework. The former emulates the responsive behaviors of adversaries in the real world, and the latter iteratively enhances both the target and attack models, ultimately producing a secure ML model that is robust against responsive CPI attacks. Furthermore, we propose and integrate a novel approximate strategy into our defense, which addresses a critical computational bottleneck of defense methods and improves defense efficiency. Through extensive empirical evaluations across various realistic model-sharing scenarios, we demonstrate that our method outperforms existing defenses by more effectively defending against CPI attacks, preserving ML model utility, and reducing computational overhead.
zh
[AI-109] A Unified AI Approach for Continuous Monitoring of Human Health and Diseases from Intensive Care Unit to Home with Physiological Foundation Models (UNIPHY)
【速读】:该论文旨在解决当前生理健康监测中缺乏统一、可泛化且能适应多场景(如重症监护与院外监测)的AI模型问题,以实现连续、个性化的人类健康与疾病监测。其解决方案的关键在于提出UNIPHY+框架,通过预训练阶段引入上下文信息、微调阶段采用多模态学习与特征融合调优(feature fusion-tuning),以及轻量化模型个性化阶段的知识蒸馏(knowledge distillation),从而构建一个通用生理基础模型(physioFM),支持临床决策和长期健康追踪的可扩展、可迁移和个性化的生理AI应用。
链接: https://arxiv.org/abs/2509.16348
作者: Minxiao Wang,Saurabh Kataria,Juntong Ni,Timothy G. Buchman,Jocelyn Grunwell,Mark Mai,Wei Jin,Matthew Clark,Stephanie Brown,Michael Fundora,Puneet Sharma,Tony Pan,Sam Khan,Timothy Ruchti,Naveen Muthu,Kevin Maher,Sivasubramanium V Bhavani,Xiao Hu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:We present UNIPHY+, a unified physiological foundation model (physioFM) framework designed to enable continuous human health and diseases monitoring across care settings using ubiquitously obtainable physiological data. We propose novel strategies for incorporating contextual information during pretraining, fine-tuning, and lightweight model personalization via multi-modal learning, feature fusion-tuning, and knowledge distillation. We advocate testing UNIPHY+ with a broad set of use cases from intensive care to ambulatory monitoring in order to demonstrate that UNIPHY+ can empower generalizable, scalable, and personalized physiological AI to support both clinical decision-making and long-term health monitoring.
zh
[AI-110] QUINTA: Reflexive Sensibility For Responsible AI Research and Data-Driven Processes
【速读】:该论文旨在解决当前人工智能(AI)与数据科学(DS)研究中对交叉性(intersectionality)关注不足的问题,尤其是在数据驱动的算法开发过程中,忽视了多重社会身份因素叠加带来的系统性偏见和边缘化风险。其解决方案的关键在于提出一种名为QUINTA(Quantitative Intersectional Data)的方法论范式,该范式以批判性反思(critical reflexivity)为核心,强调研究人员在AI/DS全生命周期中的主体性和责任意识,通过系统性地识别和干预数据采集、建模与评估等环节中的隐性偏见,实现对交叉性影响的量化捕捉与治理,从而推动更具公平性的算法设计实践。
链接: https://arxiv.org/abs/2509.16347
作者: Alicia E. Boyd
机构: 未知
类目: ocial and Information Networks (cs.SI); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures, 1 Table, This paper was accepted as a poster presentation at Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO) Conference in 2023
Abstract:As the field of artificial intelligence (AI) and machine learning (ML) continues to prioritize fairness and the concern for historically marginalized communities, the importance of intersectionality in AI research has gained significant recognition. However, few studies provide practical guidance on how researchers can effectively incorporate intersectionality into critical praxis. In response, this paper presents a comprehensive framework grounded in critical reflexivity as intersectional praxis. Operationalizing intersectionality within the AI/DS (Artificial Intelligence/Data Science) pipeline, Quantitative Intersectional Data (QUINTA) is introduced as a methodological paradigm that challenges conventional and superficial research habits, particularly in data-centric processes, to identify and mitigate negative impacts such as the inadvertent marginalization caused by these practices. The framework centers researcher reflexivity to call attention to the AI researchers’ power in creating and analyzing AI/DS artifacts through data-centric approaches. To illustrate the effectiveness of QUINTA, we provide a reflexive AI/DS researcher demonstration utilizing the #metoo movement as a case study. Note: This paper was accepted as a poster presentation at Equity and Access in Algorithms, Mechanisms, and Optimization (EAAMO) Conference in 2023.
zh
[AI-111] Estimating Clinical Lab Test Result Trajectories from PPG using Physiological Foundation Model and Patient-Aware State Space Model – a UNIPHY Approach
【速读】:该论文旨在解决重症监护病房(ICU)中实验室检测指标因采样间隔性和侵入性而导致的连续性监测难题,提出通过非侵入性的光电容积脉搏波(photoplethysmogram, PPG)信号实现对关键生化指标的持续、个性化估计。其解决方案的关键在于构建UNIPHY+Lab框架:一方面利用大规模PPG基础模型进行局部波形编码以提取生理特征,另一方面引入患者感知的Mamba模型实现长程时间建模;同时通过FiLM调制的初始状态机制捕捉个体基线差异,并支持多任务联合估计相关生物标志物,从而显著提升预测准确性(MAE、RMSE和R²指标优于LSTM与前向填充基线)。
链接: https://arxiv.org/abs/2509.16345
作者: Minxiao Wang,Runze Yan,Carol Li,Saurabh Kataria,Xiao Hu,Matthew Clark,Timothy Ruchti,Timothy G. Buchman,Sivasubramanium V Bhavani,Randall J. Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Clinical laboratory tests provide essential biochemical measurements for diagnosis and treatment, but are limited by intermittent and invasive sampling. In contrast, photoplethysmogram (PPG) is a non-invasive, continuously recorded signal in intensive care units (ICUs) that reflects cardiovascular dynamics and can serve as a proxy for latent physiological changes. We propose UNIPHY+Lab, a framework that combines a large-scale PPG foundation model for local waveform encoding with a patient-aware Mamba model for long-range temporal modeling. Our architecture addresses three challenges: (1) capturing extended temporal trends in laboratory values, (2) accounting for patient-specific baseline variation via FiLM-modulated initial states, and (3) performing multi-task estimation for interrelated biomarkers. We evaluate our method on the two ICU datasets for predicting the five key laboratory tests. The results show substantial improvements over the LSTM and carry-forward baselines in MAE, RMSE, and R^2 among most of the estimation targets. This work demonstrates the feasibility of continuous, personalized lab value estimation from routine PPG monitoring, offering a pathway toward non-invasive biochemical surveillance in critical care.
zh
[AI-112] Highly Imbalanced Regression with Tabular Data in SEP and Other Applications ICML
【速读】:该论文针对高不平衡回归问题(imbalanced regression),即目标变量中稀有实例的分布比例远低于常见实例(不平衡比大于1000),旨在提升对稀有实例的目标值估计精度,这在预测罕见有害太阳高能粒子(Solar Energetic Particle, SEP)事件强度等实际场景中具有重要意义。传统均方误差(MSE)损失函数未考虑预测值与真实值之间的相关性,且常用逆重要性函数仅支持凸函数形式,同时随机采样易导致小批量中缺乏稀有样本。论文提出CISIR方法,其核心创新在于:引入相关性增强机制以优化预测一致性;采用单调递减自反(Monotonically Decreasing Involution, MDI)重要性函数替代传统凸函数,更灵活地赋予稀有样本更高权重;结合分层采样策略确保训练批次包含足够稀有实例。实验表明,CISIR在多个数据集上优于近期方法,且其相关性组件可泛化至其他模型以提升性能,MDI重要性函数亦表现更优。
链接: https://arxiv.org/abs/2509.16339
作者: Josias K. Moukpe,Philip K. Chan,Ming Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: ICMLA 2025
Abstract:We investigate imbalanced regression with tabular data that have an imbalance ratio larger than 1,000 (“highly imbalanced”). Accurately estimating the target values of rare instances is important in applications such as forecasting the intensity of rare harmful Solar Energetic Particle (SEP) events. For regression, the MSE loss does not consider the correlation between predicted and actual values. Typical inverse importance functions allow only convex functions. Uniform sampling might yield mini-batches that do not have rare instances. We propose CISIR that incorporates correlation, Monotonically Decreasing Involution (MDI) importance, and stratified sampling. Based on five datasets, our experimental results indicate that CISIR can achieve lower error and higher correlation than some recent methods. Also, adding our correlation component to other recent methods can improve their performance. Lastly, MDI importance can outperform other importance functions. Our code can be found in this https URL.
zh
[AI-113] Generalizability of Large Language Model-Based Agents : A Comprehensive Survey
【速读】:该论文旨在解决大语言模型(Large Language Model, LLM)驱动的智能体在多样化任务、环境和指令下缺乏可靠泛化能力的问题,即“代理泛化性”(agent generalizability)的定义不清与系统性提升方法缺失。其关键解决方案在于提出一个结构化的研究框架:首先通过层级域-任务本体论明确代理泛化性的边界;其次系统梳理现有数据集、评估维度与指标的局限性;进而将提升策略归纳为三类——针对基础LLM、代理组件及二者交互的方法;并进一步区分“可泛化的框架”与“可泛化的代理”,阐明如何从框架级设计转化为代理级泛化能力。该综述为构建跨场景、跨领域稳定运行的LLM代理奠定了理论基础与实践路径。
链接: https://arxiv.org/abs/2509.16330
作者: Minxing Zhang,Yi Yang,Roy Xie,Bhuwan Dhingra,Shuyan Zhou,Jian Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large Language Model (LLM)-based agents have emerged as a new paradigm that extends LLMs’ capabilities beyond text generation to dynamic interaction with external environments. By integrating reasoning with perception, memory, and tool use, agents are increasingly deployed in diverse domains like web navigation and household robotics. A critical challenge, however, lies in ensuring agent generalizability - the ability to maintain consistent performance across varied instructions, tasks, environments, and domains, especially those beyond agents’ fine-tuning data. Despite growing interest, the concept of generalizability in LLM-based agents remains underdefined, and systematic approaches to measure and improve it are lacking. In this survey, we provide the first comprehensive review of generalizability in LLM-based agents. We begin by emphasizing agent generalizability’s importance by appealing to stakeholders and clarifying the boundaries of agent generalizability by situating it within a hierarchical domain-task ontology. We then review datasets, evaluation dimensions, and metrics, highlighting their limitations. Next, we categorize methods for improving generalizability into three groups: methods for the backbone LLM, for agent components, and for their interactions. Moreover, we introduce the distinction between generalizable frameworks and generalizable agents and outline how generalizable frameworks can be translated into agent-level generalizability. Finally, we identify critical challenges and future directions, including developing standardized frameworks, variance- and cost-based metrics, and approaches that integrate methodological innovations with architecture-level designs. By synthesizing progress and highlighting opportunities, this survey aims to establish a foundation for principled research on building LLM-based agents that generalize reliably across diverse applications.
zh
[AI-114] On the Non-Uniqueness of Representation of (UN)-Implications
【速读】:该论文旨在解决 (U,N)-implications 的表示唯一性问题,即在模糊逻辑系统中,当使用可分配的 uninorm(统一算子)与模糊否定(fuzzy negation, N)构造 (U,N)-implications 时,其是否具有唯一的表示形式。此前研究假设模糊否定连续时,(S,N)-implications 和部分 (U,N)-implications 被认为具有唯一表示,但本文通过反例证明:即使模糊否定连续,(U,N)-implications 也不一定具有唯一表示。解决方案的关键在于对 uninorm 的结构进行系统分析,区分连续与非连续底层数学函数的情形,从而给出完备的唯一性条件判定准则,为模糊蕴含算子的理论建模提供更精确的结构性依据。
链接: https://arxiv.org/abs/2509.16299
作者: Raquel Fernandez-Peralta,Andrea Mesiarová-Zemánková
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fuzzy implication functions constitute fundamental operators in fuzzy logic systems, extending classical conditionals to manage uncertainty in logical inference. Among the extensive families of these operators, generalizations of the classical material implication have received considerable theoretical attention, particularly (S,N) -implications constructed from t-conorms and fuzzy negations, and their further generalizations to (U,N) -implications using disjunctive uninorms. Prior work has established characterization theorems for these families under the assumption that the fuzzy negation N is continuous, ensuring uniqueness of representation. In this paper, we disprove this last fact for (U,N) -implications and we show that they do not necessarily possess a unique representation, even if the fuzzy negation is continuous. Further, we provide a comprehensive study of uniqueness conditions for both uninorms with continuous and non-continuous underlying functions. Our results offer important theoretical insights into the structural properties of these operators.
zh
[AI-115] A global view of diverse construction methods of fuzzy implication functions rooted on F-chains
【速读】:该论文旨在解决模糊蕴含函数(fuzzy implication functions)构造方法多样性带来的理论理解不足问题,即如何系统化地揭示不同构造方法之间的结构关系及其性质保持机制。其解决方案的关键在于提出一种广义的F-链构造方法(generalized F-chain-based construction),该方法通过引入一组模糊蕴含函数而非单一函数,并采用两个不同的单调函数替代原有的唯一F-链,从而实现对多种现有构造技术(如对合性变换、聚合、广义垂直/水平阈值法等)的统一建模与形式化表达,进而为模糊蕴含函数的构造提供一个具有广泛适用性和理论深度的框架。
链接: https://arxiv.org/abs/2509.16298
作者: Raquel Fernandez-Peralta,Juan Vicente Riera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Fuzzy implication functions are one of the most important operators used in the fuzzy logic framework. While their flexible definition allows for diverse families with distinct properties, this variety needs a deeper theoretical understanding of their structural relationships. In this work, we focus on the study of construction methods, which employ different techniques to generate new fuzzy implication functions from existing ones. Particularly, we generalize the F -chain-based construction, recently introduced by Mesiar et al. to extend a method for constructing aggregation functions to the context of fuzzy implication functions. Our generalization employs collections of fuzzy implication functions rather than single ones, and uses two different increasing functions instead of a unique F -chain. We analyze property preservation under this construction and establish sufficient conditions. Furthermore, we demonstrate that our generalized F -chain-based construction is a unifying framework for several existing methods. In particular, we show that various construction techniques, such as contraposition, aggregation, and generalized vertical/horizontal threshold methods, can be reformulated within our approach. This reveals structural similarities between seemingly distinct construction strategies and provides a cohesive perspective on fuzzy implication construction methods.
zh
[AI-116] Robust LLM Training Infrastructure at ByteDance
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练过程中因资源规模扩大而引发的高频率故障(如CUDA错误、NaN值、任务挂起等)对训练稳定性造成的挑战,目标是实现最小化训练中断、高效故障诊断与有效容错,从而保障LLM训练的连续性和效率。其解决方案的关键在于提出ByteRobust系统,该系统针对LLM训练过程的独特性,优先设计了常规化的故障检测与恢复机制,并利用LLM训练中的并行特性与结构特征,结合数据驱动方法实现了高容量容错、快速故障定位与边界划分,从而在部署于超20万GPU的生产平台上实现了97%的预期训练时间恢复率(ETTR),显著提升了大规模LLM训练的鲁棒性与效率。
链接: https://arxiv.org/abs/2509.16293
作者: Borui Wan,Gaohong Liu,Zuquan Song,Jun Wang,Yun Zhang,Guangming Sheng,Shuguang Wang,Houmin Wei,Chenyuan Wang,Weiqiang Lou,Xi Yang,Mofan Zhang,Kaihua Jiang,Cheng Ren,Xiaoyun Zhi,Menghan Yu,Zhe Nan,Zhuolin Zheng,Baoquan Zhong,Qinlong Wang,Huan Yu,Jinxin Chi,Wang Zhang,Yuhan Li,Zixian Du,Sida Zhao,Yongqiang Zhang,Jingzhe Tang,Zherui Liu,Chuan Wu,Yanghua Peng,Haibin Lin,Wencong Xiao,Xin Liu,Liang Xiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:
Abstract:The training scale of large language models (LLMs) has reached tens of thousands of GPUs and is still continuously expanding, enabling faster learning of larger models. Accompanying the expansion of the resource scale is the prevalence of failures (CUDA error, NaN values, job hang, etc.), which poses significant challenges to training stability. Any large-scale LLM training infrastructure should strive for minimal training interruption, efficient fault diagnosis, and effective failure tolerance to enable highly efficient continuous training. This paper presents ByteRobust, a large-scale GPU infrastructure management system tailored for robust and stable training of LLMs. It exploits the uniqueness of LLM training process and gives top priorities to detecting and recovering failures in a routine manner. Leveraging parallelisms and characteristics of LLM training, ByteRobust enables high-capacity fault tolerance, prompt fault demarcation, and localization with an effective data-driven approach, comprehensively ensuring continuous and efficient training of LLM tasks. ByteRobust is deployed on a production GPU platform with over 200,000 GPUs and achieves 97% ETTR for a three-month training job on 9,600 GPUs.
zh
[AI-117] Identifying Critical Pathways in Coronary Heart Disease via Fuzzy Subgraph Connectivity
【速读】:该论文旨在解决冠心病(Coronary Heart Disease, CHD)风险预测中因不可控因素、可控生活方式因素与临床指标之间复杂且不确定的相互作用而导致的建模难题。其解决方案的关键在于构建一个模糊冠心病图(fuzzy CHD graph),其中顶点代表不同类别的风险因素,边权重由模糊隶属度表示,并引入模糊子图连通性(Fuzzy Subgraph Connectivity, FSC)作为核心分析工具,以量化顶点与子图间的关联强度,从而识别最强诊断路径、主导风险因子及关键桥梁边,实现对CHD风险不确定性的系统刻画与可解释建模。
链接: https://arxiv.org/abs/2509.16288
作者: Shanookha Ali,Nitha Niralda P C
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Coronary heart disease (CHD) arises from complex interactions among uncontrollable factors, controllable lifestyle factors, and clinical indicators, where relationships are often uncertain. Fuzzy subgraph connectivity (FSC) provides a systematic tool to capture such imprecision by quantifying the strength of association between vertices and subgraphs in fuzzy graphs. In this work, a fuzzy CHD graph is constructed with vertices for uncontrollable, controllable, and indicator components, and edges weighted by fuzzy memberships. Using FSC, we evaluate connectivity to identify strongest diagnostic routes, dominant risk factors, and critical bridges. Results show that FSC highlights influential pathways, bounds connectivity between weakest and strongest correlations, and reveals critical edges whose removal reduces predictive strength. Thus, FSC offers an interpretable and robust framework for modeling uncertainty in CHD risk prediction and supporting clinical decision-making.
zh
[AI-118] Energy Equity Infrastructure and Demographic Analysis with XAI Methods
【速读】:该论文旨在解决能源负担(energy burden)问题,即家庭在能源上的支出占中位数家庭收入的比例过高所引发的公平性挑战。研究通过可解释人工智能(Explainable Artificial Intelligence, XAI)方法,如决策树和皮尔逊相关系数(Pearson’s correlation coefficient, PCC),对多地区用电行为与社会人口学特征进行分析,识别出影响能源负担的关键可解释因素。其解决方案的关键在于利用XAI技术构建一个原型能源公平网络门户及新型能源负担计算器,能够为不同利益相关者提供定制化的、可操作的建议,从而推动能源公平性的提升。
链接: https://arxiv.org/abs/2509.16279
作者: Sarahana Shrestha,Aparna S. Varde,Pankaj Lal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This study deploys methods in explainable artificial intelligence (XAI), e.g. decision trees and Pearson’s correlation coefficient (PCC), to investigate electricity usage in multiple locales. It addresses the vital issue of energy burden, i.e. total amount spent on energy divided by median household income. Socio-demographic data is analyzed with energy features, especially using decision trees and PCC, providing explainable predictors on factors affecting energy burden. Based on the results of the analysis, a pilot energy equity web portal is designed along with a novel energy burden calculator. Leveraging XAI, this portal (with its calculator) serves as a prototype information system that can offer tailored actionable advice to multiple energy stakeholders. The ultimate goal of this study is to promote greater energy equity through the adaptation of XAI methods for energy-related analysis with suitable recommendations.
zh
[AI-119] Stabilizing Information Flow Entropy: Regularization for Safe and Interpretable Autonomous Driving Perception
【速读】:该论文旨在解决自动驾驶中深度感知网络因数据密集型训练和事后异常检测机制而忽视信息论约束所导致的稳定性不足问题。其核心解决方案在于将深度神经编码器重构为分层通信链,通过理论推导提出两个设计原则:(D1)连续层间互信息平滑变化,以及(D2)随着网络深度增加潜在熵单调衰减。关键创新在于发现,在典型架构假设下(如重复结构块),强制实现信息流平滑性(D1)可自然诱导熵衰减(D2),从而保障稳定压缩;基于此,作者进一步提出Eloss——一种轻量级、可插拔的熵正则化训练目标,不仅提升感知准确性,更实现了对传感器输入异常的显式、原理性检测,显著增强分布偏移敏感度(最高达两个数量级)。
链接: https://arxiv.org/abs/2509.16277
作者: Haobo Yang,Shiyan Zhang,Zhuoyi Yang,Jilong Guo,Jun Yang,Xinyu Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Deep perception networks in autonomous driving traditionally rely on data-intensive training regimes and post-hoc anomaly detection, often disregarding fundamental information-theoretic constraints governing stable information processing. We reconceptualize deep neural encoders as hierarchical communication chains that incrementally compress raw sensory inputs into task-relevant latent features. Within this framework, we establish two theoretically justified design principles for robust perception: (D1) smooth variation of mutual information between consecutive layers, and (D2) monotonic decay of latent entropy with network depth. Our analysis shows that, under realistic architectural assumptions, particularly blocks comprising repeated layers of similar capacity, enforcing smooth information flow (D1) naturally encourages entropy decay (D2), thus ensuring stable compression. Guided by these insights, we propose Eloss, a novel entropy-based regularizer designed as a lightweight, plug-and-play training objective. Rather than marginal accuracy improvements, this approach represents a conceptual shift: it unifies information-theoretic stability with standard perception tasks, enabling explicit, principled detection of anomalous sensor inputs through entropy deviations. Experimental validation on large-scale 3D object detection benchmarks (KITTI and nuScenes) demonstrates that incorporating Eloss consistently achieves competitive or improved accuracy while dramatically enhancing sensitivity to anomalies, amplifying distribution-shift signals by up to two orders of magnitude. This stable information-compression perspective not only improves interpretability but also establishes a solid theoretical foundation for safer, more robust autonomous driving perception systems.
zh
[AI-120] Comparative Analysis of STEM and non-STEM Teachers Needs for Integrating AI into Educational Environments
【速读】:该论文试图解决当前K-12教育中常用块状编程(Block-Based Programming, BBP)平台在集成人工智能(AI)功能和跨学科适应性方面的不足,从而无法有效支持教师教学与学生学习的问题。解决方案的关键在于通过整合AI增强功能,如智能评估(包括完整性检查、抄袭检测、自定义评分量规和详细反馈)、课程资源生成(如更新教学内容、构建辅导知识库及生成式AI工具),以及学生行为监控(如桌面控制、日常追踪与干扰预防),来提升BBP平台的个性化、互动性和教学效率。研究发现,STEM与非STEM教师对AI功能的需求存在共性与差异,尤其非STEM教师更关注创意任务支持和质性评价,这为开发更具包容性和适应性的AI赋能教育平台提供了实证依据。
链接: https://arxiv.org/abs/2509.16276
作者: Bahare Riahi,Veronica Catete
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 3 figures, Published in HCII 2025 Conference Proceedings
Abstract:There is an increasing imperative to integrate programming platforms within AI frameworks to enhance educational tasks for both teachers and students. However, commonly used platforms such as this http URL, Scratch, and Snap fall short of providing the desired AI features and lack adaptability for interdisciplinary applications. This study explores how educational platforms can be improved by incorporating AI and analytics features to create more effective learning environments across various subjects and domains. We interviewed 8 K-12 teachers and asked their practices and needs while using any block-based programming (BBP) platform in their classes. We asked for their approaches in assessment, course development and expansion of resources, and student monitoring in their classes. Thematic analysis of the interview transcripts revealed both commonalities and differences in the AI tools needed between the STEM and non-STEM groups. Our results indicated advanced AI features that could promote BBP platforms. Both groups stressed the need for integrity and plagiarism checks, AI adaptability, customized rubrics, and detailed feedback in assessments. Non-STEM teachers also emphasized the importance of creative assignments and qualitative assessments. Regarding resource development, both AI tools desired for updating curricula, tutoring libraries, and generative AI features. Non-STEM teachers were particularly interested in supporting creative endeavors, such as art simulations. For student monitoring, both groups prioritized desktop control, daily tracking, behavior monitoring, and distraction prevention tools. Our findings identify specific AI-enhanced features needed by K-12 teachers across various disciplines and lay the foundation for creating more efficient, personalized, and engaging educational experiences.
zh
[AI-121] SecureFixAgent : A Hybrid LLM Agent for Automated Python Static Vulnerability Repair ICML
【速读】:该论文旨在解决现代软件开发流水线中大型代码库依赖复杂导致的安全漏洞检测与修复难题,特别是传统静态分析工具(如Bandit)存在误报率高、缺乏自动修复能力,而大语言模型(LLM)虽能提出修复建议但易产生幻觉且无自验证机制的问题。解决方案的关键在于提出SecureFixAgent这一混合修复框架,其核心创新是构建一个迭代式的“检测-修复-验证”闭环:利用Bandit进行精准漏洞检测,轻量级本地LLM(8B参数)生成带解释的候选修复方案,并通过Bandit重新验证修复效果,同时采用LoRA微调技术在多样化、精选的数据集上优化模型,从而显著降低误报率并提升修复准确性,最终实现隐私保护、资源高效与人类可信任的自动化漏洞修复。
链接: https://arxiv.org/abs/2509.16275
作者: Jugal Gajjar,Kamalasankari Subramaniakuppusamy,Relsy Puthal,Kaustik Ranaware
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: 6 pages, 3 figures, 4 tables, 1 algorithm, accepted in the Robustness and Security of Large Language Models (ROSE-LLM) special session at ICMLA 2025
Abstract:Modern software development pipelines face growing challenges in securing large codebases with extensive dependencies. Static analysis tools like Bandit are effective at vulnerability detection but suffer from high false positives and lack repair capabilities. Large Language Models (LLMs), in contrast, can suggest fixes but often hallucinate changes and lack self-validation. We present SecureFixAgent, a hybrid repair framework integrating Bandit with lightweight local LLMs (8B parameters) in an iterative detect-repair-validate loop. To improve precision, we apply parameter-efficient LoRA-based fine-tuning on a diverse, curated dataset spanning multiple Python project domains, mitigating dataset bias and reducing unnecessary edits. SecureFixAgent uses Bandit for detection, the LLM for candidate fixes with explanations, and Bandit re-validation for verification, all executed locally to preserve privacy and reduce cloud reliance. Experiments show SecureFixAgent reduces false positives by 10.8% over static analysis, improves fix accuracy by 13.51%, and lowers false positives by 5.46% compared to pre-trained LLMs, typically converging within three iterations. Beyond metrics, developer studies rate explanation quality 4.5/5, highlighting its value for human trust and adoption. By combining verifiable security improvements with transparent rationale in a resource-efficient local framework, SecureFixAgent advances trustworthy, automated vulnerability remediation for modern pipelines.
zh
[AI-122] SubDyve: Subgraph-Driven Dynamic Propagation for Virtual Screening Enhancement Controlling False Positive
【速读】:该论文旨在解决低标签(low-label)环境下虚拟筛选(Virtual Screening, VS)的挑战,即在仅知少量活性化合物的情况下,如何高效准确地从大规模化学库中识别出具有生物活性的分子。传统方法依赖通用分子指纹,忽视了对生物活性具有判别性的子结构信息,并且将分子视为独立个体,难以在数据稀缺时发挥效能。其解决方案的关键在于提出SubDyve框架,通过构建基于子图感知的相似性网络,利用已知活性化合物作为种子,迭代式地进行种子精炼(seed refinement),并结合局部错误发现率(local false discovery rate)控制假阳性,从而逐步扩展高潜力候选集,同时抑制因拓扑偏倚和过度扩张带来的误判,显著提升了低标签场景下的筛选性能。
链接: https://arxiv.org/abs/2509.16273
作者: Jungseob Yi,Seoyoung Choi,Sun Kim,Sangseon Lee
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 33 pages, 12 figures
Abstract:Virtual screening (VS) aims to identify bioactive compounds from vast chemical libraries, but remains difficult in low-label regimes where only a few actives are known. Existing methods largely rely on general-purpose molecular fingerprints and overlook class-discriminative substructures critical to bioactivity. Moreover, they consider molecules independently, limiting effectiveness in low-label regimes. We introduce SubDyve, a network-based VS framework that constructs a subgraph-aware similarity network and propagates activity signals from a small known actives. When few active compounds are available, SubDyve performs iterative seed refinement, incrementally promoting new candidates based on local false discovery rate. This strategy expands the seed set with promising candidates while controlling false positives from topological bias and overexpansion. We evaluate SubDyve on ten DUD-E targets under zero-shot conditions and on the CDK7 target with a 10-million-compound ZINC dataset. SubDyve consistently outperforms existing fingerprint or embedding-based approaches, achieving margins of up to +34.0 on the BEDROC and +24.6 on the EF1% metric.
zh
[AI-123] Digging Into the Internal: Causality-Based Analysis of LLM Function Calling
【速读】:该论文旨在解决生成式 AI(Generative AI)在实际应用中因指令遵循能力不足而导致的安全性与可靠性问题,特别是如何提升大语言模型(Large Language Models, LLMs)对恶意输入的检测能力。其解决方案的关键在于引入因果分析方法,通过层级和词元层级的因果干预(causal intervention),深入解析函数调用(Function Calling, FC)机制如何影响LLM内部计算逻辑,并据此设计更有效的指令引导策略。实验表明,相较于传统提示方法,FC能显著增强LLM对用户指令的合规性,平均提升恶意输入检测性能约135%,验证了其在提升LLM安全鲁棒性方面的有效性。
链接: https://arxiv.org/abs/2509.16268
作者: Zhenlan Ji,Daoyuan Wu,Wenxuan Wang,Pingchuan Ma,Shuai Wang,Lei Ma
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Function calling (FC) has emerged as a powerful technique for facilitating large language models (LLMs) to interact with external systems and perform structured tasks. However, the mechanisms through which it influences model behavior remain largely under-explored. Besides, we discover that in addition to the regular usage of FC, this technique can substantially enhance the compliance of LLMs with user instructions. These observations motivate us to leverage causality, a canonical analysis method, to investigate how FC works within LLMs. In particular, we conduct layer-level and token-level causal interventions to dissect FC’s impact on the model’s internal computational logic when responding to user queries. Our analysis confirms the substantial influence of FC and reveals several in-depth insights into its mechanisms. To further validate our findings, we conduct extensive experiments comparing the effectiveness of FC-based instructions against conventional prompting methods. We focus on enhancing LLM safety robustness, a critical LLM application scenario, and evaluate four mainstream LLMs across two benchmark datasets. The results are striking: FC shows an average performance improvement of around 135% over conventional prompting methods in detecting malicious inputs, demonstrating its promising potential to enhance LLM reliability and capability in practical applications.
zh
[AI-124] Socratic Mind: Impact of a Novel GenAI-Powered Assessment Tool on Student Learning and Higher-Order Thinking
【速读】:该论文旨在解决如何利用生成式人工智能(Generative AI)驱动的形成性评估工具提升在线高等教育中学生的学习参与度与高阶认知能力的问题。其解决方案的关键在于设计并应用一种名为“苏格拉底思维”(Socratic Mind)的对话式AI工具,该工具通过苏格拉底式提问(Socratic questioning)引导学生进行反思性学习,从而增强情感、行为和认知层面的参与,并促进问题解决、批判性思维等高阶思维技能的发展。实证研究表明,该工具不仅显著提升了学生的测验成绩,尤其对基础较弱的学生效果更明显,还通过质性反馈验证了其在促进深层学习方面的潜力。
链接: https://arxiv.org/abs/2509.16262
作者: Jeonghyun Lee,Jui-Tse Hung,Meryem Yilmaz Soylu,Diana Popescu,Christopher Zhang Cui,Gayane Grigoryan,David A Joyner,Stephen W Harmon
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:This study examines the impact of Socratic Mind, a Generative Artificial Intelligence (GenAI) powered formative assessment tool that employs Socratic questioning to support student learning in a large, fully online undergraduate-level computing course. Employing a quasi-experimental, mixed-methods design, we investigated participants’ engagement patterns, the influence of user experience on engagement, and impacts on both perceived and actual learning outcomes. Data were collected from the system logs, surveys on user experience and perceived engagement and learning gains, student reflections, and course performance data. Results indicated that participants consistently reported high levels of affective, behavioral, and cognitive engagement, and these were strongly linked to positive user experiences and perceived learning outcomes. Quantitative analysis further revealed that students who engaged with the GenAI tool experienced significant gains in their quiz scores compared to those who did not, particularly benefiting students with lower baseline achievement. Additionally, thematic analysis of qualitative feedback revealed substantial perceived improvements in higher-order thinking skills, including problem solving, critical thinking, and self-reflection. Our findings highlight the promise of AI-mediated dialogue in fostering deeper engagement and higher-order cognitive skills. As higher education institutions expand GenAI integration in curriculum, this dialogic, GenAI powered assessment tool can offer a scalable strategy to promote students’ meaningful learning outcomes.
zh
[AI-125] Discovering Software Parallelization Points Using Deep Neural Networks
【速读】:该论文旨在解决编程代码中循环结构(loop)的并行化潜力自动识别问题,即判断哪些循环具备并行执行的可能性,从而为软件性能优化提供支持。其解决方案的关键在于构建一个基于深度学习的分类框架:首先利用两种基于遗传算法(genetic algorithm)的代码生成器分别生成可并行循环(independent loops)和依赖关系模糊的循环(ambiguous loops),形成具有代表性的训练数据集;随后对代码片段进行分词与预处理,并采用深度神经网络(Deep Neural Network, DNN)和卷积神经网络(Convolutional Neural Network, CNN)进行模型训练与分类,实验表明CNN在平均性能上略优,且二者变异程度相当,验证了深度学习方法在自动化识别可并行循环结构方面的可行性与有效性。
链接: https://arxiv.org/abs/2509.16215
作者: Izavan dos S. Correia,Henrique C. T. Santos,Tiago A. E. Ferreira
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC); Neural and Evolutionary Computing (cs.NE); Programming Languages (cs.PL); Software Engineering (cs.SE)
备注: 17 pages, 10 figures
Abstract:This study proposes a deep learning-based approach for discovering loops in programming code according to their potential for parallelization. Two genetic algorithm-based code generators were developed to produce two distinct types of code: (i) independent loops, which are parallelizable, and (ii) ambiguous loops, whose dependencies are unclear, making them impossible to define if the loop is parallelizable or not. The generated code snippets were tokenized and preprocessed to ensure a robust dataset. Two deep learning models - a Deep Neural Network (DNN) and a Convolutional Neural Network (CNN) - were implemented to perform the classification. Based on 30 independent runs, a robust statistical analysis was employed to verify the expected performance of both models, DNN and CNN. The CNN showed a slightly higher mean performance, but the two models had a similar variability. Experiments with varying dataset sizes highlighted the importance of data diversity for model performance. These results demonstrate the feasibility of using deep learning to automate the identification of parallelizable structures in code, offering a promising tool for software optimization and performance improvement.
zh
[AI-126] DarwinWafer: A Wafer-Scale Neuromorphic Chip
【速读】:该论文旨在解决当前神经形态计算系统中因多芯片架构依赖印刷电路板(PCB)级互连而导致的带宽、延迟和能耗严重受限的问题,这些问题显著削弱了生物启发算法的效率与可扩展性。解决方案的关键在于提出 DarwinWafer——一种晶圆级系统级封装(System-on-Wafer)架构,通过在300 mm硅中介层上高密度集成64个Darwin3芯片小片(chiplet),用晶圆级互连替代传统PCB互连;其核心创新包括:每个芯片小片内部采用全局异步局部同步(GALS)网络(NoC)与基于事件驱动的异步晶圆 Fabric(AER-based asynchronous wafer fabric),结合分层时间步同步机制,实现低延迟、一致性的全晶圆协同操作;同时,通过芯片-中介层协同设计流程(含自研凸点规划工具及早期信号完整性/电源完整性与热电闭合分析)以及抗翘曲组装工艺(利用PCBlets和柔性探针连接实现I/O扇出),实现了高可靠性、可拆卸的晶圆到板集成。实测表明,系统在~100 W功耗下保持稳定供电(<10 mV压降)与均匀温升(34–36 °C),并成功支持全脑仿真任务,验证了该方案在大规模、类脑计算中的可行性与优越性。
链接: https://arxiv.org/abs/2509.16213
作者: Xiaolei Zhu,Xiaofei Jin,Ziyang Kang,Chonghui Sun,Junjie Feng,Dingwen Hu,Zengyi Wang,Hanyue Zhuang,Qian Zheng,Huajin Tang,Shi Gu,Xin Du,De Ma,Gang Pan
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:Neuromorphic computing promises brain-like efficiency, yet today’s multi-chip systems scale over PCBs and incur orders-of-magnitude penalties in bandwidth, latency, and energy, undermining biological algorithms and system efficiency. We present DarwinWafer, a hyperscale system-on-wafer that replaces off-chip interconnects with wafer-scale, high-density integration of 64 Darwin3 chiplets on a 300 mm silicon interposer. A GALS NoC within each chiplet and an AER-based asynchronous wafer fabric with hierarchical time-step synchronization provide low-latency, coherent operation across the wafer. Each chiplet implements 2.35 M neurons and 0.1 B synapses, yielding 0.15 B neurons and 6.4 B synapses per this http URL 333 MHz and 0.8 V, DarwinWafer consumes ~100 W and achieves 4.9 pJ/SOP, with 64 TSOPS peak throughput (0.64 TSOPS/W). Realization is enabled by a holistic chiplet-interposer co-design flow (including an in-house interposer-bump planner with early SI/PI and electro-thermal closure) and a warpage-tolerant assembly that fans out I/O via PCBlets and compliant pogo-pin connections, enabling robust, demountable wafer-to-board integration. Measurements confirm 10 mV supply droop and a uniform thermal profile (34-36 °C) under ~100 W. Application studies demonstrate whole-brain simulations: two zebrafish brains per chiplet with high connectivity fidelity (Spearman r = 0.896) and a mouse brain mapped across 32 chiplets (r = 0.645). To our knowledge, DarwinWafer represents a pioneering demonstration of wafer-scale neuromorphic computing, establishing a viable and scalable path toward large-scale, brain-like computation on silicon by replacing PCB-level interconnects with high-density, on-wafer integration.
zh
[AI-127] EPIC: Generative AI Platform for Accelerating HPC Operational Data Analytics
【速读】:该论文旨在解决高性能计算(HPC)系统中操作数据分析方法的局限性,即现有方法依赖静态分析流程,难以适应不断变化的分析任务和利益相关者需求。其解决方案的关键在于提出EPIC平台,采用分层多智能体架构:顶层大型语言模型(Large Language Model, LLM)负责查询处理、推理与综合,协同三个低层专用智能体——信息检索、描述性分析和预测性分析——实现对文本、图像和表格等多模态数据的动态、迭代式分析。通过在Frontier HPC系统上的评估,验证了该架构在复杂查询场景下的有效性,并发现微调的小型模型在描述性分析任务中可比最先进的大模型提升最高达26%的准确率,同时结合大模型与本地微调开源模型的混合策略,使LLM运行成本降低19倍,显著优于专有方案。
链接: https://arxiv.org/abs/2509.16212
作者: Ahmad Maroof Karimi,Woong Shin,Jesse Hines,Tirthankar Ghosal,Naw Safrin Sattar,Feiyi Wang
机构: 未知
类目: Databases (cs.DB); Artificial Intelligence (cs.AI)
备注:
Abstract:We present EPIC, an AI-driven platform designed to augment operational data analytics. EPIC employs a hierarchical multi-agent architecture where a top-level large language model provides query processing, reasoning and synthesis capabilities. These capabilities orchestrate three specialized low-level agents for information retrieval, descriptive analytics, and predictive analytics. This architecture enables EPIC to perform HPC operational analytics on multi-modal data, including text, images, and tabular formats, dynamically and iteratively. EPIC addresses the limitations of existing HPC operational analytics approaches, which rely on static methods that struggle to adapt to evolving analytics tasks and stakeholder demands. Through extensive evaluations on the Frontier HPC system, we demonstrate that EPIC effectively handles complex queries. Using descriptive analytics as a use case, fine-tuned smaller models outperform large state-of-the-art foundation models, achieving up to 26% higher accuracy. Additionally, we achieved 19x savings in LLM operational costs compared to proprietary solutions by employing a hybrid approach that combines large foundational models with fine-tuned local open-weight models. Subjects: Databases (cs.DB); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.16212 [cs.DB] (or arXiv:2509.16212v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2509.16212 Focus to learn more arXiv-issued DOI via DataCite
zh
[AI-128] Breast Cancer Classification Using Gradient Boosting Algorithms Focusing on Reducing the False Negative and SHAP for Explainability
【速读】:该论文旨在解决乳腺癌(breast cancer)早期预测中模型性能评估单一依赖准确率(accuracy)可能带来的局限性问题,强调在医疗场景下更关注召回率(recall)以降低漏诊风险。其解决方案的关键在于:采用四种先进的集成学习 boosting 算法(AdaBoost、XGBoost、CatBoost 和 LightGBM)构建分类器,并结合 Optuna 实现超参数优化以提升模型性能,同时引入 SHAP(SHapley Additive exPlanations)方法增强模型可解释性,从而在保证高 AUC(>99.41%)的同时显著改善召回率并减少假阴性(False Negative),为临床辅助诊断提供更可靠、透明的决策支持工具。
链接: https://arxiv.org/abs/2403.09548
作者: João Manoel Herrera Pinheiro,Marcelo Becker
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Quantitative Methods (q-bio.QM)
备注: 9 pages, 16 figures
Abstract:Cancer is one of the diseases that kill the most women in the world, with breast cancer being responsible for the highest number of cancer cases and consequently deaths. However, it can be prevented by early detection and, consequently, early treatment. Any development for detection or perdition this kind of cancer is important for a better healthy life. Many studies focus on a model with high accuracy in cancer prediction, but sometimes accuracy alone may not always be a reliable metric. This study implies an investigative approach to studying the performance of different machine learning algorithms based on boosting to predict breast cancer focusing on the recall metric. Boosting machine learning algorithms has been proven to be an effective tool for detecting medical diseases. The dataset of the University of California, Irvine (UCI) repository has been utilized to train and test the model classifier that contains their attributes. The main objective of this study is to use state-of-the-art boosting algorithms such as AdaBoost, XGBoost, CatBoost and LightGBM to predict and diagnose breast cancer and to find the most effective metric regarding recall, ROC-AUC, and confusion matrix. Furthermore, our study is the first to use these four boosting algorithms with Optuna, a library for hyperparameter optimization, and the SHAP method to improve the interpretability of our model, which can be used as a support to identify and predict breast cancer. We were able to improve AUC or recall for all the models and reduce the False Negative for AdaBoost and LigthGBM the final AUC were more than 99.41% for all models.
zh
[AI-129] Deep Learning as the Disciplined Construction of Tame Objects
【速读】:该论文旨在解决深度学习模型中优化算法(特别是随机梯度下降,Stochastic Gradient Descent, SGD)在非光滑、非凸但具有“可 tame 性”(tame)结构下的收敛性保障问题。其解决方案的关键在于引入并应用可 tame 几何(tame geometry,亦称 o-最小性)的数学工具,将深度学习模型视为函数的复合结构,并在此框架下建立适用于一般非光滑非凸优化问题的收敛性理论,从而为深度学习提供更坚实的数学基础和理论支撑。
链接: https://arxiv.org/abs/2509.18025
作者: Gilles Bareilles,Allen Gehret,Johannes Aspman,Jana Lepšová,Jakub Mareček
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic (math.LO); Machine Learning (stat.ML)
备注: 35 pages, 8 figures
Abstract:One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overview of some topics at the interface of tame geometry (also known as o-minimality), optimization theory, and deep learning theory and practice. To do so, we gradually introduce the concepts and tools used to build convergence guarantees for stochastic gradient descent in a general nonsmooth nonconvex, but tame, setting. This illustrates some ways in which tame geometry is a natural mathematical framework for the study of AI systems, especially within Deep Learning.
zh
[AI-130] SongPrep: A Preprocessing Framework and End-to-end Model for Full-song Structure Parsing and Lyrics Transcription
【速读】:该论文旨在解决歌曲生成(Song Generation)领域中训练数据准备效率低下的问题,即现有方法依赖大量人工标注来处理音频和歌词数据,导致成本高、耗时长。其核心解决方案是提出SongPrep自动化预处理流水线,可自动完成源分离(Source Separation)、结构分析与歌词识别等关键步骤,生成结构化训练数据;进一步地,引入基于预训练语言模型的端到端歌词识别模型SongPrepE2E,无需额外源分离即可精准提取整首歌曲的结构信息与带时间戳的歌词内容,从而显著降低语音分割错误率(DER)和词错误率(WER),并提升下游歌曲生成模型的生成质量。
链接: https://arxiv.org/abs/2509.17404
作者: Wei Tan,Shun Lei,Huaicheng Zhang,Guangzheng Li,Yixuan Zhang,Hangting Chen,Jianwei Yu,Rongzhi Gu,Dong Yu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Artificial Intelligence Generated Content (AIGC) is currently a popular research area. Among its various branches, song generation has attracted growing interest. Despite the abundance of available songs, effective data preparation remains a significant challenge. Converting these songs into training-ready datasets typically requires extensive manual labeling, which is both time consuming and costly. To address this issue, we propose SongPrep, an automated preprocessing pipeline designed specifically for song data. This framework streamlines key processes such as source separation, structure analysis, and lyric recognition, producing structured data that can be directly used to train song generation models. Furthermore, we introduce SongPrepE2E, an end-to-end structured lyrics recognition model based on pretrained language models. Without the need for additional source separation, SongPrepE2E is able to analyze the structure and lyrics of entire songs and provide precise timestamps. By leveraging context from the whole song alongside pretrained semantic knowledge, SongPrepE2E achieves low Diarization Error Rate (DER) and Word Error Rate (WER) on the proposed SSLD-200 dataset. Downstream tasks demonstrate that training song generation models with the data output by SongPrepE2E enables the generated songs to closely resemble those produced by humans.
zh
[AI-131] From Prediction to Understanding: Will AI Foundation Models Transform Brain Science?
【速读】:该论文试图解决的问题是:如何将生成式预训练模型(即基础模型,foundation models)有效整合到脑科学中,以推动对神经活动和认知机制的科学理解,而不仅仅是实现高预测准确性。其解决方案的关键在于从“预测”向“解释”的转变——即通过建立模型计算过程与神经活动及认知机制之间的因果关联,使基础模型不仅能准确预测行为或脑数据,还能揭示潜在的计算原理,从而为脑科学提供可解释的理论框架。
链接: https://arxiv.org/abs/2509.17280
作者: Thomas Serre,Ellie Pavlick
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative pretraining (the “GPT” in ChatGPT) enables language models to learn from vast amounts of internet text without human supervision. This approach has driven breakthroughs across AI by allowing deep neural networks to learn from massive, unstructured datasets. We use the term foundation models to refer to large pretrained systems that can be adapted to a wide range of tasks within and across domains, and these models are increasingly applied beyond language to the brain sciences. These models achieve strong predictive accuracy, raising hopes that they might illuminate computational principles. But predictive success alone does not guarantee scientific understanding. Here, we outline how foundation models can be productively integrated into the brain sciences, highlighting both their promise and their limitations. The central challenge is to move from prediction to explanation: linking model computations to mechanisms underlying neural activity and cognition.
zh
[AI-132] Agent ic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator
【速读】:该论文旨在解决加速器物理实验中多阶段任务自动化执行效率低、依赖专家手动编程且易引入安全风险的问题。解决方案的关键在于构建首个基于语言模型的代理型人工智能(agentic AI)系统,其核心特征包括:计划先行(plan-first orchestration)的调度机制、受限工具访问(bounded tool access)以保障安全性,以及动态能力选择(dynamic capability selection)实现灵活适应不同实验需求。该系统能将自然语言指令转化为结构化执行计划,自动整合历史数据检索、控制系统通道解析、脚本生成、机器交互与分析等环节,在显著缩短准备时间(达两个数量级)的同时严格遵守操作标准安全约束,从而为加速器实验及更广泛的大科学装置提供可复现、可审计、可迁移的安全AI集成范式。
链接: https://arxiv.org/abs/2509.17255
作者: Thorsten Hellert,Drew Bertwistle,Simon C. Leemann,Antonin Sulc,Marco Venturini
机构: 未知
类目: Accelerator Physics (physics.acc-ph); Artificial Intelligence (cs.AI)
备注:
Abstract:We present the first language-model-driven agentic artificial intelligence (AI) system to autonomously execute multi-stage physics experiments on a production synchrotron light source. Implemented at the Advanced Light Source particle accelerator, the system translates natural language user prompts into structured execution plans that combine archive data retrieval, control-system channel resolution, automated script generation, controlled machine interaction, and analysis. In a representative machine physics task, we show that preparation time was reduced by two orders of magnitude relative to manual scripting even for a system expert, while operator-standard safety constraints were strictly upheld. Core architectural features, plan-first orchestration, bounded tool access, and dynamic capability selection, enable transparent, auditable execution with fully reproducible artifacts. These results establish a blueprint for the safe integration of agentic AI into accelerator experiments and demanding machine physics studies, as well as routine operations, with direct portability across accelerators worldwide and, more broadly, to other large-scale scientific infrastructures.
zh
[AI-133] MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
【速读】:该论文旨在解决传统语音转换(Voice Conversion, VC)模型在零样本场景下难以灵活控制多维度声学特征(如说话人身份、语言内容和韵律)的问题。现有方法通常依赖固定的条件输入机制,限制了对目标语音属性的精细调节能力。其解决方案的关键在于提出MaskVCT模型,通过引入多种无分类器引导(Classifier-Free Guidance, CFG)机制,在单一模型中集成多种可选条件输入,包括连续或量化后的语言特征以提升语义清晰度与说话人相似性,以及可选择性地使用或忽略基频轮廓(pitch contour)来调控韵律特性。这种模块化设计使用户能够在不重新训练模型的前提下,动态平衡说话人保真度、语言内容准确性和韵律自然度,从而实现更灵活、鲁棒且可控的零样本语音转换。
链接: https://arxiv.org/abs/2509.17143
作者: Junhyeok Lee,Helin Wang,Yaohan Guan,Thomas Thebaud,Laureano Moro-Velazquez,Jesús Villalba,Najim Dehak
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI)
备注:
Abstract:We introduce MaskVCT, a zero-shot voice conversion (VC) model that offers multi-factor controllability through multiple classifier-free guidances (CFGs). While previous VC models rely on a fixed conditioning scheme, MaskVCT integrates diverse conditions in a single model. To further enhance robustness and control, the model can leverage continuous or quantized linguistic features to enhance intellgibility and speaker similarity, and can use or omit pitch contour to control prosody. These choices allow users to seamlessly balance speaker identity, linguistic content, and prosodic factors in a zero-shot VC setting. Extensive experiments demonstrate that MaskVCT achieves the best target speaker and accent similarities while obtaining competitive word and character error rates compared to existing baselines. Audio samples are available at this https URL.
zh
[AI-134] textttDiffSyn: A Generative Diffusion Approach to Materials Synthesis Planning
【速读】:该论文旨在解决结晶材料(如沸石)合成过程中面临的高维合成空间、复杂结构-合成关系以及实验耗时等问题。其核心挑战在于从众多可能的合成路径中找到与目标结构匹配且高效的方案,尤其在存在“多对一”结构-合成映射关系时更为突出。解决方案的关键是提出了一种名为 \textttDiffSyn 的生成式扩散模型(Generative Diffusion Model),该模型基于50年文献中超过23,000条合成配方进行训练,能够根据目标沸石结构和有机模板生成概率较高的合成路线。通过捕捉结构-合成关系的多模态特性,\textttDiffSyn 在区分竞争相和生成最优合成路径方面表现出色,并成功指导了UFI型材料的实验合成,实现了Si/Al比高达19.0的产物,显著优于以往记录,验证了其在提升热稳定性和合成效率方面的潜力。
链接: https://arxiv.org/abs/2509.17094
作者: Elton Pan,Soonhyoung Kwon,Sulin Liu,Mingrou Xie,Alexander J. Hoffman,Yifei Duan,Thorben Prein,Killian Sheriff,Yuriy Roman-Leshkov,Manuel Moliner,Rafael Gomez-Bombarelli,Elsa Olivetti
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The synthesis of crystalline materials, such as zeolites, remains a significant challenge due to a high-dimensional synthesis space, intricate structure-synthesis relationships and time-consuming experiments. Considering the one-to-many relationship between structure and synthesis, we propose \textttDiffSyn , a generative diffusion model trained on over 23,000 synthesis recipes spanning 50 years of literature. \textttDiffSyn generates probable synthesis routes conditioned on a desired zeolite structure and an organic template. \textttDiffSyn achieves state-of-the-art performance by capturing the multi-modal nature of structure-synthesis relationships. We apply \textttDiffSyn to differentiate among competing phases and generate optimal synthesis routes. As a proof of concept, we synthesize a UFI material using \textttDiffSyn -generated synthesis routes. These routes, rationalized by density functional theory binding energies, resulted in the successful synthesis of a UFI material with a high Si/Al _\textICP of 19.0, which is expected to improve thermal stability and is higher than that of any previously recorded.
zh
[AI-135] Audio-Conditioned Diffusion LLM s for ASR and Deliberation Processing
【速读】:该论文旨在解决自动语音识别(ASR)中传统自回归解码器性能瓶颈的问题,探索基于扩散机制的大语言模型(Diffusion-based Large Language Models, DLLMs)在ASR中的应用潜力。其核心解决方案是引入名为LLaDA的扩散式大语言模型,并将其作为外部推理模块与Whisper-LLaMA系统级联使用,利用其双向注意力机制和去噪能力,结合随机掩码、低置信度掩码及半自回归策略优化语音转录结果。实验表明,该方法在LibriSpeech数据集上显著降低词错误率(WER),尤其在test-other测试集上相对基线提升12.3%,同时验证了音频条件嵌入对模型性能的关键作用——纯文本输入的LLaDA无法提升准确率,凸显了声学特征融合的重要性。
链接: https://arxiv.org/abs/2509.16622
作者: Mengqi Wang,Zhan Liu,Zengrui Jin,Guangzhi Sun,Chao Zhang,Philip C. Woodland
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:
Abstract:Diffusion-based large language models (DLLMs) have recently attracted growing interest as an alternative to autoregressive decoders. In this work, we present an empirical study on using the diffusion-based large language model LLaDA for automatic speech recognition (ASR). We first investigate its use as an external deliberation-based processing module for Whisper-LLaMA transcripts. By leveraging the bidirectional attention and denoising capabilities of LLaDA, we explore random masking, low-confidence masking, and semi-autoregressive strategies, showing that Whisper-LLaDA substantially reduces WER compared with the baseline. On LibriSpeech, the best cascade system achieves 2.25%/4.94% WER on test-clean/test-other, representing a 12.3% relative improvement over the Whisper-LLaMA baseline on the test-other split. In contrast, a plain-text LLaDA without acoustic features fails to improve accuracy, highlighting the importance of audio-conditioned embeddings. We further evaluate Whisper-LLaDA as a standalone decoder for ASR with diffusion-based and semi-autoregressive decoding. Most experimental configurations achieve faster inference than the Whisper-LLaMA baseline, although recognition accuracy is slightly lower. These findings offer an empirical view of diffusion-based LLMs for ASR and point to promising directions for improvements.
zh
[AI-136] LightCode: Compiling LLM Inference for Photonic-Electronic Systems
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在低延迟、高能效推理场景下,如何有效融合光子加速器(如Photonic Tensor Units, PTUs)与电子处理器(如GPU)以构建异构计算架构的问题。当前GPU虽为主流,但难以与新兴的领域专用光子加速器集成,而PTUs具备低功耗、高吞吐线性计算能力,适合LLM中的张量运算。解决方案的关键在于提出LightCode编译框架与模拟器,其核心创新是引入“堆叠图”(Stacked Graph)作为中间表示,显式编码每类张量操作在不同硬件上的多种实现方式,并将硬件分配建模为受参数化成本模型约束的子图选择问题,从而在延迟或能耗目标下优化映射策略。实验表明,在GPT-2和Llama-7B的预填充阶段,该方案可实现最高50%能耗降低、超过10倍的延迟改善,且针对不同优化目标产生差异化的硬件映射结果。
链接: https://arxiv.org/abs/2509.16443
作者: Ryan Tomich,Zhizhen Zhong,Dirk Englund
机构: 未知
类目: Applied Physics (physics.app-ph); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 9 pages, 8 figures
Abstract:The growing demand for low-latency, energy-efficient inference in large language models (LLMs) has catalyzed interest in heterogeneous architectures. While GPUs remain dominant, they are poorly suited for integration with emerging domain-specific accelerators like the Photonic Tensor Units (PTUs), which offer low-power, high-throughput linear computation. This motivates hybrid compilation strategies that combine photonic and electronic resources. We present LightCode, a compiler framework and simulator for mapping LLM inference workloads across hybrid photonic-electronic systems. LightCode introduces the Stacked Graph, an intermediate representation that encodes multiple hardware-specific realizations of each tensor operation. Hardware assignment is formulated as a constrained subgraph selection problem optimized for latency or energy under parametric cost models. We evaluate LightCode on the prefill stage of GPT-2 and Llama-7B showing that under our workload and hardware assumptions, (i) Photonic hardware reduced energy by up to 50% in our simulated workloads at maximum sequence length; (ii) multiplexing and assignment strategy yielded latency improvements exceeding 10x; and (iii) Optimizing for latency or energy resulted in distinct hardware mappings in our simulations. LightCode offers a module, foundational framework and simulator for compiling LLMs to emerging photonic accelerators.
zh
[AI-137] Imaging Modalities-Based Classification for Lung Cancer Detection
【速读】:该论文旨在解决肺癌(lung cancer)在全球范围内仍是导致癌症死亡的主要原因这一问题,重点在于提升肺部影像学(如CT扫描和胸部X光片)及生物标志物的检测准确性与效率。其解决方案的关键在于系统性地综述并评估多种先进图像处理方法的有效性,发现3D卷积神经网络(3D CNN)架构在整合CT扫描数据时表现最优,但同时也指出当前模型在跨人群泛化能力、假阳性率高、数据集异质性以及计算复杂度等方面仍存在显著挑战,亟需进一步优化以推动临床落地应用。
链接: https://arxiv.org/abs/2509.16254
作者: Sajim Ahmed,Muhammad Zain Chaudhary,Muhammad Zohaib Chaudhary,Mahmoud Abbass,Ahmed Sherif,Mohammad Mahbubur Rahman Khan Mamun
机构: 未知
类目: Tissues and Organs (q-bio.TO); Artificial Intelligence (cs.AI)
备注: Accepted at ICMI 2025
Abstract:Lung cancer continues to be the predominant cause of cancer-related mortality globally. This review analyzes various approaches, including advanced image processing methods, focusing on their efficacy in interpreting CT scans, chest radiographs, and biological markers. Notably, we identify critical gaps in the previous surveys, including the need for robust models that can generalize across diverse populations and imaging modalities. This comprehensive synthesis aims to serve as a foundational resource for researchers and clinicians, guiding future efforts toward more accurate and efficient lung cancer detection. Key findings reveal that 3D CNN architectures integrated with CT scans achieve the most superior performances, yet challenges such as high false positives, dataset variability, and computational complexity persist across modalities.
zh
机器学习
[LG-0] Strategic Coordination for Evolving Multi-agent Systems: A Hierarchical Reinforcement and Collective Learning Approach
链接: https://arxiv.org/abs/2509.18088
作者: Chuhao Qin,Evangelos Pournaras
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication
Abstract:Decentralized combinatorial optimization in evolving multi-agent systems poses significant challenges, requiring agents to balance long-term decision-making, short-term optimized collective outcomes, while preserving autonomy of interactive agents under unanticipated changes. Reinforcement learning offers a way to model sequential decision-making through dynamic programming to anticipate future environmental changes. However, applying multi-agent reinforcement learning (MARL) to decentralized combinatorial optimization problems remains an open challenge due to the exponential growth of the joint state-action space, high communication overhead, and privacy concerns in centralized training. To address these limitations, this paper proposes Hierarchical Reinforcement and Collective Learning (HRCL), a novel approach that leverages both MARL and decentralized collective learning based on a hierarchical framework. Agents take high-level strategies using MARL to group possible plans for action space reduction and constrain the agent behavior for Pareto optimality. Meanwhile, the low-level collective learning layer ensures efficient and decentralized coordinated decisions among agents with minimal communication. Extensive experiments in a synthetic scenario and real-world smart city application models, including energy self-management and drone swarm sensing, demonstrate that HRCL significantly improves performance, scalability, and adaptability compared to the standalone MARL and collective learning approaches, achieving a win-win synthesis solution.
[LG-1] Learning functions operators and dynamical systems with kernels
链接: https://arxiv.org/abs/2509.18071
作者: Lorenzo Rosasco
类目: Machine Learning (cs.LG)
*备注:
Abstract:This expository article presents the approach to statistical machine learning based on reproducing kernel Hilbert spaces. The basic framework is introduced for scalar-valued learning and then extended to operator learning. Finally, learning dynamical systems is formulated as a suitable operator learning problem, leveraging Koopman operator theory.
[LG-2] Learning to Rank with Top-K Fairness
链接: https://arxiv.org/abs/2509.18067
作者: Boyang Zhang,Quanqi Hu,Mingxuan Sun,Qihang Lin,Tianbao Yang
类目: Machine Learning (cs.LG)
*备注: Already accepted: this https URL @article{ zhang2025learning, title={Learning to Rank with Top-$K$ Fairness}, author={Boyang Zhang and Quanqi Hu and Mingxuan Sun and Qihang Lin and Tianbao Yang}, journal={Transactions on Machine Learning Research}, issn={2835-8856}, year={2025}, url={ this https URL }, note={} }
Abstract:Fairness in ranking models is crucial, as disparities in exposure can disproportionately affect protected groups. Most fairness-aware ranking systems focus on ensuring comparable average exposure for groups across the entire ranked list, which may not fully address real-world concerns. For example, when a ranking model is used for allocating resources among candidates or disaster hotspots, decision-makers often prioritize only the top- K ranked items, while the ranking beyond top- K becomes less relevant. In this paper, we propose a list-wise learning-to-rank framework that addresses the issues of inequalities in top- K rankings at training time. Specifically, we propose a top- K exposure disparity measure that extends the classic exposure disparity metric in a ranked list. We then learn a ranker to balance relevance and fairness in top- K rankings. Since direct top- K selection is computationally expensive for a large number of items, we transform the non-differentiable selection process into a differentiable objective function and develop efficient stochastic optimization algorithms to achieve both high accuracy and sufficient fairness. Extensive experiments demonstrate that our method outperforms existing methods.
[LG-3] Prepare Before You Act: Learning From Humans to Rearrange Initial States
链接: https://arxiv.org/abs/2509.18043
作者: Yinlong Dai,Andre Keyser,Dylan P. Losey
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Imitation learning (IL) has proven effective across a wide range of manipulation tasks. However, IL policies often struggle when faced with out-of-distribution observations; for instance, when the target object is in a previously unseen position or occluded by other objects. In these cases, extensive demonstrations are needed for current IL methods to reach robust and generalizable behaviors. But when humans are faced with these sorts of atypical initial states, we often rearrange the environment for more favorable task execution. For example, a person might rotate a coffee cup so that it is easier to grasp the handle, or push a box out of the way so they can directly grasp their target object. In this work we seek to equip robot learners with the same capability: enabling robots to prepare the environment before executing their given policy. We propose ReSET, an algorithm that takes initial states – which are outside the policy’s distribution – and autonomously modifies object poses so that the restructured scene is similar to training data. Theoretically, we show that this two step process (rearranging the environment before rolling out the given policy) reduces the generalization gap. Practically, our ReSET algorithm combines action-agnostic human videos with task-agnostic teleoperation data to i) decide when to modify the scene, ii) predict what simplifying actions a human would take, and iii) map those predictions into robot action primitives. Comparisons with diffusion policies, VLAs, and other baselines show that using ReSET to prepare the environment enables more robust task execution with equal amounts of total training data. See videos at our project website: this https URL
[LG-4] Control Disturbance Rejection in Neural ODEs
链接: https://arxiv.org/abs/2509.18034
作者: Erkan Bayram,Mohamed-Ali Belabbas,Tamer Başar
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Accepted for publication in IEEE CDC 2025
Abstract:In this paper, we propose an iterative training algorithm for Neural ODEs that provides models resilient to control (parameter) disturbances. The method builds on our earlier work Tuning without Forgetting-and similarly introduces training points sequentially, and updates the parameters on new data within the space of parameters that do not decrease performance on the previously learned training points-with the key difference that, inspired by the concept of flat minima, we solve a minimax problem for a non-convex non-concave functional over an infinite-dimensional control space. We develop a projected gradient descent algorithm on the space of parameters that admits the structure of an infinite-dimensional Banach subspace. We show through simulations that this formulation enables the model to effectively learn new data points and gain robustness against control disturbance.
[LG-5] Building Transparency in Deep Learning-Powered Network Traffic Classification: A Traffic-Explainer Framework
链接: https://arxiv.org/abs/2509.18007
作者: Riya Ponraj,Ram Durairajan,Yu Wang
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注:
Abstract:Recent advancements in deep learning have significantly enhanced the performance and efficiency of traffic classification in networking systems. However, the lack of transparency in their predictions and decision-making has made network operators reluctant to deploy DL-based solutions in production networks. To tackle this challenge, we propose Traffic-Explainer, a model-agnostic and input-perturbation-based traffic explanation framework. By maximizing the mutual information between predictions on original traffic sequences and their masked counterparts, Traffic-Explainer automatically uncovers the most influential features driving model predictions. Extensive experiments demonstrate that Traffic-Explainer improves upon existing explanation methods by approximately 42%. Practically, we further apply Traffic-Explainer to identify influential features and demonstrate its enhanced transparency across three critical tasks: application classification, traffic localization, and network cartography. For the first two tasks, Traffic-Explainer identifies the most decisive bytes that drive predicted traffic applications and locations, uncovering potential vulnerabilities and privacy concerns. In network cartography, Traffic-Explainer identifies submarine cables that drive the mapping of traceroute to physical path, enabling a traceroute-informed risk analysis.
[LG-6] Equilibrium flow: From Snapshots to Dynamics
链接: https://arxiv.org/abs/2509.17990
作者: Yanbo Zhang,Michael Levin
类目: Machine Learning (cs.LG); Pattern Formation and Solitons (nlin.PS)
*备注: 17 pages, 8 figures
Abstract:Scientific data, from cellular snapshots in biology to celestial distributions in cosmology, often consists of static patterns from underlying dynamical systems. These snapshots, while lacking temporal ordering, implicitly encode the processes that preserve them. This work investigates how strongly such a distribution constrains its underlying dynamics and how to recover them. We introduce the Equilibrium flow method, a framework that learns continuous dynamics that preserve a given pattern distribution. Our method successfully identifies plausible dynamics for 2-D systems and recovers the signature chaotic behavior of the Lorenz attractor. For high-dimensional Turing patterns from the Gray-Scott model, we develop an efficient, training-free variant that achieves high fidelity to the ground truth, validated both quantitatively and qualitatively. Our analysis reveals the solution space is constrained not only by the data but also by the learning model’s inductive biases. This capability extends beyond recovering known systems, enabling a new paradigm of inverse design for Artificial Life. By specifying a target pattern distribution, we can discover the local interaction rules that preserve it, leading to the spontaneous emergence of complex behaviors, such as life-like flocking, attraction, and repulsion patterns, from simple, user-defined snapshots.
[LG-7] Budgeted Adversarial Attack against Graph-Based Anomaly Detection in Sensor Networks
链接: https://arxiv.org/abs/2509.17987
作者: Sanju Xaviar,Omid Ardakanian
类目: Machine Learning (cs.LG)
*备注: 12 pages
Abstract:Graph Neural Networks (GNNs) have emerged as powerful models for anomaly detection in sensor networks, particularly when analyzing multivariate time series. In this work, we introduce BETA, a novel grey-box evasion attack targeting such GNN-based detectors, where the attacker is constrained to perturb sensor readings from a limited set of nodes, excluding the target sensor, with the goal of either suppressing a true anomaly or triggering a false alarm at the target node. BETA identifies the sensors most influential to the target node’s classification and injects carefully crafted adversarial perturbations into their features, all while maintaining stealth and respecting the attacker’s budget. Experiments on three real-world sensor network datasets show that BETA reduces the detection accuracy of state-of-the-art GNN-based detectors by 30.62 to 39.16% on average, and significantly outperforms baseline attack strategies, while operating within realistic constraints.
[LG-8] owards Seeing Bones at Radio Frequency
链接: https://arxiv.org/abs/2509.17979
作者: Yiwen Song,Hongyang Li,Kuang Yuan,Ran Bi,Swarun Kumar
类目: Graphics (cs.GR); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:
Abstract:Wireless sensing literature has long aspired to achieve X-ray-like vision at radio frequencies. Yet, state-of-the-art wireless sensing literature has yet to generate the archetypal X-ray image: one of the bones beneath flesh. In this paper, we explore MCT, a penetration-based RF-imaging system for imaging bones at mm-resolution, one that significantly exceeds prior penetration-based RF imaging literature. Indeed the long wavelength, significant attenuation and complex diffraction that occur as RF propagates through flesh, have long limited imaging resolution (to several centimeters at best). We address these concerns through a novel penetration-based synthetic aperture algorithm, coupled with a learning-based pipeline to correct for diffraction-induced artifacts. A detailed evaluation of meat models demonstrates a resolution improvement from sub-decimeter to sub-centimeter over prior art in RF penetrative imaging.
[LG-9] Medical priority fusion: achieving dual optimization of sensitivity and interpretability in nipt anomaly detection
链接: https://arxiv.org/abs/2509.17924
作者: Xiuqi Ge,Zhibo Yao,Yaosong Du
类目: Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
*备注: 24 pages, 47 figures, publish to BIBM
Abstract:Clinical machine learning faces a critical dilemma in high-stakes medical applications: algorithms achieving optimal diagnostic performance typically sacrifice the interpretability essential for physician decision-making, while interpretable methods compromise sensitivity in complex scenarios. This paradox becomes particularly acute in non-invasive prenatal testing (NIPT), where missed chromosomal abnormalities carry profound clinical consequences yet regulatory frameworks mandate explainable AI systems. We introduce Medical Priority Fusion (MPF), a constrained multi-objective optimization framework that resolves this fundamental trade-off by systematically integrating Naive Bayes probabilistic reasoning with Decision Tree rule-based logic through mathematically-principled weighted fusion under explicit medical constraints. Rigorous validation on 1,687 real-world NIPT samples characterized by extreme class imbalance (43.4:1 normal-to-abnormal ratio) employed stratified 5-fold cross-validation with comprehensive ablation studies and statistical hypothesis testing using McNemar’s paired comparisons. MPF achieved simultaneous optimization of dual objectives: 89.3% sensitivity (95% CI: 83.9-94.7%) with 80% interpretability score, significantly outperforming individual algorithms (McNemar’s test, p 0.001). The optimal fusion configuration achieved Grade A clinical deployment criteria with large effect size (d = 1.24), establishing the first clinically-deployable solution that maintains both diagnostic accuracy and decision transparency essential for prenatal care. This work demonstrates that medical-constrained algorithm fusion can resolve the interpretability-performance trade-off, providing a mathematical framework for developing high-stakes medical decision support systems that meet both clinical efficacy and explainability requirements.
[LG-10] SingLEM: Single-Channel Large EEG Model
链接: https://arxiv.org/abs/2509.17920
作者: Jamiyan Sukhbaatar,Satoshi Imamura,Ibuki Inoue,Shoya Murakami,Kazi Mahmudul Hassan,Seungwoo Han,Ingon Chanpornpakdi,Toshihisa Tanaka
类目: Machine Learning (cs.LG)
*备注:
Abstract:Current deep learning models for electroencephalography (EEG) are often task-specific and depend on large labeled datasets, limiting their adaptability. Although emerging foundation models aim for broader applicability, their rigid dependence on fixed, high-density multi-channel montages restricts their use across heterogeneous datasets and in missing-channel or practical low-channel settings. To address these limitations, we introduce SingLEM, a self-supervised foundation model that learns robust, general-purpose representations from single-channel EEG, making it inherently hardware agnostic. The model employs a hybrid encoder architecture that combines convolutional layers to extract local features with a hierarchical transformer to model both short- and long-range temporal dependencies. SingLEM is pretrained on 71 public datasets comprising over 9,200 subjects and 357,000 single-channel hours of EEG. When evaluated as a fixed feature extractor across six motor imagery and cognitive tasks, aggregated single-channel representations consistently outperformed leading multi-channel foundation models and handcrafted baselines. These results demonstrate that a single-channel approach can achieve state-of-the-art generalization while enabling fine-grained neurophysiological analysis and enhancing interpretability. The source code and pretrained models are available at this https URL.
[LG-11] Shilling Recommender Systems by Generating Side-feature-aware Fake User Profiles
链接: https://arxiv.org/abs/2509.17918
作者: Yuanrong Wang,Yingpeng Du
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Recommender systems (RS) greatly influence users’ consumption decisions, making them attractive targets for malicious shilling attacks that inject fake user profiles to manipulate recommendations. Existing shilling methods can generate effective and stealthy fake profiles when training data only contain rating matrix, but they lack comprehensive solutions for scenarios where side features are present and utilized by the recommender. To address this gap, we extend the Leg-UP framework by enhancing the generator architecture to incorporate side features, enabling the generation of side-feature-aware fake user profiles. Experiments on benchmarks show that our method achieves strong attack performance while maintaining stealthiness.
[LG-12] Lipschitz-Based Robustness Certification for Recurrent Neural Networks via Convex Relaxation
链接: https://arxiv.org/abs/2509.17898
作者: Paul Hamelbeck,Johannes Schiffer
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 10 pages, 3 figures,
Abstract:Robustness certification against bounded input noise or adversarial perturbations is increasingly important for deployment recurrent neural networks (RNNs) in safety-critical control applications. To address this challenge, we present RNN-SDP, a relaxation based method that models the RNN’s layer interactions as a convex problem and computes a certified upper bound on the Lipschitz constant via semidefinite programming (SDP). We also explore an extension that incorporates known input constraints to further tighten the resulting Lipschitz bounds. RNN-SDP is evaluated on a synthetic multi-tank system, with upper bounds compared to empirical estimates. While incorporating input constraints yields only modest improvements, the general method produces reasonably tight and certifiable bounds, even as sequence length increases. The results also underscore the often underestimated impact of initialization errors, an important consideration for applications where models are frequently re-initialized, such as model predictive control (MPC).
[LG-13] Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark
链接: https://arxiv.org/abs/2509.17894
作者: Siu Hang Ho,Prasad Ganesan,Nguyen Duong,Daniel Schlabig
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures. Technical report
Abstract:Efficient inference is a critical challenge in deep generative modeling, particularly as diffusion models grow in capacity and complexity. While increased complexity often improves accuracy, it raises compute costs, latency, and memory requirements. This work investigates techniques such as pruning, quantization, knowledge distillation, and simplified attention to reduce computational overhead without impacting performance. The study also explores the Mixture of Experts (MoE) approach to further enhance efficiency. These experiments provide insights into optimizing inference for the state-of-the-art Fast Diffusion Transformer (fast-DiT) model.
[LG-14] GaussianPSL: A novel framework based on Gaussian Splatting for exploring the Pareto frontier in multi-criteria optimization
链接: https://arxiv.org/abs/2509.17889
作者: Phuong Mai Dinh,Van-Nam Huynh
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multi-objective optimization (MOO) is essential for solving complex real-world problems involving multiple conflicting objectives. However, many practical applications - including engineering design, autonomous systems, and machine learning - often yield non-convex, degenerate, or discontinuous Pareto frontiers, which involve traditional scalarization and Pareto Set Learning (PSL) methods that struggle to approximate accurately. Existing PSL approaches perform well on convex fronts but tend to fail in capturing the diversity and structure of irregular Pareto sets commonly observed in real-world scenarios. In this paper, we propose Gaussian-PSL, a novel framework that integrates Gaussian Splatting into PSL to address the challenges posed by non-convex Pareto frontiers. Our method dynamically partitions the preference vector space, enabling simple MLP networks to learn localized features within each region, which are then integrated by an additional MLP aggregator. This partition-aware strategy enhances both exploration and convergence, reduces sensi- tivity to initialization, and improves robustness against local optima. We first provide the mathematical formulation for controllable Pareto set learning using Gaussian Splat- ting. Then, we introduce the Gaussian-PSL architecture and evaluate its performance on synthetic and real-world multi-objective benchmarks. Experimental results demonstrate that our approach outperforms standard PSL models in learning irregular Pareto fronts while maintaining computational efficiency and model simplicity. This work offers a new direction for effective and scalable MOO under challenging frontier geometries.
[LG-15] Brainprint-Modulated Target Speaker Extraction
链接: https://arxiv.org/abs/2509.17883
作者: Qiushi Han,Yuan Liao,Youhao Si,Liya Huang
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, conference
Abstract:Achieving robust and personalized performance in neuro-steered Target Speaker Extraction (TSE) remains a significant challenge for next-generation hearing aids. This is primarily due to two factors: the inherent non-stationarity of EEG signals across sessions, and the high inter-subject variability that limits the efficacy of generalized models. To address these issues, we propose Brainprint-Modulated Target Speaker Extraction (BM-TSE), a novel framework for personalized and high-fidelity extraction. BM-TSE first employs a spatio-temporal EEG encoder with an Adaptive Spectral Gain (ASG) module to extract stable features resilient to non-stationarity. The core of our framework is a personalized modulation mechanism, where a unified brainmap embedding is learned under the joint supervision of subject identification (SID) and auditory attention decoding (AAD) tasks. This learned brainmap, encoding both static user traits and dynamic attentional states, actively refines the audio separation process, dynamically tailoring the output to each user. Evaluations on the public KUL and Cocktail Party datasets demonstrate that BM-TSE achieves state-of-the-art performance, significantly outperforming existing methods. Our code is publicly accessible at: this https URL.
[LG-16] Deep Hierarchical Learning with Nested Subspace Networks
链接: https://arxiv.org/abs/2509.17874
作者: Paulius Rauba,Mihaela van der Schaar
类目: Machine Learning (cs.LG)
*备注:
Abstract:Large neural networks are typically trained for a fixed computational budget, creating a rigid trade-off between performance and efficiency that is ill-suited for deployment in resource-constrained or dynamic environments. Existing approaches to this problem present a difficult choice: training a discrete collection of specialist models is computationally prohibitive, while dynamic methods like slimmable networks often lack the flexibility to be applied to large, pre-trained foundation models. In this work, we propose Nested Subspace Networks (NSNs), a novel architectural paradigm that enables a single model to be dynamically and granularly adjusted across a continuous spectrum of compute budgets at inference time. The core of our approach is to re-parameterize linear layers to satisfy a nested subspace property, such that the function computed at a given rank is a strict subspace of the function at any higher rank. We show that this entire hierarchy of models can be optimized jointly via an uncertainty-aware objective that learns to balance the contributions of different ranks based on their intrinsic difficulty. We demonstrate empirically that NSNs can be surgically applied to pre-trained LLMs and unlock a smooth and predictable compute-performance frontier. For example, a single NSN-adapted model can achieve a 50% reduction in inference FLOPs with only a 5 percentage point loss in accuracy. Our findings establish NSNs as a powerful framework for creating the next generation of adaptive foundation models.
[LG-17] Improving After-sales Service: Deep Reinforcement Learning for Dynamic Time Slot Assignment with Commitments and Customer Preferences
链接: https://arxiv.org/abs/2509.17870
作者: Xiao Mao,Albert H. Schrotenboer,Guohua Wu,Willem van Jaarsveld
类目: Machine Learning (cs.LG)
*备注:
Abstract:Problem definition: For original equipment manufacturers (OEMs), high-tech maintenance is a strategic component in after-sales services, involving close coordination between customers and service engineers. Each customer suggests several time slots for their maintenance task, from which the OEM must select one. This decision needs to be made promptly to support customers’ planning. At the end of each day, routes for service engineers are planned to fulfill the tasks scheduled for the following day. We study this hierarchical and sequential decision-making problem-the Dynamic Time Slot Assignment Problem with Commitments and Customer Preferences (DTSAP-CCP)-in this paper. Methodology/results: Two distinct approaches are proposed: 1) an attention-based deep reinforcement learning with rollout execution (ADRL-RE) and 2) a scenario-based planning approach (SBP). The ADRL-RE combines a well-trained attention-based neural network with a rollout framework for online trajectory simulation. To support the training, we develop a neural heuristic solver that provides rapid route planning solutions, enabling efficient learning in complex combinatorial settings. The SBP approach samples several scenarios to guide the time slot assignment. Numerical experiments demonstrate the superiority of ADRL-RE and the stability of SBP compared to both rule-based and rollout-based approaches. Furthermore, the strong practicality of ADRL-RE is verified in a case study of after-sales service for large medical equipment. Implications: This study provides OEMs with practical decision-support tools for dynamic maintenance scheduling, balancing customer preferences and operational efficiency. In particular, our ADRL-RE shows strong real-world potential, supporting timely and customer-aligned maintenance scheduling.
[LG-18] Conv-like Scale-Fusion Time Series Transformer: A Multi-Scale Representation for Variable-Length Long Time Series
链接: https://arxiv.org/abs/2509.17845
作者: Kai Zhang,Siming Sun,Zhengyu Fan,Qinmin Yang,Xuejun Jiang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Time series analysis faces significant challenges in handling variable-length data and achieving robust generalization. While Transformer-based models have advanced time series tasks, they often struggle with feature redundancy and limited generalization capabilities. Drawing inspiration from classical CNN architectures’ pyramidal structure, we propose a Multi-Scale Representation Learning Framework based on a Conv-like ScaleFusion Transformer. Our approach introduces a temporal convolution-like structure that combines patching operations with multi-head attention, enabling progressive temporal dimension compression and feature channel expansion. We further develop a novel cross-scale attention mechanism for effective feature fusion across different temporal scales, along with a log-space normalization method for variable-length sequences. Extensive experiments demonstrate that our framework achieves superior feature independence, reduced redundancy, and better performance in forecasting and classification tasks compared to state-of-the-art methods.
[LG-19] oward Affordable and Non-Invasive Detection of Hypoglycemia: A Machine Learning Approach
链接: https://arxiv.org/abs/2509.17842
作者: Lawrence Obiuwevwi,Krzysztof J. Rechowicz,Vikas Ashok,Sampath Jayarathna
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Diabetes mellitus is a growing global health issue, with Type 1 Diabetes (T1D) requiring constant monitoring to avoid hypoglycemia. Although Continuous Glucose Monitors (CGMs) are effective, their cost and invasiveness limit access, particularly in low-resource settings. This paper proposes a non-invasive method to classify glycemic states using Galvanic Skin Response (GSR), a biosignal commonly captured by wearable sensors. We use the merged OhioT1DM 2018 and 2020 datasets to build a machine learning pipeline that detects hypoglycemia (glucose 70 mg/dl) and normoglycemia (glucose 70 mg/dl) with GSR alone. Seven models are trained and evaluated: Random Forest, XGBoost, MLP, CNN, LSTM, Logistic Regression, and K-Nearest Neighbors. Validation sets and 95% confidence intervals are reported to increase reliability and assess robustness. Results show that the LSTM model achieves a perfect hypoglycemia recall (1.00) with an F1-score confidence interval of [0.611-0.745], while XGBoost offers strong performance with a recall of 0.54 even under class imbalance. This approach highlights the potential for affordable, wearable-compatible glucose monitoring tools suitable for settings with limited CGM availability using GSR data. Index Terms: Hypoglycemia Detection, Galvanic Skin Response, Non Invasive Monitoring, Wearables, Machine Learning, Confidence Intervals. Subjects: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG) Cite as: arXiv:2509.17842 [cs.HC] (or arXiv:2509.17842v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2509.17842 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: L. Obiuwevwi, K. J. Rechowicz, V. Ashok, and S. Jayarathna, “Toward Affordable and Non-Invasive Detection of Hypoglycemia: A Machine Learning Approach,” in IEEE International Conference on Information Reuse and Integration (IRI), 2025 Related DOI: https://doi.org/10.1109/IRI66576.2025.00036 Focus to learn more DOI(s) linking to related resources
[LG-20] Global Optimization via Softmin Energy Minimization
链接: https://arxiv.org/abs/2509.17815
作者: Andrea Agazzi,Vittorio Carlei,Marco Romito,Samuele Saviozzi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:Global optimization, particularly for non-convex functions with multiple local minima, poses significant challenges for traditional gradient-based methods. While metaheuristic approaches offer empirical effectiveness, they often lack theoretical convergence guarantees and may disregard available gradient information. This paper introduces a novel gradient-based swarm particle optimization method designed to efficiently escape local minima and locate global optima. Our approach leverages a “Soft-min Energy” interacting function, J_\beta(\mathbfx) , which provides a smooth, differentiable approximation of the minimum function value within a particle swarm. We define a stochastic gradient flow in the particle space, incorporating a Brownian motion term for exploration and a time-dependent parameter \beta to control smoothness, similar to temperature annealing. We theoretically demonstrate that for strongly convex functions, our dynamics converges to a stationary point where at least one particle reaches the global minimum, with other particles exhibiting exploratory behavior. Furthermore, we show that our method facilitates faster transitions between local minima by reducing effective potential barriers with respect to Simulated Annealing. More specifically, we estimate the hitting times of unexplored potential wells for our model in the small noise regime and show that they compare favorably with the ones of overdamped Langevin. Numerical experiments on benchmark functions, including double wells and the Ackley function, validate our theoretical findings and demonstrate better performance over the well-known Simulated Annealing method in terms of escaping local minima and achieving faster convergence.
[LG-21] MSGAT-GRU: A Multi-Scale Graph Attention and Recurrent Model for Spatiotemporal Road Accident Prediction
链接: https://arxiv.org/abs/2509.17811
作者: Thrinadh Pinjala,Aswin Ram Kumar Gannina,Debasis Dwibedy
类目: Machine Learning (cs.LG)
*备注: 16 pages, 4 figures, 4 tables
Abstract:Accurate prediction of road accidents remains challenging due to intertwined spatial, temporal, and contextual factors in urban traffic. We propose MSGAT-GRU, a multi-scale graph attention and recurrent model that jointly captures localized and long-range spatial dependencies while modeling sequential dynamics. Heterogeneous inputs, such as traffic flow, road attributes, weather, and points of interest, are systematically fused to enhance robustness and interpretability. On the Hybrid Beijing Accidents dataset, MSGAT-GRU achieves an RMSE of 0.334 and an F1-score of 0.878, consistently outperforming strong baselines. Cross-dataset evaluation on METR-LA under a 1-hour horizon further supports transferability, with RMSE of 6.48 (vs. 7.21 for the GMAN model) and comparable MAPE. Ablations indicate that three-hop spatial aggregation and a two-layer GRU offer the best accuracy-stability trade-off. These results position MSGAT-GRU as a scalable and generalizable model for intelligent transportation systems, providing interpretable signals that can inform proactive traffic management and road safety analytics.
[LG-22] MTM: A Multi-Scale Token Mixing Transformer for Irregular Multivariate Time Series Classification KDD2025
链接: https://arxiv.org/abs/2509.17809
作者: Shuhan Zhong,Weipeng Zhuo,Sizhe Song,Guanyao Li,Zhongyi Yu,S.-H. Gary Chan
类目: Machine Learning (cs.LG)
*备注: KDD 2025
Abstract:Irregular multivariate time series (IMTS) is characterized by the lack of synchronized observations across its different channels. In this paper, we point out that this channel-wise asynchrony can lead to poor channel-wise modeling of existing deep learning methods. To overcome this limitation, we propose MTM, a multi-scale token mixing transformer for the classification of IMTS. We find that the channel-wise asynchrony can be alleviated by down-sampling the time series to coarser timescales, and propose to incorporate a masked concat pooling in MTM that gradually down-samples IMTS to enhance the channel-wise attention modules. Meanwhile, we propose a novel channel-wise token mixing mechanism which proactively chooses important tokens from one channel and mixes them with other channels, to further boost the channel-wise learning of our model. Through extensive experiments on real-world datasets and comparison with state-of-the-art methods, we demonstrate that MTM consistently achieves the best performance on all the benchmarks, with improvements of up to 3.8% in AUPRC for classification.
[LG-23] Remote Sensing-Oriented World Model
链接: https://arxiv.org/abs/2509.17808
作者: Yuxi Lu,Biao Wu,Zhidong Li,Kunqi Li,Chenya Huang,Huacan Wang,Qizhen Lan,Ronghao Chen,Ling Chen,Bin Liang
类目: Machine Learning (cs.LG)
*备注: 10 pages, 5 figures
Abstract:World models have shown potential in artificial intelligence by predicting and reasoning about world states beyond direct observations. However, existing approaches are predominantly evaluated in synthetic environments or constrained scene settings, limiting their validation in real-world contexts with broad spatial coverage and complex semantics. Meanwhile, remote sensing applications urgently require spatial reasoning capabilities for disaster response and urban planning. This paper bridges these gaps by introducing the first framework for world modeling in remote sensing. We formulate remote sensing world modeling as direction-conditioned spatial extrapolation, where models generate semantically consistent adjacent image tiles given a central observation and directional instruction. To enable rigorous evaluation, we develop RSWISE (Remote Sensing World-Image Spatial Evaluation), a benchmark containing 1,600 evaluation tasks across four scenarios: general, flood, urban, and rural. RSWISE combines visual fidelity assessment with instruction compliance evaluation using GPT-4o as a semantic judge, ensuring models genuinely perform spatial reasoning rather than simple replication. Afterwards, we present RemoteBAGEL, a unified multimodal model fine-tuned on remote sensing data for spatial extrapolation tasks. Extensive experiments demonstrate that RemoteBAGEL consistently outperforms state-of-the-art baselines on RSWISE.
[LG-24] Elucidating the Design Space of FP4 training
链接: https://arxiv.org/abs/2509.17791
作者: Robert Hu,Carlo Luschi,Paul Balanca
类目: Machine Learning (cs.LG)
*备注:
Abstract:The increasing computational demands of foundation models have spurred research into low-precision training, with 4-bit floating-point (\textttFP4) formats emerging as a frontier for maximizing hardware throughput. While numerous techniques have been proposed to stabilize \textttFP4 training, they often present isolated solutions with varying, and not always clear, computational overheads. This paper aims to provide a unified view of the design space of \textttFP4 training. We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization that allows for a theoretical analysis of the computational costs associated with different stabilization methods on both the forward and backward passes. Using a simulator built on this framework, we conduct an extensive empirical study across a wide range of machine learning tasks, including regression, image classification, diffusion models, and language models. By systematically evaluating thousands of combinations of techniques, such as novel gradient approximations, rounding strategies, and scaling methods, we identify which configurations offer the most favourable performance-to-overhead trade-off. We find that the techniques enabling the best trade-off involve carefully combining Hadamard transformations, tensor scaling and stochastic rounding. We further find that using \textttUE5M3 as a scaling factor potentially offers a good compromise between range and precision with manageable computational overhead.
[LG-25] Flatness is Necessary Neural Collapse is Not: Rethinking Generalization via Grokking
链接: https://arxiv.org/abs/2509.17738
作者: Ting Han,Linara Adilova,Henning Petzka,Jens Kleesiek,Michael Kamp
类目: Machine Learning (cs.LG)
*备注: Preprint version
Abstract:Neural collapse, i.e., the emergence of highly symmetric, class-wise clustered representations, is frequently observed in deep networks and is often assumed to reflect or enable generalization. In parallel, flatness of the loss landscape has been theoretically and empirically linked to generalization. Yet, the causal role of either phenomenon remains unclear: Are they prerequisites for generalization, or merely by-products of training dynamics? We disentangle these questions using grokking, a training regime in which memorization precedes generalization, allowing us to temporally separate generalization from training dynamics and we find that while both neural collapse and relative flatness emerge near the onset of generalization, only flatness consistently predicts it. Models encouraged to collapse or prevented from collapsing generalize equally well, whereas models regularized away from flat solutions exhibit delayed generalization. Furthermore, we show theoretically that neural collapse implies relative flatness under classical assumptions, explaining their empirical co-occurrence. Our results support the view that relative flatness is a potentially necessary and more fundamental property for generalization, and demonstrate how grokking can serve as a powerful probe for isolating its geometric underpinnings.
[LG-26] An AutoML Framework using AutoGluonTS for Forecasting Seasonal Extreme Temperatures IJCNN2025
链接: https://arxiv.org/abs/2509.17734
作者: Pablo Rodríguez-Bocca,Guillermo Pereira,Diego Kiedanski,Soledad Collazo,Sebastián Basterrech,Gerardo Rubino
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注: Manuscript to appear in the proceedings of IJCNN 2025, in the workshop entitled "AI for a Cooler Planet: Tackling Environmental Challenges with Neural Networks.‘’ Total pages: 14. Total figures: 9 (containing a total of 27 images). Total tables: 1
Abstract:In recent years, great progress has been made in the field of forecasting meteorological variables. Recently, deep learning architectures have made a major breakthrough in forecasting the daily average temperature over a ten-day horizon. However, advances in forecasting events related to the maximum temperature over short horizons remain a challenge for the community. A problem that is even more complex consists in making predictions of the maximum daily temperatures in the short, medium, and long term. In this work, we focus on forecasting events related to the maximum daily temperature over medium-term periods (90 days). Therefore, instead of addressing the problem from a meteorological point of view, this article tackles it from a climatological point of view. Due to the complexity of this problem, a common approach is to frame the study as a temporal classification problem with the classes: maximum temperature “above normal”, “normal” or “below normal”. From a practical point of view, we created a large historical dataset (from 1981 to 2018) collecting information from weather stations located in South America. In addition, we also integrated exogenous information from the Pacific, Atlantic, and Indian Ocean basins. We applied the AutoGluonTS platform to solve the above-mentioned problem. This AutoML tool shows competitive forecasting performance with respect to large operational platforms dedicated to tackling this climatological problem; but with a “relatively” low computational cost in terms of time and resources.
[LG-27] A Generative Conditional Distribution Equality Testing Framework and Its Minimax Analysis
链接: https://arxiv.org/abs/2509.17729
作者: Siming Zheng,Meifang Lan,Tong Wang,Yuanyuan Lin
类目: Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:In this paper, we propose a general framework for testing the equality of the conditional distributions in a two-sample problem. This problem is most relevant to transfer learning under covariate shift. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional distribution testing problem into an unconditional one. We introduce two special tests: the generative permutation-based conditional distribution equality test and the generative classification accuracy-based conditional distribution equality test. Theoretically, we establish a minimax lower bound for statistical inference in testing the equality of two conditional distributions under certain smoothness conditions. We demonstrate that the generative permutation-based conditional distribution equality test and its modified version can attain this lower bound precisely or up to some iterated logarithmic factor. Moreover, we prove the testing consistency of the generative classification accuracy-based conditional distribution equality test. We also establish the convergence rate for the learned conditional generator by deriving new results related to the recently-developed offset Rademacher complexity and approximation properties using neural networks. Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach.
[LG-28] A non-smooth regularization framework for learning over multitask graphs
链接: https://arxiv.org/abs/2509.17728
作者: Yara Zgheib,Luca Calatroni,Marc Antonini,Roula Nassif
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:
Abstract:In this work, we consider learning over multitask graphs, where each agent aims to estimate its own parameter vector. Although agents seek distinct objectives, collaboration among them can be beneficial in scenarios where relationships between tasks exist. Among the various approaches to promoting relationships between tasks and, consequently, enhancing collaboration between agents, one notable method is regularization. While previous multitask learning studies have focused on smooth regularization to enforce graph smoothness, this work explores non-smooth regularization techniques that promote sparsity, making them particularly effective in encouraging piecewise constant transitions on the graph. We begin by formulating a global regularized optimization problem, which involves minimizing the aggregate sum of individual costs, regularized by a general non-smooth term designed to promote piecewise-constant relationships between the tasks of neighboring agents. Based on the forward-backward splitting strategy, we propose a decentralized learning approach that enables efficient solutions to the regularized optimization problem. Then, under convexity assumptions on the cost functions and co-regularization, we establish that the proposed approach converges in the mean-square-error sense within O(\mu) of the optimal solution of the globally regularized cost. For broader applicability and improved computational efficiency, we also derive closed-form expressions for commonly used non-smooth (and, possibly, non-convex) regularizers, such as the weighted sum of the \ell_0 -norm, \ell_1 -norm, and elastic net regularization. Finally, we illustrate both the theoretical findings and the effectiveness of the approach through simulations.
[LG-29] Fast Accurate and Interpretable Graph Classification with Topological Kernels
链接: https://arxiv.org/abs/2509.17693
作者: Adam Wesołowski,Ronin Wu,Karim Essafi
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce a novel class of explicit feature maps based on topological indices that represent each graph by a compact feature vector, enabling fast and interpretable graph classification. Using radial basis function kernels on these compact vectors, we define a measure of similarity between graphs. We perform evaluation on standard molecular datasets and observe that classification accuracies based on single topological-index feature vectors underperform compared to state-of-the-art substructure-based kernels. However, we achieve significantly faster Gram matrix evaluation – up to 20\times faster – compared to the Weisfeiler–Lehman subtree kernel. To enhance performance, we propose two extensions: 1) concatenating multiple topological indices into an \emphExtended Feature Vector (EFV), and 2) \emphLinear Combination of Topological Kernels (LCTK) by linearly combining Radial Basis Function kernels computed on feature vectors of individual topological graph indices. These extensions deliver up to 12% percent accuracy gains across all the molecular datasets. A complexity analysis highlights the potential for exponential quantum speedup for some of the vector components. Our results indicate that LCTK and EFV offer a favourable trade-off between accuracy and efficiency, making them strong candidates for practical graph learning applications.
[LG-30] Comparing Data Assimilation and Likelihood-Based Inference on Latent State Estimation in Agent -Based Models
链接: https://arxiv.org/abs/2509.17625
作者: Blas Kolic,Corrado Monti,Gianmarco De Francisci Morales,Marco Pangallo
类目: Machine Learning (cs.LG); Computers and Society (cs.CY); Physics and Society (physics.soc-ph); Methodology (stat.ME)
*备注:
Abstract:In this paper, we present the first systematic comparison of Data Assimilation (DA) and Likelihood-Based Inference (LBI) in the context of Agent-Based Models (ABMs). These models generate observable time series driven by evolving, partially-latent microstates. Latent states need to be estimated to align simulations with real-world data – a task traditionally addressed by DA, especially in continuous and equation-based models such as those used in weather forecasting. However, the nature of ABMs poses challenges for standard DA methods. Solving such issues requires adaptation of previous DA techniques, or ad-hoc alternatives such as LBI. DA approximates the likelihood in a model-agnostic way, making it broadly applicable but potentially less precise. In contrast, LBI provides more accurate state estimation by directly leveraging the model’s likelihood, but at the cost of requiring a hand-crafted, model-specific likelihood function, which may be complex or infeasible to derive. We compare the two methods on the Bounded-Confidence Model, a well-known opinion dynamics ABM, where agents are affected only by others holding sufficiently similar opinions. We find that LBI better recovers latent agent-level opinions, even under model mis-specification, leading to improved individual-level forecasts. At the aggregate level, however, both methods perform comparably, and DA remains competitive across levels of aggregation under certain parameter settings. Our findings suggest that DA is well-suited for aggregate predictions, while LBI is preferable for agent-level inference.
[LG-31] Audio Super-Resolution with Latent Bridge Models NEURIPS2025
链接: https://arxiv.org/abs/2509.17609
作者: Chang Li,Zehua Chen,Liyuan Wang,Jun Zhu
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025
Abstract:Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at this https URL.
[LG-32] An Unlearning Framework for Continual Learning
链接: https://arxiv.org/abs/2509.17530
作者: Sayanta Adhikari,Vishnuprasadh Kumaravelu,P. K. Srijith
类目: Machine Learning (cs.LG)
*备注:
Abstract:Growing concerns surrounding AI safety and data privacy have driven the development of Machine Unlearning as a potential solution. However, current machine unlearning algorithms are designed to complement the offline training paradigm. The emergence of the Continual Learning (CL) paradigm promises incremental model updates, enabling models to learn new tasks sequentially. Naturally, some of those tasks may need to be unlearned to address safety or privacy concerns that might arise. We find that applying conventional unlearning algorithms in continual learning environments creates two critical problems: performance degradation on retained tasks and task relapse, where previously unlearned tasks resurface during subsequent learning. Furthermore, most unlearning algorithms require data to operate, which conflicts with CL’s philosophy of discarding past data. A clear need arises for unlearning algorithms that are data-free and mindful of future learning. To that end, we propose UnCLe, an Unlearning framework for Continual Learning. UnCLe employs a hypernetwork that learns to generate task-specific network parameters, using task embeddings. Tasks are unlearned by aligning the corresponding generated network parameters with noise, without requiring any data. Empirical evaluations on several vision data sets demonstrate UnCLe’s ability to sequentially perform multiple learning and unlearning operations with minimal disruption to previously acquired knowledge.
[LG-33] Achilles Heel of Mamba: Essential difficulties of the Mamba architecture demonstrated by synthetic data
链接: https://arxiv.org/abs/2509.17514
作者: Tianyi Chen,Pengxiao Lin,Zhiwei Wang,Zhi-Qin John Xu
类目: Machine Learning (cs.LG)
*备注:
Abstract:State Space Models (SSMs) have emerged as promising alternatives to attention mechanisms, with the Mamba architecture demonstrating impressive performance and linear complexity for processing long sequences. However, the fundamental differences between Mamba and Transformer architectures remain incompletely understood. In this work, we use carefully designed synthetic tasks to reveal Mamba’s inherent limitations. Through experiments, we identify that Mamba’s nonlinear convolution introduces an asymmetry bias that significantly impairs its ability to recognize symmetrical patterns and relationships. Using composite function and inverse sequence matching tasks, we demonstrate that Mamba strongly favors compositional solutions over symmetrical ones and struggles with tasks requiring the matching of reversed sequences. We show these limitations stem not from the SSM module itself but from the nonlinear convolution preceding it, which fuses token information asymmetrically. These insights provide a new understanding of Mamba’s constraints and suggest concrete architectural improvements for future sequence models.
[LG-34] BiLCNet : BiLSTM-Conformer Network for Encrypted Traffic Classification with 5G SA Physical Channel Records
链接: https://arxiv.org/abs/2509.17495
作者: Ke Ma,Jialiang Lu,Philippe Martins
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: 6 pages, 5 figures
Abstract:Accurate and efficient traffic classification is vital for wireless network management, especially under encrypted payloads and dynamic application behavior, where traditional methods such as port-based identification and deep packet inspection (DPI) are increasingly inadequate. This work explores the feasibility of using physical channel data collected from the air interface of 5G Standalone (SA) networks for traffic sensing. We develop a preprocessing pipeline to transform raw channel records into structured representations with customized feature engineering to enhance downstream classification performance. To jointly capture temporal dependencies and both local and global structural patterns inherent in physical channel records, we propose a novel hybrid architecture: BiLSTM-Conformer Network (BiLCNet), which integrates the sequential modeling capability of Bidirectional Long Short-Term Memory networks (BiLSTM) with the spatial feature extraction strength of Conformer blocks. Evaluated on a noise-limited 5G SA dataset, our model achieves a classification accuracy of 93.9%, outperforming a series of conventional machine learning and deep learning algorithms. Furthermore, we demonstrate its generalization ability under zero-shot transfer settings, validating its robustness across traffic categories and varying environmental conditions.
[LG-35] Path-Weighted Integrated Gradients for Interpretable Dementia Classification
链接: https://arxiv.org/abs/2509.17491
作者: Firuz Kamalov,Mohmad Al Falasi,Fadi Thabtah
类目: Machine Learning (cs.LG)
*备注:
Abstract:Integrated Gradients (IG) is a widely used attribution method in explainable artificial intelligence (XAI). In this paper, we introduce Path-Weighted Integrated Gradients (PWIG), a generalization of IG that incorporates a customizable weighting function into the attribution integral. This modification allows for targeted emphasis along different segments of the path between a baseline and the input, enabling improved interpretability, noise mitigation, and the detection of path-dependent feature relevance. We establish its theoretical properties and illustrate its utility through experiments on a dementia classification task using the OASIS-1 MRI dataset. Attribution maps generated by PWIG highlight clinically meaningful brain regions associated with various stages of dementia, providing users with sharp and stable explanations. The results suggest that PWIG offers a flexible and theoretically grounded approach for enhancing attribution quality in complex predictive models.
[LG-36] Periodic Graph-Enhanced Multivariate Time Series Anomaly Detector
链接: https://arxiv.org/abs/2509.17472
作者: Jia Li,Shiyu Long,Ye Yuan
类目: Machine Learning (cs.LG)
*备注:
Abstract:Multivariate time series (MTS) anomaly detection commonly encounters in various domains like finance, healthcare, and industrial monitoring. However, existing MTS anomaly detection methods are mostly defined on the static graph structure, which fails to perform an accurate representation of complex spatio-temporal correlations in MTS. To address this issue, this study proposes a Periodic Graph-Enhanced Multivariate Time Series Anomaly Detector (PGMA) with the following two-fold ideas: a) designing a periodic time-slot allocation strategy based Fast Fourier Transform (FFT), which enables the graph structure to reflect dynamic changes in MTS; b) utilizing graph neural network and temporal extension convolution to accurate extract the complex spatio-temporal correlations from the reconstructed periodic graphs. Experiments on four real datasets from real applications demonstrate that the proposed PGMA outperforms state-of-the-art models in MTS anomaly detection.
[LG-37] Efficient Sliced Wasserstein Distance Computation via Adaptive Bayesian Optimization
链接: https://arxiv.org/abs/2509.17405
作者: Manish Acharya,David Hyde
类目: Machine Learning (cs.LG)
*备注: 19 pages, 11 figures
Abstract:The sliced Wasserstein distance (SW) reduces optimal transport on \mathbbR^d to a sum of one-dimensional projections, and thanks to this efficiency, it is widely used in geometry, generative modeling, and registration tasks. Recent work shows that quasi-Monte Carlo constructions for computing SW (QSW) yield direction sets with excellent approximation error. This paper presents an alternate, novel approach: learning directions with Bayesian optimization (BO), particularly in settings where SW appears inside an optimization loop (e.g., gradient flows). We introduce a family of drop-in selectors for projection directions: BOSW, a one-shot BO scheme on the unit sphere; RBOSW, a periodic-refresh variant; ABOSW, an adaptive hybrid that seeds from competitive QSW sets and performs a few lightweight BO refinements; and ARBOSW, a restarted hybrid that periodically relearns directions during optimization. Our BO approaches can be composed with QSW and its variants (demonstrated by ABOSW/ARBOSW) and require no changes to downstream losses or gradients. We provide numerical experiments where our methods achieve state-of-the-art performance, and on the experimental suite of the original QSW paper, we find that ABOSW and ARBOSW can achieve convergence comparable to the best QSW variants with modest runtime overhead.
[LG-38] Robust Anomaly Detection Under Normality Distribution Shift in Dynamic Graphs
链接: https://arxiv.org/abs/2509.17400
作者: Xiaoyang Xu,Xiaofeng Lin,Koh Takeuchi,Kyohei Atarashi,Hisashi Kashima
类目: Machine Learning (cs.LG)
*备注:
Abstract:Anomaly detection in dynamic graphs is a critical task with broad real-world applications, including social networks, e-commerce, and cybersecurity. Most existing methods assume that normal patterns remain stable over time; however, this assumption often fails in practice due to the phenomenon we refer to as normality distribution shift (NDS), where normal behaviors evolve over time. Ignoring NDS can lead models to misclassify shifted normal instances as anomalies, degrading detection performance. To tackle this issue, we propose WhENDS, a novel unsupervised anomaly detection method that aligns normal edge embeddings across time by estimating distributional statistics and applying whitening transformations. Extensive experiments on four widely-used dynamic graph datasets show that WhENDS consistently outperforms nine strong baselines, achieving state-of-the-art results and underscoring the importance of addressing NDS in dynamic graph anomaly detection.
[LG-39] SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models
链接: https://arxiv.org/abs/2509.17371
作者: Haotian Xu,Qingsong Peng,Jie Shi,Huadi Zheng,Yu Li,Cheng Zhuo
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The rapid adoption of large language models (LLMs) in critical domains has spurred extensive research into their security issues. While input manipulation attacks (e.g., prompt injection) have been well studied, Bit-Flip Attacks (BFAs) – which exploit hardware vulnerabilities to corrupt model parameters and cause severe performance degradation – have received far less attention. Existing BFA methods suffer from key limitations: they fail to balance performance degradation and output naturalness, making them prone to discovery. In this paper, we introduce SilentStriker, the first stealthy bit-flip attack against LLMs that effectively degrades task performance while maintaining output naturalness. Our core contribution lies in addressing the challenge of designing effective loss functions for LLMs with variable output length and the vast output space. Unlike prior approaches that rely on output perplexity for attack loss formulation, which inevitably degrade output naturalness, we reformulate the attack objective by leveraging key output tokens as targets for suppression, enabling effective joint optimization of attack effectiveness and stealthiness. Additionally, we employ an iterative, progressive search strategy to maximize attack efficacy. Experiments show that SilentStriker significantly outperforms existing baselines, achieving successful attacks without compromising the naturalness of generated text.
[LG-40] Word2VecGD: Neural Graph Drawing with Cosine-Stress Optimization
链接: https://arxiv.org/abs/2509.17333
作者: Minglai Yang,Reyan Ahmed
类目: Computational Geometry (cs.CG); Machine Learning (cs.LG)
*备注:
Abstract:We propose a novel graph visualization method leveraging random walk-based embeddings to replace costly graph-theoretical distance computations. Using word2vec-inspired embeddings, our approach captures both structural and semantic relationships efficiently. Instead of relying on exact shortest-path distances, we optimize layouts using cosine dissimilarities, significantly reducing computational overhead. Our framework integrates differentiable stress optimization with stochastic gradient descent (SGD), supporting multi-criteria layout objectives. Experimental results demonstrate that our method produces high-quality, semantically meaningful layouts while efficiently scaling to large graphs. Code available at: this https URL
[LG-41] DiffQ: Unified Parameter Initialization for Variational Quantum Algorithms via Diffusion Models
链接: https://arxiv.org/abs/2509.17324
作者: Chi Zhang,Mengxin Zheng,Qian Lou,Fan Chen
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:Variational Quantum Algorithms (VQAs) are widely used in the noisy intermediate-scale quantum (NISQ) era, but their trainability and performance depend critically on initialization parameters that shape the optimization landscape. Existing machine learning-based initializers achieve state-of-the-art results yet remain constrained to single-task domains and small datasets of only hundreds of samples. We address these limitations by reformulating VQA parameter initialization as a generative modeling problem and introducing DiffQ, a parameter initializer based on the Denoising Diffusion Probabilistic Model (DDPM). To support robust training and evaluation, we construct a dataset of 15,085 instances spanning three domains and five representative tasks. Experiments demonstrate that DiffQ surpasses baselines, reducing initial loss by up to 8.95 and convergence steps by up to 23.4%.
[LG-42] VQEzy: An Open-Source Dataset for Parameter Initialize in Variational Quantum Eigensolvers
链接: https://arxiv.org/abs/2509.17322
作者: Chi Zhang,Mengxin Zheng,Qian Lou,Hui Min Leung,Fan Chen
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET); Quantum Physics (quant-ph)
*备注:
Abstract:Variational Quantum Eigensolvers (VQEs) are a leading class of noisy intermediate-scale quantum (NISQ) algorithms, whose performance is highly sensitive to parameter initialization. Although recent machine learning-based initialization methods have achieved state-of-the-art performance, their progress has been limited by the lack of comprehensive datasets. Existing resources are typically restricted to a single domain, contain only a few hundred instances, and lack complete coverage of Hamiltonians, ansatz circuits, and optimization trajectories. To overcome these limitations, we introduce VQEzy, the first large-scale dataset for VQE parameter initialization. VQEzy spans three major domains and seven representative tasks, comprising 12,110 instances with full VQE specifications and complete optimization trajectories. The dataset is available online, and will be continuously refined and expanded to support future research in VQE optimization.
[LG-43] Clotho: Measuring Task-Specific Pre-Generation Test Adequacy for LLM Inputs
链接: https://arxiv.org/abs/2509.17314
作者: Juyeon Yoon,Somin Kim,Robert Feldt,Shin Yoo
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:
Abstract:Software increasingly relies on the emergent capabilities of Large Language Models (LLMs), from natural language understanding to program analysis and generation. Yet testing them on specific tasks remains difficult and costly: many prompts lack ground truth, forcing reliance on human judgment, while existing uncertainty and adequacy measures typically require full inference. A key challenge is to assess input adequacy in a way that reflects the demands of the task, ideally before even generating any output. We introduce CLOTHO, a task-specific, pre-generation adequacy measure that estimates input difficulty directly from hidden LLM states. Given a large pool of unlabelled inputs for a specific task, CLOTHO uses a Gaussian Mixture Model (GMM) to adaptively sample the most informative cases for human labelling. Based on this reference set the GMM can then rank unseen inputs by their likelihood of failure. In our empirical evaluation across eight benchmark tasks and three open-weight LLMs, CLOTHO can predict failures with a ROC-AUC of 0.716, after labelling reference sets that are on average only 5.4% of inputs. It does so without generating any outputs, thereby reducing costs compared to existing uncertainty measures. Comparison of CLOTHO and post-generation uncertainty measures shows that the two approaches complement each other. Crucially, we show that adequacy scores learnt from open-weight LLMs transfer effectively to proprietary models, extending the applicability of the approach. When prioritising test inputs for proprietary models, CLOTHO increases the average number of failing inputs from 18.7 to 42.5 out of 100, compared to random prioritisation.
[LG-44] SPRINT: Stochastic Performative Prediction With Variance Reduction
链接: https://arxiv.org/abs/2509.17304
作者: Tian Xie,Ding Zhu,Jia Liu,Mahdi Khalili,Xueru Zhang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Performative prediction (PP) is an algorithmic framework for optimizing machine learning (ML) models where the model’s deployment affects the distribution of the data it is trained on. Compared to traditional ML with fixed data, designing algorithms in PP converging to a stable point – known as a stationary performative stable (SPS) solution – is more challenging than the counterpart in conventional ML tasks due to the model-induced distribution shifts. While considerable efforts have been made to find SPS solutions using methods such as repeated gradient descent (RGD) and greedy stochastic gradient descent (SGD-GD), most prior studies assumed a strongly convex loss until a recent work established \mathcalO(1/\sqrtT) convergence of SGD-GD to SPS solutions under smooth, non-convex losses. However, this latest progress is still based on the restricted bounded variance assumption in stochastic gradient estimates and yields convergence bounds with a non-vanishing error neighborhood that scales with the variance. This limitation motivates us to improve convergence rates and reduce error in stochastic optimization for PP, particularly in non-convex settings. Thus, we propose a new algorithm called stochastic performative prediction with variance reduction (SPRINT) and establish its convergence to an SPS solution at a rate of \mathcalO(1/T) . Notably, the resulting error neighborhood is independent of the variance of the stochastic gradients. Experiments on multiple real datasets with non-convex models demonstrate that SPRINT outperforms SGD-GD in both convergence rate and stability.
[LG-45] Physics-Informed Operator Learning for Hemodynamic Modeling
链接: https://arxiv.org/abs/2509.17293
作者: Ryan Chappell,Chayan Banerjee,Kien Nguyen,Clinton Fookes
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: To appear in the proceedings of DICTA 2025
Abstract:Accurate modeling of personalized cardiovascular dynamics is crucial for non-invasive monitoring and therapy planning. State-of-the-art physics-informed neural network (PINN) approaches employ deep, multi-branch architectures with adversarial or contrastive objectives to enforce partial differential equation constraints. While effective, these enhancements introduce significant training and implementation complexity, limiting scalability and practical deployment. We investigate physics-informed neural operator learning models as efficient supervisory signals for training simplified architectures through knowledge distillation. Our approach pre-trains a physics-informed DeepONet (PI-DeepONet) on high-fidelity cuffless blood pressure recordings to learn operator mappings from raw wearable waveforms to beat-to-beat pressure signals under embedded physics constraints. This pre-trained operator serves as a frozen supervisor in a lightweight knowledge-distillation pipeline, guiding streamlined base models that eliminate complex adversarial and contrastive learning components while maintaining performance. We characterize the role of physics-informed regularization in operator learning and demonstrate its effectiveness for supervisory guidance. Through extensive experiments, our operator-supervised approach achieves performance parity with complex baselines (correlation: 0.766 vs. 0.770, RMSE: 4.452 vs. 4.501), while dramatically reducing architectural complexity from eight critical hyperparameters to a single regularization coefficient and decreasing training overhead by 4%. Our results demonstrate that operator-based supervision effectively replaces intricate multi-component training strategies, offering a more scalable and interpretable approach to physiological modeling with reduced implementation burden.
[LG-46] GraphWeave: Interpretable and Robust Graph Generation via Random Walk Trajectories ECML-PKDD2025
链接: https://arxiv.org/abs/2509.17291
作者: Rahul Nandakumar,Deepayan Chakrabarti
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures. Accepted at ECML-PKDD 2025
Abstract:Given a set of graphs from some unknown family, we want to generate new graphs from that family. Recent methods use diffusion on either graph embeddings or the discrete space of nodes and edges. However, simple changes to embeddings (say, adding noise) can mean uninterpretable changes in the graph. In discrete-space diffusion, each step may add or remove many nodes/edges. It is hard to predict what graph patterns we will observe after many diffusion steps. Our proposed method, called GraphWeave, takes a different approach. We separate pattern generation and graph construction. To find patterns in the training graphs, we see how they transform vectors during random walks. We then generate new graphs in two steps. First, we generate realistic random walk “trajectories” which match the learned patterns. Then, we find the optimal graph that fits these trajectories. The optimization infers all edges jointly, which improves robustness to errors. On four simulated and five real-world benchmark datasets, GraphWeave outperforms existing methods. The most significant differences are on large-scale graph structures such as PageRank, cuts, communities, degree distributions, and flows. GraphWeave is also 10x faster than its closest competitor. Finally, GraphWeave is simple, needing only a transformer and standard optimizers.
[LG-47] Learning and Optimization with 3D Orientations
链接: https://arxiv.org/abs/2509.17274
作者: Alexandros Ntagkas,Constantinos Tsakonas,Chairi Kiourt,Konstantinos Chatzilygeroudis
类目: Robotics (cs.RO); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: 9 pages, 11 figures
Abstract:There exist numerous ways of representing 3D orientations. Each representation has both limitations and unique features. Choosing the best representation for one task is often a difficult chore, and there exist conflicting opinions on which representation is better suited for a set of family of tasks. Even worse, when dealing with scenarios where we need to learn or optimize functions with orientations as inputs and/or outputs, the set of possibilities (representations, loss functions, etc.) is even larger and it is not easy to decide what is best for each scenario. In this paper, we attempt to a) present clearly, concisely and with unified notation all available representations, and “tricks” related to 3D orientations (including Lie Group algebra), and b) benchmark them in representative scenarios. The first part feels like it is missing from the robotics literature as one has to read many different textbooks and papers in order have a concise and clear understanding of all possibilities, while the benchmark is necessary in order to come up with recommendations based on empirical evidence. More precisely, we experiment with the following settings that attempt to cover most widely used scenarios in robotics: 1) direct optimization, 2) imitation/supervised learning with a neural network controller, 3) reinforcement learning, and 4) trajectory optimization using differential dynamic programming. We finally provide guidelines depending on the scenario, and make available a reference implementation of all the orientation math described.
[LG-48] Graph Signal Generative Diffusion Models ICASSP2026
链接: https://arxiv.org/abs/2509.17250
作者: Yigit Berkay Uslu,Samar Hadou,Sergio Rozada,Shirin Saeedi Bidokhti,Alejandro Ribeiro
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)
Abstract:We introduce U-shaped encoder-decoder graph neural networks (U-GNNs) for stochastic graph signal generation using denoising diffusion processes. The architecture learns node features at different resolutions with skip connections between the encoder and decoder paths, analogous to the convolutional U-Net for image generation. The U-GNN is prominent for a pooling operation that leverages zero-padding and avoids arbitrary graph coarsening, with graph convolutions layered on top to capture local dependencies. This technique permits learning feature embeddings for sampled nodes at deeper levels of the architecture that remain convolutional with respect to the original graph. Applied to stock price prediction – where deterministic forecasts struggle to capture uncertainties and tail events that are paramount – we demonstrate the effectiveness of the diffusion model in probabilistic forecasting of stock prices.
[LG-49] raceHiding: Scalable Machine Unlearning for Mobility Data
链接: https://arxiv.org/abs/2509.17241
作者: Ali Faraji,Manos Papagelis
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:
Abstract:This work introduces TraceHiding, a scalable, importance-aware machine unlearning framework for mobility trajectory data. Motivated by privacy regulations such as GDPR and CCPA granting users “the right to be forgotten,” TraceHiding removes specified user trajectories from trained deep models without full retraining. It combines a hierarchical data-driven importance scoring scheme with teacher-student distillation. Importance scores–computed at token, trajectory, and user levels from statistical properties (coverage diversity, entropy, length)–quantify each training sample’s impact, enabling targeted forgetting of high-impact data while preserving common patterns. The student model retains knowledge on remaining data and unlearns targeted trajectories through an importance-weighted loss that amplifies forgetting signals for unique samples and attenuates them for frequent ones. We validate on Trajectory–User Linking (TUL) tasks across three real-world higher-order mobility datasets (HO-Rome, HO-Geolife, HO-NYC) and multiple architectures (GRU, LSTM, BERT, ModernBERT, GCN-TULHOR), against strong unlearning baselines including SCRUB, NegGrad, NegGrad+, Bad-T, and Finetuning. Experiments under uniform and targeted user deletion show TraceHiding, especially its entropy-based variant, achieves superior unlearning accuracy, competitive membership inference attack (MIA) resilience, and up to 40\times speedup over retraining with minimal test accuracy loss. Results highlight robustness to adversarial deletion of high-information users and consistent performance across models. To our knowledge, this is the first systematic study of machine unlearning for trajectory data, providing a reproducible pipeline with public code and preprocessing tools.
[LG-50] Prospective Multi-Graph Cohesion for Multivariate Time Series Anomaly Detection WSDM2025
链接: https://arxiv.org/abs/2509.17235
作者: Jiazhen Chen,Mingbin Feng,Tony S. Wirjanto
类目: Machine Learning (cs.LG)
*备注: Accepted by the 18th ACM International Conference on Web Search and Data Mining (ACM WSDM 2025)
Abstract:Anomaly detection in high-dimensional time series data is pivotal for numerous industrial applications. Recent advances in multivariate time series anomaly detection (TSAD) have increasingly leveraged graph structures to model inter-variable relationships, typically employing Graph Neural Networks (GNNs). Despite their promising results, existing methods often rely on a single graph representation, which are insufficient for capturing the complex, diverse relationships inherent in multivariate time series. To address this, we propose the Prospective Multi-Graph Cohesion (PMGC) framework for multivariate TSAD. PMGC exploits spatial correlations by integrating a long-term static graph with a series of short-term instance-wise dynamic graphs, regulated through a graph cohesion loss function. Our theoretical analysis shows that this loss function promotes diversity among dynamic graphs while aligning them with the stable long-term relationships encapsulated by the static graph. Additionally, we introduce a “prospective graphing” strategy to mitigate the limitations of traditional forecasting-based TSAD methods, which often struggle with unpredictable future variations. This strategy allows the model to accurately reflect concurrent inter-series relationships under normal conditions, thereby enhancing anomaly detection efficacy. Empirical evaluations on real-world datasets demonstrate the superior performance of our method compared to existing TSAD techniques.
[LG-51] Virtual Consistency for Audio Editing
链接: https://arxiv.org/abs/2509.17219
作者: Matthieu Cervera,Francesco Paissan,Mirco Ravanelli,Cem Subakan
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Free-form, text-based audio editing remains a persistent challenge, despite progress in inversion-based neural methods. Current approaches rely on slow inversion procedures, limiting their practicality. We present a virtual-consistency based audio editing system that bypasses inversion by adapting the sampling process of diffusion models. Our pipeline is model-agnostic, requiring no fine-tuning or architectural changes, and achieves substantial speed-ups over recent neural editing baselines. Crucially, it achieves this efficiency without compromising quality, as demonstrated by quantitative benchmarks and a user study involving 16 participants.
[LG-52] Active Learning for Machine Learning Driven Molecular Dynamics NEURIPS
链接: https://arxiv.org/abs/2509.17208
作者: Kevin Bachelor,Sanya Murdeshwar,Daniel Sabo,Razvan Marinescu
类目: Machine Learning (cs.LG); Atomic and Molecular Clusters (physics.atm-clus)
*备注: 8 pages, 4 figures, for Neurips Workshop: Machine Learning and the Physical Sciences 2025
Abstract:Machine learned coarse grained (CG) potentials are fast, but degrade over time when simulations reach undersampled biomolecular conformations, and generating widespread all atom (AA) data to combat this is computationally infeasible. We propose a novel active learning framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD) based frame selection from MD simulations in order to generate data on the fly by querying an oracle during the training of a neural network potential. This framework preserves CG level efficiency while correcting the model at precise, RMSD identified coverage gaps. By training CGSchNet, a coarse grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05% improvement in the Wasserstein 1 (W1) metric in Time lagged Independent Component Analysis (TICA) space on an in house benchmark suite.
[LG-53] Conditional Policy Generator for Dynamic Constraint Satisfaction and Optimization
链接: https://arxiv.org/abs/2509.17205
作者: Wook Lee,Frans A. Oliehoek
类目: Machine Learning (cs.LG)
*备注:
Abstract:Leveraging machine learning methods to solve constraint satisfaction problems has shown promising, but they are mostly limited to a static situation where the problem description is completely known and fixed from the beginning. In this work we present a new approach to constraint satisfaction and optimization in dynamically changing environments, particularly when variables in the problem are statistically independent. We frame it as a reinforcement learning problem and introduce a conditional policy generator by borrowing the idea of class conditional generative adversarial networks (GANs). Assuming that the problem includes both static and dynamic constraints, the former are used in a reward formulation to guide the policy training such that it learns to map to a probabilistic distribution of solutions satisfying static constraints from a noise prior, which is similar to a generator in GANs. On the other hand, dynamic constraints in the problem are encoded to different class labels and fed with the input noise. The policy is then simultaneously updated for maximum likelihood of correctly classifying given the dynamic conditions in a supervised manner. We empirically demonstrate a proof-of-principle experiment with a multi-modal constraint satisfaction problem and compare between unconditional and conditional cases.
[LG-54] PMRT: A Training Recipe for Fast 3D High-Resolution Aerodynamic Prediction
链接: https://arxiv.org/abs/2509.17182
作者: Sam Jacob Jacob,Markus Mrosek,Carsten Othmer,Harald Köstler
类目: Machine Learning (cs.LG)
*备注:
Abstract:The aerodynamic optimization of cars requires close collaboration between aerodynamicists and stylists, while slow, expensive simulations remain a bottleneck. Surrogate models have been shown to accurately predict aerodynamics within the design space for which they were trained. However, many of these models struggle to scale to higher resolutions because of the 3D nature of the problem and data scarcity. We propose Progressive Multi-Resolution Training (PMRT), a probabilistic multi-resolution training schedule that enables training a U-Net to predict the drag coefficient ( c_d ) and high-resolution velocity fields (512 x 128 x 128) in 24 hours on a single NVIDIA H100 GPU, 7x cheaper than the high-resolution-only baseline, with similar accuracy. PMRT samples batches from three resolutions based on probabilities that change during training, starting with an emphasis on lower resolutions and gradually shifting toward higher resolutions. Since this is a training methodology, it can be adapted to other high-resolution-focused backbones. We also show that a single model can be trained across five datasets from different solvers, including a real-world dataset, by conditioning on the simulation parameters. In the DrivAerML dataset, our models achieve a c_d R^2 of 0.975, matching literature baselines at a fraction of the training cost.
[LG-55] Regularizing Extrapolation in Causal Inference
链接: https://arxiv.org/abs/2509.17180
作者: David Arbour,Harsh Parikh,Bijan Niknam,Elizabeth Stuart,Kara Rudolph,Avi Feller
类目: Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:
Abstract:Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel “bias-bias-variance” tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.
[LG-56] A Comprehensive Performance Comparison of Traditional and Ensemble Machine Learning Models for Online Fraud Detection
链接: https://arxiv.org/abs/2509.17176
作者: Ganesh Khekare,Shivam Sunda,Yash Bothra
类目: Machine Learning (cs.LG)
*备注: 6 pages, 6 figures. Presented at IEEE INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2025
Abstract:In the era of the digitally driven economy, where there has been an exponential surge in digital payment systems and other online activities, various forms of fraudulent activities have accompanied the digital growth, out of which credit card fraud has become an increasingly significant threat. To deal with this, real-time fraud detection is essential for financial security but remains challenging due to high transaction volumes and the complexity of modern fraud patterns. This study presents a comprehensive performance comparison between traditional machine learning models like Random Forest, SVM, Logistic Regression, XGBoost, and ensemble methods like Stacking and Voting Classifier for detecting credit card fraud on a heavily imbalanced public dataset, where the number of fraudulent transactions is 492 out of 284,807 total transactions. Application-specific preprocessing techniques were applied, and the models were evaluated using various performance metrics. The ensemble methods achieved an almost perfect precision of around 0.99, but traditional methods demonstrated superior performance in terms of recall, which highlights the trade-off between false positives and false negatives. The comprehensive comparison reveals distinct performance strengths and limitations for each algorithm, offering insights to guide practitioners in selecting the most effective model for robust fraud detection applications in real-world settings.
[LG-57] Detecting Urban PM_2.5 Hotspots with Mobile Sensing and Gaussian Process Regression
链接: https://arxiv.org/abs/2509.17175
作者: Niál Perry,Peter P. Pedersen,Charles N. Christensen,Emanuel Nussli,Sanelma Heinonen,Lorena Gordillo Dagallier,Raphaël Jacquat,Sebastian Horstmann,Christoph Franck
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 39 pages, 12 figures
Abstract:Low-cost mobile sensors can be used to collect PM _2.5 concentration data throughout an entire city. However, identifying air pollution hotspots from the data is challenging due to the uneven spatial sampling, temporal variations in the background air quality, and the dynamism of urban air pollution sources. This study proposes a method to identify urban PM _2.5 hotspots that addresses these challenges, involving four steps: (1) equip citizen scientists with mobile PM _2.5 sensors while they travel; (2) normalise the raw data to remove the influence of background ambient pollution levels; (3) fit a Gaussian process regression model to the normalised data and (4) calculate a grid of spatially explicit ‘hotspot scores’ using the probabilistic framework of Gaussian processes, which conveniently summarise the relative pollution levels throughout the city. We apply our method to create the first ever map of PM _2.5 pollution in Kigali, Rwanda, at a 200m resolution. Our results suggest that the level of ambient PM _2.5 pollution in Kigali is dangerously high, and we identify the hotspots in Kigali where pollution consistently exceeds the city-wide average. We also evaluate our method using simulated mobile sensing data for Beijing, China, where we find that the hotspot scores are probabilistically well calibrated and accurately reflect the ‘ground truth’ spatial profile of PM _2.5 pollution. Thanks to the use of open-source software, our method can be re-applied in cities throughout the world with a handful of low-cost sensors. The method can help fill the gap in urban air quality information and empower public health officials.
[LG-58] Unrolled Graph Neural Networks for Constrained Optimization
链接: https://arxiv.org/abs/2509.17156
作者: Samar Hadou,Alejandro Ribeiro
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we unroll the dynamics of the dual ascent (DA) algorithm in two coupled graph neural networks (GNNs) to solve constrained optimization problems. The two networks interact with each other at the layer level to find a saddle point of the Lagrangian. The primal GNN finds a stationary point for a given dual multiplier, while the dual network iteratively refines its estimates to reach an optimal solution. We force the primal and dual networks to mirror the dynamics of the DA algorithm by imposing descent and ascent constraints. We propose a joint training scheme that alternates between updating the primal and dual networks. Our numerical experiments demonstrate that our approach yields near-optimal near-feasible solutions and generalizes well to out-of-distribution (OOD) problems.
[LG-59] Data-efficient Kernel Methods for Learning Hamiltonian Systems
链接: https://arxiv.org/abs/2509.17154
作者: Yasamin Jalalian,Mostafa Samir,Boumediene Hamzi,Peyman Tavallali,Houman Owhadi
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Dynamical Systems (math.DS); Machine Learning (stat.ML)
*备注:
Abstract:Hamiltonian dynamics describe a wide range of physical systems. As such, data-driven simulations of Hamiltonian systems are important for many scientific and engineering problems. In this work, we propose kernel-based methods for identifying and forecasting Hamiltonian systems directly from data. We present two approaches: a two-step method that reconstructs trajectories before learning the Hamiltonian, and a one-step method that jointly infers both. Across several benchmark systems, including mass-spring dynamics, a nonlinear pendulum, and the Henon-Heiles system, we demonstrate that our framework achieves accurate, data-efficient predictions and outperforms two-step kernel-based baselines, particularly in scarce-data regimes, while preserving the conservation properties of Hamiltonian dynamics. Moreover, our methodology provides theoretical a priori error estimates, ensuring reliability of the learned models. We also provide a more general, problem-agnostic numerical framework that goes beyond Hamiltonian systems and can be used for data-driven learning of arbitrary dynamical systems.
[LG-60] On the Simplification of Neural Network Architectures for Predictive Process Monitoring
链接: https://arxiv.org/abs/2509.17145
作者: Amaan Ansari,Lukas Kirchdorfer,Raheleh Hadian
类目: Machine Learning (cs.LG)
*备注:
Abstract:Predictive Process Monitoring (PPM) aims to forecast the future behavior of ongoing process instances using historical event data, enabling proactive decision-making. While recent advances rely heavily on deep learning models such as LSTMs and Transformers, their high computational cost hinders practical adoption. Prior work has explored data reduction techniques and alternative feature encodings, but the effect of simplifying model architectures themselves remains underexplored. In this paper, we analyze how reducing model complexity, both in terms of parameter count and architectural depth, impacts predictive performance, using two established PPM approaches. Across five diverse event logs, we show that shrinking the Transformer model by 85% results in only a 2-3% drop in performance across various PPM tasks, while the LSTM proves slightly more sensitive, particularly for waiting time prediction. Overall, our findings suggest that substantial model simplification can preserve predictive accuracy, paving the way for more efficient and scalable PPM solutions.
[LG-61] Delay compensation of multi-input distinct delay nonlinear systems via neural operators
链接: https://arxiv.org/abs/2509.17131
作者: Filip Bajraktari,Luke Bhan,Miroslav Krstic,Yuanyuan Shi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS)
*备注: 8 pages, 1 figure
Abstract:In this work, we present the first stability results for approximate predictors in multi-input non-linear systems with distinct actuation delays. We show that if the predictor approximation satisfies a uniform (in time) error bound, semi-global practical stability is correspondingly achieved. For such approximators, the required uniform error bound depends on the desired region of attraction and the number of control inputs in the system. The result is achieved through transforming the delay into a transport PDE and conducting analysis on the coupled ODE-PDE cascade. To highlight the viability of such error bounds, we demonstrate our results on a class of approximators - neural operators - showcasing sufficiency for satisfying such a universal bound both theoretically and in simulation on a mobile robot experiment.
[LG-62] GRPOformer: Advancing Hyperparameter Optimization via Group Relative Policy Optimization
链接: https://arxiv.org/abs/2509.17105
作者: Haoxin Guo,Jiawen Pan,Weixin Zhai
类目: Machine Learning (cs.LG)
*备注:
Abstract:Hyperparameter optimization (HPO) plays a critical role in improving model performance. Transformer-based HPO methods have shown great potential; however, existing approaches rely heavily on large-scale historical optimization trajectories and lack effective reinforcement learning (RL) techniques, thereby limiting their efficiency and performance improvements. Inspired by the success of Group Relative Policy Optimization (GRPO) in large language models (LLMs), we propose GRPOformer – a novel hyperparameter optimization framework that integrates reinforcement learning (RL) with Transformers. In GRPOformer, Transformers are employed to generate new hyperparameter configurations from historical optimization trajectories, while GRPO enables rapid trajectory construction and optimization strategy learning from scratch. Moreover, we introduce Policy Churn Regularization (PCR) to enhance the stability of GRPO training. Experimental results on OpenML demonstrate that GRPOformer consistently outperforms baseline methods across diverse tasks, offering new insights into the application of RL for HPO.
[LG-63] Machine Learning for Campus Energy Resilience: Clustering and Time-Series Forecasting in Intelligent Load Shedding NEURIPS2025
链接: https://arxiv.org/abs/2509.17097
作者: Salim Oyinlola,Peter Olabisi Oluseyi
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Submitted for the NeurIPS 2025 Climata Change AI Workshop in San Diego, USA
Abstract:The growing demand for reliable electricity in universities necessitates intelligent energy management. This study proposes a machine learning-based load shedding framework for the University of Lagos, designed to optimize distribution and reduce waste. The methodology followed three main stages. First, a dataset of 3,648 hourly records from 55 buildings was compiled to develop building-level consumption models. Second, Principal Component Analysis was applied for dimensionality reduction, and clustering validation techniques were used to determine the optimal number of demand groups. Mini-Batch K-Means was then employed to classify buildings into high-, medium-, and low-demand clusters. Finally, short-term load forecasting was performed at the cluster level using multiple statistical and deep learning models, including ARIMA, SARIMA, Prophet, LSTM, and GRU. Results showed Prophet offered the most reliable forecasts, while Mini-Batch K-Means achieved stable clustering performance. By integrating clustering with forecasting, the framework enabled a fairer, data-driven load shedding strategy that reduces inefficiencies and supports climate change mitigation through sustainable energy management.
[LG-64] On the Limits of Tabular Hardness Metrics for Deep RL: A Study with the Pharos Benchmark
链接: https://arxiv.org/abs/2509.17092
作者: Michelangelo Conserva,Remo Sasso,Paulo Rauber
类目: Machine Learning (cs.LG)
*备注:
Abstract:Principled evaluation is critical for progress in deep reinforcement learning (RL), yet it lags behind the theory-driven benchmarks of tabular RL. While tabular settings benefit from well-understood hardness measures like MDP diameter and suboptimality gaps, deep RL benchmarks are often chosen based on intuition and popularity. This raises a critical question: can tabular hardness metrics be adapted to guide non-tabular benchmarking? We investigate this question and reveal a fundamental gap. Our primary contribution is demonstrating that the difficulty of non-tabular environments is dominated by a factor that tabular metrics ignore: representation hardness. The same underlying MDP can pose vastly different challenges depending on whether the agent receives state vectors or pixel-based observations. To enable this analysis, we introduce \textttpharos, a new open-source library for principled RL benchmarking that allows for systematic control over both environment structure and agent representations. Our extensive case study using \textttpharos shows that while tabular metrics offer some insight, they are poor predictors of deep RL agent performance on their own. This work highlights the urgent need for new, representation-aware hardness measures and positions \textttpharos as a key tool for developing them.
[LG-65] SGym: Design Choices for Deep Multivariate Time-Series Forecasting
链接: https://arxiv.org/abs/2509.17063
作者: Shuang Liang,Chaochuan Hou,Xu Yao,Shiping Wang,Minqi Jiang,Songqiao Han,Hailiang Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:Recently, deep learning has driven significant advancements in multivariate time series forecasting (MTSF) tasks. However, much of the current research in MTSF tends to evaluate models from a holistic perspective, which obscures the individual contributions and leaves critical issues unaddressed. Adhering to the current modeling paradigms, this work bridges these gaps by systematically decomposing deep MTSF methods into their core, fine-grained components like series-patching tokenization, channel-independent strategy, attention modules, or even Large Language Models and Time-series Foundation Models. Through extensive experiments and component-level analysis, our work offers more profound insights than previous benchmarks that typically discuss models as a whole. Furthermore, we propose a novel automated solution called TSGym for MTSF tasks. Unlike traditional hyperparameter tuning, neural architecture searching or fixed model selection, TSGym performs fine-grained component selection and automated model construction, which enables the creation of more effective solutions tailored to diverse time series data, therefore enhancing model transferability across different data sources and robustness against distribution shifts. Extensive experiments indicate that TSGym significantly outperforms existing state-of-the-art MTSF and AutoML methods. All code is publicly available on this https URL. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.17063 [cs.LG] (or arXiv:2509.17063v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.17063 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-66] Enhancing Performance and Calibration in Quantile Hyperparameter Optimization
链接: https://arxiv.org/abs/2509.17051
作者: Riccardo Doyle
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 19 pages, 15 figures, 1 table
Abstract:Bayesian hyperparameter optimization relies heavily on Gaussian Process (GP) surrogates, due to robust distributional posteriors and strong performance on limited training samples. GPs however underperform in categorical hyperparameter environments or when assumptions of normality, heteroskedasticity and symmetry are excessively challenged. Conformalized quantile regression can address these estimation weaknesses, while still providing robust calibration guarantees. This study builds upon early work in this area by addressing feedback covariate shift in sequential acquisition and integrating a wider range of surrogate architectures and acquisition functions. Proposed algorithms are rigorously benchmarked against a range of state of the art hyperparameter optimization methods (GP, TPE and SMAC). Findings identify quantile surrogate architectures and acquisition functions yielding superior performance to the current quantile literature, while validating the beneficial impact of conformalization on calibration and search performance.
[LG-67] Persistence Spheres: Bi-continuous Representations of Persistence Diagrams
链接: https://arxiv.org/abs/2509.16999
作者: Matteo Pegoraro
类目: Machine Learning (cs.LG)
*备注:
Abstract:We introduce persistence spheres, a novel functional representation of persistence diagrams. Unlike existing embeddings (such as persistence images, landscapes, or kernel methods), persistence spheres provide a bi-continuous mapping: they are Lipschitz continuous with respect to the 1-Wasserstein distance and admit a continuous inverse on their image. This ensures, in a theoretically optimal way, both stability and geometric fidelity, making persistence spheres the representation that most closely mirrors the Wasserstein geometry of PDs in linear space. We derive explicit formulas for persistence spheres, showing that they can be computed efficiently and parallelized with minimal overhead. Empirically, we evaluate them on diverse regression and classification tasks involving functional data, time series, graphs, meshes, and point clouds. Across these benchmarks, persistence spheres consistently deliver state-of-the-art or competitive performance compared to persistence images, persistence landscapes, and the sliced Wasserstein kernel.
[LG-68] NeuFACO: Neural Focused Ant Colony Optimization for Traveling Salesman Problem
链接: https://arxiv.org/abs/2509.16938
作者: Tran Thanh Dat,Tran Quang Khai,Pham Anh Khoi,Vu Van Khu,Do Duc Dong
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注: Submitted to RIVF’25. Code is available at this https URL
Abstract:This study presents Neural Focused Ant Colony Optimization (NeuFACO), a non-autoregressive framework for the Traveling Salesman Problem (TSP) that combines advanced reinforcement learning with enhanced Ant Colony Optimization (ACO). NeuFACO employs Proximal Policy Optimization (PPO) with entropy regularization to train a graph neural network for instance-specific heuristic guidance, which is integrated into an optimized ACO framework featuring candidate lists, restricted tour refinement, and scalable local search. By leveraging amortized inference alongside ACO stochastic exploration, NeuFACO efficiently produces high-quality solutions across diverse TSP instances.
[LG-69] Adaptive Graph Convolution and Semantic-Guided Attention for Multimodal Risk Detection in Social Networks
链接: https://arxiv.org/abs/2509.16936
作者: Cuiqianhe Du,Chia-En Chiang,Tianyi Huang,Zikun Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:This paper focuses on the detection of potentially dangerous tendencies of social media users in an innovative multimodal way. We integrate Natural Language Processing (NLP) and Graph Neural Networks (GNNs) together. Firstly, we apply NLP on the user-generated text and conduct semantic analysis, sentiment recognition and keyword extraction to get subtle risk signals from social media posts. Meanwhile, we build a heterogeneous user relationship graph based on social interaction and propose a novel relational graph convolutional network to model user relationship, attention relationship and content dissemination path to discover some important structural information and user behaviors. Finally, we combine textual features extracted from these two models above with graph structural information, which provides a more robust and effective way to discover at-risk users. Our experiments on real social media datasets from different platforms show that our model can achieve significant improvement over single-modality methods.
[LG-70] Auditability and the Landscape of Distance to Multicalibration
链接: https://arxiv.org/abs/2509.16930
作者: Nathan Derhake,Siddartha Devic,Dutch Hansen,Kuan Liu,Vatsal Sharan
类目: Machine Learning (cs.LG)
*备注: 41 pages
Abstract:Calibration is a critical property for establishing the trustworthiness of predictors that provide uncertainty estimates. Multicalibration is a strengthening of calibration which requires that predictors be calibrated on a potentially overlapping collection of subsets of the domain. As multicalibration grows in popularity with practitioners, an essential question is: how do we measure how multicalibrated a predictor is? Błasiok et al. (2023) considered this question for standard calibration by introducing the distance to calibration framework (dCE) to understand how calibration metrics relate to each other and the ground truth. Building on the dCE framework, we consider the auditability of the distance to multicalibration of a predictor f . We begin by considering two natural generalizations of dCE to multiple subgroups: worst group dCE (wdMC), and distance to multicalibration (dMC). We argue that there are two essential properties of any multicalibration error metric: 1) the metric should capture how much f would need to be modified in order to be perfectly multicalibrated; and 2) the metric should be auditable in an information theoretic sense. We show that wdMC and dMC each fail to satisfy one of these two properties, and that similar barriers arise when considering the auditability of general distance to multigroup fairness notions. We then propose two (equivalent) multicalibration metrics which do satisfy these requirements: 1) a continuized variant of dMC; and 2) a distance to intersection multicalibration, which leans on intersectional fairness desiderata. Along the way, we shed light on the loss-landscape of distance to multicalibration and the geometry of the set of perfectly multicalibrated predictors. Our findings may have implications for the development of stronger multicalibration algorithms as well as multigroup auditing more generally. Comments: 41 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.16930 [cs.LG] (or arXiv:2509.16930v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.16930 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-71] he Complexity of Finding Local Optima in Contrastive Learning NEURIPS2025
链接: https://arxiv.org/abs/2509.16898
作者: Jingming Yan,Yiyuan Luo,Vaggos Chatziafratis,Ioannis Panageas,Parnian Shahkar,Stelios Stavroulakis
类目: Machine Learning (cs.LG); Computational Complexity (cs.CC); Optimization and Control (math.OC)
*备注: To appear as a conference paper in NeurIPS 2025
Abstract:Contrastive learning is a powerful technique for discovering meaningful data representations by optimizing objectives based on \textitcontrastive information , often given as a set of weighted triplets (x_i, y_i^+, z_i^-)_i = 1^m indicating that an “anchor” x_i is more similar to a “positive” example y_i than to a “negative” example z_i . The goal is to find representations (e.g., embeddings in \mathbbR^d or a tree metric) where anchors are placed closer to positive than to negative examples. While finding \textitglobal optima of contrastive objectives is \mathsfNP -hard, the complexity of finding \textitlocal optima – representations that do not improve by local search algorithms such as gradient-based methods – remains open. Our work settles the complexity of finding local optima in various contrastive learning problems by proving \mathsfPLS -hardness in discrete settings (e.g., maximize satisfied triplets) and \mathsfCLS -hardness in continuous settings (e.g., minimize Triplet Loss), where \mathsfPLS (Polynomial Local Search) and \mathsfCLS (Continuous Local Search) are well-studied complexity classes capturing local search dynamics in discrete and continuous optimization, respectively. Our results imply that no polynomial time algorithm (local search or otherwise) can find a local optimum for various contrastive learning problems, unless \mathsfPLS\subseteq\mathsfP (or \mathsfCLS\subseteq \mathsfP for continuous problems). Even in the unlikely scenario that \mathsfPLS\subseteq\mathsfP (or \mathsfCLS\subseteq \mathsfP ), our reductions imply that there exist instances where local search algorithms need exponential time to reach a local optimum, even for d=1 (embeddings on a line).
[LG-72] LVADNet3D: A Deep Autoencoder for Reconstructing 3D Intraventricular Flow from Sparse Hemodynamic Data ICML
链接: https://arxiv.org/abs/2509.16860
作者: Mohammad Abdul Hafeez Khan,Marcello Mattei Di Eugeni,Benjamin Diaz,Ruth E. White,Siddhartha Bhattacharyya,Venkat Keshav Chivukula
类目: Machine Learning (cs.LG)
*备注: Accepted to International Conference on Machine Learning and Applications (ICMLA), 6 pages, 4 figure, 3 tables
Abstract:Accurate assessment of intraventricular blood flow is essential for evaluating hemodynamic conditions in patients supported by Left Ventricular Assist Devices (LVADs). However, clinical imaging is either incompatible with LVADs or yields sparse, low-quality velocity data. While Computational Fluid Dynamics (CFD) simulations provide high-fidelity data, they are computationally intensive and impractical for routine clinical use. To address this, we propose LVADNet3D, a 3D convolutional autoencoder that reconstructs full-resolution intraventricular velocity fields from sparse velocity vector inputs. In contrast to a standard UNet3D model, LVADNet3D incorporates hybrid downsampling and a deeper encoder-decoder architecture with increased channel capacity to better capture spatial flow patterns. To train and evaluate the models, we generate a high-resolution synthetic dataset of intraventricular blood flow in LVAD-supported hearts using CFD simulations. We also investigate the effect of conditioning the models on anatomical and physiological priors. Across various input configurations, LVADNet3D outperforms the baseline UNet3D model, yielding lower reconstruction error and higher PSNR results.
[LG-73] DISCO: Disentangled Communication Steering for Large Language Models
链接: https://arxiv.org/abs/2509.16820
作者: Max Torop,Aria Masoomi,Masih Eskandar,Jennifer Dy
类目: Machine Learning (cs.LG)
*备注:
Abstract:A variety of recent methods guide large language model outputs via the inference-time addition of steering vectors to residual-stream or attention-head representations. In contrast, we propose to inject steering vectors directly into the query and value representation spaces within attention heads. We provide evidence that a greater portion of these spaces exhibit high linear discriminability of concepts --a key property motivating the use of steering vectors-- than attention head outputs. We analytically characterize the effect of our method, which we term DISentangled COmmunication (DISCO) Steering, on attention head outputs. Our analysis reveals that DISCO disentangles a strong but underutilized baseline, steering attention inputs, which implicitly modifies queries and values in a rigid manner. In contrast, DISCO’s direct modulation of these components enables more granular control. We find that DISCO achieves superior performance over a number of steering vector baselines across multiple datasets on LLaMA 3.1 8B and Gemma 2 9B, with steering efficacy scoring up to 19.1% higher than the runner-up. Our results support the conclusion that the query and value spaces are powerful building blocks for steering vector methods.
[LG-74] Randomized Space-Time Sampling for Affine Graph Dynamical Systems
链接: https://arxiv.org/abs/2509.16818
作者: Le Gong,Longxiu Huang
类目: Numerical Analysis (math.NA); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:This paper investigates the problem of dynamical sampling for graph signals influenced by a constant source term. We consider signals evolving over time according to a linear dynamical system on a graph, where both the initial state and the source term are bandlimited. We introduce two random space-time sampling regimes and analyze the conditions under which stable recovery is achievable. While our framework extends recent work on homogeneous dynamics, it addresses a fundamentally different setting where the evolution includes a constant source term. This results in a non-orthogonal-diagonalizable system matrix, rendering classical spectral techniques inapplicable and introducing new challenges in sampling design, stability analysis, and joint recovery of both the initial state and the forcing term. A key component of our analysis is the spectral graph weighted coherence, which characterizes the interplay between the sampling distribution and the graph structure. We establish sampling complexity bounds ensuring stable recovery via the Restricted Isometry Property (RIP), and develop a robust recovery algorithm with provable error guarantees. The effectiveness of our method is validated through extensive experiments on both synthetic and real-world datasets.
[LG-75] Sublinear Time Quantum Sensitivity Sampling
链接: https://arxiv.org/abs/2509.16801
作者: Zhao Song,David P. Woodruff,Lichen Zhang
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:
Abstract:We present a unified framework for quantum sensitivity sampling, extending the advantages of quantum computing to a broad class of classical approximation problems. Our unified framework provides a streamlined approach for constructing coresets and offers significant runtime improvements in applications such as clustering, regression, and low-rank approximation. Our contributions include: * k -median and k -means clustering: For n points in d -dimensional Euclidean space, we give an algorithm that constructs an \epsilon -coreset in time \widetilde O(n^0.5dk^2.5~\mathrmpoly(\epsilon^-1)) for k -median and k -means clustering. Our approach achieves a better dependence on d and constructs smaller coresets that only consist of points in the dataset, compared to recent results of [Xue, Chen, Li and Jiang, ICML’23]. * \ell_p regression: For \ell_p regression problems, we construct an \epsilon -coreset of size \widetilde O_p(d^\max\1, p/2\epsilon^-2) in time \widetilde O_p(n^0.5d^\max\0.5, p/4+1(\epsilon^-3+d^0.5)) , improving upon the prior best quantum sampling approach of [Apers and Gribling, QIP’24] for all p\in (0, 2)\cup (2, 22] , including the widely studied least absolute deviation regression ( \ell_1 regression). * Low-rank approximation with Frobenius norm error: We introduce the first quantum sublinear-time algorithm for low-rank approximation that does not rely on data-dependent parameters, and runs in \widetilde O(nd^0.5k^0.5\epsilon^-1) time. Additionally, we present quantum sublinear algorithms for kernel low-rank approximation and tensor low-rank approximation, broadening the range of achievable sublinear time algorithms in randomized numerical linear algebra. Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Quantum Physics (quant-ph) Cite as: arXiv:2509.16801 [cs.DS] (or arXiv:2509.16801v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2509.16801 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Lichen Zhang [view email] [v1] Sat, 20 Sep 2025 20:18:49 UTC (63 KB)
[LG-76] Spectral Analysis of the Weighted Frobenius Objective
链接: https://arxiv.org/abs/2509.16783
作者: Vladislav Trifonov,Ivan Oseledets,Ekaterina Muravleva
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:
Abstract:We analyze a weighted Frobenius loss for approximating symmetric positive definite matrices in the context of preconditioning iterative solvers. Unlike the standard Frobenius norm, the weighted loss penalizes error components associated with small eigenvalues of the system matrix more strongly. Our analysis reveals that each eigenmode is scaled by the corresponding square of its eigenvalue, and that, under a fixed error budget, the loss is minimized only when the error is confined to the direction of the largest eigenvalue. This provides a rigorous explanation of why minimizing the weighted loss naturally suppresses low-frequency components, which can be a desirable strategy for the conjugate gradient method. The analysis is independent of the specific approximation scheme or sparsity pattern, and applies equally to incomplete factorizations, algebraic updates, and learning-based constructions. Numerical experiments confirm the predictions of the theory, including an illustration where sparse factors are trained by a direct gradient updates to IC(0) factor entries, i.e., no trained neural network model is used.
[LG-77] Improving User Interface Generation Models from Designer Feedback
链接: https://arxiv.org/abs/2509.16779
作者: Jason Wu,Amanda Swearngin,Arun Krishna Vajjala,Alan Leung,Jeffrey Nichols,Titus Barik
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Despite being trained on vast amounts of data, most LLMs are unable to reliably generate well-designed UIs. Designer feedback is essential to improving performance on UI generation; however, we find that existing RLHF methods based on ratings or rankings are not well-aligned with designers’ workflows and ignore the rich rationale used to critique and improve UI designs. In this paper, we investigate several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation. We first perform a study with 21 designers where they gave feedback using these interactions, which resulted in ~1500 design annotations. We then use this data to finetune a series of LLMs to generate higher quality UIs. Finally, we evaluate these models with human judges, and we find that our designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.
[LG-78] Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees
链接: https://arxiv.org/abs/2509.16756
作者: Yuchen Liang,Yingbin Liang,Lifeng Lai,Ness Shroff
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, \tau -leaping samplers have become particularly popular due to their empirical success. However, existing theoretical analyses of \tau -leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard \tau -leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie \tau -leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.
[LG-79] Interpretable Clinical Classification with Kolgomorov-Arnold Networks
链接: https://arxiv.org/abs/2509.16750
作者: Alejandro Almodóvar,Patricia A. Apellániz,Alba Garrido,Fernando Fernández-Salvador,Santiago Zazo,Juan Parras
类目: Machine Learning (cs.LG)
*备注:
Abstract:Why should a clinician trust an Artificial Intelligence (AI) prediction? Despite the increasing accuracy of machine learning methods in medicine, the lack of transparency continues to hinder their adoption in clinical practice. In this work, we explore Kolmogorov-Arnold Networks (KANs) for clinical classification tasks on tabular data. Unlike traditional neural networks, KANs are function-based architectures that offer intrinsic interpretability through transparent, symbolic representations. We introduce Logistic-KAN, a flexible generalization of logistic regression, and Kolmogorov-Arnold Additive Model (KAAM), a simplified additive variant that delivers transparent, symbolic formulas. Unlike black-box models that require post-hoc explainability tools, our models support built-in patient-level insights, intuitive visualizations, and nearest-patient retrieval. Across multiple health datasets, our models match or outperform standard baselines, while remaining fully interpretable. These results position KANs as a promising step toward trustworthy AI that clinicians can understand, audit, and act upon.
[LG-80] On the System Theoretic Offline Learning of Continuous-Time LQR with Exogenous Disturbances
链接: https://arxiv.org/abs/2509.16746
作者: Sayak Mukherjee,Ramij R. Hossain,Mahantesh Halappanavar
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 17 pages, 3 figures
Abstract:We analyze offline designs of linear quadratic regulator (LQR) strategies with uncertain disturbances. First, we consider the scenario where the exogenous variable can be estimated in a controlled environment, and subsequently, consider a more practical and challenging scenario where it is unknown in a stochastic setting. Our approach builds on the fundamental learning-based framework of adaptive dynamic programming (ADP), combined with a Lyapunov-based analytical methodology to design the algorithms and derive sample-based approximations motivated from the Markov decision process (MDP)-based approaches. For the scenario involving non-measurable disturbances, we further establish stability and convergence guarantees for the learned control gains under sample-based approximations. The overall methodology emphasizes simplicity while providing rigorous guarantees. Finally, numerical experiments focus on the intricacies and validations for the design of offline continuous-time LQR with exogenous disturbances.
[LG-81] HypeMARL: Multi-Agent Reinforcement Learning For High-Dimensional Parametric and Distributed Systems
链接: https://arxiv.org/abs/2509.16709
作者: Nicolò Botteghi,Matteo Tomasetto,Urban Fasel,Francesco Braghin,Andrea Manzoni
类目: Machine Learning (cs.LG)
*备注:
Abstract:Deep reinforcement learning has recently emerged as a promising feedback control strategy for complex dynamical systems governed by partial differential equations (PDEs). When dealing with distributed, high-dimensional problems in state and control variables, multi-agent reinforcement learning (MARL) has been proposed as a scalable approach for breaking the curse of dimensionality. In particular, through decentralized training and execution, multiple agents cooperate to steer the system towards a target configuration, relying solely on local state and reward information. However, the principle of locality may become a limiting factor whenever a collective, nonlocal behavior of the agents is crucial to maximize the reward function, as typically happens in PDE-constrained optimal control problems. In this work, we propose HypeMARL: a decentralized MARL algorithm tailored to the control of high-dimensional, parametric, and distributed systems. HypeMARL employs hypernetworks to effectively parametrize the agents’ policies and value functions with respect to the system parameters and the agents’ relative positions, encoded by sinusoidal positional encoding. Through the application on challenging control problems, such as density and flow control, we show that HypeMARL (i) can effectively control systems through a collective behavior of the agents, outperforming state-of-the-art decentralized MARL, (ii) can efficiently deal with parametric dependencies, (iii) requires minimal hyperparameter tuning and (iv) can reduce the amount of expensive environment interactions by a factor of ~10 thanks to its model-based extension, MB-HypeMARL, which relies on computationally efficient deep learning-based surrogate models approximating the dynamics locally, with minimal deterioration of the policy performance.
[LG-82] boldsymbolλ-Orthogonality Regularization for Compatible Representation Learning NEURIPS2025
链接: https://arxiv.org/abs/2509.16664
作者: Simone Ricci,Niccolò Biondi,Federico Pernici,Ioannis Patras,Alberto Del Bimbo
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS2025
Abstract:Retrieval systems rely on representations learned by increasingly powerful models. However, due to the high training cost and inconsistencies in learned representations, there is significant interest in facilitating communication between representations and ensuring compatibility across independently trained neural networks. In the literature, two primary approaches are commonly used to adapt different learned representations: affine transformations, which adapt well to specific distributions but can significantly alter the original representation, and orthogonal transformations, which preserve the original structure with strict geometric constraints but limit adaptability. A key challenge is adapting the latent spaces of updated models to align with those of previous models on downstream distributions while preserving the newly learned representation spaces. In this paper, we impose a relaxed orthogonality constraint, namely \lambda -orthogonality regularization, while learning an affine transformation, to obtain distribution-specific adaptation while retaining the original learned representations. Extensive experiments across various architectures and datasets validate our approach, demonstrating that it preserves the model’s zero-shot performance and ensures compatibility across model updates. Code available at: this https URL
[LG-83] Safe Guaranteed Dynamics Exploration with Probabilistic Models
链接: https://arxiv.org/abs/2509.16650
作者: Manish Prajapat,Johannes Köhler,Melanie N. Zeilinger,Andreas Krause
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO); Dynamical Systems (math.DS); Optimization and Control (math.OC)
*备注:
Abstract:Ensuring both optimality and safety is critical for the real-world deployment of agents, but becomes particularly challenging when the system dynamics are unknown. To address this problem, we introduce a notion of maximum safe dynamics learning via sufficient exploration in the space of safe policies. We propose a \textitpessimistically safe framework that \textitoptimistically explores informative states and, despite not reaching them due to model uncertainty, ensures continuous online learning of dynamics. The framework achieves first-of-its-kind results: learning the dynamics model sufficiently - up to an arbitrary small tolerance (subject to noise) - in a finite time, while ensuring provably safe operation throughout with high probability and without requiring resets. Building on this, we propose an algorithm to maximize rewards while learning the dynamics \textitonly to the extent needed to achieve close-to-optimal performance. Unlike typical reinforcement learning (RL) methods, our approach operates online in a non-episodic setting and ensures safety throughout the learning process. We demonstrate the effectiveness of our approach in challenging domains such as autonomous car racing and drone navigation under aerodynamic effects - scenarios where safety is critical and accurate modeling is difficult.
[LG-84] Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features NEURIPS2025
链接: https://arxiv.org/abs/2509.16629
作者: Kaichen Xu,Yihang Du,Mianpeng Liu,Zimu Yu,Xiaobo Sun
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted by NeurIPS 2025
Abstract:Positional encoding is essential for supplementing transformer with positional information of tokens. Existing positional encoding methods demand predefined token/feature order, rendering them unsuitable for real-world data with non-sequential yet causally-related features. To address this limitation, we propose CAPE, a novel method that identifies underlying causal structure over non-sequential features as a weighted directed acyclic graph (DAG) using generalized structural equation modeling. The DAG is then embedded in hyperbolic space where its geometric structure is well-preserved using a hyperboloid model-based approach that effectively captures two important causal graph properties (causal strength causal specificity). This step yields causality-aware positional encodings for the features, which are converted into their rotary form for integrating with transformer’s self-attention mechanism. Theoretical analysis reveals that CAPE-generated rotary positional encodings possess three valuable properties for enhanced self-attention, including causal distance-induced attenuation, causal generality-induced attenuation, and robustness to positional disturbances. We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. Our code is available at this https URL.
[LG-85] Self-Supervised Learning of Graph Representations for Network Intrusion Detection NEURIPS2025
链接: https://arxiv.org/abs/2509.16625
作者: Lorenzo Guerra,Thomas Chapuis,Guillaume Duc,Pavlo Mozharovskyi,Van-Tam Nguyen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted at NeurIPS 2025
Abstract:Detecting intrusions in network traffic is a challenging task, particularly under limited supervision and constantly evolving attack patterns. While recent works have leveraged graph neural networks for network intrusion detection, they often decouple representation learning from anomaly detection, limiting the utility of the embeddings for identifying attacks. We propose GraphIDS, a self-supervised intrusion detection model that unifies these two stages by learning local graph representations of normal communication patterns through a masked autoencoder. An inductive graph neural network embeds each flow with its local topological context to capture typical network behavior, while a Transformer-based encoder-decoder reconstructs these embeddings, implicitly learning global co-occurrence patterns via self-attention without requiring explicit positional information. During inference, flows with unusually high reconstruction errors are flagged as potential intrusions. This end-to-end framework ensures that embeddings are directly optimized for the downstream task, facilitating the recognition of malicious traffic. On diverse NetFlow benchmarks, GraphIDS achieves up to 99.98% PR-AUC and 99.61% macro F1-score, outperforming baselines by 5-25 percentage points.
[LG-86] ORN-CBF: Learning Observation-conditioned Residual Neural Control Barrier Functions via Hypernetworks
链接: https://arxiv.org/abs/2509.16614
作者: Bojan Derajić,Sebastian Bernhard,Wolfgang Hönig
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:
Abstract:Control barrier functions (CBFs) have been demonstrated as an effective method for safety-critical control of autonomous systems. Although CBFs are simple to deploy, their design remains challenging, motivating the development of learning-based approaches. Yet, issues such as suboptimal safe sets, applicability in partially observable environments, and lack of rigorous safety guarantees persist. In this work, we propose observation-conditioned neural CBFs based on Hamilton-Jacobi (HJ) reachability analysis, which approximately recover the maximal safe sets. We exploit certain mathematical properties of the HJ value function, ensuring that the predicted safe set never intersects with the observed failure set. Moreover, we leverage a hypernetwork-based architecture that is particularly suitable for the design of observation-conditioned safety filters. The proposed method is examined both in simulation and hardware experiments for a ground robot and a quadcopter. The results show improved success rates and generalization to out-of-domain environments compared to the baselines.
[LG-87] Bayesian Ego-graph inference for Networked Multi-Agent Reinforcement Learning NEURIPS2025
链接: https://arxiv.org/abs/2509.16606
作者: Wei Duan,Jie Lu,Junyu Xuan
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025
Abstract:In networked multi-agent reinforcement learning (Networked-MARL), decentralized agents must act under local observability and constrained communication over fixed physical graphs. Existing methods often assume static neighborhoods, limiting adaptability to dynamic or heterogeneous environments. While centralized frameworks can learn dynamic graphs, their reliance on global state access and centralized infrastructure is impractical in real-world decentralized systems. We propose a stochastic graph-based policy for Networked-MARL, where each agent conditions its decision on a sampled subgraph over its local physical neighborhood. Building on this formulation, we introduce BayesG, a decentralized actor-framework that learns sparse, context-aware interaction structures via Bayesian variational inference. Each agent operates over an ego-graph and samples a latent communication mask to guide message passing and policy computation. The variational distribution is trained end-to-end alongside the policy using an evidence lower bound (ELBO) objective, enabling agents to jointly learn both interaction topology and decision-making strategies. BayesG outperforms strong MARL baselines on large-scale traffic control tasks with up to 167 agents, demonstrating superior scalability, efficiency, and performance.
[LG-88] Near-Optimal Sample Complexity Bounds for Constrained Averag e-Reward MDPs
链接: https://arxiv.org/abs/2509.16586
作者: Yukuan Wei,Xudong Li,Lin F. Yang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Recent advances have significantly improved our understanding of the sample complexity of learning in average-reward Markov decision processes (AMDPs) under the generative model. However, much less is known about the constrained average-reward MDP (CAMDP), where policies must satisfy long-run average constraints. In this work, we address this gap by studying the sample complexity of learning an \epsilon -optimal policy in CAMDPs under a generative model. We propose a model-based algorithm that operates under two settings: (i) relaxed feasibility, which allows small constraint violations, and (ii) strict feasibility, where the output policy satisfies the constraint. We show that our algorithm achieves sample complexities of \tildeO\left(\fracS A (B+H) \epsilon^2\right) and \tildeO \left(\fracS A (B+H)\epsilon^2 \zeta^2 \right) under the relaxed and strict feasibility settings, respectively. Here, \zeta is the Slater constant indicating the size of the feasible region, H is the span bound of the bias function, and B is the transient time bound. Moreover, a matching lower bound of \tilde\Omega\left(\fracS A (B+H) \epsilon^2\zeta^2\right) for the strict feasibility case is established, thus providing the first minimax-optimal bounds for CAMDPs. Our results close the theoretical gap in understanding the complexity of constrained average-reward MDPs.
[LG-89] Learned Digital Codes for Over-the-Air Federated Learning
链接: https://arxiv.org/abs/2509.16577
作者: Antonio Tarizzo,Mohammad Kazemi,Deniz Gündüz
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:Federated edge learning (FEEL) enables distributed model training across wireless devices without centralising raw data, but deployment is constrained by the wireless uplink. A promising direction is over-the-air (OTA) aggregation, which merges communication with computation. Existing digital OTA methods can achieve either strong convergence or robustness to noise, but struggle to achieve both simultaneously, limiting performance in low signal-to-noise ratios (SNRs) where many IoT devices operate. This work proposes a learnt digital OTA framework that extends reliable operation into low-SNR conditions while maintaining the same uplink overhead as state-of-the-art. The proposed method combines an unrolled decoder with a jointly learnt unsourced random access codebook. Results show an extension of reliable operation by more than 7 dB, with improved global model convergence across all SNR levels, highlighting the potential of learning-based design for FEEL.
[LG-90] Barwise Section Boundary Detection in Symbolic Music Using Convolutional Neural Networks
链接: https://arxiv.org/abs/2509.16566
作者: Omar Eldeeb,Martin Malandro
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Current methods for Music Structure Analysis (MSA) focus primarily on audio data. While symbolic music can be synthesized into audio and analyzed using existing MSA techniques, such an approach does not exploit symbolic music’s rich explicit representation of pitch, timing, and instrumentation. A key subproblem of MSA is section boundary detection-determining whether a given point in time marks the transition between musical sections. In this paper, we study automatic section boundary detection for symbolic music. First, we introduce a human-annotated MIDI dataset for section boundary detection, consisting of metadata from 6134 MIDI files that we manually curated from the Lakh MIDI dataset. Second, we train a deep learning model to classify the presence of section boundaries within a fixed-length musical window. Our data representation involves a novel encoding scheme based on synthesized overtones to encode arbitrary MIDI instrumentations into 3-channel piano rolls. Our model achieves an F1 score of 0.77, improving over the analogous audio-based supervised learning approach and the unsupervised block-matching segmentation (CBM) audio approach by 0.22 and 0.31, respectively. We release our dataset, code, and models.
[LG-91] Etude: Piano Cover Generation with a Three-Stage Approach – Extract strucTUralize and DEcode
链接: https://arxiv.org/abs/2509.16522
作者: Tse-Yang Che,Yuh-Jzer Joung
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:
Abstract:Piano cover generation aims to automatically transform a pop song into a piano arrangement. While numerous deep learning approaches have been proposed, existing models often fail to maintain structural consistency with the original song, likely due to the absence of beat-aware mechanisms or the difficulty of modeling complex rhythmic patterns. Rhythmic information is crucial, as it defines structural similarity (e.g., tempo, BPM) and directly impacts the overall quality of the generated music. In this paper, we introduce Etude, a three-stage architecture consisting of Extract, strucTUralize, and DEcode stages. By pre-extracting rhythmic information and applying a novel, simplified REMI-based tokenization, our model produces covers that preserve proper song structure, enhance fluency and musical dynamics, and support highly controllable generation through style injection. Subjective evaluations with human listeners show that Etude substantially outperforms prior models, achieving a quality level comparable to that of human composers. Subjects: Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) Cite as: arXiv:2509.16522 [cs.SD] (or arXiv:2509.16522v1 [cs.SD] for this version) https://doi.org/10.48550/arXiv.2509.16522 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-92] mmExpert: Integrating Large Language Models for Comprehensive mmWave Data Synthesis and Understanding
链接: https://arxiv.org/abs/2509.16521
作者: Yifan Yan,Shuai Yang,Xiuzhen Guo,Xiangguang Wang,Wei Chow,Yuanchao Shu,Shibo He
类目: Machine Learning (cs.LG)
*备注: Accepted to ACM MobiHoc '25
Abstract:Millimeter-wave (mmWave) sensing technology holds significant value in human-centric applications, yet the high costs associated with data acquisition and annotation limit its widespread adoption in our daily lives. Concurrently, the rapid evolution of large language models (LLMs) has opened up opportunities for addressing complex human needs. This paper presents mmExpert, an innovative mmWave understanding framework consisting of a data generation flywheel that leverages LLMs to automate the generation of synthetic mmWave radar datasets for specific application scenarios, thereby training models capable of zero-shot generalization in real-world environments. Extensive experiments demonstrate that the data synthesized by mmExpert significantly enhances the performance of downstream models and facilitates the successful deployment of large models for mmWave understanding.
[LG-93] LLM -Guided Co-Training for Text Classification
链接: https://arxiv.org/abs/2509.16516
作者: Md Mezbaur Rahman,Cornelia Caragea
类目: Machine Learning (cs.LG)
*备注:
Abstract:In this paper, we introduce a novel weighted co-training approach that is guided by Large Language Models (LLMs). Namely, in our co-training approach, we use LLM labels on unlabeled data as target labels and co-train two encoder-only based networks that train each other over multiple iterations: first, all samples are forwarded through each network and historical estimates of each network’s confidence in the LLM label are recorded; second, a dynamic importance weight is derived for each sample according to each network’s belief in the quality of the LLM label for that sample; finally, the two networks exchange importance weights with each other – each network back-propagates all samples weighted with the importance weights coming from its peer network and updates its own parameters. By strategically utilizing LLM-generated guidance, our approach significantly outperforms conventional SSL methods, particularly in settings with abundant unlabeled data. Empirical results show that it achieves state-of-the-art performance on 4 out of 5 benchmark datasets and ranks first among 14 compared methods according to the Friedman test. Our results highlight a new direction in semi-supervised learning – where LLMs serve as knowledge amplifiers, enabling backbone co-training models to achieve state-of-the-art performance efficiently.
[LG-94] Federated Learning with Ad-hoc Adapter Insertions: The Case of Soft-Embeddings for Training Classifier-as-Retriever
链接: https://arxiv.org/abs/2509.16508
作者: Marijan Fofonjka,Shahryar Zehtabi,Alireza Behtash,Tyler Mauer,David Stout
类目: Machine Learning (cs.LG)
*备注: 22 pages, 7 figures, 3 tables
Abstract:When existing retrieval-augmented generation (RAG) solutions are intended to be used for new knowledge domains, it is necessary to update their encoders, which are taken to be pretrained large language models (LLMs). However, fully finetuning these large models is compute- and memory-intensive, and even infeasible when deployed on resource-constrained edge devices. We propose a novel encoder architecture in this work that addresses this limitation by using a frozen small language model (SLM), which satisfies the memory constraints of edge devices, and inserting a small adapter network before the transformer blocks of the SLM. The trainable adapter takes the token embeddings of the new corpus and learns to produce enhanced soft embeddings for it, while requiring significantly less compute power to update than full fine-tuning. We further propose a novel retrieval mechanism by attaching a classifier head to the SLM encoder, which is trained to learn a similarity mapping of the input embeddings to their corresponding documents. Finally, to enable the online fine-tuning of both (i) the encoder soft embeddings and (ii) the classifier-as-retriever on edge devices, we adopt federated learning (FL) and differential privacy (DP) to achieve an efficient, privacy-preserving, and product-grade training solution. We conduct a theoretical analysis of our methodology, establishing convergence guarantees under mild assumptions on gradient variance when deployed for general smooth nonconvex loss functions. Through extensive numerical experiments, we demonstrate (i) the efficacy of obtaining soft embeddings to enhance the encoder, (ii) training a classifier to improve the retriever, and (iii) the role of FL in achieving speedup.
[LG-95] orb-QFL: Orbital Quantum Federated Learning
链接: https://arxiv.org/abs/2509.16505
作者: Dev Gurung,Shiva Raj Pokhrel
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Recent breakthroughs in quantum computing present transformative opportunities for advancing Federated Learning (FL), particularly in non-terrestrial environments characterized by stringent communication and coordination constraints. In this study, we propose orbital QFL, termed orb-QFL, a novel quantum-assisted Federated Learning framework tailored for Low Earth Orbit (LEO) satellite constellations. Distinct from conventional FL paradigms, termed orb-QFL operates without centralized servers or global aggregation mechanisms (e.g., FedAvg), instead leveraging quantum entanglement and local quantum processing to facilitate decentralized, inter-satellite collaboration. This design inherently addresses the challenges of orbital dynamics, such as intermittent connectivity, high propagation delays, and coverage variability. The framework enables continuous model refinement through direct quantum-based synchronization between neighboring satellites, thereby enhancing resilience and preserving data locality. To validate our approach, we integrate the Qiskit quantum machine learning toolkit with Poliastro-based orbital simulations and conduct experiments using Statlog dataset.
[LG-96] GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models
链接: https://arxiv.org/abs/2509.16502
作者: Jialin Chen,Houyu Zhang,Seongjun Yun,Alejandro Mottini,Rex Ying,Xiang Song,Vassilis N. Ioannidis,Zheng Li,Qingjun Cui
类目: Machine Learning (cs.LG)
*备注:
Abstract:Retrieval-Augmented Generation (RAG) has significantly mitigated the hallucinations of Large Language Models (LLMs) by grounding the generation with external knowledge. Recent extensions of RAG to graph-based retrieval offer a promising direction, leveraging the structural knowledge for multi-hop reasoning. However, existing graph RAG typically decouples retrieval and reasoning processes, which prevents the retriever from adapting to the reasoning needs of the LLM. They also struggle with scalability when performing multi-hop expansion over large-scale graphs, or depend heavily on annotated ground-truth entities, which are often unavailable in open-domain settings. To address these challenges, we propose a novel graph retriever trained end-to-end with LLM, which features an attention-based growing and pruning mechanism, adaptively navigating multi-hop relevant entities while filtering out noise. Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together, thereby enhancing its reasoning capability and facilitating interactive joint training of the graph retriever and the LLM reasoner. Experimental results across three QA benchmarks show that our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks. Notably, our framework eliminates the need for predefined ground-truth entities by directly optimizing the retriever using LLM logits as implicit feedback, making it especially effective in open-domain settings.
[LG-97] A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective NEURIPS2025
链接: https://arxiv.org/abs/2509.16499
作者: Lianghe Shi,Meng Wu,Huijie Zhang,Zekai Zhang,Molei Tao,Qing Qu
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025 Spotlight paper
Abstract:The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse – a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.
[LG-98] FairTune: A Bias-Aware Fine-Tuning Framework Towards Fair Heart Rate Prediction from PPG
链接: https://arxiv.org/abs/2509.16491
作者: Lovely Yeswanth Panchumarthi,Saurabh Kataria,Yi Wu,Xiao Hu,Alex Fedorov,Hyunjung Gloria Kwak
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:
Abstract:Foundation models pretrained on physiological data such as photoplethysmography (PPG) signals are increasingly used to improve heart rate (HR) prediction across diverse settings. Fine-tuning these models for local deployment is often seen as a practical and scalable strategy. However, its impact on demographic fairness particularly under domain shifts remains underexplored. We fine-tune PPG-GPT a transformer-based foundation model pretrained on intensive care unit (ICU) data across three heterogeneous datasets (ICU, wearable, smartphone) and systematically evaluate the effects on HR prediction accuracy and gender fairness. While fine-tuning substantially reduces mean absolute error (up to 80%), it can simultaneously widen fairness gaps, especially in larger models and under significant distributional characteristics shifts. To address this, we introduce FairTune, a bias-aware fine-tuning framework in which we benchmark three mitigation strategies: class weighting based on inverse group frequency (IF), Group Distributionally Robust Optimization (GroupDRO), and adversarial debiasing (ADV). We find that IF and GroupDRO significantly reduce fairness gaps without compromising accuracy, with effectiveness varying by deployment domain. Representation analyses further reveal that mitigation techniques reshape internal embeddings to reduce demographic clustering. Our findings highlight that fairness does not emerge as a natural byproduct of fine-tuning and that explicit mitigation is essential for equitable deployment of physiological foundation models.
[LG-99] Revisiting Broken Windows Theory
链接: https://arxiv.org/abs/2509.16490
作者: Ziyao Cui,Erick Jiang,Nicholas Sortisio,Haiyan Wang,Eric Chen,Cynthia Rudin
类目: Machine Learning (cs.LG)
*备注:
Abstract:We revisit the longstanding question of how physical structures in urban landscapes influence crime. Leveraging machine learning-based matching techniques to control for demographic composition, we estimate the effects of several types of urban structures on the incidence of violent crime in New York City and Chicago. We additionally contribute to a growing body of literature documenting the relationship between perception of crime and actual crime rates by separately analyzing how the physical urban landscape shapes subjective feelings of safety. Our results are twofold. First, in consensus with prior work, we demonstrate a “broken windows” effect in which abandoned buildings, a sign of social disorder, are associated with both greater incidence of crime and a heightened perception of danger. This is also true of types of urban structures that draw foot traffic such as public transportation infrastructure. Second, these effects are not uniform within or across cities. The criminogenic effects of the same structure types across two cities differ in magnitude, degree of spatial localization, and heterogeneity across subgroups, while within the same city, the effects of different structure types are confounded by different demographic variables. Taken together, these results emphasize that one-size-fits-all approaches to crime reduction are untenable and policy interventions must be specifically tailored to their targets.
[LG-100] Local Mechanisms of Compositional Generalization in Conditional Diffusion
链接: https://arxiv.org/abs/2509.16447
作者: Arwen Bradley
类目: Machine Learning (cs.LG)
*备注: 10 pages, 7 figures
Abstract:Conditional diffusion models appear capable of compositional generalization, i.e., generating convincing samples for out-of-distribution combinations of conditioners, but the mechanisms underlying this ability remain unclear. To make this concrete, we study length generalization, the ability to generate images with more objects than seen during training. In a controlled CLEVR setting (Johnson et al., 2017), we find that length generalization is achievable in some cases but not others, suggesting that models only sometimes learn the underlying compositional structure. We then investigate locality as a structural mechanism for compositional generalization. Prior works proposed score locality as a mechanism for creativity in unconditional diffusion models (Kamb Ganguli, 2024; Niedoba et al., 2024), but did not address flexible conditioning or compositional generalization. In this paper, we prove an exact equivalence between a specific compositional structure (“conditional projective composition”) (Bradley et al., 2025) and scores with sparse dependencies on both pixels and conditioners (“local conditional scores”). This theory also extends to feature-space compositionality. We validate our theory empirically: CLEVR models that succeed at length generalization exhibit local conditional scores, while those that fail do not. Furthermore, we show that a causal intervention explicitly enforcing local conditional scores restores length generalization in a previously failing model. Finally, we investigate feature-space compositionality in color-conditioned CLEVR, and find preliminary evidence of compositional structure in SDXL.
[LG-101] End-to-end RL Improves Dexterous Grasping Policies
链接: https://arxiv.org/abs/2509.16434
作者: Ritvik Singh,Karl Van Wyk,Pieter Abbeel,Jitendra Malik,Nathan Ratliff,Ankur Handa
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: See our blog post: this https URL
Abstract:This work explores techniques to scale up image-based end-to-end learning for dexterous grasping with an arm + hand system. Unlike state-based RL, vision-based RL is much more memory inefficient, resulting in relatively low batch sizes, which is not amenable for algorithms like PPO. Nevertheless, it is still an attractive method as unlike the more commonly used techniques which distill state-based policies into vision networks, end-to-end RL can allow for emergent active vision behaviors. We identify a key bottleneck in training these policies is the way most existing simulators scale to multiple GPUs using traditional data parallelism techniques. We propose a new method where we disaggregate the simulator and RL (both training and experience buffers) onto separate GPUs. On a node with four GPUs, we have the simulator running on three of them, and PPO running on the fourth. We are able to show that with the same number of GPUs, we can double the number of existing environments compared to the previous baseline of standard data parallelism. This allows us to train vision-based environments, end-to-end with depth, which were previously performing far worse with the baseline. We train and distill both depth and state-based policies into stereo RGB networks and show that depth distillation leads to better results, both in simulation and reality. This improvement is likely due to the observability gap between state and vision policies which does not exist when distilling depth policies into stereo RGB. We further show that the increased batch size brought about by disaggregated simulation also improves real world performance. When deploying in the real world, we improve upon the previous state-of-the-art vision-based results using our end-to-end policies.
[LG-102] Dynamic Objects Relocalization in Changing Environments with Flow Matching
链接: https://arxiv.org/abs/2509.16398
作者: Francesco Argenziano,Miguel Saavedra-Ruiz,Sacha Morin,Daniele Nardi,Liam Paull
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:
Abstract:Task and motion planning are long-standing challenges in robotics, especially when robots have to deal with dynamic environments exhibiting long-term dynamics, such as households or warehouses. In these environments, long-term dynamics mostly stem from human activities, since previously detected objects can be moved or removed from the scene. This adds the necessity to find such objects again before completing the designed task, increasing the risk of failure due to missed relocalizations. However, in these settings, the nature of such human-object interactions is often overlooked, despite being governed by common habits and repetitive patterns. Our conjecture is that these cues can be exploited to recover the most likely objects’ positions in the scene, helping to address the problem of unknown relocalization in changing environments. To this end we propose FlowMaps, a model based on Flow Matching that is able to infer multimodal object locations over space and time. Our results present statistical evidence to support our hypotheses, opening the way to more complex applications of our approach. The code is publically available at this https URL
[LG-103] Federated Learning for Financial Forecasting
链接: https://arxiv.org/abs/2509.16393
作者: Manuel Noseda,Alberto De Luca,Lukas Von Briel,Nathan Lacour
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注:
Abstract:This paper studies Federated Learning (FL) for binary classification of volatile financial market trends. Using a shared Long Short-Term Memory (LSTM) classifier, we compare three scenarios: (i) a centralized model trained on the union of all data, (ii) a single-agent model trained on an individual data subset, and (iii) a privacy-preserving FL collaboration in which agents exchange only model updates, never raw data. We then extend the study with additional market features, deliberately introducing not independent and identically distributed data (non-IID) across agents, personalized FL and employing differential privacy. Our numerical experiments show that FL achieves accuracy and generalization on par with the centralized baseline, while significantly outperforming the single-agent model. The results show that collaborative, privacy-preserving learning provides collective tangible value in finance, even under realistic data heterogeneity and personalization requirements.
[LG-104] EMPEROR: Efficient Moment-Preserving Representation of Distributions
链接: https://arxiv.org/abs/2509.16379
作者: Xinran Liu,Shansita D. Sharma,Soheil Kolouri
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We introduce EMPEROR (Efficient Moment-Preserving Representation of Distributions), a mathematically rigorous and computationally efficient framework for representing high-dimensional probability measures arising in neural network representations. Unlike heuristic global pooling operations, EMPEROR encodes a feature distribution through its statistical moments. Our approach leverages the theory of sliced moments: features are projected onto multiple directions, lightweight univariate Gaussian mixture models (GMMs) are fit to each projection, and the resulting slice parameters are aggregated into a compact descriptor. We establish determinacy guarantees via Carleman’s condition and the Cramér-Wold theorem, ensuring that the GMM is uniquely determined by its sliced moments, and we derive finite-sample error bounds that scale optimally with the number of slices and samples. Empirically, EMPEROR captures richer distributional information than common pooling schemes across various data modalities, while remaining computationally efficient and broadly applicable.
[LG-105] Guided Sequence-Structure Generative Modeling for Iterative Antibody Optimization ICLR2025
链接: https://arxiv.org/abs/2509.16357
作者: Aniruddh Raghu,Sebastian Ober,Maxwell Kazman,Hunter Elliott
类目: Machine Learning (cs.LG)
*备注: GEM Workshop, ICLR 2025
Abstract:Therapeutic antibody candidates often require extensive engineering to improve key functional and developability properties before clinical development. This can be achieved through iterative design, where starting molecules are optimized over several rounds of in vitro experiments. While protein structure can provide a strong inductive bias, it is rarely used in iterative design due to the lack of structural data for continually evolving lead molecules over the course of optimization. In this work, we propose a strategy for iterative antibody optimization that leverages both sequence and structure as well as accumulating lab measurements of binding and developability. Building on prior work, we first train a sequence-structure diffusion generative model that operates on antibody-antigen complexes. We then outline an approach to use this model, together with carefully predicted antibody-antigen complexes, to optimize lead candidates throughout the iterative design process. Further, we describe a guided sampling approach that biases generation toward desirable properties by integrating models trained on experimental data from iterative design. We evaluate our approach in multiple in silico and in vitro experiments, demonstrating that it produces high-affinity binders at multiple stages of an active antibody optimization campaign.
[LG-106] Improving Deep Tabular Learning
链接: https://arxiv.org/abs/2509.16354
作者: Sivan Sarafian,Yehudit Aperstein
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures
Abstract:Tabular data remain a dominant form of real-world information but pose persistent challenges for deep learning due to heterogeneous feature types, lack of natural structure, and limited label-preserving augmentations. As a result, ensemble models based on decision trees continue to dominate benchmark leaderboards. In this work, we introduce RuleNet, a transformer-based architecture specifically designed for deep tabular learning. RuleNet incorporates learnable rule embeddings in a decoder, a piecewise linear quantile projection for numerical features, and feature masking ensembles for robustness and uncertainty estimation. Evaluated on eight benchmark datasets, RuleNet matches or surpasses state-of-the-art tree-based methods in most cases, while remaining computationally efficient, offering a practical neural alternative for tabular prediction tasks.
[LG-107] Auto-bidding under Return-on-Spend Constraints with Uncertainty Quantification
链接: https://arxiv.org/abs/2509.16324
作者: Jiale Han,Chun Gan,Chengcheng Zhang,Jie He,Zhangang Lin,Ching Law,Xiaowu Dai
类目: Machine Learning (cs.LG); Computer Science and Game Theory (cs.GT)
*备注:
Abstract:Auto-bidding systems are widely used in advertising to automatically determine bid values under constraints such as total budget and Return-on-Spend (RoS) targets. Existing works often assume that the value of an ad impression, such as the conversion rate, is known. This paper considers the more realistic scenario where the true value is unknown. We propose a novel method that uses conformal prediction to quantify the uncertainty of these values based on machine learning methods trained on historical bidding data with contextual features, without assuming the data are i.i.d. This approach is compatible with current industry systems that use machine learning to predict values. Building on prediction intervals, we introduce an adjusted value estimator derived from machine learning predictions, and show that it provides performance guarantees without requiring knowledge of the true value. We apply this method to enhance existing auto-bidding algorithms with budget and RoS constraints, and establish theoretical guarantees for achieving high reward while keeping RoS violations low. Empirical results on both simulated and real-world industrial datasets demonstrate that our approach improves performance while maintaining computational efficiency.
[LG-108] ROOT: Rethinking Offline Optimization as Distributional Translation via Probabilistic Bridge
链接: https://arxiv.org/abs/2509.16300
作者: Manh Cuong Dao, TheHung Tran,Phi Le Nguyen,Thao Nguyen Truong,Trong Nghia Hoang
类目: Machine Learning (cs.LG)
*备注: The first two authors contributed equally
Abstract:This paper studies the black-box optimization task which aims to find the maxima of a black-box function using a static set of its observed input-output pairs. This is often achieved via learning and optimizing a surrogate function with that offline data. Alternatively, it can also be framed as an inverse modeling task that maps a desired performance to potential input candidates that achieve it. Both approaches are constrained by the limited amount of offline data. To mitigate this limitation, we introduce a new perspective that casts offline optimization as a distributional translation task. This is formulated as learning a probabilistic bridge transforming an implicit distribution of low-value inputs (i.e., offline data) into another distribution of high-value inputs (i.e., solution candidates). Such probabilistic bridge can be learned using low- and high-value inputs sampled from synthetic functions that resemble the target function. These synthetic functions are constructed as the mean posterior of multiple Gaussian processes fitted with different parameterizations on the offline data, alleviating the data bottleneck. The proposed approach is evaluated on an extensive benchmark comprising most recent methods, demonstrating significant improvement and establishing a new state-of-the-art performance.
[LG-109] st-Time Learning and Inference-Time Deliberation for Efficiency-First Offline Reinforcement Learning in Care Coordination and Population Health Management
链接: https://arxiv.org/abs/2509.16291
作者: Sanjay Basu,Sadiq Y. Patel,Parth Sheth,Bhairavi Muralidharan,Namrata Elamaran,Aakriti Kinra,Rajaie Batniji
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:
Abstract:Care coordination and population health management programs serve large Medicaid and safety-net populations and must be auditable, efficient, and adaptable. While clinical risk for outreach modalities is typically low, time and opportunity costs differ substantially across text, phone, video, and in-person visits. We propose a lightweight offline reinforcement learning (RL) approach that augments trained policies with (i) test-time learning via local neighborhood calibration, and (ii) inference-time deliberation via a small Q-ensemble that incorporates predictive uncertainty and time/effort cost. The method exposes transparent dials for neighborhood size and uncertainty/cost penalties and preserves an auditable training pipeline. Evaluated on a de-identified operational dataset, TTL+ITD achieves stable value estimates with predictable efficiency trade-offs and subgroup auditing.
[LG-110] Architectural change in neural networks using fuzzy vertex pooling
链接: https://arxiv.org/abs/2509.16287
作者: Shanookha Ali,Nitha Niralda,Sunil Mathew
类目: Machine Learning (cs.LG)
*备注:
Abstract:The process of pooling vertices involves the creation of a new vertex, which becomes adjacent to all the vertices that were originally adjacent to the endpoints of the vertices being pooled. After this, the endpoints of these vertices and all edges connected to them are removed. In this document, we introduce a formal framework for the concept of fuzzy vertex pooling (FVP) and provide an overview of its key properties with its applications to neural networks. The pooling model demonstrates remarkable efficiency in minimizing loss rapidly while maintaining competitive accuracy, even with fewer hidden layer neurons. However, this advantage diminishes over extended training periods or with larger datasets, where the model’s performance tends to degrade. This study highlights the limitations of pooling in later stages of deep learning training, rendering it less effective for prolonged or large-scale applications. Consequently, pooling is recommended as a strategy for early-stage training in advanced deep learning models to leverage its initial efficiency.
[LG-111] GraphMend: Code Transformations for Fixing Graph Breaks in PyTorch 2
链接: https://arxiv.org/abs/2509.16248
作者: Savini Kashmira,Jayanaka Dantanarayana,Thamirawaran Sathiyalogeswaran,Yichao Yuan,Nishil Talati,Krisztian Flautner,Lingjia Tang,Jason Mars
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:
Abstract:This paper presents GraphMend, a high-level compiler that eliminates FX graph breaks in PyTorch 2 programs. Although PyTorch 2 introduced TorchDynamo and TorchInductor to enable just-in-time graph compilation, unresolved dynamic control flow and unsupported Python constructs often fragment models into multiple FX graphs. These fragments force frequent fallbacks to eager mode, incur costly CPU-to-GPU synchronizations, and reduce optimization opportunities. GraphMend addresses this limitation by analyzing and transforming source code before execution. Built on the Jac compilation framework, GraphMend introduces two code transformations that remove graph breaks due to dynamic control flow and Python I/O functions. This design allows PyTorch’s compilation pipeline to capture larger, uninterrupted FX graphs without requiring manual refactoring by developers. Evaluation across eight Hugging Face models shows that GraphMend removes all fixable graph breaks due to dynamic control flow and Python I/O functions, driving the break count to 0 in 6 models and reducing it from 5 to 2 in another model. On NVIDIA RTX 3090 and A40 GPUs, GraphMend achieves up to 75% latency reductions and up to 8% higher end-to-end throughput. These results demonstrate that high-level code transformation is an effective complement to PyTorch’s dynamic JIT compilation pipeline, substantially improving both usability and performance.
[LG-112] Comparison of Deterministic and Probabilistic Machine Learning Algorithms for Precise Dimensional Control and Uncertainty Quantification in Additive Manufacturing
链接: https://arxiv.org/abs/2509.16233
作者: Dipayan Sanpui,Anirban Chandra,Henry Chan,Sukriti Manna,Subramanian KRS Sankaranarayanan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:We present a probabilistic framework to accurately estimate dimensions of additively manufactured components. Using a dataset of 405 parts from nine production runs involving two machines, three polymer materials, and two-part configurations, we examine five key design features. To capture both design information and manufacturing variability, we employ models integrating continuous and categorical factors. For predicting Difference from Target (DFT) values, we test deterministic and probabilistic machine learning methods. Deterministic models, trained on 80% of the dataset, provide precise point estimates, with Support Vector Regression (SVR) achieving accuracy close to process repeatability. To address systematic deviations, we adopt Gaussian Process Regression (GPR) and Bayesian Neural Networks (BNNs). GPR delivers strong predictive performance and interpretability, while BNNs capture both aleatoric and epistemic uncertainties. We investigate two BNN approaches: one balancing accuracy and uncertainty capture, and another offering richer uncertainty decomposition but with lower dimensional accuracy. Our results underscore the importance of quantifying epistemic uncertainty for robust decision-making, risk assessment, and model improvement. We discuss trade-offs between GPR and BNNs in terms of predictive power, interpretability, and computational efficiency, noting that model choice depends on analytical needs. By combining deterministic precision with probabilistic uncertainty quantification, our study provides a rigorous foundation for uncertainty-aware predictive modeling in AM. This approach not only enhances dimensional accuracy but also supports reliable, risk-informed design strategies, thereby advancing data-driven manufacturing methodologies.
[LG-113] On the Detection of Internal Defects in Structured Media
链接: https://arxiv.org/abs/2509.16216
作者: Bryl Nico M. Ong,Aarush Borker,Neil Jerome A. Egarguin,Daniel Onofrei
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:A critical issue that affects engineers trying to assess the structural integrity of various infrastructures, such as metal rods or acoustic ducts, is the challenge of detecting internal fractures (defects). Traditionally, engineers depend on audible and visual aids to identify these fractures, as they do not physically dissect the object in question into multiple pieces to check for inconsistencies. This research introduces ideas towards the development of a robust strategy to image such defects using only a small set of minimal, non-invasive measurements. Assuming a one dimensional model (e.g. longitudinal waves in long and thin rods/acoustic ducts or transverse vibrations of strings), we make use of the continuous one-dimensional wave equation to model these physical phenomena and then employ specialized mathematical analysis tools (the Laplace transform and optimization) to introduce our defect imaging ideas. In particular, we will focus on the case of a long bar which is homogeneous throughout except in a small area where a defect in its Young’s modulus is present. We will first demonstrate how the problem is equivalent to a spring-mass vibrational system, and then show how our imaging strategy makes use of the Laplace domain analytic map between the characteristics of the respective defect and the measurement data. More explicitly, we will utilize MATLAB (a platform for numerical computations) to collect synthetic data (computational alternative to real world measurements) for several scenarios with one defect of arbitrary location and stiffness. Subsequently, we will use this data along with our analytically developed map (between defect characteristics and measurements) to construct a residual function which, once optimized, will reveal the location and magnitude of the stiffness defect. Subjects: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG) Cite as: arXiv:2509.16216 [cs.CE] (or arXiv:2509.16216v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2509.16216 Focus to learn more arXiv-issued DOI via DataCite
[LG-114] Deep Reinforcement Learning in Factor Investment
链接: https://arxiv.org/abs/2509.16206
作者: Junlin Liu
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Deep reinforcement learning has shown promise in trade execution, yet its use in low-frequency factor portfolio construction remains under-explored. A key obstacle is the high-dimensional, unbalanced state space created by stocks that enter and exit the investable universe. We introduce Conditional Auto-encoded Factor-based Portfolio Optimisation (CAFPO), which compresses stock-level returns into a small set of latent factors conditioned on 94 firm-specific characteristics. The factors feed a DRL agent implemented with both PPO and DDPG to generate continuous long-short weights. On 20 years of U.S. equity data (2000–2020), CAFPO outperforms equal-weight, value-weight, Markowitz, vanilla DRL, and Fama–French-driven DRL, delivering a 24.6% compound return and a Sharpe ratio of 0.94 out of sample. SHAP analysis further reveals economically intuitive factor attributions. Our results demonstrate that factor-aware representation learning can make DRL practical for institutional, low-turnover portfolio management.
[LG-115] Functional effects models: Accounting for preference heterogeneity in panel data with machine learning
链接: https://arxiv.org/abs/2509.18047
作者: Nicolas Salvadé,Tim Hillel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Econometrics (econ.EM); Methodology (stat.ME)
*备注:
Abstract:In this paper, we present a general specification for Functional Effects Models, which use Machine Learning (ML) methodologies to learn individual-specific preference parameters from socio-demographic characteristics, therefore accounting for inter-individual heterogeneity in panel choice data. We identify three specific advantages of the Functional Effects Model over traditional fixed, and random/mixed effects models: (i) by mapping individual-specific effects as a function of socio-demographic variables, we can account for these effects when forecasting choices of previously unobserved individuals (ii) the (approximate) maximum-likelihood estimation of functional effects avoids the incidental parameters problem of the fixed effects model, even when the number of observed choices per individual is small; and (iii) we do not rely on the strong distributional assumptions of the random effects model, which may not match reality. We learn functional intercept and functional slopes with powerful non-linear machine learning regressors for tabular data, namely gradient boosting decision trees and deep neural networks. We validate our proposed methodology on a synthetic experiment and three real-world panel case studies, demonstrating that the Functional Effects Model: (i) can identify the true values of individual-specific effects when the data generation process is known; (ii) outperforms both state-of-the-art ML choice modelling techniques that omit individual heterogeneity in terms of predictive performance, as well as traditional static panel choice models in terms of learning inter-individual heterogeneity. The results indicate that the FI-RUMBoost model, which combines the individual-specific constants of the Functional Effects Model with the complex, non-linear utilities of RUMBoost, performs marginally best on large-scale revealed preference panel data.
[LG-116] Kernel K-means clustering of distributional data
链接: https://arxiv.org/abs/2509.18037
作者: Amparo Baíllo,Jose R. Berrendero,Martín Sánchez-Signorini
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:
Abstract:We consider the problem of clustering a sample of probability distributions from a random distribution on \mathbb R^p . Our proposed partitioning method makes use of a symmetric, positive-definite kernel k and its associated reproducing kernel Hilbert space (RKHS) \mathcal H . By mapping each distribution to its corresponding kernel mean embedding in \mathcal H , we obtain a sample in this RKHS where we carry out the K -means clustering procedure, which provides an unsupervised classification of the original sample. The procedure is simple and computationally feasible even for dimension p1 . The simulation studies provide insight into the choice of the kernel and its tuning parameter. The performance of the proposed clustering procedure is illustrated on a collection of Synthetic Aperture Radar (SAR) images.
[LG-117] Core-elements Subsampling for Alternating Least Squares
链接: https://arxiv.org/abs/2509.18024
作者: Dunyao Xue,Mengyu Li,Cheng Meng,Jingyi Zhang
类目: Methodology (stat.ME); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
*备注:
Abstract:In this paper, we propose a novel element-wise subset selection method for the alternating least squares (ALS) algorithm, focusing on low-rank matrix factorization involving matrices with missing values, as commonly encountered in recommender systems. While ALS is widely used for providing personalized recommendations based on user-item interaction data, its high computational cost, stemming from repeated regression operations, poses significant challenges for large-scale datasets. To enhance the efficiency of ALS, we propose a core-elements subsampling method that selects a representative subset of data and leverages sparse matrix operations to approximate ALS estimations efficiently. We establish theoretical guarantees for the approximation and convergence of the proposed approach, showing that it achieves similar accuracy with significantly reduced computational time compared to full-data ALS. Extensive simulations and real-world applications demonstrate the effectiveness of our method in various scenarios, emphasizing its potential in large-scale recommendation systems.
[LG-118] Fréchet Geodesic Boosting
链接: https://arxiv.org/abs/2509.18013
作者: Yidong Zhou,Su I Iao,Hans-Georg Müller
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 23 pages, 4 figures, 10 tables
Abstract:Gradient boosting has become a cornerstone of machine learning, enabling base learners such as decision trees to achieve exceptional predictive performance. While existing algorithms primarily handle scalar or Euclidean outputs, increasingly prevalent complex-structured data, such as distributions, networks, and manifold-valued outputs, present challenges for traditional methods. Such non-Euclidean data lack algebraic structures such as addition, subtraction, or scalar multiplication required by standard gradient boosting frameworks. To address these challenges, we introduce Fréchet geodesic boosting (FGBoost), a novel approach tailored for outputs residing in geodesic metric spaces. FGBoost leverages geodesics as proxies for residuals and constructs ensembles in a way that respects the intrinsic geometry of the output space. Through theoretical analysis, extensive simulations, and real-world applications, we demonstrate the strong performance and adaptability of FGBoost, showcasing its potential for modeling complex data.
[LG-119] Robust Online and Adaptive Decentralized Gaussian Processes ICASSP2026
链接: https://arxiv.org/abs/2509.18011
作者: Fernando Llorente,Daniel Waxman,Sanket Jantre,Nathan M. Urban,Susan E. Minkoff
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Signal Processing (eess.SP)
*备注: Submitted to Icassp 2026 Special Session on “Bridging Signal Processing and Machine Learning with Gaussian Processes.”
Abstract:Gaussian processes (GPs) offer a flexible, uncertainty-aware framework for modeling complex signals, but scale cubically with data, assume static targets, and are brittle to outliers, limiting their applicability in large-scale problems with dynamic and noisy environments. Recent work introduced decentralized random Fourier feature Gaussian processes (DRFGP), an online and distributed algorithm that casts GPs in an information-filter form, enabling exact sequential inference and fully distributed computation without reliance on a fusion center. In this paper, we extend DRFGP along two key directions: first, by introducing a robust-filtering update that downweights the impact of atypical observations; and second, by incorporating a dynamic adaptation mechanism that adapts to time-varying functions. The resulting algorithm retains the recursive information-filter structure while enhancing stability and accuracy. We demonstrate its effectiveness on a large-scale Earth system application, underscoring its potential for in-situ modeling.
[LG-120] Random functions as data compressors for machine learning of molecular processes
链接: https://arxiv.org/abs/2509.17937
作者: Jayashrita Debnath,Gerhard Hummer
类目: oft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning (ML) is rapidly transforming the way molecular dynamics simulations are performed and analyzed, from materials modeling to studies of protein folding and function. ML algorithms are often employed to learn low-dimensional representations of conformational landscapes and to cluster trajectories into relevant metastable states. Most of these algorithms require selecting a small number of features that describe the problem of interest. Although deep neural networks can tackle large numbers of input features, the training costs increase with input size, which makes the selection of a subset of features mandatory for most problems of practical interest. Here, we show that random nonlinear projections can be used to compress large feature spaces and make computations faster without substantial loss of information. We describe an efficient way to produce random projections and then exemplify the general procedure for protein folding. For our test cases NTL9 and the double-norleucin variant of the villin headpiece, we find that random compression retains the core static and dynamic information of the original high dimensional feature space and makes trajectory analysis more robust.
[LG-121] Predicting Chest Radiograph Findings from Electrocardiograms Using Interpretable Machine Learning
链接: https://arxiv.org/abs/2509.17674
作者: Julia Matejas,Olaf Żurawski,Nils Strodthoff,Juan Miguel Lopez Alcaraz
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: 19 pages, 3 figures, source code under this https URL
Abstract:Purpose: Chest X-rays are essential for diagnosing pulmonary conditions, but limited access in resource-constrained settings can delay timely diagnosis. Electrocardiograms (ECGs), in contrast, are widely available, non-invasive, and often acquired earlier in clinical workflows. This study aims to assess whether ECG features and patient demographics can predict chest radiograph findings using an interpretable machine learning approach. Methods: Using the MIMIC-IV database, Extreme Gradient Boosting (XGBoost) classifiers were trained to predict diverse chest radiograph findings from ECG-derived features and demographic variables. Recursive feature elimination was performed independently for each target to identify the most predictive features. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC) with bootstrapped 95% confidence intervals. Shapley Additive Explanations (SHAP) were applied to interpret feature contributions. Results: Models successfully predicted multiple chest radiograph findings with varying accuracy. Feature selection tailored predictors to each target, and including demographic variables consistently improved performance. SHAP analysis revealed clinically meaningful contributions from ECG features to radiographic predictions. Conclusion: ECG-derived features combined with patient demographics can serve as a proxy for certain chest radiograph findings, enabling early triage or pre-screening in settings where radiographic imaging is limited. Interpretable machine learning demonstrates potential to support radiology workflows and improve patient care. Comments: 19 pages, 3 figures, source code under this https URL Subjects: Signal Processing (eess.SP); Machine Learning (cs.LG) Cite as: arXiv:2509.17674 [eess.SP] (or arXiv:2509.17674v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2509.17674 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Juan Miguel Lopez Alcaraz [view email] [v1] Mon, 22 Sep 2025 12:18:50 UTC (2,445 KB)
[LG-122] RAVEN: RAnking and Validation of ExoplaNets
链接: https://arxiv.org/abs/2509.17645
作者: Andreas Hadjigeorghiou,David J. Armstrong,Kaiming Cui,Marina Lafarga Magro,Luis Agustín Nieto,Rodrigo F. Díaz,Lauren Doyle,Vedad Kunovac
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: Submitted to MNRAS. Comments from the community are welcome
Abstract:We present RAVEN, a newly developed vetting and validation pipeline for TESS exoplanet candidates. The pipeline employs a Bayesian framework to derive the posterior probability of a candidate being a planet against a set of False Positive (FP) scenarios, through the use of a Gradient Boosted Decision Tree and a Gaussian Process classifier, trained on comprehensive synthetic training sets of simulated planets and 8 astrophysical FP scenarios injected into TESS lightcurves. These training sets allow large scale candidate vetting and performance verification against individual FP scenarios. A Non-Simulated FP training set consisting of real TESS candidates caused primarily by stellar variability and systematic noise is also included. The machine learning derived probabilities are combined with scenario specific prior probabilities, including the candidates’ positional probabilities, to compute the final posterior probabilities. Candidates with a planetary posterior probability greater than 99% against each FP scenario and whose implied planetary radius is less than 8 R_\oplus are considered to be statistically validated by the pipeline. In this first version, the pipeline has been developed for candidates with a lightcurve released from the TESS Science Processing Operations Centre, an orbital period between 0.5 and 16 days and a transit depth greater than 300ppm. The pipeline obtained area-under-curve (AUC) scores 97% on all FP scenarios and 99% on all but one. Testing on an independent external sample of 1361 pre-classified TOIs, the pipeline achieved an overall accuracy of 91%, demonstrating its effectiveness for automated ranking of TESS candidates. For a probability threshold of 0.9 the pipeline reached a precision of 97% with a recall score of 66% on these TOIs. The RAVEN pipeline is publicly released as a cloud-hosted app, making it easily accessible to the community.
[LG-123] Whitening Spherical Gaussian Mixtures in the Large-Dimensional Regime
链接: https://arxiv.org/abs/2509.17636
作者: Mohammed Racim Moussa Boudjemaa,Alper Kalle,Xiaoyi Mai,José Henrique de Morais Goulart,Cédric Févotte
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Whitening is a classical technique in unsupervised learning that can facilitate estimation tasks by standardizing data. An important application is the estimation of latent variable models via the decomposition of tensors built from high-order moments. In particular, whitening orthogonalizes the means of a spherical Gaussian mixture model (GMM), thereby making the corresponding moment tensor orthogonally decomposable, hence easier to decompose. However, in the large-dimensional regime (LDR) where data are high-dimensional and scarce, the standard whitening matrix built from the sample covariance becomes ineffective because the latter is spectrally distorted. Consequently, whitened means of a spherical GMM are no longer orthogonal. Using random matrix theory, we derive exact limits for their dot products, which are generally nonzero in the LDR. As our main contribution, we then construct a corrected whitening matrix that restores asymptotic orthogonality, allowing for performance gains in spherical GMM estimation.
[LG-124] FastNet: Improving the physical consistency of machine-learning weather prediction models through loss function design
链接: https://arxiv.org/abs/2509.17601
作者: Tom Dunstan,Oliver Strickson,Thusal Bennett,Jack Bowyer,Matthew Burnand,James Chappell,Alejandro Coca-Castro,Kirstine Ida Dale,Eric G. Daub,Noushin Eftekhari,Manvendra Janmaijaya,Jon Lillis,David Salvador-Jasin,Nathan Simpson,Ryan Sze-Yin Chan,Mohamad Elmasri,Lydia Allegranza France,Sam Madge,Levan Bokeria,Hannah Brown,Tom Dodds,Anna-Louise Ellis,David Llewellyn-Jones,Theo McCaie,Sophia Moreton,Tom Potter,James Robinson,Adam A. Scaife,Iain Stenson,David Walters,Karina Bett-Williams,Louisa van Zeeland,Peter Yatsyshin,J. Scott Hosking
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:
Abstract:Machine learning weather prediction (MLWP) models have demonstrated remarkable potential in delivering accurate forecasts at significantly reduced computational cost compared to traditional numerical weather prediction (NWP) systems. However, challenges remain in ensuring the physical consistency of MLWP outputs, particularly in deterministic settings. This study presents FastNet, a graph neural network (GNN)-based global prediction model, and investigates the impact of alternative loss function designs on improving the physical realism of its forecasts. We explore three key modifications to the standard mean squared error (MSE) loss: (1) a modified spherical harmonic (MSH) loss that penalises spectral amplitude errors to reduce blurring and enhance small-scale structure retention; (2) inclusion of horizontal gradient terms in the loss to suppress non-physical artefacts; and (3) an alternative wind representation that decouples speed and direction to better capture extreme wind events. Results show that while the MSH and gradient-based losses \textitalone may slightly degrade RMSE scores, when trained in combination the model exhibits very similar MSE performance to an MSE-trained model while at the same time significantly improving spectral fidelity and physical consistency. The alternative wind representation further improves wind speed accuracy and reduces directional bias. Collectively, these findings highlight the importance of loss function design as a mechanism for embedding domain knowledge into MLWP models and advancing their operational readiness.
[LG-125] Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality
链接: https://arxiv.org/abs/2509.17543
作者: Dominic Broadbent,Nick Whiteley,Robert Allison,Tom Lovett
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: 43 pages, 20 figures
Abstract:Existing distribution compression methods reduce dataset size by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which quantifies the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that across a variety of scenarios BDC can achieve comparable or superior performance to ambient-space compression at substantially lower cost.
[LG-126] Robust Mixture Models for Algorithmic Fairness Under Latent Heterogeneity
链接: https://arxiv.org/abs/2509.17411
作者: Siqi Li,Molei Liu,Ziye Tian,Chuan Hong,Nan Liu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Standard machine learning models optimized for average performance often fail on minority subgroups and lack robustness to distribution shifts. This challenge worsens when subgroups are latent and affected by complex interactions among continuous and discrete features. We introduce ROME (RObust Mixture Ensemble), a framework that learns latent group structure from data while optimizing for worst-group performance. ROME employs two approaches: an Expectation-Maximization algorithm for linear models and a neural Mixture-of-Experts for nonlinear settings. Through simulations and experiments on real-world datasets, we demonstrate that ROME significantly improves algorithmic fairness compared to standard methods while maintaining competitive average performance. Importantly, our method requires no predefined group labels, making it practical when sources of disparities are unknown or evolving.
[LG-127] Bias-variance Tradeoff in Tensor Estimation
链接: https://arxiv.org/abs/2509.17382
作者: Shivam Kumar,Haotian Xu,Carlos Misael Madrid Padilla,Yuehaw Khoo,Oscar Hernan Madrid Padilla,Daren Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Methodology (stat.ME)
*备注:
Abstract:We study denoising of a third-order tensor when the ground-truth tensor is not necessarily Tucker low-rank. Specifically, we observe Y=X^\ast+Z\in \mathbbR^p_1 \times p_2 \times p_3, where X^\ast is the ground-truth tensor, and Z is the noise tensor. We propose a simple variant of the higher-order tensor SVD estimator \widetildeX . We show that uniformly over all user-specified Tucker ranks (r_1,r_2,r_3) , | \widetildeX - X^* |_ \mathrmF^2 = O \Big( \kappa^2 \Big\ r_1r_2r_3+\sum_k=1^3 p_k r_k \Big\ ; + ; \xi_(r_1,r_2,r_3)^2\Big) \quad \text with high probability. Here, the bias term \xi_(r_1,r_2,r_3) corresponds to the best achievable approximation error of X^\ast over the class of tensors with Tucker ranks (r_1,r_2,r_3) ; \kappa^2 quantifies the noise level; and the variance term \kappa^2 \r_1r_2r_3+\sum_k=1^3 p_k r_k\ scales with the effective number of free parameters in the estimator \widetildeX . Our analysis achieves a clean rank-adaptive bias–variance tradeoff: as we increase the ranks of estimator \widetildeX , the bias \xi(r_1,r_2,r_3) decreases and the variance increases. As a byproduct we also obtain a convenient bias-variance decomposition for the vanilla low-rank SVD matrix estimators.
[LG-128] Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization
链接: https://arxiv.org/abs/2509.17251
作者: Jingfeng Wu,Peter L. Bartlett,Jason D. Lee,Sham M. Kakade,Bin Yu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems – those with fast and continuously decaying covariance spectra – which includes all problems satisfying the standard capacity condition. Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG) Cite as: arXiv:2509.17251 [stat.ML] (or arXiv:2509.17251v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.17251 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-129] AI-based Methods for Simulating Sampling and Predicting Protein Ensembles
链接: https://arxiv.org/abs/2509.17224
作者: Bowen Jing,Bonnie Berger,Tommi Jaakkola
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Biological Physics (physics.bio-ph)
*备注:
Abstract:Advances in deep learning have opened an era of abundant and accurate predicted protein structures; however, similar progress in protein ensembles has remained elusive. This review highlights several recent research directions towards AI-based predictions of protein ensembles, including coarse-grained force fields, generative models, multiple sequence alignment perturbation methods, and modeling of ensemble descriptors. An emphasis is placed on realistic assessments of the technological maturity of current methods, the strengths and weaknesses of broad families of techniques, and promising machine learning frameworks at an early stage of development. We advocate for “closing the loop” between model training, simulation, and inference to overcome challenges in training data availability and to enable the next generation of models.
[LG-130] Self-Supervised Discovery of Neural Circuits in Spatially Patterned Neural Responses with Graph Neural Networks NEURIPS2025
链接: https://arxiv.org/abs/2509.17174
作者: Kijung Yoon
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: To appear in NeurIPS 2025
Abstract:Inferring synaptic connectivity from neural population activity is a fundamental challenge in computational neuroscience, complicated by partial observability and mismatches between inference models and true circuit dynamics. In this study, we propose a graph-based neural inference model that simultaneously predicts neural activity and infers latent connectivity by modeling neurons as interacting nodes in a graph. The architecture features two distinct modules: one for learning structural connectivity and another for predicting future spiking activity via a graph neural network (GNN). Our model accommodates unobserved neurons through auxiliary nodes, allowing for inference in partially observed circuits. We evaluate this approach using synthetic data from ring attractor networks and real spike recordings from head direction cells in mice. Across a wide range of conditions, including varying recurrent connectivity, external inputs, and incomplete observations, our model consistently outperforms standard baselines, resolving spurious correlations more effectively and recovering accurate weight profiles. When applied to real data, the inferred connectivity aligns with theoretical predictions of continuous attractor models. These results highlight the potential of GNN-based models to infer latent neural circuitry through self-supervised structure learning, while leveraging the spike prediction task to flexibly link connectivity and dynamics across both simulated and biological neural systems.
[LG-131] DeepEOSNet: Capturing the dependency on thermodynamic state in property prediction tasks
链接: https://arxiv.org/abs/2509.17018
作者: Jan Pavšek,Alexander Mitsos,Manuel Dahmen,Tai Xuan Tan,Jan G. Rittig
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:We propose a machine learning (ML) architecture to better capture the dependency of thermodynamic properties on the independent states. When predicting state-dependent thermodynamic properties, ML models need to account for both molecular structure and the thermodynamic state, described by independent variables, typically temperature, pressure, and composition. Modern molecular ML models typically include state information by adding it to molecular fingerprint vectors or by embedding explicit (semi-empirical) thermodynamic relations. Here, we propose to rather split the information processing on the molecular structure and the dependency on states into two separate network channels: a graph neural network and a multilayer perceptron, whose output is combined by a dot product. We refer to our approach as DeepEOSNet, as this idea is based on the DeepONet architecture [Lu et al. (2021), Nat. Mach. Intell.]: instead of operators, we learn state dependencies, with the possibility to predict equation of states (EOS). We investigate the predictive performance of DeepEOSNet by means of three case studies, which include the prediction of vapor pressure as a function of temperature, and mixture molar volume as a function of composition, temperature, and pressure. Our results show superior performance of DeepEOSNet for predicting vapor pressure and comparable performance for predicting mixture molar volume compared to state-of-research graph-based thermodynamic prediction models from our earlier works. In fact, we see large potential of DeepEOSNet in cases where data is sparse in the state domain and the output function is structurally similar across different molecules. The concept of DeepEOSNet can easily be transferred to other ML architectures in molecular context, and thus provides a viable option for property prediction.
[LG-132] Deep Learning Inductive Biases for fMRI Time Series Classification during Resting-state and Movie-watching
链接: https://arxiv.org/abs/2509.16973
作者: Behdad Khodabandehloo,Reza Rajimehr
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Deep learning has advanced fMRI analysis, yet it remains unclear which architectural inductive biases are most effective at capturing functional patterns in human brain activity. This issue is particularly important in small-sample settings, as most datasets fall into this category. We compare models with three major inductive biases in deep learning including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and Transformers for the task of biological sex classification. These models are evaluated within a unified pipeline using parcellated multivariate fMRI time series from the Human Connectome Project (HCP) 7-Tesla cohort, which includes four resting-state runs and four movie-watching task runs. We assess performance on Whole-brain, subcortex, and 12 functional networks. CNNs consistently achieved the highest discrimination for sex classification in both resting-state and movie-watching, while LSTM and Transformer models underperformed. Network-resolved analyses indicated that the Whole-brain, Default Mode, Cingulo-Opercular, Dorsal Attention, and Frontoparietal networks were the most discriminative. These results were largely similar between resting-state and movie-watching. Our findings indicate that, at this dataset size, discriminative information is carried by local spatial patterns and inter-regional dependencies, favoring convolutional inductive bias. Our study provides insights for selecting deep learning architectures for fMRI time series classification.
[LG-133] Quantum Adaptive Self-Attention for Financial Rebalancing: An Empirical Study on Automated Market Makers in Decentralized Finance
链接: https://arxiv.org/abs/2509.16955
作者: Chi-Sheng Chen,Aidan Hung-Wen Tsai
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG); Computational Finance (q-fin.CP)
*备注:
Abstract:We formulate automated market maker (AMM) \emphrebalancing as a binary detection problem and study a hybrid quantum–classical self-attention block, \textbfQuantum Adaptive Self-Attention (QASA). QASA constructs quantum queries/keys/values via variational quantum circuits (VQCs) and applies standard softmax attention over Pauli- Z expectation vectors, yielding a drop-in attention module for financial time-series decision making. Using daily data for \textbfBTCUSDC over \textbfJan-2024–Jan-2025 with a 70/15/15 time-series split, we compare QASA against classical ensembles, a transformer, and pure quantum baselines under Return, Sharpe, and Max Drawdown. The \textbfQASA-Sequence variant attains the \emphbest single-model risk-adjusted performance (\textbf13.99% return; \textbfSharpe 1.76), while hybrid models average \textbf11.2% return (vs.\ 9.8% classical; 4.4% pure quantum), indicating a favorable performance–stability–cost trade-off.
[LG-134] Differential Privacy for Euclidean Jordan Algebra with Applications to Private Symmetric Cone Programming NEURIPS2025
链接: https://arxiv.org/abs/2509.16915
作者: Zhao Song,Jianfei Xue,Lichen Zhang
类目: Optimization and Control (math.OC); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
*备注: NeurIPS 2025
Abstract:In this paper, we study differentially private mechanisms for functions whose outputs lie in a Euclidean Jordan algebra. Euclidean Jordan algebras capture many important mathematical structures and form the foundation of linear programming, second-order cone programming, and semidefinite programming. Our main contribution is a generic Gaussian mechanism for such functions, with sensitivity measured in \ell_2 , \ell_1 , and \ell_\infty norms. Notably, this framework includes the important case where the function outputs are symmetric matrices, and sensitivity is measured in the Frobenius, nuclear, or spectral norm. We further derive private algorithms for solving symmetric cone programs under various settings, using a combination of the multiplicative weights update method and our generic Gaussian mechanism. As an application, we present differentially private algorithms for semidefinite programming, resolving a major open question posed by [Hsu, Roth, Roughgarden, and Ullman, ICALP 2014].
[LG-135] DoubleGen: Debiased Generative Modeling of Counterfactuals
链接: https://arxiv.org/abs/2509.16842
作者: Alex Luedtke,Kenji Fukumizu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
*备注: Keywords: generative modeling, counterfactual, doubly robust, debiased machine learning
Abstract:Generative models for counterfactual outcomes face two key sources of bias. Confounding bias arises when approaches fail to account for systematic differences between those who receive the intervention and those who do not. Misspecification bias arises when methods attempt to address confounding through estimation of an auxiliary model, but specify it incorrectly. We introduce DoubleGen, a doubly robust framework that modifies generative modeling training objectives to mitigate these biases. The new objectives rely on two auxiliaries – a propensity and outcome model – and successfully address confounding bias even if only one of them is correct. We provide finite-sample guarantees for this robustness property. We further establish conditions under which DoubleGen achieves oracle optimality – matching the convergence rates standard approaches would enjoy if interventional data were available – and minimax rate optimality. We illustrate DoubleGen with three examples: diffusion models, flow matching, and autoregressive language models.
[LG-136] A Study on Stabilizer Rényi Entropy Estimation using Machine Learning
链接: https://arxiv.org/abs/2509.16799
作者: Vincenzo Lipardi,Domenica Dibenedetto,Georgios Stamoulis,Mark H.M. Winands
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Nonstabilizerness is a fundamental resource for quantum advantage, as it quantifies the extent to which a quantum state diverges from those states that can be efficiently simulated on a classical computer, the stabilizer states. The stabilizer Rényi entropy (SRE) is one of the most investigated measures of nonstabilizerness because of its computational properties and suitability for experimental measurements on quantum processors. Because computing the SRE for arbitrary quantum states is a computationally hard problem, we propose a supervised machine-learning approach to estimate it. In this work, we frame SRE estimation as a regression task and train a Random Forest Regressor and a Support Vector Regressor (SVR) on a comprehensive dataset, including both unstructured random quantum circuits and structured circuits derived from the physics-motivated one-dimensional transverse Ising model (TIM). We compare the machine-learning models using two different quantum circuit representations: one based on classical shadows and the other on circuit-level features. Furthermore, we assess the generalization capabilities of the models on out-of-distribution instances. Experimental results show that an SVR trained on circuit-level features achieves the best overall performance. On the random circuits dataset, our approach converges to accurate SRE estimations, but struggles to generalize out of distribution. In contrast, it generalizes well on the structured TIM dataset, even to deeper and larger circuits. In line with previous work, our experiments suggest that machine learning offers a viable path for efficient nonstabilizerness estimation.
[LG-137] QASTAnet: A DNN-based Quality Metric for Spatial Audio
链接: https://arxiv.org/abs/2509.16715
作者: Adrien Llave,Emma Granier,Grégory Pallone
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:
Abstract:In the development of spatial audio technologies, reliable and shared methods for evaluating audio quality are essential. Listening tests are currently the standard but remain costly in terms of time and resources. Several models predicting subjective scores have been proposed, but they do not generalize well to real-world signals. In this paper, we propose QASTAnet (Quality Assessment for SpaTial Audio network), a new metric based on a deep neural network, specialized on spatial audio (ambisonics and binaural). As training data is scarce, we aim for the model to be trainable with a small amount of data. To do so, we propose to rely on expert modeling of the low-level auditory system and use a neurnal network to model the high-level cognitive function of the quality judgement. We compare its performance to two reference metrics on a wide range of content types (speech, music, ambiance, anechoic, reverberated) and focusing on codec artifacts. Results demonstrate that QASTAnet overcomes the aforementioned limitations of the existing methods. The strong correlation between the proposed metric prediction and subjective scores makes it a good candidate for comparing codecs in their development.
[LG-138] Increase Alpha: Performance and Risk of an AI-Driven Trading Framework
链接: https://arxiv.org/abs/2509.16707
作者: Sid Ghatak,Arman Khaledian,Navid Parvini,Nariman Khaledian
类目: Portfolio Management (q-fin.PM); Machine Learning (cs.LG)
*备注: To get access to the data, please contact this http URL @increasealpha.com
Abstract:There are inefficiencies in financial markets, with unexploited patterns in price, volume, and cross-sectional relationships. While many approaches use large-scale transformers, we take a domain-focused path: feed-forward and recurrent networks with curated features to capture subtle regularities in noisy financial data. This smaller-footprint design is computationally lean and reliable under low signal-to-noise, crucial for daily production at scale. At Increase Alpha, we built a deep-learning framework that maps over 800 U.S. equities into daily directional signals with minimal computational overhead. The purpose of this paper is twofold. First, we outline the general overview of the predictive model without disclosing its core underlying concepts. Second, we evaluate its real-time performance through transparent, industry standard metrics. Forecast accuracy is benchmarked against both naive baselines and macro indicators. The performance outcomes are summarized via cumulative returns, annualized Sharpe ratio, and maximum drawdown. The best portfolio combination using our signals provides a low-risk, continuous stream of returns with a Sharpe ratio of more than 2.5, maximum drawdown of around 3%, and a near-zero correlation with the S\P 500 market benchmark. We also compare the model’s performance through different market regimes, such as the recent volatile movements of the US equity market in the beginning of 2025. Our analysis showcases the robustness of the model and significantly stable performance during these volatile periods. Collectively, these findings show that market inefficiencies can be systematically harvested with modest computational overhead if the right variables are considered. This report will emphasize the potential of traditional deep learning frameworks for generating an AI-driven edge in the financial market. Comments: To get access to the data, please contact this http URL@increasealpha.com Subjects: Portfolio Management (q-fin.PM); Machine Learning (cs.LG) Cite as: arXiv:2509.16707 [q-fin.PM] (or arXiv:2509.16707v1 [q-fin.PM] for this version) https://doi.org/10.48550/arXiv.2509.16707 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-139] Knowledge Distillation for Variational Quantum Convolutional Neural Networks on Heterogeneous Data
链接: https://arxiv.org/abs/2509.16699
作者: Kai Yu,Binbin Cai,Song Lin
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Distributed quantum machine learning faces significant challenges due to heterogeneous client data and variations in local model structures, which hinder global model aggregation. To address these challenges, we propose a knowledge distillation framework for variational quantum convolutional neural networks on heterogeneous data. The framework features a quantum gate number estimation mechanism based on client data, which guides the construction of resource-adaptive VQCNN circuits. Particle swarm optimization is employed to efficiently generate personalized quantum models tailored to local data characteristics. During aggregation, a knowledge distillation strategy integrating both soft-label and hard-label supervision consolidates knowledge from heterogeneous clients using a public dataset, forming a global model while avoiding parameter exposure and privacy leakage. Theoretical analysis shows that proposed framework benefits from quantum high-dimensional representation, offering advantages over classical approaches, and minimizes communication by exchanging only model indices and test outputs. Extensive simulations on the PennyLane platform validate the effectiveness of the gate number estimation and distillation-based aggregation. Experimental results demonstrate that the aggregated global model achieves accuracy close to fully supervised centralized training. These results shown that proposed methods can effectively handle heterogeneity, reduce resource consumption, and maintain performance, highlighting its potential for scalable and privacy-preserving distributed quantum learning.
[LG-140] System-Level Uncertainty Quantification with Multiple Machine Learning Models: A Theoretical Framework
链接: https://arxiv.org/abs/2509.16663
作者: Xiaoping Du
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:ML models have errors when used for predictions. The errors are unknown but can be quantified by model uncertainty. When multiple ML models are trained using the same training points, their model uncertainties may be statistically dependent. In reality, model inputs are also random with input uncertainty. The effects of these types of uncertainty must be considered in decision-making and design. This study develops a theoretical framework that generates the joint distribution of multiple ML predictions given the joint distribution of model uncertainties and the joint distribution of model inputs. The strategy is to decouple the coupling between the two types of uncertainty and transform them as independent random variables. The framework lays a foundation for numerical algorithm development for various specific applications.
[LG-141] Conditional Multidimensional Scaling with Incomplete Conditioning Data
链接: https://arxiv.org/abs/2509.16627
作者: Anh Tuan Bui
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Conditional multidimensional scaling seeks for a low-dimensional configuration from pairwise dissimilarities, in the presence of other known features. By taking advantage of available data of the known features, conditional multidimensional scaling improves the estimation quality of the low-dimensional configuration and simplifies knowledge discovery tasks. However, existing conditional multidimensional scaling methods require full data of the known features, which may not be always attainable due to time, cost, and other constraints. This paper proposes a conditional multidimensional scaling method that can learn the low-dimensional configuration when there are missing values in the known features. The method can also impute the missing values, which provides additional insights of the problem. Computer codes of this method are maintained in the cml R package on CRAN.
[LG-142] Overfitting in Adaptive Robust Optimization
链接: https://arxiv.org/abs/2509.16451
作者: Karl Zhu,Dimitris Bertsimas
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 4 pages, 1 figure, NeuroIPS 2025 ML x OR workshop submission
Abstract:Adaptive robust optimization (ARO) extends static robust optimization by allowing decisions to depend on the realized uncertainty - weakly dominating static solutions within the modeled uncertainty set. However, ARO makes previous constraints that were independent of uncertainty now dependent, making it vulnerable to additional infeasibilities when realizations fall outside the uncertainty set. This phenomenon of adaptive policies being brittle is analogous to overfitting in machine learning. To mitigate against this, we propose assigning constraint-specific uncertainty set sizes, with harder constraints given stronger probabilistic guarantees. Interpreted through the overfitting lens, this acts as regularization: tighter guarantees shrink adaptive coefficients to ensure stability, while looser ones preserve useful flexibility. This view motivates a principled approach to designing uncertainty sets that balances robustness and adaptivity.
[LG-143] Low-Rank Adaptation of Evolutionary Deep Neural Networks for Efficient Learning of Time-Dependent PDEs
链接: https://arxiv.org/abs/2509.16395
作者: Jiahao Zhang,Shiheng Zhang,Guang Lin
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 17 pages
Abstract:We study the Evolutionary Deep Neural Network (EDNN) framework for accelerating numerical solvers of time-dependent partial differential equations (PDEs). We introduce a Low-Rank Evolutionary Deep Neural Network (LR-EDNN), which constrains parameter evolution to a low-rank subspace, thereby reducing the effective dimensionality of training while preserving solution accuracy. The low-rank tangent subspace is defined layer-wise by the singular value decomposition (SVD) of the current network weights, and the resulting update is obtained by solving a well-posed, tractable linear system within this subspace. This design augments the underlying numerical solver with a parameter efficient EDNN component without requiring full fine-tuning of all network weights. We evaluate LR-EDNN on representative PDE problems and compare it against corresponding baselines. Across cases, LR-EDNN achieves comparable accuracy with substantially fewer trainable parameters and reduced computational cost. These results indicate that low-rank constraints on parameter velocities, rather than full-space updates, provide a practical path toward scalable, efficient, and reproducible scientific machine learning for PDEs.
[LG-144] Similarity-Guided Diffusion for Long-Gap Music Inpainting ICASSP2026
链接: https://arxiv.org/abs/2509.16342
作者: Sean Turland,Eloi Moliner,Vesa Välimäki
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 5 pages, 2 figures. Submitted to IEEE ICASSP 2026. Audio examples and supplementary material are available at: this https URL
Abstract:Music inpainting aims to reconstruct missing segments of a corrupted recording. While diffusion-based generative models improve reconstruction for medium-length gaps, they often struggle to preserve musical plausibility over multi-second gaps. We introduce Similarity-Guided Diffusion Posterior Sampling (SimDPS), a hybrid method that combines diffusion-based inference with similarity search. Candidate segments are first retrieved from a corpus based on contextual similarity, then incorporated into a modified likelihood that guides the diffusion process toward contextually consistent reconstructions. Subjective evaluation on piano music inpainting with 2-s gaps shows that the proposed SimDPS method enhances perceptual plausibility compared to unguided diffusion and frequently outperforms similarity search alone when moderately similar candidates are available. These results demonstrate the potential of a hybrid similarity approach for diffusion-based audio enhancement with long gaps.
[LG-145] F-DWGNet: A Directed Weighted Graph Neural Network with Tensor Fusion for Multi-Omics Cancer Subtype Classification
链接: https://arxiv.org/abs/2509.16301
作者: Tiantian Yang,Zhiqian Chen
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 9 pages, 4 figures, 4 tables
Abstract:Integration and analysis of multi-omics data provide valuable insights for cancer subtype classification. However, such data are inherently heterogeneous, high-dimensional, and exhibit complex intra- and inter-modality dependencies. Recent advances in graph neural networks (GNNs) offer powerful tools for modeling such structure. Yet, most existing methods rely on prior knowledge or predefined similarity networks to construct graphs, which are often undirected or unweighted, failing to capture the directionality and strength of biological interactions. Interpretability at both the modality and feature levels also remains limited. To address these challenges, we propose TF-DWGNet, a novel Graph Neural Network framework that combines tree-based Directed Weighted graph construction with Tensor Fusion for multiclass cancer subtype classification. TF-DWGNet introduces two key innovations: a supervised tree-based approach for constructing directed, weighted graphs tailored to each omics modality, and a tensor fusion mechanism that captures unimodal, bimodal, and trimodal interactions using low-rank decomposition for efficiency. TF-DWGNet enables modality-specific representation learning, joint embedding fusion, and interpretable subtype prediction. Experiments on real-world cancer datasets show that TF-DWGNet consistently outperforms state-of-the-art baselines across multiple metrics and statistical tests. Moreover, it provides biologically meaningful insights by ranking influential features and modalities. These results highlight TF-DWGNet’s potential for effective and interpretable multi-omics integration in cancer research.
[LG-146] Vibrational Fingerprints of Strained Polymers: A Spectroscopic Pathway to Mechanical State Prediction
链接: https://arxiv.org/abs/2509.16266
作者: Julian Konrad,Janina Mittelhaus,David M. Wilkins,Bodo Fiedler,Robert Meißner
类目: Chemical Physics (physics.chem-ph); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:
Abstract:The vibrational response of polymer networks under load provides a sensitive probe of molecular deformation and a route to non-destructive diagnostics. Here we show that machine-learned force fields reproduce these spectroscopic fingerprints with quantum-level fidelity in realistic epoxy thermosets. Using MACE-OFF23 molecular dynamics, we capture the experimentally observed redshifts of para-phenylene stretching modes under tensile load, in contrast to the harmonic OPLS-AA model. These shifts correlate with molecular elongation and alignment, consistent with Badger’s rule, directly linking vibrational features to local stress. To capture IR intensities, we trained a symmetry-adapted dipole moment model on representative epoxy fragments, enabling validation of strain responses. Together, these approaches provide chemically accurate and computationally accessible predictions of strain-dependent vibrational spectra. Our results establish vibrational fingerprints as predictive markers of mechanical state in polymer networks, pointing to new strategies for stress mapping and structural-health diagnostics in advanced materials.
[LG-147] Motional representation; the ability to predict odor characters using molecular vibrations
链接: https://arxiv.org/abs/2509.16245
作者: Yuki Harada,Shuichi Maeda,Junwei Shen,Taku Misonou,Hirokazu Hori,Shinichiro Nakamura
类目: Chemical Physics (physics.chem-ph); Machine Learning (cs.LG)
*备注:
Abstract:The prediction of odor characters is still impossible based on the odorant molecular structure. We designed a CNN-based regressor for computed parameters in molecular vibrations (CNN_vib), in order to investigate the ability to predict odor characters of molecular vibrations. In this study, we explored following three approaches for the predictability; (i) CNN with molecular vibrational parameters, (ii) logistic regression based on vibrational spectra, and (iii) logistic regression with molecular fingerprint(FP). Our investigation demonstrates that both (i) and (ii) provide predictablity, and also that the vibrations as an explanatory variable (i and ii) and logistic regression with fingerprints (iii) show nearly identical tendencies. The predictabilities of (i) and (ii), depending on odor descriptors, are comparable to those of (iii). Our research shows that odor is predictable by odorant molecular vibration as well as their shapes alone. Our findings provide insight into the representation of molecular motional features beyond molecular structures.
[LG-148] Machine Learning for Quantum Noise Reduction
链接: https://arxiv.org/abs/2509.16242
作者: Karan Kendre
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: Code and data available at: this https URL
Abstract:Quantum noise fundamentally limits the utility of near-term quantum devices, making error mitigation essential for practical quantum computation. While traditional quantum error correction codes require substantial qubit overhead and complex syndrome decoding, we propose a machine learning approach that directly reconstructs clean quantum states from noisy density matrices without additional qubits. We formulate quantum noise reduction as a supervised learning problem using a convolutional neural network (CNN) autoencoder architecture with a novel fidelity-aware composite loss function. Our method is trained and evaluated on a comprehensive synthetic dataset of 10,000 density matrices derived from random 5-qubit quantum circuits, encompassing five noise types (depolarizing, amplitude damping, phase damping, bit-flip, and mixed noise) across four intensity levels (0.05-0.20). The CNN successfully reconstructs quantum states across all noise conditions, achieving an average fidelity improvement from 0.298 to 0.774 (\Delta = 0.476). Notably, the model demonstrates superior performance on complex mixed noise scenarios and higher noise intensities, with mixed noise showing the highest corrected fidelity (0.807) and improvement (0.567). The approach effectively preserves both diagonal elements (populations) and off-diagonal elements (quantum coherences), making it suitable for entanglement-dependent quantum algorithms. While phase damping presents fundamental information-theoretic limitations, our results suggest that CNN-based density matrix reconstruction offers a promising, resource-efficient alternative to traditional quantum error correction for NISQ-era devices. This data-driven approach could enable practical quantum advantage with fewer physical qubits than conventional error correction schemes require.
信息检索
[IR-0] A Generative Framework for Personalized Sticker Retrieval EMNLP2025
链接: https://arxiv.org/abs/2509.17749
作者: Changjiang Zhou,Ruqing Zhang,Jiafeng Guo,Yu-An Liu,Fan Zhang,Ganyuan Luo,Xueqi Cheng
类目: Information Retrieval (cs.IR)
*备注: Findings of EMNLP2025
Abstract:Formulating information retrieval as a variant of generative modeling, specifically using autoregressive models to generate relevant identifiers for a given query, has recently attracted considerable attention. However, its application to personalized sticker retrieval remains largely unexplored and presents unique challenges: existing relevance-based generative retrieval methods typically lack personalization, leading to a mismatch between diverse user expectations and the retrieved results. To address this gap, we propose PEARL, a novel generative framework for personalized sticker retrieval, and make two key contributions: (i) To encode user-specific sticker preferences, we design a representation learning model to learn discriminative user representations. It is trained on three prediction tasks that leverage personal information and click history; and (ii) To generate stickers aligned with a user’s query intent, we propose a novel intent-aware learning objective that prioritizes stickers associated with higher-ranked intents. Empirical results from both offline evaluations and online tests demonstrate that PEARL significantly outperforms state-of-the-art methods.
[IR-1] Human vs. Agent in Task-Oriented Conversations SIGIR
链接: https://arxiv.org/abs/2509.17619
作者: Zhefan Wang,Ning Geng,Zhiqiang Guo,Weizhi Ma,Min Zhang
类目: Information Retrieval (cs.IR)
*备注: SIGIR-AP 2025
Abstract:Task-oriented conversational systems are essential for efficiently addressing diverse user needs, yet their development requires substantial amounts of high-quality conversational data that is challenging and costly to obtain. While large language models (LLMs) have demonstrated potential in generating synthetic conversations, the extent to which these agent-generated interactions can effectively substitute real human conversations remains unclear. This work presents the first systematic comparison between LLM-simulated users and human users in personalized task-oriented conversations. We propose a comprehensive analytical framework encompassing three key aspects (conversation strategy, interaction style, and conversation evaluation) and ten distinct dimensions for evaluating user behaviors, and collect parallel conversational datasets from both human users and LLM agent users across four representative scenarios under identical conditions. Our analysis reveals significant behavioral differences between the two user types in problem-solving approaches, question broadness, user engagement, context dependency, feedback polarity and promise, language style, and hallucination awareness. We found consistency in the agent users and human users across the depth-first or breadth-first dimensions, as well as the usefulness dimensions. These findings provide critical insights for advancing LLM-based user simulation. Our multi-dimensional taxonomy constructed a generalizable framework for analyzing user behavior patterns, offering insights from LLM agent users and human users. By this work, we provide perspectives on rethinking how to use user simulation in conversational systems in the future.
[IR-2] LongEval at CLEF 2025: Longitudinal Evaluation of IR Systems on Web and Scientific Data
链接: https://arxiv.org/abs/2509.17469
作者: Matteo Cancellieri,Alaa El-Ebshihy,Tobias Fink,Maik Fröbe,Petra Galuščáková,Gabriela Gonzalez-Saez,Lorraine Goeuriot,David Iommi,Jüri Keller,Petr Knoth,Philippe Mulhem,Florina Piroi,David Pride,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The LongEval lab focuses on the evaluation of information retrieval systems over time. Two datasets are provided that capture evolving search scenarios with changing documents, queries, and relevance assessments. Systems are assessed from a temporal perspective-that is, evaluating retrieval effectiveness as the data they operate on changes. In its third edition, LongEval featured two retrieval tasks: one in the area of ad-hoc web retrieval, and another focusing on scientific article retrieval. We present an overview of this year’s tasks and datasets, as well as the participating systems. A total of 19 teams submitted their approaches, which we evaluated using nDCG and a variety of measures that quantify changes in retrieval effectiveness over time.
[IR-3] WildClaims: Information Access Conversations in the Wild(Chat)
链接: https://arxiv.org/abs/2509.17442
作者: Hideaki Joko,Shakiba Amirshahi,Charles L. A. Clarke,Faegheh Hasibi
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The rapid advancement of Large Language Models (LLMs) has transformed conversational systems into practical tools used by millions. However, the nature and necessity of information retrieval in real-world conversations remain largely unexplored, as research has focused predominantly on traditional, explicit information access conversations. The central question is: What do real-world information access conversations look like? To this end, we first conduct an observational study on the WildChat dataset, large-scale user-ChatGPT conversations, finding that users’ access to information occurs implicitly as check-worthy factual assertions made by the system, even when the conversation’s primary intent is non-informational, such as creative writing. To enable the systematic study of this phenomenon, we release the WildClaims dataset, a novel resource consisting of 121,905 extracted factual claims from 7,587 utterances in 3,000 WildChat conversations, each annotated for check-worthiness. Our preliminary analysis of this resource reveals that conservatively 18% to 51% of conversations contain check-worthy assertions, depending on the methods employed, and less conservatively, as many as 76% may contain such assertions. This high prevalence underscores the importance of moving beyond the traditional understanding of explicit information access, to address the implicit information access that arises in real-world user-system conversations.
[IR-4] Simplified Longitudinal Retrieval Experiments: A Case Study on Query Expansion and Document Boosting
链接: https://arxiv.org/abs/2509.17440
作者: Jüri Keller,Maik Fröbe,Gijs Hendriksen,Daria Alexander,Martin Potthast,Philipp Schaer
类目: Information Retrieval (cs.IR)
*备注: Best of labs paper for LongEval at CLEF 2024
Abstract:The longitudinal evaluation of retrieval systems aims to capture how information needs and documents evolve over time. However, classical Cranfield-style retrieval evaluations only consist of a static set of queries and documents and thereby miss time as an evaluation dimension. Therefore, longitudinal evaluations need to complement retrieval toolkits with custom logic. This custom logic increases the complexity of research software, which might reduce the reproducibility and extensibility of experiments. Based on our submissions to the 2024 edition of LongEval, we propose a custom extension of ir_datasets for longitudinal retrieval experiments. This extension allows for declaratively, instead of imperatively, describing important aspects of longitudinal retrieval experiments, e.g., which queries, documents, and/or relevance feedback are available at which point in time. We reimplement our submissions to LongEval 2024 against our new ir_datasets extension, and find that the declarative access can reduce the complexity of the code.
[IR-5] MLLM -Driven Semantic Identifier Generation for Generative Cross-Modal Retrieval
链接: https://arxiv.org/abs/2509.17359
作者: Tianyuan Li,Lei Wang,Ahtamjan Ahmat,Yating Yang,Bo Ma,Rui Dong,Bangju Han
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Generative cross-modal retrieval, which treats retrieval as a generation task, has emerged as a promising direction with the rise of Multimodal Large Language Models (MLLMs). In this setting, the model responds to a text query by generating an identifier corresponding to the target image. However, existing methods typically rely on manually crafted string IDs, clustering-based labels, or atomic identifiers requiring vocabulary expansion, all of which face challenges in semantic alignment or this http URL address these limitations, we propose a vocabulary-efficient identifier generation framework that prompts MLLMs to generate Structured Semantic Identifiers from image-caption pairs. These identifiers are composed of concept-level tokens such as objects and actions, naturally aligning with the model’s generation space without modifying the tokenizer. Additionally, we introduce a Rationale-Guided Supervision Strategy, prompting the model to produce a one-sentence explanation alongside each identifier serves as an auxiliary supervision signal that improves semantic grounding and reduces hallucinations during training.
[IR-6] Identifying and Upweighting Power-Niche Users to Mitigate Popularity Bias in Recommendations
链接: https://arxiv.org/abs/2509.17265
作者: David Liu,Erik Weis,Moritz Laber,Tina Eliassi-Rad,Brennan Klein
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recommender systems have been shown to exhibit popularity bias by over-recommending popular items and under-recommending relevant niche items. We seek to understand interactions with niche items in benchmark recommendation datasets as a step toward mitigating popularity bias. We find that, compared to mainstream users, niche-preferring users exhibit a longer-tailed activity-level distribution, indicating the existence of users who both prefer niche items and exhibit high activity levels. We partition users along two axes: (1) activity level (“power” vs. “light”) and (2) item-popularity preference (“mainstream” vs. “niche”), and show that in several benchmark datasets, the number of power-niche users (high activity and niche preference) is statistically significantly larger than expected under a null configuration model. Motivated by this observation, we propose a framework for reweighting the Bayesian Personalized Ranking (BPR) loss that simultaneously reweights based on user activity level and item popularity. Our method introduces two interpretable parameters: one controlling the significance of user activity level, and the other of item popularity. Experiments on benchmark datasets show that upweighting power-niche users reduces popularity bias and can increase overall performance. In contrast to previous work that only considers user activity level or item popularity in isolation, our results suggest that considering their interaction leads to Pareto-dominant performance.
[IR-7] mporal-Aware User Behaviour Simulation with Large Language Models for Recommender Systems
链接: https://arxiv.org/abs/2509.16895
作者: Xinye Wanyan,Danula Hettiachchi,Chenglong Ma,Ziqi Xu,Jeffrey Chan
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Large Language Models (LLMs) demonstrate human-like capabilities in language understanding, reasoning, and generation, driving interest in using LLM-based agents to simulate human feedback in recommender systems. However, most existing approaches rely on static user profiling, neglecting the temporal and dynamic nature of user interests. This limitation stems from a disconnect between language modelling and behaviour modelling, which constrains the capacity of agents to represent sequential patterns. To address this challenge, we propose a Dynamic Temporal-aware Agent-based simulator for Recommender Systems, DyTA4Rec, which enables agents to model and utilise evolving user behaviour based on historical interactions. DyTA4Rec features a dynamic updater for real-time profile refinement, temporal-enhanced prompting for sequential context, and self-adaptive aggregation for coherent feedback. Experimental results at group and individual levels show that DyTA4Rec significantly improves the alignment between simulated and actual user behaviour by modelling dynamic characteristics and enhancing temporal awareness in LLM-based agents.
[IR-8] Learn to Rank Risky Investors: A Case Study of Predicting Retail Traders Behaviour and Profitability
链接: https://arxiv.org/abs/2509.16616
作者: Weixian Waylon Li,Tiejun Ma
类目: Computational Engineering, Finance, and Science (cs.CE); Information Retrieval (cs.IR)
*备注: Accepted by ACM Transactions on Information Systems (TOIS)
Abstract:Identifying risky traders with high profits in financial markets is crucial for market makers, such as trading exchanges, to ensure effective risk management through real-time decisions on regulation compliance and hedging. However, capturing the complex and dynamic behaviours of individual traders poses significant challenges. Traditional classification and anomaly detection methods often establish a fixed risk boundary, failing to account for this complexity and dynamism. To tackle this issue, we propose a profit-aware risk ranker (PA-RiskRanker) that reframes the problem of identifying risky traders as a ranking task using Learning-to-Rank (LETOR) algorithms. Our approach features a Profit-Aware binary cross entropy (PA-BCE) loss function and a transformer-based ranker enhanced with a self-cross-trader attention pipeline. These components effectively integrate profit and loss (PL) considerations into the training process while capturing intra- and inter-trader relationships. Our research critically examines the limitations of existing deep learning-based LETOR algorithms in trading risk management, which often overlook the importance of PL in financial scenarios. By prioritising PL, our method improves risky trader identification, achieving an 8.4% increase in F1 score compared to state-of-the-art (SOTA) ranking models like Rankformer. Additionally, it demonstrates a 10%-17% increase in average profit compared to all benchmark models.
[IR-9] Decoding TRON: A Comprehensive Framework for Large-Scale Blockchain Data Extraction and Exploration
链接: https://arxiv.org/abs/2509.16292
作者: Qian’ang Mao,Jiaxin Wang,Zhiqi Feng,Yi Zhang,Jiaqi Yan
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
*备注: written in early 2024
Abstract:Cryptocurrencies and Web3 applications based on blockchain technology have flourished in the blockchain research field. Unlike Bitcoin and Ethereum, due to its unique architectural designs in consensus mechanisms, resource management, and throughput, TRON has developed a more distinctive ecosystem and application scenarios centered around stablecoins. Although it is popular in areas like stablecoin payments and settlement, research on analyzing on-chain data from the TRON blockchain is remarkably scarce. To fill this gap, this paper proposes a comprehensive data extraction and exploration framework for the TRON blockchain. An innovative high-performance ETL system aims to efficiently extract raw on-chain data from TRON, including blocks, transactions, smart contracts, and receipts, establishing a research dataset. An in-depth analysis of the extracted dataset reveals insights into TRON’s block generation, transaction trends, the dominance of exchanges, the resource delegation market, smart contract usage patterns, and the central role of the USDT stablecoin. The prominence of gambling applications and potential illicit activities related to USDT is emphasized. The paper discusses opportunities for future research leveraging this dataset, including analysis of delegate services, gambling scenarios, stablecoin activities, and illicit transaction detection. These contributions enhance blockchain data management capabilities and understanding of the rapidly evolving TRON ecosystem.