本篇博文主要内容为 2025-09-24 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-24)

今日共更新656篇论文,其中:

  • 自然语言处理79篇(Computation and Language (cs.CL))
  • 人工智能206篇(Artificial Intelligence (cs.AI))
  • 计算机视觉149篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习212篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models Understanding on Indian Culture EMNLP

【速读】: 该论文旨在解决当前生成式 AI 系统在理解和推理文化相关多模态信息时存在的显著局限性,尤其是在印度这样语言和文化多样性极高的地区。现有基准测试多聚焦于全球通用场景,缺乏对本土文化细节的深度覆盖,导致模型在低资源语言和非主流传统中的表现不佳。解决方案的关键在于构建 DRISHTIKON——首个专注于印度文化的多模态、多语言基准数据集,涵盖15种印度语言、所有州及联邦领土,包含超过64,000个对齐的文本-图像对,覆盖节日、服饰、饮食、艺术形式和历史遗产等丰富文化主题。该数据集为评估视觉-语言模型(VLMs)的文化理解能力提供了细粒度、地域多样化的测试平台,从而推动更具包容性和文化敏感性的多模态人工智能技术发展。

链接: https://arxiv.org/abs/2509.19274
作者: Arijit Maji,Raghvendra Kumar,Akash Ghosh,Anushka,Nemil Shah,Abhilekh Borah,Vanshika Shah,Nishant Mishra,Sriparna Saha
机构: Indian Institute of Technology Patna (印度理工学院巴特那分校); Banasthali Vidyapeeth University (Banasthali大学); Pandit Deendayal Energy University (德丹达尔能源大学); Manipal University Jaipur (曼尼帕尔大学贾伊普尔分校); Dwarkadas J. Sanghvi College of Engineering (杜拉克达斯·J·桑格维工程学院)
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注: EMNLP MAINS 2025

点击查看摘要

Abstract:We introduce DRISHTIKON, a first-of-its-kind multimodal and multilingual benchmark centered exclusively on Indian culture, designed to evaluate the cultural understanding of generative AI systems. Unlike existing benchmarks with a generic or global scope, DRISHTIKON offers deep, fine-grained coverage across India’s diverse regions, spanning 15 languages, covering all states and union territories, and incorporating over 64,000 aligned text-image pairs. The dataset captures rich cultural themes including festivals, attire, cuisines, art forms, and historical heritage amongst many more. We evaluate a wide range of vision-language models (VLMs), including open-source small and large models, proprietary systems, reasoning-specialized VLMs, and Indic-focused models, across zero-shot and chain-of-thought settings. Our results expose key limitations in current models’ ability to reason over culturally grounded, multimodal inputs, particularly for low-resource languages and less-documented traditions. DRISHTIKON fills a vital gap in inclusive AI research, offering a robust testbed to advance culturally aware, multimodally competent language technologies.
zh

[NLP-1] WolBanking77: Wolof Banking Speech Intent Classification Dataset

【速读】: 该论文旨在解决低资源语言(如塞内加尔的沃洛夫语)在意图分类(Intent Classification)任务中因数据稀缺而导致模型性能不足的问题,尤其是在文盲率较高的地区,语音输入比文本输入更为普遍。解决方案的关键在于构建并发布首个面向沃洛夫语的意图分类数据集——WolBanking77,该数据集包含9,791条银行领域文本句子和4小时以上的语音语料,支持自然语言处理(NLP)与自动语音识别(ASR)双模态研究,并通过多种先进基线模型的实验验证了其有效性,为后续在低资源语言上的意图识别研究提供了高质量的数据基础和可复现的基准。

链接: https://arxiv.org/abs/2509.19271
作者: Abdou Karim Kandji,Frédéric Precioso,Cheikh Ba,Samba Ndiaye,Augustin Ndione
机构: University of Gaston Berger (UGB)(加斯顿·贝杰尔大学); Inria, Université Côte d’Azur (UniCA) (法国蔚蓝海岸大学); Cheikh Anta Diop University (谢赫·安塔·迪奥普大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Intent classification models have made a lot of progress in recent years. However, previous studies primarily focus on high-resource languages datasets, which results in a gap for low-resource languages and for regions with a high rate of illiterate people where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90% of the population, with an illiteracy rate of 42% for the country. Wolof is actually spoken by more than 10 million people in West African region. To tackle such limitations, we release a Wolof Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. This paper also provides detailed analyses of the contents of the data. We report baseline f1-score and word error rate metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. We plan to share and conduct dataset maintenance, updates and to release open-source code.
zh

[NLP-2] SloPalSpeech: A 28000-Hour Slovak Speech Corpus from Parliamentary Data

【速读】: 该论文旨在解决低资源语言(如斯洛伐克语)自动语音识别(ASR)因训练数据稀缺而导致性能受限的问题。其关键解决方案是构建了一个大规模的斯洛伐克语ASR数据集SloPalSpeech,包含2,806小时来自议会会议的语音数据,并开发了一套鲁棒的处理流程,将长时录音精确对齐并分割为干净的30秒音频-文本配对样本,从而有效提升模型训练质量。通过在该数据集上微调多个OpenAI Whisper模型(small、medium、large-v3及large-v3-turbo),显著降低了标准斯洛伐克语基准测试(如Common Voice和FLEURS)上的词错误率(WER),例如Whisper-small模型的WER最高下降70%,接近大型Whisper-large-v3模型的基线性能。

链接: https://arxiv.org/abs/2509.19270
作者: Erik Božík,Marek Šuppa
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) for low-resource languages like Slovak is hindered by the scarcity of training data. To address this, we introduce SloPalSpeech, a new, large-scale Slovak ASR dataset containing 2,806 hours of speech from parliamentary proceedings. We developed a robust processing pipeline to align and segment long-form recordings into clean, 30-second audio-transcript pairs suitable for model training. We use this dataset to fine-tune several OpenAI Whisper models (small, medium, large-v3, and large-v3-turbo), achieving significant Word Error Rate (WER) reductions on standard Slovak benchmarks like Common Voice and FLEURS. For instance, the fine-tuned Whisper-small model’s WER dropped by up to 70%, approaching the baseline performance of the much larger Whisper-large-v3 model. To foster future research in low-resource speech recognition, we publicly release the complete SloPalSpeech dataset, the fully segmented transcripts (60 million words), and all our fine-tuned models.
zh

[NLP-3] Extracting Conceptual Spaces from LLM s Using Prototype Embeddings

【速读】: 该论文旨在解决如何从大语言模型(Large Language Models, LLMs)中有效提取可解释的概念空间(conceptual spaces)这一难题。概念空间通过认知上有意义的维度(如感知特征)来表示实体与概念,是实现可解释人工智能(Explainable AI)的重要基础,但其学习过程长期面临挑战。论文的关键解决方案在于:首先利用原型描述(如“非常甜的食物”)对特征(如“甜度”)进行编码,从而生成对应的嵌入向量;随后通过微调LLM,使这些原型嵌入与概念空间的潜在维度对齐,从而实现对概念空间结构的有效提取。实证分析表明,该方法在捕捉和构建概念空间方面具有显著有效性。

链接: https://arxiv.org/abs/2509.19269
作者: Nitesh Kumar,Usashi Chatterjee,Steven Schockaert
机构: Cardiff NLP, School of Computer Science and Informatics (卡迪夫自然语言处理中心,计算机科学与信息学学院); Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Conceptual spaces represent entities and concepts using cognitively meaningful dimensions, typically referring to perceptual features. Such representations are widely used in cognitive science and have the potential to serve as a cornerstone for explainable AI. Unfortunately, they have proven notoriously difficult to learn, although recent LLMs appear to capture the required perceptual features to a remarkable extent. Nonetheless, practical methods for extracting the corresponding conceptual spaces are currently still lacking. While various methods exist for extracting embeddings from LLMs, extracting conceptual spaces also requires us to encode the underlying features. In this paper, we propose a strategy in which features (e.g. sweetness) are encoded by embedding the description of a corresponding prototype (e.g. a very sweet food). To improve this strategy, we fine-tune the LLM to align the prototype embeddings with the corresponding conceptual space dimensions. Our empirical analysis finds this approach to be highly effective.
zh

[NLP-4] Cross-Cultural Transfer of Commonsense Reasoning in LLM s: Evidence from the Arab World EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的西方中心主义偏见问题,尤其是在阿拉伯世界等非西方文化语境下,其常识推理能力受限的问题。研究发现,仅需来自某一阿拉伯国家的12个文化特定示例,即可在多语言模型中平均提升其他阿拉伯国家的常识推理性能达10%;更关键的是,跨文化示例(如来自印尼和美国的演示)在多项选择题(MCQ)推理任务上可达到甚至超越本地文化对齐的效果,表明文化常识具有跨区域迁移潜力。因此,解决方案的关键在于利用轻量级对齐方法(如上下文学习和基于示范的强化学习DITTO),实现高效、低成本的跨文化常识推理适应,为低资源文化场景下的LLM适配提供了新路径。

链接: https://arxiv.org/abs/2509.19265
作者: Saeed Almheiri,Rania Hossam,Mena Attia,Chenxi Wang,Preslav Nakov,Timothy Baldwin,Fajri Koto
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025 - Findings

点击查看摘要

Abstract:Large language models (LLMs) often reflect Western-centric biases, limiting their effectiveness in diverse cultural contexts. Although some work has explored cultural alignment, the potential for cross-cultural transfer, using alignment in one culture to improve performance in others, remains underexplored. This paper investigates cross-cultural transfer of commonsense reasoning in the Arab world, where linguistic and historical similarities coexist with local cultural differences. Using a culturally grounded commonsense reasoning dataset covering 13 Arab countries, we evaluate lightweight alignment methods such as in-context learning and demonstration-based reinforcement (DITTO), alongside baselines like supervised fine-tuning and direct preference optimization. Our results show that merely 12 culture-specific examples from one country can improve performance in others by 10% on average, within multilingual models. In addition, we demonstrate that out-of-culture demonstrations from Indonesia and US contexts can match or surpass in-culture alignment for MCQ reasoning, highlighting cultural commonsense transferability beyond the Arab world. These findings demonstrate that efficient cross-cultural alignment is possible and offer a promising approach to adapt LLMs to low-resource cultural settings.
zh

[NLP-5] Reinforcement Learning on Pre-Training Data

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)训练中计算资源的指数级增长与高质量文本数据有限增长之间的失衡问题,这一瓶颈限制了传统基于监督学习的扩展策略。其核心解决方案是提出一种名为“预训练数据上的强化学习”(Reinforcement Learning on Pre-Training data, RLPT)的新训练范式,关键在于利用预训练数据本身直接构建奖励信号,而非依赖人工标注的反馈(如RLHF或RLVR),从而实现无需人类干预的强化学习优化。具体而言,RLPT采用“下一文本段推理目标”,通过奖励模型对后续文本片段的准确预测来引导策略探索更广泛上下文中的有意义轨迹,进而提升模型的泛化推理能力。实验表明,该方法在多个通用及数学推理基准上显著优于基线模型,并展现出良好的可扩展性,为未来借助更多计算资源持续提升LLM性能提供了有效路径。

链接: https://arxiv.org/abs/2509.19249
作者: Siheng Li,Kejiao Li,Zenan Xu,Guanhua Huang,Evander Yang,Kun Li,Haoyuan Wu,Jiajia Wu,Zihao Zheng,Chenchen Zhang,Kun Shi,Kyrierl Deng,Qi Yi,Ruibin Xiong,Tingqiang Xu,Yuhao Jiang,Jianfeng Yan,Yuyuan Zeng,Guanghui Xu,Jinbao Xue,Zhijiang Xu,Zheng Fang,Shuai Li,Qibin Liu,Xiaoxue Li,Zhuoyu Li,Yangyu Tao,Fei Gao,Cheng Jiang,Bo Chao Wang,Kai Liu,Jianchen Zhu,Wai Lam,Wayyt Wang,Bo Zhou,Di Wang
机构: Tencent(腾讯); The Chinese University of Hong Kong(香港中文大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Work in progress

点击查看摘要

Abstract:The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of 3.0 , 5.1 , 8.1 , 6.0 , 6.6 , and 5.3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.
zh

[NLP-6] Finding My Voice: Generative Reconstruction of Disordered Speech for Automated Clinical Evaluation

【速读】: 该论文旨在解决儿童语音障碍(Speech Sound Disorder, SSD)患者在语音重建过程中难以同时保留说话者身份特征并有效抑制发音错误的问题。传统方法多基于健康成人的语音训练,无法适配儿童特有的音高(pitch)与语调(prosody)特性,导致重建语音失真或身份混淆。其解决方案的关键在于提出一种解耦的、基于风格的端到端语音重建框架 ChiReSSD,通过分离语音内容与说话者风格特征,在保持儿童原始语音身份的同时实现对错误发音的有效修正。实验表明,该方法在 STAR 数据集上显著提升了词汇准确率和说话者身份保真度,并能自动预测语音中的音位内容,其结果与人工标注具有高度相关性(Pearson 相关系数 0.63),具备减少临床语音转录工作量的潜力;同时在 TORGO 数据集上的泛化实验也验证了其在成人运动性构音障碍语音重建中的有效性。

链接: https://arxiv.org/abs/2509.19231
作者: Karen Rosero,Eunjung Yeo,David R. Mortensen,Cortney Van’t Slot,Rami R. Hallac,Carlos Busso
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present ChiReSSD, a speech reconstruction framework that preserves children speaker’s identity while suppressing mispronunciations. Unlike prior approaches trained on healthy adult speech, ChiReSSD adapts to the voices of children with speech sound disorders (SSD), with particular emphasis on pitch and prosody. We evaluate our method on the STAR dataset and report substantial improvements in lexical accuracy and speaker identity preservation. Furthermore, we automatically predict the phonetic content in the original and reconstructed pairs, where the proportion of corrected consonants is comparable to the percentage of correct consonants (PCC), a clinical speech assessment metric. Our experiments show Pearson correlation of 0.63 between automatic and human expert annotations, highlighting the potential to reduce the manual transcription burden. In addition, experiments on the TORGO dataset demonstrate effective generalization for reconstructing adult dysarthric speech. Our results indicate that disentangled, style-based TTS reconstruction can provide identity-preserving speech across diverse clinical populations.
zh

[NLP-7] CompLLM : Compression for Long Context QA

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时因自注意力机制的二次复杂度而导致的显著计算瓶颈问题。现有软压缩方法通常将整个上下文作为一个整体进行压缩,导致压缩复杂度仍为二次级且无法在具有重叠上下文的不同查询间复用计算。其解决方案的关键在于提出CompLLM,一种分段独立压缩的软压缩技术:将输入上下文划分为多个片段并分别压缩,从而实现三个核心优势——效率提升(压缩复杂度线性增长)、可扩展性(模型可从短序列训练迁移至10万token长上下文)以及可复用性(压缩片段可缓存并在不同查询中重复使用)。实验表明,该方法在2倍压缩率下可使长上下文场景下的首次生成时间(TTFT)提速最高达4倍,KV缓存减少50%,同时性能与未压缩上下文相当甚至更优。

链接: https://arxiv.org/abs/2509.19228
作者: Gabriele Berton,Jayakrishnan Unnikrishnan,Son Tran,Mubarak Shah
机构: Amazon; Center For Research in Computer Vision, University of Central Florida
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.
zh

[NLP-8] Systematic Comparative Analysis of Large Pretrained Language Models on Contextualized Medication Event Extraction

【速读】: 该论文旨在解决从电子健康记录(Electronic Health Record, EHR)中提取与患者用药事件相关的上下文信息这一临床自然语言处理(Natural Language Processing, NLP)任务。其核心问题是如何利用预训练的注意力机制模型有效识别药物事件及其多维上下文类别,从而提升EHR结构化信息抽取的准确性。解决方案的关键在于对比多种预训练模型(包括Bert Base、BioBert、两种变体的Bio+Clinical Bert、RoBerta及Clinical Longformer)在CMED数据集上的性能表现,并通过细调(fine-tuning)和针对EHR文本的预处理方法优化模型输入兼容性,最终发现:基于临床语料预训练的模型在药物及事件检测上更具优势,而通用领域预训练的Bert Base在药物事件上下文分类任务中表现最优。

链接: https://arxiv.org/abs/2509.19224
作者: Tariq Abdul-Quddoos,Xishuang Dong,Lijun Qian
机构: Prairie View A&M University (普莱里维尤A&M大学); Texas A&M University System (德克萨斯农工大学系统)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Attention-based models have become the leading approach in modeling medical language for Natural Language Processing (NLP) in clinical notes. These models outperform traditional techniques by effectively capturing contextual rep- resentations of language. In this research a comparative analysis is done amongst pre- trained attention based models namely Bert Base, BioBert, two variations of Bio+Clinical Bert, RoBerta, and Clinical Long- former on task related to Electronic Health Record (EHR) information extraction. The tasks from Track 1 of Harvard Medical School’s 2022 National Clinical NLP Challenges (n2c2) are considered for this comparison, with the Contextualized Medication Event Dataset (CMED) given for these task. CMED is a dataset of unstructured EHRs and annotated notes that contain task relevant information about the EHRs. The goal of the challenge is to develop effective solutions for extracting contextual information related to patient medication events from EHRs using data driven methods. Each pre-trained model is fine-tuned and applied on CMED to perform medication extraction, medical event detection, and multi-dimensional medication event context classification. Pro- cessing methods are also detailed for breaking down EHRs for compatibility with the applied models. Performance analysis has been carried out using a script based on constructing medical terms from the evaluation portion of CMED with metrics including recall, precision, and F1-Score. The results demonstrate that models pre-trained on clinical data are more effective in detecting medication and medication events, but Bert Base, pre- trained on general domain data showed to be the most effective for classifying the context of events related to medications.
zh

[NLP-9] Steering Multimodal Large Language Models Decoding for Context-Aware Safety

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在现实应用中面临的安全决策能力不足问题,特别是其在上下文感知安全判断上的局限性,表现为过度敏感(对无害查询不当拒绝)和敏感不足(未能识别视觉锚定的风险),从而导致安全对齐存在持续缺口。解决方案的关键在于提出一种轻量且与模型无关的解码框架——Safety-aware Contrastive Decoding (SafeCoDe),其核心机制包含两个阶段:一是通过对比真实图像与高斯噪声图像来突出对视觉上下文敏感的词元(token),实现局部敏感性建模;二是结合场景级推理与词元级调节策略,动态调整拒绝行为以适应预测的安全判定结果,从而在保障模型有用性的同时提升上下文感知的安全响应能力。

链接: https://arxiv.org/abs/2509.19212
作者: Zheyuan Liu,Zhangchen Xu,Guangyao Dou,Xiangchi Yuan,Zhaoxuan Tan,Radha Poovendran,Meng Jiang
机构: University of Notre Dame (圣母大学); University of Washington (华盛顿大学); Johns Hopkins University (约翰霍普金斯大学); Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: A lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are increasingly deployed in real-world applications, yet their ability to make context-aware safety decisions remains limited. Existing methods often fail to balance oversensitivity (unjustified refusals of benign queries) and undersensitivity (missed detection of visually grounded risks), leaving a persistent gap in safety alignment. To address this issue, we introduce Safety-aware Contrastive Decoding (SafeCoDe), a lightweight and model-agnostic decoding framework that dynamically adjusts token generation based on multimodal context. SafeCoDe operates in two stages: (1) a contrastive decoding mechanism that highlights tokens sensitive to visual context by contrasting real and Gaussian-noised images, and (2) a global-aware token modulation strategy that integrates scene-level reasoning with token-level adjustment to adapt refusals according to the predicted safety verdict. Extensive experiments across diverse MLLM architectures and safety benchmarks, covering undersensitivity, oversensitivity, and general safety evaluations, show that SafeCoDe consistently improves context-sensitive refusal behaviors while preserving model helpfulness.
zh

[NLP-10] Online Process Reward Leanring for Agent ic Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在作为自主代理(agentic reinforcement learning, agentic RL)进行长期交互任务时,因稀疏且难以验证的奖励信号导致的时间信用分配(temporal credit assignment)难题。现有方法尝试引入过程监督(process supervision),但存在标注偏倚、奖励劫持(reward hacking)、细粒度信号方差过大或状态重叠罕见时失效等问题。其解决方案的关键在于提出在线过程奖励学习(Online Process Reward Learning, OPRL),一种无需额外采样或显式步骤标签即可与标准策略梯度算法无缝集成的信用分配策略:OPRL交替优化一个隐式过程奖励模型(Implicit Process Reward Model, PRM),通过基于轨迹的直接偏好优化(trajectory-based DPO)将轨迹偏好转化为隐式步骤奖励;这些步骤奖励用于计算步骤级优势,并与来自结果奖励的回合级优势结合以更新策略,形成自增强循环。理论证明所学步骤奖励与轨迹偏好一致并具有基于势能的奖励 shaping 性质,从而提供有界梯度以稳定训练。

链接: https://arxiv.org/abs/2509.19199
作者: Xiaoqian Liu,Ke Wang,Yuchuan Wu,Fei Huang,Yongbin Li,Junge Zhang,Jianbin Jiao
机构: University of Chinese Academy of Sciences (中国科学院大学); Tongyi Lab (通义实验室); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computation and Language (cs.CL)
备注: preprint

点击查看摘要

Abstract:Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents that reason and act over long horizons in interactive environments. However, sparse and sometimes unverifiable rewards make temporal credit assignment extremely challenging. Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation, reward hacking, high-variance from overly fine-grained signals or failtures when state overlap is rare. We therefore introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL that integrates seamlessly with standard on-policy algorithms without relying on additional rollouts or explicit step labels. In OPRL, we optimize an implicit process reward model (PRM) alternately with the agent’s policy to transform trajectory preferences into implicit step rewards through a trajectory-based DPO objective. These step rewards are then used to compute step-level advantages, which are combined with episode-level advantages from outcome rewards for policy update, creating a self-reinforcing loop. Theoretical findings guarantee that the learned step rewards are consistent with trajectory preferences and act as potential-based shaping rewards, providing bounded gradients to stabilize training. Empirically, we evaluate OPRL on three distinct agent benmarks, including WebShop and VisualSokoban, as well as open-ended social interactions with unverfiable rewards in SOTOPIA. Crucially, OPRL shows superior performance over frontier LLMs and strong RL baselines across domains, achieving state-of-the-art results with higher sample-efficiency and lower variance during training. Further analysis also demonstrates the efficient exploration by OPRL using fewer actions, underscoring its potential for agentic learning in real-world scenarios. Comments: preprint Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.19199 [cs.CL] (or arXiv:2509.19199v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.19199 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Xiaoqian Liu [view email] [v1] Tue, 23 Sep 2025 16:15:42 UTC (2,139 KB) Full-text links: Access Paper: View a PDF of the paper titled Online Process Reward Leanring for Agentic Reinforcement Learning, by Xiaoqian Liu and 6 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-11] Soft Tokens Hard Truths

【速读】: 该论文旨在解决连续推理路径(continuous Chain-of-Thought, CoT)在实际训练中难以优化的问题,即以往方法要么仅在推理阶段使用连续token,要么依赖从离散CoT蒸馏得到的标签,导致计算成本高且推理长度受限。其解决方案的关键在于提出一种基于强化学习(Reinforcement Learning, RL)的可扩展训练范式,无需蒸馏即可直接学习连续CoT;通过引入“软”token机制——即混合token并添加输入嵌入噪声以增强探索能力,实现高效、大规模(数百token)的连续CoT训练,同时保持与标准离散token模型兼容性,显著提升推理多样性与泛化能力。

链接: https://arxiv.org/abs/2509.19170
作者: Natasha Butt,Ariel Kwiatkowski,Ismail Labiad,Julia Kempe,Yann Ollivier
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The use of continuous instead of discrete tokens during the Chain-of-Thought (CoT) phase of reasoning LLMs has garnered attention recently, based on the intuition that a continuous mixture of discrete tokens could simulate a superposition of several reasoning paths simultaneously. Theoretical results have formally proven that continuous tokens have much greater expressivity and can solve specific problems more efficiently. However, practical use of continuous tokens has been limited by strong training difficulties: previous works either just use continuous tokens at inference time on a pre-trained discrete-token model, or must distill the continuous CoT from ground-truth discrete CoTs and face computational costs that limit the CoT to very few tokens. This is the first work introducing a scalable method to learn continuous CoTs via reinforcement learning (RL), without distilling from reference discrete CoTs. We use “soft” tokens: mixtures of tokens together with noise on the input embedding to provide RL exploration. Computational overhead is minimal, enabling us to learn continuous CoTs with hundreds of tokens. On math reasoning benchmarks with Llama and Qwen models up to 8B, training with continuous CoTs match discrete-token CoTs for pass@1 and surpass them for pass@32, showing greater CoT diversity. In systematic comparisons, the best-performing scenario is to train with continuous CoT tokens then use discrete tokens for inference, meaning the “soft” models can be deployed in a standard way. Finally, we show continuous CoT RL training better preserves the predictions of the base model on out-of-domain tasks, thus providing a softer touch to the base model. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.19170 [cs.CL] (or arXiv:2509.19170v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.19170 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-12] Measuring AI “Slop” in Text

【速读】: 该论文旨在解决当前缺乏对“AI slop”(即低质量AI生成文本)的明确定义及其量化评估方法的问题。其解决方案的关键在于通过与自然语言处理(Natural Language Processing, NLP)、写作和哲学领域的专家访谈,构建了一个可解释的“slop”分类体系,并提出了一组用于评估文本质量的维度,如连贯性和相关性等。研究进一步通过跨度级标注发现,尽管二元判断存在一定程度主观性,但这些判断仍与潜在的语言质量维度显著相关,从而为检测和偏好任务提供了可操作的框架。

链接: https://arxiv.org/abs/2509.19163
作者: Chantal Shaib,Tuhin Chakrabarty,Diego Garcia-Olano,Byron C. Wallace
机构: Northeastern University (东北大学); Stony Brook University (石溪大学); Meta AI (Meta人工智能)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI “slop” is an increasingly popular term used to describe low-quality AI-generated text, but there is currently no agreed upon definition of this term nor a means to measure its occurrence. In this work, we develop a taxonomy of “slop” through interviews with experts in NLP, writing, and philosophy, and propose a set of interpretable dimensions for its assessment in text. Through span-level annotation, we find that binary “slop” judgments are (somewhat) subjective, but such determinations nonetheless correlate with latent dimensions such as coherence and relevance. Our framework can be used to evaluate AI-generated text in both detection and binary preference tasks, potentially offering new insights into the linguistic and stylistic factors that contribute to quality judgments.
zh

[NLP-13] Anecdoctoring: Automated Red-Teaming Across Language and Place EMNLP2025

【速读】: 该论文旨在解决生成式AI(Generative AI)滥用带来的虚假信息(disinformation)风险在全球范围内评估不足的问题,尤其针对当前红队测试(red-teaming)数据集高度集中于美国和英语环境的局限性。其解决方案的关键在于提出了一种名为“anecdoctoring”的新型红队测试方法,该方法通过自动跨语言、跨文化生成对抗性提示(adversarial prompts),首先从英文、西班牙语和印地语三个语言及美国和印度两个地理区域的事实核查网站收集虚假信息主张(misinformation claims),再将这些主张聚类为更广泛的叙事(narratives),并利用知识图谱(knowledge graphs)对聚类结果进行表征,进而增强攻击型大语言模型(attacker LLM)的能力。此方法在攻击成功率和可解释性方面优于传统少样本提示(few-shot prompting)策略,凸显了构建全球尺度且基于真实对抗滥用场景的虚假信息缓解机制的必要性。

链接: https://arxiv.org/abs/2509.19143
作者: Alejandro Cuevas,Saloni Dash,Bharat Kumar Nayak,Dan Vann,Madeleine I. G. Daepp
机构: Carnegie Mellon University (卡内基梅隆大学); University of Washington (华盛顿大学); Microsoft Research (微软研究院); Independent Researcher (独立研究员)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: To be published in EMNLP 2025

点击查看摘要

Abstract:Disinformation is among the top risks of generative artificial intelligence (AI) misuse. Global adoption of generative AI necessitates red-teaming evaluations (i.e., systematic adversarial probing) that are robust across diverse languages and cultures, but red-teaming datasets are commonly US- and English-centric. To address this gap, we propose “anecdoctoring”, a novel red-teaming approach that automatically generates adversarial prompts across languages and cultures. We collect misinformation claims from fact-checking websites in three languages (English, Spanish, and Hindi) and two geographies (US and India). We then cluster individual claims into broader narratives and characterize the resulting clusters with knowledge graphs, with which we augment an attacker LLM. Our method produces higher attack success rates and offers interpretability benefits relative to few-shot prompting. Results underscore the need for disinformation mitigations that scale globally and are grounded in real-world adversarial misuse.
zh

[NLP-14] Context-Aware Hierarchical Taxonomy Generation for Scientific Papers via LLM -Guided Multi-Aspect Clustering EMNLP2025

【速读】: 该论文旨在解决科学文献快速增长背景下,现有分类体系构建方法在一致性(coherence)和粒度精细度(granularity)方面的不足问题。当前基于无监督聚类或直接提示大语言模型(Large Language Models, LLMs)的方法往往难以生成结构清晰、语义连贯的层级分类体系。其解决方案的关键在于提出一种上下文感知的层次化分类体系生成框架,该框架融合LLM引导的多维度编码与动态聚类机制:首先利用LLM识别每篇论文的关键方面(如方法论、数据集、评估指标),生成面向各方面的摘要;随后对每个方面独立进行编码与聚类,从而构建出逻辑一致、可解释性强且粒度可控的层次化分类结构。

链接: https://arxiv.org/abs/2509.19125
作者: Kun Zhu,Lizi Liao,Yuxuan Gu,Lei Huang,Xiaocheng Feng,Bing Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main

点击查看摘要

Abstract:The rapid growth of scientific literature demands efficient methods to organize and synthesize research findings. Existing taxonomy construction methods, leveraging unsupervised clustering or direct prompting of large language models (LLMs), often lack coherence and granularity. We propose a novel context-aware hierarchical taxonomy generation framework that integrates LLM-guided multi-aspect encoding with dynamic clustering. Our method leverages LLMs to identify key aspects of each paper (e.g., methodology, dataset, evaluation) and generates aspect-specific paper summaries, which are then encoded and clustered along each aspect to form a coherent hierarchy. In addition, we introduce a new evaluation benchmark of 156 expert-crafted taxonomies encompassing 11.6k papers, providing the first naturally annotated dataset for this task. Experimental results demonstrate that our method significantly outperforms prior approaches, achieving state-of-the-art performance in taxonomy coherence, granularity, and interpretability.
zh

[NLP-15] Human-Annotated NER Dataset for the Kyrgyz Language

【速读】: 该论文旨在解决吉尔吉斯语(Kyrgyz)领域缺乏高质量、人工标注的命名实体识别(Named Entity Recognition, NER)数据集的问题,从而推动该低资源语言的自然语言处理研究。解决方案的关键在于构建了首个吉尔吉斯语NER数据集KyrgyzNER,包含1,499篇新闻文章、10,900句和39,075个实体标注,覆盖27类命名实体;同时系统评估了基于条件随机场(CRF)的传统序列标注模型与多语言预训练Transformer模型(如multilingual RoBERTa)在该数据集上的表现,发现后者在精度与召回率之间取得了更优平衡,尤其适用于低资源语言场景。这一成果为未来针对吉尔吉斯语及其他类似语言的细粒度标注方案和模型优化提供了基础支持。

链接: https://arxiv.org/abs/2509.19109
作者: Timur Turatali,Anton Alekseev,Gulira Jumalieva,Gulnara Kabaeva,Sergey Nikolenko
机构: The Cramer Project; PDMI RAS, SPbU, KFU; KSTU n.a. I. Razzakov; Dep. of Computer Linguistics; Information Technology Institute; St. Petersburg Dep. of Steklov Math. Institute; St. Petersburg State University
类目: Computation and Language (cs.CL)
备注: Accepted to TurkLang-2025 conference, DOI and copyright will be added upon confirmation of acceptance to publication in IEEE Xplore

点击查看摘要

Abstract:We introduce KyrgyzNER, the first manually annotated named entity recognition dataset for the Kyrgyz language. Comprising 1,499 news articles from the this http URL news portal, the dataset contains 10,900 sentences and 39,075 entity mentions across 27 named entity classes. We show our annotation scheme, discuss the challenges encountered in the annotation process, and present the descriptive statistics. We also evaluate several named entity recognition models, including traditional sequence labeling approaches based on conditional random fields and state-of-the-art multilingual transformer-based models fine-tuned on our dataset. While all models show difficulties with rare entity categories, models such as the multilingual RoBERTa variant pretrained on a large corpus across many languages achieve a promising balance between precision and recall. These findings emphasize both the challenges and opportunities of using multilingual pretrained models for processing languages with limited resources. Although the multilingual RoBERTa model performed best, other multilingual models yielded comparable results. This suggests that future work exploring more granular annotation schemes may offer deeper insights for Kyrgyz language processing pipelines evaluation.
zh

[NLP-16] Are most sentences unique? An empirical examination of Chomskyan claims

【速读】: 该论文试图解决语言学中一个长期存在的主张,即绝大多数语言表达(linguistic utterances)都是唯一的,例如Pinker(1994)援引Chomsky的观点指出,人们说出或理解的句子几乎都是宇宙历史上首次出现的新组合。为验证这一假设,论文采用NLTK Python库对不同语料库(corpora)进行解析,通过统计精确字符串匹配的数量来量化重复句子的比例。其解决方案的关键在于利用大规模语料库和自动化文本处理工具,实证检验“唯一性”是否普遍成立,并发现尽管在多数语料库中完全独特的句子占多数,但重复句子并非可忽略的组成部分,且这种现象高度依赖于语料库的体裁(genre)。

链接: https://arxiv.org/abs/2509.19108
作者: Hiram Ring
机构: NTU Singapore
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:A repeated claim in linguistics is that the majority of linguistic utterances are unique. For example, Pinker (1994: 10), summarizing an argument by Noam Chomsky, states that “virtually every sentence that a person utters or understands is a brand-new combination of words, appearing for the first time in the history of the universe.” With the increased availability of large corpora, this is a claim that can be empirically investigated. The current paper addresses the question by using the NLTK Python library to parse corpora of different genres, providing counts of exact string matches in each. Results show that while completely unique sentences are often the majority of corpora, this is highly constrained by genre, and that duplicate sentences are not an insignificant part of any individual corpus.
zh

[NLP-17] Pathways of Thoughts: Multi-Directional Thinking for Long-form Personalized Question Answering

【速读】: 该论文旨在解决个性化问答(Personalized Question Answering, PQA)系统中因用户偏好难以从长且噪声大的上下文中隐式推断,以及生成既准确又符合用户预期和背景知识的回答所面临的挑战。其解决方案的关键在于提出路径思维(Pathways of Thoughts, PoT)方法——这是一种适用于任何大语言模型(Large Language Model, LLM)的推理阶段策略,无需任务特定微调。PoT 将 LLM 的推理过程建模为一个迭代决策过程,动态选择包括推理、修订、个性化和澄清在内的认知操作,从而探索多种推理轨迹并生成多样化的候选回答;随后根据推断出的用户偏好对候选结果进行聚合与重加权,最终输出融合多路径优势的个性化响应,显著提升了准确性和用户满意度。

链接: https://arxiv.org/abs/2509.19094
作者: Alireza Salemi,Cheng Li,Mingyang Zhang,Qiaozhu Mei,Zhuowan Li,Spurthi Amba Hombaiah,Weize Kong,Tao Chen,Hamed Zamani,Michael Bendersky
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校); Google DeepMind(谷歌深度大脑); University of Michigan(密歇根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Personalization is essential for adapting question answering (QA) systems to user-specific information needs, thereby improving both accuracy and user satisfaction. However, personalized QA remains relatively underexplored due to challenges such as inferring preferences from long, noisy, and implicit contexts, and generating responses that are simultaneously correct, contextually appropriate, and aligned with user expectations and background knowledge. To address these challenges, we propose Pathways of Thoughts (PoT), an inference-stage method that applies to any large language model (LLM) without requiring task-specific fine-tuning. The approach models the reasoning of an LLM as an iterative decision process, where the model dynamically selects among cognitive operations such as reasoning, revision, personalization, and clarification. This enables exploration of multiple reasoning trajectories, producing diverse candidate responses that capture different perspectives. PoT then aggregates and reweights these candidates according to inferred user preferences, yielding a final personalized response that benefits from the complementary strengths of diverse reasoning paths. Experiments on the LaMP-QA benchmark for personalized QA show that PoT consistently outperforms competitive baselines, achieving up to a 13.1% relative improvement. Human evaluation corroborates these results, with annotators preferring outputs from PoT in 66% of cases and reporting ties in only 15% of cases.
zh

[NLP-18] Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

【速读】: 该论文旨在解决当前医疗影像模型普遍存在的局限性问题,即多数模型任务单一、需依赖多个专用网络,导致泛化能力不足;同时,尽管大语言模型和多模态模型具备强大的推理与多任务处理能力,但在真实临床场景中仍难以实现精确的视觉定位、多模态融合以及链式思维推理。解决方案的关键在于提出一种名为Citrus-V的多模态医学基础模型,其核心创新在于将图像分析与文本推理深度融合,集成检测、分割与多模态链式思维推理能力,从而在统一框架内实现像素级病灶定位、结构化报告生成及类医生诊断推理,显著提升了从视觉感知到临床决策的端到端性能。

链接: https://arxiv.org/abs/2509.19090
作者: Guoxin Wang,Jun Zhao,Xinyi Liu,Yanbo Liu,Xuyang Cao,Chao Li,Zhuoyun Liu,Qintian Sun,Fangru Zhou,Haoqiang Xing,Zhenhong Yang
机构: JDH Algo, JD Health International Inc. (京东健康国际有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.
zh

[NLP-19] ColorBlindnessEval: Can Vision-Language Models Pass Color Blindness Tests? ICLR2025

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在复杂视觉干扰场景下鲁棒性不足的问题,特别是针对色觉障碍者在识别数字时可能遇到的视觉混淆情境。其解决方案的关键在于构建了一个名为ColorBlindnessEval的新基准,该基准基于Ishihara色盲测试设计,包含500张模拟色盲测试的图像,其中嵌入了0至99的数字,且颜色组合多样、视觉干扰复杂。通过该数据集对9种VLM进行Yes/No和开放式提示评估,并与人类参与者对比性能,揭示了当前模型在对抗性视觉环境下普遍存在幻觉问题,从而为提升VLM在真实世界应用中的可靠性提供了可量化、可复现的评测工具和改进方向。

链接: https://arxiv.org/abs/2509.19070
作者: Zijian Ling,Han Zhang,Yazhuo Zhou,Jiahao Cui
机构: Apply U (Apply U)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted at the Open Science for Foundation Models (SCI-FM) Workshop at ICLR 2025

点击查看摘要

Abstract:This paper presents ColorBlindnessEval, a novel benchmark designed to evaluate the robustness of Vision-Language Models (VLMs) in visually adversarial scenarios inspired by the Ishihara color blindness test. Our dataset comprises 500 Ishihara-like images featuring numbers from 0 to 99 with varying color combinations, challenging VLMs to accurately recognize numerical information embedded in complex visual patterns. We assess 9 VLMs using Yes/No and open-ended prompts and compare their performance with human participants. Our experiments reveal limitations in the models’ ability to interpret numbers in adversarial contexts, highlighting prevalent hallucination issues. These findings underscore the need to improve the robustness of VLMs in complex visual environments. ColorBlindnessEval serves as a valuable tool for benchmarking and improving the reliability of VLMs in real-world applications where accuracy is critical.
zh

[NLP-20] Charting a Decade of Computational Linguistics in Italy: The CLiC-it Corpus

【速读】: 该论文旨在解决意大利计算语言学(Computational Linguistics, CL)与自然语言处理(Natural Language Processing, NLP)研究趋势缺乏系统性追踪与量化分析的问题。其解决方案的关键在于构建并分析“CLiC-it语料库”——即对2014至2024年间首届至第十届意大利计算语言学会议(CLiC-it)全部论文的元数据(如作者来源、性别、机构归属等)和内容主题进行系统整理与深入挖掘,从而揭示该领域在十年间从词汇语义资源向语言建模及多模态方向演进的趋势,并为国际与本土研究社区提供基于实证的数据驱动洞察,以支持未来研究方向的决策制定。

链接: https://arxiv.org/abs/2509.19033
作者: Chiara Alzetta,Serena Auriemma,Alessandro Bondielli,Luca Dini,Chiara Fazzone,Alessio Miaschi,Martina Miliani,Marta Sartor
机构: Istituto di Linguistica Computazionale “Antonio Zampolli”, CNR, Pisa −- ItaliaNLP Lab; CoLingLab, Department of Philology, Literature and Linguistics, University of Pisa; Department of Computer Science, University of Pisa
类目: Computation and Language (cs.CL)
备注: Submitted to IJCoL

点击查看摘要

Abstract:Over the past decade, Computational Linguistics (CL) and Natural Language Processing (NLP) have evolved rapidly, especially with the advent of Transformer-based Large Language Models (LLMs). This shift has transformed research goals and priorities, from Lexical and Semantic Resources to Language Modelling and Multimodality. In this study, we track the research trends of the Italian CL and NLP community through an analysis of the contributions to CLiC-it, arguably the leading Italian conference in the field. We compile the proceedings from the first 10 editions of the CLiC-it conference (from 2014 to 2024) into the CLiC-it Corpus, providing a comprehensive analysis of both its metadata, including author provenance, gender, affiliations, and more, as well as the content of the papers themselves, which address various topics. Our goal is to provide the Italian and international research communities with valuable insights into emerging trends and key developments over time, supporting informed decisions and future directions in the field.
zh

[NLP-21] Investigating Test-Time Scaling with Reranking for Machine Translation

【速读】: 该论文试图解决大规模语言模型在机器翻译(Machine Translation, MT)任务中因参数规模扩大带来的高计算成本问题。传统方法依赖于增加模型参数量来提升性能,但代价高昂;为此,论文提出通过测试时扩展(Test-Time Scaling, TTS)策略,在推理阶段分配更多计算资源以提升翻译质量。其解决方案的关键在于采用一种简单而实用的“最佳N个候选”(best-of-N)框架,在WMT24基准上系统评估不同语言对、模型尺寸(3B–72B)和计算预算(N最高达1024)下的TTS效果,结果表明:对于高资源语言,TTS能稳定提升翻译质量并经由人工评估验证;小模型借助大N可媲美甚至超越大模型;而在固定计算预算下,大模型通常更高效,且低资源场景中TTS可能因指标盲点导致性能下降。

链接: https://arxiv.org/abs/2509.19020
作者: Shaomu Tan,Ryosuke Mitani,Ritvik Choudhary,Toshiyuki Sekiya
机构: University of Amsterdam (阿姆斯特丹大学); Sony Group Corporation (索尼集团)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Scaling model parameters has become the de facto strategy for improving NLP systems, but it comes with substantial computational costs. Test-Time Scaling (TTS) offers an alternative by allocating more computation at inference: generating multiple candidates and selecting the best. While effective in tasks such as mathematical reasoning, TTS has not been systematically explored for machine translation (MT). In this paper, we present the first systematic study of TTS for MT, investigating a simple but practical best-of-N framework on WMT24 benchmarks. Our experiments cover six high-resource and one low-resource language pairs, five model sizes (3B-72B), and various TTS compute budget (N up to 1024). Our results show that a) For high-resource languages, TTS generally improves translation quality according to multiple neural MT evaluation metrics, and our human evaluation confirms these gains; b) Augmenting smaller models with large N can match or surpass larger models at N=1 with more compute cost; c) Under fixed compute budgets, larger models are typically more efficient, and TTS can degrade quality due to metric blind spots in low-resource cases.
zh

[NLP-22] VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLM s via Travel Video Itinerary Reconstruction

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在长距离时空轨迹理解方面的不足,尤其是现有视频基准测试主要聚焦于室内场景或短距离户外活动,缺乏对跨越广阔地理空间和时间尺度的视频理解能力的评估。其解决方案的关键在于提出VIR-Bench这一新型基准,包含200段旅行视频,将行程重建(itinerary reconstruction)作为核心任务,系统性地评估和推动MLLMs在地理时空智能方面的能力提升。实验表明,即使是最先进的MLLMs在该任务上表现有限,而基于VIR-Bench洞察开发的原型旅行规划代理则显著提升了行程推荐效果,验证了该评估协议不仅能有效衡量模型性能,还能转化为实际应用中的性能增益。

链接: https://arxiv.org/abs/2509.19002
作者: Hao Wang,Eiki Murata,Lingfang Zhang,Ayako Sato,So Fukuda,Ziqi Yin,Wentao Hu,Keisuke Nakao,Yusuke Nakamura,Sebastian Zwirner,Yi-Chia Chen,Hiroyuki Otomo,Hiroki Ouchi,Daisuke Kawahara
机构: Waseda University (早稻田大学); CyberAgent, Inc.; AI Shift, Inc.; Nara Institute of Science and Technology (奈良科学技术大学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs’ geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.
zh

[NLP-23] DTW-Align: Bridging the Modality Gap in End-to-End Speech Translation with Dynamic Time Warping Alignment

【速读】: 该论文旨在解决端到端语音翻译(End-to-End Speech Translation, E2E-ST)中因语音与文本模态表示差异导致的“模态鸿沟”(modality gap)问题。现有方法通常依赖词或token级别的对齐,但此类方法需要预先获得语言特定的对齐工具,难以推广至所有语言;而基于最近邻相似性搜索的替代方案虽无需对齐工具,却无法实现精确对齐。本文的关键解决方案是将动态时间规整(Dynamic Time Warping, DTW)引入训练阶段,用于对齐语音和文本嵌入表示,从而在不依赖外部对齐工具的前提下实现更精准的跨模态对齐。实验表明,该方法不仅提升了对齐准确性、显著加快了训练速度,还在低资源场景下于6个语言方向中的5个取得优于先前方法的性能。

链接: https://arxiv.org/abs/2509.18987
作者: Abderrahmane Issam,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL)
备注: Accepted at WMT2025

点击查看摘要

Abstract:End-to-End Speech Translation (E2E-ST) is the task of translating source speech directly into target text bypassing the intermediate transcription step. The representation discrepancy between the speech and text modalities has motivated research on what is known as bridging the modality gap. State-of-the-art methods addressed this by aligning speech and text representations on the word or token level. Unfortunately, this requires an alignment tool that is not available for all languages. Although this issue has been addressed by aligning speech and text embeddings using nearest-neighbor similarity search, it does not lead to accurate alignments. In this work, we adapt Dynamic Time Warping (DTW) for aligning speech and text embeddings during training. Our experiments demonstrate the effectiveness of our method in bridging the modality gap in E2E-ST. Compared to previous work, our method produces more accurate alignments and achieves comparable E2E-ST results while being significantly faster. Furthermore, our method outperforms previous work in low resource settings on 5 out of 6 language directions.
zh

[NLP-24] Extractive Fact Decomposition for Interpretable Natural Language Inference in one Forward Pass EMNLP2025

【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)任务中模型可解释性与鲁棒性不足的问题,尤其是现有方法依赖资源密集型生成式大语言模型(Generative Large Language Models, LLMs)进行原子事实分解所带来的计算开销和复杂性。其解决方案的关键在于提出一种仅使用编码器架构(encoder-only architecture)的模型JEDI,该模型在推理阶段无需生成式模型即可联合执行提取式原子事实分解与可解释推理;同时,通过构建大规模合成推理链(synthetic rationales)数据集来支持训练,从而在分布内表现具有竞争力,并显著提升分布外及对抗场景下的鲁棒性。

链接: https://arxiv.org/abs/2509.18901
作者: Nicholas Popovič,Michael Färber
机构: TU Dresden (德累斯顿工业大学); ScaDS.AI Dresden/Leipzig (ScaDS.AI 德累斯顿/莱比锡)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Recent works in Natural Language Inference (NLI) and related tasks, such as automated fact-checking, employ atomic fact decomposition to enhance interpretability and robustness. For this, existing methods rely on resource-intensive generative large language models (LLMs) to perform decomposition. We propose JEDI, an encoder-only architecture that jointly performs extractive atomic fact decomposition and interpretable inference without requiring generative models during inference. To facilitate training, we produce a large corpus of synthetic rationales covering multiple NLI benchmarks. Experimental results demonstrate that JEDI achieves competitive accuracy in distribution and significantly improves robustness out of distribution and in adversarial settings over models based solely on extractive rationale supervision. Our findings show that interpretability and robust generalization in NLI can be realized using encoder-only architectures and synthetic rationales. Code and data available at this https URL
zh

[NLP-25] Diversity Boosts AI-Generated Text Detection

【速读】: 该论文旨在解决当前AI生成文本检测方法在面对高质量大语言模型(Large Language Models, LLMs)输出时准确率低、可解释性差的问题,尤其是在教育、合规、新闻和社交媒体等场景中,合成文本的高流畅性易掩盖虚假信息或欺骗行为。解决方案的关键在于提出一种名为DivEye的新检测框架,其核心创新是利用基于 surprisal(意外度)的统计特征捕捉文本中不可预测性的波动模式;研究发现人类写作在词汇与结构层面表现出比LLM输出更丰富的不可预测性变化,DivEye通过提取此类可解释的统计信号实现高精度检测,不仅在零样本场景下优于现有方法达33.2%,且对改写和对抗攻击具有鲁棒性,并能作为辅助信号提升现有检测器性能达18.7%。

链接: https://arxiv.org/abs/2509.18880
作者: Advik Raj Basani,Pin-Yu Chen
机构: Birla Institute of Technology and Science, Goa (比特拉理工学院与科学学院,果阿分校); IBM Research, USA (国际商业机器公司研究部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Project Webpage: this https URL

点击查看摘要

Abstract:Detecting AI-generated text is an increasing necessity to combat misuse of LLMs in education, business compliance, journalism, and social media, where synthetic fluency can mask misinformation or deception. While prior detectors often rely on token-level likelihoods or opaque black-box classifiers, these approaches struggle against high-quality generations and offer little interpretability. In this work, we propose DivEye, a novel detection framework that captures how unpredictability fluctuates across a text using surprisal-based features. Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features. Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines across multiple benchmarks. DivEye is robust to paraphrasing and adversarial attacks, generalizes well across domains and models, and improves the performance of existing detectors by up to 18.7% when used as an auxiliary signal. Beyond detection, DivEye provides interpretable insights into why a text is flagged, pointing to rhythmic unpredictability as a powerful and underexplored signal for LLM detection.
zh

[NLP-26] Multi-Hierarchical Feature Detection for Large Language Model Generated Text

【速读】: 该论文旨在解决多特征融合是否能显著提升AI文本检测性能的问题,特别是在现代大语言模型(Large Language Models, LLMs)生成文本背景下,验证融合语义、句法与统计特征的多层级特征集成方法是否值得其带来的计算开销。解决方案的关键在于提出MHFD(Multi-Hierarchical Feature Detection)框架,通过自适应融合DeBERTa-based语义分析、句法解析和统计概率特征,在多个基准数据集上实现89.7%的域内检测准确率和84.2%的跨域稳定性能,但实验表明其仅带来0.4–0.5%的微小提升,同时引入4.2倍的计算复杂度,暗示当前神经语言模型已高效捕获大部分可检测信号,多特征融合的边际收益有限。

链接: https://arxiv.org/abs/2509.18862
作者: Luyan Zhang,Xinyu Xie
机构: Northeastern University (东北大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 6 tables, empirical study on multi-feature AI text detection

点击查看摘要

Abstract:With the rapid advancement of large language model technology, there is growing interest in whether multi-feature approaches can significantly improve AI text detection beyond what single neural models achieve. While intuition suggests that combining semantic, syntactic, and statistical features should provide complementary signals, this assumption has not been rigorously tested with modern LLM-generated text. This paper provides a systematic empirical investigation of multi-hierarchical feature integration for AI text detection, specifically testing whether the computational overhead of combining multiple feature types is justified by performance gains. We implement MHFD (Multi-Hierarchical Feature Detection), integrating DeBERTa-based semantic analysis, syntactic parsing, and statistical probability features through adaptive fusion. Our investigation reveals important negative results: despite theoretical expectations, multi-feature integration provides minimal benefits (0.4-0.5% improvement) while incurring substantial computational costs (4.2x overhead), suggesting that modern neural language models may already capture most relevant detection signals efficiently. Experimental results on multiple benchmark datasets demonstrate that the MHFD method achieves 89.7% accuracy in in-domain detection and maintains 84.2% stable performance in cross-domain detection, showing modest improvements of 0.4-2.6% over existing methods.
zh

[NLP-27] Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

【速读】: 该论文旨在解决当前工具增强型大语言模型(Tool-augmented Large Language Models, TALLMs)在多轮交互中因缺乏有效错误诊断与修复机制而导致的失败重复问题。现有自反思(self-reflection)方法依赖启发式提示或单向推理,无法实现对错误的根本性识别和修正,导致模型在遭遇失败后常反复犯错。其解决方案的关键在于提出“结构化反思”(structured reflection),将从错误到修复的路径显式化为可控制、可训练的动作:模型生成简明而精确的反思内容,基于前一步证据诊断失败原因,并提出正确且可执行的后续工具调用。通过结合DAPO与GSPO目标函数并设计面向工具使用的奖励机制,优化“反思-调用-最终执行”的分步策略,显著提升了多轮工具调用的成功率与错误恢复能力。

链接: https://arxiv.org/abs/2509.18847
作者: Junhao Su,Yuanliang Wan,Junwei Yang,Hengyu Shi,Tianyang Han,Junfeng Luo,Yurui Qiu
机构: Meituan(美团); Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 9pages

点击查看摘要

Abstract:Tool-augmented large language models (LLMs) are usually trained with supervised imitation or coarse-grained reinforcement learning that optimizes single tool calls. Current self-reflection practices rely on heuristic prompts or one-way reasoning: the model is urged to ‘think more’ instead of learning error diagnosis and repair. This is fragile in multi-turn interactions; after a failure the model often repeats the same mistake. We propose structured reflection, which turns the path from error to repair into an explicit, controllable, and trainable action. The agent produces a short yet precise reflection: it diagnoses the failure using evidence from the previous step and then proposes a correct, executable follow-up call. For training we combine DAPO and GSPO objectives with a reward scheme tailored to tool use, optimizing the stepwise strategy Reflect, then Call, then Final. To evaluate, we introduce Tool-Reflection-Bench, a lightweight benchmark that programmatically checks structural validity, executability, parameter correctness, and result consistency. Tasks are built as mini trajectories of erroneous call, reflection, and corrected call, with disjoint train and test splits. Experiments on BFCL v3 and Tool-Reflection-Bench show large gains in multi-turn tool-call success and error recovery, and a reduction of redundant calls. These results indicate that making reflection explicit and optimizing it directly improves the reliability of tool interaction and offers a reproducible path for agents to learn from failure.
zh

[NLP-28] Are Smaller Open-Weight LLM s Closing the Gap to Proprietary Models for Biomedical Question Answering?

【速读】: 该论文旨在解决开放权重大型语言模型(Large Language Models, LLMs)是否能够有效替代闭源大模型在生物医学问答任务中的性能问题。研究聚焦于BioASQ挑战赛Task 13B Phase B,对比了多个开源模型与GPT-4o、Claude 3.5 Sonnet等顶级闭源模型的表现。其解决方案的关键在于:结合嵌入距离检索相关片段、上下文学习(in-context learning)以及结构化输出技术以增强问答准确性,并采用集成策略(ensemble approaches)融合不同模型的输出结果,从而显著提升小规模开放权重模型在精确答案类问题上的表现,甚至在某些情况下超越闭源模型。

链接: https://arxiv.org/abs/2509.18843
作者: Damian Stachura,Joanna Konieczna,Artur Nowak
机构: Evidence Prime(证据优先); Krakow(克拉科夫); Poland(波兰)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: CLEF 2025 Working Notes, 9-12 September 2025, Madrid, Spain

点击查看摘要

Abstract:Open-weight versions of large language models (LLMs) are rapidly advancing, with state-of-the-art models like DeepSeek-V3 now performing comparably to proprietary LLMs. This progression raises the question of whether small open-weight LLMs are capable of effectively replacing larger closed-source models. We are particularly interested in the context of biomedical question-answering, a domain we explored by participating in Task 13B Phase B of the BioASQ challenge. In this work, we compare several open-weight models against top-performing systems such as GPT-4o, GPT-4.1, Claude 3.5 Sonnet, and Claude 3.7 Sonnet. To enhance question answering capabilities, we use various techniques including retrieving the most relevant snippets based on embedding distance, in-context learning, and structured outputs. For certain submissions, we utilize ensemble approaches to leverage the diverse outputs generated by different models for exact-answer questions. Our results demonstrate that open-weight LLMs are comparable to proprietary ones. In some instances, open-weight LLMs even surpassed their closed counterparts, particularly when ensembling strategies were applied. All code is publicly available at this https URL.
zh

[NLP-29] Pay More Attention To Audio: Mitigating Imbalance of Cross-Modal Attention in Large Audio Language Models ICASSP2026

【速读】: 该论文旨在解决大型音频-语言模型(Large Audio-Language Models, LALMs)中存在的音频-文本注意力失衡问题,即模型在多模态融合层中过度关注文本信息而忽视声学特征,从而影响音频推理任务的性能。解决方案的关键在于提出一种无需训练的方法——MATA(More Attention To Audio),其核心机制是在自注意力机制的原始注意力得分之后进行干预,仅针对中间层中的最后一个token动态增强对音频token的关注度,且不引入额外参数或计算开销,从而有效缓解注意力偏倚并提升模型的音频处理能力。

链接: https://arxiv.org/abs/2509.18816
作者: Junyu Wang,Ziyang Ma,Zhengding Luo,Tianrui Wang,Meng Ge,Xiaobao Wang,Longbiao Wang
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) often suffer from audio-textual attention imbalance, prioritizing text over acoustic information, particularly in the multi-modal fusion layers of the Transformer architecture. This bias hinders their ability to fully utilize acoustic cues, causing suboptimal performance on audio reasoning tasks. To mitigate this, we propose \textbfMATA, a novel training-free method that dynamically pushes LALMs to pay \textbfMore \textbfAttention \textbfTo \textbfAudio tokens within the self-attention mechanism. Specifically, MATA intervenes post raw attention scoring, targeting only the last token in intermediate layers without introducing additional parameters or computational overhead. Experiments on the MMAU and MMAR benchmarks confirm MATA’s effectiveness, with consistent performance gains. Notably, on MMAR, MATA enables an open-source model to surpass the proprietary Gemini 2.0 Flash for the first time. Our work provides an efficient solution to mitigate attention bias and opens a new research direction for enhancing the audio-processing capabilities of multi-modal models.
zh

[NLP-30] MAPEX: A Multi-Agent Pipeline for Keyphrase Extraction

【速读】: 该论文旨在解决现有无监督提示方法在大语言模型(Large Language Models, LLMs)上进行关键词提取(Keyphrase Extraction)时存在的局限性,即这些方法通常采用统一的单阶段推理流程和固定提示策略,无法根据文档长度或LLM架构动态调整,从而限制了对LLM推理与生成能力的充分挖掘。解决方案的关键在于提出MAPEX框架,首次将多智能体协作机制引入关键词提取任务,通过专家招募、候选词提取、主题引导、知识增强和后处理等模块协同工作,并设计双路径策略:短文本采用知识驱动提取,长文本则依赖主题引导提取,实现对不同场景的自适应优化,显著提升了模型在多个基准数据集上的性能表现。

链接: https://arxiv.org/abs/2509.18813
作者: Liting Zhang,Shiwan Zhao,Aobo Kong,Qicheng Li
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Keyphrase extraction is a fundamental task in natural language processing. However, existing unsupervised prompt-based methods for Large Language Models (LLMs) often rely on single-stage inference pipelines with uniform prompting, regardless of document length or LLM backbone. Such one-size-fits-all designs hinder the full exploitation of LLMs’ reasoning and generation capabilities, especially given the complexity of keyphrase extraction across diverse scenarios. To address these challenges, we propose MAPEX, the first framework that introduces multi-agent collaboration into keyphrase extraction. MAPEX coordinates LLM-based agents through modules for expert recruitment, candidate extraction, topic guidance, knowledge augmentation, and post-processing. A dual-path strategy dynamically adapts to document length: knowledge-driven extraction for short texts and topic-guided extraction for long texts. Extensive experiments on six benchmark datasets across three different LLMs demonstrate its strong generalization and universality, outperforming the state-of-the-art unsupervised method by 2.44% and standard LLM baselines by 4.01% in F1@5 on average. Code is available at this https URL.
zh

[NLP-31] Beyond the Leaderboard: Understanding Performance Disparities in Large Language Models via Model Diffing EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在微调过程中性能提升机制不明确的问题,即传统基准测试难以解释为何一个模型在性能上优于另一个。为实现对模型能力差异的细粒度解析,研究提出采用模型差分(model diffing)这一机制可解释性方法,结合交叉编码器(crosscoders)识别并分类两个模型间的潜在表征差异。其关键在于通过量化不同latent概念的变化,揭示SimPO增强版本相较于Gemma-2-9b-it在安全性(+32.8%)、多语言能力(+43.8%)和指令遵循能力(+151.7%)上的显著提升,同时发现其训练导致模型自我指涉(-44.1%)和幻觉管理(-68.5%)能力下降。该方法超越了排行榜指标,实现了对性能差距的机制级归因,提供了一种透明且目标导向的LLM比较框架。

链接: https://arxiv.org/abs/2509.18792
作者: Sabri Boughorbel,Fahim Dalvi,Nadir Durrani,Majd Hawasly
机构: Qatar Computing Research Institute (卡塔尔计算研究研究所); HBKU (哈马德·本·哈利法大学)
类目: Computation and Language (cs.CL)
备注: 12 pages, accepted to the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

点击查看摘要

Abstract:As fine-tuning becomes the dominant paradigm for improving large language models (LLMs), understanding what changes during this process is increasingly important. Traditional benchmarking often fails to explain why one model outperforms another. In this work, we use model diffing, a mechanistic interpretability approach, to analyze the specific capability differences between Gemma-2-9b-it and a SimPO-enhanced variant. Using crosscoders, we identify and categorize latent representations that differentiate the two models. We find that SimPO acquired latent concepts predominantly enhance safety mechanisms (+32.8%), multilingual capabilities (+43.8%), and instruction-following (+151.7%), while its additional training also reduces emphasis on model self-reference (-44.1%) and hallucination management (-68.5%). Our analysis shows that model diffing can yield fine-grained insights beyond leaderboard metrics, attributing performance gaps to concrete mechanistic capabilities. This approach offers a transparent and targeted framework for comparing LLMs.
zh

[NLP-32] AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在建筑、工程与施工(Architecture, Engineering, and Construction, AEC)这一专业且安全关键领域中的鲁棒性和可靠性尚未被充分评估的问题。为应对这一挑战,研究提出了AECBench——一个全面的基准测试框架,其核心创新在于构建了一个五级认知导向的评估体系(涵盖知识记忆、理解、推理、计算和应用),并设计了4800道来自真实AEC实践的多样化任务题集,其中包含开放式问题;同时引入LLM-as-a-Judge方法,利用专家制定的评分标准对长文本响应进行可扩展且一致的评估。该方案的关键在于将领域专业知识系统化地融入评测流程,从而精准量化当前LLMs在AEC场景下的优势与局限,为后续可靠集成提供科学依据。

链接: https://arxiv.org/abs/2509.18776
作者: Chen Liang,Zhaoqi Huang,Haofen Wang,Fu Chai,Chunying Yu,Huanhuan Wei,Zhengjie Liu,Yanpeng Li,Hongjun Wang,Ruifeng Luo,Xianzhong Zhao
机构: Tongji University (同济大学); Shanghai Qi Zhi Institute (上海奇智研究院); Arcplus Group East China Architectural Design & Research Institute Co., Ltd. (弧普集团华东建筑设计研究院有限公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building lifecycle. However, the robustness and reliability of LLMs in such a specialized and safety-critical domain remain to be evaluated. To address this challenge, this paper establishes AECBench, a comprehensive benchmark designed to quantify the strengths and limitations of current LLMs in the AEC domain. The benchmark defines 23 representative tasks within a five-level cognition-oriented evaluation framework encompassing Knowledge Memorization, Understanding, Reasoning, Calculation, and Application. These tasks were derived from authentic AEC practice, with scope ranging from codes retrieval to specialized documents generation. Subsequently, a 4,800-question dataset encompassing diverse formats, including open-ended questions, was crafted primarily by engineers and validated through a two-round expert review. Furthermore, an LLM-as-a-Judge approach was introduced to provide a scalable and consistent methodology for evaluating complex, long-form responses leveraging expert-derived rubrics. Through the evaluation of nine LLMs, a clear performance decline across five cognitive levels was revealed. Despite demonstrating proficiency in foundational tasks at the Knowledge Memorization and Understanding levels, the models showed significant performance deficits, particularly in interpreting knowledge from tables in building codes, executing complex reasoning and calculation, and generating domain-specific documents. Consequently, this study lays the groundwork for future research and development aimed at the robust and reliable integration of LLMs into safety-critical engineering practices.
zh

[NLP-33] Financial Risk Relation Identification through Dual-view Adaptation EMNLP2025

【速读】: 该论文旨在解决传统方法在识别企业间风险关联时存在的主观性强、劳动密集且难以扩展的问题,尤其是在面对由监管变化、地缘政治紧张等多重互联风险事件引发的跨企业涟漪效应时。解决方案的关键在于利用Form 10-K文件(权威且标准化的财务披露文档)作为数据源,结合自然语言处理(Natural Language Processing, NLP)技术,通过基于时间顺序和词汇模式的无监督微调策略,提取隐含且抽象的企业间风险联系;进而构建一个领域特定的金融编码器(domain-specific financial encoder),实现更深层次的上下文理解,并引入量化风险关系评分以提升透明度与可解释性分析能力。

链接: https://arxiv.org/abs/2509.18775
作者: Wei-Ning Chiu,Yu-Hsiang Wang,Andy Hsiao,Yu-Shiang Huang,Chuan-Ju Wang
机构: National Taiwan University (国立台湾大学); Academia Sinica (中央研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures, EMNLP 2025 Main Conference

点击查看摘要

Abstract:A multitude of interconnected risk events – ranging from regulatory changes to geopolitical tensions – can trigger ripple effects across firms. Identifying inter-firm risk relations is thus crucial for applications like portfolio management and investment strategy. Traditionally, such assessments rely on expert judgment and manual analysis, which are, however, subjective, labor-intensive, and difficult to scale. To address this, we propose a systematic method for extracting inter-firm risk relations using Form 10-K filings – authoritative, standardized financial documents – as our data source. Leveraging recent advances in natural language processing, our approach captures implicit and abstract risk connections through unsupervised fine-tuning based on chronological and lexical patterns in the filings. This enables the development of a domain-specific financial encoder with a deeper contextual understanding and introduces a quantitative risk relation score for transparency, interpretable analysis. Extensive experiments demonstrate that our method outperforms strong baselines across multiple evaluation settings.
zh

[NLP-34] When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models

【速读】: 该论文旨在解决监督微调(Supervised Fine-Tuning, SFT)中数据长度对大语言模型(Large Language Models, LLMs)在短文本任务上性能影响的不确定性问题。传统观点认为,使用长上下文数据进行预训练会导致短文本任务性能下降,但本文发现,采用长上下文SFT反而能提升短文本任务表现,这一现象与直觉相反。其解决方案的关键在于:通过解耦分析多头注意力(Multi-Head Attention, MHA)和前馈网络(Feed-Forward Network, FFN)两个核心组件,揭示二者均能独立从长上下文SFT中获益;进一步发现长上下文SFT倾向于促进情境知识(contextual knowledge),而短上下文SFT更偏好参数知识(parametric knowledge),从而导致仅依赖长上下文SFT存在知识偏好偏差;最终提出混合训练策略以缓解该偏差,为LMM的微调提供可解释性的指导。

链接: https://arxiv.org/abs/2509.18762
作者: Yingming Zheng,Hanqi Li,Kai Yu,Lu Chen
机构: X-LANCE Lab, MoE Key Lab of Artificial Intelligence, AI Institute, School of Computer Science, Shanghai Jiao Tong University (上海交通大学计算机科学学院); Shanghai Innovation Institution (上海创新机构); Jiangsu Key Lab of Language Computing (江苏省语言计算重点实验室); Suzhou Laboratory (苏州实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved impressive performance across natural language processing (NLP) tasks. As real-world applications increasingly demand longer context windows, continued pretraining and supervised fine-tuning (SFT) on long-context data has become a common approach. While the effects of data length in continued pretraining have been extensively studied, their implications for SFT remain unclear. In this work, we systematically investigate how SFT data length influences LLM behavior on short-context tasks. Counterintuitively, we find that long-context SFT improves short-context performance, contrary to the commonly observed degradation from long-context pretraining. To uncover the underlying mechanisms of this phenomenon, we first decouple and analyze two key components, Multi-Head Attention (MHA) and Feed-Forward Network (FFN), and show that both independently benefit from long-context SFT. We further study their interaction and reveal a knowledge preference bias: long-context SFT promotes contextual knowledge, while short-context SFT favors parametric knowledge, making exclusive reliance on long-context SFT suboptimal. Finally, we demonstrate that hybrid training mitigates this bias, offering explainable guidance for fine-tuning LLMs.
zh

[NLP-35] False Friends Are Not Foes: Investigating Vocabulary Overlap in Multilingual Language Models

【速读】: 该论文试图解决的问题是:多语言语料库训练的子词分词器(subword tokenizer)产生的跨语言词汇重叠是否有助于跨语言迁移,还是反而引发语言间的干扰。为回答这一问题,作者设计了一项受控实验,通过系统性地调整双语自回归模型在多个语言对上的词汇重叠程度来评估其影响,并引入了一个新维度——共享词汇的语义相似性。解决方案的关键在于:控制变量以排除频率和分词粒度等混杂因素,发现无论何种形式的词汇重叠均能促进嵌入空间中跨语言语义关系的捕捉,且在XNLI和XQuAD任务上,随着词汇重叠程度增加,跨语言迁移性能提升显著,从而证明了共享词汇对多语言模型设计具有实质性优势。

链接: https://arxiv.org/abs/2509.18750
作者: Julie Kallini,Dan Jurafsky,Christopher Potts,Martijn Bartelds
机构: Stanford University (斯坦福大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Subword tokenizers trained on multilingual corpora naturally produce overlapping tokens across languages. Does token overlap facilitate cross-lingual transfer or instead introduce interference between languages? Prior work offers mixed evidence, partly due to varied setups and confounders, such as token frequency or subword segmentation granularity. To address this question, we devise a controlled experiment where we train bilingual autoregressive models on multiple language pairs under systematically varied vocabulary overlap settings. Crucially, we explore a new dimension to understanding how overlap affects transfer: the semantic similarity of tokens shared across languages. We first analyze our models’ hidden representations and find that overlap of any kind creates embedding spaces that capture cross-lingual semantic relationships, while this effect is much weaker in models with disjoint vocabularies. On XNLI and XQuAD, we find that models with overlap outperform models with disjoint vocabularies, and that transfer performance generally improves as overlap increases. Overall, our findings highlight the advantages of token overlap in multilingual models and show that substantial shared vocabulary remains a beneficial design choice for multilingual tokenizers.
zh

[NLP-36] Global-Recent Semantic Reasoning on Dynamic Text-Attributed Graphs with Large Language Models

【速读】: 该论文旨在解决动态文本属性图(Dynamic Text-Attribute Graphs, DyTAGs)中长期与近期时序语义建模不足以及大语言模型(Large Language Models, LLMs)在处理海量演化文本时效率低下的问题。其核心挑战在于现有方法(如图神经网络和LLMs)通常仅针对静态文本属性图设计,难以有效捕捉DyTAG中节点随时间演化的全局语义动态和交互文本的近期依赖关系。解决方案的关键在于提出一种名为DyGRASP的新方法,通过两个创新机制实现高效且精准的时序语义推理:一是基于滑动窗口的节点中心隐式推理机制以高效捕获近期时序语义;二是利用定制化提示词与类似循环神经网络(RNN-like)链式结构进行显式推理,以推断节点的长期语义演化。最终,通过更新与融合层将近期、全局时序语义与动态图结构信息有机结合,显著提升了任务性能,在目的地节点检索任务中Hit@10指标最高提升34%,并展现出对不同Temporal GNNs和LLMs的良好泛化能力。

链接: https://arxiv.org/abs/2509.18742
作者: Yunan Wang,Jianxin Li,Ziwei Zhang
机构: Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Dynamic Text-Attribute Graphs (DyTAGs), characterized by time-evolving graph interactions and associated text attributes, are prevalent in real-world applications. Existing methods, such as Graph Neural Networks (GNNs) and Large Language Models (LLMs), mostly focus on static TAGs. Extending these existing methods to DyTAGs is challenging as they largely neglect the recent-global temporal semantics: the recent semantic dependencies among interaction texts and the global semantic evolution of nodes over time. Furthermore, applying LLMs to the abundant and evolving text in DyTAGs faces efficiency issues. To tackle these challenges, we propose Dynamic Global-Recent Adaptive Semantic Processing (DyGRASP), a novel method that leverages LLMs and temporal GNNs to efficiently and effectively reason on DyTAGs. Specifically, we first design a node-centric implicit reasoning method together with a sliding window mechanism to efficiently capture recent temporal semantics. In addition, to capture global semantic dynamics of nodes, we leverage explicit reasoning with tailored prompts and an RNN-like chain structure to infer long-term semantics. Lastly, we intricately integrate the recent and global temporal semantics as well as the dynamic graph structural information using updating and merging layers. Extensive experiments on DyTAG benchmarks demonstrate DyGRASP’s superiority, achieving up to 34% improvement in Hit@10 for destination node retrieval task. Besides, DyGRASP exhibits strong generalization across different temporal GNNs and LLMs.
zh

[NLP-37] LOTUSDIS: A Thai far-field meeting corpus for robust conversational ASR

【速读】: 该论文旨在解决泰国语远场对话自动语音识别(ASR)在真实场景中性能下降的问题,特别是针对麦克风距离变化导致的混响、噪声和设备色散等影响。其解决方案的关键在于构建并公开了一个大规模、多样化的泰语会议语料库(LOTUSDIS),包含114小时自发对话数据,覆盖从0.12米到10米的多距离录音,并使用九种不同类型的单通道麦克风采集,从而保留了真实的远场声学特性。通过在此数据集上对Whisper模型进行微调,显著提升了远场ASR的鲁棒性,尤其在最远距离麦克风上WER从81.6降至49.5,验证了距离多样性训练数据对提升模型泛化能力的重要性。

链接: https://arxiv.org/abs/2509.18722
作者: Pattara Tipaksorn,Sumonmas Thatphithakkul,Vataya Chunwijitra,Kwanchiva Thangthai
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:We present LOTUSDIS, a publicly available Thai meeting corpus designed to advance far-field conversational ASR. The dataset comprises 114 hours of spontaneous, unscripted dialogue collected in 15-20 minute sessions with three participants, where overlapping speech is frequent and natural. Speech was recorded simultaneously by nine independent single-channel devices spanning six microphone types at distances from 0.12 m to 10 m, preserving the authentic effects of reverberation, noise, and device coloration without relying on microphone arrays. We provide standard train, dev, test splits and release a reproducible baseline system. We benchmarked several Whisper variants under zero-shot and fine-tuned conditions. Off-the-shelf models showed strong degradation with distance, confirming a mismatch between pre-training data and Thai far-field speech. Fine-tuning on LOTUSDIS dramatically improved robustness: a Thai Whisper baseline reduced overall WER from 64.3 to 38.3 and far-field WER from 81.6 to 49.5, with especially large gains on the most distant microphones. These results underscore the importance of distance-diverse training data for robust ASR. The corpus is available under CC-BY-SA 4.0. We also release training and evaluation scripts as a baseline system to promote reproducible research in this field.
zh

[NLP-38] MemOrb: A Plug-and-Play Verbal-Reinforcement Memory Layer for E-Commerce Customer Service

链接: https://arxiv.org/abs/2509.18713
作者: Yizhe Huang,Yang Liu,Ruiyu Zhao,Xiaolong Zhong,Xingming Yue,Ling Jiang
机构: Xiaoduo AI; Fudan University (复旦大学); East China University of Science and Technology (华东理工大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-39] Agent ic AutoSurvey: Let LLM LLM s Survey LLMs

【速读】: 该论文旨在解决科学文献指数增长背景下,研究人员在快速演进领域中进行知识综合所面临的挑战。现有自动化文献综述方法存在合成质量低、覆盖度不足等问题。其解决方案的关键在于提出一种多智能体框架——Agentic AutoSurvey,通过四个专业化智能体(论文搜索专家、主题挖掘与聚类代理、学术综述撰写者和质量评估者)协同工作,实现高质量的文献综述自动生成。该架构不仅提升了综述内容的组织性、整合性和批判性分析能力(12维评估指标下得分8.18/10),还能在单个主题处理75–443篇论文的同时保证高引用覆盖率(≥80%),显著优于基线方法(AutoSurvey得分为4.77/10)。

链接: https://arxiv.org/abs/2509.18661
作者: Yixin Liu,Yonghui Wu,Denghui Zhang,Lichao Sun
机构: Lehigh University (莱赫igh大学); University of Florida (佛罗里达大学); Stevens Institute of Technology (史蒂文斯理工学院)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: 29 pages, 7 figures

点击查看摘要

Abstract:The exponential growth of scientific literature poses unprecedented challenges for researchers attempting to synthesize knowledge across rapidly evolving fields. We present \textbfAgentic AutoSurvey, a multi-agent framework for automated survey generation that addresses fundamental limitations in existing approaches. Our system employs four specialized agents (Paper Search Specialist, Topic Mining \ Clustering, Academic Survey Writer, and Quality Evaluator) working in concert to generate comprehensive literature surveys with superior synthesis quality. Through experiments on six representative LLM research topics from COLM 2024 categories, we demonstrate that our multi-agent approach achieves significant improvements over existing baselines, scoring 8.18/10 compared to AutoSurvey’s 4.77/10. The multi-agent architecture processes 75–443 papers per topic (847 total across six topics) while targeting high citation coverage (often \geq 80% on 75–100-paper sets; lower on very large sets such as RLHF) through specialized agent orchestration. Our 12-dimension evaluation captures organization, synthesis integration, and critical analysis beyond basic metrics. These findings demonstrate that multi-agent architectures represent a meaningful advancement for automated literature survey generation in rapidly evolving scientific domains.
zh

[NLP-40] Analyzing Uncertainty of LLM -as-a-Judge: Interval Evaluations with Conformal Prediction EMNLP2025

链接: https://arxiv.org/abs/2509.18658
作者: Huanxin Sheng,Xinyi Liu,Hangfeng He,Jieyu Zhao,Jian Kang
机构: University of Rochester (罗切斯特大学); University of Southern California (南加州大学); MBZUAI
类目: Computation and Language (cs.CL)
备注: To appear in EMNLP 2025. Our code and data are available at \url{ this https URL

点击查看摘要

[NLP-41] Consistency-Aware Parameter-Preserving Knowledge Editing Framework for Multi-Hop Question Answering ICASSP2026

【速读】: 该论文旨在解决参数保持型知识编辑(Parameter-Preserving Knowledge Editing, PPKE)在多跳问答(Multi-Hop Question Answering, MHQA)任务中因知识一致性不足而导致的知识污染、更新不稳定以及检索行为偏离预期编辑意图的问题。解决方案的关键在于提出一种一致性感知的框架 CAPE-KG(Consistency-Aware Parameter-Preserving Editing with Knowledge Graphs),通过确保知识图谱(Knowledge Graph, KG)的构建、更新与检索始终与 MHQA 任务需求对齐,从而在未编辑和已编辑知识之间维持连贯推理,显著提升 PPKE 在 MHQA 场景下的准确性和可靠性。

链接: https://arxiv.org/abs/2509.18655
作者: Lingwen Deng,Yifei Han,Long Zhang,Yue Du,Bin Li
机构: 未知
类目: Computation and Language (cs.CL)
备注: Submitted to ICASSP 2026

点击查看摘要

Abstract:Parameter-Preserving Knowledge Editing (PPKE) enables updating models with new or corrected information without retraining or parameter adjustment. Recent PPKE approaches based on knowledge graphs (KG) to extend knowledge editing (KE) capabilities to multi-hop question answering (MHQA). However, these methods often lack consistency, leading to knowledge contamination, unstable updates, and retrieval behaviors that fail to reflect the intended edits. Such inconsistencies undermine the reliability of PPKE in multi- hop reasoning. We present CAPE-KG, Consistency-Aware Parameter-Preserving Editing with Knowledge Graphs, a novel consistency-aware framework for PPKE on MHQA. CAPE-KG ensures KG construction, update, and retrieval are always aligned with the requirements of the MHQA task, maintaining coherent reasoning over both unedited and edited knowledge. Extensive experiments on the MQuAKE benchmark show accuracy improvements in PPKE performance for MHQA, demonstrating the effectiveness of addressing consistency in PPKE.
zh

[NLP-42] A Good Plan is Hard to Find: Aligning Models with Preferences is Misaligned with What Helps Users EMNLP2025

链接: https://arxiv.org/abs/2509.18632
作者: Nishant Balepur,Matthew Shu,Yoo Yeon Sung,Seraphina Goldfarb-Tarrant,Shi Feng,Fumeng Yang,Rachel Rudinger,Jordan Lee Boyd-Graber
机构: 未知
类目: Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

[NLP-43] OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation

【速读】: 该论文旨在解决放射学报告生成(Radiology Report Generation, RRG)任务中现有方法对数据和计算资源高度依赖的问题,尤其是多阶段训练和庞大模型架构导致的高成本瓶颈。其核心解决方案在于提出一种名为OraPO(Oracle-educated GRPO)的单阶段强化学习(Reinforcement Learning, RL)框架,并结合基于FactScore的奖励机制(FactS)。OraPO通过引入轻量级“oracle步骤”,将罕见或困难病例中GRPO探索失败的经验转化为直接偏好监督信号,从而避免冗长的多阶段训练;FactS则通过提取原子级临床事实并验证其与真实标签的蕴含关系,提供密集且可解释的句级奖励,使模型学习更贴近诊断证据。二者协同构建了一个高效、紧凑且性能优越的RRG系统,在CheXpert Plus数据集上以小规模视觉语言模型(VLM)和有限硬件条件下实现了新的SOTA(F1=0.341),训练数据量仅为传统方法的1/100–1/1000。

链接: https://arxiv.org/abs/2509.18600
作者: Zhuoxiao Chen,Hongyang Yu,Ying Xu,Yadan Luo,Long Duong,Yuan-Fang Li
机构: Oracle Health & AI; The University of Queensland
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2–3 orders of magnitude less training data using a small base VLM on modest hardware.
zh

[NLP-44] UniECG: Understanding and Generating ECG in One Unified Model

【速读】: 该论文旨在解决当前统一模型(如GPT-5)在心电图(Electrocardiogram, ECG)信号理解与生成任务中表现不足的问题,即无法准确解读ECG信号以提供医学诊断,也无法根据文本条件正确生成ECG信号。解决方案的关键在于提出UniECG——首个能够同时执行基于证据的ECG解释(ECG-to-Text)和文本引导的ECG生成(Text-to-ECG)任务的统一模型;其核心创新是采用解耦的两阶段训练策略:首先学习ECG到文本的解释能力,随后通过潜在空间对齐注入文本到ECG的生成能力,使模型可根据用户输入自主选择执行解释或生成任务,从而显著扩展现有ECG模型的能力边界。

链接: https://arxiv.org/abs/2509.18588
作者: Jiarui Jin,Haoyu Wang,Xiang Lan,Jun Li,Gaofeng Cheng,Hongyan Li,Shenda Hong
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent unified models such as GPT-5 have achieved encouraging progress on vision-language tasks. However, these unified models typically fail to correctly understand ECG signals and provide accurate medical diagnoses, nor can they correctly generate ECG signals. To address these limitations, we propose UniECG, the first unified model for ECG capable of concurrently performing evidence-based ECG interpretation and text-conditioned ECG generation tasks. Through a decoupled two-stage training approach, the model first learns evidence-based interpretation skills (ECG-to-Text), and then injects ECG generation capabilities (Text-to-ECG) via latent space alignment. UniECG can autonomously choose to interpret or generate an ECG based on user input, significantly extending the capability boundaries of current ECG models. Our code and checkpoints will be made publicly available at this https URL upon acceptance.
zh

[NLP-45] sqLoRA: Towards Sensitivity and Quality Low-Rank Adaptation for Efficient Fine-Tuning ICASSP2026

【速读】: 该论文旨在解决大预训练模型在下游任务微调过程中存在的计算成本高、内存占用大以及现有参数高效微调方法忽视不同模型层敏感性和训练数据质量的问题。其解决方案的关键在于提出TsqLoRA方法,该方法融合了基于数据质量的采样机制与敏感性感知的低秩适配(Low-Rank Adaptation, LoRA),通过质量感知的数据选择策略筛选最具信息量的训练样本,并引入动态秩分配模块根据各层对参数更新的敏感度自适应调整每层的低秩矩阵秩,从而在保证或提升性能的同时显著提高微调效率。

链接: https://arxiv.org/abs/2509.18585
作者: Yu Chen,Yifei Han,Long Zhang,Yue Du,Bin Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 5 pages, 4 figures, published to ICASSP2026

点击查看摘要

Abstract:Fine-tuning large pre-trained models for downstream tasks has become a fundamental approach in natural language processing. Fully fine-tuning all model parameters is computationally expensive and memory-intensive, especially in resource-constrained environments. Existing parameter-efficient fine-tuning methods reduce the number of trainable parameters but typically overlook the varying sensitivity of different model layers and the importance of training data. In this work, we propose TsqLoRA, a novel method that integrates data-quality-driven selection with sensitivity-aware low-rank adaptation, consisted of two main components: a quality-aware sampling mechanism for selecting the most informative training data, and a dynamic rank allocation module that adjusts the rank of each layer based on its sensitivity to parameter updates. The experimental results demonstrate that TsqLoRA improves fine-tuning efficiency while maintaining or even improving performance on a variety of NLP tasks. Our code will be available at this https URL.
zh

[NLP-46] Prior-based Noisy Text Data Filtering: Fast and Strong Alternative For Perplexity

链接: https://arxiv.org/abs/2509.18577
作者: Yeongbin Seo,Gayoung Kim,Jaehyung Kim,Jinyoung Yeo
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-47] CCQA: Generating Question from Solution Can Improve Inference-Time Reasoning in SLMs EMNLP2025

链接: https://arxiv.org/abs/2509.18536
作者: Jin Young Kim,Ji Won Yoon
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Published as a main conference paper at EMNLP 2025

点击查看摘要

[NLP-48] race Is In Sentences: Unbiased Lightweight ChatGPT -Generated Text Detector

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 文本检测方法在面对重写(paraphrasing)或简单提示词修改(PSP)时鲁棒性不足、受模型自身词级模式(CWP)和训练数据内容诱导的偏差影响、对改写文本性能下降,以及依赖大型模型或在线大语言模型(LLM)交互等问题。其解决方案的关键在于提出一种基于文本内部结构特征的轻量级检测框架:通过预训练语言模型编码句向量,并利用注意力机制建模句内关系,从而捕捉在词级变化下保持不变的结构性信息;同时引入对比学习以缓解自回归生成带来的嵌入偏差,并结合因果图与反事实方法剥离主题相关偏倚,确保检测结果聚焦于结构特征而非语义内容。

链接: https://arxiv.org/abs/2509.18535
作者: Mo Mu,Dianqiao Lei,Chang Li
机构: 未知
类目: Computation and Language (cs.CL); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The widespread adoption of ChatGPT has raised concerns about its misuse, highlighting the need for robust detection of AI-generated text. Current word-level detectors are vulnerable to paraphrasing or simple prompts (PSP), suffer from biases induced by ChatGPT’s word-level patterns (CWP) and training data content, degrade on modified text, and often require large models or online LLM interaction. To tackle these issues, we introduce a novel task to detect both original and PSP-modified AI-generated texts, and propose a lightweight framework that classifies texts based on their internal structure, which remains invariant under word-level changes. Our approach encodes sentence embeddings from pre-trained language models and models their relationships via attention. We employ contrastive learning to mitigate embedding biases from autoregressive generation and incorporate a causal graph with counterfactual methods to isolate structural features from topic-related biases. Experiments on two curated datasets, including abstract comparisons and revised life FAQs, validate the effectiveness of our method.
zh

[NLP-49] A Rhythm-Aware Phrase Insertion for Classical Arabic Poetry Composition

【速读】: 该论文旨在解决如何在阿拉伯诗歌中插入短语以符合特定节奏的问题,这在古典阿拉伯诗歌创作中具有重要意义。解决方案的关键在于利用ByT5模型(一种基于字节级的多语言Transformer模型),通过设计一种针对完全音标化阿拉伯文字符的规则式音素到节拍转换机制来提取节奏特征,并采用条件去噪目标进行微调,使模型能够重建被掩码的词以匹配目标节奏。此外,研究还引入课程学习策略,在通用阿拉伯语数据集上预训练后再在诗歌数据集上微调,并探索了从英语到阿拉伯语的跨语言迁移能力,从而在保持语义连贯性的前提下实现高节奏契合度。

链接: https://arxiv.org/abs/2509.18514
作者: Mohamad Elzohbi,Richard Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for the Third Arabic Natural Language Processing Conference (ArabicNLP 2025)

点击查看摘要

Abstract:This paper presents a methodology for inserting phrases in Arabic poems to conform to a specific rhythm using ByT5, a byte-level multilingual transformer-based model. Our work discusses a rule-based grapheme-to-beat transformation tailored for extracting the rhythm from fully diacritized Arabic script. Our approach employs a conditional denoising objective to fine-tune ByT5, where the model reconstructs masked words to match a target rhythm. We adopt a curriculum learning strategy, pre-training on a general Arabic dataset before fine-tuning on poetic dataset, and explore cross-lingual transfer from English to Arabic. Experimental results demonstrate that our models achieve high rhythmic alignment while maintaining semantic coherence. The proposed model has the potential to be used in co-creative applications in the process of composing classical Arabic poems.
zh

[NLP-50] Actions Speak Louder than Prompts: A Large-Scale Study of LLM s for Graph Inference

链接: https://arxiv.org/abs/2509.18487
作者: Ben Finkelshtein,Silviu Cucerzan,Sujay Kumar Jauhar,Ryen White
机构: University of Oxford (牛津大学); Microsoft Research (微软研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-51] LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

【速读】: 该论文旨在解决Transformer架构在长序列场景下因二次计算复杂度导致的延迟高、训练成本大的问题,尤其是在需要处理超长上下文(如22K tokens)的应用中。其解决方案的关键在于提出LAWCAT(Linear Attention with Convolution Across Time),通过将预训练Transformer的知识高效迁移至线性复杂度的注意力结构中:一方面引入因果一维卷积(causal Conv1D)增强局部依赖建模能力,另一方面采用归一化门控线性注意力(normalized gated linear attention)提升模型在不同上下文长度下的泛化性能;实验表明,仅用1K长度序列蒸馏Mistral-7B即可实现高达90%的passkey检索准确率(支持至22K token),且相比原模型所需预训练token不足0.1%,同时在8K以上序列长度下预填速度优于FlashAttention-2,为边缘部署提供了高效可行的长上下文线性模型路径。

链接: https://arxiv.org/abs/2509.18467
作者: Zeyu Liu,Souvik Kundu,Lianghao Jiang,Anni Li,Srikanth Ronanki,Sravan Bodapati,Gourav Datta,Peter A. Beerel
机构: University of Southern California (南加州大学); Intel Labs (英特尔实验室); Amazon AGI (亚马逊AGI); Case Western Reserve University (凯斯西储大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 17 pages, 8 figures

点击查看摘要

Abstract:Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for latency-sensitive long-context applications. While recent linear-complexity alternatives are increasingly powerful, effectively training them from scratch is still resource-intensive. To overcome these limitations, we propose LAWCAT (Linear Attention with Convolution Across Time), a novel linearization framework designed to efficiently transfer the capabilities of pre-trained transformers into a performant linear attention architecture. LAWCAT integrates causal Conv1D layers to enhance local dependency modeling and employs normalized gated linear attention to improve generalization across varying context lengths. Our comprehensive evaluations demonstrate that, distilling Mistral-7B with only 1K-length sequences yields over 90% passkey retrieval accuracy up to 22K tokens, significantly extending its effective context window. Similarly, Llama3.2-1B LAWCAT variant achieves competitive performance on S-NIAH 1\2\3 tasks (1K-8K context length) and BABILong benchmark (QA2\QA3, 0K-16K context length), requiring less than 0.1% pre-training tokens compared with pre-training models. Furthermore, LAWCAT exhibits faster prefill speeds than FlashAttention-2 for sequences exceeding 8K tokens. LAWCAT thus provides an efficient pathway to high-performance, long-context linear models suitable for edge deployment, reducing reliance on extensive long-sequence training data and computational resources.
zh

[NLP-52] CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length Intrinsic Difficulty and Distractor Density

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在长上下文推理任务中评估基准存在的混淆因素问题,例如内在任务复杂度、干扰项干扰以及任务长度等因素相互混杂,导致难以进行精确的失败分析。为此,作者提出了CogniLoad——一个基于认知负荷理论(Cognitive Load Theory, CLT)的新型合成基准,其关键在于通过独立可调的参数系统性地控制认知负荷的三个核心维度:内在负荷(intrinsic load,由参数 dd 控制)、外在负荷(extraneous load,由干扰项与信号比 ρ\rho 调节)和相关负荷(germane load,以任务长度 NN 作为操作代理),从而实现对LLM推理能力的诊断性剖析与可重复、可扩展的量化评估。

链接: https://arxiv.org/abs/2509.18458
作者: Daniel Kaiser,Arnoldo Frigessi,Ali Ramezani-Kebrya,Benjamin Ricaud
机构: Integreat - Norwegian Centre for knowledge-driven machine learning(知识驱动机器学习挪威中心); UiT - The Arctic University of Norway(北极大学特罗姆瑟分校); University of Oslo(奥斯陆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 29 pages (main: 12 + supplemental material: 17), 6 figures, 4 tables, Code: this https URL , Data: this https URL

点击查看摘要

Abstract:Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT’s core dimensions: intrinsic difficulty ( d ) controls intrinsic load; distractor-to-signal ratio ( \rho ) regulates extraneous load; and task length ( N ) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.
zh

[NLP-53] Developing an AI framework to automatically detect shared decision-making in patient-doctor conversations

链接: https://arxiv.org/abs/2509.18439
作者: Oscar J. Ponce-Ponte,David Toro-Tobon,Luis F. Figueroa,Michael Gionfriddo,Megan Branda,Victor M. Montori,Saturnino Luz,Juan P. Brito
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 53 pages, 1 figure, 4 tables, 5 supplementary figures, 13 supplementary tables

点击查看摘要

[NLP-54] Memory-QA: Answering Recall Questions Based on Multimodal Memories

【速读】: 该论文旨在解决**记忆问答(Memory-QA)**任务中的核心挑战,即如何从先前存储的多模态记忆中准确回答关于视觉内容的回忆性问题。具体难点包括:构建面向任务的记忆、有效利用记忆中的时间与位置信息,以及融合多个记忆以支持复杂问答。解决方案的关键在于提出一个名为Pensieve的综合处理流程,其核心创新包括:针对记忆特性的数据增强方法、结合时间与位置信息的多信号检索机制,以及多记忆问答的微调策略,从而显著提升模型在真实场景下的问答准确性(相比最先进方法最高提升14%)。

链接: https://arxiv.org/abs/2509.18436
作者: Hongda Jiang,Xinyuan Zhang,Siddhant Garg,Rishab Arora,Shiun-Zu Kuo,Jiayang Xu,Christopher Brossman,Yue Liu,Aaron Colak,Ahmed Aly,Anuj Kumar,Xin Luna Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Databases (cs.DB)
备注:

点击查看摘要

Abstract:We introduce Memory-QA, a novel real-world task that involves answering recall questions about visual content from previously stored multimodal memories. This task poses unique challenges, including the creation of task-oriented memories, the effective utilization of temporal and location information within memories, and the ability to draw upon multiple memories to answer a recall question. To address these challenges, we propose a comprehensive pipeline, Pensieve, integrating memory-specific augmentation, time- and location-aware multi-signal retrieval, and multi-memory QA fine-tuning. We created a multimodal benchmark to illustrate various real challenges in this task, and show the superior performance of Pensieve over state-of-the-art solutions (up to 14% on QA accuracy).
zh

[NLP-55] Evaluating the Creativity of LLM s in Persian Literary Text Generation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在非英语文学文本生成中的创造力评估缺乏标准化方法的问题,尤其聚焦于波斯语(Persian)文学传统中文化相关表达的生成能力。其解决方案的关键在于:构建一个涵盖20个多样化主题的用户生成波斯语文学文本数据集,并基于托兰斯创造性思维测试(Torrance Tests of Creative Thinking)的四个维度(原创性、流畅性、灵活性和详尽性)对模型输出进行系统评估;同时引入大语言模型作为评判者实现自动化评分,并通过组内相关系数(intraclass correlation coefficients)验证其与人工判断的一致性,从而降低评估成本并提升可扩展性;此外,还定量分析了模型对拟人、隐喻、夸张和对比四种核心修辞手法的理解与运用能力,为后续优化提供了实证依据。

链接: https://arxiv.org/abs/2509.18401
作者: Armin Tourajmehr,Mohammad Reza Modarres,Yadollah Yaghoobzadeh
机构: Tehran Institute for Advanced Studies (德黑兰高级研究所); Khatam University (卡塔姆大学); School of Electrical and Computer Engineering (电气与计算机工程学院); College of Engineering (工程学院); University of Tehran (德黑兰大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated notable creative abilities in generating literary texts, including poetry and short stories. However, prior research has primarily centered on English, with limited exploration of non-English literary traditions and without standardized methods for assessing creativity. In this paper, we evaluate the capacity of LLMs to generate Persian literary text enriched with culturally relevant expressions. We build a dataset of user-generated Persian literary spanning 20 diverse topics and assess model outputs along four creativity dimensions-originality, fluency, flexibility, and elaboration-by adapting the Torrance Tests of Creative Thinking. To reduce evaluation costs, we adopt an LLM as a judge for automated scoring and validate its reliability against human judgments using intraclass correlation coefficients, observing strong agreement. In addition, we analyze the models’ ability to understand and employ four core literary devices: simile, metaphor, hyperbole, and antithesis. Our results highlight both the strengths and limitations of LLMs in Persian literary text generation, underscoring the need for further refinement.
zh

[NLP-56] NormGenesis: Multicultural Dialogue Generation via Exemplar-Guided Social Norm Modeling and Violation Recovery EMNLP2025

链接: https://arxiv.org/abs/2509.18395
作者: Minki Hong,Jangho Choi,Jihie Kim
机构: Dongguk University (东国大学)
类目: Computation and Language (cs.CL)
备注: 39 pages, 17 figures, EMNLP 2025 Main Conference

点击查看摘要

[NLP-57] Interactive Real-Time Speaker Diarization Correction with Human Feedback

【速读】: 该论文旨在解决自动语音处理系统在无人工干预下(即“开环”模式)易出现说话人归属错误的问题,从而影响语音识别与说话人分割的准确性。其核心解决方案是构建一个基于大语言模型(LLM)辅助的实时说话人辨识校正系统,关键在于通过流式自动语音识别(ASR)和说话人分割(speaker diarization)结合 LLM 生成简洁摘要,使用户能以简短语音反馈即时修正错误,且不影响交互流程;此外,创新性地引入“合并时拆分”(split-when-merged, SWM)技术识别并拆分被误归为单说话人的多说话人片段,并基于用户校正动态进行在线说话人注册(online speaker enrollment),有效预防未来类似错误发生,实验证明该方案显著降低说话人错误率(DER)9.92% 和说话人混淆错误 44.23%。

链接: https://arxiv.org/abs/2509.18377
作者: Xinlu He,Yiwen Guan,Badrivishal Paurana,Zilin Dai,Jacob Whitehill
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Most automatic speech processing systems operate in “open loop” mode without user feedback about who said what; yet, human-in-the-loop workflows can potentially enable higher accuracy. We propose an LLM-assisted speaker diarization correction system that lets users fix speaker attribution errors in real time. The pipeline performs streaming ASR and diarization, uses an LLM to deliver concise summaries to the users, and accepts brief verbal feedback that is immediately incorporated without disrupting interactions. Moreover, we develop techniques to make the workflow more effective: First, a split-when-merged (SWM) technique detects and splits multi-speaker segments that the ASR erroneously attributes to just a single speaker. Second, online speaker enrollments are collected based on users’ diarization corrections, thus helping to prevent speaker diarization errors from occurring in the future. LLM-driven simulations on the AMI test set indicate that our system substantially reduces DER by 9.92% and speaker confusion error by 44.23%. We further analyze correction efficacy under different settings, including summary vs full transcript display, the number of online enrollments limitation, and correction frequency.
zh

[NLP-58] Speech Vecalign: an Embedding-based Method for Aligning Parallel Speech Documents EMNLP2025

链接: https://arxiv.org/abs/2509.18360
作者: Chutong Meng,Philipp Koehn
机构: George Mason University (乔治梅森大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025 (main)

点击查看摘要

[NLP-59] Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLM s via Substitute Speculative Decoding NEURIPS2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在内存受限的消费级GPU上部署时面临的挑战,即如何在不牺牲生成质量的前提下显著提升推理速度。现有方法如模型压缩会降低输出质量,而参数卸载虽能保持质量但存在推理延迟高的问题;尽管推测解码(Speculative Decoding)可加速卸载过程,但其性能受限于草稿模型与目标模型之间的对齐程度,且多数方案需额外训练,效率提升有限。本文提出SubSpec,一种无需训练、无损的插件式加速方法:其核心创新在于通过低比特量化替换层(low-bit quantized substitute layers)构建高度对齐的草稿模型,并共享剩余驻留GPU的层及KV缓存(KV-Cache),从而大幅减少内存开销并提升token接受长度,最终实现平均9.1倍(Qwen2.5 7B)至12.5倍(Qwen2.5 32B)的推理加速效果。

链接: https://arxiv.org/abs/2509.18344
作者: Pei-Shuo Wang,Jian-Jia Chen,Chun-Che Yang,Chi-Chih Chang,Ning-Chi Huang,Mohamed S. Abdelfattah,Kai-Chiang Wu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade quality, and offloading maintains quality but suffers from slow inference. Speculative decoding presents a promising avenue to accelerate parameter offloading, utilizing a fast draft model to propose multiple draft tokens, which are then verified by the target LLM in parallel with a single forward pass. This method reduces the time-consuming data transfers in forward passes that involve offloaded weight transfers. Existing methods often rely on pretrained weights of the same family, but require additional training to align with custom-trained models. Moreover, approaches that involve draft model training usually yield only modest speedups. This limitation arises from insufficient alignment with the target model, preventing higher token acceptance lengths. To address these challenges and achieve greater speedups, we propose SubSpec, a plug-and-play method to accelerate parameter offloading that is lossless and training-free. SubSpec constructs a highly aligned draft model by generating low-bit quantized substitute layers from offloaded target LLM portions. Additionally, our method shares the remaining GPU-resident layers and the KV-Cache, further reducing memory overhead and enhance alignment. SubSpec achieves a high average acceptance length, delivering 9.1x speedup for Qwen2.5 7B on MT-Bench (8GB VRAM limit) and an average of 12.5x speedup for Qwen2.5 32B on popular generation benchmarks (24GB VRAM limit).
zh

[NLP-60] Brittleness and Promise: Knowledge Graph Based Reward Modeling for Diagnostic Reasoning

链接: https://arxiv.org/abs/2509.18316
作者: Saksham Khatwani,He Cheng,Majid Afshar,Dmitriy Dligach,Yanjun Gao
机构: University of Colorado Boulder (科罗拉多大学博尔德分校); University of Colorado Anschutz (科罗拉多大学安舒茨医学校区); University of Wisconsin - Madison (威斯康星大学麦迪逊分校); Loyola University (洛约拉大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-61] Exploiting Tree Structure for Credit Assignment in RL Training of LLM s

【速读】: 该论文旨在解决强化学习在大语言模型(Large Language Models, LLMs)推理任务中因稀疏延迟奖励导致的token级信用分配(credit assignment)瓶颈问题。针对可验证奖励场景(verifiable-reward setting),即每个提示可生成多个响应且最终答案可验证的情形,现有方法如PPO虽能提供token级优势但训练复杂且易过拟合,而GRPO虽无需价值网络(critic-free)却忽略了分支结构信息并均匀分配序列回报。解决方案的关键在于提出Prefix-to-Tree(P2T)机制,将多响应集合转化为前缀树(prefix tree),通过聚合后代结果非参数化地估计每个前缀的状态值V(s)V(s);在此基础上设计TEMPO算法,利用树结构中的分支门控时序差分(branch-gated temporal-difference, TD)修正项,在非分支token处退化为GRPO,在分支token处精准分配token级信用,无需额外教师或学习的价值网络,从而在保持高效训练的同时显著提升性能。

链接: https://arxiv.org/abs/2509.18314
作者: Hieu Tran,Zonghai Yao,Hong Yu
机构: University of Massachusetts, Amherst (马萨诸塞大学阿默斯特分校); University of Massachusetts, Lowell (马萨诸塞大学洛厄尔分校)
类目: Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Reinforcement learning improves LLM reasoning, yet sparse delayed reward over long sequences makes token-level credit assignment the key bottleneck. We study the verifiable-reward setting, where the final answer is checkable and multiple responses can be drawn per prompt. Reasoning tasks in math and medical QA align with this setup, where only a few decision tokens significantly impact the outcome. PPO offers token-level advantages with a learned value model, but it is complex to train both the actor and critic models simultaneously, and it is not easily generalizable, as the token-level values from the critic model can make training prone to overfitting. GRPO is critic-free and supports verifiable rewards, but spreads a single sequence-level return across tokens and ignores branching. We introduce \textbfPrefix-to-Tree (P2T), a simple procedure that converts a group of responses into a prefix tree and computes \emphnonparametric prefix values (V(s)) by aggregating descendant outcomes. Built on P2T, we propose \textbfTEMPO (\emph\textbfTree-\textbfEstimated \textbfMean Prefix Value for \textbfPolicy \textbfOptimization), a critic-free algorithm that augments the group-relative outcome signal of GRPO with \emphbranch-gated temporal-difference corrections derived from the tree. At non-branch tokens, the temporal-difference (TD) term is zero, so TEMPO reduces to GRPO; at branching tokens, it supplies precise token-level credit without a learned value network or extra judges/teachers. On Qwen3-1.7B/4B, TEMPO outperforms PPO and GRPO on in-distribution (MATH, MedQA) and out-of-distribution (GSM-HARD, AMC23, MedMCQA, MMLU-Medical) benchmarks, and reaches higher validation accuracy with roughly the same wall-clock time.
zh

[NLP-62] Evaluating Large Language Models for Detecting Antisemitism EMNLP2025

链接: https://arxiv.org/abs/2509.18293
作者: Jay Patel,Hrudayangam Mehta,Jeremy Blackburn
机构: Binghamton University (宾汉姆顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

[NLP-63] he Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks

【速读】: 该论文试图解决当前医疗领域大模型(Large Frontier Models)在主流基准测试中表现优异,但实际推理能力脆弱、存在捷径学习(shortcut learning)和不可靠决策的问题。其核心问题是:现有医学基准测试未能真实反映模型的临床可靠性,反而可能因奖励“应试技巧”而非真正的医学理解,导致高分掩盖了系统性缺陷。解决方案的关键在于引入由临床医生指导的评分体系(clinician-guided rubric evaluation),通过多维度评估模型在不同基准上的表现差异,揭示其推理逻辑的合理性与鲁棒性(robustness),从而推动医疗生成式 AI (Generative AI) 的发展从单纯追求排行榜分数转向对真实医疗需求的契合度和可信度的严格验证。

链接: https://arxiv.org/abs/2509.18234
作者: Yu Gu,Jingjing Fu,Xiaodong Liu,Jeya Maria Jose Valanarasu,Noel Codella,Reuben Tan,Qianchu Liu,Ying Jin,Sheng Zhang,Jinyu Wang,Rui Wang,Lei Song,Guanghui Qin,Naoto Usuyama,Cliff Wong,Cheng Hao,Hohin Lee,Praneeth Sanapathi,Sarah Hilado,Bian Jiang,Javier Alvarez-Valle,Mu Wei,Jianfeng Gao,Eric Horvitz,Matt Lungren,Hoifung Poon,Paul Vozila
机构: Microsoft Research (微软研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 35 pages

点击查看摘要

Abstract:Large frontier models like GPT-5 now achieve top scores on medical benchmarks. But our stress tests tell a different story. Leading systems often guess correctly even when key inputs like images are removed, flip answers under trivial prompt changes, and fabricate convincing yet flawed reasoning. These aren’t glitches; they expose how today’s benchmarks reward test-taking tricks over medical understanding. We evaluate six flagship models across six widely used benchmarks and find that high leaderboard scores hide brittleness and shortcut learning. Through clinician-guided rubric evaluation, we show that benchmarks vary widely in what they truly measure yet are treated interchangeably, masking failure modes. We caution that medical benchmark scores do not directly reflect real-world readiness. If we want AI to earn trust in healthcare, we must demand more than leaderboard wins and must hold systems accountable for robustness, sound reasoning, and alignment with real medical demands.
zh

[NLP-64] Conversational Orientation Reasoning : Egocentric-to-Allocentric Navigation with Multimodal Chain-of-Thought

【速读】: 该论文旨在解决对话式智能体在室内或复杂环境中进行空间导航时,如何将第一人称视角的语义表达(如“在我的右边”)准确转换为绝对方向(北/东/南/西)的问题,尤其是在缺乏GPS信号和详细地图的情况下。这一挑战在非英语环境及自动语音识别(ASR)转录场景中尤为突出。解决方案的关键在于提出一种多模态链式思维(Multimodal Chain-of-Thought, MCoT)框架,通过结构化的三步推理过程——提取空间关系、映射坐标至绝对方向、推断用户朝向——融合ASR转录语音与地标坐标信息,从而实现高精度的空间定向推理。该方法在资源受限模型上表现出优异性能,且对ASR噪声、多语言混用等复杂条件具有鲁棒性,验证了结构化MCoT在可解释性和效率上的优势。

链接: https://arxiv.org/abs/2509.18200
作者: Yu Ti Huang
机构: National Taiwan University (国立台湾大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Conversational agents must translate egocentric utterances (e.g., “on my right”) into allocentric orientations (N/E/S/W). This challenge is particularly critical in indoor or complex facilities where GPS signals are weak and detailed maps are unavailable. While chain-of-thought (CoT) prompting has advanced reasoning in language and vision tasks, its application to multimodal spatial orientation remains underexplored. We introduce Conversational Orientation Reasoning (COR), a new benchmark designed for Traditional Chinese conversational navigation projected from real-world environments, addressing egocentric-to-allocentric reasoning in non-English and ASR-transcribed scenarios. We propose a multimodal chain-of-thought (MCoT) framework, which integrates ASR-transcribed speech with landmark coordinates through a structured three-step reasoning process: (1) extracting spatial relations, (2) mapping coordinates to absolute directions, and (3) inferring user orientation. A curriculum learning strategy progressively builds these capabilities on Taiwan-LLM-13B-v2.0-Chat, a mid-sized model representative of resource-constrained settings. Experiments show that MCoT achieves 100% orientation accuracy on clean transcripts and 98.1% with ASR transcripts, substantially outperforming unimodal and non-structured baselines. Moreover, MCoT demonstrates robustness under noisy conversational conditions, including ASR recognition errors and multilingual code-switching. The model also maintains high accuracy in cross-domain evaluation and resilience to linguistic variation, domain shift, and referential ambiguity. These findings highlight the potential of structured MCoT spatial reasoning as a path toward interpretable and resource-efficient embodied navigation.
zh

[NLP-65] ERFC: Happy Customers with Emotion Recognition and Forecasting in Conversation in Call Centers

【速读】: 该论文旨在解决对话中情绪识别与预测的问题,尤其关注如何通过分析对话参与者的情绪动态来提升客户体验,例如在呼叫中心场景中帮助客服人员维持中性或积极情绪以缓解客户不满。解决方案的关键在于提出了一种新颖的多模态架构——情绪识别与预测架构(Emotion Recognition and Forecasting in Conversation, ERFC),该架构综合考虑了情绪的不同属性、上下文信息以及对话中说话者语句之间的相互依赖关系,从而实现对未来话语情绪的准确预测,为及时提供恰当应对策略提供依据,显著提升客户满意度。

链接: https://arxiv.org/abs/2509.18175
作者: Aditi Debsharma,Bhushan Jagyasi,Surajit Sen,Priyanka Pandey,Devicharith Dovari,Yuvaraj V.C,Rosalin Parida,Gopali Contractor
机构: Center for Advanced AI, Accenture(埃森哲)
类目: Computation and Language (cs.CL)
备注: 7 pages, 6 Figures, 4 Tables, 18 References

点击查看摘要

Abstract:Emotion Recognition in Conversation has been seen to be widely applicable in call center analytics, opinion mining, finance, retail, healthcare, and other industries. In a call center scenario, the role of the call center agent is not just confined to receiving calls but to also provide good customer experience by pacifying the frustration or anger of the customers. This can be achieved by maintaining neutral and positive emotion from the agent. As in any conversation, the emotion of one speaker is usually dependent on the emotion of other speaker. Hence the positive emotion of an agent, accompanied with the right resolution will help in enhancing customer experience. This can change an unhappy customer to a happy one. Imparting the right resolution at right time becomes easier if the agent has the insight of the emotion of future utterances. To predict the emotions of the future utterances we propose a novel architecture, Emotion Recognition and Forecasting in Conversation. Our proposed ERFC architecture considers multi modalities, different attributes of emotion, context and the interdependencies of the utterances of the speakers in the conversation. Our intensive experiments on the IEMOCAP dataset have shown the feasibility of the proposed ERFC. This approach can provide a tremendous business value for the applications like call center, where the happiness of customer is utmost important.
zh

[NLP-66] Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR

【速读】: 该论文旨在解决阿拉伯文文档光学字符识别(Optical Character Recognition, OCR)任务中因语言特性(如连笔书写、字体多样、元音符号及从右至左排版)导致的识别准确率低的问题。解决方案的关键在于提出一个专门针对阿拉伯文文档OCR优化的视觉-语言模型Baseer,其通过在大规模合成与真实文档数据集上采用仅解码器(decoder-only)的微调策略,对预训练多模态大语言模型(Multimodal Large Language Models, MLLMs)进行领域适配,同时保留通用视觉特征;此外,作者构建了高质量、专家验证的基准测试集Misraj-DocOCR,用于严格评估系统性能,实验表明Baseer显著优于现有开源与商业方案,在词错误率(Word Error Rate, WER)上达到0.25,建立了阿拉伯文文档OCR的新基准。

链接: https://arxiv.org/abs/2509.18174
作者: Khalil Hennara,Muhammad Hreden,Mohamed Motasim Hamed,Ahmad Bastati,Zeina Aldallal,Sara Chrouf,Safwan AlModhayan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Arabic document OCR remains a challenging task due to the language’s cursive script, diverse fonts, diacritics, and right-to-left orientation. While modern Multimodal Large Language Models (MLLMs) have advanced document understanding for high-resource languages, their performance on Arabic remains limited. In this work, we introduce Baseer, a vision-language model fine- tuned specifically for Arabic document OCR. Leveraging a large-scale dataset combining synthetic and real-world documents, Baseer is trained using a decoder-only fine-tuning strategy to adapt a pre-trained MLLM while preserving general visual features. We also present Misraj-DocOCR, a high-quality, expert-verified benchmark designed for rigorous evaluation of Arabic OCR systems. Our experiments show that Baseer significantly outperforms existing open-source and commercial solutions, achieving a WER of 0.25 and establishing a new state-of-the-art in the domain of Arabic document OCR. Our results highlight the benefits of domain-specific adaptation of general-purpose MLLMs and establish a strong baseline for high-accuracy OCR on morphologically rich languages like Arabic.
zh

[NLP-67] urnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在地理空间认知能力方面缺乏系统评估的问题,特别是其对导航路径的理解与逆向还原能力不足。现有研究受限于不可量化的指标、有限的评估数据集以及不清晰的研究框架。为应对这一挑战,作者提出一个大规模基准测试平台,包含来自全球12个大都市的36000条路线,并引入PathBuilder工具实现自然语言指令与导航路径之间的双向转换,从而建立地理空间信息与自然语言间的语义桥梁。关键创新在于构建了新的评估框架和量化指标,用于严谨评估11种前沿LLMs在路径逆向任务中的表现,揭示了当前模型在路径还原准确性、鲁棒性及错误自信度方面的显著局限。

链接: https://arxiv.org/abs/2509.18173
作者: Hongyi Luo,Qing Cheng,Daniel Matos,Hari Krishna Gadi,Yanfeng Zhang,Lu Liu,Yongliang Wang,Niclas Zeller,Daniel Cremers,Liqiu Meng
机构: Huawei Riemann Lab (华为瑞曼实验室); Technische Universität München (慕尼黑工业大学); Hochschule Karlsruhe (卡尔斯鲁厄应用技术大学); MCML
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (Main). This is the camera-ready/author version

点击查看摘要

Abstract:Humans can interpret geospatial information through natural language, while the geospatial cognition capabilities of Large Language Models (LLMs) remain underexplored. Prior research in this domain has been constrained by non-quantifiable metrics, limited evaluation datasets and unclear research hierarchies. Therefore, we propose a large-scale benchmark and conduct a comprehensive evaluation of the geospatial route cognition of LLMs. We create a large-scale evaluation dataset comprised of 36000 routes from 12 metropolises worldwide. Then, we introduce PathBuilder, a novel tool for converting natural language instructions into navigation routes, and vice versa, bridging the gap between geospatial information and natural language. Finally, we propose a new evaluation framework and metrics to rigorously assess 11 state-of-the-art (SOTA) LLMs on the task of route reversal. The benchmark reveals that LLMs exhibit limitation to reverse routes: most reverse routes neither return to the starting point nor are similar to the optimal route. Additionally, LLMs face challenges such as low robustness in route generation and high confidence for their incorrect answers. Code\ \ Data available here: \hrefthis https URLTurnBack.
zh

[NLP-68] PiMoE: Token-Level Routing for Integrating High-Precision Computation and Reasoning

链接: https://arxiv.org/abs/2509.18169
作者: Hengbo Xiao,Jingyuan Fan,Xin Tong,Jingzhao Zhang,Chao Lu,Guannan He
机构: Peking University (北京大学); Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-69] SIRAG : Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)中检索器(retriever)与生成器(generator)之间协作不协调的问题,即检索器可能返回无关或冗余文档,而生成器未能充分利用检索到的证据。其解决方案的关键在于提出一种过程监督的多智能体框架,引入两个轻量级代理:决策者(Decision Maker)用于判断何时停止检索并开始生成答案,知识选择器(Knowledge Selector)用于过滤检索结果以保留最相关证据;并通过一个基于大语言模型作为裁判(LLM-as-a-Judge)的过程级奖励机制实现细粒度监督,结合树状 rollout 策略和近端策略优化(Proximal Policy Optimization, PPO)进行端到端训练,从而提升准确率、收敛稳定性及推理路径的可解释性。

链接: https://arxiv.org/abs/2509.18167
作者: Junlin Wang,Zehao Wu,Shaowei Lu,Yanlan Li,Xinghao Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 5 pages,2 figures, IRAC under review

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to access external knowledge sources, but the effectiveness of RAG relies on the coordination between the retriever and the generator. Since these components are developed independently, their interaction is often suboptimal: the retriever may return irrelevant or redundant documents, while the generator may fail to fully leverage retrieved evidence. In this work, we propose a process-supervised multi-agent framework to bridge the gap between retriever and generator. The framework introduces two lightweight agents: a Decision Maker, which determines when to continue retrieval or stop for answer generation, and a Knowledge Selector, which filters retrieved documents to retain only the most useful evidence. To provide fine-grained supervision, we employ an LLM-as-a-Judge that evaluates each intermediate action with process-level rewards, ensuring more accurate credit assignment than relying solely on final answer correctness. We further adopt a tree-structured rollout strategy to explore diverse reasoning paths, and train both agents with Proximal Policy Optimization (PPO) in an end-to-end manner. Experiments on single-hop and multi-hop question answering benchmarks show that our approach achieves higher accuracy, more stable convergence, and produces more interpretable reasoning trajectories compared with standard RAG baselines. Importantly, the proposed framework is modular and plug-and-play, requiring no modification to the retriever or generator, making it practical for real-world RAG applications.
zh

[NLP-70] hinking in a Crowd: How Auxiliary Information Shapes LLM Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂知识密集型任务中,因外部辅助信息(如相关、无关或误导性内容)对其推理过程产生不可控影响的问题。研究发现,尽管模型具备显式分步思考能力(thinking mode),但这一机制并非增强鲁棒性的保障,反而会放大错误信息带来的负面影响——即“误导性信息”会导致性能显著下降,且这种下降随思考深度增加而加剧。解决方案的关键在于:不应仅追求模型“思考”,而应赋予其对输入信息进行批判性评估的能力,从而实现更可靠的推理决策。为此,作者构建了SciAux数据集,用于系统测试LLMs在不同类型辅助信息下的表现,为未来研究提供基准与方向。

链接: https://arxiv.org/abs/2509.18163
作者: Haodong Zhao,Chenyan Zhao,Yansi Li,Zhuosheng Zhang,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computation and Language (cs.CL)
备注: Work in progress

点击查看摘要

Abstract:The capacity of Large Language Models (LLMs) to reason is fundamental to their application in complex, knowledge-intensive domains. In real-world scenarios, LLMs are often augmented with external information that can be helpful, irrelevant, or even misleading. This paper investigates the causal impact of such auxiliary information on the reasoning process of LLMs with explicit step-by-step thinking capabilities. We introduce SciAux, a new dataset derived from ScienceQA, to systematically test the robustness of the model against these types of information. Our findings reveal a critical vulnerability: the model’s deliberative “thinking mode” is a double-edged sword. While helpful context improves accuracy, misleading information causes a catastrophic drop in performance, which is amplified by the thinking process. Instead of conferring robustness, thinking reinforces the degree of error when provided with misinformation. This highlights that the challenge is not merely to make models “think”, but to endow them with the critical faculty to evaluate the information upon which their reasoning is based. The SciAux dataset is available at this https URL.
zh

[NLP-71] ZERA: Zero-init Instruction Evolving Refinement Agent - From Zero Instructions to Structured Prompts via Principle-based Optimization EMNLP2025

链接: https://arxiv.org/abs/2509.18158
作者: Seungyoun Yi,Minsoo Khang,Sungrae Park
机构: Upstage AI Research
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages, 4 figures. To appear in EMNLP 2025 Main Conference (Oral Presentation)

点击查看摘要

[NLP-72] Event Causality Identification with Synthetic Control

链接: https://arxiv.org/abs/2509.18156
作者: Haoyu Wang,Fengze Liu,Jiayao Zhang,Dan Roth,Kyle Richardson
机构: UPenn (宾夕法尼亚大学); Allen Institute for AI (艾伦人工智能研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[NLP-73] Safe-SAIL: Towards a Fine-grained Safety Landscape of Large Language Models via Sparse Autoencoder Interpretation Framework

链接: https://arxiv.org/abs/2509.18127
作者: Jiaqi Weng,Han Zheng,Hanyu Zhang,Qinqin He,Jialing Tao,Hui Xue,Zhixuan Chu,Xiting Wang
机构: Alibaba Group; The State Key Laboratory of Blockchain and Data Security, Zhejiang University; Renmin University of China; Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

[NLP-74] GAUSS: Benchmarking Structured Mathematical Skills for Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在数学能力评估中缺乏细粒度、可解释性不足的问题。现有评估方法往往以整体分数衡量模型表现,难以揭示其在不同数学技能维度上的真实水平。为此,作者提出GAUSS(General Assessment of Underlying Structured Skills in Mathematics)基准,其核心在于将数学能力划分为十二个核心技能维度,归类为知识与理解、问题解决与沟通、元技能与创造力三大领域,并设计任务以隔离特定能力,从而构建出全面、精细且可解释的模型数学能力画像。这一多维技能导向的评估框架显著提升了对LLMs数学智能本质的理解与比较能力。

链接: https://arxiv.org/abs/2509.18122
作者: Yue Zhang,Jiaxin Zhang,Qiuyu Ren,Tahsin Saffat,Xiaoxuan Liu,Zitong Yang,Banghua Zhu,Yi Ma
机构: 未知
类目: Computation and Language (cs.CL)
备注: 120 pages (including appendix)

点击查看摘要

Abstract:We introduce \textbfGAUSS (\textbfGeneral \textbfAssessment of \textbfUnderlying \textbfStructured \textbfSkills in Mathematics), a benchmark that evaluates LLMs’ mathematical abilities across twelve core skill dimensions, grouped into three domains: knowledge and understanding, problem solving and communication, and meta-skills and creativity. By categorizing problems according to cognitive skills and designing tasks that isolate specific abilities, GAUSS constructs comprehensive, fine-grained, and interpretable profiles of models’ mathematical abilities. These profiles faithfully represent their underlying mathematical intelligence. To exemplify how to use the \textscGAUSS benchmark, we have derived the skill profile of \textscGPT-5-thinking, revealing its strengths and weaknesses as well as its differences relative to \textsco4-mini-high, thereby underscoring the value of multidimensional, skill-based evaluation.
zh

[NLP-75] Dynamic Prompt Fusion for Multi-Task and Cross-Domain Adaptation in LLM s

链接: https://arxiv.org/abs/2509.18113
作者: Xin Hu,Yue Kang,Guanzi Yao,Tianze Kang,Mengjie Wang,Heyao Liu
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

[NLP-76] aching Audio Models to Reason : A Unified Framework for Source- and Layer-wise Distillation ICASSP2026

【速读】: 该论文旨在解决大音频语言模型在复杂推理任务中表现不足的问题,其根源在于音频与文本之间的模态差距(modality gap)以及缺乏结构化的中间监督信号。解决方案的关键在于提出一种统一的知识蒸馏框架,通过双维度策略实现从高容量文本教师模型到学生音频模型的推理能力迁移,同时保留音频模型的声学建模能力:一是源域蒸馏(source-wise distillation),利用文本和声学教师提供互补的模态特定监督;二是层间蒸馏(layer-wise distillation),将教师信号与学生模型的适当层对齐以提升迁移效率。该方法实现了符号推理与语音表征之间的有效衔接,显著提升了音频推理性能。

链接: https://arxiv.org/abs/2509.18579
作者: Runyan Yang,Yuke Si,Yingying Gao,Junlan Feng,Chao Deng,Shilei Zhang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages; submitted to ICASSP 2026

点击查看摘要

Abstract:While large audio language models excel at tasks like ASR and emotion recognition, they still struggle with complex reasoning due to the modality gap between audio and text as well as the lack of structured intermediate supervision. To address this, we propose a unified knowledge distillation framework to transfer reasoning capabilities from a high-capacity textual teacher model to a student audio models while preserving its acoustic competence. Our method introduces two key dimensions: source-wise distillation, which leverages both textual and acoustic teachers to provide complementary modality-specific supervision; and layer-wise distillation, which aligns teacher signals with appropriate student layers to improve transfer efficiency. This dual-dimensional strategy enables fine-grained control over the distillation process, effectively bridging the gap between symbolic reasoning and speech representations. Experimental results show significant improvements in audio reasoning performance, demonstrating the effectiveness of our framework as a reasoning transfer solution for audio modeling.
zh

[NLP-77] HarmoniFuse: A Component-Selective and Prompt-Adaptive Framework for Multi-Task Speech Language Modeling ICASSP2026

链接: https://arxiv.org/abs/2509.18570
作者: Yuke Si,Runyan Yang,Yingying Gao,Junlan Feng,Chao Deng,Shilei Zhang
机构: 未知
类目: Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD)
备注: 5 pages; submitted to ICASSP 2026

点击查看摘要

[NLP-78] No Verifiable Reward for Prosody: Toward Preference-Guided Prosody Learning in TTS ICASSP2026

链接: https://arxiv.org/abs/2509.18531
作者: Seungyoun Shin,Dongha Ahn,Jiwoo Kim,Sungwook Jeon
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注: submitted to ICASSP 2026

点击查看摘要

计算机视觉

[CV-0] CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching

【速读】:该论文旨在解决条件生成建模中模型需同时学习质量传输(mass transport)与条件注入(conditional injection)所带来的复杂性问题,这在扩散模型和基于流的方法中尤为显著。解决方案的关键在于提出条件感知重参数化(Condition-Aware Reparameterization, CAR),通过学习一个轻量级的偏移变换,对源分布(初始噪声)、目标分布(条件数据分布)或两者进行条件感知的重新定位,从而缩短模型需要学习的概率路径,降低建模难度并加速训练过程。实验表明,在ImageNet-256上使用CAR-Flow可使FID从2.07降至1.68,且额外参数增加不足0.6%。

链接: https://arxiv.org/abs/2509.19300
作者: Chen Chen,Pengsheng Guo,Liangchen Song,Jiasen Lu,Rui Qian,Xinze Wang,Tsu-Jui Fu,Wei Liu,Yinfei Yang,Alex Schwing
机构: Apple Inc.(苹果公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conditional generative modeling aims to learn a conditional data distribution from samples containing data-condition pairs. For this, diffusion and flow-based methods have attained compelling results. These methods use a learned (flow) model to transport an initial standard Gaussian noise that ignores the condition to the conditional data distribution. The model is hence required to learn both mass transport and conditional injection. To ease the demand on the model, we propose Condition-Aware Reparameterization for Flow Matching (CAR-Flow) – a lightweight, learned shift that conditions the source, the target, or both distributions. By relocating these distributions, CAR-Flow shortens the probability path the model must learn, leading to faster training in practice. On low-dimensional synthetic data, we visualize and quantify the effects of CAR. On higher-dimensional natural image data (ImageNet-256), equipping SiT-XL/2 with CAR-Flow reduces FID from 2.07 to 1.68, while introducing less than 0.6% additional parameters.
zh

[CV-1] VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

【速读】:该论文旨在解决当前基于前馈式3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在新视角合成中面临的三大问题:1)重建的3D模型对输入视图数量高度依赖;2)密度分布存在视角偏差;3)在源视图存在遮挡或低纹理时引入对齐误差。其核心解决方案是提出VolSplat,一种新的多视角前馈范式,将传统的像素对齐高斯预测机制替换为体素对齐高斯预测——通过直接从预测的3D体素网格中生成高斯,避免了依赖易出错的2D特征匹配,从而提升多视角一致性,并实现基于3D场景复杂度自适应控制高斯密度,最终获得更忠实的点云表示、更强的几何一致性以及更优的新视角渲染质量。

链接: https://arxiv.org/abs/2509.19297
作者: Weijie Wang,Yeqing Chen,Zeyu Zhang,Hengyu Liu,Haoxiao Wang,Zhiyuan Feng,Wenkang Qin,Zheng Zhu,Donny Y. Chen,Bohan Zhuang
机构: Zhejiang University (浙江大学); GigaAI; University of Electronic Science and Technology of China (电子科技大学); The Chinese University of Hong Kong (香港中文大学); Tsinghua University (清华大学); Monash University (莫纳什大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL , Code: this https URL

点击查看摘要

Abstract:Feed-forward 3D Gaussian Splatting (3DGS) has emerged as a highly effective solution for novel view synthesis. Existing methods predominantly rely on a pixel-aligned Gaussian prediction paradigm, where each 2D pixel is mapped to a 3D Gaussian. We rethink this widely adopted formulation and identify several inherent limitations: it renders the reconstructed 3D models heavily dependent on the number of input views, leads to view-biased density distributions, and introduces alignment errors, particularly when source views contain occlusions or low texture. To address these challenges, we introduce VolSplat, a new multi-view feed-forward paradigm that replaces pixel alignment with voxel-aligned Gaussians. By directly predicting Gaussians from a predicted 3D voxel grid, it overcomes pixel alignment’s reliance on error-prone 2D feature matching, ensuring robust multi-view consistency. Furthermore, it enables adaptive control over Gaussian density based on 3D scene complexity, yielding more faithful Gaussian point clouds, improved geometric consistency, and enhanced novel-view rendering quality. Experiments on widely used benchmarks including RealEstate10K and ScanNet demonstrate that VolSplat achieves state-of-the-art performance while producing more plausible and view-consistent Gaussian reconstructions. In addition to superior results, our approach establishes a more scalable framework for feed-forward 3D reconstruction with denser and more robust representations, paving the way for further research in wider communities. The video results, code and trained models are available on our project page: this https URL.
zh

[CV-2] Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

【速读】:该论文旨在解决当前基于学习的3D重建方法对真实世界多视角数据依赖性强、获取成本高且难以满足实时交互场景(如机器人导航与自主驾驶)的问题。其核心挑战在于如何从仅限于2D表示的视频扩散模型中提取隐式的3D知识,并将其转化为可直接用于3D场景生成和渲染的显式结构。解决方案的关键在于提出一种自蒸馏框架,通过在典型的RGB解码器基础上引入一个3D Gaussian Splatting (3DGS) 解码器,并以RGB解码器输出作为监督信号,使3DGS解码器能够在无需真实多视角训练数据的情况下,仅使用视频扩散模型生成的合成数据进行纯训练。该方法实现了从文本提示或单张图像到静态/动态3D场景的实时生成与渲染,显著提升了3D场景生成的灵活性与实用性。

链接: https://arxiv.org/abs/2509.19296
作者: Sherwin Bahmani,Tianchang Shen,Jiawei Ren,Jiahui Huang,Yifeng Jiang,Haithem Turki,Andrea Tagliasacchi,David B. Lindell,Zan Gojcic,Sanja Fidler,Huan Ling,Jun Gao,Xuanchi Ren
机构: NVIDIA(英伟达); University of Toronto (多伦多大学); Vector Institute (矢量研究所); Simon Fraser University (西蒙菲莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Project Page: this https URL

点击查看摘要

Abstract:The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.
zh

[CV-3] OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps NEURIPS2025

【速读】:该论文旨在解决当前布局到图像生成(layout-to-image generation)方法在处理边界框存在显著重叠时性能下降的问题,尤其针对两类挑战:一是大范围的重叠区域,二是语义差异较小的重叠实例。其解决方案的关键在于提出两个核心贡献:首先,设计了OverLayScore这一新指标,用于量化边界框重叠的复杂程度,从而揭示现有基准数据集对低复杂度场景的偏向性;其次,构建了OverLayBench这一新的基准测试集,具备高质量标注和不同OverLayScore水平的均衡分布,以更全面评估模型在复杂重叠场景下的表现。此外,还提出了CreatiLayout-AM模型,通过在精心筛选的非完整掩码(amodal mask)数据集上微调,提升模型在复杂重叠布局下的生成能力,为实现更鲁棒的布局到图像生成奠定了基础。

链接: https://arxiv.org/abs/2509.19282
作者: Bingnan Li,Chen-Yu Wang,Haiyang Xu,Xiang Zhang,Ethan Armand,Divyansh Srivastava,Xiaojun Shan,Zeyuan Chen,Jianwen Xie,Zhuowen Tu
机构: UC San Diego; Lambda, Inc
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025 DatasetBenchmark Track

点击查看摘要

Abstract:Despite steady progress in layout-to-image generation, current methods still struggle with layouts containing significant overlap between bounding boxes. We identify two primary challenges: (1) large overlapping regions and (2) overlapping instances with minimal semantic distinction. Through both qualitative examples and quantitative analysis, we demonstrate how these factors degrade generation quality. To systematically assess this issue, we introduce OverLayScore, a novel metric that quantifies the complexity of overlapping bounding boxes. Our analysis reveals that existing benchmarks are biased toward simpler cases with low OverLayScore values, limiting their effectiveness in evaluating model performance under more challenging conditions. To bridge this gap, we present OverLayBench, a new benchmark featuring high-quality annotations and a balanced distribution across different levels of OverLayScore. As an initial step toward improving performance on complex overlaps, we also propose CreatiLayout-AM, a model fine-tuned on a curated amodal mask dataset. Together, our contributions lay the groundwork for more robust layout-to-image generation under realistic and challenging scenarios. Project link: this https URL.
zh

[CV-4] Moving by Looking: Towards Vision-Driven Avatar Motion Generation

【速读】:该论文旨在解决当前人类运动生成方法中忽视感知与动作之间相互依赖关系的问题,即现有方法使用与人类感知方式差异显著的任务特定“感知”,导致生成的虚拟角色行为缺乏真实性。为实现更贴近人类的行为表现,作者提出CLOPS——首个仅依赖第一人称视角(egocentric vision)进行环境感知和导航的人类虚拟角色系统。其解决方案的关键在于将低层运动技能学习与高层视觉控制策略解耦:首先在大规模动作捕捉数据集上训练一个运动先验模型(motion prior model),随后采用Q-learning训练策略,将第一人称视觉输入映射为驱动该运动先验的高层控制指令,从而实现基于视觉输入的自然、人类相似的运动行为,如避障等。

链接: https://arxiv.org/abs/2509.19259
作者: Markos Diomataris,Berat Mert Albaba,Giorgio Becherini,Partha Ghosh,Omid Taheri,Michael J. Black
机构: Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); ETH Zürich (苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The way we perceive the world fundamentally shapes how we move, whether it is how we navigate in a room or how we interact with other humans. Current human motion generation methods, neglect this interdependency and use task-specific ``perception’’ that differs radically from that of humans. We argue that the generation of human-like avatar behavior requires human-like perception. Consequently, in this work we present CLOPS, the first human avatar that solely uses egocentric vision to perceive its surroundings and navigate. Using vision as the primary driver of motion however, gives rise to a significant challenge for training avatars: existing datasets have either isolated human motion, without the context of a scene, or lack scale. We overcome this challenge by decoupling the learning of low-level motion skills from learning of high-level control that maps visual input to motion. First, we train a motion prior model on a large motion capture dataset. Then, a policy is trained using Q-learning to map egocentric visual inputs to high-level control commands for the motion prior. Our experiments empirically demonstrate that egocentric vision can give rise to human-like motion characteristics in our avatars. For example, the avatars walk such that they avoid obstacles present in their visual field. These findings suggest that equipping avatars with human-like sensors, particularly egocentric vision, holds promise for training avatars that behave like humans.
zh

[CV-5] Graph-Radiomic Learning (GrRAiL) Descriptor to Characterize Imaging Heterogeneity in Confounding Tumor Pathologies

【速读】:该论文旨在解决实体瘤中常规影像学检查难以可靠区分恶性肿瘤与干扰性病理状态(如放射性坏死或假进展)的问题。传统放射组学方法多采用区域平均特征,忽略了病灶内部不同强度组成之间的复杂空间关系,导致判别能力受限。其解决方案的关键在于提出一种新型图结构放射组学学习(Graph-Radiomic Learning, GrRAiL)描述符,通过两个核心步骤实现:首先利用逐体素放射组学测量识别病灶内子区域簇,进而基于图论指标量化这些簇间的空间关联性,从而构建能够编码高阶空间结构的加权图模型,有效刻画肿瘤内部异质性(intralesional heterogeneity, ILH),提升对恶性病变与非恶性干扰性病理的鉴别能力。

链接: https://arxiv.org/abs/2509.19258
作者: Dheerendranath Battalapalli,Apoorva Safai,Maria Jaramillo,Hyemin Um,Gustavo Adalfo Pineda Ortiz,Ulas Bagci,Manmeet Singh Ahluwalia,Marwa Ismail,Pallavi Tiwari
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); Northwestern University (西北大学); Miami Cancer Institute (迈阿密癌症研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review: npj Digital Medicine

点击查看摘要

Abstract:A significant challenge in solid tumors is reliably distinguishing confounding pathologies from malignant neoplasms on routine imaging. While radiomics methods seek surrogate markers of lesion heterogeneity on CT/MRI, many aggregate features across the region of interest (ROI) and miss complex spatial relationships among varying intensity compositions. We present a new Graph-Radiomic Learning (GrRAiL) descriptor for characterizing intralesional heterogeneity (ILH) on clinical MRI scans. GrRAiL (1) identifies clusters of sub-regions using per-voxel radiomic measurements, then (2) computes graph-theoretic metrics to quantify spatial associations among clusters. The resulting weighted graphs encode higher-order spatial relationships within the ROI, aiming to reliably capture ILH and disambiguate confounding pathologies from malignancy. To assess efficacy and clinical feasibility, GrRAiL was evaluated in n=947 subjects spanning three use cases: differentiating tumor recurrence from radiation effects in glioblastoma (GBM; n=106) and brain metastasis (n=233), and stratifying pancreatic intraductal papillary mucinous neoplasms (IPMNs) into no+low vs high risk (n=608). In a multi-institutional setting, GrRAiL consistently outperformed state-of-the-art baselines - Graph Neural Networks (GNNs), textural radiomics, and intensity-graph analysis. In GBM, cross-validation (CV) and test accuracies for recurrence vs pseudo-progression were 89% and 78% with 10% test-accuracy gains over comparators. In brain metastasis, CV and test accuracies for recurrence vs radiation necrosis were 84% and 74% (13% improvement). For IPMN risk stratification, CV and test accuracies were 84% and 75%, showing 10% improvement.
zh

[CV-6] Adversarially-Refined VQ-GAN with Dense Motion Tokenization for Spatio-Temporal Heatmaps

【速读】:该论文旨在解决连续人体运动理解中的高维性与固有冗余问题,核心挑战在于如何高效压缩并准确表示复杂的人体运动动态。其解决方案的关键在于提出一种对抗精炼的矢量量化生成对抗网络(VQ-GAN)框架,结合密集运动标记化(dense motion tokenization),在压缩时空热图的同时保留细粒度的人体运动轨迹。该方法通过对抗精炼机制有效消除非对抗基线中常见的重建伪影(如运动模糊和时间错位),从而显著提升重建质量与时间稳定性。实验表明,该方法在CMU Panoptic数据集上相较dVAE基线提升9.31% SSIM,并降低37.1%的时间不稳定性;同时,密集标记化策略还揭示了2D运动可用128个标记最优表示,而3D运动则需1024个标记以实现忠实重建,为不同场景下的运动分析提供了可部署的压缩方案。

链接: https://arxiv.org/abs/2509.19252
作者: Gabriel Maldonado,Narges Rashvand,Armin Danesh Pazho,Ghazal Alinezhad Noghre,Vinit Katariya,Hamed Tabkhi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Continuous human motion understanding remains a core challenge in computer vision due to its high dimensionality and inherent redundancy. Efficient compression and representation are crucial for analyzing complex motion dynamics. In this work, we introduce an adversarially-refined VQ-GAN framework with dense motion tokenization for compressing spatio-temporal heatmaps while preserving the fine-grained traces of human motion. Our approach combines dense motion tokenization with adversarial refinement, which eliminates reconstruction artifacts like motion smearing and temporal misalignment observed in non-adversarial baselines. Our experiments on the CMU Panoptic dataset provide conclusive evidence of our method’s superiority, outperforming the dVAE baseline by 9.31% SSIM and reducing temporal instability by 37.1%. Furthermore, our dense tokenization strategy enables a novel analysis of motion complexity, revealing that 2D motion can be optimally represented with a compact 128-token vocabulary, while 3D motion’s complexity demands a much larger 1024-token codebook for faithful reconstruction. These results establish practical deployment feasibility across diverse motion analysis applications. The code base for this work is available at this https URL.
zh

[CV-7] ConViS-Bench: Estimating Video Similarity Through Semantic Concepts NEURIPS2025

【速读】:该论文试图解决视频相似性评估中缺乏细粒度、可解释性比较方法的问题,传统模型通常依赖全局相似性得分,难以捕捉人类对视频内容多维度的感知差异(如动作 vs. 场景)。其解决方案的关键在于提出一种名为Concept-based Video Similarity estimation (ConViS) 的新任务,通过预定义的关键语义概念集合,计算视频对在各概念上的可解释相似性分数,从而实现类人推理;同时构建了ConViS-Bench基准数据集,包含跨领域的标注视频对及其概念级相似性评分与文本描述,为语言驱动的视频理解研究提供标准化评估工具。

链接: https://arxiv.org/abs/2509.19245
作者: Benedetta Liberatori,Alessandro Conti,Lorenzo Vaquero,Yiming Wang,Elisa Ricci,Paolo Rota
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (FBK) (布鲁诺·凯斯勒基金会)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to NeurIPS 2025

点击查看摘要

Abstract:What does it mean for two videos to be similar? Videos may appear similar when judged by the actions they depict, yet entirely different if evaluated based on the locations where they were filmed. While humans naturally compare videos by taking different aspects into account, this ability has not been thoroughly studied and presents a challenge for models that often depend on broad global similarity scores. Large Multimodal Models (LMMs) with video understanding capabilities open new opportunities for leveraging natural language in comparative video tasks. We introduce Concept-based Video Similarity estimation (ConViS), a novel task that compares pairs of videos by computing interpretable similarity scores across a predefined set of key semantic concepts. ConViS allows for human-like reasoning about video similarity and enables new applications such as concept-conditioned video retrieval. To support this task, we also introduce ConViS-Bench, a new benchmark comprising carefully annotated video pairs spanning multiple domains. Each pair comes with concept-level similarity scores and textual descriptions of both differences and similarities. Additionally, we benchmark several state-of-the-art models on ConViS, providing insights into their alignment with human judgments. Our results reveal significant performance differences on ConViS, indicating that some concepts present greater challenges for estimating video similarity. We believe that ConViS-Bench will serve as a valuable resource for advancing research in language-driven video understanding.
zh

[CV-8] Lavida-O: Elastic Masked Diffusion Models for Unified Multimodal Understanding and Generation

【速读】:该论文旨在解决现有多模态扩散模型在图像理解与生成任务中能力受限的问题,特别是针对仅支持简单图像级理解、低分辨率图像生成以及缺乏高效迭代优化机制的局限性。其解决方案的关键在于提出 Lavida-O——一个统一的多模态掩码扩散模型(Multi-Modal Masked Diffusion Model, MDM),通过引入弹性混合Transformer架构(Elastic Mixture-of-Transformer)、通用文本条件控制(universal text conditioning)和分层采样策略(stratified sampling)等创新技术,实现了对象定位(object grounding)、图像编辑(image-editing)和高分辨率(1024px)图像合成等多项新能力,并首次利用模型自身的理解能力进行规划与迭代自省式优化,从而显著提升图像生成与编辑的质量与效率。

链接: https://arxiv.org/abs/2509.19244
作者: Shufan Li,Jiuxiang Gu,Kangning Liu,Zhe Lin,Zijun Wei,Aditya Grover,Jason Kuen
机构: Adobe(Adobe); UCLA(加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 32 pages, 15 figures

点击查看摘要

Abstract:We proposed Lavida-O, a unified multi-modal Masked Diffusion Model (MDM) capable of image understanding and generation tasks. Unlike existing multimodal diffsion language models such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O exhibits many new capabilities such as object grounding, image-editing, and high-resolution (1024px) image synthesis. It is also the first unified MDM that uses its understanding capabilities to improve image generation and editing results through planning and iterative self-reflection. To allow effective and efficient training and sampling, Lavida-O ntroduces many novel techniques such as Elastic Mixture-of-Transformer architecture, universal text conditioning, and stratified sampling. \ours~achieves state-of-the-art performance on a wide range of benchmarks such as RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference.
zh

[CV-9] DevFD: Developmental Face Forgery Detection by Learning Shared and Orthogonal LoRA Subspaces NEURIPS2025

【速读】:该论文旨在解决生成式 AI(Generative AI)驱动的数字人脸伪造技术快速迭代带来的检测模型适应性不足问题,尤其是面对新伪造类型时模型易发生灾难性遗忘(catastrophic forgetting)且难以在有限计算资源和数据下实现高效持续学习。解决方案的关键在于将人脸伪造检测建模为持续学习(continual learning)任务,并提出一种基于发展型专家混合(Developmental Mixture of Experts, MoE)架构的方法:其中,Real-LoRA 用于稳定学习真实人脸特征,多个 Fake-LoRA 作为增量专家捕获不同伪造类型的信息;通过约束 Fake-LoRA 的学习方向与已学子空间正交,结合正交梯度融入损失函数,有效避免各任务训练过程中的梯度干扰,从而实现对新伪造类型的快速适应并保留历史知识。

链接: https://arxiv.org/abs/2509.19230
作者: Tianshuo Zhang,Li Gao,Siran Peng,Xiangyu Zhu,Zhen Lei
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); China Mobile Financial Technology Co., Ltd. (中国移动金融技术服务有限公司); CAIR, HKSIS, Chinese Academy of Sciences (中国科学院香港科学与创新研究院); School of Computer Science and Engineering, the Faculty of Innovation Engineering, M.U.S.T (澳门科技大学创新工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:The rise of realistic digital face generation and manipulation poses significant social risks. The primary challenge lies in the rapid and diverse evolution of generation techniques, which often outstrip the detection capabilities of existing models. To defend against the ever-evolving new types of forgery, we need to enable our model to quickly adapt to new domains with limited computation and data while avoiding forgetting previously learned forgery types. In this work, we posit that genuine facial samples are abundant and relatively stable in acquisition methods, while forgery faces continuously evolve with the iteration of manipulation techniques. Given the practical infeasibility of exhaustively collecting all forgery variants, we frame face forgery detection as a continual learning problem and allow the model to develop as new forgery types emerge. Specifically, we employ a Developmental Mixture of Experts (MoE) architecture that uses LoRA models as its individual experts. These experts are organized into two groups: a Real-LoRA to learn and refine knowledge of real faces, and multiple Fake-LoRAs to capture incremental information from different forgery types. To prevent catastrophic forgetting, we ensure that the learning direction of Fake-LoRAs is orthogonal to the established subspace. Moreover, we integrate orthogonal gradients into the orthogonal loss of Fake-LoRAs, preventing gradient interference throughout the training process of each task. Experimental results under both the datasets and manipulation types incremental protocols demonstrate the effectiveness of our method.
zh

[CV-10] MsFIN: Multi-scale Feature Interaction Network for Traffic Accident Anticipation

【速读】:该论文旨在解决从行车记录仪(dashcam)视角进行交通事故早期预测时面临的两大挑战:一是如何建模交通参与者之间在特征层面的交互关系(尤其在视野受限导致遮挡的情况下),二是如何捕捉事故前复杂且异步的多时间尺度行为线索。解决方案的关键在于提出一种多尺度特征交互网络(Multi-scale Feature Interaction Network, MsFIN),其核心创新包括三层结构:首先通过多尺度模块在短、中、长时程尺度上提取场景表征并利用Transformer架构实现全面的特征交互;其次通过时序特征处理模块在因果约束下捕获场景与物体特征的序列演化;最后在多尺度特征后融合阶段,将不同时间尺度下的场景与物体特征融合,生成综合的风险表示。实验表明,MsFIN在DAD和DADA数据集上显著优于单尺度特征提取的现有模型,在预测准确性和提前量方面均取得提升。

链接: https://arxiv.org/abs/2509.19227
作者: Tongshuai Wu,Chao Lu,Ze Song,Yunlong Lin,Sizhe Fan,Xuemei Chen
机构: Beijing Institute of Technology (北京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the widespread deployment of dashcams and advancements in computer vision, developing accident prediction models from the dashcam perspective has become critical for proactive safety interventions. However, two key challenges persist: modeling feature-level interactions among traffic participants (often occluded in dashcam views) and capturing complex, asynchronous multi-temporal behavioral cues preceding accidents. To deal with these two challenges, a Multi-scale Feature Interaction Network (MsFIN) is proposed for early-stage accident anticipation from dashcam videos. MsFIN has three layers for multi-scale feature aggregation, temporal feature processing and multi-scale feature post fusion, respectively. For multi-scale feature aggregation, a Multi-scale Module is designed to extract scene representations at short-term, mid-term and long-term temporal scales. Meanwhile, the Transformer architecture is leveraged to facilitate comprehensive feature interactions. Temporal feature processing captures the sequential evolution of scene and object features under causal constraints. In the multi-scale feature post fusion stage, the network fuses scene and object features across multiple temporal scales to generate a comprehensive risk representation. Experiments on DAD and DADA datasets show that MsFIN significantly outperforms state-of-the-art models with single-scale feature extraction in both prediction correctness and earliness. Ablation studies validate the effectiveness of each module in MsFIN, highlighting how the network achieves superior performance through multi-scale feature fusion and contextual interaction modeling.
zh

[CV-11] HyKid: An Open MRI Dataset with Expert-Annotated Multi-Structure and Choroid Plexus in Pediatric Hydrocephalus

【速读】:该论文旨在解决儿童脑积水(hydrocephalus)影像评估中缺乏公开可用、专家标注数据集的问题,尤其是缺少脉络丛(choroid plexus)的分割标注。其解决方案的关键在于构建并发布HyKid数据集,该数据集包含48名儿科患者的3D MRI图像(1mm各向同性分辨率),通过切片到体素重建算法从常规低分辨率图像中恢复高精度图像,并由经验丰富的神经科医生提供脑组织(包括白质、灰质、侧脑室、外脑脊液及脉络丛)的手动修正分割结果。此外,利用检索增强生成(Retrieval-Augmented Generation, RAG)框架从临床放射学报告中提取结构化数据,发现脉络丛体积与总脑脊液(CSF)体积具有强相关性,可作为潜在生物标志物用于脑积水预测建模(AUC = 0.87)。该数据集为神经影像算法开发提供了高质量基准,并揭示了脉络丛在脑积水评估中的关键作用。

链接: https://arxiv.org/abs/2509.19218
作者: Yunzhi Xu,Yushuang Ding,Hu Sun,Hongxi Zhang,Li Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures

点击查看摘要

Abstract:Evaluation of hydrocephalus in children is challenging, and the related research is limited by a lack of publicly available, expert-annotated datasets, particularly those with segmentation of the choroid plexus. To address this, we present HyKid, an open-source dataset from 48 pediatric patients with hydrocephalus. 3D MRIs were provided with 1mm isotropic resolution, which was reconstructed from routine low-resolution images using a slice-to-volume algorithm. Manually corrected segmentations of brain tissues, including white matter, grey matter, lateral ventricle, external CSF, and the choroid plexus, were provided by an experienced neurologist. Additionally, structured data was extracted from clinical radiology reports using a Retrieval-Augmented Generation framework. The strong correlation between choroid plexus volume and total CSF volume provided a potential biomarker for hydrocephalus evaluation, achieving excellent performance in a predictive model (AUC = 0.87). The proposed HyKid dataset provided a high-quality benchmark for neuroimaging algorithms development, and it revealed the choroid plexus-related features in hydrocephalus assessments. Our datasets are publicly available at this https URL.
zh

[CV-12] Enabling Plant Phenotyping in Weedy Environments using Multi-Modal Imagery via Synthetic and Generated Training Data

【速读】:该论文旨在解决热成像(thermal imagery)中植物分割精度低的问题,尤其是在田间复杂环境下因植物与杂草对比度低和频繁遮挡导致的分割性能下降问题。解决方案的关键在于构建一个融合合成RGB图像、少量真实标注数据以及基于生成对抗网络(GAN)的跨模态对齐框架,通过CycleGAN-turbo实现RGB到热成像的域迁移,从而提升模型在真实热图像上的语义分割性能。实验表明,结合1,128张合成图像与仅5张人工标注的真实图像,可使杂草类别的分割准确率相对提升22%,作物类别提升17%,验证了该方法在多模态遥感影像中的有效性。

链接: https://arxiv.org/abs/2509.19208
作者: Earl Ranario,Ismael Mayanja,Heesup Yun,Brian N. Bailey,J. Mason Earles
机构: UC Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate plant segmentation in thermal imagery remains a significant challenge for high throughput field phenotyping, particularly in outdoor environments where low contrast between plants and weeds and frequent occlusions hinder performance. To address this, we present a framework that leverages synthetic RGB imagery, a limited set of real annotations, and GAN-based cross-modality alignment to enhance semantic segmentation in thermal images. We trained models on 1,128 synthetic images containing complex mixtures of crop and weed plants in order to generate image segmentation masks for crop and weed plants. We additionally evaluated the benefit of integrating as few as five real, manually segmented field images within the training process using various sampling strategies. When combining all the synthetic images with a few labeled real images, we observed a maximum relative improvement of 22% for the weed class and 17% for the plant class compared to the full real-data baseline. Cross-modal alignment was enabled by translating RGB to thermal using CycleGAN-turbo, allowing robust template matching without calibration. Results demonstrated that combining synthetic data with limited manual annotations and cross-domain translation via generative models can significantly boost segmentation performance in complex field environments for multi-model imagery.
zh

[CV-13] Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

【速读】:该论文旨在解决对比视觉语言模型(Contrastive Vision-Language Models, VLMs)在理解长而密集的图像描述(long, dense captions)时表现不足的问题,核心假设是组合性(compositionality)——即对物体-属性绑定和物体间关系的推理能力——对于提升长句理解至关重要。解决方案的关键在于通过联合训练策略,利用高质量、结构化良好的长句描述数据,使模型同时增强组合理解和长句检索能力;实验表明,这种双向促进关系依赖于数据质量和模型架构设计,尤其强调避免使用低质量或结构混乱的文本数据,以及不依赖冻结位置嵌入等保守策略来维持通用对齐。最终,高质量长句训练可显著提升两类任务性能,为VLM泛化能力优化提供有效路径。

链接: https://arxiv.org/abs/2509.19207
作者: Israfel Salazar,Desmond Elliott,Yova Kementchedjhieva
机构: University of Copenhagen (哥本哈根大学); MBZUAI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive vision-language models (VLMs) have made significant progress in binding visual and textual information, but understanding long, dense captions remains an open challenge. We hypothesize that compositionality, the capacity to reason about object-attribute bindings and inter-object relationships, is key to understanding longer captions. In this paper, we investigate the interaction between compositionality and long-caption understanding, asking whether training for one property enhances the other. We train and evaluate a range of models that target each of these capabilities. Our results reveal a bidirectional relationship: compositional training improves performance on long-caption retrieval, and training on long captions promotes compositionality. However, these gains are sensitive to data quality and model design. We find that training on poorly structured captions, or with limited parameter updates, fails to support generalization. Likewise, strategies that aim at retaining general alignment, such as freezing positional embeddings, do not improve compositional understanding. Overall, we find that compositional understanding and long-caption understanding are intertwined capabilities that can be jointly learned through training on dense, grounded descriptions. Despite these challenges, we show that models trained on high-quality, long-caption data can achieve strong performance in both tasks, offering practical guidance for improving VLM generalization.
zh

[CV-14] Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions EMNLP2025

【速读】:该论文旨在解决对比训练的视觉-语言模型(Vision-Language Models, VLMs)在检索任务中存在的浅层语言理解能力不足、模态间隙(modality gap)显著以及依赖大规模网络收集数据导致计算成本高和隐私风险等问题。其解决方案的关键在于摒弃传统的双编码器架构,提出一种无需视觉编码器的单编码器检索流程,通过引入由视觉语言大模型(VLLM)生成的结构化图像描述文本,将原本的文本到图像检索范式迁移至文本到文本检索范式。这一转变不仅显著缩小了模态间隙、提升了组合性理解能力,还实现了仅需数小时校准即可在两个GPU上完成训练,并且在短句和长句查询场景下均表现出优越性能,同时具备更高的隐私友好性。

链接: https://arxiv.org/abs/2509.19203
作者: Ioanna Ntinou,Alexandros Xenos,Yassine Ouali,Adrian Bulat,Georgios Tzimiropoulos
机构: Queen Mary University of London, UK (伦敦玛丽女王大学); Samsung AI Centre, Cambridge, UK (三星人工智能中心); Technical University of Iași, Romania (雅西理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding, manifesting bag-of-words behaviour. These limitations are reinforced by their dual-encoder design, which induces a modality gap. Additionally, the reliance on vast web-collected data corpora for training makes the process computationally expensive and introduces significant privacy concerns. To address these limitations, in this work, we challenge the necessity of vision encoders for retrieval tasks by introducing a vision-free, single-encoder retrieval pipeline. Departing from the traditional text-to-image retrieval paradigm, we migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions. We demonstrate that this paradigm shift has significant advantages, including a substantial reduction of the modality gap, improved compositionality, and better performance on short and long caption queries, all attainable with only a few hours of calibration on two GPUs. Additionally, substituting raw images with textual descriptions introduces a more privacy-friendly alternative for retrieval. To further assess generalisation and address some of the shortcomings of prior compositionality benchmarks, we release two benchmarks derived from Flickr30k and COCO, containing diverse compositional queries made of short captions, which we coin subFlickr and subCOCO. Our vision-free retriever matches and often surpasses traditional multimodal models. Importantly, our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks, with models as small as 0.3B parameters. Code is available at: this https URL
zh

[CV-15] Reading Images Like Texts: Sequential Image Understanding in Vision-Language Models

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视觉处理上依赖串行化图像输入、与人类视觉的并行特性不符,以及内部机制不透明导致难以深入理解与架构优化的问题。其解决方案的关键在于基于人类视觉的双流假说(dual-stream hypothesis),将VLM的视觉处理解构为对象识别(object recognition)与空间感知(spatial perception)两个独立模块进行研究:首先通过文本标记图(text token maps)揭示对象识别过程呈现从浅层属性识别到深层语义消歧的两阶段特征演化;其次理论推导并实证验证了位置表示中隐含的几何结构;进而提出一种无需指令依赖的令牌压缩算法和RoPE缩放技术,分别提升解码效率与空间推理能力。这一方法不仅深化了对VLM内部工作机制的理解,也为未来更高效、更强大的模型架构设计提供了明确原则。

链接: https://arxiv.org/abs/2509.19191
作者: Yueyan Li,Chenggong Zhao,Zeyuan Zang,Caixia Yuan,Xiaojie Wang
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have demonstrated remarkable performance across a variety of real-world tasks. However, existing VLMs typically process visual information by serializing images, a method that diverges significantly from the parallel nature of human vision. Moreover, their opaque internal mechanisms hinder both deeper understanding and architectural innovation. Inspired by the dual-stream hypothesis of human vision, which distinguishes the “what” and “where” pathways, we deconstruct the visual processing in VLMs into object recognition and spatial perception for separate study. For object recognition, we convert images into text token maps and find that the model’s perception of image content unfolds as a two-stage process from shallow to deep layers, beginning with attribute recognition and culminating in semantic disambiguation. For spatial perception, we theoretically derive and empirically verify the geometric structure underlying the positional representation in VLMs. Based on these findings, we introduce an instruction-agnostic token compression algorithm based on a plug-and-play visual decoder to improve decoding efficiency, and a RoPE scaling technique to enhance spatial reasoning. Through rigorous experiments, our work validates these analyses, offering a deeper understanding of VLM internals and providing clear principles for designing more capable future architectures.
zh

[CV-16] he 1st Solution for MOSEv2 Challenge 2025: Long-term and Concept-aware Video Segmentation via SeC

【速读】:该论文旨在解决复杂半监督视频对象分割(Semi-Supervised Video Object Segmentation, SSVO) 中的长期时序一致性与语义干扰抑制问题。其解决方案的关键在于对SeC框架(基于SAM-2的增强版本)中两种记忆机制的深入分析与利用:一是长期记忆(long-term memory),可有效维持遮挡和重新出现场景下的时间连续性;二是概念感知记忆(concept-aware memory),能提供语义先验以抑制背景干扰。这两种机制协同作用,显著提升了模型在MOSEv2挑战任务中的性能,最终在测试集上取得39.89%的JF分数,位居第一。

链接: https://arxiv.org/abs/2509.19183
作者: Mingqi Gao,Jingkun Chen,Yunqi Miao,Gengshen Wu,Zhijin Qin,Jungong Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This technical report explores the MOSEv2 track of the LSVOS Challenge, which targets complex semi-supervised video object segmentation. By analysing and adapting SeC, an enhanced SAM-2 framework, we conduct a detailed study of its long-term memory and concept-aware memory, showing that long-term memory preserves temporal continuity under occlusion and reappearance, while concept-aware memory supplies semantic priors that suppress distractors; together, these traits directly benefit several MOSEv2’s core challenges. Our solution achieves a JF score of 39.89% on the test set, ranking 1st in the MOSEv2 track of the LSVOS Challenge.
zh

[CV-17] YOLO-LAN: Precise Polyp Detection via Optimized Loss Augmentations and Negatives

【速读】:该论文旨在解决结直肠癌(Colorectal Cancer, CRC)早期筛查中因人工肠镜检查存在主观差异和漏诊风险的问题,提出一种基于YOLO架构的实时、高精度息肉检测方法。其解决方案的关键在于构建了YOLO-LAN检测流水线,采用M2IoU损失函数优化模型训练,并结合多样化的数据增强策略与负样本数据,以更真实地模拟临床场景;该方法在Kvasir-seg和BKAI-IGH NeoPolyp数据集上显著提升了mAP₅₀和mAP₅₀:₉₅指标,尤其在mAP₅₀:₉₅上的提升体现了对息肉定位精度的增强,具备良好的尺寸鲁棒性和空间定位准确性,从而满足AI辅助肠镜筛查的临床需求。

链接: https://arxiv.org/abs/2509.19166
作者: Siddharth Gupta,Jitin Singla
机构: IIT Roorkee (印度理工学院鲁尔基分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Colorectal cancer (CRC), a lethal disease, begins with the growth of abnormal mucosal cell proliferation called polyps in the inner wall of the colon. When left undetected, polyps can become malignant tumors. Colonoscopy is the standard procedure for detecting polyps, as it enables direct visualization and removal of suspicious lesions. Manual detection by colonoscopy can be inconsistent and is subject to oversight. Therefore, object detection based on deep learning offers a better solution for a more accurate and real-time diagnosis during colonoscopy. In this work, we propose YOLO-LAN, a YOLO-based polyp detection pipeline, trained using M2IoU loss, versatile data augmentations and negative data to replicate real clinical situations. Our pipeline outperformed existing methods for the Kvasir-seg and BKAI-IGH NeoPolyp datasets, achieving mAP _50 of 0.9619, mAP _50:95 of 0.8599 with YOLOv12 and mAP _50 of 0.9540, mAP _50:95 of 0.8487 with YOLOv8 on the Kvasir-seg dataset. The significant increase is achieved in mAP _50:95 score, showing the precision of polyp detection. We show robustness based on polyp size and precise location detection, making it clinically relevant in AI-assisted colorectal screening.
zh

[CV-18] RoSe: Robust Self-supervised Stereo Matching under Adverse Weather Conditions

【速读】:该论文旨在解决自监督立体匹配方法在恶劣天气条件(如夜间、雨天和雾天)下性能显著下降的问题。其核心挑战在于:一是恶劣天气引入噪声并降低可见度,导致基于卷积神经网络(CNN)的特征提取器难以处理反光和无纹理区域;二是这些退化区域会破坏像素对应关系,使得依赖光度一致性假设的监督信号失效。解决方案的关键在于两个方面:首先,将来自视觉基础模型(vision foundation model)的鲁棒先验注入CNN特征提取器,以增强恶劣天气下的特征表示能力;其次,引入场景对应先验(scene correspondence priors)构建更可靠的监督信号,而非单纯依赖光度一致性假设。具体而言,作者构建了包含真实天气退化的合成立体数据集,并提出一种鲁棒的自监督训练范式,包括“鲁棒自监督场景对应学习”和“恶劣天气蒸馏”两个步骤,通过对齐干净图像与恶劣天气图像的潜在场景结果来提升模型在恶劣条件下的视差估计性能。

链接: https://arxiv.org/abs/2509.19165
作者: Yun Wang,Junjie Hu,Junhui Hou,Chenghao Zhang,Renwei Yang,Dapeng Oliver Wu
机构: City University of Hong Kong (香港城市大学); Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent self-supervised stereo matching methods have made significant progress, but their performance significantly degrades under adverse weather conditions such as night, rain, and fog. We identify two primary weaknesses contributing to this performance degradation. First, adverse weather introduces noise and reduces visibility, making CNN-based feature extractors struggle with degraded regions like reflective and textureless areas. Second, these degraded regions can disrupt accurate pixel correspondences, leading to ineffective supervision based on the photometric consistency assumption. To address these challenges, we propose injecting robust priors derived from the visual foundation model into the CNN-based feature extractor to improve feature representation under adverse weather conditions. We then introduce scene correspondence priors to construct robust supervisory signals rather than relying solely on the photometric consistency assumption. Specifically, we create synthetic stereo datasets with realistic weather degradations. These datasets feature clear and adverse image pairs that maintain the same semantic context and disparity, preserving the scene correspondence property. With this knowledge, we propose a robust self-supervised training paradigm, consisting of two key steps: robust self-supervised scene correspondence learning and adverse weather distillation. Both steps aim to align underlying scene results from clean and adverse image pairs, thus improving model disparity estimation under adverse weather effects. Extensive experiments demonstrate the effectiveness and versatility of our proposed solution, which outperforms existing state-of-the-art self-supervised methods. Codes are available at \textcolorbluethis https URL.
zh

[CV-19] NeuCODEX: Edge-Cloud Co-Inference with Spike-Driven Compression and Dynamic Early-Exit ICML

【速读】:该论文旨在解决边缘计算场景下脉冲神经网络(Spiking Neural Networks, SNNs)因固定高时间步开销导致的延迟与能耗问题,以及现有边缘-云协同推理系统中数据传输成本高、延迟大的挑战。其核心解决方案是提出NeuCODEX架构,通过联合优化空间冗余和时间冗余实现高效推理:一方面引入学习驱动的脉冲压缩模块以显著减少特征传输量,另一方面采用动态早退机制根据输出置信度自适应终止推理过程,从而在保证精度损失小于2%的前提下,将数据传输量降低至原始水平的1/2048,边缘能耗下降超90%,端到端延迟降低达3倍。

链接: https://arxiv.org/abs/2509.19156
作者: Maurf Hassan,Steven Davy,Muhammad Zawish,Owais Bin Zuber,Nouman Ashraf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted at ICMLA 2025. The official version will appear in IEEE Xplore

点击查看摘要

Abstract:Spiking Neural Networks (SNNs) offer significant potential for enabling energy-efficient intelligence at the edge. However, performing full SNN inference at the edge can be challenging due to the latency and energy constraints arising from fixed and high timestep overheads. Edge-cloud co-inference systems present a promising solution, but their deployment is often hindered by high latency and feature transmission costs. To address these issues, we introduce NeuCODEX, a neuromorphic co-inference architecture that jointly optimizes both spatial and temporal redundancy. NeuCODEX incorporates a learned spike-driven compression module to reduce data transmission and employs a dynamic early-exit mechanism to adaptively terminate inference based on output confidence. We evaluated NeuCODEX on both static images (CIFAR10 and Caltech) and neuromorphic event streams (CIFAR10-DVS and N-Caltech). To demonstrate practicality, we prototyped NeuCODEX on ResNet-18 and VGG-16 backbones in a real edge-to-cloud testbed. Our proposed system reduces data transfer by up to 2048x and edge energy consumption by over 90%, while reducing end-to-end latency by up to 3x compared to edge-only inference, all with a negligible accuracy drop of less than 2%. In doing so, NeuCODEX enables practical, high-performance SNN deployment in resource-constrained environments.
zh

[CV-20] KAMERA: Enhancing Aerial Surveys of Ice-associated Seals in Arctic Environments ICCV2025

【速读】:该论文旨在解决极地地区多相机、多光谱遥感影像在野生动物(海豹和北极熊)实时检测中的数据处理效率低、同步困难及空间定位不准确的问题。解决方案的关键在于构建一个名为KAMERA的综合系统,其核心包括:通过严格的硬件校准与时间同步机制实现多光谱图像的精确配准;利用生成式AI (Generative AI) 技术实现实时目标检测;并将所有采集影像与动物检测结果投影至世界平面坐标系中,从而提升调查区域面积估算精度并支持快速结果评估。该系统使数据处理时间减少高达80%,且所有软件、模型和电路设计均开源,便于科学界复用与扩展。

链接: https://arxiv.org/abs/2509.19129
作者: Adam Romlein,Benjamin X. Hou,Yuval Boss,Cynthia L. Christman,Stacie Koslovsky,Erin E. Moreland,Jason Parham,Anthony Hoogs
机构: NOAA NMFS AFSC MML (美国国家海洋和大气管理局渔业服务阿拉斯加渔业科学中心海洋哺乳动物实验室); CICOES, University of Washington (华盛顿大学气候与海洋生态系统研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF International Conference on Computer Vision (ICCV 2025)

点击查看摘要

Abstract:We introduce KAMERA: a comprehensive system for multi-camera, multi-spectral synchronization and real-time detection of seals and polar bears. Utilized in aerial surveys for ice-associated seals in the Bering, Chukchi, and Beaufort seas around Alaska, KAMERA provides up to an 80% reduction in dataset processing time over previous methods. Our rigorous calibration and hardware synchronization enable using multiple spectra for object detection. All collected data are annotated with metadata so they can be easily referenced later. All imagery and animal detections from a survey are mapped onto a world plane for accurate surveyed area estimates and quick assessment of survey results. We hope KAMERA will inspire other mapping and detection efforts in the scientific community, with all software, models, and schematics fully open-sourced.
zh

[CV-21] rack-On2: Enhancing Online Point Tracking with Memory

【速读】:该论文旨在解决长时点跟踪(long-term point tracking)问题,即在视频帧间存在显著外观变化、运动和遮挡情况下,仍能一致地识别和追踪特定点。其解决方案的关键在于提出了一种基于Transformer的在线跟踪模型Track-On2,该模型通过因果处理帧序列并借助记忆机制维持时间一致性,从而有效应对漂移和遮挡问题,且无需依赖未来帧信息;同时,该方法在训练阶段采用合成数据策略,并通过精细设计的内存使用方式与分层推理流程(粗粒度patch分类后精修),实现了性能与效率的双重提升。

链接: https://arxiv.org/abs/2509.19115
作者: Görkay Aydemir,Weidi Xie,Fatma Güney
机构: Koç University (科克大学); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: this https URL
zh

[CV-22] FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

【速读】:该论文旨在解决通用机器人技能从端到端演示中学习时,导致的任务特定策略难以在训练分布之外泛化的问题。其核心解决方案是提出FunCanon框架,通过将长时程操作任务分解为由执行者(actor)、动词(verb)和对象(object)构成的动作片段(action chunks),使策略学习聚焦于动作本身而非孤立任务,从而实现行为的组合性与复用性;关键创新在于引入功能物体规范化(functional object canonicalization),利用大规模视觉语言模型提供的可操作性线索(affordance cues)将不同物体映射到共享的功能坐标系中,结合以对象为中心和动作为中心的扩散策略FuncDiffuser,在对齐数据上训练后能自然尊重物体的可操作性和位姿,显著提升策略的学习效率与跨类别泛化能力。

链接: https://arxiv.org/abs/2509.19102
作者: Hongli Xu,Lei Zhang,Xiaoyue Hu,Boyang Zhong,Kaixin Bai,Zoltán-Csaba Márton,Zhenshan Bing,Zhaopeng Chen,Alois Christian Knoll,Jianwei Zhang
机构: TAMS (Technical Aspects of Multimodal Systems); Technical University of Munich (慕尼黑工业大学); Agile Robots SE (敏捷机器人公司)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: project website: this https URL , 11 pages

点击查看摘要

Abstract:General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website this https URL.
zh

[CV-23] Investigating Traffic Accident Detection Using Multimodal Large Language Models

【速读】:该论文旨在解决交通事故检测中对大规模标注数据依赖性强、现实场景数据稀缺的问题,提出基于多模态大语言模型(Multimodal Large Language Models, MLLMs)的零样本检测方法,以实现无需额外微调即可从基础设施摄像头图像中自动识别并描述交通事故。其解决方案的关键在于:(1) 利用CARLA仿真平台构建的DeepAccident数据集,克服真实多样化事故数据不足的问题;(2) 通过引入YOLO目标检测、Deep SORT多目标跟踪与Segment Anything(SAM)实例分割等先进视觉分析技术,增强提示(prompt)信息,提升模型准确性与可解释性;(3) 在不进行参数微调的前提下,对比不同MLLMs(如Pixtral、Gemini 1.5/2.0、Gemma 3)的零样本性能,发现Pixtral在F1-score(0.71)和召回率(83%)上最优,而Gemini系列在优化提示后精度显著提升(达90%),但牺牲了召回率,Gemma 3则展现出最稳定的综合性能。这一集成框架为自动化交通监控系统提供了高适应性且低数据依赖的智能检测路径。

链接: https://arxiv.org/abs/2509.19096
作者: Ilhan Skender,Kailin Tong,Selim Solmaz,Daniel Watzenig
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注: Accepted for presentation at the 2025 IEEE International Automated Vehicle Validation Conference (IAVVC 2025). Final version to appear in IEEE Xplore

点击查看摘要

Abstract:Traffic safety remains a critical global concern, with timely and accurate accident detection essential for hazard reduction and rapid emergency response. Infrastructure-based vision sensors offer scalable and efficient solutions for continuous real-time monitoring, facilitating automated detection of acci- dents directly from captured images. This research investigates the zero-shot capabilities of multimodal large language models (MLLMs) for detecting and describing traffic accidents using images from infrastructure cameras, thus minimizing reliance on extensive labeled datasets. Main contributions include: (1) Evaluation of MLLMs using the simulated DeepAccident dataset from CARLA, explicitly addressing the scarcity of diverse, realistic, infrastructure-based accident data through controlled simulations; (2) Comparative performance analysis between Gemini 1.5 and 2.0, Gemma 3 and Pixtral models in acci- dent identification and descriptive capabilities without prior fine-tuning; and (3) Integration of advanced visual analytics, specifically YOLO for object detection, Deep SORT for multi- object tracking, and Segment Anything (SAM) for instance segmentation, into enhanced prompts to improve model accuracy and explainability. Key numerical results show Pixtral as the top performer with an F1-score of 0.71 and 83% recall, while Gemini models gained precision with enhanced prompts (e.g., Gemini 1.5 rose to 90%) but suffered notable F1 and recall losses. Gemma 3 offered the most balanced performance with minimal metric fluctuation. These findings demonstrate the substantial potential of integrating MLLMs with advanced visual analytics techniques, enhancing their applicability in real-world automated traffic monitoring systems.
zh

[CV-24] Zero-Shot Multi-Spectral Learning: Reimagining a Generalist Multimodal Gemini 2.5 Model for Remote Sensing Applications

【速读】:该论文旨在解决多光谱遥感图像(multi-spectral imagery)在通用多模态大模型(generalist multimodal models)中难以直接利用的问题。当前主流方法依赖于为多光谱输入专门训练的机器学习模型,存在训练成本高、部署复杂等问题,且无法与具备强大视觉理解能力的RGB-only预训练模型兼容。解决方案的关键在于提出一种无需训练(training-free)的零样本(Zero-Shot-only)方法:通过将多光谱数据映射至通用多模态模型所理解的视觉空间,并以指令形式注入领域特定信息(domain-specific information),从而实现对如Gemini2.5等模型的无缝适配。该方法显著提升了模型在土地覆盖和土地利用分类任务上的零样本性能,验证了其在地理空间专业场景下快速集成先进多模态推理能力的可行性。

链接: https://arxiv.org/abs/2509.19087
作者: Ganesh Mallya,Yotam Gigi,Dahun Kim,Maxim Neumann,Genady Beryozkin,Tomer Shekel,Anelia Angelova
机构: Google DeepMind(谷歌深思维); Google Research(谷歌研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-spectral imagery plays a crucial role in diverse Remote Sensing applications including land-use classification, environmental monitoring and urban planning. These images are widely adopted because their additional spectral bands correlate strongly with physical materials on the ground, such as ice, water, and vegetation. This allows for more accurate identification, and their public availability from missions, such as Sentinel-2 and Landsat, only adds to their value. Currently, the automatic analysis of such data is predominantly managed through machine learning models specifically trained for multi-spectral input, which are costly to train and support. Furthermore, although providing a lot of utility for Remote Sensing, such additional inputs cannot be used with powerful generalist large multimodal models, which are capable of solving many visual problems, but are not able to understand specialized multi-spectral signals. To address this, we propose a training-free approach which introduces new multi-spectral data in a Zero-Shot-only mode, as inputs to generalist multimodal models, trained on RGB-only inputs. Our approach leverages the multimodal models’ understanding of the visual space, and proposes to adapt to inputs to that space, and to inject domain-specific information as instructions into the model. We exemplify this idea with the Gemini2.5 model and observe strong Zero-Shot performance gains of the approach on popular Remote Sensing benchmarks for land cover and land use classification and demonstrate the easy adaptability of Gemini2.5 to new inputs. These results highlight the potential for geospatial professionals, working with non-standard specialized inputs, to easily leverage powerful multimodal models, such as Gemini2.5, to accelerate their work, benefiting from their rich reasoning and contextual capabilities, grounded in the specialized sensor data. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.19087 [cs.CV] (or arXiv:2509.19087v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.19087 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-25] 3rd Place Report of LSVOS 2025 MeViS Track: Sa2VA-i: Improving Sa2VA Results with Consistent Training and Inference

【速读】:该论文旨在解决Sa2VA模型在视频指代表达分割(referring video object segmentation, RVOS)任务中未能充分发挥性能的问题。研究表明,训练与推理阶段流程不一致是限制其性能的关键因素。解决方案的核心在于提出改进版本Sa2VA-i,通过修正训练与推理之间的不一致性,显著提升了模型在多个视频基准数据集上的表现,实现了新的SOTA结果,且仅使用原版Sa2VA的检查点即可获得大幅提升。

链接: https://arxiv.org/abs/2509.19082
作者: Alexey Nekrasov,Ali Athar,Daan de Geus,Alexander Hermans,Bastian Leibe
机构: RWTH Aachen University (亚琛工业大学); Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Sa2VA is a recent model for language-guided dense grounding in images and video that achieves state-of-the-art results on multiple segmentation benchmarks and that has become widely popular. However, we found that Sa2VA does not perform according to its full potential for referring video object segmentation tasks. We identify inconsistencies between training and inference procedures as the key factor holding it back. To mitigate this issue, we propose an improved version of Sa2VA, Sa2VA-i, that rectifies these issues and improves the results. In fact, Sa2VA-i sets a new state of the art for multiple video benchmarks and achieves improvements of up to +11.6 JF on MeViS, +1.4 on Ref-YT-VOS, +3.3 on Ref-DAVIS and +4.1 on ReVOS using the same Sa2VA checkpoints. With our fixes, the Sa2VA-i-1B model even performs on par with the original Sa2VA-26B model on the MeViS benchmark. We hope that this work will show the importance of seemingly trivial implementation details and that it will provide valuable insights for the referring video segmentation field. We provide the code and updated models at this https URL
zh

[CV-26] WaveletGaussian: Wavelet-domain Diffusion for Sparse-view 3D Gaussian Object Reconstruction

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在稀疏视角(sparse-view)条件下重建性能显著下降的问题。现有方法通过扩散模型修复损坏的渲染结果,并将其作为伪真值用于后续优化,但此类方案计算开销大,尤其在扩散模型微调和修复步骤中效率低下。本文提出WaveletGaussian框架,其核心创新在于将扩散过程转移到小波域:仅对低频LL子带进行扩散建模,高频子带则由轻量级网络进行精修;同时设计了一种高效的在线随机掩码策略来构建训练对,替代传统但低效的留一法(leave-one-out)策略,从而显著降低训练时间并保持优异的渲染质量。

链接: https://arxiv.org/abs/2509.19073
作者: Hung Nguyen,Runfa Li,An Le,Truong Nguyen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has become a powerful representation for image-based object reconstruction, yet its performance drops sharply in sparse-view settings. Prior works address this limitation by employing diffusion models to repair corrupted renders, subsequently using them as pseudo ground truths for later optimization. While effective, such approaches incur heavy computation from the diffusion fine-tuning and repair steps. We present WaveletGaussian, a framework for more efficient sparse-view 3D Gaussian object reconstruction. Our key idea is to shift diffusion into the wavelet domain: diffusion is applied only to the low-resolution LL subband, while high-frequency subbands are refined with a lightweight network. We further propose an efficient online random masking strategy to curate training pairs for diffusion fine-tuning, replacing the commonly used, but inefficient, leave-one-out strategy. Experiments across two benchmark datasets, Mip-NeRF 360 and OmniObject3D, show WaveletGaussian achieves competitive rendering quality while substantially reducing training time.
zh

[CV-27] A DyL-Unet framework based on dynamic learning for Temporally Consistent Echocardiographic Segmentation

【速读】:该论文旨在解决超声心动图(echocardiography)中因图像形变和斑点噪声导致的帧间分割抖动问题,从而提升分割结果的时间稳定性,以保障心血管功能评估的准确性与临床可解释性。其解决方案的关键在于提出一种基于动态学习的时序一致性U-Net架构——DyL-UNet,该架构通过构建Echo-Dynamics Graph(EDG)提取视频动态信息,并引入Cardiac Phase-Dynamics Attention(CPDA)模块,在跳跃连接处融合EDG编码的动态特征与心脏相位线索,实现对分割过程的时序约束,从而在保持单帧高精度的同时显著增强时间连续性。

链接: https://arxiv.org/abs/2509.19052
作者: Jierui Qu,Jianchun Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate segmentation of cardiac anatomy in echocardiography is essential for cardiovascular diagnosis and treatment. Yet echocardiography is prone to deformation and speckle noise, causing frame-to-frame segmentation jitter. Even with high accuracy in single-frame segmentation, temporal instability can weaken functional estimates and impair clinical interpretability. To address these issues, we propose DyL-UNet, a dynamic learning-based temporal consistency U-Net segmentation architecture designed to achieve temporally stable and precise echocardiographic segmentation. The framework constructs an Echo-Dynamics Graph (EDG) through dynamic learning to extract dynamic information from videos. DyL-UNet incorporates multiple Swin-Transformer-based encoder-decoder branches for processing single-frame images. It further introduces Cardiac Phase-Dynamics Attention (CPDA) at the skip connections, which uses EDG-encoded dynamic features and cardiac-phase cues to enforce temporal consistency during segmentation. Extensive experiments on the CAMUS and EchoNet-Dynamic datasets demonstrate that DyL-UNet maintains segmentation accuracy comparable to existing methods while achieving superior temporal consistency, providing a reliable solution for automated clinical echocardiography.
zh

[CV-28] Latent Danger Zone: Distilling Unified Attention for Cross-Architecture Black-box Attacks

【速读】:该论文旨在解决黑盒对抗攻击中面临的两大挑战:一是现有方法对特定网络架构依赖性强,导致跨架构迁移能力有限;二是攻击过程通常需要大量查询,造成较高的查询成本。解决方案的关键在于提出一种基于潜在扩散模型(latent diffusion model)的框架 JAD,通过联合蒸馏卷积神经网络(CNN)和视觉Transformer(ViT)的注意力图,聚焦于不同模型共有的敏感图像区域,从而生成具有强迁移性的对抗扰动。该策略使 JAD 具备架构无关性(architecture-agnostic),显著提升攻击泛化能力和生成效率,同时减少对迭代查询的依赖。

链接: https://arxiv.org/abs/2509.19044
作者: Yang Li,Chenyu Wang,Tingrui Wang,Yongwei Wang,Haonan Li,Zhunga Liu,Quan Pan
机构: Northwestern Polytechnical University (西北工业大学); Zhejiang University (浙江大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Black-box adversarial attacks remain challenging due to limited access to model internals. Existing methods often depend on specific network architectures or require numerous queries, resulting in limited cross-architecture transferability and high query costs. To address these limitations, we propose JAD, a latent diffusion model framework for black-box adversarial attacks. JAD generates adversarial examples by leveraging a latent diffusion model guided by attention maps distilled from both a convolutional neural network (CNN) and a Vision Transformer (ViT) models. By focusing on image regions that are commonly sensitive across architectures, this approach crafts adversarial perturbations that transfer effectively between different model types. This joint attention distillation strategy enables JAD to be architecture-agnostic, achieving superior attack generalization across diverse models. Moreover, the generative nature of the diffusion framework yields high adversarial sample generation efficiency by reducing reliance on iterative queries. Experiments demonstrate that JAD offers improved attack generalization, generation efficiency, and cross-architecture transferability compared to existing methods, providing a promising and effective paradigm for black-box adversarial attacks.
zh

[CV-29] Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

【速读】:该论文旨在解决食品图像语义分割中依赖大量像素级标注数据的问题,提出了一种弱监督学习方法。其关键在于利用Segment Anything Model (SAM) 的零样本能力和提示可调性,结合Vision Transformer (ViT) 中的类激活图(Class Activation Maps, CAMs)生成高质量的分割提示,从而在仅使用图像级别标签训练Swin Transformer的前提下,实现对食物图像的有效分割。通过引入图像预处理与单掩码/多掩码SAM生成策略,进一步提升了分割结果的质量,在FoodSeg103数据集上实现了平均每个图像生成2.4个掩码(不含背景)且mIoU达到0.54的性能表现。

链接: https://arxiv.org/abs/2509.19028
作者: Ioannis Sarafis,Alexandros Papadopoulos,Anastasios Delopoulos
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to the 20th International Workshop on Semantic and Social Media Adaptation Personalization

点击查看摘要

Abstract:In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.
zh

[CV-30] Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards NEURIPS2025

【速读】:该论文旨在解决视觉-语言模型中链式推理(chain of thought reasoning)的细粒度结构化推理能力不足及中间推理步骤质量难以评估的问题。现有方法通常采用粗粒度的推理链,导致难以进行精细化推理建模和奖励机制设计。其解决方案的关键在于提出一种基于“推理步骤”(chain of step reasoning)的框架,包含细粒度的推理步骤数据、过程奖励模型(Process Reward Model, PRM)以及基于细粒度奖励的强化学习训练策略。该方法可精确评估每一步推理的质量,从而实现有效的强化学习优化与推理时缩放(inference-time scaling),显著提升模型在复杂视觉-语言任务上的性能表现。

链接: https://arxiv.org/abs/2509.19003
作者: Honghao Chen,Xingzhou Lou,Xiaokun Feng,Kaiqi Huang,Xinlong Wang
机构: Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Beijing Academy of Artificial Intelligence (北京人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by NeurIPS 2025

点击查看摘要

Abstract:Chain of thought reasoning has demonstrated remarkable success in large language models, yet its adaptation to vision-language reasoning remains an open challenge with unclear best practices. Existing attempts typically employ reasoning chains at a coarse-grained level, which struggles to perform fine-grained structured reasoning and, more importantly, are difficult to evaluate the reward and quality of intermediate reasoning. In this work, we delve into chain of step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. We present a simple, effective, and fully transparent framework, including the step-level reasoning data, process reward model (PRM), and reinforcement learning training. With the proposed approaches, our models set strong baselines with consistent improvements on challenging vision-language benchmarks. More importantly, we conduct a thorough empirical analysis and ablation study, unveiling the impact of each component and several intriguing properties of inference-time scaling. We believe this paper serves as a baseline for vision-language models and offers insights into more complex multimodal reasoning. Our dataset, PRM, and code will be available at this https URL.
zh

[CV-31] Category-Level Object Shape and Pose Estimation in Less Than a Millisecond

【速读】:该论文旨在解决机器人领域中物体形状(shape)与位姿(pose)估计这一基础性问题,该任务对抓取、场景理解及导航等下游应用至关重要。解决方案的关键在于提出了一种快速局部求解器,仅需类别级别的物体先验信息,并能提供全局最优性的高效验证证书。其核心创新是将形状和位姿联合估计建模为最大后验概率优化问题,其中物体形状由线性主动形状模型(Active Shape Model, ASM)表示,位姿以单位四元数形式参数化,从而将问题转化为带有特征向量非线性的特征值问题;通过自洽场迭代法(self-consistent field iteration)高效求解,每步仅需计算一个4×4矩阵并求其最小特征值-特征向量对,同时利用拉格朗日乘子的线性系统构建简洁的全局最优性判据。该方法单次迭代耗时约100微秒,支持快速误匹配剔除,在合成数据和多个真实场景(包括两个公开数据集和无人机跟踪任务)中验证了有效性。

链接: https://arxiv.org/abs/2509.18979
作者: Lorenzo Shaikewitz,Tim Nguyen,Luca Carlone
机构: Massachusetts Institute of Technology (麻省理工学院); Boston University (波士顿大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Object shape and pose estimation is a foundational robotics problem, supporting tasks from manipulation to scene understanding and navigation. We present a fast local solver for shape and pose estimation which requires only category-level object priors and admits an efficient certificate of global optimality. Given an RGB-D image of an object, we use a learned front-end to detect sparse, category-level semantic keypoints on the target object. We represent the target object’s unknown shape using a linear active shape model and pose a maximum a posteriori optimization problem to solve for position, orientation, and shape simultaneously. Expressed in unit quaternions, this problem admits first-order optimality conditions in the form of an eigenvalue problem with eigenvector nonlinearities. Our primary contribution is to solve this problem efficiently with self-consistent field iteration, which only requires computing a 4-by-4 matrix and finding its minimum eigenvalue-vector pair at each iterate. Solving a linear system for the corresponding Lagrange multipliers gives a simple global optimality certificate. One iteration of our solver runs in about 100 microseconds, enabling fast outlier rejection. We test our method on synthetic data and a variety of real-world settings, including two public datasets and a drone tracking scenario. Code is released at this https URL.
zh

[CV-32] Prompt-DAS: Annotation-Efficient Prompt Learning for Domain Adaptive Semantic Segmentation of Electron Microscopy Images MICCAI2025

【速读】:该论文旨在解决大规模电子显微镜(EM)图像中众多细胞器实例的域自适应分割(Domain Adaptive Segmentation, DAS)问题,以实现标注高效的学习。传统方法在跨域场景下性能受限,且依赖大量标注数据;而现有基于SAM(Segment Anything Model)的方法需为每个对象实例提供单独提示,限制了灵活性与实用性。解决方案的关键在于提出一个可提示的多任务框架Prompt-DAS,其核心创新包括:1)引入辅助的中心点检测任务,使模型能够在训练和测试阶段灵活使用全点、稀疏点或无提示的配置,从而支持无监督域适应(UDA)、弱监督域适应(WDA)及交互式分割;2)设计一种新颖的提示引导对比学习(prompt-guided contrastive learning),增强特征的判别能力,提升跨域分割精度。实验表明,该方法在多个挑战性基准上显著优于现有UDA、WDA及基于SAM的方法。

链接: https://arxiv.org/abs/2509.18973
作者: Jiabao Chen,Shan Xiong,Jialin Peng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI2025

点击查看摘要

Abstract:Domain adaptive segmentation (DAS) of numerous organelle instances from large-scale electron microscopy (EM) is a promising way to enable annotation-efficient learning. Inspired by SAM, we propose a promptable multitask framework, namely Prompt-DAS, which is flexible enough to utilize any number of point prompts during the adaptation training stage and testing stage. Thus, with varying prompt configurations, Prompt-DAS can perform unsupervised domain adaptation (UDA) and weakly supervised domain adaptation (WDA), as well as interactive segmentation during testing. Unlike the foundation model SAM, which necessitates a prompt for each individual object instance, Prompt-DAS is only trained on a small dataset and can utilize full points on all instances, sparse points on partial instances, or even no points at all, facilitated by the incorporation of an auxiliary center-point detection task. Moreover, a novel prompt-guided contrastive learning is proposed to enhance discriminative feature learning. Comprehensive experiments conducted on challenging benchmarks demonstrate the effectiveness of the proposed approach over existing UDA, WDA, and SAM-based approaches.
zh

[CV-33] Generative data augmentation for biliary tract detection on intraoperative images

【速读】:该论文旨在解决腹腔镜胆囊切除术中胆管损伤风险较高的问题,其核心在于提升术中胆道系统的可视化水平。解决方案的关键在于利用深度学习方法,特别是YOLO目标检测算法,从手术过程中获取的白光图像中定位胆道结构;同时,为增强训练数据的多样性与代表性,研究提出采用生成对抗网络(Generative Adversarial Network, GAN)生成部分合成训练样本,从而提高模型在真实临床场景中的鲁棒性与准确性。

链接: https://arxiv.org/abs/2509.18958
作者: Cristina Iacono,Mariarosaria Meola,Federica Conte,Laura Mecozzi,Umberto Bracale,Pietro Falco,Fanny Ficuciello
机构: C.R.E.A.T.E. Consorzio di Ricerca per l’Energia, l’Automazione e le Tecnologie dell’Elettromagnetismo, Università degli Studi di Napoli Federico II (C.R.E.A.T.E. 研究与能源、自动化和电磁技术 consortia, 那不勒斯费德里科二世大学); Department of Information Technology and Electrical Engineering, Università degli Studi di Napoli Federico II (信息科技与电气工程系, 那不勒斯费德里科二世大学); Department of Information Engineering, University of Padova (信息工程系, 帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Cholecystectomy is one of the most frequently performed procedures in gastrointestinal surgery, and the laparoscopic approach is the gold standard for symptomatic cholecystolithiasis and acute cholecystitis. In addition to the advantages of a significantly faster recovery and better cosmetic results, the laparoscopic approach bears a higher risk of bile duct injury, which has a significant impact on quality of life and survival. To avoid bile duct injury, it is essential to improve the intraoperative visualization of the bile duct. This work aims to address this problem by leveraging a deep-learning approach for the localization of the biliary tract from white-light images acquired during the surgical procedures. To this end, the construction and annotation of an image database to train the Yolo detection algorithm has been employed. Besides classical data augmentation techniques, the paper proposes Generative Adversarial Network (GAN) for the generation of a synthetic portion of the training dataset. Experimental results have been discussed along with ethical considerations.
zh

[CV-34] Seeing Through Reflections: Advancing 3D Scene Reconstruction in Mirror-Containing Environments with Gaussian Splatting

【速读】:该论文旨在解决镜面环境下的三维重建与新视角合成(Novel View Synthesis, NVS)问题,这类场景中反射表面会引入视点依赖的畸变和不一致性,导致现有方法如神经辐射场(Neural Radiance Fields, NeRF)和3D高斯泼溅(3D Gaussian Splatting, 3DGS)性能显著下降。其解决方案的关键在于摒弃传统仅将镜面视为对称映射的处理方式,转而将镜面反射作为互补视角来利用,从而补充缺失细节并提升几何精度;基于此思路,作者提出了ReflectiveGS方法,并构建了MirrorScene3D数据集以支持该方向的研究,实验表明该方法在SSIM、PSNR、LPIPS指标及训练速度上均优于现有方法,为镜面丰富环境中的三维重建设立了新基准。

链接: https://arxiv.org/abs/2509.18956
作者: Zijing Guo,Yunyang Zhao,Lin Wang
机构: Shanghai Jiao Tong University (上海交通大学); Ningbo Artificial Intelligence Institute (宁波人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mirror-containing environments pose unique challenges for 3D reconstruction and novel view synthesis (NVS), as reflective surfaces introduce view-dependent distortions and inconsistencies. While cutting-edge methods such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) excel in typical scenes, their performance deteriorates in the presence of mirrors. Existing solutions mainly focus on handling mirror surfaces through symmetry mapping but often overlook the rich information carried by mirror reflections. These reflections offer complementary perspectives that can fill in absent details and significantly enhance reconstruction quality. To advance 3D reconstruction in mirror-rich environments, we present MirrorScene3D, a comprehensive dataset featuring diverse indoor scenes, 1256 high-quality images, and annotated mirror masks, providing a benchmark for evaluating reconstruction methods in reflective settings. Building on this, we propose ReflectiveGS, an extension of 3D Gaussian Splatting that utilizes mirror reflections as complementary viewpoints rather than simple symmetry artifacts, enhancing scene geometry and recovering absent details. Experiments on MirrorScene3D show that ReflectiveGaussian outperforms existing methods in SSIM, PSNR, LPIPS, and training speed, setting a new benchmark for 3D reconstruction in mirror-rich environments.
zh

[CV-35] owards Robust LiDAR Localization: Deep Learning-based Uncertainty Estimation

【速读】:该论文旨在解决LiDAR位姿估计中迭代最近点(Iterative Closest Point, ICP)算法在无特征环境和动态场景下易产生误差的问题,尤其是缺乏对ICP注册误差协方差的准确预测,从而影响状态估计的鲁棒性。现有方法要么依赖手工设计模型或简化假设,要么需要预构建地图或仅提供二值化可定位性判断,无法有效建模不确定性。解决方案的关键在于提出一种数据驱动框架,利用深度学习直接预测ICP匹配前的注册误差协方差,无需依赖参考地图即可获得可靠的6自由度(6-DoF)误差协方差估计,从而实现ICP与卡尔曼滤波(Kalman filtering)的无缝集成,显著提升定位精度与鲁棒性。

链接: https://arxiv.org/abs/2509.18954
作者: Minoo Dolatabadi,Fardin Ayar,Ehsan Javanmardi,Manabu Tsukada,Mahdi Javanmardi
机构: Amirkabir University of Technology (伊朗阿米尔卡比尔理工大学); The University of Tokyo (东京大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:LiDAR-based localization and SLAM often rely on iterative matching algorithms, particularly the Iterative Closest Point (ICP) algorithm, to align sensor data with pre-existing maps or previous scans. However, ICP is prone to errors in featureless environments and dynamic scenes, leading to inaccurate pose estimation. Accurately predicting the uncertainty associated with ICP is crucial for robust state estimation but remains challenging, as existing approaches often rely on handcrafted models or simplified assumptions. Moreover, a few deep learning-based methods for localizability estimation either depend on a pre-built map, which may not always be available, or provide a binary classification of localizable versus non-localizable, which fails to properly model uncertainty. In this work, we propose a data-driven framework that leverages deep learning to estimate the registration error covariance of ICP before matching, even in the absence of a reference map. By associating each LiDAR scan with a reliable 6-DoF error covariance estimate, our method enables seamless integration of ICP within Kalman filtering, enhancing localization accuracy and robustness. Extensive experiments on the KITTI dataset demonstrate the effectiveness of our approach, showing that it accurately predicts covariance and, when applied to localization using a pre-built map or SLAM, reduces localization errors and improves robustness.
zh

[CV-36] One-shot Embroidery Customization via Contrastive LoRA Modulation SIGGRAPH

【速读】:该论文旨在解决现有风格迁移方法在处理细粒度视觉特征(如刺绣艺术中复杂的针法图案与材质属性交互)时的局限性,这类特征对传统方法而言难以精准建模和迁移。其解决方案的关键在于提出一种基于对比学习的框架,通过单参考图像实现风格与内容特征的解耦:首先构建图像对定义目标风格,并利用预训练扩散模型的解耦表征设计相似性度量;随后引入两阶段对比LoRA调制技术,第一阶段迭代更新全LoRA及选定风格模块以初步分离风格与内容,第二阶段采用自知识蒸馏策略进一步增强解耦效果;最终形成仅依赖风格模块的推理流程,从而实现高精度的细粒度风格迁移,在刺绣定制、艺术风格迁移、草图着色和外观迁移等多个场景中均展现出优越性能与泛化能力。

链接: https://arxiv.org/abs/2509.18948
作者: Jun Ma,Qian He,Gaofeng He,Huang Chen,Chen Liu,Xiaogang Jin,Huamin Wang
机构: Zhejiang Sci-Tech University (浙江理工大学); Style3D Research (Style3D 研究院); State Key Lab of CAD&CG, Zhejiang University (浙江大学CAD&CG国家重点实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM Transactions on Graphics (TOG), SIGGRAPH Asia 2025

点击查看摘要

Abstract:Diffusion models have significantly advanced image manipulation techniques, and their ability to generate photorealistic images is beginning to transform retail workflows, particularly in presale visualization. Beyond artistic style transfer, the capability to perform fine-grained visual feature transfer is becoming increasingly important. Embroidery is a textile art form characterized by intricate interplay of diverse stitch patterns and material properties, which poses unique challenges for existing style transfer methods. To explore the customization for such fine-grained features, we propose a novel contrastive learning framework that disentangles fine-grained style and content features with a single reference image, building on the classic concept of image analogy. We first construct an image pair to define the target style, and then adopt a similarity metric based on the decoupled representations of pretrained diffusion models for style-content separation. Subsequently, we propose a two-stage contrastive LoRA modulation technique to capture fine-grained style features. In the first stage, we iteratively update the whole LoRA and the selected style blocks to initially separate style from content. In the second stage, we design a contrastive learning strategy to further decouple style and content through self-knowledge distillation. Finally, we build an inference pipeline to handle image or text inputs with only the style blocks. To evaluate our method on fine-grained style transfer, we build a benchmark for embroidery customization. Our approach surpasses prior methods on this task and further demonstrates strong generalization to three additional domains: artistic style transfer, sketch colorization, and appearance transfer.
zh

[CV-37] No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning ICTAI

【速读】:该论文旨在解决深度学习模型在图像分类任务中对大规模标注数据的高度依赖问题,尤其是在标注数据稀缺的实际场景下。其解决方案的关键在于提出了一种结合视觉语言模型(VLM)与预训练视觉模型的自学习循环框架,通过置信度驱动的伪标签策略,在无需任何标注训练数据的情况下,直接在测试数据上训练一个轻量级分类器。该方法利用VLM识别高置信度样本,并借助预训练视觉模型增强其视觉表征,进而迭代优化分类器,从而在无监督条件下融合语义与视觉线索,实现动态适应。值得注意的是,该方案不涉及VLM微调或大型语言模型的使用,仅依赖视觉模型以降低对语义表示的依赖,实验表明其在十个多样化数据集上显著优于基线零样本方法。

链接: https://arxiv.org/abs/2509.18938
作者: Matheus Vinícius Todescato,Joel Luís Carbonera
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper was accepted at International Conference on Tools with Artificial Intelligence (ICTAI) 2025

点击查看摘要

Abstract:While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.
zh

[CV-38] SynapFlow: A Modular Framework Towards Large-Scale Analysis of Dendritic Spines

【速读】:该论文旨在解决在三维加时间(3D+time)双光子显微成像数据中,对树突棘(dendritic spine)进行大规模自动化检测、时空追踪与特征提取的挑战,这是研究学习与记忆神经机制的关键前提。其解决方案的核心在于构建一个模块化机器学习流水线,包含四个关键组件:基于Transformer的检测模块用于精准识别树突棘;结合空间特征的深度追踪模块提升跨层定位一致性;利用空间一致性进行时间追踪以关联不同时刻的3D树突棘;以及量化生物相关属性的特征提取单元。该方法在公开标注数据及作者新发布的两个互补标注数据集上得到验证,为树突棘动态分析提供了可扩展、端到端的基准工具。

链接: https://arxiv.org/abs/2509.18926
作者: Pamela Osuna-Vargas,Altug Kamacioglu,Dominik F. Aschauer,Petros E. Vlachos,Sercan Alipek,Jochen Triesch,Simon Rumpel,Matthias Kaschube
机构: Frankfurt Institute for Advanced Studies (法兰克福先进研究所); Institute of Computer Science, Goethe University Frankfurt (法兰克福歌德大学计算机科学研究所); Institute of Physiology, Focus Program Translational Neurosciences, University Medical Center Johannes Gutenberg University-Mainz (约翰内斯古登堡大学美因茨医学院生理学研究所,转化神经科学重点计划); Mechanical Engineering Department, Universität Siegen (锡根大学机械工程系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dendritic spines are key structural components of excitatory synapses in the brain. Given the size of dendritic spines provides a proxy for synaptic efficacy, their detection and tracking across time is important for studies of the neural basis of learning and memory. Despite their relevance, large-scale analyses of the structural dynamics of dendritic spines in 3D+time microscopy data remain challenging and labor-intense. Here, we present a modular machine learning-based pipeline designed to automate the detection, time-tracking, and feature extraction of dendritic spines in volumes chronically recorded with two-photon microscopy. Our approach tackles the challenges posed by biological data by combining a transformer-based detection module, a depth-tracking component that integrates spatial features, a time-tracking module to associate 3D spines across time by leveraging spatial consistency, and a feature extraction unit that quantifies biologically relevant spine properties. We validate our method on open-source labeled spine data, and on two complementary annotated datasets that we publish alongside this work: one for detection and depth-tracking, and one for time-tracking, which, to the best of our knowledge, is the first data of this kind. To encourage future research, we release our data, code, and pre-trained weights at this https URL, establishing a baseline for scalable, end-to-end analysis of dendritic spine dynamics.
zh

[CV-39] Audio-Driven Universal Gaussian Head Avatars SIGGRAPH

【速读】:该论文旨在解决音频驱动的通用逼真头像合成问题,即如何从原始音频输入中生成具有高保真度、精准唇同步及丰富表情细节(如眉毛运动、眼神变化和口腔内部结构)的三维头像。传统方法通常仅将音频特征映射为几何变形,忽略了音频依赖的外观变化,导致生成结果缺乏真实感。其解决方案的关键在于提出一种新型的通用头部头像先验(Universal Head Avatar Prior, UHAP),该先验模型通过跨身份多视角视频训练,并利用中性扫描数据进行监督,从而在潜在空间中同时编码几何与外观的表达变化;此外,引入一个单目编码器以轻量级回归动态表情变化,使后续微调阶段专注于捕捉个体全局外观和几何特征,最终实现高质量、可泛化的音频驱动头像生成。

链接: https://arxiv.org/abs/2509.18924
作者: Kartik Teotia,Helge Rhodin,Mohit Mendiratta,Hyeongwoo Kim,Marc Habermann,Christian Theobalt
机构: Max Planck Institute for Informatics (马克斯·普朗克信息研究所); Saarland Informatics Campus (萨尔兰信息学校区); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: (SIGGRAPH Asia 2025) Project page: this https URL

点击查看摘要

Abstract:We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject’s global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.
zh

[CV-40] Advancing Metallic Surface Defect Detection via Anomaly-Guided Pretraining on a Large Industrial Dataset

【速读】:该论文旨在解决金属表面缺陷检测中因数据稀缺导致的预训练-微调范式(pretraining-finetuning paradigm)效果受限的问题。具体而言,现有方法在ImageNet等自然图像上预训练存在显著域差距(domain gap),而直接在工业数据上进行自监督预训练又难以区分细微缺陷与复杂背景噪声,导致特征学习无效。解决方案的关键在于提出一种异常引导的自监督预训练(Anomaly-Guided Self-Supervised Pretraining, AGSSP)新范式,其核心是通过异常先验(anomaly priors)显式指导表示学习:第一阶段利用异常图蒸馏知识以增强网络对缺陷显著特征的捕捉能力;第二阶段基于异常图生成伪缺陷框对检测器进行预训练,使其与定位任务对齐。该方法显著提升了模型性能,在多个指标上相较ImageNet基线提升达10%(mAP@0.5)和11.4%(mAP@0.5:0.95)。

链接: https://arxiv.org/abs/2509.18919
作者: Chuni Liu,Hongjie Li,Jiaqi Du,Yangyang Hou,Qian Sun,Lei Jin,Ke Xu
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The pretraining-finetuning paradigm is a crucial strategy in metallic surface defect detection for mitigating the challenges posed by data scarcity. However, its implementation presents a critical dilemma. Pretraining on natural image datasets such as ImageNet, faces a significant domain gap. Meanwhile, naive self-supervised pretraining on in-domain industrial data is often ineffective due to the inability of existing learning objectives to distinguish subtle defect patterns from complex background noise and textures. To resolve this, we introduce Anomaly-Guided Self-Supervised Pretraining (AGSSP), a novel paradigm that explicitly guides representation learning through anomaly priors. AGSSP employs a two-stage framework: (1) it first pretrains the model’s backbone by distilling knowledge from anomaly maps, encouraging the network to capture defect-salient features; (2) it then pretrains the detector using pseudo-defect boxes derived from these maps, aligning it with localization tasks. To enable this, we develop a knowledge-enhanced method to generate high-quality anomaly maps and collect a large-scale industrial dataset of 120,000 images. Additionally, we present two small-scale, pixel-level labeled metallic surface defect datasets for validation. Extensive experiments demonstrate that AGSSP consistently enhances performance across various settings, achieving up to a 10% improvement in mAP@0.5 and 11.4% in mAP@0.5:0.95 compared to ImageNet-based models. All code, pretrained models, and datasets are publicly available at this https URL.
zh

[CV-41] LiDAR Point Cloud Image-based Generation Using Denoising Diffusion Probabilistic Models

【速读】:该论文旨在解决自动驾驶车辆(AVs)在复杂环境条件下因真实LiDAR数据采集困难、噪声大及稀疏性问题而导致感知性能下降的挑战。其核心解决方案是采用改进的去噪扩散概率模型(denoising diffusion probabilistic model, DDPM),通过引入新颖的噪声调度策略和时间步嵌入技术,提升合成点云数据的质量与多样性,从而增强模型对稀疏或噪声LiDAR数据的鲁棒性,并改善计算机视觉任务中的感知性能。关键创新在于优化了去噪过程和时序感知能力,使生成的点云更符合实际场景的空间结构和几何关系。

链接: https://arxiv.org/abs/2509.18917
作者: Amirhesam Aghanouri,Cristina Olaverri-Monreal
机构: Johannes Kepler University Linz, Austria (约翰·开普勒林茨大学); Department Intelligent Transport Systems (智能交通系统系)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autonomous vehicles (AVs) are expected to revolutionize transportation by improving efficiency and safety. Their success relies on 3D vision systems that effectively sense the environment and detect traffic agents. Among sensors AVs use to create a comprehensive view of surroundings, LiDAR provides high-resolution depth data enabling accurate object detection, safe navigation, and collision avoidance. However, collecting real-world LiDAR data is time-consuming and often affected by noise and sparsity due to adverse weather or sensor limitations. This work applies a denoising diffusion probabilistic model (DDPM), enhanced with novel noise scheduling and time-step embedding techniques to generate high-quality synthetic data for augmentation, thereby improving performance across a range of computer vision tasks, particularly in AV perception. These modifications impact the denoising process and the model’s temporal awareness, allowing it to produce more realistic point clouds based on the projection. The proposed method was extensively evaluated under various configurations using the IAMCV and KITTI-360 datasets, with four performance metrics compared against state-of-the-art (SOTA) methods. The results demonstrate the model’s superior performance over most existing baselines and its effectiveness in mitigating the effects of noisy and sparse LiDAR data, producing diverse point clouds with rich spatial relationships and structural detail.
zh

[CV-42] xAI-CV: An Overview of Explainable Artificial Intelligence in Computer Vision

【速读】:该论文旨在解决深度学习在图像分析任务中因模型决策过程缺乏可解释性而引发的可靠性问题(即“黑箱”问题),尤其是在关键应用场景下人类难以理解AI如何做出决策。其解决方案的关键在于系统梳理和评述四种代表性可解释人工智能(Explainable AI, xAI)方法:显著性图(Saliency Maps)、概念瓶颈模型(Concept Bottleneck Models, CBM)、基于原型的方法(Prototype-based methods)以及混合方法,深入分析它们的内在机制、优势与局限性,并总结相应的评估指标,从而为未来研究与应用提供全面指导。

链接: https://arxiv.org/abs/2509.18913
作者: Nguyen Van Tu,Pham Nguyen Hai Long,Vo Hoai Viet
机构: University of Science, Ho Chi Minh City, Vietnam (胡志明市科学大学); Vietnam National University, Ho Chi Minh City, Vietnam (胡志明市国家大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has become the de facto standard and dominant paradigm in image analysis tasks, achieving state-of-the-art performance. However, this approach often results in “black-box” models, whose decision-making processes are difficult to interpret, raising concerns about reliability in critical applications. To address this challenge and provide human a method to understand how AI model process and make decision, the field of xAI has emerged. This paper surveys four representative approaches in xAI for visual perception tasks: (i) Saliency Maps, (ii) Concept Bottleneck Models (CBM), (iii) Prototype-based methods, and (iv) Hybrid approaches. We analyze their underlying mechanisms, strengths and limitations, as well as evaluation metrics, thereby providing a comprehensive overview to guide future research and applications.
zh

[CV-43] Frequency-Domain Decomposition and Recomposition for Robust Audio-Visual Segmentation

【速读】:该论文旨在解决音频-视觉分割(Audio-Visual Segmentation, AVS)中忽视音视频模态在频域上固有差异的问题,即音频高频信号普遍存在干扰噪声,而视觉高频信号则富含结构细节,现有方法未充分考虑这一频率域矛盾,导致性能受限。解决方案的关键在于提出一种全新的频率感知音频-视觉分割(Frequency-Aware Audio-Visual Segmentation, FAVS)框架,其核心创新为两个模块:一是基于残差迭代的频域增强分解器(Frequency-Domain Enhanced Decomposer, FDED),用于分离模态特异性语义与结构特征;二是协同跨模态一致性模块(Synergistic Cross-Modal Consistency, SCMC),通过专家混合(mixture-of-experts)架构和动态专家路由机制,强化语义一致性并保留模态特异性特征。

链接: https://arxiv.org/abs/2509.18912
作者: Yunzhe Shen,Kai Peng,Leiye Liu,Wei Ji,Jingjing Li,Miao Zhang,Yongri Piao,Huchuan Lu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Audio-visual segmentation (AVS) plays a critical role in multimodal machine learning by effectively integrating audio and visual cues to precisely segment objects or regions within visual scenes. Recent AVS methods have demonstrated significant improvements. However, they overlook the inherent frequency-domain contradictions between audio and visual modalities–the pervasively interfering noise in audio high-frequency signals vs. the structurally rich details in visual high-frequency signals. Ignoring these differences can result in suboptimal performance. In this paper, we rethink the AVS task from a deeper perspective by reformulating AVS task as a frequency-domain decomposition and recomposition problem. To this end, we introduce a novel Frequency-Aware Audio-Visual Segmentation (FAVS) framework consisting of two key modules: Frequency-Domain Enhanced Decomposer (FDED) module and Synergistic Cross-Modal Consistency (SCMC) module. FDED module employs a residual-based iterative frequency decomposition to discriminate modality-specific semantics and structural features, and SCMC module leverages a mixture-of-experts architecture to reinforce semantic consistency and modality-specific feature preservation through dynamic expert routing. Extensive experiments demonstrate that our FAVS framework achieves state-of-the-art performance on three benchmark datasets, and abundant qualitative visualizations further verify the effectiveness of the proposed FDED and SCMC modules. The code will be released as open source upon acceptance of the paper.
zh

[CV-44] MoiréNet: A Compact Dual-Domain Network for Image Demoiréing

【速读】:该论文旨在解决数字图像去摩尔纹(demoiréing)问题,即由显示像素阵列与相机传感器网格之间的频谱混叠(spectral aliasing)所引发的各向异性、多尺度伪影。其解决方案的核心是提出MoiréNet框架,该框架基于U-Net结构并融合频域与空域特征;关键创新在于引入两个组件:方向差异卷积(Directional Difference Convolution)构建的方向频域-空域编码器(DFSE),用于识别摩尔纹方向;以及频域-空域自适应选择器(FSAS),实现特征自适应的精准抑制,从而在仅5.513M参数下达到优于现有方法的去噪效果,且显著降低计算资源消耗。

链接: https://arxiv.org/abs/2509.18910
作者: Shuwei Guo,Simin Luan,Yan Ke,Zeyd Boukhers,John See,Cong Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Moiré patterns arise from spectral aliasing between display pixel lattices and camera sensor grids, manifesting as anisotropic, multi-scale artifacts that pose significant challenges for digital image demoiréing. We propose MoiréNet, a convolutional neural U-Net-based framework that synergistically integrates frequency and spatial domain features for effective artifact removal. MoiréNet introduces two key components: a Directional Frequency-Spatial Encoder (DFSE) that discerns moiré orientation via directional difference convolution, and a Frequency-Spatial Adaptive Selector (FSAS) that enables precise, feature-adaptive suppression. Extensive experiments demonstrate that MoiréNet achieves state-of-the-art performance on public and actively used datasets while being highly parameter-efficient. With only 5.513M parameters, representing a 48% reduction compared to ESDNet-L, MoiréNet combines superior restoration quality with parameter efficiency, making it well-suited for resource-constrained applications including smartphone photography, industrial imaging, and augmented reality.
zh

[CV-45] DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

【速读】:该论文旨在解决运动模糊条件下三维场景重建与渲染的难题,特别是针对传统结构光恢复(Structure-from-Motion, SfM)方法因相机位姿估计误差导致初始点云精度不足的问题。其关键解决方案在于:首先,利用预训练的密集立体匹配模块(DUSt3R)直接从模糊图像中获取高精度初始点云,跳过SfM中相机位姿计算环节以避免误差累积;其次,引入事件流(event stream)作为细粒度动态信息源,通过解码事件流与模糊图像联合生成潜在清晰图像,为场景重建提供精细化监督信号,从而显著提升重建质量与渲染效率。

链接: https://arxiv.org/abs/2509.18898
作者: Pengteng Li,Yunfan Lu,Pinhao Song,Weiyu Guo,Huizai Yao,F. Richard Yu,Hui Xiong
机构: The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)); KU Leuven (鲁汶大学); Carleton University (卡尔顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermediate result, we avoid the cumulative errors transfer from inaccurate camera poses to the initial point clouds’ positions. Second, we introduce the event stream into the deblur pipeline for its high sensitivity to dynamic change. By decoding the latent sharp images from the event stream and blurred images, we can provide a fine-grained supervision signal for scene reconstruction optimization. Extensive experiments across a range of scenes demonstrate that DeblurSplat not only excels in generating high-fidelity novel views but also achieves significant rendering efficiency compared to the SOTAs in deblur 3D-GS.
zh

[CV-46] RS3DBench: A Comprehensive Benchmark for 3D Spatial Perception in Remote Sensing

【速读】:该论文旨在解决当前遥感图像中3D视觉理解模型发展受限的问题,特别是现有数据集普遍存在深度信息不完整或深度图与遥感图像之间对齐精度不足的缺陷。解决方案的关键在于构建一个名为RS3DBench的新基准数据集,该数据集包含54,951对遥感图像与像素级对齐的深度图,并配有文本描述,覆盖广泛的地理场景,从而为训练和评估遥感图像中的3D视觉感知模型提供高质量的数据支撑。此外,研究还提出了一种基于稳定扩散(Stable Diffusion)的遥感深度估计模型,利用其多模态融合能力,在所提数据集上实现了最先进的性能表现。

链接: https://arxiv.org/abs/2509.18897
作者: Jiayu Wang,Ruizhi Wang,Jie Song,Haofei Zhang,Mingli Song,Zunlei Feng,Li Sun
机构: Zhejiang University (浙江大学); Hangzhou City University (杭州城市学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 26 pages, 4 figures

点击查看摘要

Abstract:In this paper, we introduce a novel benchmark designed to propel the advancement of general-purpose, large-scale 3D vision models for remote sensing imagery. While several datasets have been proposed within the realm of remote sensing, many existing collections either lack comprehensive depth information or fail to establish precise alignment between depth data and remote sensing images. To address this deficiency, we present a visual Benchmark for 3D understanding of Remotely Sensed images, dubbed RS3DBench. This dataset encompasses 54,951 pairs of remote sensing images and pixel-level aligned depth maps, accompanied by corresponding textual descriptions, spanning a broad array of geographical contexts. It serves as a tool for training and assessing 3D visual perception models within remote sensing image spatial understanding tasks. Furthermore, we introduce a remotely sensed depth estimation model derived from stable diffusion, harnessing its multimodal fusion capabilities, thereby delivering state-of-the-art performance on our dataset. Our endeavor seeks to make a profound contribution to the evolution of 3D visual perception models and the advancement of geographic artificial intelligence within the remote sensing domain. The dataset, models and code will be accessed on the this https URL.
zh

[CV-47] SmartWilds: Multimodal Wildlife Monitoring Dataset

【速读】:该论文旨在解决野生动物监测中单一模态数据局限性导致的环境感知不全面问题,特别是在濒危物种研究、保护生态学和栖息地管理中的应用瓶颈。解决方案的关键在于构建首个多模态野生动物监测数据集SmartWilds,通过同步采集无人机影像、相机陷阱照片与视频以及生物声学录音,在220英亩牧场内实现对多种物种(如戴维鹿、四川羚牛、普氏野马等)的多维度观测,验证了不同传感器模态在土地利用模式识别、物种检测、行为分析及栖息地监控中的互补优势,从而为基于生成式AI(Generative AI)的综合环境监测提供可复现的协议和开放数据支持。

链接: https://arxiv.org/abs/2509.18894
作者: Jenna Kline,Anirudh Potlapally,Bharath Pillai,Tanishka Wani,Rugved Katole,Vedant Patil,Penelope Covey,Hari Subramoni,Tanya Berger-Wolf,Christopher Stewart
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:We present the first release of SmartWilds, a multimodal wildlife monitoring dataset. SmartWilds is a synchronized collection of drone imagery, camera trap photographs and videos, and bioacoustic recordings collected during summer 2025 at The Wilds safari park in Ohio. This dataset supports multimodal AI research for comprehensive environmental monitoring, addressing critical needs in endangered species research, conservation ecology, and habitat management. Our pilot deployment captured four days of synchronized monitoring across three modalities in a 220-acre pasture containing Pere David’s deer, Sichuan takin, Przewalski’s horses, as well as species native to Ohio, including bald eagles, white-tailed deer, and coyotes. We provide a comparative analysis of sensor modality performance, demonstrating complementary strengths for landuse patterns, species detection, behavioral analysis, and habitat monitoring. This work establishes reproducible protocols for multimodal wildlife monitoring while contributing open datasets to advance conservation computer vision research. Future releases will include synchronized GPS tracking data from tagged individuals, citizen science data, and expanded temporal coverage across multiple seasons.
zh

[CV-48] Attack for Defense: Adversarial Agents for Point Prompt Optimization Empowering Segment Anything Model

【速读】:该论文旨在解决生成式 AI(Generative AI)中基于提示(prompt)的分割模型——Segment Anything Model (SAM) 的性能受限于启发式或人工设计提示的问题,这些问题限制了模型的可扩展性和泛化能力。解决方案的关键在于提出 Point Prompt Defender,一个基于对抗强化学习(adversarial reinforcement learning)的框架,采用“攻击-防御”范式自动优化点提示:通过构建一个任务无关的双空间图环境(将图像块表示为节点,边编码物理与语义距离),训练攻击者代理识别并激活使 SAM 分割性能下降的提示子集,同时训练防守者代理抑制这些干扰提示以恢复准确率;二者均使用深度 Q 网络(Deep Q-Networks)进行训练,奖励信号基于分割质量的变化。推理阶段仅部署防守者代理,即可对任意粗略提示集进行精炼,从而在不重新训练的情况下显著提升 SAM 在多种任务中的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2509.18891
作者: Xueyu Liu,Xiaoyi Zhang,Guangze Shi,Meilin Liu,Yexin Lai,Yongfei Wu,Mingqiang Wei
机构: Taiyuan University of Technology (太原理工大学); Nanjing University Of Aeronautics And Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Prompt quality plays a critical role in the performance of the Segment Anything Model (SAM), yet existing approaches often rely on heuristic or manually crafted prompts, limiting scalability and generalization. In this paper, we propose Point Prompt Defender, an adversarial reinforcement learning framework that adopts an attack-for-defense paradigm to automatically optimize point prompts. We construct a task-agnostic point prompt environment by representing image patches as nodes in a dual-space graph, where edges encode both physical and semantic distances. Within this environment, an attacker agent learns to activate a subset of prompts that maximally degrade SAM’s segmentation performance, while a defender agent learns to suppress these disruptive prompts and restore accuracy. Both agents are trained using Deep Q-Networks with a reward signal based on segmentation quality variation. During inference, only the defender is deployed to refine arbitrary coarse prompt sets, enabling enhanced SAM segmentation performance across diverse tasks without retraining. Extensive experiments show that Point Prompt Defender effectively improves SAM’s robustness and generalization, establishing a flexible, interpretable, and plug-and-play framework for prompt-based segmentation.
zh

[CV-49] ViG-LRGC: Vision Graph Neural Networks with Learnable Reparameterized Graph Construction

【速读】:该论文旨在解决视觉图神经网络(Vision Graph Neural Networks, ViG)中图结构构建的局限性问题,即现有方法依赖非参数化、不可学习的统计手段(如k-NN、超图或相似度阈值法)来构造节点间的连接关系,难以自适应地为每个节点选择最优邻域,且需手动调参。其解决方案的关键在于提出可学习的重参数化图构造(Learnable Reparameterized Graph Construction, LRGC),该方法通过节点对之间的键-查询注意力机制计算边权重,并引入软阈值重参数化进行可微分的边选择,从而实现无需超参数的端到端训练,同时使每层的阈值可根据数据自适应调整,显著提升了模型表达能力和性能。

链接: https://arxiv.org/abs/2509.18840
作者: Ismael Elsharkawi,Hossam Sharara,Ahmed Rafea
机构: The American University in Cairo (美国大学开罗分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under Review

点击查看摘要

Abstract:Image Representation Learning is an important problem in Computer Vision. Traditionally, images were processed as grids, using Convolutional Neural Networks or as a sequence of visual tokens, using Vision Transformers. Recently, Vision Graph Neural Networks (ViG) have proposed the treatment of images as a graph of nodes; which provides a more intuitive image representation. The challenge is to construct a graph of nodes in each layer that best represents the relations between nodes and does not need a hyper-parameter search. ViG models in the literature depend on non-parameterized and non-learnable statistical methods that operate on the latent features of nodes to create a graph. This might not select the best neighborhood for each node. Starting from k-NN graph construction to HyperGraph Construction and Similarity-Thresholded graph construction, these methods lack the ability to provide a learnable hyper-parameter-free graph construction method. To overcome those challenges, we present the Learnable Reparameterized Graph Construction (LRGC) for Vision Graph Neural Networks. LRGC applies key-query attention between every pair of nodes; then uses soft-threshold reparameterization for edge selection, which allows the use of a differentiable mathematical model for training. Using learnable parameters to select the neighborhood removes the bias that is induced by any clustering or thresholding methods previously introduced in the literature. In addition, LRGC allows tuning the threshold in each layer to the training data since the thresholds are learnable through training and are not provided as hyper-parameters to the model. We demonstrate that the proposed ViG-LRGC approach outperforms state-of-the-art ViG models of similar sizes on the ImageNet-1k benchmark dataset.
zh

[CV-50] Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

【速读】:该论文旨在解决如何利用通用多模态大语言模型(Multimodal Large Language Models, MLLMs)和视觉语言模型(Vision Language Models, VLMs)完成基督教圣像学(Christian Iconography)的单标签图像分类任务,以评估这些模型是否能够替代传统监督分类器对宗教图像内容进行准确识别。其解决方案的关键在于:首先,通过构建包含Iconclass本体支持的三个数据集(ArtDL、ICONCLASS 和 Wikidata),在三种输入条件下测试模型性能——仅使用类别标签、结合Iconclass描述文本以及引入五样本少样本示例;其次,发现Gemini-2.5 Pro与GPT-4o在多数场景下优于微调后的ResNet50基线模型,且添加类描述可显著提升零样本分类准确性,表明提示工程(prompt engineering)是提升模型在文化资产领域理解能力的核心策略。

链接: https://arxiv.org/abs/2509.18839
作者: Gianmarco Spinaci(1 and 2),Lukas Klic(2),Giovanni Colavizza(1 and 3) ((1) Department of Classical Philology and Italian Studies, University of Bologna, Italy, (2) Villa i Tatti, The Harvard University Center for Italian Renaissance Studies, Florence, Italy, (3) Department of Communication, University of Copenhagen, Denmark)
机构: University of Bologna (博洛尼亚大学); Harvard University (哈佛大学); Villa i Tatti (维拉·伊·塔蒂); University of Copenhagen (哥本哈根大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.
zh

[CV-51] xt Slider: Efficient and Plug-and-Play Continuous Concept Control for Image/Video Synthesis via LoRA Adapters

【速读】:该论文旨在解决当前基于扩散模型的图像与视频生成中,概念控制方法存在的训练效率低、显存消耗大以及缺乏跨模型通用性的问题。现有方法如Concept Slider和Attribute Control需对每个扩散主干网络重新训练,且依赖大量可学习参数和计算资源,限制了其扩展性和灵活性。解决方案的关键在于提出Text Slider框架,通过识别预训练文本编码器中的低秩方向(low-rank directions),实现无需重训练即可对视觉概念进行连续、细粒度的控制,从而大幅降低训练时间(比Concept Slider快5倍,比Attribute Control快47倍)、显存占用(减少近2倍和4倍)及可训练参数数量,同时支持多概念组合与连续调节,保持输入空间结构不变。

链接: https://arxiv.org/abs/2509.18831
作者: Pin-Yen Chiu,I-Sheng Fang,Jun-Cheng Chen
机构: Research Center for Information Technology Innovation, Academia Sinica (中央研究院資訊科技創新研究中心)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have significantly improved image and video synthesis. In addition, several concept control methods have been proposed to enable fine-grained, continuous, and flexible control over free-form text prompts. However, these methods not only require intensive training time and GPU memory usage to learn the sliders or embeddings but also need to be retrained for different diffusion backbones, limiting their scalability and adaptability. To address these limitations, we introduce Text Slider, a lightweight, efficient and plug-and-play framework that identifies low-rank directions within a pre-trained text encoder, enabling continuous control of visual concepts while significantly reducing training time, GPU memory consumption, and the number of trainable parameters. Furthermore, Text Slider supports multi-concept composition and continuous control, enabling fine-grained and flexible manipulation in both image and video synthesis. We show that Text Slider enables smooth and continuous modulation of specific attributes while preserving the original spatial layout and structure of the input. Text Slider achieves significantly better efficiency: 5 \times faster training than Concept Slider and 47 \times faster than Attribute Control, while reducing GPU memory usage by nearly 2 \times and 4 \times , respectively.
zh

[CV-52] DexSkin: High-Coverag e Conformable Robotic Skin for Learning Contact-Rich Manipulation

【速读】:该论文旨在解决机器人灵巧操作中 tactile sensing(触觉感知)能力的复现难题,尤其是如何在复杂曲面和大范围内实现敏感、局部化且可校准的触觉感知。解决方案的关键在于提出 DexSkin——一种软性、贴合性强的电容式电子皮肤,能够适应不同几何形状,并覆盖几乎整个夹爪手指表面,从而提供高密度的触觉信息;同时通过学习从示范中获取策略并支持跨传感器实例的模型迁移,显著提升了数据驱动方法在真实机器人上的适用性和泛化能力。

链接: https://arxiv.org/abs/2509.18830
作者: Suzannah Wistreich,Baiyu Shi,Stephen Tian,Samuel Clarke,Michael Nath,Chengyi Xu,Zhenan Bao,Jiajun Wu
机构: Stanford University (斯坦福大学); University of Alabama at Birmingham (阿拉巴马大学伯明翰分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to CoRL 2025

点击查看摘要

Abstract:Human skin provides a rich tactile sensing stream, localizing intentional and unintentional contact events over a large and contoured region. Replicating these tactile sensing capabilities for dexterous robotic manipulation systems remains a longstanding challenge. In this work, we take a step towards this goal by introducing DexSkin. DexSkin is a soft, conformable capacitive electronic skin that enables sensitive, localized, and calibratable tactile sensing, and can be tailored to varying geometries. We demonstrate its efficacy for learning downstream robotic manipulation by sensorizing a pair of parallel jaw gripper fingers, providing tactile coverage across almost the entire finger surfaces. We empirically evaluate DexSkin’s capabilities in learning challenging manipulation tasks that require sensing coverage across the entire surface of the fingers, such as reorienting objects in hand and wrapping elastic bands around boxes, in a learning-from-demonstration framework. We then show that, critically for data-driven approaches, DexSkin can be calibrated to enable model transfer across sensor instances, and demonstrate its applicability to online reinforcement learning on real robots. Our results highlight DexSkin’s suitability and practicality for learning real-world, contact-rich manipulation. Please see our project webpage for videos and visualizations: this https URL.
zh

[CV-53] Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

【速读】:该论文旨在解决统一多模态模型在处理复杂交错的多模态输入时,由于扩散去噪(diffusion denoising)和自回归解码(autoregressive decoding)迭代过程带来的显著计算开销问题。解决方案的关键在于提出Hyper-Bagel加速框架,采用分而治之策略:通过推测解码(speculative decoding)加速下一token预测,并结合多阶段蒸馏(multi-stage distillation)优化扩散去噪过程;同时引入对抗蒸馏与人类反馈学习(human feedback learning),实现高保真、低延迟的多模态理解与生成任务加速,在保持原模型输出质量的前提下,实现超过2倍的多模态理解加速,以及16.67倍(文本到图像生成)和22倍(图像编辑)的生成加速。

链接: https://arxiv.org/abs/2509.18824
作者: Yanzuo Lu,Xin Xia,Manlin Zhang,Huafeng Kuang,Jianbin Zheng,Yuxi Ren,Xuefeng Xiao
机构: ByteDance Seed (字节跳动种子项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.
zh

[CV-54] Surgical Video Understanding with Label Interpolation

【速读】:该论文旨在解决机器人辅助手术(Robot-assisted Surgery, RAS)中视觉数据理解的挑战,特别是由于手术场景具有复杂的时序动态性和多样化的器械交互行为,导致单一任务方法难以实现全面理解;同时,多任务学习(Multi-task Learning, MTL)因像素级分割标注数据稀缺而受限,尤其是长期标注(如阶段和步骤)覆盖所有帧,而短期标注(如器械分割和动作检测)仅存在于关键帧,造成显著的时间-空间不平衡问题。解决方案的关键在于提出一种结合光流驱动的分割标签插值与多任务学习的新框架:利用关键帧间的光流估计将标注标签传播至相邻未标注帧,从而丰富稀疏的空间监督信号,并在训练中平衡时空信息,提升手术场景理解的准确性与效率,进而增强RAS系统的实用性。

链接: https://arxiv.org/abs/2509.18802
作者: Garam Kim,Tae Kyeong Jeong,Juyoun Park
机构: Korea Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures

点击查看摘要

Abstract:Robot-assisted surgery (RAS) has become a critical paradigm in modern surgery, promoting patient recovery and reducing the burden on surgeons through minimally invasive approaches. To fully realize its potential, however, a precise understanding of the visual data generated during surgical procedures is essential. Previous studies have predominantly focused on single-task approaches, but real surgical scenes involve complex temporal dynamics and diverse instrument interactions that limit comprehensive understanding. Moreover, the effective application of multi-task learning (MTL) requires sufficient pixel-level segmentation data, which are difficult to obtain due to the high cost and expertise required for annotation. In particular, long-term annotations such as phases and steps are available for every frame, whereas short-term annotations such as surgical instrument segmentation and action detection are provided only for key frames, resulting in a significant temporal-spatial imbalance. To address these challenges, we propose a novel framework that combines optical flow-based segmentation label interpolation with multi-task learning. optical flow estimated from annotated key frames is used to propagate labels to adjacent unlabeled frames, thereby enriching sparse spatial supervision and balancing temporal and spatial information for training. This integration improves both the accuracy and efficiency of surgical scene understanding and, in turn, enhances the utility of RAS.
zh

[CV-55] A Kernel Space-based Multidimensional Sparse Model for Dynamic PET Image Denoising

【速读】:该论文旨在解决动态正电子发射断层成像(dynamic positron emission tomography, dynamic PET)中短时间帧因统计量有限而导致图像质量下降的问题。其解决方案的关键在于构建一种基于核空间的多维稀疏(kernel space-based multidimensional sparse, KMDS)模型,利用动态PET图像在帧间空间相关性和帧内结构一致性特征,建立先验约束;随后将参数估计过程由传统方法替换为神经网络,实现自适应参数优化,从而形成端到端的神经KMDS-Net模型,显著提升了去噪性能并支持高时空分辨率重建。

链接: https://arxiv.org/abs/2509.18801
作者: Kuang Xiaodong,Li Bingxuan,Li Yuan,Rao Fan,Ma Gege,Xie Qingguo,Mok Greta S P,Liu Huafeng,Zhu Wentao
机构: University of Macau (澳门大学); University of Science and Technology of China (中国科学技术大学); Zhejiang University (浙江大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Achieving high image quality for temporal frames in dynamic positron emission tomography (PET) is challenging due to the limited statistic especially for the short frames. Recent studies have shown that deep learning (DL) is useful in a wide range of medical image denoising tasks. In this paper, we propose a model-based neural network for dynamic PET image denoising. The inter-frame spatial correlation and intra-frame structural consistency in dynamic PET are used to establish the kernel space-based multidimensional sparse (KMDS) model. We then substitute the inherent forms of the parameter estimation with neural networks to enable adaptive parameters optimization, forming the end-to-end neural KMDS-Net. Extensive experimental results from simulated and real data demonstrate that the neural KMDS-Net exhibits strong denoising performance for dynamic PET, outperforming previous baseline methods. The proposed method may be used to effectively achieve high temporal and spatial resolution for dynamic PET. Our source code is available at this https URL.
zh

[CV-56] owards Application Aligned Synthetic Surgical Image Synthesis

【速读】:该论文旨在解决医学手术场景中标注数据稀缺的问题,这限制了深度学习系统在计算机辅助干预中的发展。传统扩散模型虽能生成逼真图像,但常因数据记忆现象产生不一致或缺乏多样性的样本,反而损害下游任务性能。其解决方案的关键在于提出一种名为Surgical Application-Aligned Diffusion (SAADi) 的新框架,通过构建“被下游模型偏好”与“非偏好”的合成图像对,并对扩散模型进行轻量级微调,显式地将图像生成过程与下游任务目标对齐,从而实现任务感知的对齐机制,有效缓解数据稀缺问题并提升手术视觉任务的性能。

链接: https://arxiv.org/abs/2509.18796
作者: Danush Kumar Venkatesh,Stefanie Speidel
机构: Department of Translational Surgical Oncology, NCT/UCC Dresden, a partnership between DKFZ, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden, HZDR, Germany; Department of Translational Surgical Oncology, NCT/UCC Dresden, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden Germany; The Centre for Tactile Internet with Human-in-the-Loop (CeTI), TUD Dresden
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The scarcity of annotated surgical data poses a significant challenge for developing deep learning systems in computer-assisted interventions. While diffusion models can synthesize realistic images, they often suffer from data memorization, resulting in inconsistent or non-diverse samples that may fail to improve, or even harm, downstream performance. We introduce \emphSurgical Application-Aligned Diffusion (SAADi), a new framework that aligns diffusion models with samples preferred by downstream models. Our method constructs pairs of \emphpreferred and \emphnon-preferred synthetic images and employs lightweight fine-tuning of diffusion models to align the image generation process with downstream objectives explicitly. Experiments on three surgical datasets demonstrate consistent gains of 7 – 9% in classification and 2 – 10% in segmentation tasks, with the considerable improvements observed for underrepresented classes. Iterative refinement of synthetic samples further boosts performance by 4 – 10% . Unlike baseline approaches, our method overcomes sample degradation and establishes task-aware alignment as a key principle for mitigating data scarcity and advancing surgical vision applications.
zh

[CV-57] Human-Interpretable Uncertainty Explanations for Point Cloud Registration

【速读】:该论文旨在解决点云配准(point cloud registration)问题中因传感器噪声、位姿估计误差以及遮挡导致的部分重叠所引发的不确定性,传统方法如ICP(Iterative Closest Point)在上述条件下性能显著下降。其解决方案的关键在于提出了一种新颖的高斯过程概念归因(Gaussian Process Concept Attribution, GP-CA)方法,该方法不仅能量化配准过程中的不确定性,还能通过归因分析将不确定性解释为已知误差源(如传感器噪声、位姿估计误差等),并借助主动学习机制在真实场景中发现新的不确定性来源。实验表明,GP-CA在运行效率、样本高效性(尤其结合主动学习)和精度方面均优于现有最先进方法,并在实际机器人实验中验证了其鲁棒性与可应用性。

链接: https://arxiv.org/abs/2509.18786
作者: Johannes A. Gaus,Loris Schneider,Yitian Shi,Jongseok Lee,Rania Rayyes,Rudolph Triebel
机构: University of Tübingen (图宾根大学); University of Stuttgart (斯图加特大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In this paper, we address the point cloud registration problem, where well-known methods like ICP fail under uncertainty arising from sensor noise, pose-estimation errors, and partial overlap due to occlusion. We develop a novel approach, Gaussian Process Concept Attribution (GP-CA), which not only quantifies registration uncertainty but also explains it by attributing uncertainty to well-known sources of errors in registration problems. Our approach leverages active learning to discover new uncertainty sources in the wild by querying informative instances. We validate GP-CA on three publicly available datasets and in our real-world robot experiment. Extensive ablations substantiate our design choices. Our approach outperforms other state-of-the-art methods in terms of runtime, high sample-efficiency with active learning, and high accuracy. Our real-world experiment clearly demonstrates its applicability. Our video also demonstrates that GP-CA enables effective failure-recovery behaviors, yielding more robust robotic perception.
zh

[CV-58] Real-time Deer Detection and Warning in Connected Vehicles via Thermal Sensing and Deep Learning

【速读】:该论文旨在解决美国每年约210万起鹿车碰撞事故带来的交通安全与生态问题,此类事故造成近440人死亡、5.9万人受伤及100亿美元经济损失。其解决方案的关键在于构建一个融合热成像(thermal imaging)、深度学习和车联网通信(vehicle-to-everything, V2X)的实时检测与预警系统。该系统基于超过12,000张热成像鹿图像训练并验证,在复杂天气条件下仍保持88–92%的检测准确率(远超可见光摄像头<60%的性能),并通过CV2X通信实现高概率目标触发下的车辆间信息共享,最终实现从检测到驾驶员警报的端到端延迟低于100毫秒,为减少鹿车碰撞提供了可行的技术路径。

链接: https://arxiv.org/abs/2509.18779
作者: Hemanth Puppala,Wayne Sarasua,Srinivas Biyaguda,Farhad Farzinpour,Mashrur Chowdhury
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Preprint under review in TRR, 20 pages, 9 figures, 4 tables

点击查看摘要

Abstract:Deer-vehicle collisions represent a critical safety challenge in the United States, causing nearly 2.1 million incidents annually and resulting in approximately 440 fatalities, 59,000 injuries, and 10 billion USD in economic damages. These collisions also contribute significantly to declining deer populations. This paper presents a real-time detection and driver warning system that integrates thermal imaging, deep learning, and vehicle-to-everything communication to help mitigate deer-vehicle collisions. Our system was trained and validated on a custom dataset of over 12,000 thermal deer images collected in Mars Hill, North Carolina. Experimental evaluation demonstrates exceptional performance with 98.84 percent mean average precision, 95.44 percent precision, and 95.96 percent recall. The system was field tested during a follow-up visit to Mars Hill and readily sensed deer providing the driver with advanced warning. Field testing validates robust operation across diverse weather conditions, with thermal imaging maintaining between 88 and 92 percent detection accuracy in challenging scenarios where conventional visible light based cameras achieve less than 60 percent effectiveness. When a high probability threshold is reached sensor data sharing messages are broadcast to surrounding vehicles and roadside units via cellular vehicle to everything (CV2X) communication devices. Overall, our system achieves end to end latency consistently under 100 milliseconds from detection to driver alert. This research establishes a viable technological pathway for reducing deer-vehicle collisions through thermal imaging and connected vehicles.
zh

[CV-59] DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision

【速读】:该论文旨在解决自监督学习(Self-supervised Learning, SSL)在医学图像表征学习中因复杂架构、解剖先验依赖或高度调优的增强策略而导致的可扩展性和泛化能力不足的问题,尤其关注在胸部X光等解剖结构相似度高、病灶细微的模态下模型易陷入捷径学习(shortcut learning)的缺陷。其解决方案的关键在于提出DiSSECT框架,通过将多尺度向量量化(multi-scale vector quantization)引入SSL流程,构建离散的表征瓶颈(discrete representational bottleneck),从而强制模型学习重复性强且结构感知的特征,抑制视图特异性或低效模式,提升跨任务与跨域的表征迁移能力。

链接: https://arxiv.org/abs/2509.18765
作者: Azad Singh,Deepak Mishra
机构: Indian Institute of Technology Jodhpur (印度理工学院乔德普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a powerful paradigm for medical image representation learning, particularly in settings with limited labeled data. However, existing SSL methods often rely on complex architectures, anatomy-specific priors, or heavily tuned augmentations, which limit their scalability and generalizability. More critically, these models are prone to shortcut learning, especially in modalities like chest X-rays, where anatomical similarity is high and pathology is subtle. In this work, we introduce DiSSECT – Discrete Self-Supervision for Efficient Clinical Transferable Representations, a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck. This constrains the model to learn repeatable, structure-aware features while suppressing view-specific or low-utility patterns, improving representation transfer across tasks and domains. DiSSECT achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning, and shows particularly high label efficiency in low-label regimes. We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability compared to existing state-of-the-art approaches.
zh

[CV-60] Bi-VLM: Pushing Ultra-Low Precision Post-Training Quantization Boundaries in Vision-Language Models

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在硬件资源受限环境下因高计算需求和内存占用而难以部署的问题,尤其关注如何在超低比特权重精度(bitwidth ≤ 2 bits)下实现高效压缩与保持性能。其解决方案的关键在于提出Bi-VLM方法,通过基于高斯分位数的非均匀权重划分策略,将模型权重分为“异常值”(outlier,即显著权重)和多个“内点”(inlier,即非显著权重)子集,确保每类子集包含与其分位数比例一致的权重;进一步设计了一种感知显著性的混合量化算法,根据权重显著性指标和压缩目标对缩放矩阵与二值矩阵施加差异化约束,从而在极低精度下实现性能最优。

链接: https://arxiv.org/abs/2509.18763
作者: Xijun Wang,Junyun Huang,Rayyan Abdalla,Chengyuan Zhang,Ruiqi Xian,Dinesh Manocha
机构: University of Maryland, College Park, USA(马里兰大学学院市分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the critical gap between the computational demands of vision-language models and the possible ultra-low-bit weight precision (bitwidth \leq2 bits) we can use for higher efficiency. Our work is motivated by the substantial computational cost and memory requirements of VLMs, which restrict their applicability in hardware-constrained environments. We propose Bi-VLM, which separates model weights non-uniformly based on the Gaussian quantiles. Our formulation groups the model weights into outlier (salient) and multiple inlier (unsalient) subsets, ensuring that each subset contains a proportion of weights corresponding to its quantile in the distribution. We propose a saliency-aware hybrid quantization algorithm and use it to quantize weights by imposing different constraints on the scaler and binary matrices based on the saliency metric and compression objective. We have evaluated our approach on different VLMs. For the language model part of the VLM, our Bi-VLM outperforms the SOTA by 3%-47% on the visual question answering task in terms of four different benchmarks and three different models. For the overall VLM, our Bi-VLM outperforms the SOTA by 4%-45%. We also perform token pruning on the quantized models and observe that there is redundancy of image tokens 90% - 99% in the quantized models. This helps us to further prune the visual tokens to improve efficiency.
zh

[CV-61] FixingGS: Enhancing 3D Gaussian Splatting via Training-Free Score Distillation

【速读】:该论文旨在解决从稀疏视角重建3D场景时因视觉信息不足而导致的显著伪影问题,这些问题在现有3D高斯溅射(3D Gaussian Splatting, 3DGS)方法中普遍存在。为应对这一挑战,作者提出FixingGS,一种无需训练的方法,其核心在于利用预训练扩散模型(diffusion model)的能力来增强重建质量。关键创新是提出了一种蒸馏(distillation)策略,以生成更精确且跨视图一致的扩散先验(diffusion priors),从而实现有效的伪影去除与缺失内容修复;同时引入自适应渐进增强机制,在欠约束区域进一步优化重建结果。

链接: https://arxiv.org/abs/2509.18759
作者: Zhaorui Wang,Yi Gu,Deming Zhou,Renjing Xu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recently, 3D Gaussian Splatting (3DGS) has demonstrated remarkable success in 3D reconstruction and novel view synthesis. However, reconstructing 3D scenes from sparse viewpoints remains highly challenging due to insufficient visual information, which results in noticeable artifacts persisting across the 3D representation. To address this limitation, recent methods have resorted to generative priors to remove artifacts and complete missing content in under-constrained areas. Despite their effectiveness, these approaches struggle to ensure multi-view consistency, resulting in blurred structures and implausible details. In this work, we propose FixingGS, a training-free method that fully exploits the capabilities of the existing diffusion model for sparse-view 3DGS reconstruction enhancement. At the core of FixingGS is our distillation approach, which delivers more accurate and cross-view coherent diffusion priors, thereby enabling effective artifact removal and inpainting. In addition, we propose an adaptive progressive enhancement scheme that further refines reconstructions in under-constrained regions. Extensive experiments demonstrate that FixingGS surpasses existing state-of-the-art methods with superior visual quality and reconstruction performance. Our code will be released publicly.
zh

[CV-62] COLT: Enhancing Video Large Language Models with Continual Tool Usage

【速读】:该论文旨在解决现有视频大语言模型(Video LLM)在面对持续演进的工具数据流时,难以有效学习新工具使用能力且易发生“灾难性遗忘”(catastrophic forgetting)的问题。解决方案的关键在于提出一种名为COLT(COntinuaL Tool usage)的框架,其核心创新是引入一个可学习的工具代码本(tool codebook)作为工具特异性记忆系统,并通过用户指令与代码本中工具特征之间的相似度动态选择相关工具,从而实现对连续工具流的自动获取和无遗忘学习。

链接: https://arxiv.org/abs/2509.18754
作者: Yuyang Liu,Xinyuan Shi,Bang Yang,Peilin Zhou,Jiahua Dong,Long Chen,Ian Reid,Xiaondan Liang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering ‘catastrophic forgetting’ of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.
zh

[CV-63] riFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing NEURIPS2025

【速读】:该论文旨在解决LiDAR点云在自动驾驶与机器人感知中因噪声、遮挡及对抗性扰动导致的脆弱性问题,尤其针对传统卷积神经网络(CNN)基线自动编码器在复杂现实场景下性能急剧下降的局限。其解决方案的关键在于提出TriFusion-AE,一种融合文本语义先验、多视角图像单目深度图与LiDAR点云的多模态交叉注意力自动编码器;通过语义(文本)、几何(深度)与空间结构(LiDAR)特征的对齐,学习具备鲁棒性的联合表示,从而显著提升在强对抗攻击和高噪声条件下的重建稳定性,且该框架具有模型无关性,可无缝集成至任意CNN基点云自动编码器中进行联合表征学习。

链接: https://arxiv.org/abs/2509.18743
作者: Susmit Neogi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop

点击查看摘要

Abstract:LiDAR-based perception is central to autonomous driving and robotics, yet raw point clouds remain highly vulnerable to noise, occlusion, and adversarial corruptions. Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions. In this work, we propose TriFusion-AE, a multimodal cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness. By aligning semantic cues from text, geometric (depth) features from images, and spatial structure from LiDAR, TriFusion-AE learns representations that are resilient to stochastic noise and adversarial perturbations. Interestingly, while showing limited gains under mild perturbations, our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse. We evaluate on the nuScenes-mini dataset to reflect realistic low-data deployment scenarios. Our multimodal fusion framework is designed to be model-agnostic, enabling seamless integration with any CNN-based point cloud autoencoder for joint representation learning.
zh

[CV-64] HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection

【速读】:该论文旨在解决RGB-thermal显著目标检测(RGB-T SOD)中因模态特征融合不足和数据稀缺导致的边界不精确与目标不完整问题。解决方案的关键在于提出一种混合提示驱动的Segment Anything模型(HyPSAM),其核心包括两个创新模块:一是动态融合网络(DFNet),通过动态卷积和多分支解码实现自适应跨模态交互,生成高质量初始显著图作为视觉提示;二是即插即用细化网络(P2RNet),利用文本、掩码和框提示协同引导SAM进行精细化优化,从而提升检测精度与泛化能力。该方法在三个公开数据集上达到当前最优性能,并展现出良好的兼容性与可扩展性。

链接: https://arxiv.org/abs/2509.18738
作者: Ruichao Hou,Xingyuan Li,Tongwei Ren,Dongming Zhou,Gangshan Wu,Jinde Cao
机构: Nanjing University (南京大学); Yunnan University (云南大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:RGB-thermal salient object detection (RGB-T SOD) aims to identify prominent objects by integrating complementary information from RGB and thermal modalities. However, learning the precise boundaries and complete objects remains challenging due to the intrinsic insufficient feature fusion and the extrinsic limitations of data scarcity. In this paper, we propose a novel hybrid prompt-driven segment anything model (HyPSAM), which leverages the zero-shot generalization capabilities of the segment anything model (SAM) for RGB-T SOD. Specifically, we first propose a dynamic fusion network (DFNet) that generates high-quality initial saliency maps as visual prompts. DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction, overcoming the limitations of fixed-parameter kernels and enhancing multi-modal feature representation. Moreover, we propose a plug-and-play refinement network (P2RNet), which serves as a general optimization strategy to guide SAM in refining saliency maps by using hybrid prompts. The text prompt ensures reliable modality input, while the mask and box prompts enable precise salient object localization. Extensive experiments on three public datasets demonstrate that our method achieves state-of-the-art performance. Notably, HyPSAM has remarkable versatility, seamlessly integrating with different RGB-T SOD methods to achieve significant performance gains, thereby highlighting the potential of prompt engineering in this field. The code and results of our method are available at: this https URL.
zh

[CV-65] Knowledge Transfer from Interaction Learning ICCV2025

链接: https://arxiv.org/abs/2509.18733
作者: Yilin Gao,Kangyi Chen,Zhongxing Peng,Hengjie Lu,Shugong Xu
机构: Shanghai University (上海大学); Xi’an Jiaotong-Liverpool University (西安大略利物浦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV2025

点击查看摘要

[CV-66] Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

链接: https://arxiv.org/abs/2509.18717
作者: Tong Zhang,Kuofeng Gao,Jiawang Bai,Leo Yu Zhang,Xin Yin,Zonghui Wang,Shouling Ji,Wenzhi Chen
机构: Zhejiang University (浙江大学); Tsinghua University (清华大学); Griffith University (格里菲斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

[CV-67] What Makes You Unique? Attribute Prompt Composition for Object Re-Identification

【速读】:该论文旨在解决目标重识别(Object Re-IDentification, ReID)模型在实际应用中面临的两个核心问题:一是单域模型容易过拟合特定域的特征,二是跨域模型依赖多样化的归一化策略,可能无意中抑制了身份相关的判别性特征。为此,作者提出 Attribute Prompt Composition (APC) 框架,其关键在于通过文本语义引导生成判别性强且泛化能力优的特征表示。具体而言,设计了一个 Attribute Prompt Generator (APG),包含 Semantic Attribute Dictionary (SAD) 和 Prompt Composition Module (PCM),其中 SAD 提供丰富的语义属性描述,PCM 自适应地从 SAD 中组合相关属性以生成属性感知特征;同时引入 Fast-Slow Training Strategy (FSTS),利用快速更新流(FUS)捕获 ReID 特定判别知识,慢速更新流(SUS)保留预训练视觉语言模型(Vision-Language Model, VLM)继承的通用表征能力,二者相互作用,在聚焦 ReID 相关特征的同时有效缓解过拟合问题。

链接: https://arxiv.org/abs/2509.18715
作者: Yingquan Wang,Pingping Zhang,Chong Sun,Dong Wang,Huchuan Lu
机构: Dalian University of Technology (大连理工大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TCSVT2025

点击查看摘要

Abstract:Object Re-IDentification (ReID) aims to recognize individuals across non-overlapping camera views. While recent advances have achieved remarkable progress, most existing models are constrained to either single-domain or cross-domain scenarios, limiting their real-world applicability. Single-domain models tend to overfit to domain-specific features, whereas cross-domain models often rely on diverse normalization strategies that may inadvertently suppress identity-specific discriminative cues. To address these limitations, we propose an Attribute Prompt Composition (APC) framework, which exploits textual semantics to jointly enhance discrimination and generalization. Specifically, we design an Attribute Prompt Generator (APG) consisting of a Semantic Attribute Dictionary (SAD) and a Prompt Composition Module (PCM). SAD is an over-complete attribute dictionary to provide rich semantic descriptions, while PCM adaptively composes relevant attributes from SAD to generate discriminative attribute-aware features. In addition, motivated by the strong generalization ability of Vision-Language Models (VLM), we propose a Fast-Slow Training Strategy (FSTS) to balance ReID-specific discrimination and generalizable representation learning. Specifically, FSTS adopts a Fast Update Stream (FUS) to rapidly acquire ReID-specific discriminative knowledge and a Slow Update Stream (SUS) to retain the generalizable knowledge inherited from the pre-trained VLM. Through a mutual interaction, the framework effectively focuses on ReID-relevant features while mitigating overfitting. Extensive experiments on both conventional and Domain Generalized (DG) ReID datasets demonstrate that our framework surpasses state-of-the-art methods, exhibiting superior performances in terms of both discrimination and generalization. The source code is available at this https URL.
zh

[CV-68] RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

【速读】:该论文旨在解决遥感视觉定位(Remote Sensing Visual Grounding, RSVG)任务中现有方法受限于封闭词汇集(closed-set vocabularies)的问题,从而难以在开放世界场景下应用;同时,现有基于通用基础模型的开放词汇方法过度依赖昂贵的高质量数据集和耗时的微调过程。解决方案的关键在于提出一种无需训练(training-free)的框架 RSVG-ZeroOV,其核心由三个阶段构成:(i) 利用视觉语言模型(Vision-Language Model, VLM)获取跨注意力图以捕捉文本查询与图像区域间的语义关联;(ii) 借助扩散模型(Diffusion Model, DM)的细粒度建模先验填补对象结构与形状信息的缺失,弥补VLM的不足;(iii) 引入一个简单但有效的注意力演化模块抑制无关激活,生成纯净的目标分割掩码。该方法充分利用冻结的基础模型能力,在不进行特定任务训练的前提下实现高效的零样本开放词汇RSVG。

链接: https://arxiv.org/abs/2509.18711
作者: Ke Li,Di Wang,Ting Wang,Fuyu Dong,Yiming Zhang,Luyao Zhang,Xiangyu Wang,Shaofeng Li,Quan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose \textbfRSVG-ZeroOV, a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: (i) Overview: We utilize a vision-language model (VLM) to obtain cross-attention\footnote[1]In this paper, although decoder-only VLMs use self-attention over all tokens, we refer to the image-text interaction part as cross-attention to distinguish it from pure visual self-attention.maps that capture semantic correlations between text queries and visual regions. (ii) Focus: By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. (iii) Evolve: A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.
zh

[CV-69] Overview of LifeCLEF Plant Identification task 2019: diving into data deficient tropical countries

链接: https://arxiv.org/abs/2509.18705
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, CLEF 2019 Conference and Labs of the Evaluation Forum, September 09 to 12, 2019, Lugano, Switzerland

点击查看摘要

[CV-70] AGSwap: Overcoming Category Boundaries in Object Fusion via Adaptive Group Swapping

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成中跨类别物体融合的挑战,即如何将不同概念的语义属性有效整合为一个视觉连贯、语义一致的图像对象。现有方法常因特征重叠和融合不佳导致结果出现偏倚、视觉混乱或语义不一致的问题,且缺乏系统性的评估基准。为此,作者提出Adaptive Group Swapping (AGSwap) 方法,其核心在于两个关键组件:(1) Group-wise Embedding Swapping,通过特征空间中的分组嵌入交换实现语义属性的融合;(2) Adaptive Group Updating,基于平衡评估分数动态优化融合过程,确保合成结果的一致性与合理性。此外,论文还构建了大规模、层级结构化的Cross-category Object Fusion (COF) 数据集,支持多样化的跨类别融合任务,显著推动该领域的研究进展。

链接: https://arxiv.org/abs/2509.18699
作者: Zedong Zhang,Ying Tai,Jianjun Qian,Jian Yang,Jun Li
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fusing cross-category objects to a single coherent object has gained increasing attention in text-to-image (T2I) generation due to its broad applications in virtual reality, digital media, film, and gaming. However, existing methods often produce biased, visually chaotic, or semantically inconsistent results due to overlapping artifacts and poor integration. Moreover, progress in this field has been limited by the absence of a comprehensive benchmark dataset. To address these problems, we propose \textbfAdaptive Group Swapping (AGSwap), a simple yet highly effective approach comprising two key components: (1) Group-wise Embedding Swapping, which fuses semantic attributes from different concepts through feature manipulation, and (2) Adaptive Group Updating, a dynamic optimization mechanism guided by a balance evaluation score to ensure coherent synthesis. Additionally, we introduce \textbfCross-category Object Fusion (COF), a large-scale, hierarchically structured dataset built upon ImageNet-1K and WordNet. COF includes 95 superclasses, each with 10 subclasses, enabling 451,250 unique fusion pairs. Extensive experiments demonstrate that AGSwap outperforms state-of-the-art compositional T2I methods, including GPT-Image-1 using simple and complex prompts.
zh

[CV-71] Overview of PlantCLEF 2021: cross-domain plant identification

【速读】:该论文旨在解决数据贫瘠地区(如热带地区)植物自动识别准确率低的问题,其核心挑战在于这些区域缺乏足够数量的野外拍摄图像用于训练深度学习模型。解决方案的关键在于利用长期积累的标本馆(herbarium)数字化记录,构建跨域分类任务:以数十万张标本图像和少量野外照片作为训练数据,使模型能够学习标本与野外图像之间的特征映射关系。此外,训练数据还包含5个形态学与功能特性值,进一步增强了模型对物种特征的理解能力,从而提升在数据稀缺区域的识别性能。

链接: https://arxiv.org/abs/2509.18697
作者: Herve Goeau,Pierre Bonnet,Alexis Joly
机构: CIRAD, UMR AMAP, Montpellier, Occitanie, France; Inria, LIRMM, Univ Montpellier, CNRS, Montpellier, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 6 figures, CLEF 2021 Conference and Labs of the Evaluation Forum, September 21 to 24, 2021, Bucharest, Romania

点击查看摘要

Abstract:Automated plant identification has improved considerably thanks to recent advances in deep learning and the availability of training data with more and more field photos. However, this profusion of data concerns only a few tens of thousands of species, mainly located in North America and Western Europe, much less in the richest regions in terms of biodiversity such as tropical countries. On the other hand, for several centuries, botanists have systematically collected, catalogued and stored plant specimens in herbaria, especially in tropical regions, and recent efforts by the biodiversity informatics community have made it possible to put millions of digitised records online. The LifeCLEF 2021 plant identification challenge (or “PlantCLEF 2021”) was designed to assess the extent to which automated identification of flora in data-poor regions can be improved by using herbarium collections. It is based on a dataset of about 1,000 species mainly focused on the Guiana Shield of South America, a region known to have one of the highest plant diversities in the world. The challenge was evaluated as a cross-domain classification task where the training set consisted of several hundred thousand herbarium sheets and a few thousand photos to allow learning a correspondence between the two domains. In addition to the usual metadata (location, date, author, taxonomy), the training data also includes the values of 5 morphological and functional traits for each species. The test set consisted exclusively of photos taken in the field. This article presents the resources and evaluations of the assessment carried out, summarises the approaches and systems used by the participating research groups and provides an analysis of the main results.
zh

[CV-72] OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery

【速读】:该论文旨在解决遥感领域中开放集地表覆盖分析(open-set land-cover analysis)的关键挑战,即在无类别监督条件下实现细粒度的空间定位与语义开放分类,包括检测和分割未知对象,并通过多模态推理为其赋予可解释的语义标签。解决方案的核心在于提出OSDA框架,该框架采用三阶段集成设计:首先利用提示驱动微调的分割模型(如Segment Anything Model, SAM)进行精确的目标发现与掩码提取;其次借助两阶段微调的多模态大语言模型(Multimodal Large Language Model, MLLM)完成语义归属与上下文描述;最后通过LLM-as-judge机制与人工评分对MLLM输出进行评估。该方法实现了像素级精度与高层语义理解的融合,且无需标注数据,具备架构无关性和跨卫星影像的鲁棒性,为动态地表覆盖监测提供了可扩展、可解释的自动化解决方案。

链接: https://arxiv.org/abs/2509.18693
作者: Siyi Chen,Kai Wang,Weicong Pang,Ruiming Yang,Ziru Chen,Renjun Gao,Alexis Kai Hon Lau,Dasa Gu,Chenchen Zhang,Cheng Li
机构: Johns Hopkins University (约翰霍普金斯大学); The University of Hong Kong (香港大学); National University of Singapore (新加坡国立大学); The Hong Kong University of Science and Technology (香港科技大学); Macau University of Science and Technology (澳门科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project is available at this https URL

点击查看摘要

Abstract:Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization. This involves not only detecting and segmenting novel objects without categorical supervision but also assigning them interpretable semantic labels through multimodal reasoning. In this study, we introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description. The pipeline consists of: (1) precise discovery and mask extraction with a promptable fine-tuned segmentation model (SAM), (2) semantic attribution and contextual description via a two-phase fine-tuned multimodal large language model (MLLM), and (3) LLM-as-judge and manual scoring of the MLLMs evaluation. By combining pixel-level accuracy with high-level semantic understanding, OSDA addresses key challenges in open-world remote sensing interpretation. Designed to be architecture-agnostic and label-free, the framework supports robust evaluation across diverse satellite imagery without requiring manual annotation. Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.
zh

[CV-73] Lightweight Vision Transformer with Window and Spatial Attention for Food Image Classification

【速读】:该论文旨在解决食品图像分类任务中Vision Transformer模型参数量大、计算复杂度高的问题,以适应生产线上对高效自动化质量控制的需求。其解决方案的关键在于提出一种轻量化算法,通过引入窗口多头注意力机制(Window Multi-Head Attention Mechanism, WMHAM)和空间注意力机制(Spatial Attention Mechanism, SAM),其中WMHAM通过高效的窗口划分策略同时捕获局部与全局上下文特征以降低计算开销,而SAM则自适应地强化关键空间区域,提升特征判别能力。实验表明,该方法在Food-101和Vireo Food-172数据集上分别达到95.24%和94.33%的准确率,同时显著减少模型参数和浮点运算次数(FLOPs),实现了计算效率与分类性能之间的良好平衡,适用于资源受限环境部署。

链接: https://arxiv.org/abs/2509.18692
作者: Xinle Gao,Linghui Ye,Zhiyong Xiao
机构: Jiangnan University (江南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the rapid development of society and continuous advances in science and technology, the food industry increasingly demands higher production quality and efficiency. Food image classification plays a vital role in enabling automated quality control on production lines, supporting food safety supervision, and promoting intelligent agricultural production. However, this task faces challenges due to the large number of parameters and high computational complexity of Vision Transformer models. To address these issues, we propose a lightweight food image classification algorithm that integrates a Window Multi-Head Attention Mechanism (WMHAM) and a Spatial Attention Mechanism (SAM). The WMHAM reduces computational cost by capturing local and global contextual features through efficient window partitioning, while the SAM adaptively emphasizes key spatial regions to improve discriminative feature representation. Experiments conducted on the Food-101 and Vireo Food-172 datasets demonstrate that our model achieves accuracies of 95.24% and 94.33%, respectively, while significantly reducing parameters and FLOPs compared with baseline methods. These results confirm that the proposed approach achieves an effective balance between computational efficiency and classification performance, making it well-suited for deployment in resource-constrained environments.
zh

[CV-74] LEAF-Mamba: Local Emphatic and Adaptive Fusion State Space Model for RGB-D Salient Object Detection ACM-MM2025

链接: https://arxiv.org/abs/2509.18683
作者: Lanhu Wu,Zilin Gao,Hao Fei,Mong-Li Lee,Wynne Hsu
机构: Dalian University of Technology (大连理工大学); National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
备注: Accepted to ACM MM 2025

点击查看摘要

[CV-75] Zero-shot Monocular Metric Depth for Endoscopic Images MICCAI2025

【速读】:该论文旨在解决内窥镜图像中深度估计(包括相对深度和度量深度)领域缺乏鲁棒基准测试和高质量数据集的问题,从而限制了模型在临床场景中的泛化能力与实际应用。其解决方案的关键在于:首先构建了一个针对真实未见过的内窥镜图像的全面基准测试平台,用于评估当前最先进深度估计模型的性能;其次提出并公开了一个新型合成数据集EndoSynth,其中包含带真实度量深度和分割掩膜的内窥镜手术器械图像,用以弥合合成数据与真实世界数据之间的差距。实验表明,使用该合成数据集对深度基础模型进行微调后,可在大多数未见的真实数据上显著提升精度,为后续研究提供了重要资源与技术路径。

链接: https://arxiv.org/abs/2509.18642
作者: Nicolas Toussaint,Emanuele Colleoni,Ricardo Sanchez-Matilla,Joshua Sutcliffe,Vanessa Thompson,Muhammad Asad,Imanol Luengo,Danail Stoyanov
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025 DEMI Workshop

点击查看摘要

Abstract:Monocular relative and metric depth estimation has seen a tremendous boost in the last few years due to the sharp advancements in foundation models and in particular transformer based networks. As we start to see applications to the domain of endoscopic images, there is still a lack of robust benchmarks and high-quality datasets in that area. This paper addresses these limitations by presenting a comprehensive benchmark of state-of-the-art (metric and relative) depth estimation models evaluated on real, unseen endoscopic images, providing critical insights into their generalisation and performance in clinical scenarios. Additionally, we introduce and publish a novel synthetic dataset (EndoSynth) of endoscopic surgical instruments paired with ground truth metric depth and segmentation masks, designed to bridge the gap between synthetic and real-world data. We demonstrate that fine-tuning depth foundation models using our synthetic dataset boosts accuracy on most unseen real data by a significant margin. By providing both a benchmark and a synthetic dataset, this work advances the field of depth estimation for endoscopic images and serves as an important resource for future research. Project page, EndoSynth dataset and trained weights are available at this https URL.
zh

[CV-76] Understanding-in-Generation: Reinforcing Generative Capability of Unified Model via Infusing Understanding into Generation

【速读】:该论文旨在解决当前统一模型在文本到图像生成(text-to-image generation)中因理解与生成过程分离而导致的推理引导能力受限问题,即现有基于思维链(Chain-of-Thought, CoT)的方法难以有效利用模型的理解能力来弥补其生成能力的不足。解决方案的关键在于提出一种新的推理框架——“生成中的理解”(Understanding-in-Generation, UiG),其核心思想是在生成过程中嵌入模型的强理解能力,通过将“图像编辑”作为桥梁,逐步将模型对文本的理解转化为具体的图像修改指令,从而实现理解驱动的生成优化。这一机制显著提升了统一模型在复杂提示(如TIIF基准长提示场景)下的生成性能,相较现有方法提升达3.92%。

链接: https://arxiv.org/abs/2509.18639
作者: Yuanhuiyi Lyu,Chi Kit Wong,Chenfei Liao,Lutao Jiang,Xu Zheng,Zexin Lu,Linfeng Zhang,Xuming Hu
机构: The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Shanghai Jiao Tong University (上海交通大学); The Hong Kong University of Science and Technology (香港科技大学); Huawei Hong Kong Research Center (华为香港研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent works have made notable advancements in enhancing unified models for text-to-image generation through the Chain-of-Thought (CoT). However, these reasoning methods separate the processes of understanding and generation, which limits their ability to guide the reasoning of unified models in addressing the deficiencies of their generative capabilities. To this end, we propose a novel reasoning framework for unified models, Understanding-in-Generation (UiG), which harnesses the robust understanding capabilities of unified models to reinforce their performance in image generation. The core insight of our UiG is to integrate generative guidance by the strong understanding capabilities during the reasoning process, thereby mitigating the limitations of generative abilities. To achieve this, we introduce “Image Editing” as a bridge to infuse understanding into the generation process. Initially, we verify the generated image and incorporate the understanding of unified models into the editing instructions. Subsequently, we enhance the generated image step by step, gradually infusing the understanding into the generation process. Our UiG framework demonstrates a significant performance improvement in text-to-image generation over existing text-to-image reasoning methods, e.g., a 3.92% gain on the long prompt setting of the TIIF benchmark. The project code: this https URL
zh

[CV-77] Learning neuroimaging models from health system-scale data

链接: https://arxiv.org/abs/2509.18638
作者: Yiwei Lyu,Samir Harake,Asadur Chowdury,Soumyanil Banerjee,Rachel Gologorsky,Shixuan Liu,Anna-Katharina Meissner,Akshay Rao,Chenhui Zhao,Akhil Kondepudi,Cheng Jiang,Xinhai Hou,Rushikesh S. Joshi,Volker Neuschmelting,Ashok Srinivasan,Dawn Kleindorfer,Brian Athey,Vikas Gulani,Aditya Pandey,Honglak Lee,Todd Hollon
机构: University of Michigan Computer Science and Engineering (密歇根大学计算机科学与工程系); University of Michigan Neurosurgery (密歇根大学神经外科); University of Cologne Neurosurgery (科隆大学神经外科); University of Michigan Radiology (密歇根大学放射学); University of Michigan Neurology (密歇根大学神经病学); University of Michigan Computational Medicine and Bioinformatics (密歇根大学计算医学与生物信息学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-78] Prompt-Guided Dual Latent Steering for Inversion Problems

【速读】:该论文旨在解决扩散模型中图像退化重建时存在的语义漂移问题(semantic drift),即现有基于单个潜在向量(latent vector)的逆过程方法难以同时保证结构保真度与语义准确性,导致重建结果出现细节模糊或属性错误等问题。其解决方案的关键在于提出了一种无需训练的双潜在流引导机制(Prompt-Guided Dual Latent Steering, PDLS),该机制基于修正流(Rectified Flow)模型的稳定逆路径,将逆过程分解为两个互补路径:一个结构路径用于保持源图像完整性,另一个语义路径由提示词(prompt)引导以增强语义一致性;并通过将其建模为最优控制问题,利用线性二次调节器(Linear Quadratic Regulator, LQR)推导出闭式解,实现每一步生成轨迹的动态调控,从而在不进行逐图优化的前提下有效防止语义漂移并保留精细细节。

链接: https://arxiv.org/abs/2509.18619
作者: Yichen Wu,Xu Liu,Chenxuan Zhao,Xinyu Wu
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); University of Washington (华盛顿大学); University of Alberta (阿尔伯塔大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at DICTA 2025 (oral)

点击查看摘要

Abstract:Inverting corrupted images into the latent space of diffusion models is challenging. Current methods, which encode an image into a single latent vector, struggle to balance structural fidelity with semantic accuracy, leading to reconstructions with semantic drift, such as blurred details or incorrect attributes. To overcome this, we introduce Prompt-Guided Dual Latent Steering (PDLS), a novel, training-free framework built upon Rectified Flow models for their stable inversion paths. PDLS decomposes the inversion process into two complementary streams: a structural path to preserve source integrity and a semantic path guided by a prompt. We formulate this dual guidance as an optimal control problem and derive a closed-form solution via a Linear Quadratic Regulator (LQR). This controller dynamically steers the generative trajectory at each step, preventing semantic drift while ensuring the preservation of fine detail without costly, per-image optimization. Extensive experiments on FFHQ-1K and ImageNet-1K under various inversion tasks, including Gaussian deblurring, motion deblurring, super-resolution and freeform inpainting, demonstrate that PDLS produces reconstructions that are both more faithful to the original image and better aligned with the semantic information than single-latent baselines.
zh

[CV-79] MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

链接: https://arxiv.org/abs/2509.18613
作者: Yuzhi Wu,Li Xiao,Jun Liu,Guangfeng Jiang,XiangGen Xia
机构: University of Science and Technology of China (中国科学技术大学); University of Delaware (特拉华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-80] raining-Free Multi-Style Fusion Through Reference-Based Adaptive Modulation

【速读】:该论文旨在解决当前基于参考图像的扩散模型在风格融合方面的两大局限性:一是现有方法通常仅支持单一风格图像输入,难以实现多种美学特征的混合与扩展;二是缺乏一种机制来合理平衡多个风格的影响权重,导致生成结果不稳定或不可控。解决方案的关键在于提出自适应多风格融合(Adaptive Multi-Style Fusion, AMSF)框架,其核心创新包括:1)通过语义标记分解模块(semantic token decomposition module)对所有风格图像和文本提示进行编码,并将这些信息以自适应方式注入冻结扩散模型的每个交叉注意力层;2)引入相似度感知重加权模块,在去噪每一步动态调整各风格成分的注意力分配,从而实现无微调、无需外部适配器的可控且均衡的多风格融合效果。

链接: https://arxiv.org/abs/2509.18602
作者: Xu Liu,Yibo Lu,Xinxian Wang,Xinyu Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ACPR 2025 (oral)

点击查看摘要

Abstract:We propose Adaptive Multi-Style Fusion (AMSF), a reference-based training-free framework that enables controllable fusion of multiple reference styles in diffusion models. Most of the existing reference-based methods are limited by (a) acceptance of only one style image, thus prohibiting hybrid aesthetics and scalability to more styles, and (b) lack of a principled mechanism to balance several stylistic influences. AMSF mitigates these challenges by encoding all style images and textual hints with a semantic token decomposition module that is adaptively injected into every cross-attention layer of an frozen diffusion model. A similarity-aware re-weighting module then recalibrates, at each denoising step, the attention allocated to every style component, yielding balanced and user-controllable blends without any fine-tuning or external adapters. Both qualitative and quantitative evaluations show that AMSF produces multi-style fusion results that consistently outperform the state-of-the-art approaches, while its fusion design scales seamlessly to two or more styles. These capabilities position AMSF as a practical step toward expressive multi-style generation in diffusion models.
zh

[CV-81] SSCM: A Spatial-Semantic Consistent Model for Multi-Contrast MRI Super-Resolution

【速读】:该论文旨在解决多对比度磁共振成像超分辨率(Multi-contrast Magnetic Resonance Imaging super-resolution, MC-MRI SR)中因目标图像与参考图像间结构差异和运动导致的空间-语义一致性难以保持的问题,从而实现高效、高保真度的图像重建。其解决方案的关键在于提出空间-语义一致模型(Spatial-Semantic Consistent Model, SSCM),通过三个核心模块协同优化:动态空间变形模块(Dynamic Spatial Warping Module)用于跨对比度图像间的空间对齐,语义感知令牌聚合块(Semantic-Aware Token Aggregation Block)保障长程语义一致性,以及空间-频率融合块(Spatial-Frequency Fusion Block)提升细节恢复能力,从而在参数量更少的前提下实现优于现有方法的空间与语义一致性重建效果。

链接: https://arxiv.org/abs/2509.18593
作者: Xiaoman Wu,Lubin Gan,Siying Wu,Jing Zhang,Yunwei Ou,Xiaoyan Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multi-contrast Magnetic Resonance Imaging super-resolution (MC-MRI SR) aims to enhance low-resolution (LR) contrasts leveraging high-resolution (HR) references, shortening acquisition time and improving imaging efficiency while preserving anatomical details. The main challenge lies in maintaining spatial-semantic consistency, ensuring anatomical structures remain well-aligned and coherent despite structural discrepancies and motion between the target and reference images. Conventional methods insufficiently model spatial-semantic consistency and underuse frequency-domain information, which leads to poor fine-grained alignment and inadequate recovery of high-frequency details. In this paper, we propose the Spatial-Semantic Consistent Model (SSCM), which integrates a Dynamic Spatial Warping Module for inter-contrast spatial alignment, a Semantic-Aware Token Aggregation Block for long-range semantic consistency, and a Spatial-Frequency Fusion Block for fine structure restoration. Experiments on public and private datasets show that SSCM achieves state-of-the-art performance with fewer parameters while ensuring spatially and semantically consistent reconstructions.
zh

[CV-82] VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation

【速读】:该论文旨在解决视觉语言导航(Vision-Language Navigation, VLN)在未见环境中的泛化能力差与计算效率低的问题,现有方法通常依赖于大量探索或固定导航策略,难以适应新场景。其解决方案的关键在于提出一个两阶段的神经符号导航框架VLN-Zero:第一阶段通过结构化提示引导视觉语言模型(Vision-Language Model, VLM)生成信息丰富且多样化的轨迹,构建紧凑的符号化场景图(scene graph);第二阶段利用神经符号规划器基于场景图和环境观测进行可执行路径推理,并结合缓存机制复用历史任务-目标轨迹以加速适应过程。该方法实现了高效探索、符号推理与缓存驱动执行的协同优化,显著提升了零样本迁移性能与决策效率。

链接: https://arxiv.org/abs/2509.18592
作者: Neel P. Bhatt,Yunhao Yang,Rohan Siva,Pranay Samineni,Daniel Milan,Zhangyang Wang,Ufuk Topcu
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注: Codebase, datasets, and videos for VLN-Zero are available at: this https URL

点击查看摘要

Abstract:Rapid adaptation in unseen environments is essential for scalable real-world autonomy, yet existing approaches rely on exhaustive exploration or rigid navigation policies that fail to generalize. We present VLN-Zero, a two-phase vision-language navigation framework that leverages vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation. In the exploration phase, structured prompts guide VLM-based search toward informative and diverse trajectories, yielding compact scene graph representations. In the deployment phase, a neurosymbolic planner reasons over the scene graph and environmental observations to generate executable plans, while a cache-enabled execution module accelerates adaptation by reusing previously computed task-location trajectories. By combining rapid exploration, symbolic reasoning, and cache-enabled execution, the proposed framework overcomes the computational inefficiency and poor generalization of prior vision-language navigation methods, enabling robust and scalable decision-making in unseen environments. VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time with 55% fewer VLM calls on average compared to state-of-the-art models across diverse environments. Codebase, datasets, and videos for VLN-Zero are available at: this https URL.
zh

[CV-83] Enhancing Video Object Segmentation in TrackRAD Using XMem Memory Network

【速读】:该论文旨在解决MRI-guided放射治疗中肿瘤实时分割的难题,尤其是在 cine-MRI 序列中实现高精度、低延迟的肿瘤边界识别,以提升放疗的精准性和安全性。解决方案的关键在于采用基于记忆增强架构(memory-augmented architecture)的XMem模型,通过引入高效的内存机制来跟踪长时间序列中的肿瘤运动,在标注数据有限的情况下仍能保持良好的分割性能,并满足临床对实时性的要求。

链接: https://arxiv.org/abs/2509.18591
作者: Pengchao Deng,Shengqi Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This paper presents an advanced tumor segmentation framework for real-time MRI-guided radiotherapy, designed for the TrackRAD2025 challenge. Our method leverages the XMem model, a memory-augmented architecture, to segment tumors across long cine-MRI sequences. The proposed system efficiently integrates memory mechanisms to track tumor motion in real-time, achieving high segmentation accuracy even under challenging conditions with limited annotated data. Unfortunately, the detailed experimental records have been lost, preventing us from reporting precise quantitative results at this stage. Nevertheless, From our preliminary impressions during development, the XMem-based framework demonstrated reasonable segmentation performance and satisfied the clinical real-time requirement. Our work contributes to improving the precision of tumor tracking during MRI-guided radiotherapy, which is crucial for enhancing the accuracy and safety of cancer treatments.
zh

[CV-84] he Photographer Eye: Teaching Multimodal Large Language Models to See and Critique like Photographers

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视觉美学理解上的不足问题,尤其是在真实场景中缺乏对摄影技术、图像前后处理及专业审美判断等深层次知识的掌握。现有方法往往局限于基础的常识性美学认知,难以应对需要专业知识支撑的复杂视觉分析任务。其解决方案的关键在于三个核心创新:首先,构建了一个名为PhotoCritique的新颖数据集,该数据集源自专业摄影师与爱好者之间的深度讨论,具备大规模、高专业性和多样性特征;其次,提出一种名为PhotoEye的新型模型,采用语言引导的多视角视觉融合机制,从多个维度理解图像美学;最后,设计了一个名为PhotoBench的专业级基准测试平台,用于系统评估模型在美学视觉理解方面的性能。通过这些创新,论文显著提升了MLLMs在美学理解任务中的表现。

链接: https://arxiv.org/abs/2509.18582
作者: Daiqing Qi,Handong Zhao,Jing Shi,Simon Jenni,Yifei Fan,Franck Dernoncourt,Scott Cohen,Sheng Li
机构: University of Virginia (弗吉尼亚大学); Adobe (Adobe)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component–a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise–including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.
zh

[CV-85] Live-E2T: Real-time Threat Monitoring in Video via Deduplicated Event Reasoning and Chain-of-Thought

【速读】:该论文旨在解决实时威胁监测中面临的两大核心挑战:一是如何在保证实时性能的前提下实现高精度的威胁行为识别,二是如何生成可解释的威胁事件评估报告。现有基于监督学习或生成式模型的方法难以兼顾实时性与决策可解释性。解决方案的关键在于提出Live-E2T框架,其核心创新包括三个协同机制:首先,将视频帧解构为结构化的“人-物-交互-地点”(Human-Object-Interaction-Place)语义元组,构建紧凑且语义聚焦的表示以避免传统特征压缩带来的信息损失;其次,设计高效的在线事件去重与更新机制,过滤时空冗余以保障系统实时响应能力;最后,通过Chain-of-Thought微调大语言模型(Large Language Model),赋予其对事件序列进行逻辑清晰、透明推理的能力,从而输出连贯的威胁评估文本报告。

链接: https://arxiv.org/abs/2509.18571
作者: Yuhan Wang,Cheng Liu,Zihan Zhao,Weichao Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Real-time threat monitoring identifies threatening behaviors in video streams and provides reasoning and assessment of threat events through explanatory text. However, prevailing methodologies, whether based on supervised learning or generative models, struggle to concurrently satisfy the demanding requirements of real-time performance and decision explainability. To bridge this gap, we introduce Live-E2T, a novel framework that unifies these two objectives through three synergistic mechanisms. First, we deconstruct video frames into structured Human-Object-Interaction-Place semantic tuples. This approach creates a compact, semantically focused representation, circumventing the information degradation common in conventional feature compression. Second, an efficient online event deduplication and updating mechanism is proposed to filter spatio-temporal redundancies, ensuring the system’s real time responsiveness. Finally, we fine-tune a Large Language Model using a Chain-of-Thought strategy, endow it with the capability for transparent and logical reasoning over event sequences to produce coherent threat assessment reports. Extensive experiments on benchmark datasets, including XD-Violence and UCF-Crime, demonstrate that Live-E2T significantly outperforms state-of-the-art methods in terms of threat detection accuracy, real-time efficiency, and the crucial dimension of explainability.
zh

[CV-86] Event-guided 3D Gaussian Splatting for Dynamic Human and Scene Reconstruction

【速读】:该论文旨在解决从单目视频中重建动态人体与静态场景的难题,尤其是在快速运动导致RGB帧出现运动模糊的情况下。其解决方案的关键在于提出了一种事件引导的人体-场景联合重建框架,通过3D高斯点绘(3D Gaussian Splatting)实现统一建模:使用一组可学习语义属性的3D高斯表示同时编码人体和场景,仅对人体相关的高斯进行形变以实现动画,而场景高斯保持静态;此外,引入事件引导损失函数,将连续渲染图像间的模拟亮度变化与事件流匹配,从而提升高速运动区域的局部重建保真度。该方法无需外部人体掩码,简化了对分离高斯集合的管理,并在ZJU-MoCap-Blur和MMHPSD-Blur两个基准数据集上实现了当前最优的重建性能。

链接: https://arxiv.org/abs/2509.18566
作者: Xiaoting Yin,Hao Shi,Kailun Yang,Jiajun Zhai,Shangwei Guo,Lin Wang,Kaiwei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Reconstructing dynamic humans together with static scenes from monocular videos remains difficult, especially under fast motion, where RGB frames suffer from motion blur. Event cameras exhibit distinct advantages, e.g., microsecond temporal resolution, making them a superior sensing choice for dynamic human reconstruction. Accordingly, we present a novel event-guided human-scene reconstruction framework that jointly models human and scene from a single monocular event camera via 3D Gaussian Splatting. Specifically, a unified set of 3D Gaussians carries a learnable semantic attribute; only Gaussians classified as human undergo deformation for animation, while scene Gaussians stay static. To combat blur, we propose an event-guided loss that matches simulated brightness changes between consecutive renderings with the event stream, improving local fidelity in fast-moving regions. Our approach removes the need for external human masks and simplifies managing separate Gaussian sets. On two benchmark datasets, ZJU-MoCap-Blur and MMHPSD-Blur, it delivers state-of-the-art human-scene reconstruction, with notable gains over strong baselines in PSNR/SSIM and reduced LPIPS, especially for high-speed subjects.
zh

[CV-87] HadaSmileNet: Hadamard fusion of handcrafted and deep-learning features for enhancing facial emotion recognition of genuine smiles ICDM

【速读】:该论文旨在解决真实情绪(genuine emotion)与摆拍情绪(posed emotion)识别中的模式识别难题,尤其聚焦于微笑面部情绪识别任务,以提升在社会科学、医疗健康和人机交互等场景下的数据挖掘效能。其解决方案的关键在于提出HadaSmileNet框架,通过参数无感的Hadamard乘法融合机制,直接整合基于Transformer的深度特征与生理学基础的D-Marker特征,从而在保持计算效率的同时实现更优的判别能力。该方法避免了多任务学习中复杂的辅助任务监督与损失函数平衡问题,在四个基准数据集上均取得新的SOTA性能,并实现了26%的参数减少和训练简化。

链接: https://arxiv.org/abs/2509.18550
作者: Mohammad Junayed Hasan,Nabeel Mohammed,Shafin Rahman,Philipp Koehn
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE International Conference on Data Mining (ICDM) 2025. Final version to appear in the conference proceedings

点击查看摘要

Abstract:The distinction between genuine and posed emotions represents a fundamental pattern recognition challenge with significant implications for data mining applications in social sciences, healthcare, and human-computer interaction. While recent multi-task learning frameworks have shown promise in combining deep learning architectures with handcrafted D-Marker features for smile facial emotion recognition, these approaches exhibit computational inefficiencies due to auxiliary task supervision and complex loss balancing requirements. This paper introduces HadaSmileNet, a novel feature fusion framework that directly integrates transformer-based representations with physiologically grounded D-Markers through parameter-free multiplicative interactions. Through systematic evaluation of 15 fusion strategies, we demonstrate that Hadamard multiplicative fusion achieves optimal performance by enabling direct feature interactions while maintaining computational efficiency. The proposed approach establishes new state-of-the-art results for deep learning methods across four benchmark datasets: UvA-NEMO (88.7 percent, +0.8), MMI (99.7 percent), SPOS (98.5 percent, +0.7), and BBC (100 percent, +5.0). Comprehensive computational analysis reveals 26 percent parameter reduction and simplified training compared to multi-task alternatives, while feature visualization demonstrates enhanced discriminative power through direct domain knowledge integration. The framework’s efficiency and effectiveness make it particularly suitable for practical deployment in multimedia data mining applications that require real-time affective computing capabilities.
zh

[CV-88] SEGA: A Transferable Signed Ensemble Gaussian Black-Box Attack against No-Reference Image Quality Assessment Models

链接: https://arxiv.org/abs/2509.18546
作者: Yujia Liu,Dingquan Li,Tiejun Huang
机构: NERCVT, School of Computer Science, Peking University, China (北京大学计算机学院多媒体信息处理国家重点实验室); National Key Laboratory for Multimedia Information Processing, Peking University, China (北京大学多媒体信息处理国家重点实验室); School of Mathematical Sciences, Peking University, China (北京大学数学科学学院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-89] GeoRemover: Removing Objects and Their Causal Visual Artifacts NEURIPS2025

【速读】:该论文旨在解决智能图像编辑中目标物体移除问题,特别是如何同时消除目标物体及其因果视觉伪影(如阴影和反射),而现有基于图像外观的方法要么因严格遵循掩码对齐训练无法处理未被显式标注的因果效应,要么采用松散掩码对齐策略导致可控性差并可能过度擦除其他对象。解决方案的关键在于识别出物体几何存在与其视觉效应之间的因果关系被忽略是问题根源,并提出一种几何感知的两阶段框架:第一阶段通过严格掩码对齐监督从几何信息(如深度图)中直接移除物体,实现结构感知的编辑;第二阶段则基于更新后的几何信息条件生成逼真的RGB图像,使因果视觉效应作为三维几何变化的隐式结果自然恢复。为引导几何移除阶段的学习,还引入基于正负样本对的偏好驱动目标,促使模型在移除物体及其因果伪影的同时避免引入新的结构。

链接: https://arxiv.org/abs/2509.18538
作者: Zixin Zhu,Haoxiang Li,Xuelu Feng,He Wu,Chunming Qiao,Junsong Yuan
机构: University at Buffalo (纽约州立大学布法罗分校); Pixocial Technology (Pixocial科技)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted as Spotlight at NeurIPS 2025

点击查看摘要

Abstract:Towards intelligent image editing, object removal should eliminate both the target object and its causal visual artifacts, such as shadows and reflections. However, existing image appearance-based methods either follow strictly mask-aligned training and fail to remove these causal effects which are not explicitly masked, or adopt loosely mask-aligned strategies that lack controllability and may unintentionally over-erase other objects. We identify that these limitations stem from ignoring the causal relationship between an object’s geometry presence and its visual effects. To address this limitation, we propose a geometry-aware two-stage framework that decouples object removal into (1) geometry removal and (2) appearance rendering. In the first stage, we remove the object directly from the geometry (e.g., depth) using strictly mask-aligned supervision, enabling structure-aware editing with strong geometric constraints. In the second stage, we render a photorealistic RGB image conditioned on the updated geometry, where causal visual effects are considered implicitly as a result of the modified 3D geometry. To guide learning in the geometry removal stage, we introduce a preference-driven objective based on positive and negative sample pairs, encouraging the model to remove objects as well as their causal visual artifacts while avoiding new structural insertions. Extensive experiments demonstrate that our method achieves state-of-the-art performance in removing both objects and their associated artifacts on two popular benchmarks. The code is available at this https URL.
zh

[CV-90] Hyperbolic Coarse-to-Fine Few-Shot Class-Incremental Learning

链接: https://arxiv.org/abs/2509.18504
作者: Jiaxin Dai,Xiang Xiang
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-91] Source-Free Domain Adaptive Semantic Segmentation of Remote Sensing Images with Diffusion-Guided Label Enrichment

链接: https://arxiv.org/abs/2509.18502
作者: Wenjie Liu,Hongmin Liu,Lixin Zhang,Bin Fan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-92] BridgeSplat: Bidirectionally Coupled CT and Non-Rigid Gaussian Splatting for Deformable Intraoperative Surgical Navigation MICCAI2025

【速读】:该论文旨在解决术中导航中手术视频与术前CT(计算机断层扫描)体积数据之间缺乏精确配准的问题,尤其在柔性器官变形场景下如何实现高精度的实时形变映射。解决方案的关键在于提出BridgeSplat方法,通过将3D高斯(3D Gaussians)绑定到CT网格(mesh)上,并利用光度监督联合优化高斯参数与网格形变,同时以每个高斯相对于其所属三角面片的参数化方式强制高斯与网格保持对齐,从而获得可回传至CT体数据的物理合理形变场。

链接: https://arxiv.org/abs/2509.18501
作者: Maximilian Fehrentz,Alexander Winkler,Thomas Heiliger,Nazim Haouchine,Christian Heiliger,Nassir Navab
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI 2025

点击查看摘要

Abstract:We introduce BridgeSplat, a novel approach for deformable surgical navigation that couples intraoperative 3D reconstruction with preoperative CT data to bridge the gap between surgical video and volumetric patient data. Our method rigs 3D Gaussians to a CT mesh, enabling joint optimization of Gaussian parameters and mesh deformation through photometric supervision. By parametrizing each Gaussian relative to its parent mesh triangle, we enforce alignment between Gaussians and mesh and obtain deformations that can be propagated back to update the CT. We demonstrate BridgeSplat’s effectiveness on visceral pig surgeries and synthetic data of a human liver under simulation, showing sensible deformations of the preoperative CT on monocular RGB data. Code, data, and additional resources can be found at this https URL .
zh

[CV-93] Differentiable Light Transport with Gaussian Surfels via Adapted Radiosity for Efficient Relighting and Geometry Reconstruction

【速读】:该论文旨在解决基于辐射场(radiance fields)的重建与渲染方法在建模材料反射特性(reflective properties)和光照条件时存在的局限性,这些问题导致几何歧义(geometric ambiguities)以及难以实现灵活的再照明(relighting)。传统方法通常采用简化物理渲染模型以提升优化效率,但牺牲了准确性。其解决方案的关键在于:引入高斯面元(Gaussian surfels)作为基础表征单元,并构建一个基于球谐函数(spherical harmonics)系数空间的可微光传输(differentiable light transport)框架,该框架受经典辐射度理论(radiosity theory)启发,扩展了非二值可见性和半透明体的处理能力,设计了高效的光传输求解器,并推导出更优的反向传播梯度计算方式,从而实现视点无关的高效全局光照渲染(数百帧每秒),显著优于现有逆渲染或数据驱动基线方法,在稀疏数据集下仍能获得高质量的几何重建、视角合成与再照明效果。

链接: https://arxiv.org/abs/2509.18497
作者: Kaiwen Jiang,Jia-Mu Sun,Zilu Li,Dan Wang,Tzu-Mao Li,Ravi Ramamoorthi
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radiance fields have gained tremendous success with applications ranging from novel view synthesis to geometry reconstruction, especially with the advent of Gaussian splatting. However, they sacrifice modeling of material reflective properties and lighting conditions, leading to significant geometric ambiguities and the inability to easily perform relighting. One way to address these limitations is to incorporate physically-based rendering, but it has been prohibitively expensive to include full global illumination within the inner loop of the optimization. Therefore, previous works adopt simplifications that make the whole optimization with global illumination effects efficient but less accurate. In this work, we adopt Gaussian surfels as the primitives and build an efficient framework for differentiable light transport, inspired from the classic radiosity theory. The whole framework operates in the coefficient space of spherical harmonics, enabling both diffuse and specular materials. We extend the classic radiosity into non-binary visibility and semi-opaque primitives, propose novel solvers to efficiently solve the light transport, and derive the backward pass for gradient optimizations, which is more efficient than auto-differentiation. During inference, we achieve view-independent rendering where light transport need not be recomputed under viewpoint changes, enabling hundreds of FPS for global illumination effects, including view-dependent reflections using a spherical harmonics representation. Through extensive qualitative and quantitative experiments, we demonstrate superior geometry reconstruction, view synthesis and relighting than previous inverse rendering baselines, or data-driven baselines given relatively sparse datasets with known or unknown lighting conditions.
zh

[CV-94] MK-UNet: Multi-kernel Lightweight CNN for Medical Image Segmentation ICCV2025

链接: https://arxiv.org/abs/2509.18493
作者: Md Mostafijur Rahman,Radu Marculescu
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, Accepted at ICCV 2025 Workshop CVAMD

点击查看摘要

[CV-95] Codebook-Based Adaptive Feature Compression With Semantic Enhancement for Edge-Cloud Systems

【速读】:该论文旨在解决边缘-云系统中图像编码在低比特率下性能下降的问题,即传统方法在压缩过程中要么保留冗余细节,要么学习过于集中的符号分布,导致分析精度显著降低。解决方案的关键在于提出一种基于码本的自适应特征压缩框架(CAFC-SE),通过在边缘端使用向量量化(Vector Quantization, VQ)将连续视觉特征映射为离散索引,并选择性地传输这些索引至云端;VQ操作将特征向量投影到最近的视觉原型上,从而在低比特率条件下更好地保留语义信息,提升压缩效率与后续分析任务的准确性。

链接: https://arxiv.org/abs/2509.18481
作者: Xinyu Wang,Zikun Zhou,Yingjian Li,Xin An,Hongpeng Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Coding images for machines with minimal bitrate and strong analysis performance is key to effective edge-cloud systems. Several approaches deploy an image codec and perform analysis on the reconstructed image. Other methods compress intermediate features using entropy models and subsequently perform analysis on the decoded features. Nevertheless, these methods both perform poorly under low-bitrate conditions, as they retain many redundant details or learn over-concentrated symbol distributions. In this paper, we propose a Codebook-based Adaptive Feature Compression framework with Semantic Enhancement, named CAFC-SE. It maps continuous visual features to discrete indices with a codebook at the edge via Vector Quantization (VQ) and selectively transmits them to the cloud. The VQ operation that projects feature vectors onto the nearest visual primitives enables us to preserve more informative visual patterns under low-bitrate conditions. Hence, CAFC-SE is less vulnerable to low-bitrate conditions. Extensive experiments demonstrate the superiority of our method in terms of rate and accuracy.
zh

[CV-96] MoCrop: Training Free Motion Guided Cropping for Efficient Video Action Recognition

链接: https://arxiv.org/abs/2509.18473
作者: Binhua Huang,Wendong Yao,Shaowu Chen,Guoxin Wang,Qingyuan Wang,Soumyabrata Dev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures

点击查看摘要

[CV-97] Zero-Shot Visual Deepfake Detection: Can AI Predict and Prevent Fake Content Before Its Created?

【速读】:该论文旨在解决深度伪造(deepfake)技术快速发展对数字安全、媒体真实性和公众信任造成的严重威胁,特别是针对现有检测方法难以应对新型或未见过的深度伪造变体的问题。其核心解决方案是提出零样本深度伪造检测(zero-shot deepfake detection),通过自监督学习、基于Transformer的零样本分类器、生成模型指纹识别和元学习等先进技术,使检测系统能够在未见过特定深度伪造类型的情况下仍具备适应能力。此外,论文还提出了一系列AI驱动的预防策略,包括对抗扰动以干扰生成过程、数字水印用于内容真实性验证、实时AI监控内容生成管道以及区块链支持的内容验证框架,从而从源头上遏制深度伪造的产生。这些方案共同构建了一个融合零样本检测与主动预防机制的综合防御体系,强调了跨学科协作在应对深度伪造攻击中的关键作用。

链接: https://arxiv.org/abs/2509.18461
作者: Ayan Sar,Sampurna Roy,Tanupriya Choudhury,Ajith Abraham
机构: University of Petroleum and Energy Studies (UPES)(印度石油与能源研究大学); Sai University (赛大学)
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Published in Foundations and Trends in Signal Processing (#1 in Signal Processing, #3 in Computer Science)

点击查看摘要

Abstract:Generative adversarial networks (GANs) and diffusion models have dramatically advanced deepfake technology, and its threats to digital security, media integrity, and public trust have increased rapidly. This research explored zero-shot deepfake detection, an emerging method even when the models have never seen a particular deepfake variation. In this work, we studied self-supervised learning, transformer-based zero-shot classifier, generative model fingerprinting, and meta-learning techniques that better adapt to the ever-evolving deepfake threat. In addition, we suggested AI-driven prevention strategies that mitigated the underlying generation pipeline of the deepfakes before they occurred. They consisted of adversarial perturbations for creating deepfake generators, digital watermarking for content authenticity verification, real-time AI monitoring for content creation pipelines, and blockchain-based content verification frameworks. Despite these advancements, zero-shot detection and prevention faced critical challenges such as adversarial attacks, scalability constraints, ethical dilemmas, and the absence of standardized evaluation benchmarks. These limitations were addressed by discussing future research directions on explainable AI for deepfake detection, multimodal fusion based on image, audio, and text analysis, quantum AI for enhanced security, and federated learning for privacy-preserving deepfake detection. This further highlighted the need for an integrated defense framework for digital authenticity that utilized zero-shot learning in combination with preventive deepfake mechanisms. Finally, we highlighted the important role of interdisciplinary collaboration between AI researchers, cybersecurity experts, and policymakers to create resilient defenses against the rising tide of deepfake attacks.
zh

[CV-98] An Analysis of Kalman Filter based Object Tracking Methods for Fast-Moving Tiny Objects

【速读】:该论文旨在解决快速移动微小目标(如回力球)在计算机视觉中难以精确跟踪的问题,其核心挑战在于目标运动轨迹不可预测、视觉特征微弱且尺寸小,导致现有跟踪算法性能显著下降。解决方案的关键在于系统评估五种基于卡尔曼滤波(Kalman filter)的先进跟踪方法(OCSORT、DeepOCSORT、ByteTrack、BoTSORT 和 StrongSORT),并通过自建包含10,000帧标注数据集(分辨率720p–1280p)进行实验验证,重点分析推理速度与每帧更新频率对跟踪精度和鲁棒性的影响。结果表明,尽管 DeepOCSORT 在平均轨迹误差(ADE)上表现最优(31.15像素),但所有方法仍存在显著漂移(空间误差达3–11 cm),揭示了当前通用跟踪框架在处理此类极端动态场景时的根本局限,凸显出开发专门针对高速微小目标跟踪的新型算法的必要性。

链接: https://arxiv.org/abs/2509.18451
作者: Prithvi Raj Singh,Raju Gottumukkala,Anthony Maida
机构: McNeese State University (麦克尼斯州立大学); University of Louisiana at Lafayette (路易斯安那大学拉法叶分校); Informatics Research Institute (信息研究所); Center for Advanced Computer Studies (高级计算机研究中⼼)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unpredictable movement patterns and small visual mark make precise tracking of fast-moving tiny objects like a racquetball one of the challenging problems in computer vision. This challenge is particularly relevant for sport robotics applications, where lightweight and accurate tracking systems can improve robot perception and planning capabilities. While Kalman filter-based tracking methods have shown success in general object tracking scenarios, their performance degrades substantially when dealing with rapidly moving objects that exhibit irregular bouncing behavior. In this study, we evaluate the performance of five state-of-the-art Kalman filter-based tracking methods-OCSORT, DeepOCSORT, ByteTrack, BoTSORT, and StrongSORT-using a custom dataset containing 10,000 annotated racquetball frames captured at 720p-1280p resolution. We focus our analysis on two critical performance factors: inference speed and update frequency per image, examining how these parameters affect tracking accuracy and reliability for fast-moving tiny objects. Our experimental evaluation across four distinct scenarios reveals that DeepOCSORT achieves the lowest tracking error with an average ADE of 31.15 pixels compared to ByteTrack’s 114.3 pixels, while ByteTrack demonstrates the fastest processing at 26.6ms average inference time versus DeepOCSORT’s 26.8ms. However, our results show that all Kalman filter-based trackers exhibit significant tracking drift with spatial errors ranging from 3-11cm (ADE values: 31-114 pixels), indicating fundamental limitations in handling the unpredictable motion patterns of fast-moving tiny objects like racquetballs. Our analysis demonstrates that current tracking approaches require substantial improvements, with error rates 3-4x higher than standard object tracking benchmarks, highlighting the need for specialized methodologies for fast-moving tiny object tracking applications.
zh

[CV-99] Latent Action Pretraining Through World Modeling

链接: https://arxiv.org/abs/2509.18428
作者: Bahey Tharwat,Yara Nasser,Ali Abouzeid,Ian Reid
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Alexandria University (亚历山大大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-100] CPT-4DMR: Continuous sPatial-Temporal Representation for 4D-MRI Reconstruction

链接: https://arxiv.org/abs/2509.18427
作者: Xinyang Wu,Muheng Li,Xia Li,Orso Pusterla,Sairos Safai,Philippe C. Cattin,Antony J. Lomax,Ye Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Medical Physics (physics.med-ph)
备注:

点击查看摘要

[CV-101] Losing the Plot: How VLM responses degrade on imperfect charts

链接: https://arxiv.org/abs/2509.18425
作者: Philip Wootaek Shin,Jack Sampson,Vijaykrishnan Narayanan,Andres Marquez,Mahantesh Halappanavar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-102] Check Field Detection Agent (CFD-Agent ) using Multimodal Large Language and Vision Language Models

【速读】:该论文旨在解决金融场景中支票(check)字段自动检测的难题,尤其是传统基于目标检测模型的方法依赖大规模、多样化且精细标注的数据集,而此类数据因隐私和专有性问题难以获取。解决方案的关键在于提出一种无需训练(training-free)的框架,利用视觉语言模型(VLM)与多模态大语言模型(MLLM)的联合能力,实现零样本(zero-shot)的支票关键字段检测,从而显著降低在真实金融环境中部署的门槛,并具备良好的泛化性能。

链接: https://arxiv.org/abs/2509.18405
作者: Sourav Halder,Jinjun Tong,Xinyu Wu
机构: U.S. Bank (美国银行)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 2 tables

点击查看摘要

Abstract:Checks remain a foundational instrument in the financial ecosystem, facilitating substantial transaction volumes across institutions. However, their continued use also renders them a persistent target for fraud, underscoring the importance of robust check fraud detection mechanisms. At the core of such systems lies the accurate identification and localization of critical fields, such as the signature, magnetic ink character recognition (MICR) line, courtesy amount, legal amount, payee, and payer, which are essential for subsequent verification against reference checks belonging to the same customer. This field-level detection is traditionally dependent on object detection models trained on large, diverse, and meticulously labeled datasets, a resource that is scarce due to proprietary and privacy concerns. In this paper, we introduce a novel, training-free framework for automated check field detection, leveraging the power of a vision language model (VLM) in conjunction with a multimodal large language model (MLLM). Our approach enables zero-shot detection of check components, significantly lowering the barrier to deployment in real-world financial settings. Quantitative evaluation of our model on a hand-curated dataset of 110 checks spanning multiple formats and layouts demonstrates strong performance and generalization capability. Furthermore, this framework can serve as a bootstrap mechanism for generating high-quality labeled datasets, enabling the development of specialized real-time object detection models tailored to institutional needs.
zh

[CV-103] Does Embodiment Matter to Biomechanics and Function? A Comparative Analysis of Head-Mounted and Hand-Held Assistive Devices for Individuals with Blindness and Low Vision

链接: https://arxiv.org/abs/2509.18391
作者: Gaurav Seth,Hoa Pham,Giles Hamilton-Fletcher,Charles Leclercq,John-Ross Rizzo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: 30 pages, 7 figures, 5 tables. Pre-print submitted to International Journal of Human-Computer Interaction. Also to appear as a late-breaking poster at ACRM. Limited AI (ChatGPT-4/5) used for language refinement and figure schematics under author supervision. One author (CL) is CEO of ARx Vision; others report no conflicts

点击查看摘要

[CV-104] Improving the color accuracy of lighting estimation models

链接: https://arxiv.org/abs/2509.18390
作者: Zitian Zhang,Joshua Urban Davis,Jeanne Phuong Anh Vu,Jiangtao Kuang,Jean-François Lalonde
机构: Université Laval (拉瓦尔大学); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

[CV-105] MVP: Motion Vector Propagation for Zero-Shot Video Object Detection

【速读】:该论文旨在解决在视频中运行大规模开放词汇检测器(Open-vocabulary Detector)时计算成本高昂的问题。其核心挑战是在保持检测精度的同时显著降低对检测模型的调用频率。解决方案的关键在于提出一种无需训练的传播机制(MVP),仅在固定间隔的关键帧上运行OWLv2检测器,并利用压缩域中的运动矢量(Motion Vector, MV)将检测结果传播至中间帧。通过一个简单的3×3网格聚合策略实现平移和均匀缩放更新,辅以面积增长检查和可选的单类切换机制,从而在不依赖标签或微调的情况下维持良好的零样本泛化能力。该方法在ILSVRC2015-VID数据集上实现了与逐帧检测接近的性能(mAP@0.5=0.609),同时大幅减少检测次数,证明了压缩域运动矢量传播是一种高效且实用的视频级检测优化方案。

链接: https://arxiv.org/abs/2509.18388
作者: Binhua Huang,Ni Wang,Wendong Yao,Soumyabrata Dev
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 1 figure

点击查看摘要

Abstract:Running a large open-vocabulary (Open-vocab) detector on every video frame is accurate but expensive. We introduce a training-free pipeline that invokes OWLv2 only on fixed-interval keyframes and propagates detections to intermediate frames using compressed-domain motion vectors (MV). A simple 3x3 grid aggregation of motion vectors provides translation and uniform-scale updates, augmented with an area-growth check and an optional single-class switch. The method requires no labels, no fine-tuning, and uses the same prompt list for all open-vocabulary methods. On ILSVRC2015-VID (validation dataset), our approach (MVP) attains mAP@0.5=0.609 and mAP@[0.5:0.95]=0.316. At loose intersection-over-union (IoU) thresholds it remains close to framewise OWLv2-Large (0.747/0.721 at 0.2/0.3 versus 0.784/0.780), reflecting that coarse localization is largely preserved. Under the same keyframe schedule, MVP outperforms tracker-based propagation (MOSSE, KCF, CSRT) at mAP@0.5. A supervised reference (YOLOv12x) reaches 0.631 at mAP@0.5 but requires labeled training, whereas our method remains label-free and open-vocabulary. These results indicate that compressed-domain propagation is a practical way to reduce detector invocations while keeping strong zero-shot coverage in videos. Our code and models are available at this https URL.
zh

[CV-106] BlurBall: Joint Ball and Motion Blur Estimation for Table Tennis Ball Tracking

链接: https://arxiv.org/abs/2509.18387
作者: Thomas Gossard,Filip Radovic,Andreas Ziegler,Andrea Zell
机构: University of Tuebingen (图宾根大学); Sony AI (索尼人工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-107] nyBEV: Cross Modal Knowledge Distillation for Efficient Multi Task Birds Eye View Perception and Planning

链接: https://arxiv.org/abs/2509.18372
作者: Reeshad Khan,John Gauch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-108] Align Where the Words Look: Cross-Attention-Guided Patch Alignment with Contrastive and Transport Regularization for Bengali Captioning

【速读】:该论文旨在解决低资源语言(以孟加拉语为例)中视觉-语言模型的对齐问题,即模型虽能生成语法流畅的文本,却常描述错误的对象,根源在于配对数据稀缺、翻译桥接破坏语义一致性以及英语中心预训练忽略目标语言语义。其解决方案的关键在于提出一个计算感知的孟加拉语图像描述管道,结合LaBSE验证的英-孟双语对与11万张双语提示合成图像,采用冻结的MaxViT提取稳定视觉特征,孟加拉语原生mBART-50解码,并引入轻量级跨模态桥梁;核心创新是三重损失目标:Patch-Alignment Loss (PAL) 利用解码器交叉注意力对齐真实与合成图像块描述符,InfoNCE强化全局真实与合成样本分离,Sinkhorn-based Optimal Transport (OT) 保证细粒度图像块对应平衡。此PAL+InfoNCE+OT协同机制显著提升定位准确性、减少虚假匹配,并在Flickr30k-1k和MSCOCO-1k基准上超越强对比基线,将真实与合成特征中心距离缩小41%。

链接: https://arxiv.org/abs/2509.18369
作者: Riad Ahmed Anonto,Sardar Md. Saffat Zabin,M. Saifur Rahman
机构: Bangladesh University of Engineering and Technology (BUET)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Grounding vision–language models in low-resource languages remains challenging, as they often produce fluent text about the wrong objects. This stems from scarce paired data, translation pivots that break alignment, and English-centric pretraining that ignores target-language semantics. We address this with a compute-aware Bengali captioning pipeline trained on LaBSE-verified EN–BN pairs and 110k bilingual-prompted synthetic images. A frozen MaxViT yields stable visual patches, a Bengali-native mBART-50 decodes, and a lightweight bridge links the modalities. Our core novelty is a tri-loss objective: Patch-Alignment Loss (PAL) aligns real and synthetic patch descriptors using decoder cross-attention, InfoNCE enforces global real–synthetic separation, and Sinkhorn-based OT ensures balanced fine-grained patch correspondence. This PAL+InfoNCE+OT synergy improves grounding, reduces spurious matches, and drives strong gains on Flickr30k-1k (BLEU-4 12.29, METEOR 27.98, BERTScore-F1 71.20) and MSCOCO-1k (BLEU-4 12.00, METEOR 28.14, BERTScore-F1 75.40), outperforming strong CE baselines and narrowing the real–synthetic centroid gap by 41%.
zh

[CV-109] A Single Image Is All You Need: Zero-Shot Anomaly Localization Without Training Data

【速读】:该论文旨在解决无监督图像异常检测中的零样本(zero-shot)问题,即在缺乏训练数据或参考样本的情况下,仅凭单张测试图像实现异常定位。解决方案的关键在于提出了一种名为Single Shot Decomposition Network (SSDnet) 的方法,其核心思想是利用卷积神经网络(Convolutional Neural Networks, CNNs)的归纳偏置(inductive bias),通过将输入图像直接作为网络的输入进行自重建来学习图像的深层先验(deep image prior)。为避免模型简单地学习恒等映射,作者引入了掩码(masking)、补丁打乱(patch shuffling)和小高斯噪声等策略,并采用基于内积相似性的感知损失(perceptual loss)以捕捉超越像素级保真的结构信息。该方法无需外部训练数据、标签或参考图像,在MVTec-AD和织物数据集上分别达到0.99 AUROC/0.60 AUPRC和0.98 AUROC/0.67 AUPRC,显著优于现有最先进方法。

链接: https://arxiv.org/abs/2509.18354
作者: Mehrdad Moradi,Shengzhe Chen,Hao Yan,Kamran Paynabar
机构: Georgia Tech (佐治亚理工学院); Arizona State University (亚利桑那州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 12 pages, 10 figures, 1 table. Preprint submitted to a CVF conference

点击查看摘要

Abstract:Anomaly detection in images is typically addressed by learning from collections of training data or relying on reference samples. In many real-world scenarios, however, such training data may be unavailable, and only the test image itself is provided. We address this zero-shot setting by proposing a single-image anomaly localization method that leverages the inductive bias of convolutional neural networks, inspired by Deep Image Prior (DIP). Our method is named Single Shot Decomposition Network (SSDnet). Our key assumption is that natural images often exhibit unified textures and patterns, and that anomalies manifest as localized deviations from these repetitive or stochastic patterns. To learn the deep image prior, we design a patch-based training framework where the input image is fed directly into the network for self-reconstruction, rather than mapping random noise to the image as done in DIP. To avoid the model simply learning an identity mapping, we apply masking, patch shuffling, and small Gaussian noise. In addition, we use a perceptual loss based on inner-product similarity to capture structure beyond pixel fidelity. Our approach needs no external training data, labels, or references, and remains robust in the presence of noise or missing pixels. SSDnet achieves 0.99 AUROC and 0.60 AUPRC on MVTec-AD and 0.98 AUROC and 0.67 AUPRC on the fabric dataset, outperforming state-of-the-art methods. The implementation code will be released at this https URL
zh

[CV-110] OrthoLoC: UAV 6-DoF Localization and Calibration Using Orthographic Geodata NEURIPS2025

链接: https://arxiv.org/abs/2509.18350
作者: Oussema Dhaouadi,Riccardo Marin,Johannes Meier,Jacques Kaiser,Daniel Cremers
机构: DeepScenario; TU Munich (慕尼黑工业大学); Munich Center of Machine Learning (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted at NeurIPS 2025

点击查看摘要

[CV-111] Semantic-Aware Particle Filter for Reliable Vineyard Robot Localisation ICRA2026

链接: https://arxiv.org/abs/2509.18342
作者: Rajitha de Silva,Jonathan Cox,James R. Heselden,Marija Popovic,Cesar Cadena,Riccardo Polvara
机构: Lincoln Centre for Autonomous Systems (L-CAS), University of Lincoln (林肯大学); MAVLab, TU Delft (代尔夫特理工大学); Robotics Systems Lab, ETH Zurich (苏黎世联邦理工学院)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Sumbitted to ICRA 2026

点击查看摘要

[CV-112] Influence of Classification Task and Distribution Shift Type on OOD Detection in Fetal Ultrasound MICCAI2025

链接: https://arxiv.org/abs/2509.18326
作者: Chun Kit Wong,Anders N. Christensen,Cosmin I. Bercea,Julia A. Schnabel,Martin G. Tolsgaard,Aasa Feragen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

[CV-113] Improving Handshape Representations for Sign Language Processing: A Graph Neural Network Approach

【速读】:该论文旨在解决手形(handshape)在手势语言计算识别中长期被忽视的问题,尤其是由于手形类别间细微差异和时间动态变化导致的识别准确率低。其关键解决方案是提出一种新型图神经网络,通过将时间动态与静态手形配置分离建模,并结合解剖学启发的图结构与对比学习(contrastive learning),有效提升了手形识别性能,首次建立了结构化手形识别基准,在37类手形上达到46%的准确率(基线方法为25%)。

链接: https://arxiv.org/abs/2509.18309
作者: Alessa Carbo,Eric Nalisnick
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Handshapes serve a fundamental phonological role in signed languages, with American Sign Language employing approximately 50 distinct shapes. However,computational approaches rarely model handshapes explicitly, limiting both recognition accuracy and linguistic this http URL introduce a novel graph neural network that separates temporal dynamics from static handshape configurations. Our approach combines anatomically-informed graph structures with contrastive learning to address key challenges in handshape recognition, including subtle interclass distinctions and temporal variations. We establish the first benchmark for structured handshape recognition in signing sequences, achieving 46% accuracy across 37 handshape classes (with baseline methods achieving 25%).
zh

[CV-114] Rethinking Pulmonary Embolism Segmentation: A Study of Current Approaches and Challenges with an Open Weight Model WACV2026

链接: https://arxiv.org/abs/2509.18308
作者: Yixin Zhang,Ryan Chamberlain,Lawrance Ngo,Kevin Kramer,Maciej A. Mazurowski
机构: Duke University (杜克大学); Minnesota Health Solutions (明尼苏达健康解决方案); CoRead (CoRead)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submitted to WACV 2026 application track, model weights available at: this https URL

点击查看摘要

[CV-115] Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction MICCAI2025

【速读】:该论文旨在解决多模态医疗诊断中因模态缺失(missingness)和模态不平衡(modality imbalance)导致的模型性能下降问题,特别是在仅能获取单一模态数据的实际临床场景下。其解决方案的关键在于提出一种融合可学习模态标记(learnable modality tokens)与对比学习(contrastive learning)的新框架:一方面通过引入可学习模态标记增强对缺失模态的感知能力,实现更鲁棒的多模态融合;另一方面在传统单模态对比目标基础上扩展为融合后的多模态表示,从而提升模型在不完整数据下的表征能力和泛化性能。该方法在大规模临床数据集上验证了其优越性,尤其在仅有单一模态可用时表现突出,并成功适配于最新的CT基础模型,展现出良好的实用性与扩展性。

链接: https://arxiv.org/abs/2509.18284
作者: Yi Gu,Kuniaki Saito,Jiaxin Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025

点击查看摘要

Abstract:As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at this https URL.
zh

[CV-116] nyEcoWeedNet: Edge Efficient Real-Time Aerial Agricultural Weed Detection

链接: https://arxiv.org/abs/2509.18193
作者: Omar H. Khater,Abdul Jabbar Siddiqui,Aiman El-Maleh,M. Shamim Hossain
机构: King Fahd University of Petroleum and Minerals (KFUPM); SDAIA-KFUPM Joint Research Center on Artificial Intelligence; Center for Intelligent Secure Systems; Department of Computer Engineering; Research Chair of Pervasive and Mobile Computing; Department of Software Engineering; College of Computer and Information Sciences; King Saud University
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-117] HazeFlow: Revisit Haze Physical Model as ODE and Non-Homogeneous Haze Generation for Real-World Dehazing

【速读】:该论文旨在解决真实场景下图像去雾(dehazing)中因缺乏成对训练数据导致的域差距(domain gap)问题,以及传统基于大气散射模型(Atmospheric Scattering Model, ASM)的方法在处理复杂多样雾霾模式时性能不足的问题。其解决方案的关键在于提出一种基于常微分方程(ODE)的新型框架 HazeFlow,将 ASM 重新建模为一个 ODE 系统,并借鉴修正流(Rectified Flow, RF)的思想学习最优轨迹,从而仅需一步推理即可实现从有雾图像到清晰图像的映射;同时引入基于马尔可夫链布朗运动(Markov Chain Brownian Motion, MCBM)的非均匀雾霾生成方法,以模拟更真实的雾霾分布,有效缓解真实配对数据稀缺问题,显著提升模型在多样化现实场景中的泛化能力。

链接: https://arxiv.org/abs/2509.18190
作者: Junseong Shin,Seungwoo Chung,Yunjeong Yang,Tae Hyun Kim
机构: Hanyang University (汉阳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dehazing involves removing haze or fog from images to restore clarity and improve visibility by estimating atmospheric scattering effects. While deep learning methods show promise, the lack of paired real-world training data and the resulting domain gap hinder generalization to real-world scenarios. In this context, physics-grounded learning becomes crucial; however, traditional methods based on the Atmospheric Scattering Model (ASM) often fall short in handling real-world complexities and diverse haze patterns. To solve this problem, we propose HazeFlow, a novel ODE-based framework that reformulates ASM as an ordinary differential equation (ODE). Inspired by Rectified Flow (RF), HazeFlow learns an optimal ODE trajectory to map hazy images to clean ones, enhancing real-world dehazing performance with only a single inference step. Additionally, we introduce a non-homogeneous haze generation method using Markov Chain Brownian Motion (MCBM) to address the scarcity of paired real-world data. By simulating realistic haze patterns through MCBM, we enhance the adaptability of HazeFlow to diverse real-world scenarios. Through extensive experiments, we demonstrate that HazeFlow achieves state-of-the-art performance across various real-world dehazing benchmark datasets.
zh

[CV-118] Qianfan-VL: Domain-Enhanced Universal Vision-Language Models

【速读】:该论文旨在解决多模态大语言模型在特定领域任务中性能不足的问题,尤其是在文档理解、OCR(光学字符识别)和数学推理等专业场景下的能力瓶颈。其核心解决方案是提出了一种基于分阶段渐进式训练与高精度数据合成流水线的域增强策略(domain enhancement strategy),通过该策略显著提升了模型在特定领域的表现,同时保持了良好的通用能力。关键创新在于利用多阶段训练优化模型对复杂视觉-语言交互的理解,并结合高质量合成数据增强模型在真实业务场景中的泛化能力,最终在多个基准测试中达到或超越现有开源模型水平,如CCBench、ScienceQA、MMStar以及DocVQA等,尤其在OCR相关任务上表现突出。

链接: https://arxiv.org/abs/2509.18189
作者: Daxiang Dong,Mingming Zheng,Dong Xu,Bairong Zhuang,Wenyu Zhang,Chunhua Luo,Haoran Wang,Zijian Zhao,Jie Li,Yuxuan Li,Hanjun Zhong,Mengyue Liu,Jieting Chen,Shupeng Li,Lun Tian,Yaping Feng,Xin Li,Donggang Jiang,Yong Chen,Yehua Xu,Duohao Qin,Chen Feng,Dan Wang,Henghua Zhang,Jingjing Ha,Jinhui He,Yanfeng Zhai,Chengxin Zheng,Jiayi Mao,Jiacheng Chen,Ruchang Yao,Ziye Yuan,Jianmin Wu,Guangjun Xie,Dou Shen
机构: Baidu AI Cloud (百度AI云)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages

点击查看摘要

Abstract:We present Qianfan-VL, a series of multimodal large language models ranging from 3B to 70B parameters, achieving state-of-the-art performance through innovative domain enhancement techniques. Our approach employs multi-stage progressive training and high-precision data synthesis pipelines, which prove to be critical technologies for enhancing domain-specific capabilities while maintaining strong general performance. Qianfan-VL achieves comparable results to leading open-source models on general benchmarks, with state-of-the-art performance on benchmarks such as CCBench, SEEDBench IMG, ScienceQA, and MMStar. The domain enhancement strategy delivers significant advantages in OCR and document understanding, validated on both public benchmarks (OCRBench 873, DocVQA 94.75%) and in-house evaluations. Notably, Qianfan-VL-8B and 70B variants incorporate long chain-of-thought capabilities, demonstrating superior performance on mathematical reasoning (MathVista 78.6%) and logical inference tasks. All models are trained entirely on Baidu’s Kunlun P800 chips, validating the capability of large-scale AI infrastructure to train SOTA-level multimodal models with over 90% scaling efficiency on 5000 chips for a single task. This work establishes an effective methodology for developing domain-enhanced multimodal models suitable for diverse enterprise deployment scenarios.
zh

[CV-119] V-SenseDrive: A Privacy-Preserving Road Video and In-Vehicle Sensor Fusion Framework for Road Safety Driver Behaviour Modelling

链接: https://arxiv.org/abs/2509.18187
作者: Muhammad Naveed,Nazia Perwaiz,Sidra Sultana,Mohaira Ahmad,Muhammad Moazam Fraz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[CV-120] Visionerves: Automatic and Reproducible Hybrid AI for Peripheral Nervous System Recognition Applied to Endometriosis Cases ALT MICCAI2025

【速读】:该论文旨在解决子宫内膜异位症(endometriosis)相关慢性盆腔疼痛中周围神经成像困难的问题,特别是如何在不依赖人工感兴趣区域(ROI)选择的情况下实现对周围神经系统(peripheral nervous system)的精准识别与追踪。其解决方案的关键在于提出了一种名为Visionerves的新型混合人工智能框架,该框架融合了深度学习与符号空间推理技术:第一阶段通过深度学习模型自动分割解剖结构,第二阶段利用模糊空间关系编码解剖知识进行符号化空间推理,从而实现无需手动ROI干预的神经束追踪。此方法在lumbosacral plexus成像中显著优于传统纤维追踪技术,Dice评分提升最高达25%,空间误差降低至5 mm以内,为非侵入性诊断子宫内膜异位症相关神经病变提供了可重复、自动化的分析路径。

链接: https://arxiv.org/abs/2509.18185
作者: Giammarco La Barbera,Enzo Bonnot,Thomas Isla,Juan Pablo de la Plata,Joy-Rose Dunoyer de Segonzac,Jennifer Attali,Cécile Lozach,Alexandre Bellucci,Louis Marcellin,Laure Fournier,Sabine Sarnacki,Pietro Gori,Isabelle Bloch
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Computer-Aided Pelvic Imaging for Female Health (CAPI) - Workshop MICCAI 2025

点击查看摘要

Abstract:Endometriosis often leads to chronic pelvic pain and possible nerve involvement, yet imaging the peripheral nerves remains a challenge. We introduce Visionerves, a novel hybrid AI framework for peripheral nervous system recognition from multi-gradient DWI and morphological MRI data. Unlike conventional tractography, Visionerves encodes anatomical knowledge through fuzzy spatial relationships, removing the need for selection of manual ROIs. The pipeline comprises two phases: (A) automatic segmentation of anatomical structures using a deep learning model, and (B) tractography and nerve recognition by symbolic spatial reasoning. Applied to the lumbosacral plexus in 10 women with (confirmed or suspected) endometriosis, Visionerves demonstrated substantial improvements over standard tractography, with Dice score improvements of up to 25% and spatial errors reduced to less than 5 mm. This automatic and reproducible approach enables detailed nerve analysis and paves the way for non-invasive diagnosis of endometriosis-related neuropathy, as well as other conditions with nerve involvement.
zh

[CV-121] URNet: Uncertainty-aware Refinement Network for Event-based Stereo Depth Estimation

链接: https://arxiv.org/abs/2509.18184
作者: Yifeng Cheng,Alois Knoll,Hu Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is accepted by Visual Intelligence Journal

点击查看摘要

[CV-122] VLA-LPAF: Lightweight Perspective-Adaptive Fusion for Vision-Language-Action to Enable More Unconstrained Robotic Manipulation

【速读】:该论文旨在解决视觉-语言-动作(Visual-Language-Action, VLA)模型在不同视角下表现不一致的问题,即由于多视角观测数据在数量和视角上的差异导致的视觉特征异质性限制了VLA模型的泛化能力。解决方案的关键在于提出轻量级模块VLA-LPAF(Viewpoint-Adaptive Latent Feature Fusion),该模块仅使用单视角图像进行微调,并在潜在空间中融合其他多视角观测信息,从而有效且高效地缓解因视角不一致带来的性能差异。

链接: https://arxiv.org/abs/2509.18183
作者: Jinyue Bian,Zhaoxing Zhang,Zhengyu Liang,Shiwei Zheng,Shengtao Zhang,Rong Shen,Chen Yang,Anzhou Hou
机构: Li Auto Inc. (小鹏汽车)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Visual-Language-Action (VLA) models can follow text instructions according to visual observations of the surrounding environment. This ability to map multimodal inputs to actions is derived from the training of the VLA model on extensive standard demonstrations. These visual observations captured by third-personal global and in-wrist local cameras are inevitably varied in number and perspective across different environments, resulting in significant differences in the visual features. This perspective heterogeneity constrains the generality of VLA models. In light of this, we first propose the lightweight module VLA-LPAF to foster the perspective adaptivity of VLA models using only 2D data. VLA-LPAF is finetuned using images from a single view and fuses other multiview observations in the latent space, which effectively and efficiently bridge the gap caused by perspective inconsistency. We instantiate our VLA-LPAF framework with the VLA model RoboFlamingo to construct RoboFlamingo-LPAF. Experiments show that RoboFlamingo-LPAF averagely achieves around 8% task success rate improvement on CALVIN, 15% on LIBERO, and 30% on a customized simulation benchmark. We also demonstrate the developed viewadaptive characteristics of the proposed RoboFlamingo-LPAF through real-world tasks.
zh

[CV-123] AI-Derived Structural Building Intelligence for Urban Resilience: An Application in Saint Vincent and the Grenadines ICCV2025

【速读】:该论文旨在解决小岛屿发展中国家(SIDS)在气候脆弱地区缺乏详细建筑结构信息的问题,这限制了其在飓风、洪水和滑坡等灾害事件中的风险评估与城市韧性规划能力。解决方案的关键在于提出一种基于人工智能(AI)的自动化工作流,利用高分辨率卫星遥感影像自动推断屋顶属性(如屋顶坡度和材料),并通过对比地理空间基础模型结合浅层分类器与微调的深度学习模型,验证了前者在数据有限场景下的有效性;同时,通过引入邻近SIDS的额外训练数据提升了模型性能,最终实现了屋顶坡度和屋顶材料分类的F1分数分别为0.88和0.83,为SIDS提供了可复制、可扩展的AI与地球观测(Earth Observation, EO)融合的技术路径,助力其实现更高效的证据驱动型城市治理。

链接: https://arxiv.org/abs/2509.18182
作者: Isabelle Tingzon,Yoji Toriumi,Caroline Gevaert
机构: The World Bank Group (世界银行集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: Accepted at the 2nd Workshop on Computer Vision for Developing Countries (CV4DC) at ICCV 2025

点击查看摘要

Abstract:Detailed structural building information is used to estimate potential damage from hazard events like cyclones, floods, and landslides, making them critical for urban resilience planning and disaster risk reduction. However, such information is often unavailable in many small island developing states (SIDS) in climate-vulnerable regions like the Caribbean. To address this data gap, we present an AI-driven workflow to automatically infer rooftop attributes from high-resolution satellite imagery, with Saint Vincent and the Grenadines as our case study. Here, we compare the utility of geospatial foundation models combined with shallow classifiers against fine-tuned deep learning models for rooftop classification. Furthermore, we assess the impact of incorporating additional training data from neighboring SIDS to improve model performance. Our best models achieve F1 scores of 0.88 and 0.83 for roof pitch and roof material classification, respectively. Combined with local capacity building, our work aims to provide SIDS with novel capabilities to harness AI and Earth Observation (EO) data to enable more efficient, evidence-based urban governance.
zh

[CV-124] he Describe-Then-Generate Bottleneck: How VLM Descriptions Alter Image Generation Outcomes

链接: https://arxiv.org/abs/2509.18179
作者: Sai Varun Kodathala,Rakesh Vunnam
机构: Sports Vision, Inc. (Sports Vision, Inc.); Vizworld, Inc. (Vizworld, Inc.)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 7 Figures

点击查看摘要

[CV-125] A Framework for Generating Artificial Datasets to Validate Absolute and Relative Position Concepts

【速读】:该论文旨在解决当前人工智能(AI)模型在理解基础概念(如物体识别、绝对与相对位置、属性识别)方面存在认知不一致和能力不足的问题。解决方案的关键在于提出 Scrapbook 框架,该框架通过生成大量针对单一概念且语言形式多样的问题数据集,系统性地评估模型对基本语义元素的理解深度与一致性。实验表明,尽管主流模型在物体识别上表现良好,但在处理位置信息和带约束条件的问题时出现显著错误或回答偏差,而 Scrapbook 框架能有效暴露这些薄弱环节,从而为后续模型优化提供可量化的基准和方向。

链接: https://arxiv.org/abs/2509.18177
作者: George Corrêa de Araújo,Helena de Almeida Maia,Helio Pedrini
机构: Institute of Computing, University of Campinas (坎皮纳斯州立大学计算机学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: WIP

点击查看摘要

Abstract:In this paper, we present the Scrapbook framework, a novel methodology designed to generate extensive datasets for probing the learned concepts of artificial intelligence (AI) models. The framework focuses on fundamental concepts such as object recognition, absolute and relative positions, and attribute identification. By generating datasets with a large number of questions about individual concepts and a wide linguistic variation, the Scrapbook framework aims to validate the model’s understanding of these basic elements before tackling more complex tasks. Our experimental findings reveal that, while contemporary models demonstrate proficiency in recognizing and enumerating objects, they encounter challenges in comprehending positional information and addressing inquiries with additional constraints. Specifically, the MobileVLM-V2 model showed significant answer disagreements and plausible wrong answers, while other models exhibited a bias toward affirmative answers and struggled with questions involving geometric shapes and positional information, indicating areas for improvement in understanding and consistency. The proposed framework offers a valuable instrument for generating diverse and comprehensive datasets, which can be utilized to systematically assess and enhance the performance of AI models.
zh

[CV-126] A Deep Learning Approach for Spatio-Temporal Forecasting of InSAR Ground Deformation in Eastern Ireland

链接: https://arxiv.org/abs/2509.18176
作者: Wendong Yao,Saeed Azadnejad,Binhua Huang,Shane Donohue,Soumyabrata Dev
机构: ADAPT SFI Research Centre, School of Computer Science, University College Dublin (都柏林大学计算机科学学院); School of Civil Engineering, University College Dublin (都柏林大学土木工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper is submitted to IEEE Transactions on Geoscience and Remote Sensing

点击查看摘要

[CV-127] MAGIA: Sensing Per-Image Signals from Single-Round Averag ed Gradients for Label-Inference-Free Gradient Inversion

链接: https://arxiv.org/abs/2509.18170
作者: Zhanting Zhou,Jinbo Wang,Zeqin Wu,Fengli Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-128] Self Identity Mapping

链接: https://arxiv.org/abs/2509.18165
作者: Xiuding Cai,Yaoyao Zhu,Linjie Fu,Dong Miao,Yu Yao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Early accepted by Neural Networks 2025

点击查看摘要

[CV-129] PerceptronCARE: A Deep Learning-Based Intelligent Teleopthalmology Application for Diabetic Retinopathy Diagnosis

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)在资源匮乏地区难以实现早期筛查和诊断的问题,从而降低视力丧失风险。其解决方案的关键在于开发了一种名为PerceptronCARE的深度学习驱动的远程眼科应用,通过集成多种卷积神经网络(如ResNet-18、EfficientNet-B0和SqueezeNet)实现高精度(85.4%)且计算高效的DR分级分类,并结合云端可扩展性、安全的数据管理及多用户架构,支持临床与远程医疗场景下的实时筛查,显著提升诊断效率并降低医疗成本。

链接: https://arxiv.org/abs/2509.18160
作者: Akwasi Asare,Isaac Baffour Senkyire,Emmanuel Freeman,Simon Hilary Ayinedenaba Aluze-Ele,Kelvin Kwao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic retinopathy is a leading cause of vision loss among adults and a major global health challenge, particularly in underserved regions. This study presents PerceptronCARE, a deep learning-based teleophthalmology application designed for automated diabetic retinopathy detection using retinal images. The system was developed and evaluated using multiple convolutional neural networks, including ResNet-18, EfficientNet-B0, and SqueezeNet, to determine the optimal balance between accuracy and computational efficiency. The final model classifies disease severity with an accuracy of 85.4%, enabling real-time screening in clinical and telemedicine settings. PerceptronCARE integrates cloud-based scalability, secure patient data management, and a multi-user framework, facilitating early diagnosis, improving doctor-patient interactions, and reducing healthcare costs. This study highlights the potential of AI-driven telemedicine solutions in expanding access to diabetic retinopathy screening, particularly in remote and resource-constrained environments.
zh

[CV-130] PolypSeg-GradCAM: Towards Explainable Computer-Aided Gastrointestinal Disease Detection Using U-Net Based Segmentation and Grad-CAM Visualization on the Kvasir Dataset

链接: https://arxiv.org/abs/2509.18159
作者: Akwasi Asare,Ulas Bagci
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

[CV-131] MiniCPM-V 4.5: Cooking Efficient MLLM s via Architecture Data and Training Recipe

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练与推理效率方面的瓶颈问题,以提升其可访问性与可扩展性。解决方案的关键在于三个核心改进:一是提出统一的3D-Resampler模型架构,实现对图像和视频的高密度编码;二是设计一种无需复杂数据工程的统一学习范式,同时支持文档知识理解和文本识别;三是采用混合强化学习策略,使模型在短程和长程推理任务中均具备高熟练度。这些创新共同实现了在参数规模较小(8B)的情况下,性能超越多个主流商用及开源大模型,且显著降低GPU内存占用与推理时间。

链接: https://arxiv.org/abs/2509.18154
作者: Tianyu Yu,Zefan Wang,Chongyi Wang,Fuwei Huang,Wenshuo Ma,Zhihui He,Tianchi Cai,Weize Chen,Yuxiang Huang,Yuanqian Zhao,Bokai Xu,Junbo Cui,Yingjing Xu,Liqing Ruan,Luoyuan Zhang,Hanyu Liu,Jingkun Tang,Hongyuan Liu,Qining Guo,Wenhao Hu,Bingxiang He,Jie Zhou,Jie Cai,Ji Qi,Zonghao Guo,Chi Chen,Guoyang Zeng,Yuxuan Li,Ganqu Cui,Ning Ding,Xu Han,Yuan Yao,Zhiyuan Liu,Maosong Sun
机构: MiniCPM-V Team, OpenBMB
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Project Website: this https URL

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and scalable. To address the challenges, we present MiniCPM-V 4.5, an 8B parameter model designed for high efficiency and strong performance. We introduce three core improvements in model architecture, data strategy and training method: a unified 3D-Resampler model architecture for highly compact encoding over images and videos, a unified learning paradigm for document knowledge and text recognition without heavy data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes. Comprehensive experimental results in OpenCompass evaluation show that MiniCPM-V 4.5 surpasses widely used proprietary models such as GPT-4o-latest, and significantly larger open-source models such as Qwen2.5-VL 72B. Notably, the strong performance is achieved with remarkable efficiency. For example, on the widely adopted VideoMME benchmark, MiniCPM-V 4.5 achieves state-of-the-art performance among models under 30B size, using just 46.7% GPU memory cost and 8.7% inference time of Qwen2.5-VL 7B.
zh

[CV-132] KM-GPT : An Automated Pipeline for Reconstructing Individual Patient Data from Kaplan-Meier Plots

链接: https://arxiv.org/abs/2509.18141
作者: Yao Zhao,Haoyue Sun,Yantian Ding,Yanxun Xu
机构: Johns Hopkins University (约翰霍普金斯大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

[CV-133] Prompt Optimization Meets Subspace Representation Learning for Few-shot Out-of-Distribution Detection

链接: https://arxiv.org/abs/2509.18111
作者: Faizul Rakib Sayem,Shahana Ibrahim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-134] Localized PCA-Net Neural Operators for Scalable Solution Reconstruction of Elliptic PDEs

链接: https://arxiv.org/abs/2509.18110
作者: Mrigank Dhingra,Romit Maulik,Adil Rasheed,Omer San
机构: University of Tennessee, Knoxville (田纳西大学诺克斯维尔分校); The Pennsylvania State University (宾夕法尼亚州立大学); Norwegian University of Science and Technology (挪威科技大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-135] Leverag ing Geometric Visual Illusions as Perceptual Inductive Biases for Vision Models

【速读】:该论文试图解决当前深度学习模型在图像分类任务中过度依赖大规模数据统计规律、缺乏基于人类感知机制的结构化先验知识的问题。其解决方案的关键在于引入经典几何视觉错觉(geometric visual illusions)作为辅助监督信号,构建一个参数化的合成错觉数据集,并通过多源学习策略将错觉识别任务与ImageNet分类目标相结合,从而引导模型学习更具感知一致性的特征表示。实验表明,这种基于感知心理学的归纳偏置(inductive bias)能够显著提升模型在复杂纹理和精细轮廓场景下的泛化能力,并增强卷积神经网络(CNN)与Transformer架构对结构信息的敏感性。

链接: https://arxiv.org/abs/2509.15156
作者: Haobo Yang,Minghao Guo,Dequan Yang,Wenyu Wang
机构: Mohamed bin Zayed University of Artificial Intelligence (MBZUAI); University of Oxford
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Contemporary deep learning models have achieved impressive performance in image classification by primarily leveraging statistical regularities within large datasets, but they rarely incorporate structured insights drawn directly from perceptual psychology. To explore the potential of perceptually motivated inductive biases, we propose integrating classic geometric visual illusions well-studied phenomena from human perception into standard image-classification training pipelines. Specifically, we introduce a synthetic, parametric geometric-illusion dataset and evaluate three multi-source learning strategies that combine illusion recognition tasks with ImageNet classification objectives. Our experiments reveal two key conceptual insights: (i) incorporating geometric illusions as auxiliary supervision systematically improves generalization, especially in visually challenging cases involving intricate contours and fine textures; and (ii) perceptually driven inductive biases, even when derived from synthetic stimuli traditionally considered unrelated to natural image recognition, can enhance the structural sensitivity of both CNN and transformer-based architectures. These results demonstrate a novel integration of perceptual science and machine learning and suggest new directions for embedding perceptual priors into vision model design.
zh

[CV-136] MOIS-SAM2: Exemplar-based Segment Anything Model 2 for multilesion interactive segmentation of neurobromas in whole-body MRI

【速读】:该论文旨在解决神经纤维瘤病1型(Neurofibromatosis Type 1, NF1)患者全身磁共振成像(Whole-body MRI, WB-MRI)中多发性神经纤维瘤(Neurofibromas, NFs)的高效、精准且可扩展的交互式分割问题。现有交互式分割方法难以在保持高病变级精度的同时,适应数百个病变的规模化处理需求。解决方案的关键在于提出MOIS-SAM2模型,该模型基于Transformer架构的可提示Segment Anything Model 2(SAM2)进行扩展,引入了基于样本的语义传播机制(exemplar-based semantic propagation),从而实现仅需少量用户交互即可完成对大量NFs的高精度分割,并在不同MRI场强、扫描仪厂商及低肿瘤负荷等域偏移场景下展现出良好的泛化能力。

链接: https://arxiv.org/abs/2509.19277
作者: Georgii Kolokolnikov,Marie-Lena Schmalhofer,Sophie Götz,Lennart Well,Said Farschtschi,Victor-Felix Mautner,Inka Ristow,Rene Werner
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Background and Objectives: Neurofibromatosis type 1 is a genetic disorder characterized by the development of numerous neurofibromas (NFs) throughout the body. Whole-body MRI (WB-MRI) is the clinical standard for detection and longitudinal surveillance of NF tumor growth. Existing interactive segmentation methods fail to combine high lesion-wise precision with scalability to hundreds of lesions. This study proposes a novel interactive segmentation model tailored to this challenge. Methods: We introduce MOIS-SAM2, a multi-object interactive segmentation model that extends the state-of-the-art, transformer-based, promptable Segment Anything Model 2 (SAM2) with exemplar-based semantic propagation. MOIS-SAM2 was trained and evaluated on 119 WB-MRI scans from 84 NF1 patients acquired using T2-weighted fat-suppressed sequences. The dataset was split at the patient level into a training set and four test sets (one in-domain and three reflecting different domain shift scenarios, e.g., MRI field strength variation, low tumor burden, differences in clinical site and scanner vendor). Results: On the in-domain test set, MOIS-SAM2 achieved a scan-wise DSC of 0.60 against expert manual annotations, outperforming baseline 3D nnU-Net (DSC: 0.54) and SAM2 (DSC: 0.35). Performance of the proposed model was maintained under MRI field strength shift (DSC: 0.53) and scanner vendor variation (DSC: 0.50), and improved in low tumor burden cases (DSC: 0.61). Lesion detection F1 scores ranged from 0.62 to 0.78 across test sets. Preliminary inter-reader variability analysis showed model-to-expert agreement (DSC: 0.62-0.68), comparable to inter-expert agreement (DSC: 0.57-0.69). Conclusions: The proposed MOIS-SAM2 enables efficient and scalable interactive segmentation of NFs in WB-MRI with minimal user input and strong generalization, supporting integration into clinical workflows. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2509.19277 [eess.IV] (or arXiv:2509.19277v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2509.19277 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rene Werner [view email] [v1] Tue, 23 Sep 2025 17:42:24 UTC (4,394 KB)
zh

[CV-137] Quantum Random Synthetic Skyrmion Texture Generation a Qiskit Simulation

链接: https://arxiv.org/abs/2509.18947
作者: Hillol Biswas
机构: 未知
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

[CV-138] Reconstruction of Optical Coherence Tomography Images from Wavelength-space Using Deep-learning

【速读】:该论文旨在解决传统傅里叶域光学相干断层扫描(Fourier-domain Optical Coherence Tomography, FD-OCT)系统在图像重建中依赖波数(k)域重采样所导致的计算复杂度高及硬件资源消耗大的问题,同时应对由低相干干涉测量引起的散斑噪声(speckle noise)对图像质量的影响。其解决方案的关键在于提出一种基于深度学习(Deep-Learning, DL)的端到端重建方法,通过两个级联的编码器-解码器结构网络实现:首先使用空间域卷积神经网络(Spatial Domain CNN, SD-CNN)从波长域直接重构出结构清晰且去噪的初始图像,再利用傅里叶域卷积神经网络(Fourier Domain CNN, FD-CNN)在频域进一步优化图像质量,从而在无需k域重采样的前提下显著提升图像分辨率与信噪比,并大幅降低计算复杂度。

链接: https://arxiv.org/abs/2509.18783
作者: Maryam Viqar,Erdem Sahin,Elena Stoykova,Violeta Madjarova
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conventional Fourier-domain Optical Coherence Tomography (FD-OCT) systems depend on resampling into wavenumber (k) domain to extract the depth profile. This either necessitates additional hardware resources or amplifies the existing computational complexity. Moreover, the OCT images also suffer from speckle noise, due to systemic reliance on low coherence interferometry. We propose a streamlined and computationally efficient approach based on Deep-Learning (DL) which enables reconstructing speckle-reduced OCT images directly from the wavelength domain. For reconstruction, two encoder-decoder styled networks namely Spatial Domain Convolution Neural Network (SD-CNN) and Fourier Domain CNN (FD-CNN) are used sequentially. The SD-CNN exploits the highly degraded images obtained by Fourier transforming the domain fringes to reconstruct the deteriorated morphological structures along with suppression of unwanted noise. The FD-CNN leverages this output to enhance the image quality further by optimization in Fourier domain (FD). We quantitatively and visually demonstrate the efficacy of the method in obtaining high-quality OCT images. Furthermore, we illustrate the computational complexity reduction by harnessing the power of DL models. We believe that this work lays the framework for further innovations in the realm of OCT image reconstruction.
zh

[CV-139] Efficient Breast and Ovarian Cancer Classification via ViT-Based Preprocessing and Transfer Learning

链接: https://arxiv.org/abs/2509.18553
作者: Richa Rawat,Faisal Ahmed
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); Embry-Riddle Aeronautical University (Embry-Riddle 航空大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 3 figures

点击查看摘要

[CV-140] Dynamical Modeling of Behaviorally Relevant Spatiotemporal Patterns in Neural Imaging Data ICML

链接: https://arxiv.org/abs/2509.18507
作者: Mohammad Hosseini,Maryam M. Shanechi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Published at the 42nd International Conference on Machine Learning (ICML) 2025. Code available at: this https URL

点击查看摘要

[CV-141] Machine learning approach to single-shot multiparameter estimation for the non-linear Schrödinger equation

链接: https://arxiv.org/abs/2509.18479
作者: Louis Rossignol,Tangui Aladjidi,Myrann Baker-Rasooli,Quentin Glorieux
机构: Sorbonne University (索邦大学)
类目: Quantum Physics (quant-ph); Computer Vision and Pattern Recognition (cs.CV); Optics (physics.optics)
备注: 10 pages, 4 figures

点击查看摘要

[CV-142] Neural Network-Driven Direct CBCT-Based Dose Calculation for Head-and-Neck Proton Treatment Planning

【速读】:该论文旨在解决基于锥形束计算机断层成像(CBCT)图像进行质子治疗剂量计算的准确性问题,特别是在适应性放疗中因分次间解剖结构变化导致的传统CBCT图像质量限制及复杂校正流程的挑战。其解决方案的关键在于提出一种基于扩展长短期记忆(xLSTM)神经网络的深度学习方法(CBCT-NN),通过引入能量标记编码和视线视角序列建模,有效捕捉质子剂量沉积的空间依赖性,从而实现从CBCT图像直接生成高精度剂量分布,无需传统校正流程,在保持蒙特卡罗级准确度的同时,计算时间低于3分钟,满足临床自适应放疗的需求。

链接: https://arxiv.org/abs/2509.18378
作者: Muheng Li,Evangelia Choulilitsa,Lisa Fankhauser,Francesca Albertini,Antony Lomax,Ye Zhang
机构: Center for Proton Therapy, Paul Scherrer Institute (PSI), Villigen, Switzerland; Department of Physics, ETH Zürich, Zürich, Switzerland
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate dose calculation on cone beam computed tomography (CBCT) images is essential for modern proton treatment planning workflows, particularly when accounting for inter-fractional anatomical changes in adaptive treatment scenarios. Traditional CBCT-based dose calculation suffers from image quality limitations, requiring complex correction workflows. This study develops and validates a deep learning approach for direct proton dose calculation from CBCT images using extended Long Short-Term Memory (xLSTM) neural networks. A retrospective dataset of 40 head-and-neck cancer patients with paired planning CT and treatment CBCT images was used to train an xLSTM-based neural network (CBCT-NN). The architecture incorporates energy token encoding and beam’s-eye-view sequence modelling to capture spatial dependencies in proton dose deposition patterns. Training utilized 82,500 paired beam configurations with Monte Carlo-generated ground truth doses. Validation was performed on 5 independent patients using gamma analysis, mean percentage dose error assessment, and dose-volume histogram comparison. The CBCT-NN achieved gamma pass rates of 95.1 \pm 2.7% using 2mm/2% criteria. Mean percentage dose errors were 2.6 \pm 1.4% in high-dose regions ( 90% of max dose) and 5.9 \pm 1.9% globally. Dose-volume histogram analysis showed excellent preservation of target coverage metrics (Clinical Target Volume V95% difference: -0.6 \pm 1.1%) and organ-at-risk constraints (parotid mean dose difference: -0.5 \pm 1.5%). Computation time is under 3 minutes without sacrificing Monte Carlo-level accuracy. This study demonstrates the proof-of-principle of direct CBCT-based proton dose calculation using xLSTM neural networks. The approach eliminates traditional correction workflows while achieving comparable accuracy and computational efficiency suitable for adaptive protocols.
zh

人工智能

[AI-0] SOE: Sample-Efficient Robot Policy Self-Improvement via On-Manifold Exploration

【速读】:该论文旨在解决机器人策略在执行任务时因动作模式坍缩(action mode collapse)而导致探索能力不足的问题,从而限制了策略优化的效率与安全性。现有方法多依赖随机扰动来增强探索,但此类方式存在安全隐患且易引发不稳定行为。解决方案的关键在于提出一种名为“基于流形上的自提升探索”(Self-Improvement via On-Manifold Exploration, SOE)的框架,该框架通过学习任务相关因素的紧凑潜在表示,并将探索约束在有效动作流形(valid action manifold)内,确保探索的安全性、多样性与有效性;同时,SOE可作为插件模块无缝集成至任意策略模型中,在不损害基础策略性能的前提下显著提升探索能力,并借助结构化的潜在空间实现人类引导探索,从而提高效率与可控性。

链接: https://arxiv.org/abs/2509.19292
作者: Yang Jin,Jun Lv,Han Xue,Wendi Chen,Chuan Wen,Cewu Lu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Intelligent agents progress by continually refining their capabilities through actively exploring environments. Yet robot policies often lack sufficient exploration capability due to action mode collapse. Existing methods that encourage exploration typically rely on random perturbations, which are unsafe and induce unstable, erratic behaviors, thereby limiting their effectiveness. We propose Self-Improvement via On-Manifold Exploration (SOE), a framework that enhances policy exploration and improvement in robotic manipulation. SOE learns a compact latent representation of task-relevant factors and constrains exploration to the manifold of valid actions, ensuring safety, diversity, and effectiveness. It can be seamlessly integrated with arbitrary policy models as a plug-in module, augmenting exploration without degrading the base policy performance. Moreover, the structured latent space enables human-guided exploration, further improving efficiency and controllability. Extensive experiments in both simulation and real-world tasks demonstrate that SOE consistently outperforms prior methods, achieving higher task success rates, smoother and safer exploration, and superior sample efficiency. These results establish on-manifold exploration as a principled approach to sample-efficient policy self-improvement. Project website: this https URL
zh

[AI-1] Agent Init: Initializing LLM -based Multi-Agent Systems via Diversity and Expertise Orchestration for Effective and Efficient Collaboration EMNLP2025

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中初始阶段Agent团队结构设计不合理的问题,现有方法未能充分考虑后续协作需求,导致系统效率与效果受限。其解决方案的关键在于提出AgentInit框架,通过引入自然语言到格式化(Natural Language to Format)机制确保生成Agent的一致性与标准化,并结合基于帕累托最优(Pareto principles)的平衡团队选择策略,协同优化Agent团队多样性与任务相关性,从而提升协作效能与整体性能。实验表明,该方法在多种任务和框架下均显著优于当前最优初始化方法及预定义策略,且大幅降低Token消耗,具备良好的可迁移性和组件有效性。

链接: https://arxiv.org/abs/2509.19236
作者: Chunhao Tian,Yutong Wang,Xuebo Liu,Zhexuan Wang,Liang Ding,Miao Zhang,Min Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:Proper initialization is crucial for any system, particularly in multi-agent systems (MAS), where it plays a pivotal role in determining both the system’s efficiency and effectiveness. However, existing MAS initialization methods do not fully account for the collaborative needs of the generated agents in subsequent stages. Inspired by the principles of effective team composition, we propose AgentInit, which aims to optimize the structure of agent teams. Specifically, in addition to multi-round interactions and reflections between agents during agent generation, AgentInit incorporates a Natural Language to Format mechanism to ensure consistency and standardization. Balanced team selection strategies using Pareto principles are subsequently applied to jointly consider agent team diversity and task relevance to promote effective and efficient collaboration and enhance overall system performance. Experiments show that AgentInit consistently outperforms state-of-the-art initialization methods and pre-defined strategies across various frameworks and tasks, achieving an overall performance improvement of up to 1.2 and 1.6, respectively, while also significantly reducing token consumption. Further analysis confirms its strong transferability to similar tasks and verifies the effectiveness of its key components, demonstrating its capability and adaptability as a reliable MAS initialization method. Source code and models are available at this https URL.
zh

[AI-2] FedFusion: Federated Learning with Diversity- and Cluster-Aware Encoders for Robust Adaptation under Label Scarcity

【速读】:该论文旨在解决联邦学习(Federated Learning)在实际应用中面临的三大挑战:异构特征空间、严重的非独立同分布(non-IID)数据分布以及客户端间标签稀缺问题。其解决方案的关键在于提出FedFusion框架,通过三个核心机制实现:(1) 引入多样性感知的编码器(DivEn系列),使各客户端能够维护个性化模型以适应本地数据特性;(2) 利用置信度过滤的伪标签与域自适应迁移机制,由标注教师客户端指导无标签学习客户端;(3) 采用相似性加权分类器耦合策略(可选聚类平均),在保持全局一致性的同时缓解数据丰富客户端的主导效应,提升少数客户端性能。此外,该方法还设计了轻量级标签高效管道,结合自监督/半监督预训练与选择性微调,在不共享原始数据的前提下显著降低标注需求。实验表明,FedFusion在多种场景下均优于现有最优基线,在准确率、鲁棒性和公平性上取得提升,且通信与计算开销可控。

链接: https://arxiv.org/abs/2509.19220
作者: Ferdinand Kahenga,Antoine Bagula,Patrick Sello,Sajal K. Das
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated learning in practice must contend with heterogeneous feature spaces, severe non-IID data, and scarce labels across clients. We present FedFusion, a federated transfer-learning framework that unifies domain adaptation and frugal labelling with diversity-/cluster-aware encoders (DivEn, DivEn-mix, DivEn-c). Labelled teacher clients guide learner clients via confidence-filtered pseudo-labels and domain-adaptive transfer, while clients maintain personalised encoders tailored to local data. To preserve global coherence under heterogeneity, FedFusion employs similarity-weighted classifier coupling (with optional cluster-wise averaging), mitigating dominance by data-rich sites and improving minority-client performance. The frugal-labelling pipeline combines self-/semi-supervised pretext training with selective fine-tuning, reducing annotation demands without sharing raw data. Across tabular and imaging benchmarks under IID, non-IID, and label-scarce regimes, FedFusion consistently outperforms state-of-the-art baselines in accuracy, robustness, and fairness while maintaining comparable communication and computation budgets. These results show that harmonising personalisation, domain adaptation, and label efficiency is an effective recipe for robust federated learning under real-world constraints.
zh

[AI-3] YAC: Bridging Natural Language and Interactive Visual Exploration with Generative AI for Biomedical Data Discovery

【速读】:该论文旨在解决生物医学数据发现界面中如何有效融合自然语言输入与交互式可视化工具的问题。当前生成式 AI 虽能提升用户与系统交互的便捷性,但传统可视化组件在数据探索中的核心作用仍不可替代。解决方案的关键在于构建一个基于多智能体系统的原型系统 YAC(Yet Another Chatbot),通过生成结构化的声明式输出,将自然语言指令转化为可执行的数据操作逻辑,并据此渲染联动的交互式可视化图表及应用数据过滤;同时引入控件(widgets)机制,允许用户通过图形化界面调整结构化输出参数,从而实现自然语言与交互式可视化之间的双向映射与协同控制。

链接: https://arxiv.org/abs/2509.19182
作者: Devin Lange,Shanghua Gao,Pengwei Sui,Austen Money,Priya Misner,Marinka Zitnik,Nils Gehlenborg
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Incorporating natural language input has the potential to improve the capabilities of biomedical data discovery interfaces. However, user interface elements and visualizations are still powerful tools for interacting with data, even in the new world of generative AI. In our prototype system, YAC, Yet Another Chatbot, we bridge the gap between natural language and interactive visualizations by generating structured declarative output with a multi-agent system and interpreting that output to render linked interactive visualizations and apply data filters. Furthermore, we include widgets, which allow users to adjust the values of that structured output through user interface elements. We reflect on the capabilities and design of this system with an analysis of its technical dimensions and illustrate the capabilities through four usage scenarios.
zh

[AI-4] Generative Propaganda

【速读】:该论文旨在解决生成式AI(Generative AI)在现实世界中被用于操纵公众舆论的问题,尤其聚焦于台湾和印度这两个在线宣传高度活跃的地区。研究通过访谈防御者(如事实核查人员、记者、官员)和创作者(如网红、政治顾问、广告商),揭示了当前对“深度伪造”(deepfakes)的过度关注如何扭曲了对生成式AI滥用模式的理解。解决方案的关键在于提出一个分类框架,将生成式宣传区分为显性与隐性、推广型与贬损型四种类型,从而指出: deception(欺骗)并非主要动机或影响路径,而创作者更常采用显性使用以降低法律与声誉风险,并利用AI提升跨语言传播效率及规避人工与算法检测;因此,安全研究人员应更新威胁模型,区分深度伪造与其他明显但具策略性的应用,强化内部行为约束的社会机制,并在全球层面应对AI带来的效率增益。

链接: https://arxiv.org/abs/2509.19147
作者: Madeleine I. G. Daepp,Alejandro Cuevas,Robert Osazuwa Ness,Vickie Yu-Ping Wang,Bharat Kumar Nayak,Dibyendu Mishra,Ti-Chung Cheng,Shaily Desai,Joyojeet Pal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Social and Information Networks (cs.SI)
备注: Working Paper

点击查看摘要

Abstract:Generative propaganda is the use of generative artificial intelligence (AI) to shape public opinion. To characterize its use in real-world settings, we conducted interviews with defenders (e.g., factcheckers, journalists, officials) in Taiwan and creators (e.g., influencers, political consultants, advertisers) as well as defenders in India, centering two places characterized by high levels of online propaganda. The term “deepfakes”, we find, exerts outsized discursive power in shaping defenders’ expectations of misuse and, in turn, the interventions that are prioritized. To better characterize the space of generative propaganda, we develop a taxonomy that distinguishes between obvious versus hidden and promotional versus derogatory use. Deception was neither the main driver nor the main impact vector of AI’s use; instead, Indian creators sought to persuade rather than to deceive, often making AI’s use obvious in order to reduce legal and reputational risks, while Taiwan’s defenders saw deception as a subset of broader efforts to distort the prevalence of strategic narratives online. AI was useful and used, however, in producing efficiency gains in communicating across languages and modes, and in evading human and algorithmic detection. Security researchers should reconsider threat models to clearly differentiate deepfakes from promotional and obvious uses, to complement and bolster the social factors that constrain misuse by internal actors, and to counter efficiency gains globally.
zh

[AI-5] On the Soundness and Consistency of LLM Agents for Executing Test Cases Written in Natural Language

【速读】:该论文旨在解决自然语言(Natural Language, NL)测试用例在图形用户界面(Graphical User Interface, GUI)测试中固有的不严谨性(unsoundness)和执行一致性差的问题。NL测试用例由于指令模糊或大语言模型(Large Language Models, LLMs)代理行为的不可预测性,可能导致误报失败;且同一测试用例多次执行可能产生不一致结果,影响测试可靠性。解决方案的关键在于提出一种带有护栏机制(guardrail mechanisms)和专用代理(specialised agents)的执行算法,通过动态验证每一步测试执行的正确性来提升鲁棒性和一致性,并引入“弱不严谨性”(weak unsoundness)定义以适配工业级质量标准(如六西格玛Six Sigma),从而在实际应用中实现可接受的测试有效性与稳定性。

链接: https://arxiv.org/abs/2509.19136
作者: Sébastien Salva,Redha Taguelmimt
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The use of natural language (NL) test cases for validating graphical user interface (GUI) applications is emerging as a promising direction to manually written executable test scripts, which are costly to develop and difficult to maintain. Recent advances in large language models (LLMs) have opened the possibility of the direct execution of NL test cases by LLM agents. This paper investigates this direction, focusing on the impact on NL test case unsoundness and on test case execution consistency. NL test cases are inherently unsound, as they may yield false failures due to ambiguous instructions or unpredictable agent behaviour. Furthermore, repeated executions of the same NL test case may lead to inconsistent outcomes, undermining test reliability. To address these challenges, we propose an algorithm for executing NL test cases with guardrail mechanisms and specialised agents that dynamically verify the correct execution of each test step. We introduce measures to evaluate the capabilities of LLMs in test execution and one measure to quantify execution consistency. We propose a definition of weak unsoundness to characterise contexts in which NL test case execution remains acceptable, with respect to the industrial quality levels Six Sigma. Our experimental evaluation with eight publicly available LLMs, ranging from 3B to 70B parameters, demonstrates both the potential and current limitations of current LLM agents for GUI testing. Our experiments show that Meta Llama 3.1 70B demonstrates acceptable capabilities in NL test case execution with high execution consistency (above the level 3-sigma). We provide prototype tools, test suites, and results.
zh

[AI-6] GSTM-HMU: Generative Spatio-Temporal Modeling for Human Mobility Understanding

【速读】:该论文旨在解决人类移动轨迹数据中语义复杂性和时间动态性建模不足的问题,以更准确地捕捉个体行为意图与长期生活方式规律。其解决方案的关键在于提出一种生成式时空框架GSTM-HMU,包含四个核心创新:(1)时空概念编码器(STCE)将地理位置、兴趣点(POI)类别语义和周期性时间节奏融合为统一向量表示;(2)认知轨迹记忆(CTM)自适应过滤历史访问记录,强化近期及行为显著事件以更好体现用户意图;(3)生活方式概念库(LCB)引入结构化的偏好线索(如活动类型与生活模式),提升模型的可解释性与个性化能力;(4)任务导向的生成头将学习到的表征转化为多下游任务预测,实现端到端的泛化能力。

链接: https://arxiv.org/abs/2509.19135
作者: Wenying Luo,Zhiyuan Lin,Wenhao Xu,Minghao Liu,Zhi Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human mobility traces, often recorded as sequences of check-ins, provide a unique window into both short-term visiting patterns and persistent lifestyle regularities. In this work we introduce GSTM-HMU, a generative spatio-temporal framework designed to advance mobility analysis by explicitly modeling the semantic and temporal complexity of human movement. The framework consists of four key innovations. First, a Spatio-Temporal Concept Encoder (STCE) integrates geographic location, POI category semantics, and periodic temporal rhythms into unified vector representations. Second, a Cognitive Trajectory Memory (CTM) adaptively filters historical visits, emphasizing recent and behaviorally salient events in order to capture user intent more effectively. Third, a Lifestyle Concept Bank (LCB) contributes structured human preference cues, such as activity types and lifestyle patterns, to enhance interpretability and personalization. Finally, task-oriented generative heads transform the learned representations into predictions for multiple downstream tasks. We conduct extensive experiments on four widely used real-world datasets, including Gowalla, WeePlace, Brightkite, and FourSquare, and evaluate performance on three benchmark tasks: next-location prediction, trajectory-user identification, and time estimation. The results demonstrate consistent and substantial improvements over strong baselines, confirming the effectiveness of GSTM-HMU in extracting semantic regularities from complex mobility data. Beyond raw performance gains, our findings also suggest that generative modeling provides a promising foundation for building more robust, interpretable, and generalizable systems for human mobility intelligence.
zh

[AI-7] Analysis on distribution and clustering of weight

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)中权重特征的表征与分析问题,以揭示不同模型间的差异性及同一模型家族内部的相似性。其核心解决方案在于提出两种向量表征方法:标准差向量(Standard-Deviation Vector)和聚类向量(Clustering Vector)。前者通过归一化投影矩阵的标准差来刻画模型权重的分布特性,后者则基于奇异值聚类(K-Means算法)提取权重矩阵的关联结构,从而反映权重间的相关性特征。实验表明,这两种向量能够有效区分不同模型,并清晰展现同一家族模型间的相似性;同时,在LoRA微调后,标准差向量受数据集影响显著,而聚类向量保持稳定,体现出对预训练模型结构的高度一致性。

链接: https://arxiv.org/abs/2509.19122
作者: Chunming Ye,Wenquan Tian,Yalan Gao,Songzhou Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14page,16 figures

点击查看摘要

Abstract:The study on architecture and parameter characteristics remains the hot topic in the research of large language models. In this paper we concern with the characteristics of weight which are used to analyze the correlations and differences between models. Two kinds of vectors-standard deviation vector and clustering vector-are proposed to describe features of models. In the first case, the weights are assumed to follow normal distribution. The standard deviation values of projection matrices are normalized to form Standard-Deviation Vector, representing the distribution characteristics of models. In the second case, the singular values from each weight projection matrix are extracted and grouped by K-Means algorithm. The grouped data with the same type matrix are combined as Clustering Vector to represent the correlation characteristics of models’ weights. The study reveals that these two vectors can effectively distinguish between different models and clearly show the similarities among models of the same family. Moreover, after conducting LoRA fine-tuning with different datasets and models, it is found that the distribution of weights represented by standard deviation vector is directly influenced by the dataset, but the correlations between different weights represented by clustering vector remain unaffected and maintain a high consistency with the pre-trained model.
zh

[AI-8] FedFiTS: Fitness-Selected Slotted Client Scheduling for Trustworthy Federated Learning in Healthcare AI

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在医疗等敏感领域部署时面临的三大核心挑战:非独立同分布(non-IID)数据、客户端不可靠性以及对抗性操纵。其解决方案的关键在于提出FedFiTS框架,该框架通过融合基于适应度的客户端选举机制与分槽聚合策略,构建了一个三阶段参与机制——自由参与训练、自然选择和分槽团队协作,并引入动态客户端评分、自适应阈值设定及基于群体的调度机制,从而在收敛效率与鲁棒性之间实现平衡。该方法同时整合了信任感知聚合与公平导向的客户端选择,显著提升了模型在真实场景下的准确率、抗中毒攻击能力及跨域适用性。

链接: https://arxiv.org/abs/2509.19120
作者: Ferdinand Kahenga,Antoine Bagula,Sajal K. Das,Patrick Sello
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Federated Learning (FL) has emerged as a powerful paradigm for privacy-preserving model training, yet deployments in sensitive domains such as healthcare face persistent challenges from non-IID data, client unreliability, and adversarial manipulation. This paper introduces FedFiTS, a trust and fairness-aware selective FL framework that advances the FedFaSt line by combining fitness-based client election with slotted aggregation. FedFiTS implements a three-phase participation strategy-free-for-all training, natural selection, and slotted team participation-augmented with dynamic client scoring, adaptive thresholding, and cohort-based scheduling to balance convergence efficiency with robustness. A theoretical convergence analysis establishes bounds for both convex and non-convex objectives under standard assumptions, while a communication-complexity analysis shows reductions relative to FedAvg and other baselines. Experiments on diverse datasets-medical imaging (X-ray pneumonia), vision benchmarks (MNIST, FMNIST), and tabular agricultural data (Crop Recommendation)-demonstrate that FedFiTS consistently outperforms FedAvg, FedRand, and FedPow in accuracy, time-to-target, and resilience to poisoning attacks. By integrating trust-aware aggregation with fairness-oriented client selection, FedFiTS advances scalable and secure FL, making it well suited for real-world healthcare and cross-domain deployments.
zh

[AI-9] owards Practical Multi-label Causal Discovery in High-Dimensional Event Sequences via One-Shot Graph Aggregation NEURIPS2025

【速读】:该论文旨在解决高维稀疏事件序列中因果关系识别的难题,尤其是在医疗或车辆诊断等场景下,如何从大量前置事件(如症状或错误代码)推断出结果标签(如疾病或系统故障)的因果结构。其解决方案的关键在于提出CARGO方法,该方法利用两个预训练的因果Transformer作为领域特定的基础模型,首先对每条事件序列并行地一次性推断出局部因果图,随后通过自适应频率融合策略聚合这些局部图,以重构标签的全局马尔可夫边界(Markov boundaries)。这一两阶段架构在不进行全数据集条件独立性测试的前提下,实现了大规模概率推理的高效性与可扩展性。

链接: https://arxiv.org/abs/2509.19112
作者: Hugo Math,Rainer Lienhart
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at NeuRIPS2025 Workshop on Structured Probabilistic Inference and Generative Modeling

点击查看摘要

Abstract:Understanding causality in event sequences where outcome labels such as diseases or system failures arise from preceding events like symptoms or error codes is critical. Yet remains an unsolved challenge across domains like healthcare or vehicle diagnostics. We introduce CARGO, a scalable multi-label causal discovery method for sparse, high-dimensional event sequences comprising of thousands of unique event types. Using two pretrained causal Transformers as domain-specific foundation models for event sequences. CARGO infers in parallel, per sequence one-shot causal graphs and aggregates them using an adaptive frequency fusion to reconstruct the global Markov boundaries of labels. This two-stage approach enables efficient probabilistic reasoning at scale while bypassing the intractable cost of full-dataset conditional independence testing. Our results on a challenging real-world automotive fault prediction dataset with over 29,100 unique event types and 474 imbalanced labels demonstrate CARGO’s ability to perform structured reasoning.
zh

[AI-10] Algorithms for Adversarially Robust Deep Learning

【速读】:该论文旨在解决深度学习模型在安全关键应用中面临的鲁棒性问题,具体包括三个核心场景:计算机视觉中的对抗样本(adversarial examples)问题、领域泛化(domain generalization)问题以及大语言模型(LLMs)的“越狱”(jailbreaking)攻击问题。解决方案的关键在于提出具有理论保障和实践效果的新算法与训练范式:针对对抗样本,设计了新的认证算法和训练策略以提升模型鲁棒性;针对领域泛化,提出了能实现医疗影像、分子识别和图像分类任务上最优泛化性能的算法;针对LLM越狱攻击,则构建了前沿的攻击与防御机制,推动语言模型代理的可靠性发展。这些方法共同体现了从理论到实践的系统性鲁棒性增强路径。

链接: https://arxiv.org/abs/2509.19100
作者: Alexander Robey
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.
zh

[AI-11] A Mega-Study of Digital Twins Reveals Strengths Weaknesses and Opportunities for Further Improvement

【速读】:该论文旨在解决数字孪生(Digital Twin)是否能够准确捕捉个体在调查和实验中的响应行为这一问题。其核心挑战在于评估基于大规模个体层面数据构建的数字孪生模型,在预测具体个体回答或群体均值与方差方面的有效性。解决方案的关键在于通过19项预注册研究,利用美国全国代表性样本及其由大型语言模型(LLM)驱动的数字孪生体进行系统对比,覆盖164个结果变量,发现当前数字孪生体仅能捕捉到相对差异(平均相关系数约为0.2),但无法可靠预测个体精确答案或准确估计样本均值和方差,且性能受教育程度、收入水平和意识形态倾向等因素影响。这表明,尽管数字孪生可增强对群体异质性的建模能力,但其个体级预测可靠性仍需严格验证。

链接: https://arxiv.org/abs/2509.19088
作者: Tiany Peng,George Gui,Daniel J. Merlau,Grace Jiarui Fan,Malek Ben Sliman,Melanie Brucks,Eric J. Johnson,Vicki Morwitz,Abdullah Althenayyan,Silvia Bellezza,Dante Donati,Hortense Fong,Elizabeth Friedman,Ariana Guevara,Mohamed Hussein,Kinshuk Jerath,Bruce Kogut,Kristen Lane,Hannah Li,Patryk Perkowski,Oded Netzer,Olivier Toubia
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Do “digital twins” capture individual responses in surveys and experiments? We run 19 pre-registered studies on a national U.S. panel and their LLM-powered digital twins (constructed based on previously-collected extensive individual-level data) and compare twin and human answers across 164 outcomes. The correlation between twin and human answers is modest (approximately 0.2 on average) and twin responses are less variable than human responses. While constructing digital twins based on rich individual-level data improves our ability to capture heterogeneity across participants and predict relative differences between them, it does not substantially improve our ability to predict the exact answers given by specific participants or enhance predictions of population means. Twin performance varies by domain and is higher among more educated, higher-income, and ideologically moderate participants. These results suggest current digital twins can capture some degree of relative differences but are unreliable for individual-level predictions and sample mean and variance estimation, underscoring the need for careful validation before use. Our data and code are publicly available for researchers and practitioners interested in optimizing digital twin pipelines.
zh

[AI-12] Graph Neural Networks with Similarity-Navigated Probabilistic Feature Copying

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在实际应用中面临的三大核心问题:特征过平滑(feature oversmoothing)导致深层网络中节点表示趋于不可区分、难以有效处理异质关系(heterogeneous relationships)以及将整个特征向量视为不可分割单元限制了灵活性。其解决方案的关键在于提出 AxelGNN,一种受 Axelrod 文化传播模型启发的新架构:通过相似性门控的概率交互机制,自适应地促进或抑制节点间信息传递以实现收敛或发散;引入基于特质级别的复制机制,在细粒度层面进行特征聚合;并维持全局极化状态以保持多个表示簇中节点的区分度。该设计使模型具备双稳态收敛动力学,可在单一架构内自然适应同质图(homophilic)与异质图(heterophilic)的不同结构特性,并在节点分类和影响力估计任务上显著优于或匹配现有最先进方法。

链接: https://arxiv.org/abs/2509.19084
作者: Asela Hevapathige
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have demonstrated remarkable success across various graph-based tasks. However, they face some fundamental limitations: feature oversmoothing can cause node representations to become indistinguishable in deeper networks, they struggle to effectively manage heterogeneous relationships where connected nodes differ significantly, and they process entire feature vectors as indivisible units, which limits flexibility. We seek to address these limitations. We propose AxelGNN, a novel GNN architecture inspired by Axelrod’s cultural dissemination model that addresses these limitations through a unified framework. AxelGNN incorporates similarity-gated probabilistic interactions that adaptively promote convergence or divergence based on node similarity, implements trait-level copying mechanisms for fine-grained feature aggregation at the segment level, and maintains global polarization to preserve node distinctiveness across multiple representation clusters. The model’s bistable convergence dynamics naturally handle both homophilic and heterophilic graphs within a single architecture. Extensive experiments on node classification and influence estimation benchmarks demonstrate that AxelGNN consistently outperforms or matches state-of-the-art GNN methods across diverse graph structures with varying homophily-heterophily characteristics.
zh

[AI-13] World4RL: Diffusion World Models for Policy Refinement with Reinforcement Learning for Robotic Manipulation

【速读】:该论文旨在解决机器人操作中策略初始化依赖专家数据稀缺与覆盖不足的问题,以及强化学习在真实机器人上训练成本高、安全性差,而在仿真环境中又面临“仿真到现实”差距(sim-to-real gap)的挑战。其解决方案的关键在于提出World4RL框架,该框架利用基于扩散模型的世界模型作为高保真模拟器,在想象的环境中对预训练策略进行端到端优化,从而无需在线真实交互即可实现策略精炼;其核心创新包括:1)在多任务数据集上预训练扩散世界模型以捕捉多样化动态;2)采用专为机器人操作设计的两热动作编码方案并结合扩散骨干网络提升建模精度,从而显著提高策略成功率。

链接: https://arxiv.org/abs/2509.19080
作者: Zhennan Jiang,Kai Liu,Yuxin Qin,Shuai Tian,Yupeng Zheng,Mingcai Zhou,Chao Yu,Haoran Li,Dongbin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Robotic manipulation policies are commonly initialized through imitation learning, but their performance is limited by the scarcity and narrow coverage of expert data. Reinforcement learning can refine polices to alleviate this limitation, yet real-robot training is costly and unsafe, while training in simulators suffers from the sim-to-real gap. Recent advances in generative models have demonstrated remarkable capabilities in real-world simulation, with diffusion models in particular excelling at generation. This raises the question of how diffusion model-based world models can be combined to enhance pre-trained policies in robotic manipulation. In this work, we propose World4RL, a framework that employs diffusion-based world models as high-fidelity simulators to refine pre-trained policies entirely in imagined environments for robotic manipulation. Unlike prior works that primarily employ world models for planning, our framework enables direct end-to-end policy optimization. World4RL is designed around two principles: pre-training a diffusion world model that captures diverse dynamics on multi-task datasets and refining policies entirely within a frozen world model to avoid online real-world interactions. We further design a two-hot action encoding scheme tailored for robotic manipulation and adopt diffusion backbones to improve modeling fidelity. Extensive simulation and real-world experiments demonstrate that World4RL provides high-fidelity environment modeling and enables consistent policy refinement, yielding significantly higher success rates compared to imitation learning and other baselines. More visualization results are available at this https URL.
zh

[AI-14] Code Driven Planning with Domain-Adaptive Critic

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在AI代理的序列决策任务中因通用知识与环境特定需求之间存在差距而导致计划不准确的问题,同时应对现有方法依赖频繁LLM查询以基于短期环境反馈迭代优化计划所带来的高昂查询成本。解决方案的关键在于提出Code Driven Planning with Domain-Adaptive Critic (CoPiC),其核心机制是:利用LLM生成多样化的高层规划程序(high-level planning programs)来迭代产生和改进候选计划,并引入一个经过训练的领域自适应评判器(domain-adaptive critic)对候选计划进行评估,从而选择最符合长期奖励的计划执行,显著减少对LLM的调用次数,同时提升计划质量。

链接: https://arxiv.org/abs/2509.19077
作者: Zikang Tian,Shaohui Peng,Du Huang,Jiaming Guo,Ruizhi Chen,Rui Zhang,Xishan Zhang,Yuxuan Guo,Zidong Du,Qi Guo,Ling Li,Yewen Pu,Xing Hu,Yunji Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have been widely adopted as task planners for AI agents in sequential decision-making problems, leveraging their extensive world knowledge. However, the gap between their general knowledge and environment-specific requirements often leads to inaccurate plans. To address this, existing approaches rely on frequent LLM queries to iteratively refine plans based on immediate environmental feedback, which incurs substantial query costs. However, this refinement is typically guided by short-term environmental feedback, limiting LLMs from developing plans aligned with long-term rewards. We propose Code Driven Planning with Domain-Adaptive Critic (CoPiC). Instead of relying on frequent queries, CoPiC employs LLMs to generate a diverse set of high-level planning programs, which iteratively produce and refine candidate plans. A trained domain-adaptive critic then evaluates these candidates and selects the one most aligned with long-term rewards for execution. Using high-level planning programs as planner and domain-adaptive critic as estimator, CoPiC improves planning while significantly reducing query costs. Results in ALFWorld, NetHack, and StarCraft II Unit Building show that CoPiC outperforms advanced LLM-based baselines, AdaPlanner and Reflexion, achieving an average (1) 23.33% improvement in success rate and (2) 91.27% reduction in query costs.
zh

[AI-15] Beyond Backpropagation: Exploring Innovative Algorithms for Energy-Efficient Deep Neural Network Training

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)因反向传播(Backpropagation, BP)算法带来的高计算与能耗问题,从而推动可持续人工智能的发展。其核心解决方案是提出并验证一种无需反向传播的训练方法——单向前向(Mono-Forward, MF)算法,该方法在保持甚至超越BP模型分类准确率的同时,显著降低能量消耗(最高达41%)和训练时间(最高达34%)。MF的关键优势在于其能够收敛至验证损失景观中更优的局部极小值,从而实现更好的泛化性能,并通过硬件级分析揭示了其计算轻量化的本质,挑战了“所有无BP方法均更节能”的既有认知,为未来高效、低碳的深度学习提供了可量化、可复现的技术路径。

链接: https://arxiv.org/abs/2509.19063
作者: Przemysław Spyra
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rising computational and energy demands of deep neural networks (DNNs), driven largely by backpropagation (BP), challenge sustainable AI development. This paper rigorously investigates three BP-free training methods: the Forward-Forward (FF), Cascaded-Forward (CaFo), and Mono-Forward (MF) algorithms, tracing their progression from foundational concepts to a demonstrably superior solution. A robust comparative framework was established: each algorithm was implemented on its native architecture (MLPs for FF and MF, a CNN for CaFo) and benchmarked against an equivalent BP-trained model. Hyperparameters were optimized with Optuna, and consistent early stopping criteria were applied based on validation performance, ensuring all models were optimally tuned before comparison. Results show that MF not only competes with but consistently surpasses BP in classification accuracy on its native MLPs. Its superior generalization stems from converging to a more favorable minimum in the validation loss landscape, challenging the assumption that global optimization is required for state-of-the-art results. Measured at the hardware level using the NVIDIA Management Library (NVML) API, MF reduces energy consumption by up to 41% and shortens training time by up to 34%, translating to a measurably smaller carbon footprint as estimated by CodeCarbon. Beyond this primary result, we present a hardware-level analysis that explains the efficiency gains: exposing FF’s architectural inefficiencies, validating MF’s computationally lean design, and challenging the assumption that all BP-free methods are inherently more memory-efficient. By documenting the evolution from FF’s conceptual groundwork to MF’s synthesis of accuracy and sustainability, this work offers a clear, data-driven roadmap for future energy-efficient deep learning. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 68T07 Cite as: arXiv:2509.19063 [cs.LG] (or arXiv:2509.19063v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.19063 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Przemysław Spyra [view email] [v1] Tue, 23 Sep 2025 14:27:44 UTC (1,759 KB)
zh

[AI-16] owards Causal Representation Learning with Observable Sources as Auxiliaries

【速读】:该论文旨在解决因果表示学习(causal representation learning)中潜在变量识别的可辨识性问题,即如何从观测数据中恢复生成这些数据的潜在因子。传统方法通常依赖于已知辅助变量(auxiliary variables)作为条件独立性的前提,但受限于这些辅助变量必须是外部于混合函数(mixing function)的假设,限制了实际应用范围。本文的关键创新在于提出将可观测源(observable sources)作为辅助变量纳入条件框架,从而扩展了辅助变量的适用范围;其核心解决方案是利用保体积编码器(volume-preserving encoders),在已知潜在因果图的情况下,能够将全部潜在变量识别至子空间变换和排列等价类,并进一步设计基于变量选择的策略以最大化潜在因子的可恢复性。

链接: https://arxiv.org/abs/2509.19058
作者: Kwonho Kim,Heejeong Nam,Inwoo Hwang,Sanghack Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Causal representation learning seeks to recover latent factors that generate observational data through a mixing function. Needing assumptions on latent structures or relationships to achieve identifiability in general, prior works often build upon conditional independence given known auxiliary variables. However, prior frameworks limit the scope of auxiliary variables to be external to the mixing function. Yet, in some cases, system-driving latent factors can be easily observed or extracted from data, possibly facilitating identification. In this paper, we introduce a framework of observable sources being auxiliaries, serving as effective conditioning variables. Our main results show that one can identify entire latent variables up to subspace-wise transformations and permutations using volume-preserving encoders. Moreover, when multiple known auxiliary variables are available, we offer a variable-selection scheme to choose those that maximize recoverability of the latent factors given knowledge of the latent causal graph. Finally, we demonstrate the effectiveness of our framework through experiments on synthetic graph and image data, thereby extending the boundaries of current approaches.
zh

[AI-17] Landmarks Monuments and Beacons: Understanding Generative Calls to Action

【速读】:该论文旨在解决程序化生成内容(Procedural Content Generation, PCG)的算法评估难题,特别是针对复合型游戏 artefacts(如关卡、地图或任务结构)缺乏与人类体验对齐的量化指标问题。其解决方案的关键在于提出一套基于玩家视角的通用概念框架,包括“地标(Landmarks)”、“纪念碑(Monuments)”和“路标(Beacons)”,这些概念分别对应于内容的可感知性(perceivability)、唤起性(evocativeness)和行动召唤(Call to Action),并可通过当前研究与工业界已有的技术手段进行识别与评估。该框架为实现PCG内容的全自动分解及关键子组件的计算评估提供了可行路径,并促进人文学科与游戏技术研究之间的跨领域融合。

链接: https://arxiv.org/abs/2509.19030
作者: Victoire Hervé,Henrik Warpefelt,Christoph Salge
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Algorithmic evaluation of procedurally generated content struggles to find metrics that align with human experience, particularly for composite artefacts. Automatic decomposition as a possible solution requires concepts that meet a range of properties. To this end, drawing on Games Studies and Game AI research, we introduce the nested concepts of \textitLandmarks, \textitMonuments, and \textitBeacons. These concepts are based on the artefact’s perceivability, evocativeness, and Call to Action, all from a player-centric perspective. These terms are generic to games and usable across genres. We argue that these entities can be found and evaluated with techniques currently used in both research and industry, opening a path towards a fully automated decomposition of PCG, and evaluation of the salient sub-components. Although the work presented here emphasises mixed-initiative PCG and compositional PCG, we believe it applies beyond those domains. With this approach, we intend to create a connection between humanities and technical game research and allow for better computational PCG evaluation
zh

[AI-18] Reduced-Order Model-Guided Reinforcement Learning for Demonstration-Free Humanoid Locomotion

【速读】:该论文旨在解决人形机器人行走控制中依赖大量运动捕捉数据或复杂奖励函数设计的问题,传统强化学习方法往往难以生成稳定、自然且能量高效的步态。其解决方案的关键在于提出一种两阶段的强化学习框架——Reduced-Order Model-Guided Reinforcement Learning (ROM-GRL):第一阶段通过近端策略优化(Proximal Policy Optimization)训练一个4自由度(4-DOF)简化模型(Reduced-Order Model, ROM),生成能量高效的步态模板;第二阶段利用软演员-评论家(Soft Actor–Critic)算法结合对抗判别器,使全身体型策略的学习轨迹在五维步态特征分布上逼近ROM生成的动态一致轨迹,从而实现无需人类示范即可获得稳定、对称且低跟踪误差的自然行走行为。

链接: https://arxiv.org/abs/2509.19023
作者: Shuai Liu,Meng Cheng Lau
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 11 pages, 5 figures, 1 table, Computational Science Graduate Project

点击查看摘要

Abstract:We introduce Reduced-Order Model-Guided Reinforcement Learning (ROM-GRL), a two-stage reinforcement learning framework for humanoid walking that requires no motion capture data or elaborate reward shaping. In the first stage, a compact 4-DOF (four-degree-of-freedom) reduced-order model (ROM) is trained via Proximal Policy Optimization. This generates energy-efficient gait templates. In the second stage, those dynamically consistent trajectories guide a full-body policy trained with Soft Actor–Critic augmented by an adversarial discriminator, ensuring the student’s five-dimensional gait feature distribution matches the ROM’s demonstrations. Experiments at 1 meter-per-second and 4 meter-per-second show that ROM-GRL produces stable, symmetric gaits with substantially lower tracking error than a pure-reward baseline. By distilling lightweight ROM guidance into high-dimensional policies, ROM-GRL bridges the gap between reward-only and imitation-based locomotion methods, enabling versatile, naturalistic humanoid behaviors without any human demonstrations.
zh

[AI-19] Fully Learnable Neural Reward Machines

【速读】:该论文旨在解决非马尔可夫强化学习(Non-Markovian Reinforcement Learning, RL)任务中代理难以基于完整状态-动作轨迹进行最优决策的问题。传统方法常依赖符号形式化工具(如线性时序逻辑 LTL 或自动机)来表达时序扩展目标,但受限于预定义的符号接地(Symbol Grounding, SG)函数和对任务时序结构的先验知识。其解决方案的关键在于提出一种完全可学习的神经奖励机器(Fully Learnable Neural Reward Machine, FLNRM),能够端到端地同时学习SG函数与自动机结构,从而无需任何先验知识即可建模复杂时序目标。该方法在保持深度强化学习(DRL)易用性的同时,显著提升了模型的可解释性,并在性能上优于基于循环神经网络(RNN)的现有方法。

链接: https://arxiv.org/abs/2509.19017
作者: Hazem Dewidar,Elena Umili
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Non-Markovian Reinforcement Learning (RL) tasks present significant challenges, as agents must reason over entire trajectories of state-action pairs to make optimal decisions. A common strategy to address this is through symbolic formalisms, such as Linear Temporal Logic (LTL) or automata, which provide a structured way to express temporally extended objectives. However, these approaches often rely on restrictive assumptions – such as the availability of a predefined Symbol Grounding (SG) function mapping raw observations to high-level symbolic representations, or prior knowledge of the temporal task. In this work, we propose a fully learnable version of Neural Reward Machines (NRM), which can learn both the SG function and the automaton end-to-end, removing any reliance on prior knowledge. Our approach is therefore as easily applicable as classic deep RL (DRL) approaches, while being far more explainable, because of the finite and compact nature of automata. Furthermore, we show that by integrating Fully Learnable Reward Machines (FLNRM) with DRL, our method outperforms previous approaches based on Recurrent Neural Networks (RNNs).
zh

[AI-20] Pure Vision Language Action (VLA) Models: A Comprehensive Survey

【速读】:该论文旨在解决当前机器人控制范式从传统基于策略(policy-based)方法向通用机器人(generalized robotics)演进过程中,如何系统性地整合视觉、语言与动作(Vision Language Action, VLA)模型以实现复杂动态环境中的自主决策与操作的问题。其解决方案的关键在于构建一个清晰的VLA方法分类体系,涵盖自回归(autoregression-based)、扩散(diffusion-based)、强化学习(reinforcement-based)、混合(hybrid)及专用(specialized)五大范式,并深入分析各类方法的动机、核心策略与实现机制,同时梳理基础数据集、基准测试和仿真平台,为未来可扩展、通用的VLA模型研究提供方向指引。

链接: https://arxiv.org/abs/2509.19012
作者: Dapeng Zhang,Jin Sun,Chenghui Hu,Xiaoyan Wu,Zhenlong Yuan,Rui Zhou,Fei Shen,Qingguo Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The emergence of Vision Language Action (VLA) models marks a paradigm shift from traditional policy-based control to generalized robotics, reframing Vision Language Models (VLMs) from passive sequence generators into active agents for manipulation and decision-making in complex, dynamic environments. This survey delves into advanced VLA methods, aiming to provide a clear taxonomy and a systematic, comprehensive review of existing research. It presents a comprehensive analysis of VLA applications across different scenarios and classifies VLA approaches into several paradigms: autoregression-based, diffusion-based, reinforcement-based, hybrid, and specialized methods; while examining their motivations, core strategies, and implementations in detail. In addition, foundational datasets, benchmarks, and simulation platforms are introduced. Building on the current VLA landscape, the review further proposes perspectives on key challenges and future directions to advance research in VLA models and generalizable robotics. By synthesizing insights from over three hundred recent studies, this survey maps the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose VLA methods.
zh

[AI-21] Remaining Time Prediction in Outbound Warehouse Processes: A Case Study (Short Paper)

【速读】:该论文旨在解决流程挖掘中的预测性过程监控(predictive process monitoring)问题,具体聚焦于对正在进行的流程实例剩余时间(remaining time)的准确预测。其关键解决方案是比较四种不同的剩余时间预测方法在真实航空物流企业的出库流程中的性能表现,发现深度学习模型虽精度最高,但传统浅层学习方法如集成提升技术(conventional boosting techniques)在保持较高预测准确性的同时,显著降低了计算资源消耗,从而为实际应用场景提供了更具性价比的替代方案。

链接: https://arxiv.org/abs/2509.18986
作者: Erik Penther,Michael Grohs,Jana-Rebecca Rehse
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Short paper at the ML4PM Workshop 2025, held in conjunction with the ICPM 2025 in Montevideo, Uruguay

点击查看摘要

Abstract:Predictive process monitoring is a sub-domain of process mining which aims to forecast the future of ongoing process executions. One common prediction target is the remaining time, meaning the time that will elapse until a process execution is completed. In this paper, we compare four different remaining time prediction approaches in a real-life outbound warehouse process of a logistics company in the aviation business. For this process, the company provided us with a novel and original event log with 169,523 traces, which we can make publicly available. Unsurprisingly, we find that deep learning models achieve the highest accuracy, but shallow methods like conventional boosting techniques achieve competitive accuracy and require significantly fewer computational resources.
zh

[AI-22] From latent factors to language: a user study on LLM -generated explanations for an inherently interpretable matrix-based recommender system

【速读】:该论文试图解决的问题是:如何利用大语言模型(Large Language Models, LLMs)从数学可解释的推荐模型中生成对用户有效的自然语言解释。现有可解释人工智能(Explainable AI)研究多依赖自动评估指标,难以准确反映用户的实际需求和感知;而本文采用以用户为中心的方法,通过326名参与者对五维质量指标(透明度、有效性、说服力、信任度与满意度)进行评估,系统比较不同输入信息下LLM生成的解释策略效果。其解决方案的关键在于:基于约束矩阵分解(constrained matrix factorization)构建具有显式用户类型表示和同观测评分量纲一致的预测分数的推荐模型,确保模型内部表示和输出可直接解释,并通过精心设计的LLM提示(prompt)将结构化模型输出转化为自然语言解释,从而实现可解释性与用户感知之间的有效衔接。

链接: https://arxiv.org/abs/2509.18980
作者: Maxime Manderlier,Fabian Lecron,Olivier Vu Thanh,Nicolas Gillis
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We investigate whether large language models (LLMs) can generate effective, user-facing explanations from a mathematically interpretable recommendation model. The model is based on constrained matrix factorization, where user types are explicitly represented and predicted item scores share the same scale as observed ratings, making the model’s internal representations and predicted scores directly interpretable. This structure is translated into natural language explanations using carefully designed LLM prompts. Many works in explainable AI rely on automatic evaluation metrics, which often fail to capture users’ actual needs and perceptions. In contrast, we adopt a user-centered approach: we conduct a study with 326 participants who assessed the quality of the explanations across five key dimensions-transparency, effectiveness, persuasion, trust, and satisfaction-as well as the recommendations this http URL evaluate how different explanation strategies are perceived, we generate multiple explanation types from the same underlying model, varying the input information provided to the LLM. Our analysis reveals that all explanation types are generally well received, with moderate statistical differences between strategies. User comments further underscore how participants react to each type of explanation, offering complementary insights beyond the quantitative results.
zh

[AI-23] LLM -based Agents Suffer from Hallucinations: A Survey of Taxonomy Methods and Directions

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)驱动的智能代理(LLM-based agents)中存在的幻觉(hallucination)问题,该问题会导致任务执行错误并削弱系统可靠性。解决方案的关键在于提出一个全新的分类体系,基于对代理完整工作流程的细致分析,识别不同阶段出现的幻觉类型,并系统梳理了导致幻觉产生的十八种触发因素;同时,论文综述了现有幻觉检测与缓解方法,并指出了未来研究的潜在方向,从而为构建更鲁棒和可靠的代理系统提供理论基础与实践指导。

链接: https://arxiv.org/abs/2509.18970
作者: Xixun Lin,Yucheng Ning,Jingwen Zhang,Yan Dong,Yilong Liu,Yongxuan Wu,Xiaohua Qi,Nan Sun,Yanmin Shang,Pengfei Cao,Lixin Zou,Xu Chen,Chuan Zhou,Jia Wu,Shirui Pan,Bin Wang,Yanan Cao,Kai Chen,Songlin Hu,Li Guo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by the rapid advancements of Large Language Models (LLMs), LLM-based agents have emerged as powerful intelligent systems capable of human-like cognition, reasoning, and interaction. These agents are increasingly being deployed across diverse real-world applications, including student education, scientific research, and financial analysis. However, despite their remarkable potential, LLM-based agents remain vulnerable to hallucination issues, which can result in erroneous task execution and undermine the reliability of the overall system design. Addressing this critical challenge requires a deep understanding and a systematic consolidation of recent advances on LLM-based agents. To this end, we present the first comprehensive survey of hallucinations in LLM-based agents. By carefully analyzing the complete workflow of agents, we propose a new taxonomy that identifies different types of agent hallucinations occurring at different stages. Furthermore, we conduct an in-depth examination of eighteen triggering causes underlying the emergence of agent hallucinations. Through a detailed review of a large number of existing studies, we summarize approaches for hallucination mitigation and detection, and highlight promising directions for future research. We hope this survey will inspire further efforts toward addressing hallucinations in LLM-based agents, ultimately contributing to the development of more robust and reliable agent systems.
zh

[AI-24] Eva-VLA: Evaluating Vision-Language-Action Models Robustness Under Real-World Physical Variations

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在机器人操作任务中对真实世界物理变化的鲁棒性问题,即当前VLA模型虽在实验室环境中表现优异,但在面对实际部署时的复杂物理扰动时性能显著下降。解决方案的关键在于提出Eva-VLA框架,其核心创新是将离散的物理变化转化为连续优化问题:首先,系统性地将现实世界的物理变化分解为三个关键维度——物体3D变换(影响空间推理)、光照变化(挑战视觉感知)和对抗补丁(干扰场景理解);其次,引入一种连续黑盒优化方法,通过参数化建模实现对最坏情况场景的高效探索,从而无需大量真实数据即可发现模型脆弱点。该框架揭示了现有VLA模型在多种扰动下失败率超过60%,最高达97.8%,为提升VLA模型在真实环境中的部署可靠性提供了可量化的评估与改进路径。

链接: https://arxiv.org/abs/2509.18953
作者: Hanqing Liu,Jiahuan Long,Junqi Wu,Jiacheng Hou,Huili Tang,Tingsong Jiang,Weien Zhou,Wen Yao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language-Action (VLA) models have emerged as promising solutions for robotic manipulation, yet their robustness to real-world physical variations remains critically underexplored. To bridge this gap, we propose Eva-VLA, the first unified framework that systematically evaluates the robustness of VLA models by transforming discrete physical variations into continuous optimization problems. However, comprehensively assessing VLA robustness presents two key challenges: (1) how to systematically characterize diverse physical variations encountered in real-world deployments while maintaining evaluation reproducibility, and (2) how to discover worst-case scenarios without prohibitive real-world data collection costs efficiently. To address the first challenge, we decompose real-world variations into three critical domains: object 3D transformations that affect spatial reasoning, illumination variations that challenge visual perception, and adversarial patches that disrupt scene understanding. For the second challenge, we introduce a continuous black-box optimization framework that transforms discrete physical variations into parameter optimization, enabling systematic exploration of worst-case scenarios. Extensive experiments on state-of-the-art OpenVLA models across multiple benchmarks reveal alarming vulnerabilities: all variation types trigger failure rates exceeding 60%, with object transformations causing up to 97.8% failure in long-horizon tasks. Our findings expose critical gaps between controlled laboratory success and unpredictable deployment readiness, while the Eva-VLA framework provides a practical pathway for hardening VLA-based robotic manipulation models against real-world deployment challenges.
zh

[AI-25] owards Privacy-Aware Bayesian Networks: A Credal Approach ECAI2025

【速读】:该论文旨在解决在公开发布贝叶斯网络(Bayesian network, BN)模型时,如何在保障训练数据隐私的同时维持模型推理效用的问题。当前主流的隐私保护方法通过向学习到的参数中引入噪声来抵御追踪攻击(tracing attacks),但此类方法会显著损害模型的准确性与实用性。论文提出以可信网络(credal network, CN)作为解决方案,其关键在于利用CN对BN进行模糊化(obfuscation)而非加噪处理,从而在不破坏模型基本结构的前提下掩蔽原始BN信息,降低攻击者成功恢复个体数据的可能性;同时,CN仍能支持有意义的推理任务,实现隐私与效用之间的平衡。此外,研究还识别出需隐藏的关键学习信息,并通过调整CN超参数控制隐私强度,实验证明该方法具备良好的可调性和有效性。

链接: https://arxiv.org/abs/2509.18949
作者: Niccolò Rocchi,Fabio Stella,Cassio de Campos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at ECAI2025 conference, 20 pages, 1 figure

点击查看摘要

Abstract:Bayesian networks (BN) are probabilistic graphical models that enable efficient knowledge representation and inference. These have proven effective across diverse domains, including healthcare, bioinformatics and economics. The structure and parameters of a BN can be obtained by domain experts or directly learned from available data. However, as privacy concerns escalate, it becomes increasingly critical for publicly released models to safeguard sensitive information in training data. Typically, released models do not prioritize privacy by design. In particular, tracing attacks from adversaries can combine the released BN with auxiliary data to determine whether specific individuals belong to the data from which the BN was learned. State-of-the-art protection tecniques involve introducing noise into the learned parameters. While this offers robust protection against tracing attacks, it significantly impacts the model’s utility, in terms of both the significance and accuracy of the resulting inferences. Hence, high privacy may be attained at the cost of releasing a possibly ineffective model. This paper introduces credal networks (CN) as a novel solution for balancing the model’s privacy and utility. After adapting the notion of tracing attacks, we demonstrate that a CN enables the masking of the learned BN, thereby reducing the probability of successful attacks. As CNs are obfuscated but not noisy versions of BNs, they can achieve meaningful inferences while safeguarding privacy. Moreover, we identify key learning information that must be concealed to prevent attackers from recovering the underlying BN. Finally, we conduct a set of numerical experiments to analyze how privacy gains can be modulated by tuning the CN hyperparameters. Our results confirm that CNs provide a principled, practical, and effective approach towards the development of privacy-aware probabilistic graphical models.
zh

[AI-26] Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在微调(Fine-tuning, FT)过程中面临的灾难性遗忘(catastrophic forgetting)和数据效率低下问题,这些问题限制了LLMs在实际应用中的持续适应能力。其解决方案的关键在于提出DEAL框架,该框架结合低秩适配(Low-Rank Adaptation, LoRA)与连续微调策略,并引入知识保留模块和自适应参数更新机制,从而在保持隐私保护场景下高效地提升任务性能与资源利用效率。

链接: https://arxiv.org/abs/2509.18942
作者: Xiao Han,Zimo Zhao,Wanyu Wang,Maolin Wang,Zitao Liu,Yi Chang,Xiangyu Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have emphasized the critical role of fine-tuning (FT) techniques in adapting LLMs to specific tasks, especially when retraining from scratch is computationally infeasible. Fine-tuning enables LLMs to leverage task- or domain-specific data, producing models that more effectively meet the requirements of targeted applications. However, con- ventional FT approaches often suffer from catastrophic forgetting and suboptimal data efficiency, limiting their real-world applicability. To address these challenges, this paper proposes DEAL, a novel framework that integrates Low-Rank Adapta- tion (LoRA) with a continuous fine-tuning strategy. By incorporating knowledge retention and adaptive parameter update modules, the framework mitigates the lim- itations of existing FT methods while maintaining efficiency in privacy-preserving settings. Experiments on 15 diverse datasets show that DEAL consistently outper- forms baseline methods, yielding substantial gains in task accuracy and resource efficiency. These findings demonstrate the potential of our approach to advance continual adaptation in LLMs by enhancing task performance while improving resource efficiency.
zh

[AI-27] Accurate and Efficient Prediction of Wi-Fi Link Quality Based on Machine Learning

【速读】:该论文旨在解决无线通信中因环境不确定性导致的Wi-Fi链路质量难以维持稳定的问题,以提升工业环境中Wi-Fi系统的可靠性。其解决方案的关键在于提出一种基于指数移动平均线性组合的数据驱动预测模型,该模型设计用于低复杂度实现,适用于处理能力受限的硬件平台;同时,实验表明,无需依赖特定信道特征的通道无关模型在实际部署中展现出与通道相关模型相当的预测精度,从而支持设备制造商进行通用化训练和规模化应用。

链接: https://arxiv.org/abs/2509.18933
作者: Gabriele Formis,Gianluca Cena,Lukasz Wisniewski,Stefano Scanzio
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: accepted version in IEEE Transactions on Industrial Informatics, 12 pages, 2025

点击查看摘要

Abstract:Wireless communications are characterized by their unpredictability, posing challenges for maintaining consistent communication quality. This paper presents a comprehensive analysis of various prediction models, with a focus on achieving accurate and efficient Wi-Fi link quality forecasts using machine learning techniques. Specifically, the paper evaluates the performance of data-driven models based on the linear combination of exponential moving averages, which are designed for low-complexity implementations and are then suitable for hardware platforms with limited processing resources. Accuracy of the proposed approaches was assessed using experimental data from a real-world Wi-Fi testbed, considering both channel-dependent and channel-independent training data. Remarkably, channel-independent models, which allow for generalized training by equipment manufacturers, demonstrated competitive performance. Overall, this study provides insights into the practical deployment of machine learning-based prediction models for enhancing Wi-Fi dependability in industrial environments.
zh

[AI-28] ackling GNARLy Problems: Graph Neural Algorithmic Reasoning Reimagined through Reinforcement Learning

【速读】:该论文旨在解决神经算法推理(Neural Algorithmic Reasoning, NAR)中存在的若干关键局限性,包括无法在不依赖后处理的情况下构造有效解、难以对多个正确解进行推理、在组合优化的NP-hard问题上表现不佳,以及在缺乏已知高效算法的问题上不可适用。解决方案的关键在于将学习算法轨迹的问题重新建模为马尔可夫决策过程(Markov Decision Process, MDP),从而引入结构化的解构建机制,并利用模仿学习(imitation learning)和强化学习(reinforcement learning, RL)的强大能力。作者提出了GNARL框架,包含从NAR到RL的转化方法及适用于多种图问题的学习架构,在CLRS-30基准上实现了高图准确率,且在NP-hard问题上的性能达到或超越窄域NAR方法,甚至在无专家算法可用时仍具备适用性。

链接: https://arxiv.org/abs/2509.18930
作者: Alex Schutz,Victor-Alexandru Darvariu,Efimia Panagiotaki,Bruno Lacerda,Nick Hawes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neural Algorithmic Reasoning (NAR) is a paradigm that trains neural networks to execute classic algorithms by supervised learning. Despite its successes, important limitations remain: inability to construct valid solutions without post-processing and to reason about multiple correct ones, poor performance on combinatorial NP-hard problems, and inapplicability to problems for which strong algorithms are not yet known. To address these limitations, we reframe the problem of learning algorithm trajectories as a Markov Decision Process, which imposes structure on the solution construction procedure and unlocks the powerful tools of imitation and reinforcement learning (RL). We propose the GNARL framework, encompassing the methodology to translate problem formulations from NAR to RL and a learning architecture suitable for a wide range of graph-based problems. We achieve very high graph accuracy results on several CLRS-30 problems, performance matching or exceeding much narrower NAR approaches for NP-hard problems and, remarkably, applicability even when lacking an expert algorithm.
zh

[AI-29] How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在视觉空间推理(Visual Spatial Reasoning, VSR)能力上的显著不足,尤其是其在三维空间表征与推理方面难以达到人类水平的问题。解决方案的关键在于系统性地梳理现有VLMs在输入模态、模型架构、训练策略和推理机制等方面的进展,并提出一个分层的能力分类体系——将空间智能划分为基础感知(basic perception)、空间理解(spatial understanding)和空间规划(spatial planning)三个层次;同时构建了SIBench基准测试平台,整合近20个开源数据集,覆盖23种任务场景,用于全面评估模型在不同空间认知层级的表现。实验结果揭示了当前主流VLMs在感知层面表现良好,但在理解和规划层面存在明显短板,特别是在数值估算、多视角推理、时序动态建模和空间想象等复杂任务中,从而为未来研究提供了清晰的路径指引和标准化评测工具。

链接: https://arxiv.org/abs/2509.18905
作者: Songsong Yu,Yuxin Chen,Hao Ju,Lianjie Jia,Fuxi Zhang,Shaofei Huang,Yuhan Wu,Rundi Cui,Binghao Ran,Zaibin Zhang,Zhedong Zheng,Zhipeng Zhang,Yifan Wang,Lin Song,Lijun Wang,Yanwei Li,Ying Shan,Huchuan Lu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: a comprehensive visual spatial reasoning evaluation tool, 25 pages, 16 figures

点击查看摘要

Abstract:Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at this https URL.
zh

[AI-30] he AI Literacy Heptagon: A Structured Approach to AI Literacy in Higher Education

【速读】:该论文旨在解决当前高等教育(Higher Education, HE)中人工智能素养(AI Literacy, AIL)概念模糊、实践落地困难的问题,尤其关注如何厘清AIL与数据素养(Data Literacy)、媒介素养(Media Literacy)及计算素养(Computational Literacy)等相近概念的边界,并将理论洞见有效转化为教学实践。其解决方案的关键在于通过系统性文献综述识别出AIL的七个核心维度——技术、应用、批判性思维、伦理、社会、整合与法律,并将其整合为“AI素养七边形”(AI Literacy Heptagon)模型,从而为HE机构提供结构化、可操作的AIL发展框架,实现从理论到课程实施的有效衔接。

链接: https://arxiv.org/abs/2509.18900
作者: Veronika Hackl,Alexandra Mueller,Maximilian Sailer
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 4 figures

点击查看摘要

Abstract:The integrative literature review addresses the conceptualization and implementation of AI Literacy (AIL) in Higher Education (HE) by examining recent research literature. Through an analysis of publications (2021-2024), we explore (1) how AIL is defined and conceptualized in current research, particularly in HE, and how it can be delineated from related concepts such as Data Literacy, Media Literacy, and Computational Literacy; (2) how various definitions can be synthesized into a comprehensive working definition, and (3) how scientific insights can be effectively translated into educational practice. Our analysis identifies seven central dimensions of AIL: technical, applicational, critical thinking, ethical, social, integrational, and legal. These are synthesized in the AI Literacy Heptagon, deepening conceptual understanding and supporting the structured development of AIL in HE. The study aims to bridge the gap between theoretical AIL conceptualizations and the practical implementation in academic curricula.
zh

[AI-31] LongCat-Flash-Thinking Technical Report

【速读】:该论文旨在解决大规模生成式 AI(Generative AI)模型在复杂推理任务中效率低、训练成本高以及多领域能力难以协同优化的问题。其核心解决方案在于提出一种名为 LongCat-Flash-Thinking 的高效 5600 亿参数稀疏专家混合模型(Mixture-of-Experts, MoE),通过两阶段训练策略实现性能与效率的双重提升:第一阶段采用精心设计的长链式思维(Chain-of-Thought, CoT)冷启动训练,显著增强模型的推理潜力并赋予其形式化和代理式(agentic)推理能力;第二阶段引入领域并行训练方案(domain-parallel training scheme),解耦不同领域(如 STEM、代码、代理任务)的优化过程,并融合专家模型以获得近乎帕累托最优的整体表现。整个流程由动态异步回放调度系统(Dynamic ORchestration for Asynchronous rollout, DORA)驱动,在数万个加速器上实现了超过三倍于同步方法的训练速度提升,最终在 AIME-25 等基准测试中以 token 消耗降低 64.5% 的代价保持甚至优于现有开源模型的准确性,有效推动了高效推理系统与代理型人工智能的研究进展。

链接: https://arxiv.org/abs/2509.18883
作者: Meituan LongCat Team,Anchun Gui,Bei Li,Bingyang Tao,Bole Zhou,Borun Chen,Chao Zhang,Chao Zhang,Chengcheng Han,Chenhui Yang,Chi Zhang,Chong Peng,Chuyu Zhang,Cong Chen,Fengcun Li,Gang Xu,Guoyuan Lin,Hao Jiang,Hao Liang,Haomin Fu,Haoxiang Ma,Hong Liu,Hongyan Hao,Hongyin Tang,Hongyu Zang,Hongzhi Ni,Hui Su,Jiahao Liu,Jiahuan Li,Jialin Liu,Jianfei Zhang,Jianhao Xu,Jianing Wang,Jiaqi Sun,Jiaqi Zhang,Jiarong Shi,Jiawei Yang,Jingang Wang,Jinrui Ding,Jun Kuang,Jun Xu,Ke He,Kefeng Zhang,Keheng Wang,Keqing He,Li Wei,Liang Shi,Lin Qiu,Lingbin Kong,Lingchuan Liu,Linsen Guo,Longfei An,Mai Xia,Meng Zhou,Mengshen Zhu,Peng Pei,Pengcheng Jia,Qi Gu,Qi Guo,Qiong Huang,Quan Chen,Quanchi Weng,Rongxiang Weng,Ruichen Shao,Rumei Li,Shanglin Lei,Shuai Du,Shuaikang Liu,Shuang Zhou,Shuhao Hu,Siyu Xu,Songshan Gong,Tao Liang,Tianhao Hu,Wei He,Wei Shi,Wei Wang,Wei Wu,Wei Zhuo,Weifeng Tang,Wenjie Shi,Wenlong Zhu,Xi Su,Xiangcheng Liu,Xiangyu Xi,Xiangzhou Huang,Xiao Liu,Xiaochen Jiang,Xiaowei Shi,Xiaowen Shi,Xiaoyu Li,Xin Chen,Xinyue Zhao,Xuan Huang,Xuemiao Zhang,Xuezhi Cao,Xunliang Cai,Yajie Zhang,Yang Chen,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present LongCat-Flash-Thinking, an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model. Its advanced capabilities are cultivated through a meticulously crafted training process, beginning with long Chain-of-Thought (CoT) data cold-start and culminating in large-scale Reinforcement Learning (RL). We first employ a well-designed cold-start training strategy, which significantly enhances the reasoning potential and equips the model with specialized skills in both formal and agentic reasoning. Then, a core innovation is our domain-parallel training scheme, which decouples optimization across distinct domains (e.g., STEM, Code, Agentic) and subsequently fuses the resulting expert models into a single, nearly Pareto-optimal model. This entire process is powered by our Dynamic ORchestration for Asynchronous rollout (DORA) system, a large-scale RL framework that delivers a greater than threefold training speedup over synchronous methods on tens of thousands of accelerators. As a result, LongCat-Flash-Thinking achieves state-of-the-art performance among open-source models on a suite of complex reasoning tasks. The model exhibits exceptional efficiency in agentic reasoning, reducing average token consumption by 64.5% (from 19, 653 to 6, 965) on AIME-25, without degrading task accuracy. We release LongCat-Flash-Thinking to promote further advances in reasoning systems and agentic AI research.
zh

[AI-32] When Ads Become Profiles: Large-Scale Audit of Algorithmic Biases and LLM Profiling Risks

【速读】:该论文旨在解决社交平台上广告定向算法的不透明性问题,特别是其可能引发的用户剥削与外部监管缺失风险,以及生成式 AI(Generative AI)从广告曝光流中逆向推断敏感用户属性所带来的隐私威胁。其解决方案的关键在于提出并实施一个多阶段审计框架:首先通过大规模实证分析(覆盖435,000次广告展示和891名澳大利亚Facebook用户)识别出算法偏见,如对社会经济弱势群体和政治倾向一致群体过度投放赌博与政治类广告;其次利用多模态大语言模型(Multimodal LLM)从广告流中重建用户人口统计特征,结果优于基于人口普查数据的基线模型,并达到或超过人类判断水平,首次实证证明广告流可作为公共AI推理的丰富数字足迹,凸显亟需内容层级的审计机制与治理措施。

链接: https://arxiv.org/abs/2509.18874
作者: Baiyu Chen,Benjamin Tag,Hao Xue,Daniel Angus,Flora Salim
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Automated ad targeting on social media is opaque, creating risks of exploitation and invisibility to external scrutiny. Users may be steered toward harmful content while independent auditing of these processes remains blocked. Large Language Models (LLMs) raise a new concern: the potential to reverse-engineer sensitive user attributes from exposure alone. We introduce a multi-stage auditing framework to investigate these risks. First, a large-scale audit of over 435,000 ad impressions delivered to 891 Australian Facebook users reveals algorithmic biases, including disproportionate Gambling and Politics ads shown to socioeconomically vulnerable and politically aligned groups. Second, a multimodal LLM can reconstruct users’ demographic profiles from ad streams, outperforming census-based baselines and matching or exceeding human performance. Our results provide the first empirical evidence that ad streams constitute rich digital footprints for public AI inference, highlighting urgent privacy risks and the need for content-level auditing and governance.
zh

[AI-33] Memory in Large Language Models : Mechanisms Evaluation and Evolution

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)记忆机制的定义不清、评估方法不统一以及治理框架缺失的问题,尤其在不同训练和推理设置下难以进行可比性比较。其核心解决方案是构建一个基于“写-读-抑制/更新”链路的系统性框架,通过四类记忆分类(参数化、上下文、外部、过程/情景记忆)与记忆四元组(位置、持久性、写入/访问路径、可控性)实现结构化描述;并提出三阶段实验协议(仅参数化、离线检索、在线检索)以解耦能力与信息可用性,从而建立分层评估体系(包括闭卷回忆、位置曲线、片段归属准确性、跨会话一致性等指标),同时引入可审计的更新与遗忘机制(DMM Gov),整合微调、模型编辑(如ROME、MEND)、检索增强生成(RAG)等技术形成闭环治理流程,最终提供可验证、可重复且具备时序治理能力的研究与部署坐标系。

链接: https://arxiv.org/abs/2509.18868
作者: Dianxing Zhang,Wendong Li,Kani Song,Jiaye Lu,Gang Li,Liuchun Yang,Sheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 50 pages, 1 figure, 8 tables This is a survey/framework paper on LLM memory mechanisms and evaluation

点击查看摘要

Abstract:Under a unified operational definition, we define LLM memory as a persistent state written during pretraining, finetuning, or inference that can later be addressed and that stably influences outputs. We propose a four-part taxonomy (parametric, contextual, external, procedural/episodic) and a memory quadruple (location, persistence, write/access path, controllability). We link mechanism, evaluation, and governance via the chain write - read - inhibit/update. To avoid distorted comparisons across heterogeneous setups, we adopt a three-setting protocol (parametric only, offline retrieval, online retrieval) that decouples capability from information availability on the same data and timeline. On this basis we build a layered evaluation: parametric (closed-book recall, edit differential, memorization/privacy), contextual (position curves and the mid-sequence drop), external (answer correctness vs snippet attribution/faithfulness), and procedural/episodic (cross-session consistency and timeline replay, E MARS+). The framework integrates temporal governance and leakage auditing (freshness hits, outdated answers, refusal slices) and uncertainty reporting via inter-rater agreement plus paired tests with multiple-comparison correction. For updating and forgetting, we present DMM Gov: coordinating DAPT/TAPT, PEFT, model editing (ROME, MEND, MEMIT, SERAC), and RAG to form an auditable loop covering admission thresholds, rollout, monitoring, rollback, and change audits, with specs for timeliness, conflict handling, and long-horizon consistency. Finally, we give four testable propositions: minimum identifiability; a minimal evaluation card; causally constrained editing with verifiable forgetting; and when retrieval with small-window replay outperforms ultra-long-context reading. This yields a reproducible, comparable, and governable coordinate system for research and deployment.
zh

[AI-34] Conf-Profile: A Confidence-Driven Reasoning Paradigm for Label-Free User Profiling

【速读】:该论文旨在解决用户画像(User Profiling)任务中因缺乏全面基准测试和高质量标注数据而导致的性能瓶颈问题,尤其在面对异构且噪声较大的真实用户数据时,大型语言模型(Large Language Models, LLMs)的可靠性难以保障。其核心解决方案是提出一种基于置信度驱动的推理框架 Conf-Profile,关键在于采用两阶段范式:第一阶段利用具备置信度提示的先进LLMs合成高质量标签;第二阶段通过置信度加权投票实现精度提升与置信度校准,并进一步引入置信度引导的无监督强化学习机制,以难度过滤、类真值投票和奖励加权优化推理能力,最终显著提升了模型在工业级用户画像任务上的表现(F1提升13.97)。

链接: https://arxiv.org/abs/2509.18864
作者: Yingxin Li,Jianbo Zhao,Xueyu Ren,Jie Tang,Wangjie You,Xu Chen,Kan Zhou,Chao Feng,Jiao Ran,Yuan Meng,Zhi Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User profiling, as a core technique for user understanding, aims to infer structural attributes from user information. Large Language Models (LLMs) provide a promising avenue for user profiling, yet the progress is hindered by the lack of comprehensive benchmarks. To bridge this gap, we propose ProfileBench, an industrial benchmark derived from a real-world video platform, encompassing heterogeneous user data and a well-structured profiling taxonomy. However, the profiling task remains challenging due to the difficulty of collecting large-scale ground-truth labels, and the heterogeneous and noisy user information can compromise the reliability of LLMs. To approach label-free and reliable user profiling, we propose a Confidence-driven Profile reasoning framework Conf-Profile, featuring a two-stage paradigm. We first synthesize high-quality labels by leveraging advanced LLMs with confidence hints, followed by confidence-weighted voting for accuracy improvement and confidence calibration for a balanced distribution. The multiple profile results, rationales, and confidence scores are aggregated and distilled into a lightweight LLM. We further enhance the reasoning ability via confidence-guided unsupervised reinforcement learning, which exploits confidence for difficulty filtering, quasi-ground truth voting, and reward weighting. Experimental results demonstrate that Conf-Profile delivers substantial performance through the two-stage training, improving F1 by 13.97 on Qwen3-8B.
zh

[AI-35] NGRPO: Negative-enhanced Group Relative Policy Optimization

【速读】:该论文旨在解决Group Relative Policy Optimization (GRPO) 在处理同质化错误响应时的局限性问题,即当一组样本全部正确或全部错误时,GRPO 的优势函数会退化为零,导致梯度消失,从而丢失有价值的训练信号。解决方案的关键在于提出 Negative-enhanced Group Relative Policy Optimization (NGRPO),其核心机制包括:1)优势校准(Advantage Calibration),通过假设存在一个虚拟的最大奖励样本,调整组内奖励的均值和方差,使同质错误样本的优势不再为零;2)非对称裁剪(Asymmetric Clipping),对正样本放宽更新幅度、对负样本施加更严格的约束,以稳定由优势校准引入的探索压力。实验表明,NGRPO 在数学推理任务上显著优于 PPO、GRPO、DAPO 和 PSR-NSR 等基线方法,验证了其从同质错误中提取有效学习信号的能力。

链接: https://arxiv.org/abs/2509.18851
作者: Gongrui Nan,Siye Chen,Jing Huang,Mengyu Lu,Dexun Wang,Chunmei Xie,Weiqi Xiong,Xianzhou Zeng,Qixuan Zhou,Yadong Li,Xingzhong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:RLVR has enhanced the reasoning capabilities of Large Language Models (LLMs) across various tasks. However, GRPO, a representative RLVR algorithm, suffers from a critical limitation: when all responses within a group are either entirely correct or entirely incorrect, the model fails to learn from these homogeneous responses. This is particularly problematic for homogeneously incorrect groups, where GRPO’s advantage function yields a value of zero, leading to null gradients and the loss of valuable learning signals. To overcome this issue, we propose NGRPO (Negative-enhanced Group Relative Policy Optimization), an algorithm designed to convert homogeneous errors into robust learning signals. First, NGRPO introduces Advantage Calibration. This mechanism hypothesizes the existence of a virtual maximum-reward sample during advantage calculation, thereby altering the mean and variance of rewards within a group and ensuring that the advantages for homogeneously incorrect samples are no longer zero. Second, NGRPO employs Asymmetric Clipping, which relaxes the update magnitude for positive samples while imposing stricter constraints on that of negative samples. This serves to stabilize the exploration pressure introduced by the advantage calibration. Our experiments on Qwen2.5-Math-7B demonstrate that NGRPO significantly outperforms baselines such as PPO, GRPO, DAPO, and PSR-NSR on mathematical benchmarks including MATH500, AMC23, and AIME2025. These results validate NGRPO’s ability to learn from homogeneous errors, leading to stable and substantial improvements in mathematical reasoning. Our code is available at this https URL.
zh

[AI-36] MAPO: Mixed Advantage Policy Optimization

【速读】:该论文旨在解决基于强化学习的基座模型(foundation models)在推理任务中因优势函数(advantage function)分配不合理而导致的性能瓶颈问题,特别是现有方法中存在的优势反转(advantage reversion)和优势镜像(advantage mirror)现象,这些现象会导致不同查询样本间的优势分配失衡。解决方案的关键在于提出一种简单但有效的策略——混合优势策略优化(Mixed Advantage Policy Optimization, MAPO),其核心创新是识别出轨迹具有不同确定性(certainty),并针对高确定性轨迹样本引入优势百分比偏差(advantage percent deviation);同时动态重加权优势函数,以自适应地匹配样本特异性特征,从而实现更合理的样本级优势分配。

链接: https://arxiv.org/abs/2509.18849
作者: Wenke Huang,Quan Zhang,Yiyang Fang,Jian Liang,Xuankun Rong,Huanjin Yao,Guancheng Wan,Ke Liang,Wenwen He,Mingjun Li,Leszek Rutkowski,Mang Ye,Bo Du,Dacheng Tao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in reinforcement learning for foundation models, such as Group Relative Policy Optimization (GRPO), have significantly improved the performance of foundation models on reasoning tasks. Notably, the advantage function serves as a central mechanism in GRPO for ranking the trajectory importance. However, existing explorations encounter both advantage reversion and advantage mirror problems, which hinder the reasonable advantage allocation across different query samples. In this work, we propose an easy but effective GRPO strategy, Mixed Advantage Policy Optimization (MAPO). We reveal that the trajectory appears with different certainty and propose the advantage percent deviation for samples with high-certainty trajectories. Furthermore, we dynamically reweight the advantage function for samples with varying trajectory certainty, thereby adaptively configuring the advantage function to account for sample-specific characteristics. Comparison with related state-of-the-art methods, along with ablation studies on different advantage variants, validates the effectiveness of our approach.
zh

[AI-37] Model selection meets clinical semantics: Optimizing ICD-10-CM prediction via LLM -as-Judge evaluation redundancy-aware sampling and section-aware fine-tuning

【速读】:该论文旨在解决国际疾病分类第十版临床修改版(ICD-10-CM)编码在临床文档记录、医疗账单和健康数据分析中因人工操作导致的劳动密集与错误频发问题。现有大语言模型(LLMs)虽具自动化潜力,但受限于基础模型选择不当、输入上下文构建不足及训练数据冗余等问题,难以实现高效准确的编码预测。其解决方案的关键在于提出一个模块化框架:通过“LLM-as-judge”评估协议结合Plackett-Luce聚合方法进行系统性模型遴选,以识别对ICD-10-CM定义理解最优的基础模型;采用基于嵌入的相似度度量与冗余感知采样策略剔除语义重复的出院小结,提升训练数据质量;并利用结构化病历文本设计上下文提示机制,在通用与分段建模范式下验证不同临床章节内容对预测性能的影响。实验证明,经微调后的优选模型在两个机构数据集上均优于基线模型,且增加临床信息模块显著提升编码准确性,从而为自动化医疗编码系统提供可扩展、适配医疗机构部署的实用方案。

链接: https://arxiv.org/abs/2509.18846
作者: Hong-Jie Dai,Zheng-Hao Li,An-Tai Lu,Bo-Tsz Shain,Ming-Ta Li,Tatheer Hussain Mir,Kuang-Te Wang,Min-I Su,Pei-Kang Liu,Ming-Ju Tsai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 28 Pages, 4 Figures, 2 Tables

点击查看摘要

Abstract:Accurate International Classification of Diseases (ICD) coding is critical for clinical documentation, billing, and healthcare analytics, yet it remains a labour-intensive and error-prone task. Although large language models (LLMs) show promise in automating ICD coding, their challenges in base model selection, input contextualization, and training data redundancy limit their effectiveness. We propose a modular framework for ICD-10 Clinical Modification (ICD-10-CM) code prediction that addresses these challenges through principled model selection, redundancy-aware data sampling, and structured input design. The framework integrates an LLM-as-judge evaluation protocol with Plackett-Luce aggregation to assess and rank open-source LLMs based on their intrinsic comprehension of ICD-10-CM code definitions. We introduced embedding-based similarity measures, a redundancy-aware sampling strategy to remove semantically duplicated discharge summaries. We leverage structured discharge summaries from Taiwanese hospitals to evaluate contextual effects and examine section-wise content inclusion under universal and section-specific modelling paradigms. Experiments across two institutional datasets demonstrate that the selected base model after fine-tuning consistently outperforms baseline LLMs in internal and external evaluations. Incorporating more clinical sections consistently improves prediction performance. This study uses open-source LLMs to establish a practical and principled approach to ICD-10-CM code prediction. The proposed framework provides a scalable, institution-ready solution for real-world deployment of automated medical coding systems by combining informed model selection, efficient data refinement, and context-aware prompting.
zh

[AI-38] Bounded PCTL Model Checking of Large Language Model Outputs ICTAI2025

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)文本生成过程中概率性行为的可验证性问题,特别是如何形式化地验证其生成路径满足特定的概率性质。传统方法难以捕捉LLM在每一步选择token时的随机性和局部最优策略,导致无法保证生成结果的一致性和可靠性。解决方案的关键在于提出一种基于模型检测(model checking)的框架LLMCHECKER,其核心创新是引入α-k-bounded文本生成机制:在每一步生成中仅考虑累积概率高于阈值α的top-k个候选token,从而显著缩小搜索空间并聚焦于高置信度路径。在此基础上,LLMCHECKER能够对LLM文本生成过程的形式化属性进行概率计算树逻辑(Probabilistic Computation Tree Logic, PCTL)验证,首次实现了对LLM生成一致性与可控性的形式化保障。

链接: https://arxiv.org/abs/2509.18836
作者: Dennis Gross,Helge Spieker,Arnaud Gotlieb
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: ICTAI 2025

点击查看摘要

Abstract:In this paper, we introduce LLMCHECKER, a model-checking-based verification method to verify the probabilistic computation tree logic (PCTL) properties of an LLM text generation process. We empirically show that only a limited number of tokens are typically chosen during text generation, which are not always the same. This insight drives the creation of \alpha - k -bounded text generation, narrowing the focus to the \alpha maximal cumulative probability on the top- k tokens at every step of the text generation process. Our verification method considers an initial string and the subsequent top- k tokens while accommodating diverse text quantification methods, such as evaluating text quality and biases. The threshold \alpha further reduces the selected tokens, only choosing those that exceed or meet it in cumulative probability. LLMCHECKER then allows us to formally verify the PCTL properties of \alpha - k -bounded LLMs. We demonstrate the applicability of our method in several LLMs, including Llama, Gemma, Mistral, Genstruct, and BERT. To our knowledge, this is the first time PCTL-based model checking has been used to check the consistency of the LLM text generation process.
zh

[AI-39] Detection of security smells in IaC scripts through semantics-aware code and language processing

【速读】:该论文旨在解决基础设施即代码(Infrastructure as Code, IaC)脚本中普遍存在的安全配置错误检测难题。现有方法主要依赖静态分析技术,通过统计代码特征或机器学习(Machine Learning, ML)分类器识别不安全配置,但存在语义理解不足的问题。其解决方案的关键在于引入语义增强的静态分析框架,通过联合利用自然语言与代码表示来提升检测精度:具体采用CodeBERT模型捕捉代码与文本间的语义关联,并结合LongFormer模型处理长IaC脚本以保留上下文信息,从而显著改善对Ansible和Puppet等工具中典型安全误配置的识别能力,实验表明该方法在精确率(Precision)和召回率(Recall)上均实现大幅提升。

链接: https://arxiv.org/abs/2509.18790
作者: Aicha War,Adnan A. Rawass,Abdoul K. Kabore,Jordan Samhi,Jacques Klein,Tegawende F. Bissyande
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Infrastructure as Code (IaC) automates the provisioning and management of IT infrastructure through scripts and tools, streamlining software deployment. Prior studies have shown that IaC scripts often contain recurring security misconfigurations, and several detection and mitigation approaches have been proposed. Most of these rely on static analysis, using statistical code representations or Machine Learning (ML) classifiers to distinguish insecure configurations from safe code. In this work, we introduce a novel approach that enhances static analysis with semantic understanding by jointly leveraging natural language and code representations. Our method builds on two complementary ML models: CodeBERT, to capture semantics across code and text, and LongFormer, to represent long IaC scripts without losing contextual information. We evaluate our approach on misconfiguration datasets from two widely used IaC tools, Ansible and Puppet. To validate its effectiveness, we conduct two ablation studies (removing code text from the natural language input and truncating scripts to reduce context) and compare against four large language models (LLMs) and prior work. Results show that semantic enrichment substantially improves detection, raising precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from 0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) Cite as: arXiv:2509.18790 [cs.CR] (or arXiv:2509.18790v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.18790 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-40] he AGNTCY Agent Directory Service: Architecture and Implementation

【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MAS)中智能体能力、元数据及溯源信息的高效、可验证且多维发现问题。当前异构MAS环境下缺乏统一、可信的目录服务机制,导致智能体间协作与互操作性受限。解决方案的关键在于提出Agent Directory Service (ADS),其核心创新是基于Open Agentic Schema Framework (OASF)构建两级映射机制,通过Kademlia-based分布式哈希表(Distributed Hash Table, DHT)实现能力索引与内容位置的解耦,并结合内容寻址存储、分层分类体系与密码学签名技术,保障发现过程的效率、可验证性和扩展性。同时,ADS复用成熟的OCI/ORAS基础设施进行资源分发,集成Sigstore实现溯源可信,支持基于schema的新兴代理模态(如LLM提示代理、MCP服务器、A2A-enabled组件)的扩展,从而为下一代智能体注册与互操作提供标准化基础架构。

链接: https://arxiv.org/abs/2509.18787
作者: Luca Muscariello,Vijoy Pandey,Ramiz Polic
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Agent Directory Service (ADS) is a distributed directory for the discovery of AI agent capabilities, metadata, and provenance. It leverages content-addressed storage, hierarchical taxonomies, and cryptographic signing to enable efficient, verifiable, and multi-dimensional discovery across heterogeneous Multi-Agent Systems (MAS). Built on the Open Agentic Schema Framework (OASF), ADS decouples capability indexing from content location through a two-level mapping realized over a Kademlia-based Distributed Hash Table (DHT). It reuses mature OCI / ORAS infrastructure for artifact distribution, integrates Sigstore for provenance, and supports schema-driven extensibility for emerging agent modalities (LLM prompt agents, MCP servers, A2A-enabled components). This paper formalizes the architectural model, describes storage and discovery layers, explains security and performance properties, and positions ADS within the broader landscape of emerging agent registry and interoperability initiatives.
zh

[AI-41] VGGT-DP: Generalizable Robot Control via Vision Foundation Models AAAI2026

【速读】:该论文旨在解决视觉模仿学习(Visual Imitation Learning)中因忽视视觉编码器结构与容量而导致的空间理解能力弱、泛化性能差的问题。现有方法多聚焦于策略设计,而未充分挖掘视觉编码器在空间感知中的潜力。解决方案的关键在于提出VGGT-DP框架,其核心创新包括:1)采用预训练3D感知模型提供的几何先验构建视觉编码器——视觉几何接地Transformer(VGGT),增强视觉表征的空间语义;2)引入 proprioception-guided visual learning strategy(本体感觉引导的视觉学习策略),将外部视觉输入与机器人内部状态对齐,提升闭环控制的鲁棒性;3)设计帧级token复用机制和随机token剪枝策略,在保证效率的同时减少过拟合,从而在MetaWorld复杂任务中显著优于DP和DP3等基线方法,尤其在高精度和长时程场景下表现突出。

链接: https://arxiv.org/abs/2509.18778
作者: Shijia Ge,Yinxin Zhang,Shuzhao Xie,Weixiang Zhang,Mingcai Zhou,Zhi Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: submitted to AAAI 2026

点击查看摘要

Abstract:Visual imitation learning frameworks allow robots to learn manipulation skills from expert demonstrations. While existing approaches mainly focus on policy design, they often neglect the structure and capacity of visual encoders, limiting spatial understanding and generalization. Inspired by biological vision systems, which rely on both visual and proprioceptive cues for robust control, we propose VGGT-DP, a visuomotor policy framework that integrates geometric priors from a pretrained 3D perception model with proprioceptive feedback. We adopt the Visual Geometry Grounded Transformer (VGGT) as the visual encoder and introduce a proprioception-guided visual learning strategy to align perception with internal robot states, improving spatial grounding and closed-loop control. To reduce inference latency, we design a frame-wise token reuse mechanism that compacts multi-view tokens into an efficient spatial representation. We further apply random token pruning to enhance policy robustness and reduce overfitting. Experiments on challenging MetaWorld tasks show that VGGT-DP significantly outperforms strong baselines such as DP and DP3, particularly in precision-critical and long-horizon scenarios.
zh

[AI-42] Experience Scaling: Post-Deployment Evolution For Large Language Models

【速读】:该论文试图解决当前大语言模型(Large Language Models, LLMs)发展面临的瓶颈问题,即随着模型规模、训练数据和计算资源的持续扩大,其性能提升逐渐趋于饱和,主要受限于人类生成文本的有限性和知识更新滞后性。解决方案的关键在于提出“经验扩展”(experience scaling)框架,通过模型在部署后自主与环境交互并协作共享积累的经验,将原始交互数据提炼为紧凑且可复用的知识,并周期性地优化存储内容以保持其相关性和效率,从而实现模型能力的持续进化,突破静态人类标注数据的限制。

链接: https://arxiv.org/abs/2509.18771
作者: Xingkun Yin,Kaibin Huang,Dong In Kim,Hongyang Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scaling model size, training data, and compute power have driven advances in large language models (LLMs), but these approaches are reaching saturation as human-generated text is exhausted and further gains diminish. We propose experience scaling, a framework for continuous post-deployment evolution for LLMs through autonomous interaction with the environment and collaborative sharing of accumulated experience. The framework captures raw interactions, distills them into compact, reusable knowledge, and periodically refines stored content to preserve relevance and efficiency. We validate the framework in simulated real-world scenarios involving generalization to previously unseen but related tasks, repetitive queries, and over-saturated knowledge stores. Across all settings, experience scaling improves accuracy, sustains performance over time, and maintains gains when applied to novel situations. These results demonstrate that structured post-deployment learning can extend LLM capabilities beyond the limits of static human-generated data, offering a scalable path for continued intelligence progress.
zh

[AI-43] Security smells in infrastructure as code: a taxonomy update beyond the seven sins

【速读】:该论文旨在解决基础设施即代码(Infrastructure as Code, IaC)脚本中安全异味(security smell)识别与治理不足的问题,这类问题可能导致云服务被持续利用以实施攻击。解决方案的关键在于:首先,通过扩展至涵盖七种主流IaC工具(Terraform、Ansible、Chef、Puppet、Pulumi、Saltstack和Vagrant)的多样化数据集,系统性地重新构建并大幅扩充了安全异味的分类体系;其次,引入大语言模型(Large Language Model, LLM)辅助自动化初步模式识别,同时通过人工验证与现有安全标准的交叉校验确保分类准确性,最终形成包含62类安全异味的综合性分类体系,并在7种IaC工具的linters中实现高精度(1.00精确率)的新检测规则,推动DevSecOps实践落地。

链接: https://arxiv.org/abs/2509.18761
作者: Aicha War,Serge L.B. Nikiema,Jordan Samhi,Jacques Klein,Tegawende F. Bissyande
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Infrastructure as Code (IaC) has become essential for modern software management, yet security flaws in IaC scripts can have severe consequences, as exemplified by the recurring exploits of Cloud Web Services. Prior work has recognized the need to build a precise taxonomy of security smells in IaC scripts as a first step towards developing approaches to improve IaC security. This first effort led to the unveiling of seven sins, limited by the focus on a single IaC tool as well as by the extensive, and potentially biased, manual effort that was required. We propose, in our work, to revisit this taxonomy: first, we extend the study of IaC security smells to a more diverse dataset with scripts associated with seven popular IaC tools, including Terraform, Ansible, Chef, Puppet, Pulumi, Saltstack, and Vagrant; second, we bring in some automation for the analysis by relying on an LLM. While we leverage LLMs for initial pattern processing, all taxonomic decisions underwent systematic human validation and reconciliation with established security standards. Our study yields a comprehensive taxonomy of 62 security smell categories, significantly expanding beyond the previously known seven. We demonstrate actionability by implementing new security checking rules within linters for seven popular IaC tools, often achieving 1.00 precision score. Our evolution study of security smells in GitHub projects reveals that these issues persist for extended periods, likely due to inadequate detection and mitigation tools. This work provides IaC practitioners with insights for addressing common security smells and systematically adopting DevSecOps practices to build safer infrastructure code.
zh

[AI-44] MV-UMI: A Scalable Multi-View Interface for Cross-Embodiment Learning

【速读】:该论文旨在解决手持式夹爪在机器人操作策略学习中因仅依赖第一人称视角(egocentric view)导致的场景上下文信息不足问题,从而限制了任务泛化能力和跨机器人本体(cross-embodiment)迁移效果。解决方案的关键在于提出MV-UMI(Multi-View Universal Manipulation Interface)框架,通过融合第三人称视角与第一人称摄像头数据,增强对环境全局信息的感知能力,有效缓解人类示范与机器人部署之间的域偏移(domain shift),同时保持手持设备在跨本体适应性方面的优势。实验表明,该方法在需要广泛场景理解的子任务上性能提升约47%,显著扩展了基于手持夹爪系统可学习的操作任务范围。

链接: https://arxiv.org/abs/2509.18757
作者: Omar Rayyan,John Abanes,Mahmoud Hafez,Anthony Tzes,Fares Abu-Dakka
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: For project website and videos, see https this https URL

点击查看摘要

Abstract:Recent advances in imitation learning have shown great promise for developing robust robot manipulation policies from demonstrations. However, this promise is contingent on the availability of diverse, high-quality datasets, which are not only challenging and costly to collect but are often constrained to a specific robot embodiment. Portable handheld grippers have recently emerged as intuitive and scalable alternatives to traditional robotic teleoperation methods for data collection. However, their reliance solely on first-person view wrist-mounted cameras often creates limitations in capturing sufficient scene contexts. In this paper, we present MV-UMI (Multi-View Universal Manipulation Interface), a framework that integrates a third-person perspective with the egocentric camera to overcome this limitation. This integration mitigates domain shifts between human demonstration and robot deployment, preserving the cross-embodiment advantages of handheld data-collection devices. Our experimental results, including an ablation study, demonstrate that our MV-UMI framework improves performance in sub-tasks requiring broad scene understanding by approximately 47% across 3 tasks, confirming the effectiveness of our approach in expanding the range of feasible manipulation tasks that can be learned using handheld gripper systems, without compromising the cross-embodiment advantages inherent to such systems.
zh

[AI-45] A Generalized Bisimulation Metric of State Similarity between Markov Decision Processes: From Theoretical Propositions to Applications NEURIPS2025

【速读】:该论文旨在解决生成式 AI (Generative AI) 在多马尔可夫决策过程(MDP)场景下,如策略迁移(policy transfer)中应用传统双模拟度量(bisimulation metric, BSM)时面临的理论分析不足问题。其核心挑战在于:现有方法虽尝试将BSM推广至多个MDP之间,但缺乏对其数学性质的严格论证,限制了理论进展。解决方案的关键是提出一种广义双模拟度量(generalized bisimulation metric, GBSM),并从理论上严格证明其具备三个基本性质——对称性、跨MDP三角不等式以及相同状态空间下的距离上界。基于这些性质,作者进一步推导出在策略迁移、状态聚合和基于采样的估计任务中更紧致的理论边界,并提供了一个闭合形式的样本复杂度表达式,显著优于基于标准BSM的渐近结果。

链接: https://arxiv.org/abs/2509.18714
作者: Zhenyu Tao,Wei Xu,Xiaohu You
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper is accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

Abstract:The bisimulation metric (BSM) is a powerful tool for computing state similarities within a Markov decision process (MDP), revealing that states closer in BSM have more similar optimal value functions. While BSM has been successfully utilized in reinforcement learning (RL) for tasks like state representation learning and policy exploration, its application to multiple-MDP scenarios, such as policy transfer, remains challenging. Prior work has attempted to generalize BSM to pairs of MDPs, but a lack of rigorous analysis of its mathematical properties has limited further theoretical progress. In this work, we formally establish a generalized bisimulation metric (GBSM) between pairs of MDPs, which is rigorously proven with the three fundamental properties: GBSM symmetry, inter-MDP triangle inequality, and the distance bound on identical state spaces. Leveraging these properties, we theoretically analyse policy transfer, state aggregation, and sampling-based estimation in MDPs, obtaining explicit bounds that are strictly tighter than those derived from the standard BSM. Additionally, GBSM provides a closed-form sample complexity for estimation, improving upon existing asymptotic results based on BSM. Numerical results validate our theoretical findings and demonstrate the effectiveness of GBSM in multi-MDP scenarios.
zh

[AI-46] Autonomous Data Agents : A New Opportunity for Smart Data

【速读】:该论文旨在解决当前数据处理过程中存在的劳动密集、重复性强且难以规模化的问题,尤其是在面对日益复杂和庞大的数据时,如何高效地将原始数据转化为可操作的知识。其核心挑战在于数据结构往往不适用于AI的直接利用,且缺乏对知识密度最大化潜力的探索。解决方案的关键在于提出“自主数据代理”(DataAgents),这是一种融合大语言模型(LLM)推理能力与任务分解、动作推理、工具调用及代码生成的新型系统。DataAgents能够自主理解数据任务描述,动态规划执行流程,并通过接地(grounding)机制将抽象动作映射为Python代码或工具调用,从而实现从数据收集、清洗、转换到增强、修复等全流程自动化。这一架构代表了向自主数据到知识系统范式的转变,显著提升了数据处理的智能化水平与可扩展性。

链接: https://arxiv.org/abs/2509.18710
作者: Yanjie Fu,Dongjie Wang,Wangyang Ying,Xiangliang Zhang,Huan Liu,Jian Pei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As data continues to grow in scale and complexity, preparing, transforming, and analyzing it remains labor-intensive, repetitive, and difficult to scale. Since data contains knowledge and AI learns knowledge from it, the alignment between AI and data is essential. However, data is often not structured in ways that are optimal for AI utilization. Moreover, an important question arises: how much knowledge can we pack into data through intensive data operations? Autonomous data agents (DataAgents), which integrate LLM reasoning with task decomposition, action reasoning and grounding, and tool calling, can autonomously interpret data task descriptions, decompose tasks into subtasks, reason over actions, ground actions into python code or tool calling, and execute operations. Unlike traditional data management and engineering tools, DataAgents dynamically plan workflows, call powerful tools, and adapt to diverse data tasks at scale. This report argues that DataAgents represent a paradigm shift toward autonomous data-to-knowledge systems. DataAgents are capable of handling collection, integration, preprocessing, selection, transformation, reweighing, augmentation, reprogramming, repairs, and retrieval. Through these capabilities, DataAgents transform complex and unstructured data into coherent and actionable knowledge. We first examine why the convergence of agentic AI and data-to-knowledge systems has emerged as a critical trend. We then define the concept of DataAgents and discuss their architectural design, training strategies, as well as the new skills and capabilities they enable. Finally, we call for concerted efforts to advance action workflow optimization, establish open datasets and benchmark ecosystems, safeguard privacy, balance efficiency with scalability, and develop trustworthy DataAgent guardrails to prevent malicious actions.
zh

[AI-47] An overview of neural architectures for self-supervised audio representation learning from masked spectrograms

【速读】:该论文旨在解决当前音频表示学习领域中两个关键问题:一是如何有效利用无标签数据训练通用音频表征模型(audio foundation models),二是如何克服Transformer架构在处理长序列时计算复杂度高(二次方增长)的局限性。其解决方案的关键在于系统性地综述并比较三种主流神经序列建模架构——Transformer、Mamba和xLSTM,在基于掩码频谱图建模(masked spectrogram modeling)任务中的性能表现,并在十个多样化的下游音频分类任务上构建了一个统一且可复现的评估框架,从而为研究者提供决策依据,选择最适合特定应用场景的模型架构。

链接: https://arxiv.org/abs/2509.18691
作者: Sarthak Yadav,Sergios Theodoridis,Zheng-Hua Tan
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, self-supervised learning has amassed significant interest for training deep neural representations without labeled data. One such self-supervised learning approach is masked spectrogram modeling, where the objective is to learn semantically rich contextual representations by predicting removed or hidden portions of the input audio spectrogram. With the Transformer neural architecture at its core, masked spectrogram modeling has emerged as the prominent approach for learning general purpose audio representations, a.k.a. audio foundation models. Meanwhile, addressing the issues of the Transformer architecture, in particular the underlying Scaled Dot-product Attention operation, which scales quadratically with input sequence length, has led to renewed interest in recurrent sequence modeling approaches. Among them, Selective structured state space models (such as Mamba) and extended Long Short-Term Memory (xLSTM) are the two most promising approaches which have experienced widespread adoption. While the body of work on these two topics continues to grow, there is currently a lack of an adequate overview encompassing the intersection of these topics. In this paper, we present a comprehensive overview of the aforementioned research domains, covering masked spectrogram modeling and the previously mentioned neural sequence modeling architectures, Mamba and xLSTM. Further, we compare Transformers, Mamba and xLSTM based masked spectrogram models in a unified, reproducible framework on ten diverse downstream audio classification tasks, which will help interested readers to make informed decisions regarding suitability of the evaluated approaches to adjacent applications.
zh

[AI-48] Advances in Large Language Models for Medicine

【速读】:该论文旨在系统梳理大语言模型(Large Language Models, LLMs)在医疗领域的研究进展,解决当前医疗领域中LLMs应用不清晰、分类标准模糊及评估方法不统一的问题。其关键解决方案在于:首先,基于训练方法将医疗LLMs创新性地划分为三类,其次,将评价方法归纳为两类,从而为医疗LLMs的开发与评估提供结构化框架;最后,针对现有挑战提出改进策略并指明未来研究方向,推动医疗LLMs向更可靠、可解释和临床实用的方向发展。

链接: https://arxiv.org/abs/2509.18690
作者: Zhiyu Kan,Wensheng Gan,Zhenlian Qi,Philip S. Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Preprint. 5 figures, 4 tables

点击查看摘要

Abstract:Artificial intelligence (AI) technology has advanced rapidly in recent years, with large language models (LLMs) emerging as a significant breakthrough. LLMs are increasingly making an impact across various industries, with the medical field standing out as the most prominent application area. This paper systematically reviews the up-to-date research progress of LLMs in the medical field, providing an in-depth analysis of training techniques for large medical models, their adaptation in healthcare settings, related applications, as well as their strengths and limitations. Furthermore, it innovatively categorizes medical LLMs into three distinct types based on their training methodologies and classifies their evaluation approaches into two categories. Finally, the study proposes solutions to existing challenges and outlines future research directions based on identified issues in the field of medical LLMs. By systematically reviewing previous and advanced research findings, we aim to highlight the necessity of developing medical LLMs, provide a deeper understanding of their current state of development, and offer clear guidance for subsequent research.
zh

[AI-49] Implementation of airborne ML models with semantics preservation

链接: https://arxiv.org/abs/2509.18681
作者: Nicolas Valot,Louis Fabre,Benjamin Lesage,Ammar Mechouche,Claire Pagetti
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-50] NaviSense: A Multimodal Assistive Mobile application for Object Retrieval by Persons with Visual Impairment

【速读】:该论文旨在解决视障人群在定位和获取周围物体时面临的挑战,特别是现有辅助技术在精度与开放世界识别之间存在权衡的问题:已有系统要么需要预先扫描或仅支持固定类别物体,要么虽具备开放世界物体识别能力但缺乏指向目标的实时空间反馈。解决方案的关键在于提出名为 NaviSense 的移动辅助系统,该系统融合了对话式 AI、视觉-语言模型(vision-language models)、增强现实(AR)和 LiDAR 技术,实现了开放世界物体检测与实时音频-触觉引导的结合,使用户可通过自然语言指定目标物体,并在无需前期设置的情况下获得连续的空间导航反馈以精准抵达目标。

链接: https://arxiv.org/abs/2509.18672
作者: Ajay Narayanan Sridhar(1),Fuli Qiao(1),Nelson Daniel Troncoso Aldas(2),Yanpei Shi(3),Mehrdad Mahdavi(1),Laurent Itti(3),Vijaykrishnan Narayanan(1) ((1) The Pennsylvania State University, (2) Independent Researcher, (3) University of Southern California)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:People with visual impairments often face significant challenges in locating and retrieving objects in their surroundings. Existing assistive technologies present a trade-off: systems that offer precise guidance typically require pre-scanning or support only fixed object categories, while those with open-world object recognition lack spatial feedback for reaching the object. To address this gap, we introduce ‘NaviSense’, a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR to support open-world object detection with real-time audio-haptic guidance. Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target without needing prior setup. Designed with insights from a formative study and evaluated with 12 blind and low-vision participants, NaviSense significantly reduced object retrieval time and was preferred over existing tools, demonstrating the value of integrating open-world perception with precise, accessible guidance.
zh

[AI-51] ERAG : Token-Efficient Graph-Based Retrieval-Augmented Generation ICML

【速读】:该论文旨在解决图结构增强生成(Graph-based Retrieval-augmented Generation, RAG)系统在构建知识图谱过程中因大语言模型(Large Language Models, LLMs)Token消耗过高而导致成本难以大规模应用的问题。其解决方案的关键在于提出TERAG框架,通过在检索阶段引入个性化PageRank(Personalized PageRank, PPR)机制,在显著降低LLM输出Token使用量(仅需现有方法的3%-11%)的前提下,仍能保持至少80%的准确率,从而实现高效且高质量的图结构构建。

链接: https://arxiv.org/abs/2509.18667
作者: Qiao Xiao,Hong Ting Tsang,Jiaxin Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 2 figures, 4 tables. Submitted to the 2026 18th International Conference on Machine Learning and Computing (ICMLC 2026), under review

点击查看摘要

Abstract:Graph-based Retrieval-augmented generation (RAG) has become a widely studied approach for improving the reasoning, accuracy, and factuality of Large Language Models. However, many existing graph-based RAG systems overlook the high cost associated with LLM token usage during graph construction, hindering large-scale adoption. To address this, we propose TERAG, a simple yet effective framework designed to build informative graphs at a significantly lower cost. Inspired by HippoRAG, we incorporate Personalized PageRank (PPR) during the retrieval phase, and we achieve at least 80% of the accuracy of widely used graph-based RAG methods while consuming only 3%-11% of the output tokens.
zh

[AI-52] SPiDR: A Simple Approach for Zero-Shot Safety in Sim-to-Real Transfer

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在真实世界应用中因“仿真到现实(sim-to-real)差距”导致的安全性问题。现有方法如鲁棒安全强化学习虽具理论保障,但难以与主流可扩展训练流程兼容;而广泛使用的领域随机化(Domain Randomization, DR)虽具备良好兼容性,却常引发实际中的不安全行为。解决方案的关键在于提出SPiDR(Sim-to-real via Pessimistic Domain Randomization),其通过悲观域随机化机制将仿真与现实之间的不确定性显式建模进安全约束,从而在保持与现有训练流程高度兼容的同时,提供可证明的安全转移保障。

链接: https://arxiv.org/abs/2509.18648
作者: Yarden As,Chengrui Qu,Benjamin Unger,Dongho Kang,Max van der Hart,Laixi Shi,Stelian Coros,Adam Wierman,Andreas Krause
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Safety remains a major concern for deploying reinforcement learning (RL) in real-world applications. Simulators provide safe, scalable training environments, but the inevitable sim-to-real gap introduces additional safety concerns, as policies must satisfy constraints in real-world conditions that differ from simulation. To address this challenge, robust safe RL techniques offer principled methods, but are often incompatible with standard scalable training pipelines. In contrast, domain randomization, a simple and popular sim-to-real technique, stands out as a promising alternative, although it often results in unsafe behaviors in practice. We present SPiDR, short for Sim-to-real via Pessimistic Domain Randomization – a scalable algorithm with provable guarantees for safe sim-to-real transfer. SPiDR uses domain randomization to incorporate the uncertainty about the sim-to-real gap into the safety constraints, making it versatile and highly compatible with existing training pipelines. Through extensive experiments on sim-to-sim benchmarks and two distinct real-world robotic platforms, we demonstrate that SPiDR effectively ensures safety despite the sim-to-real gap while maintaining strong performance.
zh

[AI-53] Do You Need Proprioceptive States in Visuomotor Policies?

【速读】:该论文旨在解决基于模仿学习的视觉-运动策略(visuomotor policies)在机器人操作中因依赖本体感觉状态(proprioceptive state)而导致的过拟合问题,进而造成空间泛化能力差的缺陷。其解决方案的关键在于提出“无状态策略”(State-free Policy),即完全移除本体感觉输入,仅基于视觉观测(由双广角腕部摄像头提供)预测动作,并在相对末端执行器动作空间中构建策略,从而显著提升策略在不同高度和水平位置上的空间泛化性能。实验证明,该方法在真实世界任务中平均成功率大幅提升,且具备更强的数据效率和跨机器人本体适应性。

链接: https://arxiv.org/abs/2509.18644
作者: Juntu Zhao,Wenbo Lu,Di Zhang,Yufeng Liu,Yushen Liang,Tianluo Zhang,Yifeng Cao,Junyuan Xie,Yingdong Hu,Shengjie Wang,Junliang Guo,Dequan Wang,Yang Gao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0% to 85% in height generalization and from 6% to 64% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.
zh

[AI-54] Adaptive Learning in Spatial Agent -Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents NEURIPS2025

【速读】:该论文旨在解决气候风险评估中复杂的空间异质性灾害与适应性经济系统之间相互作用的建模难题。其核心挑战在于如何量化直接气候冲击与供应链等间接传导路径带来的系统性风险,并评估适应策略的有效性。解决方案的关键在于构建一个融合地理空间代理模型(geospatial agent-based model)与CLIMADA气候影响评估工具的新框架,通过进化学习机制使企业代理在预算分配、定价、工资和风险适应等方面实现基于适应度的选择与突变,从而模拟长期适应过程。该方法揭示了即使未直接受洪水影响的企业也会因供应链中断而面临价格上升等系统性风险,为金融机构和企业提供了一种可量化直接与级联气候风险并优化适应成本的开放源代码工具。

链接: https://arxiv.org/abs/2509.18633
作者: Yara Mohajerani
机构: 未知
类目: Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注: Submitted and accepted to Tackling Climate Change with Machine Learning workshop at NeurIPS 2025. 5 pages, 1 figure. Source code and documentation available at this https URL

点击查看摘要

Abstract:Climate risk assessment requires modelling complex interactions between spatially heterogeneous hazards and adaptive economic systems. We present a novel geospatial agent-based model that integrates climate hazard data with evolutionary learning for economic agents. Our framework combines Mesa-based spatial modelling with CLIMADA climate impact assessment, introducing adaptive learning behaviours that allow firms to evolve strategies for budget allocation, pricing, wages, and risk adaptation through fitness-based selection and mutation. We demonstrate the framework using riverine flood projections under RCP8.5 until 2100, showing that evolutionary adaptation enables firms to converge with baseline (no hazard) production levels after decades of disruption due to climate stress. Our results reveal systemic risks where even agents that are not directly exposed to floods face impacts through supply chain disruptions, with the end-of-century average price of goods 5.6% higher under RCP8.5 compared to the baseline. This open-source framework provides financial institutions and companies with tools to quantify both direct and cascading climate risks while evaluating cost-effective adaptation strategies.
zh

[AI-55] Generalizable Domain Adaptation for Sim-and-Real Policy Co-Training

链接: https://arxiv.org/abs/2509.18631
作者: Shuo Cheng,Liqian Ma,Zhenyang Chen,Ajay Mandlekar,Caelan Garrett,Danfei Xu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-56] HyperAdapt: Simple High-Rank Adaptation

【速读】:该论文旨在解决基础模型(Foundation Models)在特定应用场景中微调时面临的高内存与计算开销问题。现有参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法虽能减少训练参数量,但仍存在优化空间。其解决方案的关键在于提出HyperAdapt方法,通过引入行和列方向的对角缩放机制(即使用对角矩阵对预训练权重矩阵进行变换),在仅需 n+mn + m 个可训练参数的前提下实现高秩更新(high-rank update),从而显著降低参数量并保持模型性能。理论分析表明该方法具有可控的更新秩上界,实验验证其在GLUE、算术推理和常识推理等任务上达到或接近全参数微调及当前最优PEFT方法的性能表现。

链接: https://arxiv.org/abs/2509.18629
作者: Abel Gurung,Joseph Campbell
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Foundation models excel across diverse tasks, but adapting them to specialized applications often requires fine-tuning, an approach that is memory and compute-intensive. Parameter-efficient fine-tuning (PEFT) methods mitigate this by updating only a small subset of weights. In this paper, we introduce HyperAdapt, a parameter-efficient fine-tuning method that significantly reduces the number of trainable parameters compared to state-of-the-art methods like LoRA. Specifically, HyperAdapt adapts a pre-trained weight matrix by applying row- and column-wise scaling through diagonal matrices, thereby inducing a high-rank update while requiring only n+m trainable parameters for an n \times m matrix. Theoretically, we establish an upper bound on the rank of HyperAdapt’s updates, and empirically, we confirm that it consistently induces high-rank transformations across model layers. Experiments on GLUE, arithmetic reasoning, and commonsense reasoning benchmarks with models up to 14B parameters demonstrate that HyperAdapt matches or nearly matches the performance of full fine-tuning and state-of-the-art PEFT methods while using orders of magnitude fewer trainable parameters.
zh

[AI-57] he Case for Negative Data: From Crash Reports to Counterfactuals for Reason able Driving

【速读】:该论文旨在解决学习型自动驾驶系统在安全性能边界附近决策能力不足的问题,因其训练数据多为无事故场景,缺乏对危险情境的充分指导。解决方案的关键在于:首先将非结构化的第三方视角车祸报告标准化为以车辆自身为中心的语言,并将其与正常驾驶日志统一转化为场景-动作表示;其次,在决策时通过检索该统一索引中的相关先例来评估当前动作;进一步引入代理式反事实扩展机制,生成可能的替代动作、检索其对应先例并跨结果进行推理,从而提升决策精度。实验表明,该方法显著改善了模型校准性,contextually preferred actions的召回率从24%提升至53%,且反事实变体在高风险区域进一步优化了决策锐度。

链接: https://arxiv.org/abs/2509.18626
作者: Jay Patrikar,Apoorva Sharma,Sushant Veer,Boyi Li,Sebastian Scherer,Marco Pavone
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Learning-based autonomous driving systems are trained mostly on incident-free data, offering little guidance near safety-performance boundaries. Real crash reports contain precisely the contrastive evidence needed, but they are hard to use: narratives are unstructured, third-person, and poorly grounded to sensor views. We address these challenges by normalizing crash narratives to ego-centric language and converting both logs and crashes into a unified scene-action representation suitable for retrieval. At decision time, our system adjudicates proposed actions by retrieving relevant precedents from this unified index; an agentic counterfactual extension proposes plausible alternatives, retrieves for each, and reasons across outcomes before deciding. On a nuScenes benchmark, precedent retrieval substantially improves calibration, with recall on contextually preferred actions rising from 24% to 53%. The counterfactual variant preserves these gains while sharpening decisions near risk.
zh

[AI-58] Flow marching for a generative PDE foundation model

【速读】:该论文旨在解决现有偏微分方程(PDE)基础模型多依赖确定性Transformer架构,缺乏生成灵活性的问题,从而限制了其在科学与工程领域中对不确定性建模和多样态模拟的应用。解决方案的关键在于提出“流推进”(Flow Marching)算法,该算法通过分析物理动力系统中的误差累积机制,将神经算子学习与流匹配(flow matching)相结合;同时引入物理预训练变分自编码器(P2VAE)实现物理状态的紧凑隐空间嵌入,并设计高效的流推进Transformer(FMT),融合扩散驱动力(diffusion-forcing)与隐式时间金字塔结构,在保持高生成质量的同时实现高达15倍于完整视频扩散模型的计算效率,从而支持大规模预训练并提升长期滚动预测的稳定性与不确定性感知能力。

链接: https://arxiv.org/abs/2509.18611
作者: Zituo Chen,Sili Deng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pretraining on large-scale collections of PDE-governed spatiotemporal trajectories has recently shown promise for building generalizable models of dynamical systems. Yet most existing PDE foundation models rely on deterministic Transformer architectures, which lack generative flexibility for many science and engineering applications. We propose Flow Marching, an algorithm that bridges neural operator learning with flow matching motivated by an analysis of error accumulation in physical dynamical systems, and we build a generative PDE foundation model on top of it. By jointly sampling the noise level and the physical time step between adjacent states, the model learns a unified velocity field that transports a noisy current state toward its clean successor, reducing long-term rollout drift while enabling uncertainty-aware ensemble generations. Alongside this core algorithm, we introduce a Physics-Pretrained Variational Autoencoder (P2VAE) to embed physical states into a compact latent space, and an efficient Flow Marching Transformer (FMT) that combines a diffusion-forcing scheme with latent temporal pyramids, achieving up to 15x greater computational efficiency than full-length video diffusion models and thereby enabling large-scale pretraining at substantially reduced cost. We curate a corpus of ~2.5M trajectories across 12 distinct PDE families and train suites of P2VAEs and FMTs at multiple scales. On downstream evaluation, we benchmark on unseen Kolmogorov turbulence with few-shot adaptation, demonstrate long-term rollout stability over deterministic counterparts, and present uncertainty-stratified ensemble results, highlighting the importance of generative PDE foundation models for real-world applications.
zh

[AI-59] End-to-End Crop Row Navigation via LiDAR-Based Deep Reinforcement Learning

【速读】:该论文旨在解决农业环境中树冠下(under-canopy)的可靠导航问题,其核心挑战包括全球导航卫星系统(GNSS)信号不可靠、作物行间杂乱以及光照条件变化等因素。解决方案的关键在于提出一种端到端的学习型导航系统,通过深度强化学习策略直接将原始3D激光雷达(LiDAR)数据映射为控制指令,并且整个策略在仿真环境中训练完成,无需依赖标注数据集或人工设计的控制接口。该方法采用体素(voxel)降采样策略,将LiDAR输入尺寸减少95.83%,显著提升了策略学习效率,验证结果表明其在直线种植区实现100%成功率,且性能随行弯曲程度增加而逐渐下降,显示出良好的泛化能力。

链接: https://arxiv.org/abs/2509.18608
作者: Ana Luiza Mineiro,Francisco Affonso,Marcelo Becker
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted to the 22nd International Conference on Advanced Robotics (ICAR 2025). 7 pages

点击查看摘要

Abstract:Reliable navigation in under-canopy agricultural environments remains a challenge due to GNSS unreliability, cluttered rows, and variable lighting. To address these limitations, we present an end-to-end learning-based navigation system that maps raw 3D LiDAR data directly to control commands using a deep reinforcement learning policy trained entirely in simulation. Our method includes a voxel-based downsampling strategy that reduces LiDAR input size by 95.83%, enabling efficient policy learning without relying on labeled datasets or manually designed control interfaces. The policy was validated in simulation, achieving a 100% success rate in straight-row plantations and showing a gradual decline in performance as row curvature increased, tested across varying sinusoidal frequencies and amplitudes.
zh

[AI-60] LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA

【速读】:该论文旨在解决具身智能(embodied intelligence)中多模态语义学习的关键挑战,特别是异构数据的有效融合与资源受限环境下的计算效率问题。其解决方案的核心在于提出轻量级的LCMF级联注意力框架,通过在Mamba模块中引入多层级跨模态参数共享机制,结合Cross-Attention与选择性参数共享的状态空间模型(Selective parameter-sharing State Space Models, SSMs)的优势,实现了异构模态间的高效融合与语义互补对齐。该设计在保持高性能的同时显著降低计算复杂度,验证了其在视觉问答(VQA)任务中达到74.29%准确率,并在视频问答(EQA)任务中达到与大型语言模型代理(LLM Agents)相当的中等水平性能,同时FLOPs减少4.35倍,适用于人机交互(HRI)场景中的资源受限部署。

链接: https://arxiv.org/abs/2509.18576
作者: Zeyi Kang(1),Liang He(2),Yanxin Zhang(3),Zuheng Ming(4),Kaixing Zhao(5) ((1) Northwestern Polytechnical University, (2) University Sorbonne Paris Nord)
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal semantic learning plays a critical role in embodied intelligence, especially when robots perceive their surroundings, understand human instructions, and make intelligent decisions. However, the field faces technical challenges such as effective fusion of heterogeneous data and computational efficiency in resource-constrained environments. To address these challenges, this study proposes the lightweight LCMF cascaded attention framework, introducing a multi-level cross-modal parameter sharing mechanism into the Mamba module. By integrating the advantages of Cross-Attention and Selective parameter-sharing State Space Models (SSMs), the framework achieves efficient fusion of heterogeneous modalities and semantic complementary alignment. Experimental results show that LCMF surpasses existing multimodal baselines with an accuracy of 74.29% in VQA tasks and achieves competitive mid-tier performance within the distribution cluster of Large Language Model Agents (LLM Agents) in EQA video tasks. Its lightweight design achieves a 4.35-fold reduction in FLOPs relative to the average of comparable baselines while using only 166.51M parameters (image-text) and 219M parameters (video-text), providing an efficient solution for Human-Robot Interaction (HRI) applications in resource-constrained scenarios with strong multimodal decision generalization capabilities.
zh

[AI-61] he Ranking Blind Spot: Decision Hijacking in LLM -based Text Ranking EMNLP2025

链接: https://arxiv.org/abs/2509.18575
作者: Yaoyao Qian,Yifan Zeng,Yuchao Jiang,Chelsi Jain,Huazheng Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

[AI-62] Interaction Topological Transformer for Multiscale Learning in Porous Materials

【速读】:该论文旨在解决多孔材料中结构-性能关系的预测建模难题,特别是由于局部化学环境与全局孔道网络拓扑之间的多尺度相互作用,以及标签数据稀疏且分布不均导致的跨材料族泛化能力不足的问题。解决方案的关键在于提出一种名为交互拓扑变换器(Interaction Topological Transformer, ITT)的统一数据高效框架,其核心创新是引入新型交互拓扑(interaction topology)以捕获从结构、元素、原子到成对元素等多个尺度和层次的信息;ITT通过提取具有尺度感知能力的特征,并利用内置的Transformer架构实现跨尺度联合推理,从而有效整合组成信息与关联结构,在两阶段训练策略(先自监督预训练0.6百万个无标签结构,再监督微调)下实现了吸附、传输和稳定性等性质的高精度、可迁移预测。

链接: https://arxiv.org/abs/2509.18573
作者: Dong Chen,Jian Liu,Chun-Long Chen,Guo-Wei Wei
机构: 未知
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI)
备注: 4 figures, 2 tables

点击查看摘要

Abstract:Porous materials exhibit vast structural diversity and support critical applications in gas storage, separations, and catalysis. However, predictive modeling remains challenging due to the multiscale nature of structure-property relationships, where performance is governed by both local chemical environments and global pore-network topology. These complexities, combined with sparse and unevenly distributed labeled data, hinder generalization across material families. We propose the Interaction Topological Transformer (ITT), a unified data-efficient framework that leverages novel interaction topology to capture materials information across multiple scales and multiple levels, including structural, elemental, atomic, and pairwise-elemental organization. ITT extracts scale-aware features that reflect both compositional and relational structure within complex porous frameworks, and integrates them through a built-in Transformer architecture that supports joint reasoning across scales. Trained using a two-stage strategy, i.e., self-supervised pretraining on 0.6 million unlabeled structures followed by supervised fine-tuning, ITT achieves state-of-the-art, accurate, and transferable predictions for adsorption, transport, and stability properties. This framework provides a principled and scalable path for learning-guided discovery in structurally and chemically diverse porous materials.
zh

[AI-63] Explore the Reinforcement Learning for the LLM based ASR and TTS system

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在自动语音识别(Automatic Speech Recognition, ASR)和语音合成(Text-to-Speech, TTS)系统中应用受限的问题,尤其是由于音频模型训练复杂性高导致RL方法尚未被充分探索。解决方案的关键在于提出一个轻量级的强化学习框架,专为处理音频输入与输出的大语言模型(Large Language Models, LLMs)设计,并基于该框架在ASR任务中测试不同规则奖励函数在Group Relative Policy Optimization (GRPO)中的效果,在TTS任务中对比GRPO与可微分奖励优化(Differentiable Reward Optimization, DiffRO)的效果,并进一步融合二者以提升性能。实验表明,该方法即使在数据有限和优化步数较少的情况下,仍能显著增强ASR与TTS系统的性能。

链接: https://arxiv.org/abs/2509.18569
作者: Changfeng Gao,Yabin Li,Keyu An,Zhifu Gao,Zhihao Du,Han Zhao,Xiangang Li
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In recent years, large language models (LLMs) have played an important role in automatic speech recognition (ASR) and text-to-speech (TTS) systems. While reinforcement learning (RL) has significantly enhanced LLM performance in text-based tasks, its application to ASR and TTS remains underexplored due to the complexity of training audio-based models. In this study, we propose a lightweight RL framework tailored for audio-based LLMs that can process audio inputs and generate audio outputs. Based on this framework, we evaluate the effectiveness of reinforcement learning on both ASR and TTS tasks. For the ASR task, we experiment with different rule-based reward functions within the Group Relative Policy Optimization (GRPO) framework and investigate the impact of RL data construction. For the TTS task, we compare GRPO with Differentiable Reward Optimization (DiffRO) and further combine the two approaches to achieve improved performance. Our experiments demonstrate that RL can significantly enhance the performance of both ASR and TTS systems, even with limited training data and a small number of optimization steps.
zh

[AI-64] Solving Math Word Problems Using Estimation Verification and Equation Generation ICML

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在求解数学应用题(Math Word Problems, MWPs)时表现不佳的问题,尤其是其在复杂推理和数学运算能力上的局限性。解决方案的关键在于引入一种两阶段验证与迭代修正机制:首先,LLM被提示将问题分解并生成方程,随后借助外部符号方程求解器获得初步答案;接着,LLM再次被要求以估算为目标重新求解同一问题,通过比较估算结果与初解来验证答案正确性;若验证失败,则启动迭代修正过程直至得到准确解。该方法显著提升了LLMs在数值型和代数型MWPs上的性能,并首次在三角函数类MWPs上取得良好效果。

链接: https://arxiv.org/abs/2509.18565
作者: Mitchell Piehl,Dillon Wilson,Ananya Kalita,Jugal Kalita
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to IEEE ICMLA 2025

点击查看摘要

Abstract:Large Language Models (LLMs) excel at various tasks, including problem-solving and question-answering. However, LLMs often find Math Word Problems (MWPs) challenging because solving them requires a range of reasoning and mathematical abilities with which LLMs seem to struggle. Recent efforts have helped LLMs solve more complex MWPs with improved prompts. This study proposes a novel method that initially prompts an LLM to create equations from a decomposition of the question, followed by using an external symbolic equation solver to produce an answer. To ensure the accuracy of the obtained answer, inspired by an established recommendation of math teachers, the LLM is instructed to solve the MWP a second time, but this time with the objective of estimating the correct answer instead of solving it exactly. The estimation is then compared to the generated answer to verify. If verification fails, an iterative rectification process is employed to ensure the correct answer is eventually found. This approach achieves new state-of-the-art results on datasets used by prior published research on numeric and algebraic MWPs, improving the previous best results by nearly two percent on average. In addition, the approach obtains satisfactory results on trigonometric MWPs, a task not previously attempted to the authors’ best knowledge. This study also introduces two new datasets, SVAMPClean and Trig300, to further advance the testing of LLMs’ reasoning abilities.
zh

[AI-65] CPCLDETECTOR: Knowledge Enhancement and Alignment Selection for Chinese Patronizing and Condescending Language Detection ICASSP2025

【速读】:该论文旨在解决中文语境下隐性歧视性有害言论——即“中国式奉承与轻蔑语言”(Chinese Patronizing and Condescending Language, CPCL)在视频平台上的识别难题。现有数据集缺乏用户评论,而评论是理解视频内容的关键信息源,导致模型对CPCL视频的检测效果受限。为弥补这一缺陷,研究构建了一个包含10.3万条评论条目的新数据集PCLMMPLUS,并提出CPCLDetector模型,其核心创新在于引入对齐选择机制(alignment selection)和知识增强型评论内容模块(knowledge-enhanced comment content modules),从而显著提升对CPCL视频的识别准确率,在PCLMM和扩展数据集PCLMMPLUS上均优于当前最优方法(SOTA),有效支持内容治理并保护弱势群体。

链接: https://arxiv.org/abs/2509.18562
作者: Jiaxun Yang,Yifei Han,Long Zhang,Liu Yujie,Bin Li,Bo Gao,Yangfan He,Kejia Zhan
机构: 未知
类目: Multimedia (cs.MM); Artificial Intelligence (cs.AI)
备注: Submitted to ICASSP 2025

点击查看摘要

Abstract:Chinese Patronizing and Condescending Language (CPCL) is an implicitly discriminatory toxic speech targeting vulnerable groups on Chinese video platforms. The existing dataset lacks user comments, which are a direct reflection of video content. This undermines the model’s understanding of video content and results in the failure to detect some CPLC videos. To make up for this loss, this research reconstructs a new dataset PCLMMPLUS that includes 103k comment entries and expands the dataset size. We also propose the CPCLDetector model with alignment selection and knowledge-enhanced comment content modules. Extensive experiments show the proposed CPCLDetector outperforms the SOTA on PCLMM and achieves higher performance on PCLMMPLUS . CPLC videos are detected more accurately, supporting content governance and protecting vulnerable groups. Code and dataset are available at this https URL.
zh

[AI-66] LLM Z: Contextual Prompt Whitelist Principles for Agent ic LLM s ICML

【速读】:该论文旨在解决agentic LLM(智能代理大语言模型)因具备对数据源和API工具的特权访问权限,以及其非确定性行为(仅定义最终目标,路径由LLM自主选择)所带来的操作安全与信息安全风险问题。传统防御机制主要依赖恶意意图检测以阻止提示注入(prompt injection)等越狱攻击,但存在局限性。本文提出解决方案 LLMZ+,其关键在于采用提示白名单(prompt whitelisting)机制,通过限定交互内容必须符合预定义的上下文语境与使用场景,确保外部用户与LLM之间的所有通信均在安全边界内进行,从而实现对常见越狱攻击的强韧性防护,同时保持合法业务通信的无中断运行。实验证明,该方法可将误报率(false positive rate)和漏报率(false negative rate)降至零。

链接: https://arxiv.org/abs/2509.18557
作者: Tom Pawelek,Raj Patel,Charlotte Crowell,Noorbakhsh Amiri,Sudip Mittal,Shahram Rahimi,Andy Perkins
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 7 pages, 5 figures, to be published and presented at ICMLA 2025

点击查看摘要

Abstract:Compared to traditional models, agentic AI represents a highly valuable target for potential attackers as they possess privileged access to data sources and API tools, which are traditionally not incorporated into classical agents. Unlike a typical software application residing in a Demilitarized Zone (DMZ), agentic LLMs consciously rely on nondeterministic behavior of the AI (only defining a final goal, leaving the path selection to LLM). This characteristic introduces substantial security risk to both operational security and information security. Most common existing defense mechanism rely on detection of malicious intent and preventing it from reaching the LLM agent, thus protecting against jailbreak attacks such as prompt injection. In this paper, we present an alternative approach, LLMZ+, which moves beyond traditional detection-based approaches by implementing prompt whitelisting. Through this method, only contextually appropriate and safe messages are permitted to interact with the agentic LLM. By leveraging the specificity of context, LLMZ+ guarantees that all exchanges between external users and the LLM conform to predefined use cases and operational boundaries. Our approach streamlines the security framework, enhances its long-term resilience, and reduces the resources required for sustaining LLM information security. Our empirical evaluation demonstrates that LLMZ+ provides strong resilience against the most common jailbreak prompts. At the same time, legitimate business communications are not disrupted, and authorized traffic flows seamlessly between users and the agentic LLM. We measure the effectiveness of approach using false positive and false negative rates, both of which can be reduced to 0 in our experimental setting.
zh

[AI-67] Global Minimizers of Sigmoid Contrastive Loss NEURIPS2025

【速读】:该论文旨在解决对比预训练(contrastive pretraining)中表示学习的优化问题,特别是如何通过调整可训练的逆温度(inverse temperature)和偏置(bias)来提升模型在跨模态检索任务中的表现。其解决方案的关键在于提出了一种新的理论框架——(\mathsfm, \mathsfb_\mathsfrel)-Constellations(一种与球面码相关的组合结构),用于刻画在Sigmoid损失下能够使损失函数趋于零的样本配置。该理论不仅解释了SigLIP系列模型在检索任务中成功的原因,还揭示了模态间隙(modality gap)的来源,并指明了生成高质量表示所需的最低维度。此外,作者进一步提出了显式引入相对偏置的损失重参数化方法,在合成数据实验中显著改善了训练动态。

链接: https://arxiv.org/abs/2509.18552
作者: Kiril Bangachev,Guy Bresler,Iliyas Noman,Yury Polyanskiy
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Author names listed in alphabetical order. NeurIPS 2025

点击查看摘要

Abstract:The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call (\mathsfm, \mathsfb_\mathsfrel) -Constellations. (\mathsfm, \mathsfb_\mathsfrel) -Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin \mathsfm and relative bias \mathsfb_\mathsfrel . We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.
zh

[AI-68] Symphony-MoE: Harmonizing Disparate Pre-trained Models into a Coherent Mixture-of-Experts

【速读】:该论文旨在解决当前基于upcycling策略构建混合专家(Mixture-of-Experts, MoE)模型时存在的专家多样性不足问题。现有方法通常从单一预训练密集模型中复制前馈网络(Feed-Forward Network, FFN)层作为专家,导致所有专家结构相似、缺乏多样性,限制了MoE模型的性能潜力。为克服这一局限,论文提出Symphony-MoE框架,其核心创新在于通过两阶段设计实现来自异构预训练模型(如Llama2-Chat与Code Llama)的专家有效融合:第一阶段在无训练条件下建立专家间的协调性,采用分层融合策略构建共享骨干,并利用基于激活的功能对齐缓解参数空间不一致;第二阶段仅需轻量级路由器训练即可统一分配专家权重,从而显著提升多领域任务表现和分布外泛化能力。

链接: https://arxiv.org/abs/2509.18542
作者: Qi Wang,Hanyang Peng,Yue Yu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixture-of-Experts (MoE) models enable scalable performance by activating large parameter sets sparsely, minimizing computational overhead. To circumvent the prohibitive cost of training MoEs from scratch, recent work employs upcycling, reusing a single pre-trained dense model by replicating its feed-forward network (FFN) layers into experts. However, this limits expert diversity, as all experts originate from a single pre-trained dense model. This paper addresses this limitation by constructing powerful MoE models using experts sourced from multiple identically-architected but disparate pre-trained models (e.g., Llama2-Chat and Code Llama). A key challenge lies in the fact that these source models occupy disparate, dissonant regions of the parameter space, making direct upcycling prone to severe performance degradation. To overcome this, we propose Symphony-MoE, a novel two-stage framework designed to harmonize these models into a single, coherent expert mixture. First, we establish this harmony in a training-free manner: we construct a shared backbone via a layer-aware fusion strategy and, crucially, alleviate parameter misalignment among experts using activation-based functional alignment. Subsequently, a single lightweight stage of router training coordinates the entire architecture. Experiments demonstrate that our method successfully integrates experts from heterogeneous sources, achieving an MoE model that significantly surpasses baselines in multi-domain tasks and out-of-distribution generalization.
zh

[AI-69] FERA: Foil Fencing Referee Assistant Using Pose-Based Multi-Label Move Recognition and Rule Reasoning

【速读】:该论文旨在解决击剑(foil fencing)比赛中裁判判罚中存在的主观性、人为错误、偏见以及训练环境中裁判资源有限等问题。解决方案的关键在于提出了一种名为FERA(Fencing Referee Assistant)的AI裁判原型系统,其核心创新包括:基于姿态的多标签动作识别与规则推理相结合的方法;通过提取视频中的2D关节位置并计算101维运动学特征集,利用Transformer模型实现对击剑动作和剑刃状态的多标签分类;同时采用蒸馏后的语言模型编码优先权(right-of-way)规则,自动生成每轮交锋的判罚决策及其解释。该方法在少量标注数据下实现了优于TCN、BiLSTM和普通Transformer等基线模型的性能(平均宏F1分数为0.549),验证了自动化裁判辅助在击剑领域的可行性与潜力。

链接: https://arxiv.org/abs/2509.18527
作者: Ziwen Chen,Zhong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The sport of fencing, like many other sports, faces challenges in refereeing: subjective calls, human errors, bias, and limited availability in practice environments. We present FERA (Fencing Referee Assistant), a prototype AI referee for foil fencing which integrates pose-based multi-label action recognition and rule-based reasoning. FERA extracts 2D joint positions from video, normalizes them, computes a 101-dimensional kinematic feature set, and applies a Transformer for multi-label move and blade classification. To determine priority and scoring, FERA applies a distilled language model with encoded right-of-way rules, producing both a decision and an explanation for each exchange. With limited hand-labeled data, a 5-fold cross-validation achieves an average macro-F1 score of 0.549, outperforming multiple baselines, including a Temporal Convolutional Network (TCN), BiLSTM, and a vanilla Transformer. While not ready for deployment, these results demonstrate a promising path towards automated referee assistance in foil fencing and new opportunities for AI applications, such as coaching in the field of fencing.
zh

[AI-70] Automatic coherence-driven inference on arguments

【速读】:该论文试图解决法律、行政与法理学领域中普遍存在的不一致性问题,这类不一致性往往阻碍了立法分析、政策制定和法律推理的严谨性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)从论证中准确提取命题,并将其编译为自然的数据结构,从而通过组合优化实现基于一致性的推理(Coherence-Driven Inference, CDI)。该方法构建了一种神经符号架构,天然地分离了不同关注点,使对论证一致性的有意义判断成为可能,进而支持更可靠的法律与政策分析。

链接: https://arxiv.org/abs/2509.18523
作者: Steve Huntsman
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Workshop on Data Mining and AI for Law ( this https URL )

点击查看摘要

Abstract:Inconsistencies are ubiquitous in law, administration, and jurisprudence. Though a cure is too much to hope for, we propose a technological remedy. Large language models (LLMs) can accurately extract propositions from arguments and compile them into natural data structures that enable coherence-driven inference (CDI) via combinatorial optimization. This neurosymbolic architecture naturally separates concerns and enables meaningful judgments about the coherence of arguments that can inform legislative and policy analysis and legal reasoning.
zh

[AI-71] APRIL: Active Partial Rollouts in Reinforcement Learning to tame long-tail generation

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)训练中因长尾分布的rollout响应长度导致的计算效率低下问题,尤其在大规模预训练语言模型(Large-Scale Pre-trained Language Models, LLMs)的RL训练场景下,rollout生成占总运行时间超过90%,且少数超长响应会阻塞整个batch,造成GPU资源闲置。解决方案的关键在于提出主动部分rollout机制(Active Partial Rollouts in Reinforcement Learning, APRIL):在rollout阶段超额分配请求,一旦达到目标响应数量即终止,并将未完成的rollout回收用于后续步骤继续生成,从而避免丢弃任何rollout并显著降低GPU空闲时间。该方法在不依赖特定框架或硬件的前提下提升了rollout吞吐量最多达44%,加速收敛并提升最终任务准确率最高达8%。

链接: https://arxiv.org/abs/2509.18521
作者: Yuzhen Zhou,Jiajun Li,Yusheng Su,Gowtham Ramesh,Zilin Zhu,Xiang Long,Chenyang Zhao,Jin Pan,Xiaodong Yu,Ze Wang,Kangrui Du,Jialian Wu,Ximeng Sun,Jiang Liu,Qiaolin Yu,Hao Chen,Zicheng Liu,Emad Barsoum
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Reinforcement learning (RL) has become a cornerstone in advancing large-scale pre-trained language models (LLMs). Successive generations, including GPT-o series, DeepSeek-R1, Kimi-K1.5, Grok 4, and GLM-4.5, have relied on large-scale RL training to enhance reasoning and coding capabilities. To meet the community’s growing RL needs, numerous RL frameworks have been proposed. Most of these frameworks primarily rely on inference engines for rollout generation and training engines for policy updates. However, RL training remains computationally expensive, with rollout generation accounting for more than 90% of total runtime. In addition, its efficiency is often constrained by the long-tail distribution of rollout response lengths, where a few lengthy responses stall entire batches, leaving GPUs idle and underutilized. As model and rollout sizes continue to grow, this bottleneck increasingly limits scalability. To address this challenge, we propose Active Partial Rollouts in Reinforcement Learning (APRIL), which mitigates long-tail inefficiency. In the rollout phase, APRIL over-provisions rollout requests, terminates once the target number of responses is reached, and recycles incomplete responses for continuation in future steps. This strategy ensures that no rollouts are discarded while substantially reducing GPU idle time. Experiments show that APRIL improves rollout throughput by at most 44% across commonly used RL algorithms (GRPO, DAPO, GSPO), accelerates convergence, and achieves at most 8% higher final accuracy across tasks. Moreover, APRIL is both framework and hardware agnostic, already integrated into the slime RL framework, and deployable on NVIDIA and AMD GPUs alike. Taken together, this work unifies system-level and algorithmic considerations in proposing APRIL, with the aim of advancing RL training efficiency and inspiring further optimizations in RL systems.
zh

[AI-72] Coherence-driven inference for cybersecurity

【速读】:该论文旨在解决网络安全领域中红蓝对抗(red and blue team operations)场景下,如何实现基于自然语言数据的自动连贯性驱动推理(coherence-driven inference, CDI),以提升决策效率与自动化水平。其解决方案的关键在于利用大语言模型(large language models, LLMs)对自然语言数据进行加权图构建,从而实现自动化的CDI机制,为网络安全决策提供结构化推理支持,并展现出在短期内增强人工决策、中期推动自主蓝队操作的潜力。

链接: https://arxiv.org/abs/2509.18520
作者: Steve Huntsman
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: LLM4Sec - Workshop on the use of Large Language Models for Cybersecurity ( this https URL )

点击查看摘要

Abstract:Large language models (LLMs) can compile weighted graphs on natural language data to enable automatic coherence-driven inference (CDI) relevant to red and blue team operations in cybersecurity. This represents an early application of automatic CDI that holds near- to medium-term promise for decision-making in cybersecurity and eventually also for autonomous blue team operations.
zh

[AI-73] PrioriTouch: Adapting to User Contact Preferences for Whole-Arm Physical Human-Robot Interaction

【速读】:该论文旨在解决物理人机交互(Physical Human-Robot Interaction, pHRI)中多接触场景下的控制目标冲突问题,特别是在照护任务中因人体不同部位对力的偏好不一致而导致的不可调和约束。其核心挑战在于如何在多个同时接触点上进行优先级排序,以实现个性化、安全且高效的交互。解决方案的关键在于提出PrioriTouch框架,该框架结合了新颖的“学习排序”(learning-to-rank)方法与分层操作空间控制(hierarchical operational space control),通过仿真内回放(simulation-in-the-loop rollouts)实现数据高效且安全的探索,并将用户个体舒适阈值纳入控制优先级决策机制,从而在保证任务性能的同时提升交互的安全性和舒适性。

链接: https://arxiv.org/abs/2509.18447
作者: Rishabh Madan,Jiawei Lin,Mahika Goel,Angchen Xie,Xiaoyu Liang,Marcus Lee,Justin Guo,Pranav N. Thakkar,Rohan Banerjee,Jose Barreiros,Kate Tsui,Tom Silver,Tapomayukh Bhattacharjee
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Conference on Robot Learning (CoRL)

点击查看摘要

Abstract:Physical human-robot interaction (pHRI) requires robots to adapt to individual contact preferences, such as where and how much force is applied. Identifying preferences is difficult for a single contact; with whole-arm interaction involving multiple simultaneous contacts between the robot and human, the challenge is greater because different body parts can impose incompatible force requirements. In caregiving tasks, where contact is frequent and varied, such conflicts are unavoidable. With multiple preferences across multiple contacts, no single solution can satisfy all objectives–trade-offs are inherent, making prioritization essential. We present PrioriTouch, a framework for ranking and executing control objectives across multiple contacts. PrioriTouch can prioritize from a general collection of controllers, making it applicable not only to caregiving scenarios such as bed bathing and dressing but also to broader multi-contact settings. Our method combines a novel learning-to-rank approach with hierarchical operational space control, leveraging simulation-in-the-loop rollouts for data-efficient and safe exploration. We conduct a user study on physical assistance preferences, derive personalized comfort thresholds, and incorporate them into PrioriTouch. We evaluate PrioriTouch through extensive simulation and real-world experiments, demonstrating its ability to adapt to user contact preferences, maintain task performance, and enhance safety and comfort. Website: this https URL.
zh

[AI-74] Scattering Transformer: A Training-Free Transformer Architecture for Heart Murmur Detection

【速读】:该论文旨在解决心脏杂音(heart murmur)自动检测中因标注数据有限而导致的监督学习方法性能受限的问题,同时应对通用音频基础模型在计算资源上的高消耗问题。其解决方案的关键在于提出一种无需训练的轻量级Transformer架构——Scattering Transformer,该方法利用标准的小波散射网络(wavelet scattering networks)引入类Transformer的上下文依赖关系,且不依赖反向传播机制,从而在资源受限环境下实现了与当前最先进方法相当的性能,验证了其在临床部署中的可行性与潜力。

链接: https://arxiv.org/abs/2509.18424
作者: Rami Zewail
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:In an attempt to address the need for skilled clinicians in heart sound interpretation, recent research efforts on automating cardiac auscultation have explored deep learning approaches. The majority of these approaches have been based on supervised learning that is always challenged in occasions where training data is limited. More recently, there has been a growing interest in potentials of pre-trained self-supervised audio foundation models for biomedical end tasks. Despite exhibiting promising results, these foundational models are typically computationally intensive. Within the context of automatic cardiac auscultation, this study explores a lightweight alternative to these general-purpose audio foundation models by introducing the Scattering Transformer, a novel, training-free transformer architecture for heart murmur detection. The proposed method leverages standard wavelet scattering networks by introducing contextual dependencies in a transformer-like architecture without any backpropagation. We evaluate our approach on the public CirCor DigiScope dataset, directly comparing it against leading general-purpose foundational models. The Scattering Transformer achieves a Weighted Accuracy(WAR) of 0.786 and an Unweighted Average Recall(UAR) of 0.697, demonstrating performance highly competitive with contemporary state of the art methods. This study establishes the Scattering Transformer as a viable and promising alternative in resource-constrained setups.
zh

[AI-75] Instruction-Following Evaluation in Function Calling for Large Language Models

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在函数调用(Function Calling)任务中对格式指令遵循不足的问题。现有基准如BFCL、tau²-Bench和ACEBench仅评估参数值的正确性,而未检验模型是否严格遵守嵌入在参数描述中的格式要求(如双引号包裹、ISO日期格式等),这限制了AI代理在真实场景下的可靠性。其解决方案的关键在于提出IFEval-FC基准,通过将可验证的格式规范直接编码至JSON schema中(例如明确指定某字段值不得包含标点符号),并设计750个测试用例,每个包含一个带格式约束的函数及对应的用户查询,实现完全算法化的自动化评估,从而客观、可复现且可扩展地衡量模型对格式指令的精确遵循能力。

链接: https://arxiv.org/abs/2509.18420
作者: Nikolai Skripko
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Function calling is a core capability of large language models, essential for AI agents. Existing benchmarks such as the Berkeley Function Calling Leaderboard (BFCL), tau^2-Bench (arXiv:2506.07982), and ACEBench (arXiv:2501.12851) evaluate argument correctness but do not test adherence to format instructions embedded in parameter descriptions, such as enclosing values in double quotes or using ISO date formats. We introduce IFEval-FC, a benchmark inspired by IFEval (arXiv:2311.07911) that assesses precise instruction following in function calling. IFEval-FC encodes verifiable formats directly within JSON schema descriptions, for example specifying that a value must not contain punctuation. It includes 750 test cases, each consisting of a function with an embedded format for one of its input parameters and a corresponding user query. Evaluation is fully algorithmic, ensuring objectivity, reproducibility, and scalability. Our results show that even state-of-the-art proprietary models, including GPT-5 and Claude 4.1 Opus, frequently fail to follow basic formatting rules, highlighting a practical limitation for real-world agent systems. The complete codebase and data are publicly available at this https URL. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.18420 [cs.AI] (or arXiv:2509.18420v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.18420 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-76] Context Lineage Assurance for Non-Human Identities in Critical Multi-Agent Systems

链接: https://arxiv.org/abs/2509.18415
作者: Sumana Malkapuram,Sameera Gangavarapu,Kailashnath Reddy Kavalakuntla,Ananya Gangavarapu
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-77] Assistive Decision-Making for Right of Way Navigation at Uncontrolled Intersections

【速读】:该论文旨在解决无控制交叉口(uncontrolled intersections)中因路权规则模糊、遮挡及驾驶员行为不可预测而导致的交通事故问题。其解决方案的关键在于将路权推理建模为部分可观测马尔可夫决策过程(Partially Observable Markov Decision Process, POMDP),并在此框架下比较四种决策方法:确定性有限状态机(Finite State Machine, FSM)与三种概率规划器(QMDP、POMCP 和 DESPOT)。实验表明,概率规划器在部分可观测条件下显著优于规则基线,其中 POMCP 最注重安全性,DESPOT 在效率与计算可行性之间取得平衡,从而验证了不确定性感知规划在驾驶辅助系统中的重要性。

链接: https://arxiv.org/abs/2509.18407
作者: Navya Tiwari,Joseph Vazhaeparampil,Victoria Preston
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 6 pages, 5 figures. Accepted as a poster at Northeast Robotics Colloquium (NERC 2025). Extended abstract

点击查看摘要

Abstract:Uncontrolled intersections account for a significant fraction of roadway crashes due to ambiguous right-of-way rules, occlusions, and unpredictable driver behavior. While autonomous vehicle research has explored uncertainty-aware decision making, few systems exist to retrofit human-operated vehicles with assistive navigation support. We present a driver-assist framework for right-of-way reasoning at uncontrolled intersections, formulated as a Partially Observable Markov Decision Process (POMDP). Using a custom simulation testbed with stochastic traffic agents, pedestrians, occlusions, and adversarial scenarios, we evaluate four decision-making approaches: a deterministic finite state machine (FSM), and three probabilistic planners: QMDP, POMCP, and DESPOT. Results show that probabilistic planners outperform the rule-based baseline, achieving up to 97.5 percent collision-free navigation under partial observability, with POMCP prioritizing safety and DESPOT balancing efficiency and runtime feasibility. Our findings highlight the importance of uncertainty-aware planning for driver assistance and motivate future integration of sensor fusion and environment perception modules for real-time deployment in realistic traffic environments.
zh

[AI-78] ATLAS: Benchmarking and Adapting LLM s for Global Trade via Harmonized Tariff Code Classification

链接: https://arxiv.org/abs/2509.18400
作者: Pritish Yuvraj,Siva Devarakonda
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-79] An Artificial Intelligence Value at Risk Approach: Metrics and Models

链接: https://arxiv.org/abs/2509.18394
作者: Luis Enriquez Alvarez
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Risk Management (q-fin.RM)
备注:

点击查看摘要

[AI-80] Graph Enhanced Trajectory Anomaly Detection

【速读】:该论文旨在解决轨迹异常检测(Trajectory Anomaly Detection)中因忽略道路网络结构和语义信息而导致的检测精度不足问题。现有方法通常将轨迹视为位置序列或高阶抽象(如停留点),并在欧几里得空间中分析,未能考虑实际移动路径受道路拓扑约束与连接性的影响,从而难以识别在路网约束环境下发生的细微异常行为。其解决方案的关键在于提出Graph Enhanced Trajectory Anomaly Detection (GETAD)框架,该框架通过图注意力网络(Graph Attention Network)学习融合道路拓扑结构、路段语义特征及历史通行模式的路网感知嵌入(road-aware embeddings),并引入基于图结构的位置编码以反映道路布局的空间特性;同时采用Transformer解码器建模时间序列移动行为,并设计多目标损失函数(结合自回归预测与监督链接预测)确保表征的真实性与结构性一致性;此外,创新性地提出置信加权负对数似然(Confidence Weighted Negative Log Likelihood, CW NLL)作为异常评分函数,强化对高置信度偏离行为的敏感性,显著提升了复杂路网环境下的异常检测鲁棒性与准确性。

链接: https://arxiv.org/abs/2509.18386
作者: Jonathan Kabala Mbuya,Dieter Pfoser,Antonios Anastasopoulos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Trajectory anomaly detection is essential for identifying unusual and unexpected movement patterns in applications ranging from intelligent transportation systems to urban safety and fraud prevention. Existing methods only consider limited aspects of the trajectory nature and its movement space by treating trajectories as sequences of sampled locations, with sampling determined by positioning technology, e.g., GPS, or by high-level abstractions such as staypoints. Trajectories are analyzed in Euclidean space, neglecting the constraints and connectivity information of the underlying movement network, e.g., road or transit networks. The proposed Graph Enhanced Trajectory Anomaly Detection (GETAD) framework tightly integrates road network topology, segment semantics, and historical travel patterns to model trajectory data. GETAD uses a Graph Attention Network to learn road-aware embeddings that capture both physical attributes and transition behavior, and augments these with graph-based positional encodings that reflect the spatial layout of the road network. A Transformer-based decoder models sequential movement, while a multiobjective loss function combining autoregressive prediction and supervised link prediction ensures realistic and structurally coherent representations. To improve the robustness of anomaly detection, we introduce Confidence Weighted Negative Log Likelihood (CW NLL), an anomaly scoring function that emphasizes high-confidence deviations. Experiments on real-world and synthetic datasets demonstrate that GETAD achieves consistent improvements over existing methods, particularly in detecting subtle anomalies in road-constrained environments. These results highlight the benefits of incorporating graph structure and contextual semantics into trajectory modeling, enabling more precise and context-aware anomaly detection. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.18386 [cs.LG] (or arXiv:2509.18386v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.18386 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Jonathan Mbuya [view email] [v1] Mon, 22 Sep 2025 20:15:15 UTC (4,804 KB) Full-text links: Access Paper: View a PDF of the paper titled Graph Enhanced Trajectory Anomaly Detection, by Jonathan Kabala Mbuya and 2 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-09 Change to browse by: cs cs.AI References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[AI-81] Gödel Test: Can Large Language Models Solve Easy Conjectures?

链接: https://arxiv.org/abs/2509.18383
作者: Moran Feldman,Amin Karbasi
机构: 未知
类目: Artificial Intelligence (cs.AI); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-82] Evaluating the Safety and Skill Reasoning of Large Reasoning Models Under Compute Constraints

链接: https://arxiv.org/abs/2509.18382
作者: Adarsha Balaji,Le Chen,Rajeev Thakur,Franck Cappello,Sandeep Madireddy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-83] Multi-Worker Selection based Distributed Swarm Learning for Edge IoT with Non-i.i.d. Data

【速读】:该论文旨在解决分布式群体学习(Distributed Swarm Learning, DSL)在多接入边缘计算环境中因非独立同分布(non-i.i.d.)数据导致的模型训练性能下降与收敛行为不稳定的问题。现有方法在面对数据异质性时缺乏理论指导,难以有效保障全局模型的准确性与鲁棒性。解决方案的关键在于提出一种新的多工作者选择机制——M-DSL算法,其核心创新包括:引入一个全新的non-i.i.d.程度度量指标,用于量化本地数据集间的统计差异,并据此动态筛选对全局模型更新贡献显著的多个工作节点;同时提供了严格的收敛性理论分析,证明该机制能有效提升DSL在异构数据下的训练效率与精度。实验结果表明,M-DSL在多种非i.i.d.数据设置下均优于基准方法,显著增强了网络智能水平。

链接: https://arxiv.org/abs/2509.18367
作者: Zhuoyu Yao,Yue Wang,Songyang Zhang,Yingshu Li,Zhipeng Cai,Zhi Tian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in distributed swarm learning (DSL) offer a promising paradigm for edge Internet of Things. Such advancements enhance data privacy, communication efficiency, energy saving, and model scalability. However, the presence of non-independent and identically distributed (non-i.i.d.) data pose a significant challenge for multi-access edge computing, degrading learning performance and diverging training behavior of vanilla DSL. Further, there still lacks theoretical guidance on how data heterogeneity affects model training accuracy, which requires thorough investigation. To fill the gap, this paper first study the data heterogeneity by measuring the impact of non-i.i.d. datasets under the DSL framework. This then motivates a new multi-worker selection design for DSL, termed M-DSL algorithm, which works effectively with distributed heterogeneous data. A new non-i.i.d. degree metric is introduced and defined in this work to formulate the statistical difference among local datasets, which builds a connection between the measure of data heterogeneity and the evaluation of DSL performance. In this way, our M-DSL guides effective selection of multiple works who make prominent contributions for global model updates. We also provide theoretical analysis on the convergence behavior of our M-DSL, followed by extensive experiments on different heterogeneous datasets and non-i.i.d. data settings. Numerical results verify performance improvement and network intelligence enhancement provided by our M-DSL beyond the benchmarks.
zh

[AI-84] FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因自回归生成的顺序特性而导致的吞吐量瓶颈问题。现有方法中,多标记预测(Multi-Token Prediction, MTP)虽在训练效率和性能上表现优异,但其在推理加速方面的潜力尚未被充分挖掘。解决方案的关键在于提出FastMTP,通过将MTP训练过程与推理时的推测解码(speculative decoding)模式对齐,优化了多步草稿质量:具体而言,采用位置共享权重的单头MTP结构在自蒸馏数据上进行微调,使模型能够捕捉连续未来标记间的依赖关系,并在多轮递归草稿步骤中保持高接受率;同时引入基于语言感知的动态词汇压缩机制,进一步降低草稿阶段的计算开销。实验表明,FastMTP在七个基准测试上相较标准逐token预测实现平均2.03倍加速,且输出质量无损,优于原始MTP达82%,且仅需轻量级训练并可无缝集成至现有推理框架。

链接: https://arxiv.org/abs/2509.18362
作者: Yuxuan Cai,Xiaozhuan Liang,Xinghua Wang,Jin Ma,Haijin Liang,Jinwen Luo,Xinyu Zuo,Lisheng Duan,Yuyang Yin,Xi Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) become increasingly powerful, the sequential nature of autoregressive generation creates a fundamental throughput bottleneck that limits the practical deployment. While Multi-Token Prediction (MTP) has demonstrated remarkable benefits for model training efficiency and performance, its inherent potential for inference acceleration remains largely unexplored. This paper introduces FastMTP, a simple yet effective method that improves multi-step draft quality by aligning MTP training with its inference pattern, significantly enhancing speculative decoding performance. Our approach fine-tunes a single MTP head with position-shared weights on self-distilled data, enabling it to capture dependencies among consecutive future tokens and maintain high acceptance rates across multiple recursive draft steps. By integrating language-aware dynamic vocabulary compression into the MTP head, we further reduce computational overhead in the drafting process. Experimental results across seven diverse benchmarks demonstrate that FastMTP achieves an average of 2.03x speedup compared to standard next token prediction with lossless output quality, outperforming vanilla MTP by 82%. FastMTP requires only lightweight training and seamlessly integrates with existing inference frameworks, offering a practical and rapidly deployable solution for accelerating LLM inference.
zh

[AI-85] Reading Between the Lines: Scalable User Feedback via Implicit Sentiment in Developer Prompts

【速读】:该论文旨在解决大规模评估开发者对对话式人工智能助手(conversational AI assistants)满意度的难题。传统用户研究虽能提供丰富洞察,但难以扩展;而基于日志或产品内评分的大规模定量信号则往往过于浅层或稀疏,可靠性不足。解决方案的关键在于利用对开发者提示(prompts)进行情感分析(sentiment analysis),从中识别隐含的用户满意度信号。研究表明,该方法可在约8%的交互中捕捉到有效信号,显著高于显式反馈(超过13倍),且即使使用现成的情感分析工具也具备合理准确性,从而为构建可扩展的开发者体验理解体系提供了新路径。

链接: https://arxiv.org/abs/2509.18361
作者: Daye Nam,Malgorzata Salawa,Satish Chandra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Evaluating developer satisfaction with conversational AI assistants at scale is critical but challenging. User studies provide rich insights, but are unscalable, while large-scale quantitative signals from logs or in-product ratings are often too shallow or sparse to be reliable. To address this gap, we propose and evaluate a new approach: using sentiment analysis of developer prompts to identify implicit signals of user satisfaction. With an analysis of industrial usage logs of 372 professional developers, we show that this approach can identify a signal in ~8% of all interactions, a rate more than 13 times higher than explicit user feedback, with reasonable accuracy even with an off-the-shelf sentiment analysis approach. This new practical approach to complement existing feedback channels would open up new directions for building a more comprehensive understanding of the developer experience at scale.
zh

[AI-86] Chiplet-Based RISC-V SoC with Modular AI Acceleration

【速读】:该论文旨在解决边缘人工智能(Edge AI)设备在实现高性能、高能效与低成本的同时保持架构灵活性的难题。当前单片系统级芯片(SoC)设计因先进制程节点(如360 mm²)下制造良率低(低于16%)而难以平衡上述需求。解决方案的关键在于提出一种基于芯粒(chiplet)的RISC-V SoC架构,其核心创新包括:面向跨芯粒的自适应动态电压频率调节(DVFS);支持流控单元和压缩感知传输的AI感知通用芯粒互连Express(UCIe)协议扩展;异构芯粒间的分布式密码安全机制;以及由传感器驱动的智能负载迁移策略。该架构集成7nm RISC-V CPU芯粒、双5nm AI加速器(每颗15 TOPS INT8)、16GB HBM3内存堆栈及专用电源管理控制器,在MobileNetV2、ResNet-50等标准基准测试中实现了约14.7%延迟降低、17.3%吞吐量提升和16.2%功耗下降,整体能效提升达40.1%,且满足亚毫秒级实时性要求,验证了模块化芯粒设计可在接近单片集成密度的同时实现成本效益、可扩展性和可升级性。

链接: https://arxiv.org/abs/2509.18355
作者: P. Ramkumar,S. S. Bharadwaj
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: 3 pages, 3 figures and 2 tables

点击查看摘要

Abstract:Achieving high performance, energy efficiency, and cost-effectiveness while maintaining architectural flexibility is a critical challenge in the development and deployment of edge AI devices. Monolithic SoC designs struggle with this complex balance mainly due to low manufacturing yields (below 16%) at advanced 360 mm^2 process nodes. This paper presents a novel chiplet-based RISC-V SoC architecture that addresses these limitations through modular AI acceleration and intelligent system level optimization. Our proposed design integrates 4 different key innovations in a 30mm x 30mm silicon interposer: adaptive cross-chiplet Dynamic Voltage and Frequency Scaling (DVFS); AI-aware Universal Chiplet Interconnect Express (UCIe) protocol extensions featuring streaming flow control units and compression-aware transfers; distributed cryptographic security across heterogeneous chiplets; and intelligent sensor-driven load migration. The proposed architecture integrates a 7nm RISC-V CPU chiplet with dual 5nm AI accelerators (15 TOPS INT8 each), 16GB HBM3 memory stacks, and dedicated power management controllers. Experimental results across industry standard benchmarks like MobileNetV2, ResNet-50 and real-time video processing demonstrate significant performance improvements. The AI-optimized configuration achieves ~14.7% latency reduction, 17.3% throughput improvement, and 16.2% power reduction compared to previous basic chiplet implementations. These improvements collectively translate to a 40.1% efficiency gain corresponding to ~3.5 mJ per MobileNetV2 inference (860 mW/244 images/s), while maintaining sub-5ms real-time capability across all experimented workloads. These performance upgrades demonstrate that modular chiplet designs can achieve near-monolithic computational density while enabling cost efficiency, scalability and upgradeability, crucial for next-generation edge AI device applications.
zh

[AI-87] PEEK: Guiding and Minimal Image Representations for Zero-Shot Generalization of Robot Manipulation Policies

链接: https://arxiv.org/abs/2509.18282
作者: Jesse Zhang,Marius Memmel,Kevin Kim,Dieter Fox,Jesse Thomason,Fabio Ramos,Erdem Bıyık,Abhishek Gupta,Anqi Li
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 11 pages

点击查看摘要

[AI-88] Perceptions of AI Across Sectors: A Comparative Review of Public Attitudes

【速读】:该论文试图解决的问题是:如何系统理解不同领域中公众对人工智能(Artificial Intelligence, AI)态度的差异及其影响因素,从而为负责任的人工智能治理提供依据。其解决方案的关键在于通过领域中介的比较分析方法,整合251项关于公众AI态度的研究,识别出个体、情境与技术因素的共性与特性,并揭示机构信任、公平感知和伦理关切等跨域变量的作用机制,进而提出更具针对性和情境敏感性的AI治理策略。

链接: https://arxiv.org/abs/2509.18233
作者: Filip Bialy,Mark Elliot,Robert Meckin
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper offers a domain-mediated comparative review of 251 studies on public attitudes toward AI, published between 2011 and 2025. Drawing on a systematic literature review, we analyse how different factors including perceived benefits and concerns (or risks) shape public acceptance of - or resistance to - artificial intelligence across domains and use-cases, including healthcare, education, security, public administration, generative AI, and autonomous vehicles. The analysis highlights recurring patterns in individual, contextual, and technical factors influencing perception, while also tracing variations in institutional trust, perceived fairness, and ethical concerns. We show that the public perception in AI is shaped not only by technical design or performance but also by sector-specific considerations as well as imaginaries, cultural narratives, and historical legacies. This comparative approach offers a foundation for developing more tailored and context-sensitive strategies for responsible AI governance.
zh

[AI-89] Enhanced Interpretable Knowledge Tracing for Students Performance Prediction with Human understandable Feature Space

链接: https://arxiv.org/abs/2509.18231
作者: Sein Minn,Roger Nkambou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: International Conference on Artificial Intelligence in Education

点击查看摘要

[AI-90] owards General Computer Control with Hierarchical Agents and Multi-Level Action Spaces

【速读】:该论文旨在解决通过软件控制桌面应用程序这一基础但尚未充分解决的问题,尤其针对现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理延迟高、长程稀疏奖励任务下样本效率低以及无法部署到设备端等局限性。其解决方案的关键在于提出一种轻量级分层强化学习框架 ComputerAgent,该框架将操作系统控制建模为两级选项过程(管理器与子策略),采用三模态状态编码器(截图、任务ID、数值状态)以应对视觉与上下文多样性,引入元动作(meta-actions)结合早停机制减少无效交互,并使用紧凑的视觉主干网络和小型策略网络实现设备端推理(仅15M参数)。该方法在135个真实桌面任务上实现了92.1%的简单任务成功率和58.8%的复杂任务成功率,性能媲美甚至超越200B参数MLLM基线,同时模型规模缩小超四个数量级、推理时间减半,验证了分层强化学习在计算机控制自动化中的实用性与可扩展性。

链接: https://arxiv.org/abs/2509.18230
作者: Zihan Dong,Xinyu Fan,Zixiang Tang,Yunqing Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Controlling desktop applications via software remains a fundamental yet under-served problem. Existing multi-modal large language models (MLLMs) ingest screenshots and task instructions to generate keystrokes and mouse events, but they suffer from prohibitive inference latency, poor sample efficiency on long-horizon sparse-reward tasks, and infeasible on-device deployment. We introduce a lightweight hierarchical reinforcement learning framework, ComputerAgent, that formulates OS control as a two-level option process (manager and subpolicy), employs a triple-modal state encoder (screenshot, task ID, numeric state) to handle visual and contextual diversity, integrates meta-actions with an early-stop mechanism to reduce wasted interactions, and uses a compact vision backbone plus small policy networks for on-device inference (15M parameters). On a suite of 135 real-world desktop tasks, ComputerAgent attains 92.1% success on simple tasks (8 steps) and 58.8% on hard tasks (=8 steps), matching or exceeding 200B-parameter MLLM baselines on simple scenarios while reducing model size by over four orders of magnitude and halving inference time. These results demonstrate that hierarchical RL offers a practical, scalable alternative to monolithic MLLM-based automation for computer control.
zh

[AI-91] An N-Plus-1 GPT Agency for Critical Solution of Mechanical Engineering Analysis Problems

【速读】:该论文旨在解决生成式 AI(Generative AI),特别是 GPT 在机械工程分析问题中存在不可靠性的问题——即同一问题在不同实例中可能产生正确或错误的解,成功概率仅为 85%,这限制了其在教育和工程实践中的直接部署。解决方案的关键在于提出一种“N-Plus-1”GPT代理架构(Agency),通过并行运行 N 个独立的 Agent Solve 实例获取多个候选解,再由 Agent Compare 对这些解进行汇总与比较,基于 Condorcet’s Jury Theorem 推断出最可能正确的解;该方法不仅提升了整体解的可靠性,还具备识别不同数学模型或求解路径的能力,从而增强透明度与教学价值。

链接: https://arxiv.org/abs/2509.18229
作者: Anthony Patera,Rohan Abeyaratne
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative AI, and specifically GPT, can produce a remarkable solution to a mechanical engineering analysis problem - but also, on occasion, a flawed solution. For example, an elementary mechanics problem is solved flawlessly in one GPT instance and incorrectly in a subsequent GPT instance, with a success probability of only 85%. This unreliability renders “out-of-the-box” GPT unsuitable for deployment in education or engineering practice. We introduce an “N-Plus-1” GPT Agency for Initial (Low-Cost) Analysis of mechanical engineering Problem Statements. Agency first launches N instantiations of Agent Solve to yield N independent Proposed Problem Solution Realizations; Agency then invokes Agent Compare to summarize and compare the N Proposed Problem Solution Realizations and to provide a Recommended Problem Solution. We argue from Condorcet’s Jury Theorem that, for a Problem Statement characterized by per-Solve success probability greater than 1/2 (and N sufficiently large), the Predominant (Agent Compare) Proposed Problem Solution will, with high probability, correspond to a Correct Proposed Problem Solution. Furthermore, Agent Compare can also incorporate aspects of Secondary (Agent Compare) Proposed Problem Solutions, in particular when the latter represent alternative Problem Statement interpretations - different Mathematical Models - or alternative Mathematical Solution Procedures. Comparisons to Grok Heavy, a commercial multi-agent model, show similarities in design and performance, but also important differences in emphasis: our Agency focuses on transparency and pedagogical value.
zh

[AI-92] From “What to Eat?” to Perfect Recipe: ChefMinds Chain-of-Exploration for Ambiguous User Intent in Recipe Recommendation ICASSP2026

链接: https://arxiv.org/abs/2509.18226
作者: Yu Fu,Linyue Cai,Ruoyu Wu,Yong Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 5 pages, 3 figures, submitted to icassp 2026

点击查看摘要

[AI-93] Multimodal Health Risk Prediction System for Chronic Diseases via Vision-Language Fusion and Large Language Models

【速读】:该论文旨在解决慢性疾病日益增长的全球负担下,如何利用多模态异构临床数据(如医学影像、自由文本记录、可穿戴设备流数据等)构建统一的AI框架以主动预测个体健康风险的问题。其解决方案的关键在于提出VL-RiskFormer——一种分层堆叠的视觉-语言多模态Transformer架构,嵌入大型语言模型(Large Language Model, LLM)推理头于顶层,并包含四项核心创新:(i) 基于动量更新编码器和去偏InfoNCE损失函数的跨模态对比预训练与细粒度对齐;(ii) 通过自适应时间间隔位置编码实现不规则就诊序列融合的时序融合模块;(iii) 利用ICD-10疾病本体图适配器注入疾病编码并借助图注意力机制推断共病模式。在MIMIC-IV纵向队列上,该模型实现了平均AUROC为0.90、预期校准误差仅为2.7%的优异性能。

链接: https://arxiv.org/abs/2509.18221
作者: Dingxin Lu,Shurui Wu,Xinyi Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the rising global burden of chronic diseases and the multimodal and heterogeneous clinical data (medical imaging, free-text recordings, wearable sensor streams, etc.), there is an urgent need for a unified multimodal AI framework that can proactively predict individual health risks. We propose VL-RiskFormer, a hierarchical stacked visual-language multimodal Transformer with a large language model (LLM) inference head embedded in its top layer. The system builds on the dual-stream architecture of existing visual-linguistic models (e.g., PaLM-E, LLaVA) with four key innovations: (i) pre-training with cross-modal comparison and fine-grained alignment of radiological images, fundus maps, and wearable device photos with corresponding clinical narratives using momentum update encoders and debiased InfoNCE losses; (ii) a time fusion block that integrates irregular visit sequences into the causal Transformer decoder through adaptive time interval position coding; (iii) a disease ontology map adapter that injects ICD-10 codes into visual and textual channels in layers and infers comorbid patterns with the help of a graph attention mechanism. On the MIMIC-IV longitudinal cohort, VL-RiskFormer achieved an average AUROC of 0.90 with an expected calibration error of 2.7 percent.
zh

[AI-94] Similarity Field Theory: A Mathematical Framework for Intelligence

【速读】:该论文旨在解决如何从相似性关系的结构演化角度,为智能系统提供一个统一的数学基础与形式化定义。其核心问题是:在动态系统中,实体之间的相似性如何作为结构基础,并通过演化过程体现智能行为?解决方案的关键在于提出相似性场理论(Similarity Field Theory),该理论以映射 $ S: U \times U \to [0,1] $ 描述实体间相似度,允许非对称性和非传递性;并通过引入概念(concepts)作为超水平集(superlevel sets)来刻画知识结构,以及定义生成算子 $ G $ 的智能性——即若生成的新实体仍属于原概念的纤维,则称该算子对该概念具有智能性。此框架不仅形式化了智能的生成机制,还通过两个定理(不对称性阻止互包含、稳定性需锚点或水平集约束)确保系统演化的可解释性和结构性限制,从而为理解大语言模型等复杂系统的社会认知能力提供了理论工具与实验探针。

链接: https://arxiv.org/abs/2509.18218
作者: Kei-Sing Ng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field S: U \times U \to [0,1] over a universe of entities U , satisfying reflexivity S(E,E)=1 and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence Z_p = (X_p, S^§) indexed by p=0,1,2,\ldots ; (3) concepts K as entities that induce fibers F_\alpha(K) = E \in U \mid S(E,K) \ge \alpha , i.e., superlevel sets of the unary map S_K(E) := S(E,K) ; and (4) a generative operator G that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator G is intelligent with respect to a concept K if, given a system containing entities belonging to the fiber of K , it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability requires either an anchor coordinate or eventual confinement within a level set of f . These results ensure that the evolution of similarity fields is both constrained and interpretable, culminating in an exploration of how the framework allows us to interpret large language models and use them as experimental probes into societal cognition.
zh

[AI-95] nDNA – the Semantic Helix of Artificial Cognition

【速读】:该论文旨在解决当前AI基础模型在行为评估之外的“内在认知身份”问题,即如何刻画模型内部语义结构的稳定性和演化规律。传统基准测试仅能衡量输出行为,而忽视了模型潜在空间中意义流动的几何特性。解决方案的关键在于提出神经DNA(Neural DNA, nDNA),这是一种基于潜空间几何结构的语义基因型表征,由三个核心维度构成:谱曲率(spectral curvature)揭示概念流在层间的弯曲程度;热力学长度(thermodynamic length)量化语义迁移所需的能量代价;信念向量场(belief vector field)刻画引导模型信念方向的语义张力场。nDNA作为坐标无关的神经指纹,可追踪预训练、微调、对齐、剪枝、蒸馏等过程中的语义继承与漂移,从而实现对人工认知演化的建模与治理。

链接: https://arxiv.org/abs/2509.18216
作者: Amitava Das
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As AI foundation models grow in capability, a deeper question emerges: What shapes their internal cognitive identity – beyond fluency and output? Benchmarks measure behavior, but the soul of a model resides in its latent geometry. In this work, we propose Neural DNA (nDNA) as a semantic-genotypic representation that captures this latent identity through the intrinsic geometry of belief. At its core, nDNA is synthesized from three principled and indispensable dimensions of latent geometry: spectral curvature, which reveals the curvature of conceptual flow across layers; thermodynamic length, which quantifies the semantic effort required to traverse representational transitions through layers; and belief vector field, which delineates the semantic torsion fields that guide a model’s belief directional orientations. Like biological DNA, it encodes ancestry, mutation, and semantic inheritance, found in finetuning and alignment scars, cultural imprints, and architectural drift. In naming it, we open a new field: Neural Genomics, where models are not just tools, but digital semantic organisms with traceable inner cognition. Modeling statement. We read AI foundation models as semantic fluid–dynamics: meaning is transported through layers like fluid in a shaped conduit; nDNA is the physics-grade readout of that flow – a geometry-first measure of how meaning is bent, paid for, and pushed – yielding a stable, coordinate-free neural DNA fingerprint tied to on-input behavior; with this fingerprint we cross into biology: tracing lineages across pretraining, fine-tuning, alignment, pruning, distillation, and merges; measuring inheritance between checkpoints; detecting drift as traits shift under new data or objectives; and, ultimately, studying the evolution of artificial cognition to compare models, diagnose risks, and govern change over time. Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.18216 [cs.AI] (or arXiv:2509.18216v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.18216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-96] Change in Quantitative Bipolar Argumentation: Sufficient Necessary and Counterfactual Explanations

【速读】:该论文旨在解决定量双极论证框架(Quantitative Bipolar Argumentation Frameworks, QBAFs)中推理变化的解释问题,即当QBAF被更新后,原有结论发生变化时,如何系统性地追踪并解释这种变化。其核心解决方案是引入“强度不一致”(strength inconsistency)的概念,用于刻画语义在特定主题论点(topic arguments)上建立的论点强度偏序关系的变化,并将这些不一致的成因归因于具体论点,从而提供充分、必要及反事实等类型的解释。关键在于通过形式化方法识别出导致强度不一致的具体论点,进而实现对推理变化的可解释性分析,且证明了强度不一致解释的存在性与更新引发强度不一致的等价性。

链接: https://arxiv.org/abs/2509.18215
作者: Timotheus Kampik,Kristijonas Čyras,José Ruiz Alarcón
机构: 未知
类目: Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO); Multiagent Systems (cs.MA)
备注: The publisher’s version contains a notation glitch in Example 3, 5th line, first sub-script G should be G’. This has always been G’ in authors’ version. Thanks to J. Lanser for pointing this out

点击查看摘要

Abstract:This paper presents a formal approach to explaining change of inference in Quantitative Bipolar Argumentation Frameworks (QBAFs). When drawing conclusions from a QBAF and updating the QBAF to then again draw conclusions (and so on), our approach traces changes – which we call strength inconsistencies – in the partial order over argument strengths that a semantics establishes on some arguments of interest, called topic arguments. We trace the causes of strength inconsistencies to specific arguments, which then serve as explanations. We identify sufficient, necessary, and counterfactual explanations for strength inconsistencies and show that strength inconsistency explanations exist if and only if an update leads to strength inconsistency. We define a heuristic-based approach to facilitate the search for strength inconsistency explanations, for which we also provide an implementation.
zh

[AI-97] Variational Task Vector Composition

链接: https://arxiv.org/abs/2509.18208
作者: Boyuan Zhang,Yingjun Du,Xiantong Zhen,Ling Shao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-98] MMCD: Multi-Modal Collaborative Decision-Making for Connected Autonomy with Knowledge Distillation

链接: https://arxiv.org/abs/2509.18198
作者: Rui Liu,Zikang Wang,Peng Gao,Yu Shen,Pratap Tokekar,Ming Lin
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Robotics (cs.RO)
备注:

点击查看摘要

[AI-99] MNV-17: A High-Quality Performative Mandarin Dataset for Nonverbal Vocalization Recognition in Speech ICASSP2026

链接: https://arxiv.org/abs/2509.18196
作者: Jialong Mai,Jinxin Ji,Xiaofen Xing,Chen Yang,Weidong Chen,Jingyuan Xing,Xiangmin Xu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Submitted to ICASSP 2026

点击查看摘要

[AI-100] An Outcome-Based Educational Recommender System

链接: https://arxiv.org/abs/2509.18186
作者: Nursultan Askarbekuly,Timur Fayzrakhmanov,Sladjan Babarogić,Ivan Luković
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-101] Synthesizing Attitudes Predicting Actions (SAPA): Behavioral Theory-Guided LLM s for Ridesourcing Mode Choice Modeling

链接: https://arxiv.org/abs/2509.18181
作者: Mustafa Sameen,Xiaojian Zhang,Xilei Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-102] Large Language Models and Operations Research: A Structured Survey

链接: https://arxiv.org/abs/2509.18180
作者: Yang Wang,Kai Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-103] Foam-Agent : An End-to-End Composable Multi-Agent Framework for Automating CFD Simulation in OpenFOAM

链接: https://arxiv.org/abs/2509.18178
作者: Ling Yue,Nithin Somasekharan,Tingwen Zhang,Yadi Cao,Shaowu Pan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-104] HSGM: Hierarchical Segment-Graph Memory for Scalable Long-Text Semantics

【速读】:该论文旨在解决长文档语义解析(semantic parsing)中因成对组合和内存需求随文本长度呈二次增长(O(N²))而导致的计算效率与可扩展性难题。其核心解决方案是提出分层段图记忆机制(Hierarchical Segment-Graph Memory, HSGM),关键在于将输入文本分解为M个有意义的片段(segment),在每个片段上构建局部语义图(Local Semantic Graph),并通过提取紧凑的摘要节点(summary node)形成全局图记忆(Global Graph Memory)。该框架支持增量更新,仅新到达的片段触发局部图构建与摘要节点整合,并通过分层查询处理机制——先基于摘要节点进行Top-K检索定位相关片段,再在局部图内执行细粒度推理——从而将最坏情况复杂度从O(N²)降低至O(Nk + (N/k)²),其中段大小k ≪ N。理论层面,作者还推导了节点摘要和稀疏阈值引入的Frobenius范数误差边界;实验表明,HSGM在三个基准任务上实现了2–4倍推理加速、峰值内存减少60%,且保持≥95%基线准确率,显著提升了超长文本的语义建模能力,适用于实时与资源受限的自然语言处理场景。

链接: https://arxiv.org/abs/2509.18168
作者: Dong Liu,Yanxuan Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Semantic parsing of long documents remains challenging due to quadratic growth in pairwise composition and memory requirements. We introduce \textbfHierarchical Segment-Graph Memory (HSGM), a novel framework that decomposes an input of length N into M meaningful segments, constructs \emphLocal Semantic Graphs on each segment, and extracts compact \emphsummary nodes to form a \emphGlobal Graph Memory. HSGM supports \emphincremental updates – only newly arrived segments incur local graph construction and summary-node integration – while \emphHierarchical Query Processing locates relevant segments via top- K retrieval over summary nodes and then performs fine-grained reasoning within their local graphs. Theoretically, HSGM reduces worst-case complexity from O(N^2) to O!\left(N,k + (N/k)^2\right) , with segment size k \ll N , and we derive Frobenius-norm bounds on the approximation error introduced by node summarization and sparsification thresholds. Empirically, on three benchmarks – long-document AMR parsing, segment-level semantic role labeling (OntoNotes), and legal event extraction – HSGM achieves \emph2–4 \times inference speedup, \emph 60% reduction in peak memory, and \emph \ge 95% of baseline accuracy. Our approach unlocks scalable, accurate semantic modeling for ultra-long texts, enabling real-time and resource-constrained NLP applications. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.18168 [cs.AI] (or arXiv:2509.18168v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.18168 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-105] Developing Training Procedures for Piecewise-linear Spline Activation Functions in Neural Networks

链接: https://arxiv.org/abs/2509.18161
作者: William H Patty
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-106] WLFM: A Well-Logs Foundation Model for Multi-Task and Cross-Well Geological Interpretation

链接: https://arxiv.org/abs/2509.18152
作者: Zhenyu Qi,Qing Yu,Jichen Wang,Yun-Bo Zhao,Zerui Li,Wenjun Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-107] HyperNAS: Enhancing Architecture Representation for NAS Predictor via Hypernetwork

【速读】:该论文旨在解决神经架构搜索(Neural Architecture Search, NAS)中因性能评估耗时过长而导致的效率瓶颈问题。现有方法依赖代理数据集上的代理模型进行架构性能预测,但其泛化能力不足,难以捕捉不同架构间的复杂关系。解决方案的关键在于提出HyperNAS,其核心创新包括:(1)全局编码方案(global encoding scheme),用于捕获架构的宏观结构信息;(2)共享超网络(shared hypernetwork),作为辅助任务以增强对跨架构模式的学习;并引入动态自适应多任务损失函数以提升训练稳定性与个性化帕累托前沿探索能力。实验表明,HyperNAS在多个搜索空间(包括ViT)中显著优于现有方法,尤其在少样本场景下表现突出,例如在CIFAR-10上达到97.60% top-1准确率,且样本量减少至少5倍。

链接: https://arxiv.org/abs/2509.18151
作者: Jindi Lv,Yuhao Zhou,Yuxin Tian,Qing Ye,Wentao Feng,Jiancheng Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time-intensive performance evaluations significantly impede progress in Neural Architecture Search (NAS). To address this, neural predictors leverage surrogate models trained on proxy datasets, allowing for direct performance predictions for new architectures. However, these predictors often exhibit poor generalization due to their limited ability to capture intricate relationships among various architectures. In this paper, we propose HyperNAS, a novel neural predictor paradigm for enhancing architecture representation learning. HyperNAS consists of two primary components: a global encoding scheme and a shared hypernetwork. The global encoding scheme is devised to capture the comprehensive macro-structure information, while the shared hypernetwork serves as an auxiliary task to enhance the investigation of inter-architecture patterns. To ensure training stability, we further develop a dynamic adaptive multi-task loss to facilitate personalized exploration on the Pareto front. Extensive experiments across five representative search spaces, including ViTs, demonstrate the advantages of HyperNAS, particularly in few-shot scenarios. For instance, HyperNAS strikes new state-of-the-art results, with 97.60% top-1 accuracy on CIFAR-10 and 82.4% top-1 accuracy on ImageNet, using at least 5.0 \times fewer samples.
zh

[AI-108] Sparse Training Scheme for Multimodal LLM

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在训练过程中存在的效率低下问题,主要表现为因多模态数据引入的超长输入序列导致计算资源浪费,以及层间计算利用不足。其解决方案的关键在于提出一种基于稀疏表示的训练高效框架——稀疏训练方案(Sparse Training Scheme, STS),该方案包含两个核心组件:视觉令牌压缩器(Visual Token Compressor),通过压缩视觉令牌降低信息负载;以及层动态跳过机制(Layer Dynamic Skipper),在前向和反向传播中动态跳过不必要的语言模型层以减少计算开销。此方法适用于多种MLLM架构,并在多个基准测试中验证了其有效性与效率。

链接: https://arxiv.org/abs/2509.18150
作者: Kean Shi,Liang Chen,Haozhe Zhao,Baobao Chang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient due to the significantly longer input sequences introduced by multimodal data and the low utilization of inter-layer computations. To address this challenge, we shift the focus to the training process itself and propose a novel training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). This scheme consists of two key components: the Visual Token Compressor, which reduces the information load by compressing visual tokens, and the Layer Dynamic Skipper, which mitigates the computational overhead by dynamically skipping unnecessary layers in the language model during both forward and backward passes. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.
zh

[AI-109] ConceptFlow: Hierarchical and Fine-grained Concept-Based Explanation for Convolutional Neural Networks

链接: https://arxiv.org/abs/2509.18147
作者: Xinyu Mu,Hui Dou,Furao Shen,Jian Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-110] Early Prediction of Multi-Label Care Escalation Triggers in the Intensive Care Unit Using Electronic Health Records

链接: https://arxiv.org/abs/2509.18145
作者: Syed Ahmad Chan Bukhari,Amritpal Singh,Shifath Hossain,Iram Wajahat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 Figure

点击查看摘要

[AI-111] AdaSTI: Conditional Diffusion Models with Adaptive Dependency Modeling for Spatio-Temporal Imputation

链接: https://arxiv.org/abs/2509.18144
作者: Yubo Yang,Yichen Zhu,Bo Jiang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

[AI-112] Weight Mapping Properties of a Dual Tree Single Clock Adiabatic Capacitive Neuron

【速读】:该论文旨在解决在全定制模拟集成电路(Analog IC)设计中,如何高效地将软件训练得到的人工神经元(Artificial Neuron, AN)权重映射到自适应电容神经元(Adiabatic Capacitive Neuron, ACN)的物理电容值这一关键问题。现有研究尚未充分探索该映射过程中的隐藏复杂性、挑战及其对集成电路(IC)设计精度和实现的影响。论文提出了一种最优的AN到ACN映射方法,其核心在于通过优化权重量化策略,在保证功能等效性的前提下显著减小芯片面积并提升分类准确率,从而支持实际部署。作者利用TensorFlow和Larq框架训练三种不同ANN网络,并将其权重映射至DTSC ACN电容域,实现了100%的功能等效性验证;同时引入与IC实际考量(如版图空间占用和比较器决策效能)相关的新型量化指标,系统评估了权重量化对ACN性能的影响。

链接: https://arxiv.org/abs/2509.18143
作者: Mike Smart,Sachin Maheshwari,Himadri Singh Raghav,Alexander Serb
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注: 11 pages, 10 figures, 6 tables. This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Dual Tree Single Clock (DTSC) Adiabatic Capacitive Neuron (ACN) circuits offer the potential for highly energy-efficient Artificial Neural Network (ANN) computation in full custom analog IC designs. The efficient mapping of Artificial Neuron (AN) abstract weights, extracted from the software-trained ANNs, onto physical ACN capacitance values has, however, yet to be fully researched. In this paper, we explore the unexpected hidden complexities, challenges and properties of the mapping, as well as, the ramifications for IC designers in terms accuracy, design and implementation. We propose an optimal, AN to ACN methodology, that promotes smaller chip sizes and improved overall classification accuracy, necessary for successful practical deployment. Using TensorFlow and Larq software frameworks, we train three different ANN networks and map their weights into the energy-efficient DTSC ACN capacitance value domain to demonstrate 100% functional equivalency. Finally, we delve into the impact of weight quantization on ACN performance using novel metrics related to practical IC considerations, such as IC floor space and comparator decision-making efficacy.
zh

[AI-113] A Machine Learning Framework for Pathway-Driven Therapeutic Target Discovery in Metabolic Disorders

链接: https://arxiv.org/abs/2509.18140
作者: Iram Wajahat,Amritpal Singh,Fazel Keshtkar,Syed Ahmad Chan Bukhari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 6 figures

点击查看摘要

[AI-114] LoRALib: A Standardized Benchmark for Evaluating LoRA-MoE Methods

链接: https://arxiv.org/abs/2509.18137
作者: Shaoheng Wang,Yao Lu,Yuqi Li,Yaxin Gao,Jiaqi Nie,Shanqing Yu,Yingli Tian,Qi Xuan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-115] From Parameters to Performance: A Data-Driven Study on LLM Structure and Development EMNLP2025

【速读】:该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在规模和能力上迅速增长,但关于结构配置(structural configurations)如何系统性影响模型性能的研究仍十分匮乏,缺乏数据驱动的实证分析。为填补这一空白,论文提出了一种大规模数据集,涵盖多种开源LLM结构及其在多个基准测试中的表现,并基于此开展系统性的数据挖掘分析,以验证并量化结构配置与性能之间的关系。其解决方案的关键在于构建一个全面、多样化的LLM结构-性能数据集,并结合机制可解释性技术对分析结果进行交叉验证,从而为未来模型的定向优化与应用提供数据驱动的指导。

链接: https://arxiv.org/abs/2509.18136
作者: Suqing Wang,Zuchao Li,Luohe Shi,Bo Du,Hai Zhao,Yun Li,Qianren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success across various domains, driving significant technological advancements and innovations. Despite the rapid growth in model scale and capability, systematic, data-driven research on how structural configurations affect performance remains scarce. To address this gap, we present a large-scale dataset encompassing diverse open-source LLM structures and their performance across multiple benchmarks. Leveraging this dataset, we conduct a systematic, data mining-driven analysis to validate and quantify the relationship between structural configurations and performance. Our study begins with a review of the historical development of LLMs and an exploration of potential future trends. We then analyze how various structural choices impact performance across benchmarks and further corroborate our findings using mechanistic interpretability techniques. By providing data-driven insights into LLM optimization, our work aims to guide the targeted development and application of future models. We will release our dataset at this https URL
zh

[AI-116] SDGF: Fusing Static and Multi-Scale Dynamic Correlations for Multivariate Time Series Forecasting

链接: https://arxiv.org/abs/2509.18135
作者: Shaoxun Wang,Xingjun Zhang,Qianyang Li,Jiawei Cao,Zhendong Tan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-117] Self-Evolving LLM s via Continual Instruction Tuning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工业场景中持续学习时面临的灾难性遗忘问题,即在不断适应新任务的过程中导致对旧任务性能下降的问题。其核心解决方案是提出MoE-CL框架,关键在于采用双专家设计:一是为每个任务配置独立的LoRA(Low-Rank Adaptation)专家以保持任务特异性知识,避免参数干扰;二是引入一个共享LoRA专家实现跨任务知识迁移。为防止共享路径传递无关噪声,进一步集成基于GAN的任务感知判别器(task-aware discriminator),通过对抗学习机制引导共享专家仅保留与当前任务对齐的信息,从而在保留任务特定细节的同时获得泛化表征,实现稳定的知识留存与高效迁移。

链接: https://arxiv.org/abs/2509.18133
作者: Le Huang,Jiazheng Kang,Cheng Hou,Zhe Zhao,Zhenxiang Yan,Chuan Shi,Ting Bai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world industrial settings, large language models (LLMs) must learn continually to keep pace with diverse and evolving tasks, requiring self-evolution to refine knowledge under dynamic data distributions. However, existing continual learning (CL) approaches, such as replay and parameter isolation, often suffer from catastrophic forgetting: training on new tasks degrades performance on earlier ones by overfitting to the new distribution and weakening this http URL propose MoE-CL, a parameter-efficient adversarial mixture-of-experts framework for industrial-scale, self-evolving continual instruction tuning of LLMs. MoE-CL uses a dual-expert design: (1) a dedicated LoRA expert per task to preserve task-specific knowledge via parameter independence, mitigating forgetting; and (2) a shared LoRA expert to enable cross-task transfer. To prevent transferring task-irrelevant noise through the shared pathway, we integrate a task-aware discriminator within a GAN. The discriminator encourages the shared expert to pass only task-aligned information during sequential training. Through adversarial learning, the shared expert acquires generalized representations that mimic the discriminator, while dedicated experts retain task-specific details, balancing knowledge retention and cross-task generalization and thereby supporting this http URL experiments on the public MTL5 benchmark and an industrial Tencent3 benchmark validate the effectiveness of MoE-CL for continual instruction tuning. In real-world A/B testing for content compliance review on the Tencent Video platform, MoE-CL reduced manual review costs by 15.3%. These results demonstrate that MoE-CL is practical for large-scale industrial deployment where continual adaptation and stable transfer are critical.
zh

[AI-118] Position Paper: Integrating Explainability and Uncertainty Estimation in Medical AI IJCNN2025

链接: https://arxiv.org/abs/2509.18132
作者: Xiuyi Fan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the International Joint Conference on Neural Networks, IJCNN 2025

点击查看摘要

[AI-119] wo ways to knowledge?

链接: https://arxiv.org/abs/2509.18131
作者: Jean-Michel Tucny,Abhisek Ganguly,Santosh Ansumali,Sauro Succi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-120] Research on Metro Transportation Flow Prediction Based on the STL-GRU Combined Model

链接: https://arxiv.org/abs/2509.18130
作者: Zijie Zhou,Huichen Ma
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-121] Anomaly Detection in Electric Vehicle Charging Stations Using Federated Learning

【速读】:该论文旨在解决电动汽车充电站(EVCS)在物联网(IoT)环境下面临的网络安全威胁问题,特别是传统集中式入侵检测系统(IDS)因涉及敏感数据上传而引发的隐私担忧。为应对这一挑战,研究提出采用联邦学习(Federated Learning, FL)框架来实现分布式、隐私保护的异常检测。其解决方案的关键在于通过引入FedAvgM优化算法,在系统异构性和非独立同分布(non-IID)数据条件下显著提升模型收敛性和检测准确性,从而在保障隐私的同时维持高鲁棒性的安全防护能力。

链接: https://arxiv.org/abs/2509.18126
作者: Bishal K C,Amr Hilal,Pawan Thapa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated Learning (FL) is a decentralized training framework widely used in IoT ecosystems that preserves privacy by keeping raw data local, making it ideal for IoT-enabled cyber-physical systems with sensing and communication like Smart Grids (SGs), Connected and Automated Vehicles (CAV), and Electric Vehicle Charging Stations (EVCS). With the rapid expansion of electric vehicle infrastructure, securing these IoT-based charging stations against cyber threats has become critical. Centralized Intrusion Detection Systems (IDS) raise privacy concerns due to sensitive network and user data, making FL a promising alternative. However, current FL-based IDS evaluations overlook practical challenges such as system heterogeneity and non-IID data. To address these challenges, we conducted experiments to evaluate the performance of federated learning for anomaly detection in EV charging stations under system and data heterogeneity. We used FedAvg and FedAvgM, widely studied optimization approaches, to analyze their effectiveness in anomaly detection. Under IID settings, FedAvg achieves superior performance to centralized models using the same neural network. However, performance degrades with non-IID data and system heterogeneity. FedAvgM consistently outperforms FedAvg in heterogeneous settings, showing better convergence and higher anomaly detection accuracy. Our results demonstrate that FL can handle heterogeneity in IoT-based EVCS without significant performance loss, with FedAvgM as a promising solution for robust, privacy-preserving EVCS security.
zh

[AI-122] NurseSchedRL: Attention-Guided Reinforcement Learning for Nurse-Patient Assignment

【速读】:该论文旨在解决医疗系统中护士资源分配效率低下的问题,尤其是在面对护士技能异质性、患者病情严重程度差异、工作人员疲劳累积以及护理连续性要求等多重动态约束时,传统优化和启发式调度方法难以有效应对。其解决方案的关键在于提出 NurseSchedRL,一个基于强化学习(Reinforcement Learning, RL)的护士-患者分配框架,通过结构化状态编码、约束动作掩码(constrained action masking)以及注意力机制对技能、疲劳和地理上下文进行表征,结合近端策略优化(Proximal Policy Optimization, PPO)算法,在确保可行性的前提下动态适应患者入院和护士可用性的变化,从而实现更高效、更契合临床需求的排班决策。

链接: https://arxiv.org/abs/2509.18125
作者: Harsha Koduri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Healthcare systems face increasing pressure to allocate limited nursing resources efficiently while accounting for skill heterogeneity, patient acuity, staff fatigue, and continuity of care. Traditional optimization and heuristic scheduling methods struggle to capture these dynamic, multi-constraint environments. I propose NurseSchedRL, a reinforcement learning framework for nurse-patient assignment that integrates structured state encoding, constrained action masking, and attention-based representations of skills, fatigue, and geographical context. NurseSchedRL uses Proximal Policy Optimization (PPO) with feasibility masks to ensure assignments respect real-world constraints, while dynamically adapting to patient arrivals and varying nurse availability. In simulation with realistic nurse and patient data, NurseSchedRL achieves improved scheduling efficiency, better alignment of skills to patient needs, and reduced fatigue compared to baseline heuristic and unconstrained RL approaches. These results highlight the potential of reinforcement learning for decision support in complex, high-stakes healthcare workforce management.
zh

[AI-123] SPADE: A Large Language Model Framework for Soil Moisture Pattern Recognition and Anomaly Detection in Precision Agriculture

链接: https://arxiv.org/abs/2509.18123
作者: Yeonju Lee,Rui Qi Chen,Joseph Oboamah,Po Nien Su,Wei-zhen Liang,Yeyin Shi,Lu Gan,Yongsheng Chen,Xin Qiao,Jing Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

[AI-124] A Coopetitive-Compatible Data Generation Framework for Cross-silo Federated Learning

【速读】:该论文旨在解决跨孤岛联邦学习(Cross-silo Federated Learning, CFL)中因组织间经济竞争导致的协作意愿低下的问题,即在数据统计异构性(statistical heterogeneity)与组织间市场竞争并存的情况下,如何设计机制以激励组织参与联合训练并提升系统整体社会福利。解决方案的关键在于提出一种协同竞争兼容的数据生成框架 CoCoGen,其核心是结合生成式 AI(Generative AI, GenAI)与势博弈(potential game)理论,将每轮训练建模为加权势博弈,并通过 GenAI 生成优化策略以最大化社会福利,从而在保证数据隐私的前提下平衡组织间的利益冲突与性能提升。

链接: https://arxiv.org/abs/2509.18120
作者: Thanh Linh Nguyen,Quoc-Viet Pham
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Distributed, Parallel, and Cluster Computing (cs.DC); Computer Science and Game Theory (cs.GT)
备注: Accepted in IEEE GLOBECOM 2025

点击查看摘要

Abstract:Cross-silo federated learning (CFL) enables organizations (e.g., hospitals or banks) to collaboratively train artificial intelligence (AI) models while preserving data privacy by keeping data local. While prior work has primarily addressed statistical heterogeneity across organizations, a critical challenge arises from economic competition, where organizations may act as market rivals, making them hesitant to participate in joint training due to potential utility loss (i.e., reduced net benefit). Furthermore, the combined effects of statistical heterogeneity and inter-organizational competition on organizational behavior and system-wide social welfare remain underexplored. In this paper, we propose CoCoGen, a coopetitive-compatible data generation framework, leveraging generative AI (GenAI) and potential game theory to model, analyze, and optimize collaborative learning under heterogeneous and competitive settings. Specifically, CoCoGen characterizes competition and statistical heterogeneity through learning performance and utility-based formulations and models each training round as a weighted potential game. We then derive GenAI-based data generation strategies that maximize social welfare. Experimental results on the Fashion-MNIST dataset reveal how varying heterogeneity and competition levels affect organizational behavior and demonstrate that CoCoGen consistently outperforms baseline methods.
zh

[AI-125] MobileRL: Online Agent ic Reinforcement Learning for Mobile GUI Agents

【速读】:该论文旨在解决移动图形用户界面(GUI)智能体在强化学习(RL)训练中面临的两大挑战:一是任务难度呈现重尾分布,导致模型难以稳定训练;二是大规模环境采样效率低下,限制了样本利用率和性能提升。其解决方案的核心是提出一种在线代理强化学习框架MOBILERL,其中关键创新在于Difficulty-Adaptive GRPO(ADAGRPO)算法,通过设计难度自适应正向回放(difficulty-adaptive positive replay)和失败课程过滤(failure curriculum filtering)机制来动态调整策略以适配不同难度任务,同时引入最短路径奖励调整策略(shortest path reward adjustment)优化多轮任务中的奖励结构,从而显著提升训练稳定性、样本效率及跨应用的泛化性能。

链接: https://arxiv.org/abs/2509.18119
作者: Yifan Xu,Xiao Liu,Xinghan Liu,Jiaqi Fu,Hanchen Zhang,Bohao Jing,Shudan Zhang,Yuting Wang,Wenyi Zhao,Yuxiao Dong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MOBILERL to enhance GUI agents in mobile environments. Its core component is the Difficulty-Adaptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (75.8%) and AndroidLab (46.8%). The MOBILERL framework is adopted in the AutoGLM products, and also open-sourced at this https URL.
zh

[AI-126] Amortized Latent Steering: Low-Cost Alternative to Test-Time Optimization

链接: https://arxiv.org/abs/2509.18116
作者: Nathan Egbuna,Saatvik Gaur,Sunishchal Dev,Ashwinee Panda,Maheep Chaudhary
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-127] Solve it with EASE

链接: https://arxiv.org/abs/2509.18108
作者: Adam Viktorin,Tomas Kadavy,Jozef Kovac,Michal Pluhacek,Roman Senkerik
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: EASE framework landing paper

点击查看摘要

[AI-128] BULL-ODE: Bullwhip Learning with Neural ODEs and Universal Differential Equations under Stochastic Demand

链接: https://arxiv.org/abs/2509.18105
作者: Nachiket N. Naik,Prathamesh Dinesh Joshi,Raj Abhijit Dandekar,Rajat Dandekar,Sreedath Panat
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-129] Data Valuation and Selection in a Federated Model Marketplace

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)场景下异构数据源中的有效数据估值与选择问题,这是构建可信数据市场(data marketplace)的关键挑战。其核心解决方案在于提出一个基于Wasserstein距离的估计器框架,该估计器能够预测模型在未见过的数据组合上的性能,并揭示数据异质性与FL聚合算法之间的兼容性;同时,为保障隐私,设计了一种无需访问原始数据即可分布式近似Wasserstein距离的方法;此外,利用神经尺度定律(neural scaling law)实现无需全量训练即可可靠外推模型性能,从而高效筛选高价值数据组合。

链接: https://arxiv.org/abs/2509.18104
作者: Wenqian Li,Youjia Yang,Ruoxi Jia,Yan Pang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of Artificial Intelligence (AI), marketplaces have become essential platforms for facilitating the exchange of data products to foster data sharing. Model transactions provide economic solutions in data marketplaces that enhance data reusability and ensure the traceability of data ownership. To establish trustworthy data marketplaces, Federated Learning (FL) has emerged as a promising paradigm to enable collaborative learning across siloed datasets while safeguarding data privacy. However, effective data valuation and selection from heterogeneous sources in the FL setup remain key challenges. This paper introduces a comprehensive framework centered on a Wasserstein-based estimator tailored for FL. The estimator not only predicts model performance across unseen data combinations but also reveals the compatibility between data heterogeneity and FL aggregation algorithms. To ensure privacy, we propose a distributed method to approximate Wasserstein distance without requiring access to raw data. Furthermore, we demonstrate that model performance can be reliably extrapolated under the neural scaling law, enabling effective data selection without full-scale training. Extensive experiments across diverse scenarios, such as label skew, mislabeled, and unlabeled sources, show that our approach consistently identifies high-performing data combinations, paving the way for more reliable FL-based model marketplaces.
zh

[AI-130] A Cost-Benefit Analysis of On-Premise Large Language Model Deployment: Breaking Even with Commercial LLM Services

【速读】:该论文旨在解决组织在选择使用商业大语言模型(Large Language Models, LLMs)服务还是本地部署开源模型时面临的经济决策难题。其解决方案的关键在于构建一个成本效益分析框架,综合考虑硬件投入、运维开销及最新开源模型(如Qwen、Llama、Mistral等)的性能基准,与主流云服务商订阅费用进行对比,从而估算出在不同使用量和性能需求下本地部署的盈亏平衡点,为组织制定LLM战略提供可操作的决策依据。

链接: https://arxiv.org/abs/2509.18101
作者: Guanzhong Pan,Haibo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are becoming increasingly widespread. Organizations that want to use AI for productivity now face an important decision. They can subscribe to commercial LLM services or deploy models on their own infrastructure. Cloud services from providers such as OpenAI, Anthropic, and Google are attractive because they provide easy access to state-of-the-art models and are easy to scale. However, concerns about data privacy, the difficulty of switching service providers, and long-term operating costs have driven interest in local deployment of open-source models. This paper presents a cost-benefit analysis framework to help organizations determine when on-premise LLM deployment becomes economically viable compared to commercial subscription services. We consider the hardware requirements, operational expenses, and performance benchmarks of the latest open-source models, including Qwen, Llama, Mistral, and etc. Then we compare the total cost of deploying these models locally with the major cloud providers subscription fee. Our findings provide an estimated breakeven point based on usage levels and performance needs. These results give organizations a practical framework for planning their LLM strategies.
zh

[AI-131] Audio-Based Pedestrian Detection in the Presence of Vehicular Noise

链接: https://arxiv.org/abs/2509.19295
作者: Yonghyun Kim,Chaeyeon Han,Akash Sarode,Noah Posner,Subhrajit Guhathakurta,Alexander Lerch
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD)
备注: Accepted to the 10th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), 2025

点击查看摘要

[AI-132] raining Flow Matching Models with Reliable Labels via Self-Purification

链接: https://arxiv.org/abs/2509.19091
作者: Hyeongju Kim,Yechan Yu,June Young Yi,Juheon Lee
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 3 figures, preprint

点击查看摘要

[AI-133] Complexity of Activity Patterns in a Bio-Inspired Hopfield-Type Network in Different Topologies

链接: https://arxiv.org/abs/2509.18758
作者: Marco Cafiso,Paolo Paradisi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO); Biological Physics (physics.bio-ph)
备注:

点击查看摘要

[AI-134] BRAID: Input-Driven Nonlinear Dynamical Modeling of Neural-Behavioral Data RAID ICLR

链接: https://arxiv.org/abs/2509.18627
作者: Parsa Vahidi,Omid G. Sani,Maryam M. Shanechi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Published at the International Conference on Learning Representations (ICLR) 2025. Code is available at GitHub this https URL

点击查看摘要

[AI-135] FlexSED: Towards Open-Vocabulary Sound Event Detection

链接: https://arxiv.org/abs/2509.18606
作者: Jiarui Hai,Helin Wang,Weizhe Guo,Mounya Elhilali
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

[AI-136] SynSonic: Augmenting Sound Event Detection through Text-to-Audio Diffusion ControlNet and Effective Sample Filtering

【速读】:该论文旨在解决声事件检测(Sound Event Detection, SED)中因时序标注数据稀缺而导致模型性能受限的问题。现有增强方法如SpecAugment和Mix-up受制于已有样本的多样性,而生成式模型虽具潜力,却因缺乏精确的时间标注及不可靠过滤引入噪声而难以直接应用。解决方案的关键在于提出SynSonic,一种专为SED设计的数据增强方法:其利用文本到音频扩散模型,并通过能量包络ControlNet引导生成具有时间一致性的声事件;同时采用双分类器联合评分过滤策略确保生成样本质量,从而有效提升模型在多音符声事件检测中的定位精度与类别区分能力。

链接: https://arxiv.org/abs/2509.18603
作者: Jiarui Hai,Mounya Elhilali
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Data synthesis and augmentation are essential for Sound Event Detection (SED) due to the scarcity of temporally labeled data. While augmentation methods like SpecAugment and Mix-up can enhance model performance, they remain constrained by the diversity of existing samples. Recent generative models offer new opportunities, yet their direct application to SED is challenging due to the lack of precise temporal annotations and the risk of introducing noise through unreliable filtering. To address these challenges and enable generative-based augmentation for SED, we propose SynSonic, a data augmentation method tailored for this task. SynSonic leverages text-to-audio diffusion models guided by an energy-envelope ControlNet to generate temporally coherent sound events. A joint score filtering strategy with dual classifiers ensures sample quality, and we explore its practical integration into training pipelines. Experimental results show that SynSonic improves Polyphonic Sound Detection Scores (PSDS1 and PSDS2), enhancing both temporal localization and sound class discrimination.
zh

[AI-137] SoundCompass: Navigating Target Sound Extraction With Effective Directional Clue Integration In Complex Acoustic Scenes ICASSP2026

【速读】:该论文旨在解决目标声音提取(Target Sound Extraction, TSE)中方向信息利用不足的问题,即传统方法依赖手工设计特征或离散编码的到达方向(Direction of Arrival, DoA)特征,导致精细空间信息丢失且适应性受限。其解决方案的关键在于提出SoundCompass框架,核心创新为引入谱对偶交互(Spectral Pairwise INteraction, SPIN)模块,用于在复谱图域中捕捉多通道信号间的跨通道空间相关性,从而保留完整的空间信息;同时将DoA以球谐函数(Spherical Harmonics, SH)编码形式与SPIN输出融合,并通过重叠频带分割策略增强模型鲁棒性,进一步结合迭代精炼机制(Chain-of-Inference, CoI)实现多阶段方向与声事件激活的递归融合,显著提升TSE性能。

链接: https://arxiv.org/abs/2509.18561
作者: Dayun Choi,Jung-Woo Choi
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: 5 pages, 4 figures, submitted to ICASSP 2026

点击查看摘要

Abstract:Recent advances in target sound extraction (TSE) utilize directional clues derived from direction of arrival (DoA), which represent an inherent spatial property of sound available in any acoustic scene. However, previous DoA-based methods rely on hand-crafted features or discrete encodings, which lose fine-grained spatial information and limit adaptability. We propose SoundCompass, an effective directional clue integration framework centered on a Spectral Pairwise INteraction (SPIN) module that captures cross-channel spatial correlations in the complex spectrogram domain to preserve full spatial information in multichannel signals. The input feature expressed in terms of spatial correlations is fused with a DoA clue represented as spherical harmonics (SH) encoding. The fusion is carried out across overlapping frequency subbands, inheriting the benefits reported in the previous band-split architectures. We also incorporate the iterative refinement strategy, chain-of-inference (CoI), in the TSE framework, which recursively fuses DoA with sound event activation estimated from the previous inference stage. Experiments demonstrate that SoundCompass, combining SPIN, SH embedding, and CoI, robustly extracts target sources across diverse signal classes and spatial configurations.
zh

[AI-138] Automatic Classification of Magnetic Chirality of Solar Filaments from H-Alpha Observations

链接: https://arxiv.org/abs/2509.18214
作者: Alexis Chalmers,Azim Ahmadzadeh
机构: 未知
类目: olar and Stellar Astrophysics (astro-ph.SR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

[AI-139] Augmenting Limited and Biased RCTs through Pseudo-Sample Matching-Based Observational Data Fusion Method CIKM2025

链接: https://arxiv.org/abs/2509.18148
作者: Kairong Han,Weidong Huang,Taiyang Zhou,Peng Zhen,Kun Kuang
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Accepted by CIKM 2025

点击查看摘要

机器学习

[LG-0] Residual Off-Policy RL for Finetuning Behavior Cloning Policies

链接: https://arxiv.org/abs/2509.19301
作者: Lars Ankile,Zhenyu Jiang,Rocky Duan,Guanya Shi,Pieter Abbeel,Anusha Nagabandi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in behavior cloning (BC) have enabled impressive visuomotor control policies. However, these approaches are limited by the quality of human demonstrations, the manual effort required for data collection, and the diminishing returns from increasing offline data. In comparison, reinforcement learning (RL) trains an agent through autonomous interaction with the environment and has shown remarkable success in various domains. Still, training RL policies directly on real-world robots remains challenging due to sample inefficiency, safety concerns, and the difficulty of learning from sparse rewards for long-horizon tasks, especially for high-degree-of-freedom (DoF) systems. We present a recipe that combines the benefits of BC and RL through a residual learning framework. Our approach leverages BC policies as black-box bases and learns lightweight per-step residual corrections via sample-efficient off-policy RL. We demonstrate that our method requires only sparse binary reward signals and can effectively improve manipulation policies on high-degree-of-freedom (DoF) systems in both simulation and the real world. In particular, we demonstrate, to the best of our knowledge, the first successful real-world RL training on a humanoid robot with dexterous hands. Our results demonstrate state-of-the-art performance in various vision-based tasks, pointing towards a practical pathway for deploying RL in the real world. Project website: this https URL

[LG-1] What Characterizes Effective Reasoning ? Revisiting Length Review and Structure of CoT

链接: https://arxiv.org/abs/2509.19284
作者: Yunzhen Feng,Julia Kempe,Cheng Zhang,Parag Jain,Anthony Hartshorn
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large reasoning models (LRMs) spend substantial test-time compute on long chain-of-thought (CoT) traces, but what characterizes an effective CoT remains unclear. While prior work reports gains from lengthening CoTs and increasing review (revisiting earlier steps) via appended wait tokens, recent studies suggest that shorter thinking can outperform longer traces. We therefore conduct a systematic evaluation across ten LRMs on math and scientific reasoning. Contrary to the “longer-is-better” narrative, we find that both naive CoT lengthening and increased review are associated with lower accuracy. As CoT unfolds step by step, token-level metrics can conflate verbosity with process quality. We introduce a graph view of CoT to extract structure and identify a single statistic-the Failed-Step Fraction (FSF), the fraction of steps in abandoned branches-that consistently outpredicts length and review ratio for correctness across models. To probe causality, we design two interventions. First, we rank candidate CoTs by each metric at test time, where FSF yields the largest pass@1 gains; second, we edit CoTs to remove failed branches, which significantly improves accuracy, indicating that failed branches bias subsequent reasoning. Taken together, these results characterize effective CoTs as those that fail less and support structure-aware test-time scaling over indiscriminately generating long CoT. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.19284 [cs.LG] (or arXiv:2509.19284v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.19284 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-2] Linear Regression under Missing or Corrupted Coordinates

链接: https://arxiv.org/abs/2509.19242
作者: Ilias Diakonikolas,Jelena Diakonikolas,Daniel M. Kane,Jasper C.H. Lee,Thanasis Pittas
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We study multivariate linear regression under Gaussian covariates in two settings, where data may be erased or corrupted by an adversary under a coordinate-wise budget. In the incomplete data setting, an adversary may inspect the dataset and delete entries in up to an \eta -fraction of samples per coordinate; a strong form of the Missing Not At Random model. In the corrupted data setting, the adversary instead replaces values arbitrarily, and the corruption locations are unknown to the learner. Despite substantial work on missing data, linear regression under such adversarial missingness remains poorly understood, even information-theoretically. Unlike the clean setting, where estimation error vanishes with more samples, here the optimal error remains a positive function of the problem parameters. Our main contribution is to characterize this error up to constant factors across essentially the entire parameter range. Specifically, we establish novel information-theoretic lower bounds on the achievable error that match the error of (computationally efficient) algorithms. A key implication is that, perhaps surprisingly, the optimal error in the missing data setting matches that in the corruption setting-so knowing the corruption locations offers no general advantage.

[LG-3] Stability and Generalization of Adversarial Diffusion Training

链接: https://arxiv.org/abs/2509.19234
作者: Hesam Hosseini,Ying Cao,Ali H. Sayed
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Algorithmic stability is an established tool for analyzing generalization. While adversarial training enhances model robustness, it often suffers from robust overfitting and an enlarged generalization gap. Although recent work has established the convergence of adversarial training in decentralized networks, its generalization properties remain unexplored. This work presents a stability-based generalization analysis of adversarial training under the diffusion strategy for convex losses. We derive a bound showing that the generalization error grows with both the adversarial perturbation strength and the number of training steps, a finding consistent with single-agent case but novel for decentralized settings. Numerical experiments on logistic regression validate these theoretical predictions.

[LG-4] Study Design and Demystification of Physics Informed Neural Networks for Power Flow Simulation ECML KDD

链接: https://arxiv.org/abs/2509.19233
作者: Milad Leyli-abadi,Antoine Marot,Jérôme Picault
类目: Machine Learning (cs.LG)
*备注: Accepted at ECML PKDD ML4SPS 2025 workshop

点击查看摘要

Abstract:In the context of the energy transition, with increasing integration of renewable sources and cross-border electricity exchanges, power grids are encountering greater uncertainty and operational risk. Maintaining grid stability under varying conditions is a complex task, and power flow simulators are commonly used to support operators by evaluating potential actions before implementation. However, traditional physical solvers, while accurate, are often too slow for near real-time use. Machine learning models have emerged as fast surrogates, and to improve their adherence to physical laws (e.g., Kirchhoff’s laws), they are often trained with embedded constraints which are also known as physics-informed or hybrid models. This paper presents an ablation study to demystify hybridization strategies, ranging from incorporating physical constraints as regularization terms or unsupervised losses, and exploring model architectures from simple multilayer perceptrons to advanced graph-based networks enabling the direct optimization of physics equations. Using our custom benchmarking pipeline for hybrid models called LIPS, we evaluate these models across four dimensions: accuracy, physical compliance, industrial readiness, and out-of-distribution generalization. The results highlight how integrating physical knowledge impacts performance across these criteria. All the implementations are reproducible and provided in the corresponding Github page.

[LG-5] Video Killed the Energy Budget: Characterizing the Latency and Power Regimes of Open Text-to-Video Models NEURIPS2025

链接: https://arxiv.org/abs/2509.19222
作者: Julien Delavande,Regis Pierrard,Sasha Luccioni
类目: Machine Learning (cs.LG)
*备注: 10 pages. Accepted as an oral presentation at the NeurIPS 2025 NextVid Workshop (San Diego, December 6, 2025)

点击查看摘要

Abstract:Recent advances in text-to-video (T2V) generation have enabled the creation of high-fidelity, temporally coherent clips from natural language prompts. Yet these systems come with significant computational costs, and their energy demands remain poorly understood. In this paper, we present a systematic study of the latency and energy consumption of state-of-the-art open-source T2V models. We first develop a compute-bound analytical model that predicts scaling laws with respect to spatial resolution, temporal length, and denoising steps. We then validate these predictions through fine-grained experiments on WAN2.1-T2V, showing quadratic growth with spatial and temporal dimensions, and linear scaling with the number of denoising steps. Finally, we extend our analysis to six diverse T2V models, comparing their runtime and energy profiles under default settings. Our results provide both a benchmark reference and practical insights for designing and deploying more sustainable generative video systems.

[LG-6] PPG-Distill: Efficient Photoplethysmography Signals Analysis via Foundation Model Distillation ALT NEURIPS2025

链接: https://arxiv.org/abs/2509.19215
作者: Juntong Ni,Saurabh Kataria,Shengpu Tang,Carl Yang,Xiao Hu,Wei Jin
类目: Machine Learning (cs.LG)
*备注: Accepted at NeurIPS 2025 Workshop on Learning from Time Series for Health

点击查看摘要

Abstract:Photoplethysmography (PPG) is widely used in wearable health monitoring, yet large PPG foundation models remain difficult to deploy on resource-limited devices. We present PPG-Distill, a knowledge distillation framework that transfers both global and local knowledge through prediction-, feature-, and patch-level distillation. PPG-Distill incorporates morphology distillation to preserve local waveform patterns and rhythm distillation to capture inter-patch temporal structures. On heart rate estimation and atrial fibrillation detection, PPG-Distill improves student performance by up to 21.8% while achieving 7X faster inference and reducing memory usage by 19X, enabling efficient PPG analysis on wearables

[LG-7] AlloyInter: Visualising Alloy Mixture Interpolations in t-SNE Representations

链接: https://arxiv.org/abs/2509.19202
作者: Benedikt Kantz,Peter Waldert,Stefan Lengauer,Tobias Schreck
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, Submitted to the IEEE SciVis 2025 contest

点击查看摘要

Abstract:This entry description proposes AlloyInter, a novel system to enable joint exploration of input mixtures and output parameters space in the context of the SciVis Contest 2025. We propose an interpolation approach, guided by eXplainable Artificial Intelligence (XAI) based on a learned model ensemble that allows users to discover input mixture ratios by specifying output parameter goals that can be iteratively adjusted and improved towards a goal. We strengthen the capabilities of our system by building upon prior research within the robustness of XAI, as well as combining well-established techniques like manifold learning with interpolation approaches.

[LG-8] A Validation Strategy for Deep Learning Models: Evaluating and Enhancing Robustness

链接: https://arxiv.org/abs/2509.19197
作者: Abdul-Rauf Nuhu,Parham Kebria,Vahid Hemmati,Benjamin Lartey,Mahmoud Nabil Mahmoud,Abdollah Homaifar,Edward Tunstel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Data-driven models, especially deep learning classifiers often demonstrate great success on clean datasets. Yet, they remain vulnerable to common data distortions such as adversarial and common corruption perturbations. These perturbations can significantly degrade performance, thereby challenging the overall reliability of the models. Traditional robustness validation typically relies on perturbed test datasets to assess and improve model performance. In our framework, however, we propose a validation approach that extracts “weak robust” samples directly from the training dataset via local robustness analysis. These samples, being the most susceptible to perturbations, serve as an early and sensitive indicator of the model’s vulnerabilities. By evaluating models on these challenging training instances, we gain a more nuanced understanding of its robustness, which informs targeted performance enhancement. We demonstrate the effectiveness of our approach on models trained with CIFAR-10, CIFAR-100, and ImageNet, highlighting how robustness validation guided by weak robust samples can drive meaningful improvements in model reliability under adversarial and common corruption scenarios.

[LG-9] Unveiling the Role of Learning Rate Schedules via Functional Scaling Laws NEURIPS2025

链接: https://arxiv.org/abs/2509.19189
作者: Binghui Li,Fengling Chen,Zixun Huang,Lean Wang,Lei Wu
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 52 pages, accepted by NeurIPS 2025 as a spotlight paper

点击查看摘要

Abstract:Scaling laws have played a cornerstone role in guiding the training of large language models (LLMs). However, most existing works on scaling laws primarily focus on the final-step loss, overlooking the loss dynamics during the training process and, crucially, the impact of learning rate schedule (LRS). In this paper, we aim to bridge this gap by studying a teacher-student kernel regression setup trained via online stochastic gradient descent (SGD). Leveraging a novel intrinsic time viewpoint and stochastic differential equation (SDE) modeling of SGD, we introduce the Functional Scaling Law (FSL), which characterizes the evolution of population risk during the training process for general LRSs. Remarkably, the impact of the LRSs is captured through an explicit convolution-type functional term, making their effects fully tractable. To illustrate the utility of FSL, we analyze three widely used LRSs – constant, exponential decay, and warmup-stable-decay (WSD) – under both data-limited and compute-limited regimes. We provide theoretical justification for widely adopted empirical practices in LLMs pre-training such as (i) higher-capacity models are more data- and compute-efficient; (ii) learning rate decay can improve training efficiency; (iii) WSD-like schedules can outperform direct-decay schedules. Lastly, we explore the practical relevance of FSL as a surrogate model for fitting, predicting and optimizing the loss curves in LLM pre-training, with experiments conducted across model sizes ranging from 0.1B to 1B parameters. We hope our FSL framework can deepen the understanding of LLM pre-training dynamics and provide insights for improving large-scale model training.

[LG-10] Circuit Complexity From Physical Constraints: Scaling Limitations of Attention

链接: https://arxiv.org/abs/2509.19161
作者: Benjamin Prada,Ankur Mali
类目: Computational Complexity (cs.CC); Machine Learning (cs.LG)
*备注: 8 pages

点击查看摘要

Abstract:We argue that the standard circuit complexity measures derived from NC, AC, TC provide limited practical information and are now insufficient to further differentiate model expressivity. To address these new limitations, we define a novel notion of local uniformity and a family of circuit complexity classes RC(\cdot) that capture the fundamental constraints of scaling physical circuits. Through the lens of RC(\cdot) , we show that attention mechanisms with \omega(n^3/2) runtime cannot scale to accommodate the entropy of increasingly complex datasets. Our results simultaneously provide a methodology for defining meaningful bounds on transformer expressivity and naturally expose the restricted viability of attention.

[LG-11] Efficient Reinforcement Learning by Reducing Forgetting with Elephant Activation Functions

链接: https://arxiv.org/abs/2509.19159
作者: Qingfeng Lan,Gautham Vasan,A. Rupam Mahmood
类目: Machine Learning (cs.LG)
*备注: Code release: this https URL

点击查看摘要

Abstract:Catastrophic forgetting has remained a significant challenge for efficient reinforcement learning for decades (Ring 1994, Rivest and Precup 2003). While recent works have proposed effective methods to mitigate this issue, they mainly focus on the algorithmic side. Meanwhile, we do not fully understand what architectural properties of neural networks lead to catastrophic forgetting. This study aims to fill this gap by studying the role of activation functions in the training dynamics of neural networks and their impact on catastrophic forgetting in reinforcement learning setup. Our study reveals that, besides sparse representations, the gradient sparsity of activation functions also plays an important role in reducing forgetting. Based on this insight, we propose a new class of activation functions, elephant activation functions, that can generate both sparse outputs and sparse gradients. We show that by simply replacing classical activation functions with elephant activation functions in the neural networks of value-based algorithms, we can significantly improve the resilience of neural networks to catastrophic forgetting, thus making reinforcement learning more sample-efficient and memory-efficient.

[LG-12] PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generatio

链接: https://arxiv.org/abs/2509.19128
作者: Alexandre Piché,Ehsan Kamaloo,Rafael Pardinas,Dzmitry Bahdanau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Reinforcement Learning (RL) is increasingly utilized to enhance the reasoning capabilities of Large Language Models (LLMs). However, effectively scaling these RL methods presents significant challenges, primarily due to the difficulty in maintaining high AI accelerator utilization without generating stale, off-policy data that harms common RL algorithms. This paper introduces PipelineRL, an approach designed to achieve a superior trade-off between hardware efficiency and data on-policyness for LLM training. PipelineRL employs concurrent asynchronous data generation and model training, distinguished by the novel in-flight weight updates. This mechanism allows the LLM generation engine to receive updated model weights with minimal interruption during the generation of token sequences, thereby maximizing both the accelerator utilization and the freshness of training data. Experiments conducted on long-form reasoning tasks using 128 H100 GPUs demonstrate that PipelineRL achieves approximately \sim 2x faster learning compared to conventional RL baselines while maintaining highly on-policy training data. A scalable and modular open-source implementation of PipelineRL is also released as a key contribution.

[LG-13] LLM -based Vulnerability Discovery through the Lens of Code Metrics

链接: https://arxiv.org/abs/2509.19117
作者: Felix Weissberg,Lukas Pirch,Erik Imgrund,Jonas Möller,Thorsten Eisenhofer,Konrad Rieck
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in many tasks of software engineering, yet progress in leveraging them for vulnerability discovery has stalled in recent years. To understand this phenomenon, we investigate LLMs through the lens of classic code metrics. Surprisingly, we find that a classifier trained solely on these metrics performs on par with state-of-the-art LLMs for vulnerability discovery. A root-cause analysis reveals a strong correlation and a causal effect between LLMs and code metrics: When the value of a metric is changed, LLM predictions tend to shift by a corresponding magnitude. This dependency suggests that LLMs operate at a similarly shallow level as code metrics, limiting their ability to grasp complex patterns and fully realize their potential in vulnerability discovery. Based on these findings, we derive recommendations on how research should more effectively address this challenge.

[LG-14] A Fast Initialization Method for Neural Network Controllers: A Case Study of Image-based Visual Servoing Control for the multicopter Interception

链接: https://arxiv.org/abs/2509.19110
作者: Chenxu Ke,Congling Tian,Kaichen Xu,Ye Li,Lingcong Bao
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Robotics (cs.RO)
*备注:

点击查看摘要

Abstract:Reinforcement learning-based controller design methods often require substantial data in the initial training phase. Moreover, the training process tends to exhibit strong randomness and slow convergence. It often requires considerable time or high computational resources. Another class of learning-based method incorporates Lyapunov stability theory to obtain a control policy with stability guarantees. However, these methods generally require an initially stable neural network control policy at the beginning of training. Evidently, a stable neural network controller can not only serve as an initial policy for reinforcement learning, allowing the training to focus on improving controller performance, but also act as an initial state for learning-based Lyapunov control methods. Although stable controllers can be designed using traditional control theory, designers still need to have a great deal of control design knowledge to address increasingly complicated control problems. The proposed neural network rapid initialization method in this paper achieves the initial training of the neural network control policy by constructing datasets that conform to the stability conditions based on the system model. Furthermore, using the image-based visual servoing control for multicopter interception as a case study, simulations and experiments were conducted to validate the effectiveness and practical performance of the proposed method. In the experiment, the trained control policy attains a final interception velocity of 15 m/s.

[LG-15] DRO-REBEL: Distributionally Robust Relative-Reward Regression for Fast and Efficient LLM Alignment

链接: https://arxiv.org/abs/2509.19104
作者: Sharan Sahu,Martin T. Wells
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 70 pages, 9 figures, 3 tables

点击查看摘要

Abstract:Reinforcement learning with human feedback (RLHF) has become crucial for aligning Large Language Models (LLMs) with human intent. However, existing offline RLHF approaches suffer from overoptimization, where models overfit to reward misspecification and drift from preferred behaviors observed during training. We introduce DRO-REBEL, a unified family of robust REBEL updates with type- p Wasserstein, KL, and \chi^2 ambiguity sets. Using Fenchel duality, each update reduces to a simple relative-reward regression, preserving scalability and avoiding PPO-style clipping or auxiliary value networks. Under standard linear-reward and log-linear policy classes with a data-coverage condition, we establish O(n^-1/4) estimation bounds with tighter constants than prior DRO-DPO approaches, and recover the minimax-optimal O(n^-1/2) rate via a localized Rademacher complexity analysis. The same analysis closes the gap for Wasserstein-DPO and KL-DPO, showing both also attain optimal parametric rates. We derive practical SGD algorithms for all three divergences: gradient regularization (Wasserstein), importance weighting (KL), and a fast 1-D dual solve ( \chi^2 ). Experiments on Emotion Alignment, the large-scale ArmoRM multi-objective benchmark, and HH-Alignment demonstrate strong worst-case robustness across unseen preference mixtures, model sizes, and data scales, with \chi^2 -REBEL showing consistently strong empirical performance. A controlled radius–coverage study validates a no-free-lunch trade-off: radii shrinking faster than empirical divergence concentration rates achieve minimax-optimal parametric rates but forfeit coverage, while coverage-guaranteeing radii incur O(n^-1/4) rates.

[LG-16] Asymptotically Optimal Problem-Dependent Bandit Policies for Transfer Learning

链接: https://arxiv.org/abs/2509.19098
作者: Adrien Prevost,Timothee Mathieu,Odalric-Ambrym Maillard
类目: Machine Learning (cs.LG); Statistics Theory (math.ST)
*备注:

点击查看摘要

Abstract:We study the non-contextual multi-armed bandit problem in a transfer learning setting: before any pulls, the learner is given N’_k i.i.d. samples from each source distribution nu’_k, and the true target distributions nu_k lie within a known distance bound d_k(nu_k, nu’_k) = L_k. In this framework, we first derive a problem-dependent asymptotic lower bound on cumulative regret that extends the classical Lai-Robbins result to incorporate the transfer parameters (d_k, L_k, N’_k). We then propose KL-UCB-Transfer, a simple index policy that matches this new bound in the Gaussian case. Finally, we validate our approach via simulations, showing that KL-UCB-Transfer significantly outperforms the no-prior baseline when source and target distributions are sufficiently close.

[LG-17] Diffusion Bridge Variational Inference for Deep Gaussian Processes

链接: https://arxiv.org/abs/2509.19078
作者: Jian Xu,Qibin Zhao,John Paisley,Delu Zeng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep Gaussian processes (DGPs) enable expressive hierarchical Bayesian modeling but pose substantial challenges for posterior inference, especially over inducing variables. Denoising diffusion variational inference (DDVI) addresses this by modeling the posterior as a time-reversed diffusion from a simple Gaussian prior. However, DDVI’s fixed unconditional starting distribution remains far from the complex true posterior, resulting in inefficient inference trajectories and slow convergence. In this work, we propose Diffusion Bridge Variational Inference (DBVI), a principled extension of DDVI that initiates the reverse diffusion from a learnable, data-dependent initial distribution. This initialization is parameterized via an amortized neural network and progressively adapted using gradients from the ELBO objective, reducing the posterior gap and improving sample efficiency. To enable scalable amortization, we design the network to operate on the inducing inputs, which serve as structured, low-dimensional summaries of the dataset and naturally align with the inducing variables’ shape. DBVI retains the mathematical elegance of DDVI, including Girsanov-based ELBOs and reverse-time SDEs,while reinterpreting the prior via a Doob-bridged diffusion process. We derive a tractable training objective under this formulation and implement DBVI for scalable inference in large-scale DGPs. Across regression, classification, and image reconstruction tasks, DBVI consistently outperforms DDVI and other variational baselines in predictive accuracy, convergence speed, and posterior quality.

[LG-18] Improving Credit Card Fraud Detection through Transformer-Enhanced GAN Oversampling

链接: https://arxiv.org/abs/2509.19032
作者: Kashaf Ul Emaan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Detection of credit card fraud is an acute issue of financial security because transaction datasets are highly lopsided, with fraud cases being only a drop in the ocean. Balancing datasets using the most popular methods of traditional oversampling such as the Synthetic Minority Oversampling Technique (SMOTE) generally create simplistic synthetic samples that are not readily applicable to complex fraud patterns. Recent industry advances that include Conditional Tabular Generative Adversarial Networks (CTGAN) and Tabular Variational Autoencoders (TVAE) have demonstrated increased efficiency in tabular synthesis, yet all these models still exhibit issues with high-dimensional dependence modelling. Now we will present our hybrid approach where we use a Generative Adversarial Network (GAN) with a Transformer encoder block to produce realistic fraudulent transactions samples. The GAN architecture allows training realistic generators adversarial, and the Transformer allows the model to learn rich feature interactions by self-attention. Such a hybrid strategy overcomes the limitations of SMOTE, CTGAN, and TVAE by producing a variety of high-quality synthetic minority classes samples. We test our algorithm on the publicly-available Credit Card Fraud Detection dataset and compare it to conventional and generative resampling strategies with a variety of classifiers, such as Logistic Regression (LR), Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM). Findings indicate that our Transformer-based GAN shows substantial gains in Recall, F1-score and Area Under the Receiver Operating Characteristic Curve (AUC), which indicates that it is effective in overcoming the severe class imbalance inherent in the task of fraud detection.

[LG-19] OmniBridge: Unified Multimodal Understanding Generation and Retrieval via Latent Space Alignment

链接: https://arxiv.org/abs/2509.19018
作者: Teng Xiao,Zuchao Li,Lefei Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in multimodal large language models (LLMs) have led to significant progress in understanding, generation, and retrieval tasks. However, current solutions often treat these tasks in isolation or require training LLMs from scratch, resulting in high computational costs and limited generalization across modalities. In this work, we present OmniBridge, a unified and modular multimodal framework that supports vision-language understanding, generation, and retrieval within a unified architecture. OmniBridge adopts a language-centric design that reuses pretrained LLMs and introduces a lightweight bidirectional latent alignment module. To address the challenge of task interference, we propose a two-stage decoupled training strategy: supervised fine-tuning and latent space alignment for aligning LLM behavior with multimodal reasoning, and semantic-guided diffusion training to align cross-modal latent spaces via learnable query embeddings. Extensive experiments across a wide range of benchmarks demonstrate that OmniBridge achieves competitive or state-of-the-art performance in all three tasks. Moreover, our results highlight the effectiveness of latent space alignment for unifying multimodal modeling under a shared representation space. Code and models are released at this https URL.

[LG-20] heoretical Foundations of Representation Learning using Unlabeled Data: Statistics and Optimization

链接: https://arxiv.org/abs/2509.18997
作者: Pascal Esser,Maximilian Fleissner,Debarghya Ghoshdastidar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Representation learning from unlabeled data has been extensively studied in statistics, data science and signal processing with a rich literature on techniques for dimension reduction, compression, multi-dimensional scaling among others. However, current deep learning models use new principles for unsupervised representation learning that cannot be easily analyzed using classical theories. For example, visual foundation models have found tremendous success using self-supervision or denoising/masked autoencoders, which effectively learn representations from massive amounts of unlabeled data. However, it remains difficult to characterize the representations learned by these models and to explain why they perform well for diverse prediction tasks or show emergent behavior. To answer these questions, one needs to combine mathematical tools from statistics and optimization. This paper provides an overview of recent theoretical advances in representation learning from unlabeled data and mentions our contributions in this direction.

[LG-21] CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

链接: https://arxiv.org/abs/2509.18993
作者: Boao Kong,Junzhu Liang,Yuxi Liu,Renjia Deng,Kun Yuan
类目: Machine Learning (cs.LG)
*备注: 32 pages

点击查看摘要

Abstract:Low-rank architectures have become increasingly important for efficient large language model (LLM) pre-training, providing substantial reductions in both parameter complexity and memory/computational demands. Despite these advantages, current low-rank methods face three critical shortcomings: (1) compromised model performance, (2) considerable computational overhead, and (3) limited activation memory savings. To address these limitations, we propose Cross-layer Low-Rank residual Network (CR-Net), an innovative parameter-efficient framework inspired by our discovery that inter-layer activation residuals possess low-rank properties. CR-Net implements this insight through a dual-path architecture that efficiently reconstructs layer activations by combining previous-layer outputs with their low-rank differences, thereby maintaining high-rank information with minimal parameters. We further develop a specialized activation recomputation strategy tailored for CR-Net that dramatically reduces memory requirements. Extensive pre-training experiments across model scales from 60M to 7B parameters demonstrate that CR-Net consistently outperforms state-of-the-art low-rank frameworks while requiring fewer computational resources and less memory.

[LG-22] Learning From Simulators: A Theory of Simulation-Grounded Learning

链接: https://arxiv.org/abs/2509.18990
作者: Carson Dudley,Marisa Eisenberg
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:

点击查看摘要

Abstract:Simulation-Grounded Neural Networks (SGNNs) are predictive models trained entirely on synthetic data from mechanistic simulations. They have achieved state-of-the-art performance in domains where real-world labels are limited or unobserved, but lack a formal underpinning. We present the foundational theory of simulation-grounded learning. We show that SGNNs implement amortized Bayesian inference under a simulation prior and converge to the Bayes-optimal predictor. We derive generalization bounds under model misspecification and prove that SGNNs can learn unobservable scientific quantities that empirical methods provably cannot. We also formalize a novel form of mechanistic interpretability uniquely enabled by SGNNs: by attributing predictions to the simulated mechanisms that generated them, SGNNs yield posterior-consistent, scientifically grounded explanations. We provide numerical experiments to validate all theoretical predictions. SGNNs recover latent parameters, remain robust under mismatch, and outperform classical tools: in a model selection task, SGNNs achieve half the error of AIC in distinguishing mechanistic dynamics. These results establish SGNNs as a principled and practical framework for scientific prediction in data-limited regimes. Subjects: Machine Learning (cs.LG); Dynamical Systems (math.DS) Cite as: arXiv:2509.18990 [cs.LG] (or arXiv:2509.18990v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.18990 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-23] Otters: An Energy-Efficient SpikingTransformer via Optical Time-to-First-Spike Encoding

链接: https://arxiv.org/abs/2509.18968
作者: Zhanglu Yan,Jiayi Mao,Qianhui Liu,Fanfan Li,Gang Pan,Tao Luo,Bowen Zhu,Weng-Fai Wong
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Spiking neural networks (SNNs) promise high energy efficiency, particularly with time-to-first-spike (TTFS) encoding, which maximizes sparsity by emitting at most one spike per neuron. However, such energy advantage is often unrealized because inference requires evaluating a temporal decay function and subsequent multiplication with the synaptic weights. This paper challenges this costly approach by repurposing a physical hardware `bug’, namely, the natural signal decay in optoelectronic devices, as the core computation of TTFS. We fabricated a custom indium oxide optoelectronic synapse, showing how its natural physical decay directly implements the required temporal function. By treating the device’s analog output as the fused product of the synaptic weight and temporal decay, optoelectronic synaptic TTFS (named Otters) eliminates these expensive digital operations. To use the Otters paradigm in complex architectures like the transformer, which are challenging to train directly due to the sparsity issue, we introduce a novel quantized neural network-to-SNN conversion algorithm. This complete hardware-software co-design enables our model to achieve state-of-the-art accuracy across seven GLUE benchmark datasets and demonstrates a 1.77 \times improvement in energy efficiency over previous leading SNNs, based on a comprehensive analysis of compute, data movement, and memory access costs using energy measurements from a commercial 22nm process. Our work thus establishes a new paradigm for energy-efficient SNNs, translating fundamental device physics directly into powerful computational primitives. All codes and data are open source.

[LG-24] Central Limit Theorems for Asynchronous Averag ed Q-Learning

链接: https://arxiv.org/abs/2509.18964
作者: Xingtu Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We present a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.

[LG-25] Lift What You Can: Green Online Learning with Heterogeneous Ensembles

链接: https://arxiv.org/abs/2509.18962
作者: Kirsten Köbschall,Sebastian Buschjäger,Raphael Fischer,Lisa Hartung,Stefan Kramer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensemble methods for stream mining necessitate managing multiple models and updating them as data distributions evolve. Considering the calls for more sustainability, established methods are however not sufficiently considerate of ensemble members’ computational expenses and instead overly focus on predictive capabilities. To address these challenges and enable green online learning, we propose heterogeneous online ensembles (HEROS). For every training step, HEROS chooses a subset of models from a pool of models initialized with diverse hyperparameter choices under resource constraints to train. We introduce a Markov decision process to theoretically capture the trade-offs between predictive performance and sustainability constraints. Based on this framework, we present different policies for choosing which models to train on incoming data. Most notably, we propose the novel \zeta -policy, which focuses on training near-optimal models at reduced costs. Using a stochastic model, we theoretically prove that our \zeta -policy achieves near optimal performance while using fewer resources compared to the best performing policy. In our experiments across 11 benchmark datasets, we find empiric evidence that our \zeta -policy is a strong contribution to the state-of-the-art, demonstrating highly accurate performance, in some cases even outperforming competitors, and simultaneously being much more resource-friendly.

[LG-26] Integrating Stacked Intelligent Metasurfaces and Power Control for Dynamic Edge Inference via Over-The-Air Neural Networks ICASSP2026

链接: https://arxiv.org/abs/2509.18906
作者: Kyriakos Stylianopoulos,George C. Alexandropoulos
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: Submitted to IEEE ICASSP 2026

点击查看摘要

Abstract:This paper introduces a novel framework for Edge Inference (EI) that bypasses the conventional practice of treating the wireless channel as noise. We utilize Stacked Intelligent Metasurfaces (SIMs) to control wireless propagation, enabling the channel itself to perform over-the-air computation. This eliminates the need for symbol estimation at the receiver, significantly reducing computational and communication overhead. Our approach models the transmitter-channel-receiver system as an end-to-end Deep Neural Network (DNN) where the response of the SIM elements are trainable parameters. To address channel variability, we incorporate a dedicated DNN module responsible for dynamically adjusting transmission power leveraging user location information. Our performance evaluations showcase that the proposed metasurfaces-integrated DNN framework with deep SIM architectures are capable of balancing classification accuracy and power consumption under diverse scenarios, offering significant energy efficiency improvements.

[LG-27] Enhancing the Effectiveness and Durability of Backdoor Attacks in Federated Learning through Maximizing Task Distinction

链接: https://arxiv.org/abs/2509.18904
作者: Zhaoxin Wang,Handing Wang,Cong Tian,Yaochu Jin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning allows multiple participants to collaboratively train a central model without sharing their private data. However, this distributed nature also exposes new attack surfaces. In particular, backdoor attacks allow attackers to implant malicious behaviors into the global model while maintaining high accuracy on benign inputs. Existing attacks usually rely on fixed patterns or adversarial perturbations as triggers, which tightly couple the main and backdoor tasks. This coupling makes them vulnerable to dilution by honest updates and limits their persistence under federated defenses. In this work, we propose an approach to decouple the backdoor task from the main task by dynamically optimizing the backdoor trigger within a min-max framework. The inner layer maximizes the performance gap between poisoned and benign samples, ensuring that the contributions of benign users have minimal impact on the backdoor. The outer process injects the adaptive triggers into the local model. We evaluate our method on both computer vision and natural language tasks, and compare it with six backdoor attack methods under six defense algorithms. Experimental results show that our method achieves good attack performance and can be easily integrated into existing backdoor attack techniques.

[LG-28] Exploring Heterophily in Graph-level Tasks CEC NEURIPS2025

链接: https://arxiv.org/abs/2509.18893
作者: Qinhan Hou,Yilun Zheng,Xichun Zhang,Sitao Luan,Jing Tang
类目: Machine Learning (cs.LG)
*备注: Accectped by NeurIPS 2025 Workshop, New Perspectives in Advancing Graph Machine Learning (NPGML)

点击查看摘要

Abstract:While heterophily has been widely studied in node-level tasks, its impact on graph-level tasks remains unclear. We present the first analysis of heterophily in graph-level learning, combining theoretical insights with empirical validation. We first introduce a taxonomy of graph-level labeling schemes, and focus on motif-based tasks within local structure labeling, which is a popular labeling scheme. Using energy-based gradient flow analysis, we reveal a key insight: unlike frequency-dominated regimes in node-level tasks, motif detection requires mixed-frequency dynamics to remain flexible across multiple spectral components. Our theory shows that motif objectives are inherently misaligned with global frequency dominance, demanding distinct architectural considerations. Experiments on synthetic datasets with controlled heterophily and real-world molecular property prediction support our findings, showing that frequency-adaptive model outperform frequency-dominated models. This work establishes a new theoretical understanding of heterophily in graph-level learning and offers guidance for designing effective GNN architectures.

[LG-29] Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

链接: https://arxiv.org/abs/2509.18886
作者: Marcin Chrapek,Marcin Copik,Etienne Mettaz,Torsten Hoefler
类目: Performance (cs.PF); Hardware Architecture (cs.AR); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened security requirements slow adoption in privacy-sensitive sectors such as healthcare and finance. We investigate methods to address this gap and propose Trusted Execution Environments (TEEs) as a solution for securing end-to-end LLM inference. We validate their practicality by evaluating these compute-intensive workloads entirely within CPU and GPU TEEs. On the CPU side, we conduct an in-depth study running full Llama2 inference pipelines (7B, 13B, 70B) inside Intel’s TDX and SGX, accelerated by Advanced Matrix Extensions (AMX). We derive 12 insights, including that across various data types, batch sizes, and input lengths, CPU TEEs impose under 10% throughput and 20% latency overheads, further reduced by AMX. We run LLM inference on NVIDIA H100 Confidential Compute GPUs, contextualizing our CPU findings and observing throughput penalties of 4-8% that diminish as batch and input sizes grow. By comparing performance, cost, and security trade-offs, we show how CPU TEEs can be more cost-effective or secure than their GPU counterparts. To our knowledge, our work is the first to comprehensively demonstrate the performance and practicality of modern TEEs across both CPUs and GPUs for enabling confidential LLMs (cLLMs).

[LG-30] Bi-VLA: Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation

链接: https://arxiv.org/abs/2509.18865
作者: Masato Kobayashi,Thanpimon Buamanee
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose Bilateral Control-Based Imitation Learning via Vision-Language Fusion for Action Generation (Bi-VLA), a novel framework that extends bilateral control-based imitation learning to handle more than one task within a single model. Conventional bilateral control methods exploit joint angle, velocity, torque, and vision for precise manipulation but require task-specific models, limiting their generality. Bi-VLA overcomes this limitation by utilizing robot joint angle, velocity, and torque data from leader-follower bilateral control with visual features and natural language instructions through SigLIP and FiLM-based fusion. We validated Bi-VLA on two task types: one requiring supplementary language cues and another distinguishable solely by vision. Real-robot experiments showed that Bi-VLA successfully interprets vision-language combinations and improves task success rates compared to conventional bilateral control-based imitation learning. Our Bi-VLA addresses the single-task limitation of prior bilateral approaches and provides empirical evidence that combining vision and language significantly enhances versatility. Experimental results validate the effectiveness of Bi-VLA in real-world tasks. For additional material, please visit the website: this https URL

[LG-31] Shared-Weights Extender and Gradient Voting for Neural Network Expansion

链接: https://arxiv.org/abs/2509.18842
作者: Nikolas Chatzis,Ioannis Kordonis,Manos Theodosis,Petros Maragos
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures

点击查看摘要

Abstract:Expanding neural networks during training is a promising way to augment capacity without retraining larger models from scratch. However, newly added neurons often fail to adjust to a trained network and become inactive, providing no contribution to capacity growth. We propose the Shared-Weights Extender (SWE), a novel method explicitly designed to prevent inactivity of new neurons by coupling them with existing ones for smooth integration. In parallel, we introduce the Steepest Voting Distributor (SVoD), a gradient-based method for allocating neurons across layers during deep network expansion. Our extensive benchmarking on four datasets shows that our method can effectively suppress neuron inactivity and achieve better performance compared to other expanding methods and baselines.

[LG-32] Graph-based Clustering Revisited: A Relaxation of Kernel k-Means Perspective

链接: https://arxiv.org/abs/2509.18826
作者: Wenlong Lyu,Yuheng Jia,Hui Liu,Junhui Hou
类目: Machine Learning (cs.LG)
*备注: 39 pages, 20 figures

点击查看摘要

Abstract:The well-known graph-based clustering methods, including spectral clustering, symmetric non-negative matrix factorization, and doubly stochastic normalization, can be viewed as relaxations of the kernel k -means approach. However, we posit that these methods excessively relax their inherent low-rank, nonnegative, doubly stochastic, and orthonormal constraints to ensure numerical feasibility, potentially limiting their clustering efficacy. In this paper, guided by our theoretical analyses, we propose \textbfLow-\textbfRank \textbfDoubly stochastic clustering (\textbfLoRD), a model that only relaxes the orthonormal constraint to derive a probabilistic clustering results. Furthermore, we theoretically establish the equivalence between orthogonality and block diagonality under the doubly stochastic constraint. By integrating \textbfBlock diagonal regularization into LoRD, expressed as the maximization of the Frobenius norm, we propose \textbfB-LoRD, which further enhances the clustering performance. To ensure numerical solvability, we transform the non-convex doubly stochastic constraint into a linear convex constraint through the introduction of a class probability parameter. We further theoretically demonstrate the gradient Lipschitz continuity of our LoRD and B-LoRD enables the proposal of a globally convergent projected gradient descent algorithm for their optimization. Extensive experiments validate the effectiveness of our approaches. The code is publicly available at this https URL.

[LG-33] raining-Free Data Assimilation with GenCast

链接: https://arxiv.org/abs/2509.18811
作者: Thomas Savary,François Rozet,Gilles Louppe
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注:

点击查看摘要

Abstract:Data assimilation is widely used in many disciplines such as meteorology, oceanography, and robotics to estimate the state of a dynamical system from noisy observations. In this work, we propose a lightweight and general method to perform data assimilation using diffusion models pre-trained for emulating dynamical systems. Our method builds on particle filters, a class of data assimilation algorithms, and does not require any further training. As a guiding example throughout this work, we illustrate our methodology on GenCast, a diffusion-based model that generates global ensemble weather forecasts.

[LG-34] Probabilistic Machine Learning for Uncertainty-Aware Diagnosis of Industrial Systems

链接: https://arxiv.org/abs/2509.18810
作者: Arman Mohammadi,Mattias Krysander,Daniel Jung,Erik Frisk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Deep neural networks has been increasingly applied in fault diagnostics, where it uses historical data to capture systems behavior, bypassing the need for high-fidelity physical models. However, despite their competence in prediction tasks, these models often struggle with the evaluation of their confidence. This matter is particularly important in consistency-based diagnosis where decision logic is highly sensitive to false alarms. To address this challenge, this work presents a diagnostic framework that uses ensemble probabilistic machine learning to improve diagnostic characteristics of data driven consistency based diagnosis by quantifying and automating the prediction uncertainty. The proposed method is evaluated across several case studies using both ablation and comparative analyses, showing consistent improvements across a range of diagnostic metrics. Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.18810 [cs.LG] (or arXiv:2509.18810v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.18810 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Arman Mohammadi [view email] [v1] Tue, 23 Sep 2025 08:59:20 UTC (5,930 KB) Full-text links: Access Paper: View a PDF of the paper titled Probabilistic Machine Learning for Uncertainty-Aware Diagnosis of Industrial Systems, by Arman Mohammadi and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-09 Change to browse by: cs References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack

[LG-35] Diagonal Linear Networks and the Lasso Regularization Path

链接: https://arxiv.org/abs/2509.18766
作者: Raphaël Berthier
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 29 pages, 1 figure

点击查看摘要

Abstract:Diagonal linear networks are neural networks with linear activation and diagonal weight matrices. Their theoretical interest is that their implicit regularization can be rigorously analyzed: from a small initialization, the training of diagonal linear networks converges to the linear predictor with minimal 1-norm among minimizers of the training loss. In this paper, we deepen this analysis showing that the full training trajectory of diagonal linear networks is closely related to the lasso regularization path. In this connection, the training time plays the role of an inverse regularization parameter. Both rigorous results and simulations are provided to illustrate this conclusion. Under a monotonicity assumption on the lasso regularization path, the connection is exact while in the general case, we show an approximate connection.

[LG-36] MOMEMTO: Patch-based Memory Gate Model in Time Series Foundation Model

链接: https://arxiv.org/abs/2509.18751
作者: Samuel Yoon,Jongwon Kim,Juyoung Ha,Young Myoung Ko
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-37] heory of periodic convolutional neural network

链接: https://arxiv.org/abs/2509.18744
作者: Yuqing Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel convolutional neural network architecture, termed the \emphperiodic CNN, which incorporates periodic boundary conditions into the convolutional layers. Our main theoretical contribution is a rigorous approximation theorem: periodic CNNs can approximate ridge functions depending on d-1 linear variables in a d -dimensional input space, while such approximation is impossible in lower-dimensional ridge settings ( d-2 or fewer variables). This result establishes a sharp characterization of the expressive power of periodic CNNs. Beyond the theory, our findings suggest that periodic CNNs are particularly well-suited for problems where data naturally admits a ridge-like structure of high intrinsic dimension, such as image analysis on wrapped domains, physics-informed learning, and materials science. The work thus both expands the mathematical foundation of CNN approximation theory and highlights a class of architectures with surprising and practically relevant approximation capabilities.

[LG-38] LLM -Enhanced Self-Evolving Reinforcement Learning for Multi-Step E-Commerce Payment Fraud Risk Detection ACL2025

链接: https://arxiv.org/abs/2509.18719
作者: Bo Qu,Zhurong Wang,Daisuke Yagi,Zhen Xu,Yang Zhao,Yinan Shan,Frank Zahradnik
类目: Machine Learning (cs.LG)
*备注: 12 pages, 12 figures, ACL 2025 industry track

点击查看摘要

Abstract:This paper presents a novel approach to e-commerce payment fraud detection by integrating reinforcement learning (RL) with Large Language Models (LLMs). By framing transaction risk as a multi-step Markov Decision Process (MDP), RL optimizes risk detection across multiple payment stages. Crafting effective reward functions, essential for RL model success, typically requires significant human expertise due to the complexity and variability in design. LLMs, with their advanced reasoning and coding capabilities, are well-suited to refine these functions, offering improvements over traditional methods. Our approach leverages LLMs to iteratively enhance reward functions, achieving better fraud detection accuracy and demonstrating zero-shot capability. Experiments with real-world data confirm the effectiveness, robustness, and resilience of our LLM-enhanced RL framework through long-term evaluations, underscoring the potential of LLMs in advancing industrial RL applications.

[LG-39] owards Rational Pesticide Design with Graph Machine Learning Models for Ecotoxicology

链接: https://arxiv.org/abs/2509.18703
作者: Jakub Adamczyk
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This research focuses on rational pesticide design, using graph machine learning to accelerate the development of safer, eco-friendly agrochemicals, inspired by in silico methods in drug discovery. With an emphasis on ecotoxicology, the initial contributions include the creation of ApisTox, the largest curated dataset on pesticide toxicity to honey bees. We conducted a broad evaluation of machine learning (ML) models for molecular graph classification, including molecular fingerprints, graph kernels, GNNs, and pretrained transformers. The results show that methods successful in medicinal chemistry often fail to generalize to agrochemicals, underscoring the need for domain-specific models and benchmarks. Future work will focus on developing a comprehensive benchmarking suite and designing ML models tailored to the unique challenges of pesticide discovery.

[LG-40] Query-Centric Diffusion Policy for Generalizable Robotic Assembly

链接: https://arxiv.org/abs/2509.18686
作者: Ziyi Xu,Haohong Lin,Shiqi Liu,Ding Zhao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 8 pages, 7 figures

点击查看摘要

Abstract:The robotic assembly task poses a key challenge in building generalist robots due to the intrinsic complexity of part interactions and the sensitivity to noise perturbations in contact-rich settings. The assembly agent is typically designed in a hierarchical manner: high-level multi-part reasoning and low-level precise control. However, implementing such a hierarchical policy is challenging in practice due to the mismatch between high-level skill queries and low-level execution. To address this, we propose the Query-centric Diffusion Policy (QDP), a hierarchical framework that bridges high-level planning and low-level control by utilizing queries comprising objects, contact points, and skill information. QDP introduces a query-centric mechanism that identifies task-relevant components and uses them to guide low-level policies, leveraging point cloud observations to improve the policy’s robustness. We conduct comprehensive experiments on the FurnitureBench in both simulation and real-world settings, demonstrating improved performance in skill precision and long-horizon success rate. In the challenging insertion and screwing tasks, QDP improves the skill-wise success rate by over 50% compared to baselines without structured queries.

[LG-41] Online Learning for Optimizing AoI-Energy Tradeoff under Unknown Channel Statistics

链接: https://arxiv.org/abs/2509.18654
作者: Mohamed A. Abd-Elmagid,Ming Shi,Eylem Ekici,Ness B. Shroff
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a real-time monitoring system where a source node (with energy limitations) aims to keep the information status at a destination node as fresh as possible by scheduling status update transmissions over a set of channels. The freshness of information at the destination node is measured in terms of the Age of Information (AoI) metric. In this setting, a natural tradeoff exists between the transmission cost (or equivalently, energy consumption) of the source and the achievable AoI performance at the destination. This tradeoff has been optimized in the existing literature under the assumption of having a complete knowledge of the channel statistics. In this work, we develop online learning-based algorithms with finite-time guarantees that optimize this tradeoff in the practical scenario where the channel statistics are unknown to the scheduler. In particular, when the channel statistics are known, the optimal scheduling policy is first proven to have a threshold-based structure with respect to the value of AoI (i.e., it is optimal to drop updates when the AoI value is below some threshold). This key insight was then utilized to develop the proposed learning algorithms that surprisingly achieve an order-optimal regret (i.e., O(1) ) with respect to the time horizon length.

[LG-42] Subspace Clustering of Subspaces: Unifying Canonical Correlation Analysis and Subspace Clustering

链接: https://arxiv.org/abs/2509.18653
作者: Paris A. Karakasis,Nicholas D. Sidiropoulos
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 13 pages, Submitted to IEEE Transactions on Signal Processing

点击查看摘要

Abstract:We introduce a novel framework for clustering a collection of tall matrices based on their column spaces, a problem we term Subspace Clustering of Subspaces (SCoS). Unlike traditional subspace clustering methods that assume vectorized data, our formulation directly models each data sample as a matrix and clusters them according to their underlying subspaces. We establish conceptual links to Subspace Clustering and Generalized Canonical Correlation Analysis (GCCA), and clarify key differences that arise in this more general setting. Our approach is based on a Block Term Decomposition (BTD) of a third-order tensor constructed from the input matrices, enabling joint estimation of cluster memberships and partially shared subspaces. We provide the first identifiability results for this formulation and propose scalable optimization algorithms tailored to large datasets. Experiments on real-world hyperspectral imaging datasets demonstrate that our method achieves superior clustering accuracy and robustness, especially under high noise and interference, compared to existing subspace clustering techniques. These results highlight the potential of the proposed framework in challenging high-dimensional applications where structure exists beyond individual data vectors.

[LG-43] Reflect before Act: Proactive Error Correction in Language Models

链接: https://arxiv.org/abs/2509.18607
作者: Qiuhai Zeng,Sarvesh Rajkumar,Di Wang,Narendra Gyanchandani,Wenbo Yan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in interactive decision-making tasks, but existing methods often struggle with error accumulation and lack robust self-correction mechanisms. We introduce “Reflect before Act” (REBACT), a novel approach that enhances LLM-based decision-making by introducing a critical reflect step prior to taking the next action. This approach allows for immediate error correction, ensuring smooth action path and adaptibity to environment feedback. We evaluate REBACT on three diverse interactive environments: ALFWorld, WebShop, and TextCraft. Our results demonstrate that REBACT significantly outperforms strong baselines, improving success rates by up to 24% on WebShop (achieving 61%), 6.72% on ALFWorld (achieving 98.51%), and 0.5% on TextCraft (achieving 99.5%) using Claude3.5-sonnet as the underlying LLM. Further analysis reveals that REBACT’s performance improvements are achieved with only a few modification steps, demonstrating its computational efficiency.

[LG-44] DS-Diffusion: Data Style-Guided Diffusion Model for Time-Series Generation

链接: https://arxiv.org/abs/2509.18584
作者: Mingchun Sun,Rongqiang Zhao,Jie Liu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models are the mainstream approach for time series generation tasks. However, existing diffusion models for time series generation require retraining the entire framework to introduce specific conditional guidance. There also exists a certain degree of distributional bias between the generated data and the real data, which leads to potential model biases in downstream tasks. Additionally, the complexity of diffusion models and the latent spaces leads to an uninterpretable inference process. To address these issues, we propose the data style-guided diffusion model (DS-Diffusion). In the DS-Diffusion, a diffusion framework based on style-guided kernels is developed to avoid retraining for specific conditions. The time-information based hierarchical denoising mechanism (THD) is developed to reduce the distributional bias between the generated data and the real data. Furthermore, the generated samples can clearly indicate the data style from which they originate. We conduct comprehensive evaluations using multiple public datasets to validate our approach. Experimental results show that, compared to the state-of-the-art model such as ImagenTime, the predictive score and the discriminative score decrease by 5.56% and 61.55%, respectively. The distributional bias between the generated data and the real data is further reduced, the inference process is also more interpretable. Moreover, by eliminating the need to retrain the diffusion model, the flexibility and adaptability of the model to specific conditions are also enhanced.

[LG-45] Explainable Graph Neural Networks: Understanding Brain Connectivity and Biomarkers in Dementia

链接: https://arxiv.org/abs/2509.18568
作者: Niharika Tewari,Nguyen Linh Dan Le,Mujie Liu,Jing Ren,Ziqi Xu,Tabinda Sarwar,Veeky Baths,Feng Xia
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-46] Reverse-Complement Consistency for DNA Language Models

链接: https://arxiv.org/abs/2509.18529
作者: Mingqian Ma
类目: Machine Learning (cs.LG); Genomics (q-bio.GN)
*备注:

点击查看摘要

Abstract:A fundamental property of DNA is that the reverse complement (RC) of a sequence often carries identical biological meaning. However, state-of-the-art DNA language models frequently fail to capture this symmetry, producing inconsistent predictions for a sequence and its RC counterpart, which undermines their reliability. In this work, we introduce Reverse-Complement Consistency Regularization (RCCR), a simple and model-agnostic fine-tuning objective that directly penalizes the divergence between a model’s prediction on a sequence and the aligned prediction on its reverse complement. We evaluate RCCR across three diverse backbones (Nucleotide Transformer, HyenaDNA, DNABERT-2) on a wide range of genomic tasks, including sequence classification, scalar regression, and profile prediction. Our experiments show that RCCR substantially improves RC robustness by dramatically reducing prediction flips and errors, all while maintaining or improving task accuracy compared to baselines such as RC data augmentation and test-time averaging. By integrating a key biological prior directly into the learning process, RCCR produces a single, intrinsically robust, and computationally efficient model fine-tuning recipe for diverse biology tasks.

[LG-47] Hybrid Data can Enhance the Utility of Synthetic Data for Training Anti-Money Laundering Models

链接: https://arxiv.org/abs/2509.18499
作者: Rachel Chung,Pratyush Nidhi Sharma,Mikko Siponen,Rohit Vadodaria,Luke Smith
类目: Machine Learning (cs.LG)
*备注: Presented at the Association of Certified Fraud Examiners (ACFE) Research Institute Annual Meeting, Las Vegas, NV, (2024)

点击查看摘要

Abstract:Money laundering is a critical global issue for financial institutions. Automated Anti-money laundering (AML) models, like Graph Neural Networks (GNN), can be trained to identify illicit transactions in real time. A major issue for developing such models is the lack of access to training data due to privacy and confidentiality concerns. Synthetically generated data that mimics the statistical properties of real data but preserves privacy and confidentiality has been proposed as a solution. However, training AML models on purely synthetic datasets presents its own set of challenges. This article proposes the use of hybrid datasets to augment the utility of synthetic datasets by incorporating publicly available, easily accessible, and real-world features. These additions demonstrate that hybrid datasets not only preserve privacy but also improve model utility, offering a practical pathway for financial institutions to enhance AML systems.

[LG-48] Physics-informed time series analysis with Kolmogorov-Arnold Networks under Ehrenfest constraints

链接: https://arxiv.org/abs/2509.18483
作者: Abhijit Sen,Illya V. Lukin,Kurt Jacobs,Lev Kaplan,Andrii G. Sotnikov,Denys I. Bondar
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The prediction of quantum dynamical responses lies at the heart of modern physics. Yet, modeling these time-dependent behaviors remains a formidable challenge because quantum systems evolve in high-dimensional Hilbert spaces, often rendering traditional numerical methods computationally prohibitive. While large language models have achieved remarkable success in sequential prediction, quantum dynamics presents a fundamentally different challenge: forecasting the entire temporal evolution of quantum systems rather than merely the next element in a sequence. Existing neural architectures such as recurrent and convolutional networks often require vast training datasets and suffer from spurious oscillations that compromise physical interpretability. In this work, we introduce a fundamentally new approach: Kolmogorov Arnold Networks (KANs) augmented with physics-informed loss functions that enforce the Ehrenfest theorems. Our method achieves superior accuracy with significantly less training data: it requires only 5.4 percent of the samples (200) compared to Temporal Convolution Networks (3,700). We further introduce the Chain of KANs, a novel architecture that embeds temporal causality directly into the model design, making it particularly well-suited for time series modeling. Our results demonstrate that physics-informed KANs offer a compelling advantage over conventional black-box models, maintaining both mathematical rigor and physical consistency while dramatically reducing data requirements.

[LG-49] SimpleFold: Folding Proteins is Simpler than You Think

链接: https://arxiv.org/abs/2509.18480
作者: Yuyang Wang,Jiarui Lu,Navdeep Jaitly,Josh Susskind,Miguel Angel Bautista
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 28 pages, 11 figures, 13 tables

点击查看摘要

[LG-50] Individualized non-uniform quantization for vector search

链接: https://arxiv.org/abs/2509.18471
作者: Mariano Tepper,Ted Willke
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

[LG-51] Discrete-time diffusion-like models for speech synthesis

链接: https://arxiv.org/abs/2509.18470
作者: Xiaozhou Tan,Minghui Zhao,Mattias Cross,Anton Ragni
类目: Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Diffusion models have attracted a lot of attention in recent years. These models view speech generation as a continuous-time process. For efficient training, this process is typically restricted to additive Gaussian noising, which is limiting. For inference, the time is typically discretized, leading to the mismatch between continuous training and discrete sampling conditions. Recently proposed discrete-time processes, on the other hand, usually do not have these limitations, may require substantially fewer inference steps, and are fully consistent between training/inference conditions. This paper explores some diffusion-like discrete-time processes and proposes some new variants. These include processes applying additive Gaussian noise, multiplicative Gaussian noise, blurring noise and a mixture of blurring and Gaussian noises. The experimental results suggest that discrete-time processes offer comparable subjective and objective speech quality to their widely popular continuous counterpart, with more efficient and consistent training and inference schemas.

[LG-52] Probabilistic Geometric Principal Component Analysis with application to neural data ICLR

链接: https://arxiv.org/abs/2509.18469
作者: Han-Lin Hsieh,Maryam M. Shanechi
类目: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
*备注: Published at the International Conference on Learning Representations (ICLR) 2025. Code is available at GitHub this https URL

点击查看摘要

Abstract:Dimensionality reduction is critical across various domains of science including neuroscience. Probabilistic Principal Component Analysis (PPCA) is a prominent dimensionality reduction method that provides a probabilistic approach unlike the deterministic approach of PCA and serves as a connection between PCA and Factor Analysis (FA). Despite their power, PPCA and its extensions are mainly based on linear models and can only describe the data in a Euclidean coordinate system. However, in many neuroscience applications, data may be distributed around a nonlinear geometry (i.e., manifold) rather than lying in the Euclidean space. We develop Probabilistic Geometric Principal Component Analysis (PGPCA) for such datasets as a new dimensionality reduction algorithm that can explicitly incorporate knowledge about a given nonlinear manifold that is first fitted from these data. Further, we show how in addition to the Euclidean coordinate system, a geometric coordinate system can be derived for the manifold to capture the deviations of data from the manifold and noise. We also derive a data-driven EM algorithm for learning the PGPCA model parameters. As such, PGPCA generalizes PPCA to better describe data distributions by incorporating a nonlinear manifold geometry. In simulations and brain data analyses, we show that PGPCA can effectively model the data distribution around various given manifolds and outperforms PPCA for such data. Moreover, PGPCA provides the capability to test whether the new geometric coordinate system better describes the data than the Euclidean one. Finally, PGPCA can perform dimensionality reduction and learn the data distribution both around and on the manifold. These capabilities make PGPCA valuable for enhancing the efficacy of dimensionality reduction for analysis of high-dimensional data that exhibit noise and are distributed around a nonlinear manifold.

[LG-53] Robotic Skill Diversification via Active Mutation of Reward Functions in Reinforcement Learning During a Liquid Pouring Task

链接: https://arxiv.org/abs/2509.18463
作者: Jannick van Buuren,Roberto Giglio,Loris Roveda,Luka Peternel
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper explores how deliberate mutations of reward function in reinforcement learning can produce diversified skill variations in robotic manipulation tasks, examined with a liquid pouring use case. To this end, we developed a new reward function mutation framework that is based on applying Gaussian noise to the weights of the different terms in the reward function. Inspired by the cost-benefit tradeoff model from human motor control, we designed the reward function with the following key terms: accuracy, time, and effort. The study was performed in a simulation environment created in NVIDIA Isaac Sim, and the setup included Franka Emika Panda robotic arm holding a glass with a liquid that needed to be poured into a container. The reinforcement learning algorithm was based on Proximal Policy Optimization. We systematically explored how different configurations of mutated weights in the rewards function would affect the learned policy. The resulting policies exhibit a wide range of behaviours: from variations in execution of the originally intended pouring task to novel skills useful for unexpected tasks, such as container rim cleaning, liquid mixing, and watering. This approach offers promising directions for robotic systems to perform diversified learning of specific tasks, while also potentially deriving meaningful skills for future tasks.

[LG-54] GluMind: Multimodal Parallel Attention and Knowledge Retention for Robust Cross-Population Blood Glucose Forecasting

链接: https://arxiv.org/abs/2509.18457
作者: Ebrahim Farahmand,Reza Rahimi Azghan,Nooshin Taheri Chatrudi,Velarie Yaa Ansu-Baidoo,Eric Kim,Gautham Krishna Gudur,Mohit Malu,Owen Krueger,Edison Thomaz,Giulia Pedrielli,Pavan Turaga,Hassan Ghasemzadeh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-55] Fast Linear Solvers via AI-Tuned Markov Chain Monte Carlo-based Matrix Inversion

链接: https://arxiv.org/abs/2509.18452
作者: Anton Lebedev,Won Kyung Lee,Soumyadip Ghosh,Olha I. Yaman,Vassilis Kalantzis,Yingdong Lu,Tomasz Nowicki,Shashanka Ubaru,Lior Horesh,Vassil Alexandrov
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 8 pages, 3 figures, 1 algorithm, 1 table of experiment cases

点击查看摘要

Abstract:Large, sparse linear systems are pervasive in modern science and engineering, and Krylov subspace solvers are an established means of solving them. Yet convergence can be slow for ill-conditioned matrices, so practical deployments usually require preconditioners. Markov chain Monte Carlo (MCMC)-based matrix inversion can generate such preconditioners and accelerate Krylov iterations, but its effectiveness depends on parameters whose optima vary across matrices; manual or grid search is costly. We present an AI-driven framework recommending MCMC parameters for a given linear system. A graph neural surrogate predicts preconditioning speed from A and MCMC parameters. A Bayesian acquisition function then chooses the parameter sets most likely to minimise iterations. On a previously unseen ill-conditioned system, the framework achieves better preconditioning with 50% of the search budget of conventional methods, yielding about a 10% reduction in iterations to convergence. These results suggest a route for incorporating MCMC-based preconditioners into large-scale systems.

[LG-56] Large-Scale Longitudinal Study of Large Language Models During the 2024 US Election Season

链接: https://arxiv.org/abs/2509.18446
作者: Sarah H. Cen,Andrew Ilyas,Hedi Driss,Charlotte Park,Aspen Hopkins,Chara Podimata,Aleksander Mądry
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 100 pages, 69 figures

点击查看摘要

[LG-57] MeshODENet: A Graph-Informed Neural Ordinary Differential Equation Neural Network for Simulating Mesh-Based Physical Systems

链接: https://arxiv.org/abs/2509.18445
作者: Kangzheng Liu,Leixin Ma
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:The simulation of complex physical systems using a discretized mesh is a cornerstone of applied mechanics, but traditional numerical solvers are often computationally prohibitive for many-query tasks. While Graph Neural Networks (GNNs) have emerged as powerful surrogate models for mesh-based data, their standard autoregressive application for long-term prediction is often plagued by error accumulation and instability. To address this, we introduce MeshODENet, a general framework that synergizes the spatial reasoning of GNNs with the continuous-time modeling of Neural Ordinary Differential Equations. We demonstrate the framework’s effectiveness and versatility on a series of challenging structural mechanics problems, including one- and two-dimensional elastic bodies undergoing large, non-linear deformations. The results demonstrate that our approach significantly outperforms baseline models in long-term predictive accuracy and stability, while achieving substantial computational speed-ups over traditional solvers. This work presents a powerful and generalizable approach for developing data-driven surrogates to accelerate the analysis and modeling of complex structural systems.

[LG-58] Diffusion Policies with Offline and Inverse Reinforcement Learning for Promoting Physical Activity in Older Adults Using Wearable Sensors ICML

链接: https://arxiv.org/abs/2509.18433
作者: Chang Liu,Ladda Thiamwong,Yanjie Fu,Rui Xie
类目: Machine Learning (cs.LG)
*备注: Accepted at ICMLA 2025. 8 pages, 6 figures

点击查看摘要

Abstract:Utilizing offline reinforcement learning (RL) with real-world clinical data is getting increasing attention in AI for healthcare. However, implementation poses significant challenges. Defining direct rewards is difficult, and inverse RL (IRL) struggles to infer accurate reward functions from expert behavior in complex environments. Offline RL also encounters challenges in aligning learned policies with observed human behavior in healthcare applications. To address challenges in applying offline RL to physical activity promotion for older adults at high risk of falls, based on wearable sensor activity monitoring, we introduce Kolmogorov-Arnold Networks and Diffusion Policies for Offline Inverse Reinforcement Learning (KANDI). By leveraging the flexible function approximation in Kolmogorov-Arnold Networks, we estimate reward functions by learning free-living environment behavior from low-fall-risk older adults (experts), while diffusion-based policies within an Actor-Critic framework provide a generative approach for action refinement and efficiency in offline RL. We evaluate KANDI using wearable activity monitoring data in a two-arm clinical trial from our Physio-feedback Exercise Program (PEER) study, emphasizing its practical application in a fall-risk intervention program to promote physical activity among older adults. Additionally, KANDI outperforms state-of-the-art methods on the D4RL benchmark. These results underscore KANDI’s potential to address key challenges in offline RL for healthcare applications, offering an effective solution for activity promotion intervention strategies in healthcare.

[LG-59] VoxGuard: Evaluating User and Attribute Privacy in Speech via Membership Inference Attacks

链接: https://arxiv.org/abs/2509.18413
作者: Efthymios Tsaprazlis,Thanathai Lertpetchpun,Tiantian Feng,Sai Praneeth Karimireddy,Shrikanth Narayanan
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-60] Identifying birdsong syllables without labelled data

链接: https://arxiv.org/abs/2509.18412
作者: Mélisande Teng,Julien Boussard,David Rolnick,Hugo Larochelle
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Identifying sequences of syllables within birdsongs is key to tackling a wide array of challenges, including bird individual identification and better understanding of animal communication and sensory-motor learning. Recently, machine learning approaches have demonstrated great potential to alleviate the need for experts to label long audio recordings by hand. However, they still typically rely on the availability of labelled data for model training, restricting applicability to a few species and datasets. In this work, we build the first fully unsupervised algorithm to decompose birdsong recordings into sequences of syllables. We first detect syllable events, then cluster them to extract templates --syllable representations-- before performing matching pursuit to decompose the recording as a sequence of syllables. We evaluate our automatic annotations against human labels on a dataset of Bengalese finch songs and find that our unsupervised method achieves high performance. We also demonstrate that our approach can distinguish individual birds within a species through their unique vocal signatures, for both Bengalese finches and another species, the great tit.

[LG-61] Explicit Path CGR: Maintaining Sequence Fidelity in Geometric Representations CIKM2025

链接: https://arxiv.org/abs/2509.18408
作者: Sarwan Ali
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: Accepted to CIKM 2025 as Short paper

点击查看摘要

Abstract:We present a novel information-preserving Chaos Game Representation (CGR) method, also called Reverse-CGR (R-CGR), for biological sequence analysis that addresses the fundamental limitation of traditional CGR approaches - the loss of sequence information during geometric mapping. Our method introduces complete sequence recovery through explicit path encoding combined with rational arithmetic precision control, enabling perfect sequence reconstruction from stored geometric traces. Unlike purely geometric approaches, our reversibility is achieved through comprehensive path storage that maintains both positional and character information at each step. We demonstrate the effectiveness of R-CGR on biological sequence classification tasks, achieving competitive performance compared to traditional sequence-based methods while providing interpretable geometric visualizations. The approach generates feature-rich images suitable for deep learning while maintaining complete sequence information through explicit encoding, opening new avenues for interpretable bioinformatics analysis where both accuracy and sequence recovery are essential.

[LG-62] Development of Deep Learning Optimizers: Approaches Concepts and Update Rules

链接: https://arxiv.org/abs/2509.18396
作者: Doğay Altınel
类目: Machine Learning (cs.LG)
*备注: 24 pages

点击查看摘要

Abstract:Deep learning optimizers are optimization algorithms that enable deep neural networks to learn. The effectiveness of learning is highly dependent on the optimizer employed in the training process. Alongside the rapid advancement of deep learning, a wide range of optimizers with different approaches have been developed. This study aims to provide a review of various optimizers that have been proposed and received attention in the literature. From Stochastic gradient descent to the most recent ones such as Momentum, AdamW, Sophia, and Muon in chronological order, optimizers are examined individually, and their distinctive features are highlighted in the study. The update rule of each optimizer is presented in detail, with an explanation of the associated concepts and variables. The techniques applied by these optimizers, their contributions to the optimization process, and their default hyperparameter settings are also discussed. In addition, insights are offered into the open challenges encountered in the optimization of deep learning models. Thus, a comprehensive resource is provided both for understanding the current state of optimizers and for identifying potential areas of future development.

[LG-63] owards Provable Emergence of In-Context Reinforcement Learning NEURIPS2025

链接: https://arxiv.org/abs/2509.18389
作者: Jiuqi Wang,Rohan Chandra,Shangtong Zhang
类目: Machine Learning (cs.LG)
*备注: NeurIPS 2025, 28 pages

点击查看摘要

Abstract:Typically, a modern reinforcement learning (RL) agent solves a task by updating its neural network parameters to adapt its policy to the task. Recently, it has been observed that some RL agents can solve a wide range of new out-of-distribution tasks without parameter updates after pretraining on some task distribution. When evaluated in a new task, instead of making parameter updates, the pretrained agent conditions its policy on additional input called the context, e.g., the agent’s interaction history in the new task. The agent’s performance increases as the information in the context increases, with the agent’s parameters fixed. This phenomenon is typically called in-context RL (ICRL). The pretrained parameters of the agent network enable the remarkable ICRL phenomenon. However, many ICRL works perform the pretraining with standard RL algorithms. This raises the central question this paper aims to address: Why can the RL pretraining algorithm generate network parameters that enable ICRL? We hypothesize that the parameters capable of ICRL are minimizers of the pretraining loss. This work provides initial support for this hypothesis through a case study. In particular, we prove that when a Transformer is pretrained for policy evaluation, one of the global minimizers of the pretraining loss can enable in-context temporal difference learning.

[LG-64] GnnXemplar: Exemplars to Explanations - Natural Language Rules for Global GNN Interpretability NEURIPS2025

链接: https://arxiv.org/abs/2509.18376
作者: Burouj Armgaan,Eshan Jain,Harsh Pandey,Mahesh Chandran,Sayan Ranu
类目: Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注: 31 pages, 20 figures, NeurIPS 2025 (Oral)

点击查看摘要

Abstract:Graph Neural Networks (GNNs) are widely used for node classification, yet their opaque decision-making limits trust and adoption. While local explanations offer insights into individual predictions, global explanation methods, those that characterize an entire class, remain underdeveloped. Existing global explainers rely on motif discovery in small graphs, an approach that breaks down in large, real-world settings where subgraph repetition is rare, node attributes are high-dimensional, and predictions arise from complex structure-attribute interactions. We propose GnnXemplar, a novel global explainer inspired from Exemplar Theory from cognitive science. GnnXemplar identifies representative nodes in the GNN embedding space, exemplars, and explains predictions using natural language rules derived from their neighborhoods. Exemplar selection is framed as a coverage maximization problem over reverse k-nearest neighbors, for which we provide an efficient greedy approximation. To derive interpretable rules, we employ a self-refining prompt strategy using large language models (LLMs). Experiments across diverse benchmarks show that GnnXemplar significantly outperforms existing methods in fidelity, scalability, and human interpretability, as validated by a user study with 60 participants.

[LG-65] MolPILE - large-scale diverse dataset for molecular representation learning

链接: https://arxiv.org/abs/2509.18353
作者: Jakub Adamczyk,Jakub Poziemski,Franciszek Job,Mateusz Król,Maciej Makowski
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-66] SBVR: Summation of BitVector Representation for Efficient LLM Quantization

链接: https://arxiv.org/abs/2509.18172
作者: Wonjun Bang,Jongseok Park,Hongseung Yu,Kyungmin Bin,Kyunghan Lee
类目: Machine Learning (cs.LG)
*备注: 9 pages, 4 figures

点击查看摘要

[LG-67] FedIA: A Plug-and-Play Importance-Aware Gradient Pruning Aggregation Method for Domain-Robust Federated Graph Learning on Node Classification

链接: https://arxiv.org/abs/2509.18171
作者: Zhanting Zhou,KaHou Tam,Zeqin Wu,Pengzhao Sun,Jinbo Wang,Fengli Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Graph Learning (FGL) under domain skew – as observed on platforms such as \emphTwitch Gamers and multilingual \emphWikipedia networks – drives client models toward incompatible representations, rendering naive aggregation both unstable and ineffective. We find that the culprit is not the weighting scheme but the \emphnoisy gradient signal: empirical analysis of baseline methods suggests that a vast majority of gradient dimensions can be dominated by domain-specific variance. We therefore shift focus from “aggregation-first” to a \emphprojection-first strategy that denoises client updates \emphbefore they are combined. The proposed FedIA framework realises this \underlineImportance-\underlineAware idea through a two-stage, plug-and-play pipeline: (i) a server-side top- \rho mask keeps only the most informative about 5% of coordinates, and (ii) a lightweight influence-regularised momentum weight suppresses outlier clients. FedIA adds \emphno extra uplink traffic and only negligible server memory, making it readily deployable. On both homogeneous (Twitch Gamers) and heterogeneous (Wikipedia) graphs, it yields smoother, more stable convergence and higher final accuracy than nine strong baselines. A convergence sketch further shows that dynamic projection maintains the optimal \mathcalO(\sigma^2/\sqrtT) rate.

[LG-68] MobiGPT : A Foundation Model for Mobile Wireless Networks

链接: https://arxiv.org/abs/2509.18166
作者: Xiaoqian Qi,Haoye Chai,Yong Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid development of mobile communication technologies, future mobile networks will offer vast services and resources for commuting, production, daily life, and entertainment. Accurate and efficient forecasting of mobile data (e.g., cell traffic, user behavior, channel quality) helps operators monitor network state changes, orchestrate wireless resources, and schedule infrastructure and users, thereby improving supply efficiency and service quality. However, current forecasting paradigms rely on customized designs with tailored models for exclusive data types. Such approaches increase complexity and deployment costs under large-scale, heterogeneous networks involving base stations, users, and channels. In this paper, we design a foundation model for mobile data forecasting, MobiGPT, with a unified structure capable of forecasting three data types: base station traffic, user app usage, and channel quality. We propose a soft-prompt learning method to help the model understand features of different data types, and introduce a temporal masking mechanism to guide the model through three forecasting tasks: short-term prediction, long-term prediction, and distribution generation, supporting diverse optimization scenarios. Evaluations on real-world datasets with over 100,000 samples show that MobiGPT achieves accurate multi-type forecasting. Compared to existing models, it improves forecasting accuracy by 27.37%, 20.08%, and 7.27%, reflecting strong generalization. Moreover, MobiGPT exhibits superior zero/few-shot performance in unseen scenarios, with over 21.51% improvement, validating its strong transferability as a foundation model.

[LG-69] DSFT: Inspiring Diffusion Large Language Models to Comprehend Mathematical and Logical Patterns

链接: https://arxiv.org/abs/2509.18164
作者: Ranfei Chen,Ming Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion large language models (dLLMs) have emerged as a new architecture following auto regressive models. Their denoising process offers a powerful generative advantage, but they present significant challenges in learning and understanding numerically sensitive mathematical and order-sensitive logical tasks. Current training methods, including pre-training, fine-tuning, and reinforcement learning, focus primarily on improving general knowledge retention and reasoning abilities, but lack a comprehensive understanding of mathematical and logical patterns. We propose DSFT, a simple yet effective Diffusion SFT strategy, by adjusting the masking strategy and loss function, guiding models to understand mathematical and logical patterns. This strategy can be flexibly combined with pre-training, reinforcement learning, and other training methods. Validated on models such as LLaDA and Dream series, we prove that DSFT on small-scale data can achieve improvements of 5-10% and approximately 2% on mathematical and logical problems, respectively. This inspiring masking approach offers insights for future learning of specific patterns, which can be easily and efficiently combined with other training methods and applied to various dLLMs. Our code is publicly available at this https URL

[LG-70] A Simple and Reproducible Hybrid Solver for a Truck-Drone VRP with Recharge

链接: https://arxiv.org/abs/2509.18162
作者: Meraryslan Meraliyev(1),Cemil Turan(1),Shirali Kadyrov(2) ((1) SDU University (2) New Uzbekistan University)
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study last-mile delivery with one truck and one drone under explicit battery management: the drone flies at twice the truck speed; each sortie must satisfy an endurance budget; after every delivery the drone recharges on the truck before the next launch. We introduce a hybrid reinforcement learning (RL) solver that couples an ALNS-based truck tour (with 2/3-opt and Or-opt) with a small pointer/attention policy that schedules drone sorties. The policy decodes launch–serve–rendezvous triplets with hard feasibility masks for endurance and post-delivery recharge; a fast, exact timeline simulator enforces launch/recovery handling and computes the true makespan used by masked greedy/beam decoding. On Euclidean instances with N=50 , E=0.7 , and R=0.1 , the method achieves an average makespan of \textbf5.203 \pm 0.093, versus \textbf5.349 \pm 0.038 for ALNS and \textbf5.208 \pm 0.124 for NN – i.e., \textbf2.73% better than ALNS on average and within \textbf0.10% of NN. Per-seed, the RL scheduler never underperforms ALNS on the same instance and ties or beats NN on two of three seeds. A decomposition of the makespan shows the expected truck–wait trade-off across heuristics; the learned scheduler balances both to minimize the total completion time. We provide a config-first implementation with plotting and significance-test utilities to support replication.

[LG-71] Learning Progression-Guided AI Evaluation of Scientific Models To Support Diverse Multi-Modal Understanding in NGSS Classroom

链接: https://arxiv.org/abs/2509.18157
作者: Leonora Kaldaras,Tingting Li,Prudence Djagba,Kevin Haudek,Joseph Krajcik
类目: Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning Progressions (LPs) can help adjust instruction to individual learners needs if the LPs reflect diverse ways of thinking about a construct being measured, and if the LP-aligned assessments meaningfully measure this diversity. The process of doing science is inherently multi-modal with scientists utilizing drawings, writing and other modalities to explain phenomena. Thus, fostering deep science understanding requires supporting students in using multiple modalities when explaining phenomena. We build on a validated NGSS-aligned multi-modal LP reflecting diverse ways of modeling and explaining electrostatic phenomena and associated assessments. We focus on students modeling, an essential practice for building a deep science understanding. Supporting culturally and linguistically diverse students in building modeling skills provides them with an alternative mode of communicating their understanding, essential for equitable science assessment. Machine learning (ML) has been used to score open-ended modeling tasks (e.g., drawings), and short text-based constructed scientific explanations, both of which are time- consuming to score. We use ML to evaluate LP-aligned scientific models and the accompanying short text-based explanations reflecting multi-modal understanding of electrical interactions in high school Physical Science. We show how LP guides the design of personalized ML-driven feedback grounded in the diversity of student thinking on both assessment modes.

[LG-72] A deep reinforcement learning platform for antibiotic discovery

链接: https://arxiv.org/abs/2509.18153
作者: Hanqun Cao,Marcelo D. T. Torres,Jingjie Zhang,Zijun Gao,Fang Wu,Chunbin Gu,Jure Leskovec,Yejin Choi,Cesar de la Fuente-Nunez,Guangyong Chen,Pheng-Ann Heng
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 42 pages, 16 figures

点击查看摘要

[LG-73] nsor Train Completion from Fiberwise Observations Along a Single Mode

链接: https://arxiv.org/abs/2509.18149
作者: Shakir Showkat Sofi,Lieven De Lathauwer
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Signal Processing (eess.SP); Optimization and Control (math.OC); Computation (stat.CO); Machine Learning (stat.ML)
*备注: Submitted to Numerical Algorithms (28 pages)

点击查看摘要

[LG-74] Comparative Analysis of FOLD-SE vs. FOLD-R in Binary Classification and XGBoost in Multi-Category Classification

链接: https://arxiv.org/abs/2509.18139
作者: Akshay Murthy,Shawn Sebastian,Manil Shangle,Huaduo Wang,Sopam Dasgupta,Gopal Gupta
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:Recently, the demand for Machine Learning (ML) models that can balance accuracy, efficiency, and interpreability has grown significantly. Traditionally, there has been a tradeoff between accuracy and explainability in predictive models, with models such as Neural Networks achieving high accuracy on complex datasets while sacrificing internal transparency. As such, new rule-based algorithms such as FOLD-SE have been developed that provide tangible justification for predictions in the form of interpretable rule sets. The primary objective of this study was to compare FOLD-SE and FOLD-R++, both rule-based classifiers, in binary classification and evaluate how FOLD-SE performs against XGBoost, a widely used ensemble classifier, when applied to multi-category classification. We hypothesized that because FOLD-SE can generate a condensed rule set in a more explainable manner, it would lose upwards of an average of 3 percent in accuracy and F1 score when compared with XGBoost and FOLD-R++ in multiclass and binary classification, respectively. The research used data collections for classification, with accuracy, F1 scores, and processing time as the primary performance measures. Outcomes show that FOLD-SE is superior to FOLD-R++ in terms of binary classification by offering fewer rules but losing a minor percentage of accuracy and efficiency in processing time; in tasks that involve multi-category classifications, FOLD-SE is more precise and far more efficient compared to XGBoost, in addition to generating a comprehensible rule set. The results point out that FOLD-SE is a better choice for both binary tasks and classifications with multiple categories. Therefore, these results demonstrate that rule-based approaches like FOLD-SE can bridge the gap between explainability and performance, highlighting their potential as viable alternatives to black-box models in diverse classification tasks.

[LG-75] Rank-Induced PL Mirror Descent: A Rank-Faithful Second-Order Algorithm for Sleeping Experts

链接: https://arxiv.org/abs/2509.18138
作者: Tiantian Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a new algorithm, \emphRank-Induced Plackett–Luce Mirror Descent (RIPLM), which leverages the structural equivalence between the \emphrank benchmark and the \emphdistributional benchmark established in \citetBergamOzcanHsu2022. Unlike prior approaches that operate on expert identities, RIPLM updates directly in the \emphrank-induced Plackett–Luce (PL) parameterization. This ensures that the algorithm’s played distributions remain within the class of rank-induced distributions at every round, preserving the equivalence with the rank benchmark. To our knowledge, RIPLM is the first algorithm that is both (i) \emphrank-faithful and (ii) \emphvariance-adaptive in the sleeping experts setting.

[LG-76] A Weighted Gradient Tracking Privacy-Preserving Method for Distributed Optimization

链接: https://arxiv.org/abs/2509.18134
作者: Furan Xie,Bing Liu,Li Chai
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

[LG-77] Accounting for Uncertainty in Machine Learning Surrogates: A Gauss-Hermite Quadrature Approach to Reliability Analysis

链接: https://arxiv.org/abs/2509.18128
作者: Amirreza Tootchi,Xiaoping Du
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning surrogates are increasingly employed to replace expensive computational models for physics-based reliability analysis. However, their use introduces epistemic uncertainty from model approximation errors, which couples with aleatory uncertainty in model inputs, potentially compromising the accuracy of reliability predictions. This study proposes a Gauss-Hermite quadrature approach to decouple these nested uncertainties and enable more accurate reliability analysis. The method evaluates conditional failure probabilities under aleatory uncertainty using First and Second Order Reliability Methods and then integrates these probabilities across realizations of epistemic uncertainty. Three examples demonstrate that the proposed approach maintains computational efficiency while yielding more trustworthy predictions than traditional methods that ignore model uncertainty.

[LG-78] Prediction of Coffee Ratings Based On Influential Attributes Using SelectKBest and Optimal Hyperparameters

链接: https://arxiv.org/abs/2509.18124
作者: Edmund Agyemang,Lawrence Agbota,Vincent Agbenyeavu,Peggy Akabuah,Bismark Bimpong,Christopher Attafuah
类目: Machine Learning (cs.LG); Applications (stat.AP)
*备注: 13 pages, 6 figures and 4 tables

点击查看摘要

[LG-79] Energy-convergence trade off for the training of neural networks on bio-inspired hardware

链接: https://arxiv.org/abs/2509.18121
作者: Nikhil Garg,Paul Uriarte Vicandi,Yanming Zhang,Alexandre Baigol,Donato Francesco Falcone,Saketh Ram Mamidala,Bert Jan Offrein,Laura Bégon-Lours
类目: Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-80] Decentor-V: Lightweight ML Training on Low-Power RISC-V Edge Devices

链接: https://arxiv.org/abs/2509.18118
作者: Marcelo Ribeiro,Diogo Costa,Gonçalo Moreira,Sandro Pinto,Tiago Gomes
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR)
*备注:

点击查看摘要

[LG-81] Robust and continuous machine learning of usage habits to adapt digital interfaces to user needs

链接: https://arxiv.org/abs/2509.18117
作者: Eric Petit,Denis Chêne
类目: Machine Learning (cs.LG)
*备注: soumis {à} la conf{é}rence IHM 2025

点击查看摘要

Abstract:The paper presents a machine learning approach to design digital interfaces that can dynamically adapt to different users and usage strategies. The algorithm uses Bayesian statistics to model users’ browsing behavior, focusing on their habits rather than group preferences. It is distinguished by its online incremental learning, allowing reliable predictions even with little data and in the case of a changing environment. This inference method generates a task model, providing a graphical representation of navigation with the usage statistics of the current user. The algorithm learns new tasks while preserving prior knowledge. The theoretical framework is described, and simulations show the effectiveness of the approach in stationary and non-stationary environments. In conclusion, this research paves the way for adaptive systems that improve the user experience by helping them to better navigate and act on their interface.

[LG-82] owards Scalable and Structured Spatiotemporal Forecasting

链接: https://arxiv.org/abs/2509.18115
作者: Hongyi Chen,Xiucheng Li,Xinyang Chen,Jing Li,Kehai Chen,Liqiang Nie
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-83] A Study of Skews Imbalances and Pathological Conditions in LLM Inference Deployment on GPU Clusters detectable from DPU

链接: https://arxiv.org/abs/2509.18114
作者: Javed I. Khan an Henry Uwabor Moye
类目: Machine Learning (cs.LG)
*备注: 12 pages, Technical Report 2025-07-01, Internetworking and Media Communications Research Laboratories, Department of Computer Science, Kent State University

点击查看摘要

Abstract:Autoregressive inference in large transformer-based language models (LLMs) presents significant challenges for runtime efficiency, particularly during the decode phase where load imbalance across GPU shards can cause throughput degradation and latency spikes. A DPU-assisted framework leveraged by BlueField-3 Data Processing Units can enable real-time detection and mitigation of load imbalance in multi-node tensor-parallel inference. By offloading monitoring tasks to the DPU and analyzing GPU telemetry and inter-node communication patterns, the resulting system can provide actionable feedback to inference controllers and schedulers. The goal of this study is three-fold i) identify the reported skews/imbalances/pathological conditions that arise in muti-GPU execution of a) LLM tensor computing (both during training and inference), b) identify their impact on computational performance, and c) make a critical assessment if those can be tracked for potential mitigation from a DPU network.

[LG-84] Large language models surpass domain-specific architectures for antepartum electronic fetal monitoring analysis

链接: https://arxiv.org/abs/2509.18112
作者: Sheng Wong,Ravi Shankar,Beth Albert,Gabriel Davis Jones
类目: Machine Learning (cs.LG)
*备注: Preparing for journal

点击查看摘要

[LG-85] Machine Learning-Based Classification of Vessel Types in Straits Using AIS Tracks

链接: https://arxiv.org/abs/2509.18109
作者: Jonatan Katz Nielsen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-86] AdaMixT: Adaptive Weighted Mixture of Multi-Scale Expert Transformers for Time Series Forecasting

链接: https://arxiv.org/abs/2509.18107
作者: Huanyao Zhang,Jiaye Lin,Wentao Zhang,Haitao Yuan,Guoliang Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series forecasting involves predicting future values based on historical observations. However, existing approaches primarily rely on predefined single-scale patches or lack effective mechanisms for multi-scale feature fusion. These limitations hinder them from fully capturing the complex patterns inherent in time series, leading to constrained performance and insufficient generalizability. To address these challenges, we propose a novel architecture named Adaptive Weighted Mixture of Multi-Scale Expert Transformers (AdaMixT). Specifically, AdaMixT introduces various patches and leverages both General Pre-trained Models (GPM) and Domain-specific Models (DSM) for multi-scale feature extraction. To accommodate the heterogeneity of temporal features, AdaMixT incorporates a gating network that dynamically allocates weights among different experts, enabling more accurate predictions through adaptive multi-scale fusion. Comprehensive experiments on eight widely used benchmarks, including Weather, Traffic, Electricity, ILI, and four ETT datasets, consistently demonstrate the effectiveness of AdaMixT in real-world scenarios.

[LG-87] Model-Based Transfer Learning for Real-Time Damage Assessment of Bridge Networks

链接: https://arxiv.org/abs/2509.18106
作者: Elisa Tomassini,Enrique García-Macías,Filippo Ubertini
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-88] Machine Learnability as a Measure of Order in Aperiodic Sequences

链接: https://arxiv.org/abs/2509.18103
作者: Jennifer Dodgson,Michael Joedhitya,Adith Ramdas,Surender Suresh Kumar,Adarsh Singh Chauhan,Akira Rafhael,Wang Mingshu,Nordine Lotfi
类目: Machine Learning (cs.LG); Number Theory (math.NT)
*备注:

点击查看摘要

[LG-89] A Gradient Flow Approach to Solving Inverse Problems with Latent Diffusion Models NEURIPS2025

链接: https://arxiv.org/abs/2509.19276
作者: Tim Y. J. Wang,O. Deniz Akyildiz
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注: Accepted at the 2nd Workshop on Frontiers in Probabilistic Inference: Sampling Meets Learning, 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

点击查看摘要

[LG-90] Discovering strategies for coastal resilience with AI-based prediction and optimization

链接: https://arxiv.org/abs/2509.19263
作者: Jared Markowitz,Alexander New,Jennifer Sleeman,Chace Ashcraft,Jay Brett,Gary Collins,Stella In,Nathaniel Winstead
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-91] Recovering Wasserstein Distance Matrices from Few Measurements

链接: https://arxiv.org/abs/2509.19250
作者: Muhammad Rana,Abiy Tasissa,HanQin Cai,Yakov Gavriyelov,Keaton Hamm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper proposes two algorithms for estimating square Wasserstein distance matrices from a small number of entries. These matrices are used to compute manifold learning embeddings like multidimensional scaling (MDS) or Isomap, but contrary to Euclidean distance matrices, are extremely costly to compute. We analyze matrix completion from upper triangular samples and Nyström completion in which \mathcalO(d\log(d)) columns of the distance matrices are computed where d is the desired embedding dimension, prove stability of MDS under Nyström completion, and show that it can outperform matrix completion for a fixed budget of sample distances. Finally, we show that classification of the OrganCMNIST dataset from the MedMNIST benchmark is stable on data embedded from the Nyström estimation of the distance matrix even when only 10% of the columns are computed.

[LG-92] Neighbor Embeddings Using Unbalanced Optimal Transport Metrics

链接: https://arxiv.org/abs/2509.19226
作者: Muhammad Rana,Keaton Hamm
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-93] CayleyPy Growth: Efficient growth computations and hundreds of new conjectures on Cayley graphs (Brief version)

链接: https://arxiv.org/abs/2509.19162
作者: A. Chervov,D. Fedoriaka,E. Konstantinova,A. Naumov,I. Kiselev,A. Sheveleva,I. Koltsov,S. Lytkin,A. Smolensky,A. Soibelman,F. Levkovich-Maslyuk,R. Grimov,D. Volovich,A. Isakov,A. Kostin,M. Litvinov,N. Vilkin-Krom,A. Bidzhiev,A. Krasnyi,M. Evseev,E. Geraseva,L. Grunwald,S. Galkin,E. Koldunov,S. Diner,A. Chevychelov,E. Kudasheva,A. Sychev,A. Kravchenko,Z. Kogan,A. Natyrova,L. Shishina,L. Cheldieva,V. Zamkovoy,D. Kovalenko,O. Papulov,S. Kudashev,D. Shiltsov,R. Turtayev,O. Nikitina,D. Mamayeva,S. Nikolenko,M. Obozov,A. Titarenko,A. Dolgorukova,A. Aparnev,O. Debeaupuis,S. Alami C.,H. Isambert
类目: Combinatorics (math.CO); Machine Learning (cs.LG); Group Theory (math.GR)
*备注: 46 pages, 30 figures

点击查看摘要

[LG-94] Quantum Annealing for Minimum Bisection Problem: A Machine Learning-based Approach for Penalty Parameter Tuning

链接: https://arxiv.org/abs/2509.19005
作者: Renáta Rusnáková,Martin Chovanec,Juraj Gazda
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Minimum Bisection Problem is a well-known NP-hard problem in combinatorial optimization, with practical applications in areas such as parallel computing, network design, and machine learning. In this paper, we examine the potential of using D-Wave Systems’ quantum annealing solvers to solve the Minimum Bisection Problem, which we formulate as a Quadratic Unconstrained Binary Optimization model. A key challenge in this formulation lies in choosing an appropriate penalty parameter, as it plays a crucial role in ensuring both the quality of the solution and the satisfaction of the problem’s constraints. To address this, we introduce a novel machine learning-based approach for adaptive tuning of the penalty parameter. Specifically, we use a Gradient Boosting Regressor model trained to predict suitable penalty parameter values based on structural properties of the input graph, the number of nodes and the graph’s density. This method enables the penalty parameter to be adjusted dynamically for each specific problem instance, improving the solver’s ability to balance the competing goals of minimizing the cut size and maintaining equally sized partitions. We test our approach on a large dataset of randomly generated Erdős-Rényi graphs with up to 4,000 nodes, and we compare the results with classical partitioning algorithms, Metis and Kernighan-Lin. Experimental findings demonstrate that our adaptive tuning strategy significantly improves the performance of the quantum annealing hybrid solver and consistently outperforms the classical methods used, indicating its potential as an alternative for the graph partitioning problem.

[LG-95] Bayesian Calibration and Model Assessment of Cell Migration Dynamics with Surrogate Model Integration

链接: https://arxiv.org/abs/2509.18998
作者: Christina Schenk,Jacobo Ayensa Jiménez,Ignacio Romero
类目: Analysis of PDEs (math.AP); Machine Learning (cs.LG); Cell Behavior (q-bio.CB); Quantitative Methods (q-bio.QM)
*备注: 31 pages, 13 figures, 1 table

点击查看摘要

[LG-96] On the Convergence of Policy Mirror Descent with Temporal Difference Evaluation

链接: https://arxiv.org/abs/2509.18822
作者: Jiacai Liu,Wenye Li,Ke Wei
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-97] Consistency of Selection Strategies for Fraud Detection

链接: https://arxiv.org/abs/2509.18739
作者: Christos Revelas,Otilia Boldea,Bas J.M. Werker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper studies how insurers can chose which claims to investigate for fraud. Given a prediction model, typically only claims with the highest predicted propability of being fraudulent are investigated. We argue that this can lead to inconsistent learning and propose a randomized alternative. More generally, we draw a parallel with the multi-arm bandit literature and argue that, in the presence of selection, the obtained observations are not iid. Hence, dependence on past observations should be accounted for when updating parameter estimates. We formalize selection in a binary regression framework and show that model updating and maximum-likelihood estimation can be implemented as if claims were investigated at random. Then, we define consistency of selection strategies and conjecture sufficient conditions for consistency. Our simulations suggest that the often-used selection strategy can be inconsistent while the proposed randomized alternative is consistent. Finally, we compare our randomized selection strategy with Thompson sampling, a standard multi-arm bandit heuristic. Our simulations suggest that the latter can be inefficient in learning low fraud probabilities.

[LG-98] Learning When to Restart: Nonstationary Newsvendor from Uncensored to Censored Demand

链接: https://arxiv.org/abs/2509.18709
作者: Xin Chen,Jiameng Lyu,Shilin Yuan,Yuan Zhou
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study nonstationary newsvendor problems under nonparametric demand models and general distributional measures of nonstationarity, addressing the practical challenges of unknown degree of nonstationarity and demand censoring. We propose a novel distributional-detection-and-restart framework for learning in nonstationary environments, and instantiate it through two efficient algorithms for the uncensored and censored demand settings. The algorithms are fully adaptive, requiring no prior knowledge of the degree and type of nonstationarity, and offer a flexible yet powerful approach to handling both abrupt and gradual changes in nonstationary environments. We establish a comprehensive optimality theory for our algorithms by deriving matching regret upper and lower bounds under both general and refined structural conditions with nontrivial proof techniques that are of independent interest. Numerical experiments using real-world datasets, including nurse staffing data for emergency departments and COVID-19 test demand data, showcase the algorithms’ superior and robust empirical performance. While motivated by the newsvendor problem, the distributional-detection-and-restart framework applies broadly to a wide class of nonstationary stochastic optimization problems. Managerially, our framework provides a practical, easy-to-deploy, and theoretically grounded solution for decision-making under nonstationarity.

[LG-99] Scalable bayesian shadow tomography for quantum property estimation with set transformers

链接: https://arxiv.org/abs/2509.18674
作者: Hyunho Cha,Wonjung Kim,Jungwoo Lee
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 33 pages, 9 figures

点击查看摘要

Abstract:A scalable Bayesian machine learning framework is introduced for estimating scalar properties of an unknown quantum state from measurement data, which bypasses full density matrix reconstruction. This work is the first to integrate the classical shadows protocol with a permutation-invariant set transformer architecture, enabling the approach to predict and correct bias in existing estimators to approximate the true Bayesian posterior mean. Measurement outcomes are encoded as fixed-dimensional feature vectors, and the network outputs a residual correction to a baseline estimator. Scalability to large quantum systems is ensured by the polynomial dependence of input size on system size and number of measurements. On Greenberger-Horne-Zeilinger state fidelity and second-order Rényi entropy estimation tasks – using random Pauli and random Clifford measurements – this Bayesian estimator always achieves lower mean squared error than classical shadows alone, with more than a 99% reduction in the few copy regime.

[LG-100] Re-uploading quantum data: A universal function approximator for quantum inputs

链接: https://arxiv.org/abs/2509.18530
作者: Hyunho Cha,Daniel K. Park,Jungwoo Lee
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 24 pages, 11 figures

点击查看摘要

Abstract:Quantum data re-uploading has proved powerful for classical inputs, where repeatedly encoding features into a small circuit yields universal function approximation. Extending this idea to quantum inputs remains underexplored, as the information contained in a quantum state is not directly accessible in classical form. We propose and analyze a quantum data re-uploading architecture in which a qubit interacts sequentially with fresh copies of an arbitrary input state. The circuit can approximate any bounded continuous function using only one ancilla qubit and single-qubit measurements. By alternating entangling unitaries with mid-circuit resets of the input register, the architecture realizes a discrete cascade of completely positive and trace-preserving maps, analogous to collision models in open quantum system dynamics. Our framework provides a qubit-efficient and expressive approach to designing quantum machine learning models that operate directly on quantum data.

[LG-101] Estimating Heterogeneous Causal Effect on Networks via Orthogonal Learning

链接: https://arxiv.org/abs/2509.18484
作者: Yuanchen Wu,Yubai Yuan
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-102] End-Cut Preference in Survival Trees

链接: https://arxiv.org/abs/2509.18477
作者: Xiaogang Su
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 2 figures

点击查看摘要

[LG-103] Zero-Shot Transferable Solution Method for Parametric Optimal Control Problems

链接: https://arxiv.org/abs/2509.18404
作者: Xingjian Li,Kelvin Kan,Deepanshu Verma,Krishna Kumar,Stanley Osher,Ján Drgoňa
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注: 8 pages, 6 figures, 3 tables

点击查看摘要

[LG-104] Measurement Score-Based MRI Reconstruction with Automatic Coil Sensitivity Estimation

链接: https://arxiv.org/abs/2509.18402
作者: Tingjun Liu,Chicago Y. Park,Yuyang Hu,Hongyu An,Ulugbek S. Kamilov
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 7 pages, 2 figures. Equal contribution: Tingjun Liu and Chicago Y. Park

点击查看摘要

[LG-105] Statistical Insight into Meta-Learning via Predictor Subspace Characterization and Quantification of Task Diversity

链接: https://arxiv.org/abs/2509.18349
作者: Saptati Datta,Nicolas W. Hengartner,Yulia Pimonova,Natalie E. Klein,Nicholas Lubbers
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Meta-learning has emerged as a powerful paradigm for leveraging information across related tasks to improve predictive performance on new tasks. In this paper, we propose a statistical framework for analyzing meta-learning through the lens of predictor subspace characterization and quantification of task diversity. Specifically, we model the shared structure across tasks using a latent subspace and introduce a measure of diversity that captures heterogeneity across task-specific predictors. We provide both simulation-based and theoretical evidence indicating that achieving the desired prediction accuracy in meta-learning depends on the proportion of predictor variance aligned with the shared subspace, as well as on the accuracy of subspace estimation.

[LG-106] On Multi-entity Multivariate Quickest Change Point Detection

链接: https://arxiv.org/abs/2509.18310
作者: Bahar Kor,Bipin Gaikwad,Abani Patra,Eric L. Miller
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

[LG-107] Joint Cooperative and Non-Cooperative Localization in WSNs with Distributed Scaled Proximal ADMM Algorithms

链接: https://arxiv.org/abs/2509.18213
作者: Qiaojia Zhu,Xiaojing Shen,Haiqi Liu,Pramod K. Varshney
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

[LG-108] Surrogate Modelling of Proton Dose with Monte Carlo Dropout Uncertainty Quantification

链接: https://arxiv.org/abs/2509.18155
作者: Aaron Pim,Tristan Pryer
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an); Applications (stat.AP)
*备注: 21 pages, 23 figures

点击查看摘要

[LG-109] Pareto-optimal Tradeoffs Between Communication and Computation with Flexible Gradient Tracking

链接: https://arxiv.org/abs/2509.18129
作者: Yan Huang,Jinming Xu,Li Chai,Jiming Chen,Karl H. Johansson
类目: Optimization and Control (math.OC); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 25 pages

点击查看摘要

信息检索

[IR-0] A Knowledge Graph and a Tripartite Evaluation Framework Make Retrieval-Augmented Generation Scalable and Transparent

链接: https://arxiv.org/abs/2509.19209
作者: Olalekan K. Akindele,Bhupesh Kumar Mishra,Kenneth Y. Wertheim
类目: Information Retrieval (cs.IR)
*备注: 25 Pages

点击查看摘要

Abstract:Large Language Models (LLMs) have significantly enhanced conversational Artificial Intelligence(AI) chatbots; however, domain-specific accuracy and the avoidance of factual inconsistencies remain pressing challenges, particularly for large datasets. Designing an effective chatbot with appropriate methods and evaluating its effectiveness is among the challenges in this domain. This study presents a Retrieval Augmented Generation (RAG) chatbot that harnesses a knowledge graph and vector search retrieval to deliver precise, context-rich responses in an exemplary use case from over high-volume engineering project-related emails, thereby minimising the need for document chunking. A central innovation of this work is the introduction of RAG Evaluation (RAG-Eval), a novel chain-of-thought LLM-based tripartite evaluation framework specifically developed to assess RAG applications. This framework operates in parallel with the chatbot, jointly assessing the user’s query, the retrieved document, and the generated response, enabling a holistic evaluation across multiple quality metrics like query relevance, factual accuracy, coverage, coherence and fluency. The resulting scoring system is provided directly to users as a confidence score (1 to 100%), enabling quick identification of possible misaligned or incomplete answers. This proposed approach promotes transparency and rapid verification by incorporating metadata email IDs, timestamps into responses. Experimental comparisons against BERTScore and G-EVAL for summarisation evaluation tasks confirm its effectiveness, and empirical analysis also shows RAG-Eval reliably detects factual gaps and query mismatches, thereby fostering trust in high demand, data centric environments. These findings highlight a scalable path for developing accurate, user-verifiable chatbots that bridge the gap between high-level conversational fluency and factual accuracy.

[IR-1] RELATE: Relation Extraction in Biomedical Abstracts with LLM s and Ontology Constraints

链接: https://arxiv.org/abs/2509.19057
作者: Olawumi Olasunkanmi,Mathew Satursky,Hong Yi,Chris Bizon,Harlin Lee,Stanley Ahalt
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Biomedical knowledge graphs (KGs) are vital for drug discovery and clinical decision support but remain incomplete. Large language models (LLMs) excel at extracting biomedical relations, yet their outputs lack standardization and alignment with ontologies, limiting KG integration. We introduce RELATE, a three-stage pipeline that maps LLM-extracted relations to standardized ontology predicates using ChemProt and the Biolink Model. The pipeline includes: (1) ontology preprocessing with predicate embeddings, (2) similarity-based retrieval enhanced with SapBERT, and (3) LLM-based reranking with explicit negation handling. This approach transforms relation extraction from free-text outputs to structured, ontology-constrained representations. On the ChemProt benchmark, RELATE achieves 52% exact match and 94% accuracy@10, and in 2,400 HEAL Project abstracts, it effectively rejects irrelevant associations (0.4%) and identifies negated assertions. RELATE captures nuanced biomedical relationships while ensuring quality for KG augmentation. By combining vector search with contextual LLM reasoning, RELATE provides a scalable, semantically accurate framework for converting unstructured biomedical literature into standardized KGs.

[IR-2] Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation

链接: https://arxiv.org/abs/2509.18807
作者: Christian Ganhör,Marta Moscati,Anna Hausberger,Shah Nawaz,Markus Schedl
类目: Information Retrieval (cs.IR)
*备注: Accepted by ACM Transactions on Recommender Systems (TORS)

点击查看摘要

Abstract:Traditional recommender systems rely on collaborative filtering, using past user-item interactions to help users discover new items in a vast collection. In cold start, i.e., when interaction histories of users or items are not available, content-based recommender systems use side information instead. Hybrid recommender systems (HRSs) often employ multimodal learning to combine collaborative and side information, which we jointly refer to as modalities. Though HRSs can provide recommendations when some modalities are missing, their quality degrades. In this work, we utilize single-branch neural networks equipped with weight sharing, modality sampling, and contrastive loss to provide accurate recommendations even in missing modality scenarios by narrowing the modality gap. We compare these networks with multi-branch alternatives and conduct extensive experiments on three datasets. Six accuracy-based and four beyond-accuracy-based metrics help assess the recommendation quality for the different training paradigms and their hyperparameters in warm-start and missing modality scenarios. We quantitatively and qualitatively study the effects of these different aspects on bridging the modality gap. Our results show that single-branch networks achieve competitive performance in warm-start scenarios and are significantly better in missing modality settings. Moreover, our approach leads to closer proximity of an item’s modalities in the embedding space. Our full experimental setup is available at this https URL.

[IR-3] Robust Denoising Neural Reranker for Recommender Systems

链接: https://arxiv.org/abs/2509.18736
作者: Wenyu Mao,Shuchang Liu,Hailan Yang,Xiaobei Wang,Xiaoyu Yang,Xu Gao,Xiang Li,Lantao Hu,Han Li,Kun Gai,An Zhang,Xiang Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:For multi-stage recommenders in industry, a user request would first trigger a simple and efficient retriever module that selects and ranks a list of relevant items, then calls a slower but more sophisticated deep reranking model that refines the item arrangement before exposure to the user. The latter model typically reranks the item list conditioned on the user’s history content and the initial ranking from retrievers. Although this two-stage retrieval-ranking framework demonstrates practical effectiveness, the significance of retriever scores from the previous stage has been limitedly explored, which is informative. In this work, we first theoretically analyze the limitations of using retriever scores as the rerankers’ input directly and argue that the reranking task is essentially a noise reduction problem from the retriever scores. Following this notion, we derive an adversarial framework, DNR, that associates the denoising reranker with a carefully designed noise generation module. We extend the conventional score error minimization term with three augmented objectives, including: 1) a denoising objective that aims to denoise the noisy retriever scores to align with the user feedback; 2) an adversarial retriever score generation objective that improves the exploration in the retriever score space; and 3) a distribution regularization term that aims to align the distribution of generated noisy retriever scores with the real ones. Extensive experiments are conducted on three public datasets, together with analytical support, validating the effectiveness of the proposed DNR.

[IR-4] BloomIntent: Automating Search Evaluation with LLM -Generated Fine-Grained User Intents

链接: https://arxiv.org/abs/2509.18641
作者: Yoonseo Choi,Eunhye Kim,Hyunwoo Kim,Donghyun Park,Honggu Lee,Jinyoung Kim,Juho Kim
类目: Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: Accepted to UIST 2025; 34 pages (including 18 pages of Appendix)

点击查看摘要

Abstract:If 100 people issue the same search query, they may have 100 different goals. While existing work on user-centric AI evaluation highlights the importance of aligning systems with fine-grained user intents, current search evaluation methods struggle to represent and assess this diversity. We introduce BloomIntent, a user-centric search evaluation method that uses user intents as the evaluation unit. BloomIntent first generates a set of plausible, fine-grained search intents grounded on taxonomies of user attributes and information-seeking intent types. Then, BloomIntent provides an automated evaluation of search results against each intent powered by large language models. To support practical analysis, BloomIntent clusters semantically similar intents and summarizes evaluation outcomes in a structured interface. With three technical evaluations, we showed that BloomIntent generated fine-grained, evaluable, and realistic intents and produced scalable assessments of intent-level satisfaction that achieved 72% agreement with expert evaluators. In a case study (N=4), we showed that BloomIntent supported search specialists in identifying intents for ambiguous queries, uncovering underserved user needs, and discovering actionable insights for improving search experiences. By shifting from query-level to intent-level evaluation, BloomIntent reimagines how search systems can be assessed – not only for performance but for their ability to serve a multitude of user goals.

[IR-5] Scalable Evaluation for Audio Identification via Synthetic Latent Fingerprint Generation ICASSP

链接: https://arxiv.org/abs/2509.18620
作者: Aditya Bhattacharjee,Marco Pasini,Emmanouil Benetos
类目: ound (cs.SD); Information Retrieval (cs.IR); Audio and Speech Processing (eess.AS)
*备注: Under review for International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, 2026

点击查看摘要

Abstract:The evaluation of audio fingerprinting at a realistic scale is limited by the scarcity of large public music databases. We present an audio-free approach that synthesises latent fingerprints which approximate the distribution of real fingerprints. Our method trains a Rectified Flow model on embeddings extracted by pre-trained neural audio fingerprinting systems. The synthetic fingerprints generated using our system act as realistic distractors and enable the simulation of retrieval performance at a large scale without requiring additional audio. We assess the fidelity of synthetic fingerprints by comparing the distributions to real data. We further benchmark the retrieval performances across multiple state-of-the-art audio fingerprinting frameworks by augmenting real reference databases with synthetic distractors, and show that the scaling trends obtained with synthetic distractors closely track those obtained with real distractors. Finally, we scale the synthetic distractor database to model retrieval performance for very large databases, providing a practical metric of system scalability that does not depend on access to audio corpora.

[IR-6] Understand your Users An Ensemble Learning Framework for Natural Noise Filtering in Recommender Systems

链接: https://arxiv.org/abs/2509.18560
作者: Clarita Hawat,Wissam Al Jurdi,Jacques Bou Abdo,Jacques Demerjian,Abdallah Makhoul
类目: Information Retrieval (cs.IR)
*备注: 32 pages

点击查看摘要

Abstract:The exponential growth of web content is a major key to the success for Recommender Systems. This paper addresses the challenge of defining noise, which is inherently related to variability in human preferences and behaviors. In classifying changes in user tendencies, we distinguish three kinds of phenomena: external factors that directly influence users’ sentiment, serendipity causing unexpected preference, and incidental interaction perceived as noise. To overcome these problems, we present a new framework that identifies noisy ratings. In this context, the proposed framework is modular, consisting of three layers: known natural noise algorithms for item classification, an Ensemble learning model for refined evaluation of the items and signature-based noise identification. We further advocate the metrics that quantitatively assess serendipity and group validation, offering higher robustness in recommendation accuracy. Our approach aims to provide a cleaner training dataset that would inherently improve user satisfaction and engagement with Recommender Systems.

附件下载

点击下载今日全部论文列表