本篇博文主要内容为 2025-09-18 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-18)

今日共更新501篇论文,其中:

  • 自然语言处理61篇(Computation and Language (cs.CL))
  • 人工智能117篇(Artificial Intelligence (cs.AI))
  • 计算机视觉98篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习122篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Apertus: Democratizing Open and Compliant LLM s for Global Language Environments

【速读】: 该论文旨在解决当前开源大语言模型(Large Language Models, LLMs)生态系统中存在的两大系统性问题:数据合规性不足与多语言代表性薄弱。针对前者,Apertus 模型仅在公开可用的数据上进行预训练,并通过可复现的数据处理流程确保对内容提供者权利的尊重,同时过滤非授权、有害及个人身份信息内容;针对后者,其采用包含超过 1800 种语言的 15T tokens 数据集进行训练,其中约 40% 为非英语内容,显著提升多语言覆盖能力。解决方案的关键在于:一是引入 Goldfish 目标函数以抑制记忆化行为,实现对原始数据的强去重同时保持下游任务性能;二是全面开源所有开发过程中的科学成果(如数据处理脚本、检查点、评估套件和训练代码),支持透明审计与扩展。

链接: https://arxiv.org/abs/2509.14233
作者: Alejandro Hernández-Cano,Alexander Hägele,Allen Hao Huang,Angelika Romanou,Antoni-Joan Solergibert,Barna Pasztor,Bettina Messmer,Dhia Garbaya,Eduard Frank Ďurech,Ido Hakimi,Juan García Giraldo,Mete Ismayilzada,Negar Foroutan,Skander Moalla,Tiancheng Chen,Vinko Sabolčec,Yixuan Xu,Michael Aerni,Badr AlKhamissi,Ines Altemir Marinas,Mohammad Hossein Amani,Matin Ansaripour,Ilia Badanin,Harold Benoit,Emanuela Boros,Nicholas Browning,Fabian Bösch,Maximilian Böther,Niklas Canova,Camille Challier,Clement Charmillot,Jonathan Coles,Jan Deriu,Arnout Devos,Lukas Drescher,Daniil Dzenhaliou,Maud Ehrmann,Dongyang Fan,Simin Fan,Silin Gao,Miguel Gila,María Grandury,Diba Hashemi,Alexander Hoyle,Jiaming Jiang,Mark Klein,Andrei Kucharavy,Anastasiia Kucherenko,Frederike Lübeck,Roman Machacek,Theofilos Manitaras,Andreas Marfurt,Kyle Matoba,Simon Matrenok,Henrique Mendoncça,Fawzi Roberto Mohamed,Syrielle Montariol,Luca Mouchel,Sven Najem-Meyer,Jingwei Ni,Gennaro Oliva,Matteo Pagliardini,Elia Palme,Andrei Panferov,Léo Paoletti,Marco Passerini,Ivan Pavlov,Auguste Poiroux,Kaustubh Ponkshe,Nathan Ranchin,Javi Rando,Mathieu Sauser,Jakhongir Saydaliev,Muhammad Ali Sayfiddinov,Marian Schneider,Stefano Schuppli,Marco Scialanga,Andrei Semenov,Kumar Shridhar,Raghav Singhal,Anna Sotnikova,Alexander Sternfeld,Ayush Kumar Tarun,Paul Teiletche,Jannis Vamvas,Xiaozhe Yao,Hao Zhao Alexander Ilic,Ana Klimovic,Andreas Krause,Caglar Gulcehre,David Rosenthal,Elliott Ash,Florian Tramèr,Joost VandeVondele,Livio Veraldi,Martin Rajman,Thomas Schulthess,Torsten Hoefler,Antoine Bosselut,Martin Jaggi,Imanol Schlag
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today’s open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting this http URL exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.
zh

[NLP-1] Language models activations linearly encode training-order recency

【速读】: 该论文试图解决的问题是:语言模型在训练过程中对信息的存储是否具有时间顺序编码特性,即模型能否通过激活值区分不同信息的学习先后顺序。解决方案的关键在于,通过顺序微调Llama-3.2-1B模型在六个结构相似但互不重叠的命名实体数据集上,发现测试样本的平均激活向量在二维子空间中呈现出与训练顺序完全一致的线性排列,并且可被线性探测器准确区分“早期”与“晚期”学习的实体(准确率约90%),甚至可通过微调使模型直接预测未见过实体的训练阶段(准确率约80%)。这一现象表明,语言模型能够以一种非显式的、基于激活模式的方式编码信息获取的时间顺序,而这种信号并非由简单的激活幅值、损失或置信度差异导致,揭示了模型内部存在潜在的时间感知机制,对理解模型如何处理冲突信息和动态知识更新具有重要意义。

链接: https://arxiv.org/abs/2509.14223
作者: Dmitrii Krasheninnikov,Richard E. Turner,David Krueger
机构: University of Cambridge (剑桥大学); Mila, University of Montreal (蒙特利尔大学米尔研究所)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We show that language models’ activations linearly encode when information was learned during training. Our setup involves creating a model with a known training order by sequentially fine-tuning Llama-3.2-1B on six disjoint but otherwise similar datasets about named entities. We find that the average activations of test samples for the six training datasets encode the training order: when projected into a 2D subspace, these centroids are arranged exactly in the order of training and lie on a straight line. Further, we show that linear probes can accurately (~90%) distinguish “early” vs. “late” entities, generalizing to entities unseen during the probes’ own training. The model can also be fine-tuned to explicitly report an unseen entity’s training stage (~80% accuracy). Interestingly, this temporal signal does not seem attributable to simple differences in activation magnitudes, losses, or model confidence. Our paper demonstrates that models are capable of differentiating information by its acquisition time, and carries significant implications for how they might manage conflicting data and respond to knowledge modifications.
zh

[NLP-2] GEM-Bench: A Benchmark for Ad-Injected Response Generation within Generative Engine Marketing

【速读】: 该论文旨在解决生成式引擎营销(Generative Engine Marketing, GEM)中广告注入响应生成与评估缺乏专门基准的问题。当前现有基准未针对GEM场景设计,限制了该领域研究的深入发展。为此,作者提出GEM-Bench,这是首个面向GEM场景下广告注入响应生成的综合性基准,其关键在于:构建覆盖聊天机器人和搜索场景的三个精选数据集、定义多维度用户满意度与参与度的指标体系(metric ontology),以及在可扩展的多智能体框架中实现多个基线方案。实验表明,简单提示驱动的方法虽能提升点击率等参与指标,但常损害用户满意度;而基于预生成无广告响应插入广告的方法虽缓解此问题,却带来额外计算开销,凸显了未来需开发更高效且有效的广告注入策略。

链接: https://arxiv.org/abs/2509.14221
作者: Silan Hu,Shiqi Zhang,Yimin Shi,Xiaokui Xiao
机构: National University of Singapore (新加坡国立大学); PyroWis AI (PyroWis AI)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Generative Engine Marketing (GEM) is an emerging ecosystem for monetizing generative engines, such as LLM-based chatbots, by seamlessly integrating relevant advertisements into their responses. At the core of GEM lies the generation and evaluation of ad-injected responses. However, existing benchmarks are not specifically designed for this purpose, which limits future research. To address this gap, we propose GEM-Bench, the first comprehensive benchmark for ad-injected response generation in GEM. GEM-Bench includes three curated datasets covering both chatbot and search scenarios, a metric ontology that captures multiple dimensions of user satisfaction and engagement, and several baseline solutions implemented within an extensible multi-agent framework. Our preliminary results indicate that, while simple prompt-based methods achieve reasonable engagement such as click-through rate, they often reduce user satisfaction. In contrast, approaches that insert ads based on pre-generated ad-free responses help mitigate this issue but introduce additional overhead. These findings highlight the need for future research on designing more effective and efficient solutions for generating ad-injected responses in GEM.
zh

[NLP-3] Dense Video Understanding with Gated Residual Tokenization

【速读】: 该论文旨在解决当前视频大语言模型(Video Large Language Models, VLLMs)在处理高帧率(high-FPS)视频时存在的两个核心问题:一是由于采用低帧率采样(如均匀采样或关键帧选择)导致密集时间信息丢失,难以满足如课堂讲解等需要精确时序对齐的任务需求;二是传统tokenization方法在处理高帧率视频时引发的计算冗余和线性增长的token数量,限制了模型效率与扩展性。解决方案的关键在于提出一种名为Gated Residual Tokenization (GRT) 的两阶段框架:第一阶段“运动补偿的跨帧门控token化”(Motion-Compensated Inter-Gated Tokenization)利用像素级运动估计跳过静态区域,实现token数和计算量的次线性增长;第二阶段“语义场景内token融合”(Semantic-Scene Intra-Tokenization Merging)在静态场景内合并token,进一步减少冗余并保留动态语义信息。该方案显著提升了高帧率视频理解的效率与准确性,并通过新提出的DIVE基准验证了其在密集时间推理任务中的优越性能。

链接: https://arxiv.org/abs/2509.14199
作者: Haichao Zhang,Wenhao Chai,Shwai He,Ang Li,Yun Fu
机构: Northeastern University (东北大学); Princeton University (普林斯顿大学); University of Maryland (马里兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:High temporal resolution is essential for capturing fine-grained details in video understanding. However, current video large language models (VLLMs) and benchmarks mostly rely on low-frame-rate sampling, such as uniform sampling or keyframe selection, discarding dense temporal information. This compromise avoids the high cost of tokenizing every frame, which otherwise leads to redundant computation and linear token growth as video length increases. While this trade-off works for slowly changing content, it fails for tasks like lecture comprehension, where information appears in nearly every frame and requires precise temporal alignment. To address this gap, we introduce Dense Video Understanding (DVU), which enables high-FPS video comprehension by reducing both tokenization time and token overhead. Existing benchmarks are also limited, as their QA pairs focus on coarse content changes. We therefore propose DIVE (Dense Information Video Evaluation), the first benchmark designed for dense temporal reasoning. To make DVU practical, we present Gated Residual Tokenization (GRT), a two-stage framework: (1) Motion-Compensated Inter-Gated Tokenization uses pixel-level motion estimation to skip static regions during tokenization, achieving sub-linear growth in token count and compute. (2) Semantic-Scene Intra-Tokenization Merging fuses tokens across static regions within a scene, further reducing redundancy while preserving dynamic semantics. Experiments on DIVE show that GRT outperforms larger VLLM baselines and scales positively with FPS. These results highlight the importance of dense temporal information and demonstrate that GRT enables efficient, scalable high-FPS video understanding.
zh

[NLP-4] Framing Migration: A Computational Analysis of UK Parliamentary Discourse

【速读】: 该论文旨在解决如何在大规模政治历史语料中实现对移民相关话语的细粒度分析问题,特别是识别和追踪不同政党、时间跨度下的立场倾向与叙事框架变化。其解决方案的关键在于结合开放权重的大语言模型(LLM)进行高阶立场标注,并构建半自动化框架提取英国议会辩论中的精细叙事框架(narrative frames),从而捕捉移民话语的演变趋势,如从社会融合导向向安全化叙事的转变,以及国内法讨论被国际法与人权议题取代的现象。

链接: https://arxiv.org/abs/2509.14197
作者: Vahid Ghafouri,Robert McNeil,Teodor Yankov,Madeleine Sumption,Luc Rocher,Scott A. Hale,Adam Mahdi
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:We present a large-scale computational analysis of migration-related discourse in UK parliamentary debates spanning over 75 years and compare it with US congressional discourse. Using open-weight LLMs, we annotate each statement with high-level stances toward migrants and track the net tone toward migrants across time and political parties. For the UK, we extend this with a semi-automated framework for extracting fine-grained narrative frames to capture nuances of migration discourse. Our findings show that, while US discourse has grown increasingly polarised, UK parliamentary attitudes remain relatively aligned across parties, with a persistent ideological gap between Labour and the Conservatives, reaching its most negative level in 2025. The analysis of narrative frames in the UK parliamentary statements reveals a shift toward securitised narratives such as border control and illegal immigration, while longer-term integration-oriented frames such as social integration have declined. Moreover, discussions of national law about immigration have been replaced over time by international law and human rights, revealing nuances in discourse trends. Taken together broadly, our findings demonstrate how LLMs can support scalable, fine-grained discourse analysis in political and historical contexts.
zh

[NLP-5] Synthesizing Behaviorally-Grounded Reasoning Chains: A Data-Generation Framework for Personal Finance LLM s

【速读】: 该论文旨在解决个性化金融建议中因缺乏整合用户目标、约束条件、风险偏好及地域法规等因素而导致的准确性与实用性不足的问题。现有大型语言模型(LLM)多聚焦于投资者支持系统,而近期基于代理管道(agentic pipelines)的个人理财任务虽覆盖预算、债务管理、退休规划等场景,但维护成本高且实际收益低于预期(<25%)。其解决方案的关键在于提出一个新颖且可复现的框架,将财务背景知识与行为金融学研究相结合,构建用于端到端顾问训练的监督数据;在此基础上,创建包含19k样本的推理数据集,并对Qwen-3-8B模型进行精细微调。实验证明,通过精心的数据筛选和行为因素融合,该8B模型在事实准确性、流畅性和个性化指标上达到与14–32B参数大模型相当的性能,同时成本降低80%。

链接: https://arxiv.org/abs/2509.14180
作者: Akhil Theerthala
机构: Perfios Software Solutions
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 11 figures. The paper presents a novel framework for generating a personal finance dataset. The resulting fine-tuned model and dataset are publicly available

点击查看摘要

Abstract:Personalized financial advice requires consideration of user goals, constraints, risk tolerance, and jurisdiction. Prior LLM work has focused on support systems for investors and financial planners. Simultaneously, numerous recent studies examine broader personal finance tasks, including budgeting, debt management, retirement, and estate planning, through agentic pipelines that incur high maintenance costs, yielding less than 25% of their expected financial returns. In this study, we introduce a novel and reproducible framework that integrates relevant financial context with behavioral finance studies to construct supervision data for end-to-end advisors. Using this framework, we create a 19k sample reasoning dataset and conduct a comprehensive fine-tuning of the Qwen-3-8B model on the dataset. Through a held-out test split and a blind LLM-jury study, we demonstrate that through careful data curation and behavioral integration, our 8B model achieves performance comparable to significantly larger baselines (14-32B parameters) across factual accuracy, fluency, and personalization metrics while incurring 80% lower costs than the larger counterparts.
zh

[NLP-6] AssoCiAm: A Benchmark for Evaluating Association Thinking while Circumventing Ambiguity

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在评估关联能力(associative ability)时因任务固有模糊性(ambiguity)而导致评价不可靠的问题。现有评估框架通常忽略这种模糊性,而模糊性源于关联本身具有多样性与主观性,从而影响模型性能判断的准确性。解决方案的关键在于将模糊性细分为内部模糊性(internal ambiguity)和外部模糊性(external ambiguity),并提出 AssoCiAm 基准测试方法,采用混合计算策略来规避模糊性干扰,从而实现更准确、可靠的关联能力评估。

链接: https://arxiv.org/abs/2509.14171
作者: Yifan Liu,Wenkuan Zhao,Shanshan Zhong,Jinghui Qin,Mingfu Liang,Zhongzhan Huang,Wushao Wen
机构: Sun Yat-sen University (中山大学); Guangdong University of Technology (广东工业大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in multimodal large language models (MLLMs) have garnered significant attention, offering a promising pathway toward artificial general intelligence (AGI). Among the essential capabilities required for AGI, creativity has emerged as a critical trait for MLLMs, with association serving as its foundation. Association reflects a model’ s ability to think creatively, making it vital to evaluate and understand. While several frameworks have been proposed to assess associative ability, they often overlook the inherent ambiguity in association tasks, which arises from the divergent nature of associations and undermines the reliability of evaluations. To address this issue, we decompose ambiguity into two types-internal ambiguity and external ambiguity-and introduce AssoCiAm, a benchmark designed to evaluate associative ability while circumventing the ambiguity through a hybrid computational method. We then conduct extensive experiments on MLLMs, revealing a strong positive correlation between cognition and association. Additionally, we observe that the presence of ambiguity in the evaluation process causes MLLMs’ behavior to become more random-like. Finally, we validate the effectiveness of our method in ensuring more accurate and reliable evaluations. See Project Page for the data and codes.
zh

[NLP-7] CS-FLEURS: A Massively Multilingual and Code-Switched Speech Dataset

【速读】: 该论文旨在解决多语言环境中代码切换(Code-Switching)语音识别与翻译系统在低资源语言场景下的开发与评估难题。现有研究主要聚焦于高资源语言,缺乏覆盖广泛语言对且具有高质量标注数据的基准测试集。为应对这一挑战,作者提出了CS-FLEURS数据集,其关键创新在于构建了四个结构化测试集和一个训练集:涵盖113种独特的代码切换语言对(涉及52种语言),包含真实语音、生成式文本到语音(Text-to-Speech, TTS)以及拼接式TTS等多种合成语音来源,特别针对低资源语言设计了专门的测试子集。该数据集显著扩展了代码切换语音处理的研究边界,为跨语言语音理解模型提供了标准化评估平台。

链接: https://arxiv.org/abs/2509.14161
作者: Brian Yan,Injy Hamed,Shuichiro Shimizu,Vasista Lodagala,William Chen,Olga Iakovenko,Bashar Talafha,Amir Hussein,Alexander Polok,Kalvin Chang,Dominik Klement,Sara Althubaiti,Puyuan Peng,Matthew Wiesner,Thamar Solorio,Ahmed Ali,Sanjeev Khudanpur,Shinji Watanabe,Chih-Chen Chen,Zhen Wu,Karim Benharrak,Anuj Diwan,Samuele Cornell,Eunjung Yeo,Kwanghee Choi,Carlos Carvalho,Karen Rosero
机构: 未知
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:We present CS-FLEURS, a new dataset for developing and evaluating code-switched speech recognition and translation systems beyond high-resourced languages. CS-FLEURS consists of 4 test sets which cover in total 113 unique code-switched language pairs across 52 languages: 1) a 14 X-English language pair set with real voices reading synthetically generated code-switched sentences, 2) a 16 X-English language pair set with generative text-to-speech 3) a 60 Arabic, Mandarin, Hindi, Spanish-X language pair set with the generative text-to-speech, and 4) a 45 X-English lower-resourced language pair test set with concatenative text-to-speech. Besides the four test sets, CS-FLEURS also provides a training set with 128 hours of generative text-to-speech data across 16 X-English language pairs. Our hope is that CS-FLEURS helps to broaden the scope of future code-switched speech research. Dataset link: this https URL.
zh

[NLP-8] When Avatars Have Personality: Effects on Engagement and Communication in Immersive Medical Training

【速读】: 该论文旨在解决虚拟现实(Virtual Reality, VR)在训练复杂人际交互技能时,因缺乏心理上可信的虚拟人类而效果受限的问题,尤其是在医疗教育等高风险领域中,沟通能力是核心胜任力。其解决方案的关键在于构建一个模块化框架,将人格特征与临床数据解耦,并利用大语言模型(Large Language Models, LLMs)生成具有医学一致性、个性鲜明且行为一致的虚拟患者(Virtual Patient)。该框架不仅提升了VR培训的真实性与沉浸感,还通过实证研究验证了其可行性与有效性,同时揭示了如“真实性-冗余悖论”等关键设计原则,为下一代社会智能型VR训练环境的发展提供了理论基础与实践指导。

链接: https://arxiv.org/abs/2509.14132
作者: Julia S. Dollis,Iago A. Brito,Fernanda B. Färber,Pedro S. F. B. Ribeiro,Rafael T. Sousa,Arlindo R. Galvão Filho
机构: Advanced Knowledge Center for Immersive Technologies (高级沉浸式技术知识中心)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 8 pages, 2 figures

点击查看摘要

Abstract:While virtual reality (VR) excels at simulating physical environments, its effectiveness for training complex interpersonal skills is limited by a lack of psychologically plausible virtual humans. This is a critical gap in high-stakes domains like medical education, where communication is a core competency. This paper introduces a framework that integrates large language models (LLMs) into immersive VR to create medically coherent virtual patients with distinct, consistent personalities, built on a modular architecture that decouples personality from clinical data. We evaluated our system in a mixed-method, within-subjects study with licensed physicians who engaged in simulated consultations. Results demonstrate that the approach is not only feasible but is also perceived by physicians as a highly rewarding and effective training enhancement. Furthermore, our analysis uncovers critical design principles, including a ``realism-verbosity paradox" where less communicative agents can seem more artificial, and the need for challenges to be perceived as authentic to be instructive. This work provides a validated framework and key insights for developing the next generation of socially intelligent VR training environments.
zh

[NLP-9] Canary-1B-v2 Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST ICASSP2026

【速读】: 该论文旨在解决多语言自动语音识别(ASR)与语音到文本翻译(AST)任务中模型效率低、 hallucination(幻觉)问题严重以及跨语言性能不足的问题。其核心解决方案是提出 Canary-1B-v2 模型,采用 FastConformer 编码器与 Transformer 解码器架构,在 170 万小时的多源数据上进行两阶段预训练与动态数据平衡微调,并引入非语音音频以降低 ASR 和 AST 中的幻觉现象。此外,通过 NeMo 强制对齐器(NFA)结合辅助 CTC 模型实现可靠的段级时间戳标注,从而在保证高精度的同时显著提升推理速度——相比 Whisper-large-v3 在英语 ASR 上表现更优且速度快 10 倍,同时在多语言场景下性能可媲美更大规模模型如 Seamless-M4T-v2-large 和基于大语言模型(LLM)的系统。

链接: https://arxiv.org/abs/2509.14128
作者: Monica Sekoyan,Nithin Rao Koluguri,Nune Tadevosyan,Piotr Zelasko,Travis Bartley,Nick Karpov,Jagadeesh Balam,Boris Ginsburg
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Mini Version of it Submitted to ICASSP 2026

点击查看摘要

Abstract:This report introduces Canary-1B-v2, a fast, robust multilingual model for Automatic Speech Recognition (ASR) and Speech-to-Text Translation (AST). Built with a FastConformer encoder and Transformer decoder, it supports 25 languages primarily European. The model was trained on 1.7M hours of total data samples, including Granary and NeMo ASR Set 3.0, with non-speech audio added to reduce hallucinations for ASR and AST. We describe its two-stage pre-training and fine-tuning process with dynamic data balancing, as well as experiments with an nGPT encoder. Results show nGPT scales well with massive data, while FastConformer excels after fine-tuning. For timestamps, Canary-1B-v2 uses the NeMo Forced Aligner (NFA) with an auxiliary CTC model, providing reliable segment-level timestamps for ASR and AST. Evaluations show Canary-1B-v2 outperforms Whisper-large-v3 on English ASR while being 10x faster, and delivers competitive multilingual ASR and AST performance against larger models like Seamless-M4T-v2-large and LLM-based systems. We also release Parakeet-TDT-0.6B-v3, a successor to v2, offering multilingual ASR across the same 25 languages with just 600M parameters.
zh

[NLP-10] Reasoning Efficiently Through Adaptive Chain-of-Thought Compression: A Self-Optimizing Framework

【速读】: 该论文旨在解决链式思维(Chain-of-Thought, CoT)推理在大语言模型(Large Language Models, LLMs)中因输出过长而导致的计算开销高、延迟增加、内存占用上升以及截断问题,尤其在软件工程任务中,这种冗长推理反而可能降低准确性并引发无限循环。解决方案的关键在于提出一种自适应高效推理框架SEER(Self-Enhancing Efficient Reasoning),其核心机制包括基于Best-of-N采样的多样性增强与任务感知的自适应过滤策略,通过预推理输出动态调整阈值,实现对CoT的压缩,在保持甚至提升准确性的前提下显著减少冗余文本和计算资源消耗。

链接: https://arxiv.org/abs/2509.14093
作者: Kerui Huang,Shuhan Liu,Xing Hu,Tongtong Xu,Lingfeng Bao,Xin Xia
机构: The State Key Laboratory of Blockchain and Data Security, Zhejiang University (浙江大学区块链与数据安全国家重点实验室); State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by prompting intermediate steps, improving accuracy and robustness in arithmetic, logic, and commonsense tasks. However, this benefit comes with high computational costs: longer outputs increase latency, memory usage, and KV-cache demands. These issues are especially critical in software engineering tasks where concise and deterministic outputs are required. To investigate these trade-offs, we conduct an empirical study based on code generation benchmarks. The results reveal that longer CoT does not always help. Excessive reasoning often causes truncation, accuracy drops, and latency up to five times higher, with failed outputs consistently longer than successful ones. These findings challenge the assumption that longer reasoning is inherently better and highlight the need for adaptive CoT control. Motivated by this, we propose SEER (Self-Enhancing Efficient Reasoning), an adaptive framework that compresses CoT while preserving accuracy. SEER combines Best-of-N sampling with task-aware adaptive filtering, dynamically adjusting thresholds based on pre-inference outputs to reduce verbosity and computational overhead. We then evaluate SEER on three software engineering tasks and one math task. On average, SEER shortens CoT by 42.1%, improves accuracy by reducing truncation, and eliminates most infinite loops. These results demonstrate SEER as a practical method to make CoT-enhanced LLMs more efficient and robust, even under resource constraints.
zh

[NLP-11] A TRRIP Down Memory Lane: Temperature-Based Re-Reference Interval Prediction For Instruction Caching

【速读】: 该论文旨在解决现代移动CPU软件因复杂运行时行为导致指令缓存(Instruction Cache)替换策略效率低下问题,尤其在代码重用距离增大、前端流水线频繁停顿(stall)以及片上内存资源受限的背景下,传统以硬件为中心的缓存管理方法已难以满足性能需求。解决方案的关键在于提出一种软硬协同设计方法TRRIP(Temperature-based Re-Reference Interval Prediction),其核心创新是利用编译器对代码按“温度”(hot/cold)进行分析与分类,并通过操作系统接口将代码页属性中的温度信息传递给硬件;硬件端则基于此信息优化指令缓存替换策略,优先保留热点代码,从而显著降低热代码的淘汰率,最终实现L2指令MPKI下降26.5%,几何平均加速比达3.9%。

链接: https://arxiv.org/abs/2509.14041
作者: Henry Kao,Nikhil Sreekumar,Prabhdeep Singh Soni,Ali Sedaghati,Fang Su,Bryan Chan,Maziar Goudarzi,Reza Azimi
机构: Huawei Technologies Canada (华为技术加拿大); Huawei Technologies Beijing (华为技术北京)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL); Operating Systems (cs.OS); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Modern mobile CPU software pose challenges for conventional instruction cache replacement policies due to their complex runtime behavior causing high reuse distance between executions of the same instruction. Mobile code commonly suffers from large amounts of stalls in the CPU frontend and thus starvation of the rest of the CPU resources. Complexity of these applications and their code footprint are projected to grow at a rate faster than available on-chip memory due to power and area constraints, making conventional hardware-centric methods for managing instruction caches to be inadequate. We present a novel software-hardware co-design approach called TRRIP (Temperature-based Re-Reference Interval Prediction) that enables the compiler to analyze, classify, and transform code based on “temperature” (hot/cold), and to provide the hardware with a summary of code temperature information through a well-defined OS interface based on using code page attributes. TRRIP’s lightweight hardware extension employs code temperature attributes to optimize the instruction cache replacement policy resulting in the eviction rate reduction of hot code. TRRIP is designed to be practical and adoptable in real mobile systems that have strict feature requirements on both the software and hardware components. TRRIP can reduce the L2 MPKI for instructions by 26.5% resulting in geomean speedup of 3.9%, on top of RRIP cache replacement running mobile code already optimized using PGO.
zh

[NLP-12] SSL-SSAW: Self-Supervised Learning with Sigmoid Self-Attention Weighting for Question-Based Sign Language Translation

【速读】: 该论文旨在解决手语翻译(Sign Language Translation, SLT)中如何高效融合对话上下文以提升翻译质量的问题。传统方法依赖于符号标注(gloss annotations),但其获取成本高且不自然;而对话作为自然发生的交流形式,更易获取且蕴含重要语境信息。解决方案的关键在于提出一种基于交叉模态自监督学习的特征融合方法——SSL-SSAW(Cross-modality Self-supervised Learning with Sigmoid Self-attention Weighting),通过对比学习对齐多模态特征,并引入Sigmoid Self-attention Weighting模块实现对问题和手语序列的自适应特征提取,同时利用自监督学习增强问题文本的表征能力,从而显著提升翻译性能。实验表明,仅使用易于获取的对话辅助即可达到甚至超越依赖符号标注的效果。

链接: https://arxiv.org/abs/2509.14036
作者: Zekang Liu,Wei Feng,Fanhua Shang,Lianyu Hu,Jichao Feng,Liqing Gao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sign Language Translation (SLT) bridges the communication gap between deaf people and hearing people, where dialogue provides crucial contextual cues to aid in translation. Building on this foundational concept, this paper proposes Question-based Sign Language Translation (QB-SLT), a novel task that explores the efficient integration of dialogue. Unlike gloss (sign language transcription) annotations, dialogue naturally occurs in communication and is easier to annotate. The key challenge lies in aligning multimodality features while leveraging the context of the question to improve translation. To address this issue, we propose a cross-modality Self-supervised Learning with Sigmoid Self-attention Weighting (SSL-SSAW) fusion method for sign language translation. Specifically, we employ contrastive learning to align multimodality features in QB-SLT, then introduce a Sigmoid Self-attention Weighting (SSAW) module for adaptive feature extraction from question and sign language sequences. Additionally, we leverage available question text through self-supervised learning to enhance representation and translation capabilities. We evaluated our approach on newly constructed CSL-Daily-QA and PHOENIX-2014T-QA datasets, where SSL-SSAW achieved SOTA performance. Notably, easily accessible question assistance can achieve or even surpass the performance of gloss assistance. Furthermore, visualization results demonstrate the effectiveness of incorporating dialogue in improving translation quality.
zh

[NLP-13] Enhancing Multi-Agent Debate System Performance via Confidence Expression EMNLP’25

【速读】: 该论文旨在解决多智能体辩论(Multi-Agent Debate, MAD)系统中因大语言模型(Large Language Models, LLMs)缺乏明确置信度表达而导致的辩论效率低下问题。具体而言,部分LLMs虽具备更强的知识或推理能力,但难以在辩论过程中清晰传达其优势,导致其他代理可能误判或过早收敛至次优解。解决方案的关键在于引入显式的置信度表达机制,使LLMs能够在辩论全过程中主动标注自身判断的可信程度,从而优化信息传递与决策过程。作者为此构建了ConfMAD框架,实验验证了该方法的有效性,并揭示了置信度对辩论动态演化的影响机制,为设计具备置信感知能力的MAD系统提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2509.14034
作者: Zijie Lin,Bryan Hooi
机构: National University of Singapore (新加坡国立大学)
类目: Computation and Language (cs.CL)
备注: EMNLP’25 Findings

点击查看摘要

Abstract:Generative Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of tasks. Recent research has introduced Multi-Agent Debate (MAD) systems, which leverage multiple LLMs to simulate human debate and thereby improve task performance. However, while some LLMs may possess superior knowledge or reasoning capabilities for specific tasks, they often struggle to clearly communicate this advantage during debates, in part due to a lack of confidence expression. Moreover, inappropriate confidence expression can cause agents in MAD systems to either stubbornly maintain incorrect beliefs or converge prematurely on suboptimal answers, ultimately reducing debate effectiveness and overall system performance. To address these challenges, we propose incorporating confidence expression into MAD systems to allow LLMs to explicitly communicate their confidence levels. To validate this approach, we develop ConfMAD, a MAD framework that integrates confidence expression throughout the debate process. Experimental results demonstrate the effectiveness of our method, and we further analyze how confidence influences debate dynamics, offering insights into the design of confidence-aware MAD systems.
zh

[NLP-14] You Are What You Train: Effects of Data Composition on Training Context-aware Machine Translation Models EMNLP2025

【速读】: 该论文旨在解决生成式 AI 在机器翻译任务中难以有效利用上下文信息的问题,尤其是在处理代词消歧等复杂语言现象时表现不足。研究表明,标准训练数据中富含上下文的样本稀疏是导致模型无法充分挖掘上下文依赖的关键瓶颈。解决方案的核心在于通过构建具有可控比例上下文相关样本的训练数据集进行系统验证,并提出两种针对性的训练策略以优化数据利用率。实验表明,这些策略显著提升了模型在上下文感知能力上的表现,在单语和多语场景下分别带来最高达6和8个百分点的准确率提升。

链接: https://arxiv.org/abs/2509.14031
作者: Paweł Mąka,Yusuf Can Semerci,Jan Scholtes,Gerasimos Spanakis
机构: Maastricht University (马斯特里赫特大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2025 main conference

点击查看摘要

Abstract:Achieving human-level translations requires leveraging context to ensure coherence and handle complex phenomena like pronoun disambiguation. Sparsity of contextually rich examples in the standard training data has been hypothesized as the reason for the difficulty of context utilization. In this work, we systematically validate this claim in both single- and multilingual settings by constructing training datasets with a controlled proportions of contextually relevant examples. We demonstrate a strong association between training data sparsity and model performance confirming sparsity as a key bottleneck. Importantly, we reveal that improvements in one contextual phenomenon do no generalize to others. While we observe some cross-lingual transfer, it is not significantly higher between languages within the same sub-family. Finally, we propose and empirically evaluate two training strategies designed to leverage the available data. These strategies improve context utilization, resulting in accuracy gains of up to 6 and 8 percentage points on the ctxPro evaluation in single- and multilingual settings respectively.
zh

[NLP-15] Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

【速读】: 该论文试图解决当前机器翻译(Machine Translation, MT)质量评估仍以文本为中心、缺乏对语音场景下翻译表现的有效衡量问题。其解决方案的关键在于引入基于音频的评估方法,通过众包平台(Amazon Mechanical Turk)收集用户对语音翻译输出的判断,并与传统文本评估结果进行对比分析。研究表明,音频评估能更自然地反映实际应用场景(如语音翻译模式),且在某些情况下可识别出文本评估未能发现的系统差异,从而提出将语音模态纳入未来MT评估框架的重要性。

链接: https://arxiv.org/abs/2509.14023
作者: Sami Ul Haq,Sheila Castilho,Yvette Graham
机构: ADAPT Centre; Dublin City University (DCU), Ireland; Trinity College Dublin (TCD), Ireland
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted at WMT2025 (ENNLP) for oral presented

点击查看摘要

Abstract:Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.
zh

[NLP-16] Hala Technical Report: Building Arabic-Centric Instruction Translation Models at Scale

【速读】: 该论文旨在解决阿拉伯语自然语言处理(Natural Language Processing, NLP)领域中高质量模型稀缺的问题,尤其是针对指令遵循(instruction following)和翻译任务的性能瓶颈。其核心解决方案在于提出一种“翻译与微调”(translate-and-tune)的训练范式:首先将强大的阿拉伯语-英语(AR↔EN)教师模型压缩至FP8精度以提升推理吞吐量且不损失质量,进而生成高保真双语监督数据;随后利用轻量级语言模型(LFM2-1.2B)在该数据上微调,并将其用于将高质量英文指令集翻译为阿拉伯语,构建百万规模的阿拉伯语指令跟随语料库;最终通过slerp合并技术融合不同参数规模(350M–9B)的Hala模型,在保持基础模型能力的同时增强阿拉伯语专业化表现。此方法显著提升了阿拉伯语NLP任务上的性能,尤其在“纳米”(≤2B)和“小”(7–9B)两类模型中达到当前最优水平。

链接: https://arxiv.org/abs/2509.14008
作者: Hasan Abed Al Kader Hammoud,Mohammad Zbeeb,Bernard Ghanem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical Report

点击查看摘要

Abstract:We present Hala, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR \leftrightarrow EN teacher to FP8 (yielding \sim 2 \times higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2-1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train Hala models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, Hala achieves state-of-the-art results within both the “nano” ( \leq 2B) and “small” (7-9B) categories, outperforming their bases. We release models, data, evaluation, and recipes to accelerate research in Arabic NLP.
zh

[NLP-17] Early Stopping Chain-of-thoughts in Large Language Models

【速读】: 该论文旨在解决生成式大语言模型(Large Language Models, LLMs)在复杂问题求解中因生成长链式思维(Chain-of-Thought, CoT)而导致的高推理成本问题。解决方案的关键在于提出一种推理时(inference-time)的早停机制——ES-CoT,其核心思想是在每一步推理后让模型输出当前的中间答案(step answer),并通过监测连续相同step answer的运行长度(run length)来判断是否达到答案收敛状态;当run length出现显著增长并超过预设阈值时,即终止生成过程。该方法在保持与标准CoT相当准确率的前提下,平均减少约41%的推理token消耗,且与自一致性提示(self-consistency prompting)兼容性强、对超参数不敏感,具备良好的实用性与鲁棒性。

链接: https://arxiv.org/abs/2509.14004
作者: Minjia Mao,Bowen Yin,Yu Zhu,Xiao Fang
机构: University of Delaware (特拉华大学); Peking University (北京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning large language models (LLMs) have demonstrated superior capacities in solving complicated problems by generating long chain-of-thoughts (CoT), but such a lengthy CoT incurs high inference costs. In this study, we introduce ES-CoT, an inference-time method that shortens CoT generation by detecting answer convergence and stopping early with minimal performance loss. At the end of each reasoning step, we prompt the LLM to output its current final answer, denoted as a step answer. We then track the run length of consecutive identical step answers as a measure of answer convergence. Once the run length exhibits a sharp increase and exceeds a minimum threshold, the generation is terminated. We provide both empirical and theoretical support for this heuristic: step answers steadily converge to the final answer, and large run-length jumps reliably mark this convergence. Experiments on five reasoning datasets across three LLMs show that ES-CoT reduces the number of inference tokens by about 41% on average while maintaining accuracy comparable to standard CoT. Further, ES-CoT integrates seamlessly with self-consistency prompting and remains robust across hyperparameter choices, highlighting it as a practical and effective approach for efficient reasoning.
zh

[NLP-18] Slim-SC: Thought Pruning for Efficient Scaling with Self-Consistency EMNLP2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在测试时扩展(Test-Time Scaling, TTS)中自一致性(Self-Consistency, SC)方法因计算开销大而难以大规模部署的问题。SC 通过并行生成多个推理链并采用多数投票机制选择最终答案,虽能提升大语言模型(LLM)的推理性能,但其显著的资源消耗限制了实际应用。论文的关键解决方案是提出 Slim-SC,一种基于链间相似性的逐层剪枝策略,通过在思维层级上识别并移除冗余推理链,从而在不牺牲甚至提升准确率的前提下,显著降低推理延迟和 KV 缓存(KVC)使用量,实现高效且实用的 TTS 替代方案。

链接: https://arxiv.org/abs/2509.13990
作者: Colin Hong,Xu Guo,Anand Chaanan Singh,Esha Choukse,Dmitrii Ustiugov
机构: NTU Singapore (新加坡南洋理工大学); KTH Royal Institute of Technology (瑞典皇家理工学院); Microsoft (微软)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by EMNLP 2025 (Oral), 9 pages

点击查看摘要

Abstract:Recently, Test-Time Scaling (TTS) has gained increasing attention for improving LLM reasoning performance at test time without retraining the model. A notable TTS technique is Self-Consistency (SC), which generates multiple reasoning chains in parallel and selects the final answer via majority voting. While effective, the order-of-magnitude computational overhead limits its broad deployment. Prior attempts to accelerate SC mainly rely on model-based confidence scores or heuristics with limited empirical support. For the first time, we theoretically and empirically analyze the inefficiencies of SC and reveal actionable opportunities for improvement. Building on these insights, we propose Slim-SC, a step-wise pruning strategy that identifies and removes redundant chains using inter-chain similarity at the thought level. Experiments on three STEM reasoning datasets and two recent LLM architectures show that Slim-SC reduces inference latency and KVC usage by up to 45% and 26%, respectively, with R1-Distill, while maintaining or improving accuracy, thus offering a simple yet efficient TTS alternative for SC.
zh

[NLP-19] Long-context Reference-based MT Quality Estimation

【速读】: 该论文旨在解决机器翻译质量评估(Machine Translation Quality Evaluation, MTQE)中如何更准确地预测翻译段落质量的问题,特别是在缺乏完整上下文信息时模型性能受限的挑战。解决方案的关键在于构建基于长上下文(long-context)的参考式质量估计框架:通过整合领域内人工标注的句子并计算其加权平均得分来生成训练数据,同时融合多源人类判断数据集(MQM、SQM 和 DA),并采用归一化处理以统一评分尺度;最终训练多语言回归模型,从源句、假设译文和参考译文中预测段落级错误跨度注释(Error Span Annotation, ESA)分数。实验表明,引入长上下文信息显著提升了模型与人工判断的相关性。

链接: https://arxiv.org/abs/2509.13980
作者: Sami Ul Haq,Chinonso Cynthia Osuji,Sheila Castilho,Brian Davis
机构: ADAPT Centre, Dublin City University (都柏林城市大学), Dublin, Ireland
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level Error Span Annotation (ESA) scores using augmented long-context data. To construct long-context training data, we concatenate in-domain, human-annotated sentences and compute a weighted average of their scores. We integrate multiple human judgment datasets (MQM, SQM, and DA) by normalising their scales and train multilingual regression models to predict quality scores from the source, hypothesis, and reference translations. Experimental results show that incorporating long-context information improves correlations with human judgments compared to models trained only on short segments. Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2509.13980 [cs.CL] (or arXiv:2509.13980v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.13980 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Sami Ul Haq [view email] [v1] Wed, 17 Sep 2025 13:52:45 UTC (102 KB) Full-text links: Access Paper: View a PDF of the paper titled Long-context Reference-based MT Quality Estimation, by Sami Ul Haq and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CL prev | next new | recent | 2025-09 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[NLP-20] Exploring Major Transitions in the Evolution of Biological Cognition With Artificial Neural Networks

【速读】: 该论文试图解决的问题是:认知演化是否可以通过一系列关键的“过渡性”变化实现,这些变化通过改变生物神经网络的信息流结构,从而显著提升可演化的能力。解决方案的关键在于使用理想化的信息流模型——人工神经网络(Artificial Neural Networks, ANNs)——来模拟不同拓扑结构(前馈、循环和分层)对学习复杂人工语法任务时认知性能的影响。研究发现,循环拓扑相较于前馈拓扑能够处理更广泛类型的输入并显著提升对最复杂语法的学习能力,且训练难度形成的“过渡屏障”和不可逆性特征也符合进化过渡的核心属性,从而表明某些信息流结构的改变确实能引发认知性能的质变。

链接: https://arxiv.org/abs/2509.13968
作者: Konstantinos Voudouris,Andrew Barron,Marta Halina,Colin Klein,Matishalin Patel
机构: Institute for Human-Centered AI, Helmholtz Zentrum Munich (人类中心人工智能研究所,赫尔姆霍兹慕尼黑研究中心); Leverhulme Centre for the Future of Intelligence, University of Cambridge (未来智能利弗休姆中心,剑桥大学); School of Natural Sciences, Macquarie University (自然科学院,麦考瑞大学); School of Philosophy, The Australian National University (哲学学院,澳大利亚国立大学); Department of History and Philosophy of Science, University of Cambridge (科学史与哲学系,剑桥大学); School of Environmental and Life Sciences, University of Hull (环境与生命科学学院,赫尔大学); Centre for Data Science, AI, and Modelling, University of Hull (数据科学、人工智能与建模中心,赫尔大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Formal Languages and Automata Theory (cs.FL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transitional accounts of evolution emphasise a few changes that shape what is evolvable, with dramatic consequences for derived lineages. More recently it has been proposed that cognition might also have evolved via a series of major transitions that manipulate the structure of biological neural networks, fundamentally changing the flow of information. We used idealised models of information flow, artificial neural networks (ANNs), to evaluate whether changes in information flow in a network can yield a transitional change in cognitive performance. We compared networks with feed-forward, recurrent and laminated topologies, and tested their performance learning artificial grammars that differed in complexity, controlling for network size and resources. We documented a qualitative expansion in the types of input that recurrent networks can process compared to feed-forward networks, and a related qualitative increase in performance for learning the most complex grammars. We also noted how the difficulty in training recurrent networks poses a form of transition barrier and contingent irreversibility – other key features of evolutionary transitions. Not all changes in network topology confer a performance advantage in this task set. Laminated networks did not outperform non-laminated networks in grammar learning. Overall, our findings show how some changes in information flow can yield transitions in cognitive performance.
zh

[NLP-21] Enhancing Time Awareness in Generative Recommendation EMNLP2025

【速读】: 该论文旨在解决生成式推荐(Generative Recommendation)中忽视物品间时间动态变化的问题,即现有方法仅关注物品的顺序关系,而未能捕捉用户偏好随时间演化的特性。其解决方案的关键在于提出一种名为GRUT(Generative Recommender Using Time awareness)的新模型,核心创新包括:1)时间感知提示(Time-aware Prompting),通过用户级时间上下文建模个性化的时间模式和时间间隔特征,以及物品级转换上下文捕获跨用户的物品转移模式;2)趋势感知推理(Trend-aware Inference),一种无需训练的方法,在生成过程中引入物品的趋势信息以提升排序质量。该方案有效提升了推荐系统的准确性与对用户动态偏好的适应能力。

链接: https://arxiv.org/abs/2509.13957
作者: Sunkyung Lee,Seongmin Park,Jonghyo Kim,Mincheol Yoon,Jongwuk Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: EMNLP 2025 (Findings)

点击查看摘要

Abstract:Generative recommendation has emerged as a promising paradigm that formulates the recommendations into a text-to-text generation task, harnessing the vast knowledge of large language models. However, existing studies focus on considering the sequential order of items and neglect to handle the temporal dynamics across items, which can imply evolving user preferences. To address this limitation, we propose a novel model, Generative Recommender Using Time awareness (GRUT), effectively capturing hidden user preferences via various temporal signals. We first introduce Time-aware Prompting, consisting of two key contexts. The user-level temporal context models personalized temporal patterns across timestamps and time intervals, while the item-level transition context provides transition patterns across users. We also devise Trend-aware Inference, a training-free method that enhances rankings by incorporating trend information about items with generation likelihood. Extensive experiments demonstrate that GRUT outperforms state-of-the-art models, with gains of up to 15.4% and 14.3% in Recall@5 and NDCG@5 across four benchmark datasets. The source code is available at this https URL.
zh

[NLP-22] An Empirical Study on Failures in Automated Issue Solving

【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化缺陷修复工具在SWE-Bench基准测试中仍存在较高失败率的问题,尤其关注现有评估方法仅提供聚合成功率而无法揭示失败根本原因的局限性。解决方案的关键在于:首先通过系统性人工分析150个失败案例,构建了一个包含3个主要阶段、9个类别和25个子类别的失败模式分类体系;其次发现代理式(agentic)架构的失败多源于推理错误和认知死锁(cognitive deadlocks),进而提出一种协作式“专家-执行者”(Expert-Executor)框架,其中专家代理负责战略监督与纠错,执行者代理承担具体任务操作,从而有效修正推理偏差并打破认知僵局。实验表明,该框架可使领先单代理模型解决22.2%原本无法处理的问题,为提升自动化代码修复系统的鲁棒性提供了诊断驱动的设计路径。

链接: https://arxiv.org/abs/2509.13941
作者: Simiao Liu,Fang Liu,Liehao Li,Xin Tan,Yinghao Zhu,Xiaoli Lian,Li Zhang
机构: Beihang University (北京航空航天大学); The University of Hong Kong (香港大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Automated issue solving seeks to autonomously identify and repair defective code snippets across an entire codebase. SWE-Bench has emerged as the most widely adopted benchmark for evaluating progress in this area. While LLM-based agentic tools show great promise, they still fail on a substantial portion of tasks. Moreover, current evaluations primarily report aggregate issue-solving rates, which obscure the underlying causes of success and failure, making it challenging to diagnose model weaknesses or guide targeted improvements. To bridge this gap, we first analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified under varying task characteristics. Furthermore, to move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances. From this analysis, we developed a comprehensive taxonomy of failure modes comprising 3 primary phases, 9 main categories, and 25 fine-grained subcategories. Then we systematically analyze the distribution of the identified failure modes, the results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks. Motivated by these insights, we propose a collaborative Expert-Executor framework. It introduces a supervisory Expert agent tasked with providing strategic oversight and course-correction for a primary Executor agent. This architecture is designed to correct flawed reasoning and break the cognitive deadlocks that frequently lead to failure. Experiments show that our framework solves 22.2% of previously intractable issues for a leading single agent. These findings pave the way for building more robust agents through diagnostic evaluation and collaborative design.
zh

[NLP-23] Linguistic Nepotism: Trading-off Quality for Language Preference in Multilingual RAG

【速读】: 该论文旨在解决多语言检索增强生成(Multilingual Retrieval-Augmented Generation, mRAG)系统中,不同语种文档混合使用时是否会导致模型在生成回答和引用来源时产生非预期的语言偏好问题。解决方案的关键在于提出一种受控的方法论,通过分析模型内部机制,在保持文档相关性等其他变量恒定的条件下,量化模型对不同语言文档的引用倾向。实验结果表明,当查询为英语时,模型更倾向于引用英文文档,且这种倾向在低资源语言和上下文中间位置的文档中更为显著;更重要的是,模型有时会牺牲文档的相关性来优先选择特定语言的文档,说明引用行为并非仅由信息量驱动。

链接: https://arxiv.org/abs/2509.13930
作者: Dayeon Ki,Marine Carpuat,Paul McNamee,Daniel Khashabi,Eugene Yang,Dawn Lawrie,Kevin Duh
机构: University of Maryland (马里兰大学); Johns Hopkins University (约翰霍普金斯大学)
类目: Computation and Language (cs.CL)
备注: 33 pages, 20 figures

点击查看摘要

Abstract:Multilingual Retrieval-Augmented Generation (mRAG) systems enable language models to answer knowledge-intensive queries with citation-supported responses across languages. While such systems have been proposed, an open questions is whether the mixture of different document languages impacts generation and citation in unintended ways. To investigate, we introduce a controlled methodology using model internals to measure language preference while holding other factors such as document relevance constant. Across eight languages and six open-weight models, we find that models preferentially cite English sources when queries are in English, with this bias amplified for lower-resource languages and for documents positioned mid-context. Crucially, we find that models sometimes trade-off document relevance for language preference, indicating that citation choices are not always driven by informativeness alone. Our findings shed light on how language models leverage multilingual context and influence citation behavior.
zh

[NLP-24] Do Large Language Models Understand Word Senses? EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际语境中是否真正理解词汇意义这一关键问题,即评估LLMs在词义消歧(Word Sense Disambiguation, WSD)和生成式任务中对词义的理解能力。其解决方案的关键在于设计并执行两项核心评估:一是将指令微调后的LLMs在WSD任务中的表现与专门针对该任务的先进系统进行对比;二是考察两种顶尖开源与闭源LLMs在定义生成、自由形式解释和例句生成三种生成场景下对词义的理解准确性。结果表明,GPT-4o和DeepSeek-V3等领先模型在WSD任务中性能媲美专业系统且更具领域和难度鲁棒性,而在生成任务中词义解释准确率高达98%,尤其在自由形式解释任务中表现最优,验证了LLMs具备强大的语境词义理解能力。

链接: https://arxiv.org/abs/2509.13905
作者: Domenico Meconi,Simone Stirpe,Federico Martelli,Leonardo Lavalle,Roberto Navigli
机构: Babelscape; Sapienza NLP Group, Sapienza University of Rome (罗马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 20 pages, to be published in EMNLP2025

点击查看摘要

Abstract:Understanding the meaning of words in context is a fundamental capability for Large Language Models (LLMs). Despite extensive evaluation efforts, the extent to which LLMs show evidence that they truly grasp word senses remains underexplored. In this paper, we address this gap by evaluating both i) the Word Sense Disambiguation (WSD) capabilities of instruction-tuned LLMs, comparing their performance to state-of-the-art systems specifically designed for the task, and ii) the ability of two top-performing open- and closed-source LLMs to understand word senses in three generative settings: definition generation, free-form explanation, and example generation. Notably, we find that, in the WSD task, leading models such as GPT-4o and DeepSeek-V3 achieve performance on par with specialized WSD systems, while also demonstrating greater robustness across domains and levels of difficulty. In the generation tasks, results reveal that LLMs can explain the meaning of words in context up to 98% accuracy, with the highest performance observed in the free-form explanation task, which best aligns with their generative capabilities.
zh

[NLP-25] Combating Biomedical Misinformation through Multi-modal Claim Detection and Evidence-based Verification

【速读】: 该论文旨在解决医疗领域中虚假信息(misinformation)传播所带来的公共健康风险与医学系统信任危机问题,特别是针对疫苗犹豫和未经验证的治疗方法等场景下的生物医学事实核查挑战。其解决方案的关键在于提出了一种名为CER(Combining Evidence and Reasoning)的新框架,该框架通过融合科学证据检索、基于大语言模型(Large Language Models, LLMs)的推理能力以及监督式真伪预测机制,有效降低生成内容中的幻觉(hallucination)风险,确保输出结果建立在可验证的循证来源之上,从而实现更可靠、准确且具备可解释性的生物医学事实核查。

链接: https://arxiv.org/abs/2509.13888
作者: Mariano Barone,Antonio Romano,Giuseppe Riccio,Marco Postiglione,Vincenzo Moscato
机构: University of Naples Federico II (那不勒斯腓特烈二世大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical systems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminology, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combining Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language models with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of hallucinations, ensuring that generated outputs are grounded in verifiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the-art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: this https URL
zh

[NLP-26] Combining Evidence and Reasoning for Biomedical Fact-Checking

【速读】: 该论文旨在解决医疗领域中虚假信息(misinformation)传播所带来的公共健康风险与信任危机问题,特别是针对疫苗犹豫和未经证实的治疗方法等生物医学主张的自动化验证难题。其解决方案的关键在于提出CER(Combining Evidence and Reasoning)框架,通过融合科学证据检索、基于大语言模型(Large Language Models, LLMs)的推理能力以及监督式真伪预测机制,有效减少生成内容中的幻觉(hallucination),确保输出结果建立在可验证的循证医学来源之上。该方法显著提升了生物医学事实核查的准确性与可靠性,并在多个专家标注数据集上展现出先进性能与良好的跨数据集泛化能力。

链接: https://arxiv.org/abs/2509.13879
作者: Mariano Barone,Antonio Romano,Giuseppe Riccio,Marco Postiglione,Vincenzo Moscato
机构: University of Naples Federico II (那不勒斯腓特烈二世大学); Northwestern University (西北大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Misinformation in healthcare, from vaccine hesitancy to unproven treatments, poses risks to public health and trust in medical sys- tems. While machine learning and natural language processing have advanced automated fact-checking, validating biomedical claims remains uniquely challenging due to complex terminol- ogy, the need for domain expertise, and the critical importance of grounding in scientific evidence. We introduce CER (Combin- ing Evidence and Reasoning), a novel framework for biomedical fact-checking that integrates scientific evidence retrieval, reasoning via large language models, and supervised veracity prediction. By integrating the text-generation capabilities of large language mod- els with advanced retrieval techniques for high-quality biomedical scientific evidence, CER effectively mitigates the risk of halluci- nations, ensuring that generated outputs are grounded in veri- fiable, evidence-based sources. Evaluations on expert-annotated datasets (HealthFC, BioASQ-7b, SciFact) demonstrate state-of-the- art performance and promising cross-dataset generalization. Code and data are released for transparency and reproducibility: https: //github.com/PRAISELab-PicusLab/CER.
zh

[NLP-27] Do LLM s Align Human Values Regarding Social Biases? Judging and Explaining Social Biases with LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对不同类型社会偏见场景时,其与人类价值观对齐程度(Human Values Social Bias Alignment, HVSB)是否存在差异的问题,以及小规模语言模型是否具备解释HVSB的能力及其效果如何。解决方案的关键在于:首先,通过系统性评估12个来自四个模型家族的LLMs在四个数据集上的表现,揭示了模型参数规模并非决定对齐程度的唯一因素;其次,发现LLMs在不同类型的偏见场景中存在偏好性对齐行为,且同一模型家族的模型具有更高的一致性判断能力;最后,通过微调较小的语言模型使其具备生成HVSB解释的能力,结果显示这些小模型生成的解释更具可读性,但模型间一致性较低。

链接: https://arxiv.org/abs/2509.13869
作者: Yang Liu,Chenhui Chu
机构: 未知
类目: Computation and Language (cs.CL)
备注: 38 pages, 31 figures

点击查看摘要

Abstract:Large language models (LLMs) can lead to undesired consequences when misaligned with human values, especially in scenarios involving complex and sensitive social biases. Previous studies have revealed the misalignment of LLMs with human values using expert-designed or agent-based emulated bias scenarios. However, it remains unclear whether the alignment of LLMs with human values differs across different types of scenarios (e.g., scenarios containing negative vs. non-negative questions). In this study, we investigate the alignment of LLMs with human values regarding social biases (HVSB) in different types of bias scenarios. Through extensive analysis of 12 LLMs from four model families and four datasets, we demonstrate that LLMs with large model parameter scales do not necessarily have lower misalignment rate and attack success rate. Moreover, LLMs show a certain degree of alignment preference for specific types of scenarios and the LLMs from the same model family tend to have higher judgment consistency. In addition, we study the understanding capacity of LLMs with their explanations of HVSB. We find no significant differences in the understanding of HVSB across LLMs. We also find LLMs prefer their own generated explanations. Additionally, we endow smaller language models (LMs) with the ability to explain HVSB. The generation results show that the explanations generated by the fine-tuned smaller LMs are more readable, but have a relatively lower model agreeability.
zh

[NLP-28] Noise Supervised Contrastive Learning and Feature-Perturbed for Anomalous Sound Detection ICASSP2025

【速读】: 该论文旨在解决无监督异常声音检测中因同一类型设备不同样本导致频繁误报的问题。其核心解决方案是提出一种新颖的一阶段监督对比学习(one-stage supervised contrastive learning, OS-SCL)训练策略,通过在嵌入空间中扰动特征并采用单阶段噪声监督对比学习方法,有效提升模型对同类设备间差异的鲁棒性;同时引入一种从原始音频提取的时间-频率特征TFgram,显著增强异常声音的关键信息捕捉能力,从而在DCASE 2020 Challenge Task 2上实现优异性能,AUC达到95.71%。

链接: https://arxiv.org/abs/2509.13853
作者: Shun Huang,Zhihua Fang,Liang He
机构: Xinjiang University (新疆大学); Tsinghua University (清华大学)
类目: ound (cs.SD); Computation and Language (cs.CL)
备注: Accept ICASSP 2025

点击查看摘要

Abstract:Unsupervised anomalous sound detection aims to detect unknown anomalous sounds by training a model using only normal audio data. Despite advancements in self-supervised methods, the issue of frequent false alarms when handling samples of the same type from different machines remains unresolved. This paper introduces a novel training technique called one-stage supervised contrastive learning (OS-SCL), which significantly addresses this problem by perturbing features in the embedding space and employing a one-stage noisy supervised contrastive learning approach. On the DCASE 2020 Challenge Task 2, it achieved 94.64% AUC, 88.42% pAUC, and 89.24% mAUC using only Log-Mel features. Additionally, a time-frequency feature named TFgram is proposed, which is extracted from raw audio. This feature effectively captures critical information for anomalous sound detection, ultimately achieving 95.71% AUC, 90.23% pAUC, and 91.23% mAUC. The source code is available at: \underlinethis http URL.
zh

[NLP-29] Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models EMNLP2025

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)中存在的视觉幻觉(visual hallucination)问题,即模型在理解图像内容时产生与输入图像不符的错误生成内容,这严重限制了LVLMs在现实场景中的可靠性。研究发现,不同视觉编码器(visual encoder)因训练范式差异而具有不同的归纳偏置(inductive bias),从而导致其在幻觉表现上存在显著差异。为系统评估这一现象,作者构建了VHBench-10基准数据集,包含约10,000个样本,覆盖10类细粒度幻觉类型。基于此分析,论文提出VisionWeaver——一种上下文感知路由网络(Context-Aware Routing Network),其核心创新在于利用全局视觉特征生成路由信号,动态聚合来自多个专用专家模块的视觉特征,从而实现更精准的视觉信息融合,有效降低幻觉并提升整体性能。

链接: https://arxiv.org/abs/2509.13836
作者: Weihang Wang,Xinhao Li,Ziyue Wang,Yan Pang,Jielei Zhang,Peiyi Li,Qiang Zhang,Longwen Gao
机构: Bilibili; UESTC; University of Virginia
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted by EMNLP2025 Finding

点击查看摘要

Abstract:Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.
zh

[NLP-30] Large Language Models Discriminate Against Speakers of German Dialects EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)是否复制并放大对德语方言使用者的负面社会刻板印象这一问题。其核心挑战在于,尽管方言是人类文化的重要组成部分,但方言使用者常面临歧视性认知,而现有研究尚未充分揭示LLMs在处理方言相关语料时是否存在偏见及其机制。解决方案的关键在于构建一个新颖的评估语料库,将七种德语区域方言(如阿勒曼尼语和巴伐利亚语)与其标准德语对应句配对,并通过两个任务——关联任务和决策任务——量化模型在“方言命名偏见”(dialect naming bias)和“方言使用偏见”(dialect usage bias)上的表现。结果表明,所有测试的LLMs均表现出显著的负面倾向,且显式标注“方言使用者”身份会比隐含的方言用法线索更强烈地放大偏见,这揭示了语言多样性在AI系统中可能被系统性污名化的风险。

链接: https://arxiv.org/abs/2509.13835
作者: Minh Duc Bui,Carolin Holtermann,Valentin Hofmann,Anne Lauscher,Katharina von der Wense
机构: Johannes Gutenberg University Mainz (美因茨约翰内斯古腾堡大学); University of Hamburg (汉堡大学); Allen Institute for AI (艾伦人工智能研究所); University of Washington (华盛顿大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main

点击查看摘要

Abstract:Dialects represent a significant component of human culture and are found across all regions of the world. In Germany, more than 40% of the population speaks a regional dialect (Adler and Hansen, 2022). However, despite cultural importance, individuals speaking dialects often face negative societal stereotypes. We examine whether such stereotypes are mirrored by large language models (LLMs). We draw on the sociolinguistic literature on dialect perception to analyze traits commonly associated with dialect speakers. Based on these traits, we assess the dialect naming bias and dialect usage bias expressed by LLMs in two tasks: an association task and a decision task. To assess a model’s dialect usage bias, we construct a novel evaluation corpus that pairs sentences from seven regional German dialects (e.g., Alemannic and Bavarian) with their standard German counterparts. We find that: (1) in the association task, all evaluated LLMs exhibit significant dialect naming and dialect usage bias against German dialect speakers, reflected in negative adjective associations; (2) all models reproduce these dialect naming and dialect usage biases in their decision making; and (3) contrary to prior work showing minimal bias with explicit demographic mentions, we find that explicitly labeling linguistic demographics–German dialect speakers–amplifies bias more than implicit cues like dialect usage.
zh

[NLP-31] Findings of the Third Automatic Minuting (AutoMin) Challenge

【速读】: 该论文旨在解决自动会议摘要生成(automatic meeting summarization)与基于会议转录文本的问答(question answering, QA)任务的标准化评估问题,特别是在多语言和多领域场景下的性能衡量。其解决方案的关键在于构建一个统一的共享任务平台——AutoMin 2025,涵盖两个核心任务:一是结构化会议纪要生成(minuting),支持英语和捷克语两种语言及项目会议和欧洲议会会议两类领域;二是问答任务,包括单语QA(英文提问与回答)和跨语言QA(捷克语提问、英文会议内容作答)。为促进对当前大语言模型(large language models, LLMs)能力的系统性评估,作者还提供了多个基线系统,从而为相关研究提供可复现、可比较的基准。

链接: https://arxiv.org/abs/2509.13814
作者: Kartik Shinde,Laurent Besacier,Ondrej Bojar,Thibaut Thonet,Tirthankar Ghosal
机构: NAVER Labs Europe (NAVER 实验室欧洲); Charles University, MFF, ÚFAL (查尔斯大学, 数学与物理学系, 语言学研究所); Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computation and Language (cs.CL)
备注: Automin 2025 Website: this https URL

点击查看摘要

Abstract:This paper presents the third edition of AutoMin, a shared task on automatic meeting summarization into minutes. In 2025, AutoMin featured the main task of minuting, the creation of structured meeting minutes, as well as a new task: question answering (QA) based on meeting transcripts. The minuting task covered two languages, English and Czech, and two domains: project meetings and European Parliament sessions. The QA task focused solely on project meetings and was available in two settings: monolingual QA in English, and cross-lingual QA, where questions were asked and answered in Czech based on English meetings. Participation in 2025 was more limited compared to previous years, with only one team joining the minuting task and two teams participating in QA. However, as organizers, we included multiple baseline systems to enable a comprehensive evaluation of current (2025) large language models (LLMs) on both tasks. Comments: Automin 2025 Website: this https URL Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.13814 [cs.CL] (or arXiv:2509.13814v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.13814 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-32] Geometric Uncertainty for Detecting and Correcting Hallucinations in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答任务中存在幻觉(hallucination)的问题,即生成语法正确但事实错误的回答。现有不确定性量化方法无法同时提供全局和局部不确定性估计:前者适用于整个响应批次,后者则需依赖白盒访问模型内部状态,而黑盒方法仅能给出单一全局不确定性指标。其解决方案的关键在于提出一种基于原型分析(archetypal analysis)的几何框架,仅通过黑盒方式采样响应并分析其嵌入空间结构。该框架引入两个核心指标:**几何体积(Geometric Volume)用于衡量批次层面的全局不确定性(通过原型构成的凸包体积表示),以及几何怀疑度(Geometric Suspicion)**用于对单个响应进行可靠性排序,从而实现基于可靠性的响应选择以降低幻觉风险。该方法首次在纯黑盒条件下同时提供全局与局部不确定性估计,并通过理论证明凸包体积与熵之间的关联,增强了其可解释性与实用性。

链接: https://arxiv.org/abs/2509.13813
作者: Edward Phillips,Sean Wu,Soheila Molaei,Danielle Belgrave,Anshul Thakur,David Clifton
机构: University of Oxford (牛津大学); GlaxoSmithKline (葛兰素史克); Oxford Suzhou Centre for Advanced Research (牛津苏州先进研究院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models demonstrate impressive results across diverse tasks but are still known to hallucinate, generating linguistically plausible but incorrect answers to questions. Uncertainty quantification has been proposed as a strategy for hallucination detection, but no existing black-box approach provides estimates for both global and local uncertainty. The former attributes uncertainty to a batch of responses, while the latter attributes uncertainty to individual responses. Current local methods typically rely on white-box access to internal model states, whilst black-box methods only provide global uncertainty estimates. We introduce a geometric framework to address this, based on archetypal analysis of batches of responses sampled with only black-box model access. At the global level, we propose Geometric Volume, which measures the convex hull volume of archetypes derived from response embeddings. At the local level, we propose Geometric Suspicion, which ranks responses by reliability and enables hallucination reduction through preferential response selection. Unlike prior dispersion methods which yield only a single global score, our approach provides semantic boundary points which have utility for attributing reliability to individual responses. Experiments show that our framework performs comparably to or better than prior methods on short form question-answering datasets, and achieves superior results on medical datasets where hallucinations carry particularly critical risks. We also provide theoretical justification by proving a link between convex hull volume and entropy.
zh

[NLP-33] Measuring Gender Bias in Job Title Matching for Grammatical Gender Languages

【速读】: 该论文旨在解决自动职位排序系统中因职位名称显式语法性别指派(grammatical gender assignment)所引发的性别偏见问题。其解决方案的关键在于提出一种基于控制性别的排名比较度量方法,特别是Rank-Biased Overlap (RBO),用于评估职位名称排序系统中的性别偏差;同时构建并公开了四种具有语法性别区分的语言(包括阳性与阴性形式的职业名称)的职位匹配测试集,并通过该方法对多个现成的多语言模型进行基准测试,揭示了所有模型均存在不同程度的性别偏见。

链接: https://arxiv.org/abs/2509.13803
作者: Laura García-Sardiña,Hermenegildo Fabregat,Daniel Deniz,Rabih Zbib
机构: Avature Machine Learning(阿瓦图机器学习)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This work sets the ground for studying how explicit grammatical gender assignment in job titles can affect the results of automatic job ranking systems. We propose the usage of metrics for ranking comparison controlling for gender to evaluate gender bias in job title ranking systems, in particular RBO (Rank-Biased Overlap). We generate and share test sets for a job title matching task in four grammatical gender languages, including occupations in masculine and feminine form and annotated by gender and matching relevance. We use the new test sets and the proposed methodology to evaluate the gender bias of several out-of-the-box multilingual models to set as baselines, showing that all of them exhibit varying degrees of gender bias.
zh

[NLP-34] aching According to Talents! Instruction Tuning LLM s with Competence-Aware Curriculum Learning EMNLP2025

【速读】: 该论文旨在解决当前指令微调(instruction tuning)方法中因课程学习(curriculum learning)策略过于僵化而导致的性能瓶颈问题。现有方法依赖静态启发式难度指标,无法根据模型在训练过程中能力的变化动态调整学习路径,从而限制了最终性能的提升。其解决方案的关键在于提出一种基于能力感知的多视角课程指令微调框架——CAMPUS(Competence-Aware Multi-Perspective cUrriculum inStruction tuning),通过动态选择子课程、能力感知的课程调度调整以及基于多种难度维度的调度机制,实现更高效且自适应的训练过程,显著优于现有最优基线方法。

链接: https://arxiv.org/abs/2509.13790
作者: Yangning Li,Tingwei Lu,Yinghui Li,Yankai Chen,Wei-Chieh Huang,Wenhao Jiang,Hui Wang,Hai-Tao Zheng,Philip S.Yu
机构: Tsinghua University (清华大学); Peng Cheng Laboratory (鹏城实验室); Cornell University (康奈尔大学); University of Illinois Chicago (芝加哥伊利诺伊大学); Guangming Laboratory (光明实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Findings

点击查看摘要

Abstract:Efficient instruction tuning aims to enhance the ultimate performance of large language models (LLMs) trained on a given instruction dataset. Curriculum learning as a typical data organization strategy has shown preliminary effectiveness in instruction tuning. However, current curriculum tuning methods suffer from the curriculum rigidity, since they rely solely on static heuristic difficulty metrics. These methods fail to adapt to the evolving capabilities of models during training, resulting in a fixed and potentially sub-optimal learning trajectory. To address the issue, Competence-Aware Multi-Perspective cUrriculum inStruction tuning framework termed CAMPUS is proposed. CAMPUS offers several advantages: (1) Dynamic selection for sub-curriculum. (2) Competency-aware adjustment to the curriculum schedule. (3) Multiple difficulty-based scheduling. Extensive experiments prove the superior performance of CAMPUS, compared to other state-of-the-art baselines for efficient instruction tuning.
zh

[NLP-35] Exploring Data and Parameter Efficient Strategies for Arabic Dialect Identifications

【速读】: 该论文旨在解决阿拉伯语方言识别(Arabic Dialect Identification, ADI)任务中数据稀缺与模型参数效率之间的矛盾问题。其核心挑战在于如何在有限标注数据下提升模型对细微方言差异的判别能力,同时避免大规模参数微调带来的计算开销。解决方案的关键在于系统比较多种参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)策略,包括软提示(soft-prompting)方法如prefix-tuning、prompt-tuning、P-tuning及P-tuning V2,以及LoRA(Low-Rank Adaptation)重参数化技术,并结合零样本和少样本推理验证不同架构(如阿拉伯语专用编码器模型、通用多语言解码器模型Phi-3.5及阿拉伯语特化模型SILMA)的表现。实验表明,LoRA方法在少样本场景下性能最优,甚至超越全量微调,而软提示策略优于硬提示,说明引入结构化参数调整机制是提升ADI任务效果的关键路径。

链接: https://arxiv.org/abs/2509.13775
作者: Vani Kanjirangat,Ljiljana Dolamic,Fabio Rinaldi
机构: IDSIA-USI/SUPSI, Switzerland (瑞士IDSIA-USI/SUPSI); armasuisse S+T, Switzerland (瑞士armasuisse S+T)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 4 main pages, 4 additional, 5 figures

点击查看摘要

Abstract:This paper discusses our exploration of different data-efficient and parameter-efficient approaches to Arabic Dialect Identification (ADI). In particular, we investigate various soft-prompting strategies, including prefix-tuning, prompt-tuning, P-tuning, and P-tuning V2, as well as LoRA reparameterizations. For the data-efficient strategy, we analyze hard prompting with zero-shot and few-shot inferences to analyze the dialect identification capabilities of Large Language Models (LLMs). For the parameter-efficient PEFT approaches, we conducted our experiments using Arabic-specific encoder models on several major datasets. We also analyzed the n-shot inferences on open-source decoder-only models, a general multilingual model (Phi-3.5), and an Arabic-specific one(SILMA). We observed that the LLMs generally struggle to differentiate the dialectal nuances in the few-shot or zero-shot setups. The soft-prompted encoder variants perform better, while the LoRA-based fine-tuned models perform best, even surpassing full fine-tuning.
zh

[NLP-36] HOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在高精度数学任务中面临的挑战,特别是数值计算和形式符号操作能力不足的问题。尽管已有研究尝试通过集成外部工具来提升性能,但现有方法仍受限于高质量工具集成推理数据的构建、细粒度优化以及推理阶段效率的提升。其解决方案的关键在于提出THOR(Tool-Integrated Hierarchical Optimization via RL),包含三个核心创新:首先,设计TIRGen——一种基于多智能体Actor-Critic框架的数据生成管道,用于构建高质量的工具集成推理路径数据集;其次,引入强化学习(Reinforcement Learning, RL)策略实现轨迹级问题求解与步骤级代码生成的联合优化,利用中间工具调用的成功与否作为最终答案正确性的强预测信号;最后,集成自校正机制,在推理过程中利用即时工具反馈动态修正错误推理路径,从而显著提升模型在多种数学与代码基准上的泛化能力和性能表现。

链接: https://arxiv.org/abs/2509.13761
作者: Qikai Chang,Zhenrong Zhang,Pengfei Hu,Jiefeng Ma,Yicheng Pan,Jianshu Zhang,Jun Du,Quan Liu,Jianqing Gao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 22 pages, 13 figures

点击查看摘要

Abstract:Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer’s correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at this https URL.
zh

[NLP-37] Implementing a Logical Inference System for Japanese Comparatives

【速读】: 该论文旨在解决自然语言推理(Natural Language Inference, NLI)中涉及比较结构(comparatives)的挑战,尤其是针对日语比较句的逻辑推理问题。由于日语与英语在形态和语义上存在显著差异,现有基于英语设计的逻辑推理系统难以直接应用于日语场景。解决方案的关键在于提出一个基于组合语义学(compositional semantics)的日语比较推理系统ccg-jcomp,该系统通过形式化语法和逻辑规则建模日语比较表达式,从而实现对数量关系和比较逻辑的稳健处理,并在日语NLI数据集上验证了其优于现有大语言模型(Large Language Models, LLMs)的准确性。

链接: https://arxiv.org/abs/2509.13734
作者: Yosuke Mikami,Daiki Matsuoka,Hitomi Yanaka
机构: The University of Tokyo (东京大学); Riken (理化学研究所)
类目: Computation and Language (cs.CL)
备注: In Proceedings of the 5th Workshop on Natural Logic Meets Machine Learning (NALOMA)

点击查看摘要

Abstract:Natural Language Inference (NLI) involving comparatives is challenging because it requires understanding quantities and comparative relations expressed by sentences. While some approaches leverage Large Language Models (LLMs), we focus on logic-based approaches grounded in compositional semantics, which are promising for robust handling of numerical and logical expressions. Previous studies along these lines have proposed logical inference systems for English comparatives. However, it has been pointed out that there are several morphological and semantic differences between Japanese and English comparatives. These differences make it difficult to apply such systems directly to Japanese comparatives. To address this gap, this study proposes ccg-jcomp, a logical inference system for Japanese comparatives based on compositional semantics. We evaluate the proposed system on a Japanese NLI dataset containing comparative expressions. We demonstrate the effectiveness of our system by comparing its accuracy with that of existing LLMs.
zh

[NLP-38] DSPC: Dual-Stage Progressive Compression Framework for Efficient Long-Context Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中因提示(prompt)长度不断增长而导致的计算成本上升问题,即“提示膨胀”(prompt inflation)。解决方案的关键在于提出一种无需训练的两阶段渐进式压缩方法(Dual-Stage Progressive Compression, DSPC):第一阶段采用基于TF-IDF的语义相关性过滤机制进行粗粒度压缩,移除低语义价值的句子;第二阶段则通过注意力贡献度、跨模型损失差异和位置重要性三重指标评估token重要性,实现细粒度的token剪枝,从而在显著减少token数量的同时保持任务性能。实验表明,在LLaMA-3.1-8B-Instruct和GPT-3.5-Turbo上,DSPC在受限token预算下均取得优于现有最优基线的方法表现。

链接: https://arxiv.org/abs/2509.13723
作者: Yaxin Gao,Yao Lu,Zongfei Zhang,Jiaqi Nie,Shanqing Yu,Qi Xuan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have achieved remarkable success in many natural language processing (NLP) tasks. To achieve more accurate output, the prompts used to drive LLMs have become increasingly longer, which incurs higher computational costs. To address this prompt inflation problem, prompt compression has been proposed. However, most existing methods require training a small auxiliary model for compression, incurring a significant amount of additional computation. To avoid this, we propose a two-stage, training-free approach, called Dual-Stage Progressive Compression (DSPC). In the coarse-grained stage, semantic-related sentence filtering removes sentences with low semantic value based on TF-IDF. In the fine-grained stage, token importance is assessed using attention contribution, cross-model loss difference, and positional importance, enabling the pruning of low-utility tokens while preserving semantics. We validate DSPC on LLaMA-3.1-8B-Instruct and GPT-3.5-Turbo under a constrained token budget and observe consistent improvements. For instance, in the FewShot task of the Longbench dataset, DSPC achieves a performance of 49.17 by using only 3x fewer tokens, outperforming the best state-of-the-art baseline LongLLMLingua by 7.76.
zh

[NLP-39] Automated Triaging and Transfer Learning of Incident Learning Safety Reports Using Large Language Representational Models

【速读】: 该论文旨在解决医疗安全领域中高严重程度不良事件报告(incident reports)人工审核效率低、依赖专家经验的问题,特别是在放射治疗(radiation oncology)场景下。其核心解决方案是基于自然语言处理(Natural Language Processing, NLP)技术构建自动化筛选模型,通过训练支持向量机(SVM)和预训练于医学文献与患者数据的BlueBERT大语言模型(Large Language Model),实现对 incident reports 中高严重程度事件的精准识别。关键创新在于采用跨机构迁移学习策略(即 BlueBERT_TRANSFER 模型先在本机构数据上微调,再结合另一机构数据进一步优化),显著提升了模型在不同机构数据上的泛化能力(SF测试集AUROC从0.56提升至0.78),并达到与人类专家相当的判别性能(AUROC 0.85 vs. 0.81)。

链接: https://arxiv.org/abs/2509.13706
作者: Peter Beidler,Mark Nguyen,Kevin Lybarger,Ola Holmberg,Eric Ford,John Kang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:PURPOSE: Incident reports are an important tool for safety and quality improvement in healthcare, but manual review is time-consuming and requires subject matter expertise. Here we present a natural language processing (NLP) screening tool to detect high-severity incident reports in radiation oncology across two institutions. METHODS AND MATERIALS: We used two text datasets to train and evaluate our NLP models: 7,094 reports from our institution (Inst.), and 571 from IAEA SAFRON (SF), all of which had severity scores labeled by clinical content experts. We trained and evaluated two types of models: baseline support vector machines (SVM) and BlueBERT which is a large language model pretrained on PubMed abstracts and hospitalized patient data. We assessed for generalizability of our model in two ways. First, we evaluated models trained using Inst.-train on SF-test. Second, we trained a BlueBERT_TRANSFER model that was first fine-tuned on Inst.-train then on SF-train before testing on SF-test set. To further analyze model performance, we also examined a subset of 59 reports from our Inst. dataset, which were manually edited for clarity. RESULTS Classification performance on the Inst. test achieved AUROC 0.82 using SVM and 0.81 using BlueBERT. Without cross-institution transfer learning, performance on the SF test was limited to an AUROC of 0.42 using SVM and 0.56 using BlueBERT. BlueBERT_TRANSFER, which was fine-tuned on both datasets, improved the performance on SF test to AUROC 0.78. Performance of SVM, and BlueBERT_TRANSFER models on the manually curated Inst. reports (AUROC 0.85 and 0.74) was similar to human performance (AUROC 0.81). CONCLUSION: In summary, we successfully developed cross-institution NLP models on incident report text from radiation oncology centers. These models were able to detect high-severity reports similarly to humans on a curated dataset. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.13706 [cs.CL] (or arXiv:2509.13706v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.13706 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mark Nguyen [view email] [v1] Wed, 17 Sep 2025 05:29:23 UTC (718 KB)
zh

[NLP-40] DSCC-HS: A Dynamic Self-Reinforcing Framework for Hallucination Suppression in Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在生成过程中普遍存在且难以控制的幻觉(Hallucination)问题,即模型输出与事实不符的内容。现有方法如检索增强生成(Retrieval-Augmented Generation, RAG)多为被动响应式策略,难以在生成过程中主动干预。本文提出一种新颖的主动式解决方案——动态自强化校准(Dynamic Self-reinforcing Calibration for Hallucination Suppression, DSCC-HS),其核心在于引入一个轻量级代理模型,在推理阶段通过双角色对抗训练形成两个子代理:事实对齐代理(Factual Alignment Proxy, FAP)和幻觉检测代理(Hallucination Detection Proxy, HDP)。在自回归解码的每一步,DSCC-HS通过计算FAP与HDP logits之差生成实时校准向量,并将其注入目标模型以引导其生成更符合事实的文本,无需修改原模型结构,实现了高效、可插拔的事实性增强。

链接: https://arxiv.org/abs/2509.13702
作者: Xiao Zheng
机构: China University of Petroleum (中国石油大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) hallucination is a significant barrier to their reliable deployment. Current methods like Retrieval-Augmented Generation (RAG) are often reactive. We introduce Dynamic Self-reinforcing Calibration for Hallucination Suppression (DSCC-HS), a novel, proactive framework that intervenes during autoregressive decoding. Inspired by dual-process cognitive theory, DSCC-HS uses a compact proxy model, trained in adversarial roles as a Factual Alignment Proxy (FAP) and a Hallucination Detection Proxy (HDP). During inference, these proxies dynamically steer a large target model by injecting a real-time steering vector, which is the difference between FAP and HDP logits, at each decoding step. This plug-and-play approach requires no modification to the target model. Our experiments on TruthfulQA and BioGEN show DSCC-HS achieves state-of-the-art performance. On TruthfulQA, it reached a 99.2% Factual Consistency Rate (FCR). On the long-form BioGEN benchmark, it attained the highest FActScore of 46.50. These results validate DSCC-HS as a principled and efficient solution for enhancing LLM factuality.
zh

[NLP-41] Integrating Text and Time-Series into (Large) Language Models to Predict Medical Outcomes

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理涉及结构化数据(如时间序列)的临床分类任务时能力不足的问题。其解决方案的关键在于采用基于DSPy的提示优化方法,对指令微调后的LLMs进行适配,使其能够联合处理临床笔记与结构化电子健康记录(Electronic Health Record, EHR)输入,从而在性能上达到与专用多模态系统相当的水平,同时显著降低系统复杂性并提升跨任务的适应性。

链接: https://arxiv.org/abs/2509.13696
作者: Iyadh Ben Cheikh Larbi,Ajay Madhavan Ravichandran,Aljoscha Burchardt,Roland Roller
机构: German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心); Technical University Berlin (柏林工业大学)
类目: Computation and Language (cs.CL)
备注: Presented and published at BioCreative IX

点击查看摘要

Abstract:Large language models (LLMs) excel at text generation, but their ability to handle clinical classification tasks involving structured data, such as time series, remains underexplored. In this work, we adapt instruction-tuned LLMs using DSPy-based prompt optimization to process clinical notes and structured EHR inputs jointly. Our results show that this approach achieves performance on par with specialized multimodal systems while requiring less complexity and offering greater adaptability across tasks.
zh

[NLP-42] Can Large Language Models Robustly Perform Natural Language Inference for Japanese Comparatives?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理包含数值和逻辑表达式的自然语言推理(Natural Language Inference, NLI)任务中表现不佳的问题,特别是针对日语中比较结构(comparatives)的推理能力尚未充分研究的现状。解决方案的关键在于构建一个专注于比较结构的日语NLI数据集,并通过零样本(zero-shot)与少样本(few-shot)设置评估多种LLMs;同时发现,使用包含逻辑语义表示(logical semantic representations)的提示(prompt)能够显著提升模型在复杂推理任务中的准确性,即使在提供少量标注示例后仍难以正确预测的情况下亦然。

链接: https://arxiv.org/abs/2509.13695
作者: Yosuke Mikami,Daiki Matsuoka,Hitomi Yanaka
机构: The University of Tokyo (东京大学); Riken (理化学研究所)
类目: Computation and Language (cs.CL)
备注: To appear in Proceedings of the 16th International Conference on Computational Semantics (IWCS 2025)

点击查看摘要

Abstract:Large Language Models (LLMs) perform remarkably well in Natural Language Inference (NLI). However, NLI involving numerical and logical expressions remains challenging. Comparatives are a key linguistic phenomenon related to such inference, but the robustness of LLMs in handling them, especially in languages that are not dominant in the models’ training data, such as Japanese, has not been sufficiently explored. To address this gap, we construct a Japanese NLI dataset that focuses on comparatives and evaluate various LLMs in zero-shot and few-shot settings. Our results show that the performance of the models is sensitive to the prompt formats in the zero-shot setting and influenced by the gold labels in the few-shot examples. The LLMs also struggle to handle linguistic phenomena unique to Japanese. Furthermore, we observe that prompts containing logical semantic representations help the models predict the correct labels for inference problems that they struggle to solve even with few-shot examples.
zh

[NLP-43] Improving Context Fidelity via Native Retrieval-Augmented Reasoning EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在回答基于给定上下文的问题时存在的上下文一致性(context fidelity)问题,即模型常因未能有效利用所提供的证据而产生不一致或错误的回答。解决方案的关键在于提出一种名为CARE的原生检索增强推理框架,该框架通过训练模型自身具备检索能力,在推理链中显式地整合上下文中的证据,从而提升对给定信息的利用率和答案生成质量。与依赖昂贵监督微调或外部检索的方法不同,CARE仅需少量标注证据数据,即可显著提高检索准确性和问答性能。

链接: https://arxiv.org/abs/2509.13683
作者: Suyuchen Wang,Jinlin Wang,Xinyu Wang,Shiqi Li,Xiangru Tang,Sirui Hong,Xiao-Wen Chang,Chenglin Wu,Bang Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a main conference paper at EMNLP 2025

点击查看摘要

Abstract:Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on provided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model’s own retrieval capabilities. Our method requires limited labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.
zh

[NLP-44] Agent CTG: Harnessing Multi-Agent Collaboration for Fine-Grained Precise Control in Text Generation

【速读】: 该论文旨在解决受控文本生成(Controlled Text Generation, CTG)中面临的细粒度条件控制难题,尤其是在实际应用场景下对成本、可扩展性、领域知识学习和更精确控制的需求。其解决方案的关键在于提出一种新颖且可扩展的框架——AgentCTG,该框架通过模拟多智能体工作流中的控制与调节机制,实现对文本生成过程的精准和复杂控制;同时引入自动提示(auto-prompt)模块以进一步提升生成效果,并在多个公开数据集上取得最先进性能。

链接: https://arxiv.org/abs/2509.13677
作者: Xinxu Zhou,Jiaqi Bai,Zhenqi Sun,Fanxiang Zeng,Yue Liu
机构: AMAP, Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Although significant progress has been made in many tasks within the field of Natural Language Processing (NLP), Controlled Text Generation (CTG) continues to face numerous challenges, particularly in achieving fine-grained conditional control over generation. Additionally, in real scenario and online applications, cost considerations, scalability, domain knowledge learning and more precise control are required, presenting more challenge for CTG. This paper introduces a novel and scalable framework, AgentCTG, which aims to enhance precise and complex control over the text generation by simulating the control and regulation mechanisms in multi-agent workflows. We explore various collaboration methods among different agents and introduce an auto-prompt module to further enhance the generation effectiveness. AgentCTG achieves state-of-the-art results on multiple public datasets. To validate its effectiveness in practical applications, we propose a new challenging Character-Driven Rewriting task, which aims to convert the original text into new text that conform to specific character profiles and simultaneously preserve the domain knowledge. When applied to online navigation with role-playing, our approach significantly enhances the driving experience through improved content delivery. By optimizing the generation of contextually relevant text, we enable a more immersive interaction within online communities, fostering greater personalization and user engagement.
zh

[NLP-45] CL2GEC: A Multi-Discipline Benchmark for Continual Learning in Chinese Literature Grammatical Error Correction

【速读】: 该论文旨在解决当前中文语法错误纠正(Chinese Grammatical Error Correction, CGEC)系统在多学科领域适应性不足的问题,尤其是缺乏针对学术写作场景的持续学习(Continual Learning, CL)评估基准,导致模型难以应对不同学科间的语言风格差异并避免灾难性遗忘。其解决方案的关键在于提出首个面向中文文学语法纠错的持续学习基准CL²GEC,包含10个学科共10,000条人工标注句子,模拟顺序学习多个学术领域的语言特征,并通过标准GEC指标与适配任务级变化的持续学习指标综合评估大语言模型在参数高效微调和四种代表性CL算法下的表现,实验证明正则化方法比重放或简单顺序训练更能有效缓解遗忘问题。

链接: https://arxiv.org/abs/2509.13672
作者: Shang Qin,Jingheng Ye,Yinghui Li,Hai-Tao Zheng,Qi Li,Jinxiao Shan,Zhixing Li,Hong-Gee Kim
机构: Tsinghua Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院,清华大学); Peng Cheng Laboratory (鹏城实验室); China Merchants Group (招商局集团); Zhipu AI (智谱AI); Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing demand for automated writing assistance in diverse academic domains highlights the need for robust Chinese Grammatical Error Correction (CGEC) systems that can adapt across disciplines. However, existing CGEC research largely lacks dedicated benchmarks for multi-disciplinary academic writing, overlooking continual learning (CL) as a promising solution to handle domain-specific linguistic variation and prevent catastrophic forgetting. To fill this crucial gap, we introduce CL ^2 GEC, the first Continual Learning benchmark for Chinese Literature Grammatical Error Correction, designed to evaluate adaptive CGEC across multiple academic fields. Our benchmark includes 10,000 human-annotated sentences spanning 10 disciplines, each exhibiting distinct linguistic styles and error patterns. CL ^2 GEC focuses on evaluating grammatical error correction in a continual learning setting, simulating sequential exposure to diverse academic disciplines to reflect real-world editorial dynamics. We evaluate large language models under sequential tuning, parameter-efficient adaptation, and four representative CL algorithms, using both standard GEC metrics and continual learning metrics adapted to task-level variation. Experimental results reveal that regularization-based methods mitigate forgetting more effectively than replay-based or naive sequential approaches. Our benchmark provides a rigorous foundation for future research in adaptive grammatical error correction across diverse academic domains.
zh

[NLP-46] Sparse Neurons Carry Strong Signals of Question Ambiguity in LLM s EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对现实世界中普遍存在的问题歧义时,往往给出自信回答而非寻求澄清的局限性。其核心解决方案在于发现并利用模型内部表示中线性编码的问题歧义信息:研究者识别出少量“歧义编码神经元”(Ambiguity-Encoding Neurons, AENs),这些神经元在预填充阶段即能捕获歧义信号,且通过针对AENs训练探测器可在多个数据集上实现高精度歧义检测,并显著优于基于提示(prompting-based)和表示(representation-based)的基线方法;进一步地,通过对AENs进行干预可有效控制模型行为从直接回答转向回避回答,从而实现对LLM行为的可解释与可控调节。

链接: https://arxiv.org/abs/2509.13664
作者: Zhuoxuan Zhang,Jinhao Duan,Edward Kim,Kaidi Xu
机构: Brown University (布朗大学); Drexel University (德雷塞尔大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: To be appeared in EMNLP 2025 (main)

点击查看摘要

Abstract:Ambiguity is pervasive in real-world questions, yet large language models (LLMs) often respond with confident answers rather than seeking clarification. In this work, we show that question ambiguity is linearly encoded in the internal representations of LLMs and can be both detected and controlled at the neuron level. During the model’s pre-filling stage, we identify that a small number of neurons, as few as one, encode question ambiguity information. Probes trained on these Ambiguity-Encoding Neurons (AENs) achieve strong performance on ambiguity detection and generalize across datasets, outperforming prompting-based and representation-based baselines. Layerwise analysis reveals that AENs emerge from shallow layers, suggesting early encoding of ambiguity signals in the model’s processing pipeline. Finally, we show that through manipulating AENs, we can control LLM’s behavior from direct answering to abstention. Our findings reveal that LLMs form compact internal representations of question ambiguity, enabling interpretable and controllable behavior.
zh

[NLP-47] Privacy-Aware In-Context Learning for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成文本过程中可能泄露敏感信息的隐私问题,特别是针对攻击者通过分析输入提示(prompt)提取嵌入其中的私有数据的风险。解决方案的关键在于引入差分隐私(Differential Privacy, DP)框架,在不依赖模型微调的前提下,对私有记录进行推理并聚合每个词元(token)的输出分布,从而在生成长且连贯的合成文本的同时提供严格的理论隐私保障。此外,作者提出一种简单的混合操作,将私有推理与公共推理结果结合,进一步提升生成文本的实用性,实验证明该方法在上下文学习(In-Context Learning, ICL)任务上优于现有最先进方法。

链接: https://arxiv.org/abs/2509.13625
作者: Bishnu Bhusal,Manoj Acharya,Ramneet Kaur,Colin Samplawski,Anirban Roy,Adam D. Cobb,Rohit Chadha,Susmit Jha
机构: 2: University of California, San Diego (加州大学圣地亚哥分校); 1: University of California, San Diego (加州大学圣地亚哥分校)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have significantly transformed natural language understanding and generation, but they raise privacy concerns due to potential exposure of sensitive information. Studies have highlighted the risk of information leakage, where adversaries can extract sensitive information embedded in the prompts. In this work, we introduce a novel private prediction framework for generating high-quality synthetic text with strong privacy guarantees. Our approach leverages the Differential Privacy (DP) framework to ensure worst-case theoretical bounds on information leakage without requiring any fine-tuning of the underlying this http URL proposed method performs inference on private records and aggregates the resulting per-token output distributions. This enables the generation of longer and coherent synthetic text while maintaining privacy guarantees. Additionally, we propose a simple blending operation that combines private and public inference to further enhance utility. Empirical evaluations demonstrate that our approach outperforms previous state-of-the-art methods on in-context-learning (ICL) tasks, making it a promising direction for privacy-preserving text generation while maintaining high utility.
zh

[NLP-48] Latent Traits and Cross-Task Transfer: Deconstructing Dataset Interactions in LLM Fine-tuning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中面临的一个核心挑战:当模型被部署于训练阶段未见过的任务时,由于无法枚举并获取所有任务的高质量训练数据,传统方法难以有效支持模型适应。为此,作者提出了一种基于迁移学习矩阵(transfer learning matrix)与降维分析的框架,用于系统性地剖析跨任务间的交互机制。其解决方案的关键在于通过量化不同任务之间的迁移效应,识别出影响性能提升的隐藏统计因素(如类别分布、生成长度倾向性)和特定语言特征,而非依赖表面的数据集相似度或源数据质量,从而揭示了迁移学习中更深层次的复杂动态,为实现可预测且高效的LLM适配提供了新路径。

链接: https://arxiv.org/abs/2509.13624
作者: Shambhavi Krishna,Atharva Naik,Chaitali Agarwal,Sudharshan Govindan,Taesung Lee,Haw-Shiuan Chang
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); Anthropic (Anthropic)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Camera-ready version. Accepted to appear in the proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)

点击查看摘要

Abstract:Large language models are increasingly deployed across diverse applications. This often includes tasks LLMs have not encountered during training. This implies that enumerating and obtaining the high-quality training data for all tasks is infeasible. Thus, we often need to rely on transfer learning using datasets with different characteristics, and anticipate out-of-distribution requests. Motivated by this practical need, we propose an analysis framework, building a transfer learning matrix and dimensionality reduction, to dissect these cross-task interactions. We train and analyze 10 models to identify latent abilities (e.g., Reasoning, Sentiment Classification, NLU, Arithmetic) and discover the side effects of the transfer learning. Our findings reveal that performance improvements often defy explanations based on surface-level dataset similarity or source data quality. Instead, hidden statistical factors of the source dataset, such as class distribution and generation length proclivities, alongside specific linguistic features, are actually more influential. This work offers insights into the complex dynamics of transfer learning, paving the way for more predictable and effective LLM adaptation.
zh

[NLP-49] See Think Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles

【速读】: 该论文旨在解决多模态智能体(multimodal agents)在图形用户界面(GUI)中执行切换控制指令(toggle control instructions)时的不可靠性问题,尤其是在当前状态已与目标状态一致的情况下,现有方法常出现错误执行。解决方案的关键在于提出一种名为“状态感知推理”(State-aware Reasoning, StaR)的训练方法,该方法使智能体能够感知当前的切换状态、从指令中解析目标状态,并据此作出正确动作决策,从而显著提升切换指令执行准确率(超过30%),并增强通用任务性能。

链接: https://arxiv.org/abs/2509.13615
作者: Zongru Wu,Rui Mao,Zhiyuan Tian,Pengzhou Cheng,Tianjie Ju,Zheng Wu,Lingzhong Dong,Haiyue Sheng,Zhuosheng Zhang,Gongshen Liu
机构: Shanghai Jiao Tong University (上海交通大学); Beijing Institute of Technology (北京理工大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at this https URL.
zh

[NLP-50] Annotating Satellite Images of Forests with Keywords from a Specialized Corpus in the Context of Change Detection

【速读】: 该论文旨在解决亚马逊雨林 deforestation(森林砍伐)的自动检测与语义标注问题,以支持对全球碳排放和生物多样性影响的监测研究。其解决方案的关键在于提出一种基于深度学习的图像变化检测方法,通过对比不同时间点的地球观测卫星图像对,识别森林覆盖的变化,并结合一个视觉语义模型,从相关科学文献中提取候选关键词,自动为检测到的变化区域生成语义标注。该方法在亚马逊图像对数据集上验证了有效性,具备良好的可扩展性,适用于其他领域。

链接: https://arxiv.org/abs/2509.13586
作者: Nathalie Neptune,Josiane Mothe
机构: IRIT(信息与推理技术研究所); Université de Toulouse(图卢兹大学); CNRS(法国国家科学研究中心); Toulouse INP(图卢兹国立理工学院); UT3(图卢兹第三大学); INSPÉ(教师教育与培训研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:The Amazon rain forest is a vital ecosystem that plays a crucial role in regulating the Earth’s climate and providing habitat for countless species. Deforestation in the Amazon is a major concern as it has a significant impact on global carbon emissions and biodiversity. In this paper, we present a method for detecting deforestation in the Amazon using image pairs from Earth observation satellites. Our method leverages deep learning techniques to compare the images of the same area at different dates and identify changes in the forest cover. We also propose a visual semantic model that automatically annotates the detected changes with relevant keywords. The candidate annotation for images are extracted from scientific documents related to the Amazon region. We evaluate our approach on a dataset of Amazon image pairs and demonstrate its effectiveness in detecting deforestation and generating relevant annotations. Our method provides a useful tool for monitoring and studying the impact of deforestation in the Amazon. While we focus on environment applications of our work by using images of deforestation in the Amazon rain forest to demonstrate the effectiveness of our proposed approach, it is generic enough to be applied to other domains.
zh

[NLP-51] Overview of Dialog System Evaluation Track: Dimensionality Language Culture and Safety at DSTC 12

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)对话系统评估中存在维度单一、文化敏感性不足及安全性考量片面的问题。其解决方案的关键在于通过DSTC12 Track 1的两个子任务构建更全面的评估体系:一是设计多维对话自动评价指标以覆盖包括流畅性、相关性等在内的10个对话维度;二是开发跨语言与跨文化的安全部署检测机制,尤其强调在多语言场景下提升安全识别能力,并揭示现有基线模型在文化适应性方面的显著缺陷,从而推动对话系统评估向更鲁棒、公平和多元的方向发展。

链接: https://arxiv.org/abs/2509.13569
作者: John Mendonça,Lining Zhang,Rahul Mallidi,Alon Lavie,Isabel Trancoso,Luis Fernando D’Haro,João Sedoc
机构: INESC-ID(里斯本研究所); Instituto Superior Técnico - University of Lisbon(里斯本大学理工学院); Department of Technology, Operations, and Statistics, New York University(纽约大学技术、运营与统计系); Speech Technology and Machine Learning Group - Universidad Politécnica de Madrid(马德里理工大学语音技术与机器学习组); Carnegie Mellon University(卡内基梅隆大学); Phrase(Phrase公司)
类目: Computation and Language (cs.CL)
备注: DSTC12 Track 1 Overview Paper. this https URL

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has intensified the need for robust dialogue system evaluation, yet comprehensive assessment remains challenging. Traditional metrics often prove insufficient, and safety considerations are frequently narrowly defined or culturally biased. The DSTC12 Track 1, “Dialog System Evaluation: Dimensionality, Language, Culture and Safety,” is part of the ongoing effort to address these critical gaps. The track comprised two subtasks: (1) Dialogue-level, Multi-dimensional Automatic Evaluation Metrics, and (2) Multilingual and Multicultural Safety Detection. For Task 1, focused on 10 dialogue dimensions, a Llama-3-8B baseline achieved the highest average Spearman’s correlation (0.1681), indicating substantial room for improvement. In Task 2, while participating teams significantly outperformed a Llama-Guard-3-1B baseline on the multilingual safety subset (top ROC-AUC 0.9648), the baseline proved superior on the cultural subset (0.5126 ROC-AUC), highlighting critical needs in culturally-aware safety. This paper describes the datasets and baselines provided to participants, as well as submission evaluation results for each of the two proposed subtasks.
zh

[NLP-52] Op-Fed: Opinion Stance and Monetary Policy Annotations on FOMC Transcripts Using Active Learning

【速读】: 该论文旨在解决金融文本中立场识别(stance classification)任务的两个核心挑战:一是类别不平衡问题,即表达非中立货币政策立场的句子占比不足8%;二是句间依赖性强,约65%的实例需要超越单句层面的上下文信息才能准确标注。其解决方案的关键在于提出一个五阶段分层标注框架,将观点(opinion)、货币政策内容(monetary policy)与立场(stance)进行解耦,并明确所需上下文层级;同时采用主动学习策略选择最具信息量的样本进行人工标注,显著提升了正例覆盖率,从而构建了高质量的Op-Fed数据集,为后续模型训练和立场分析提供了可靠基础。

链接: https://arxiv.org/abs/2509.13539
作者: Alisa Kanganis,Katherine A. Keith
机构: Williams College (威廉姆斯学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The U.S. Federal Open Market Committee (FOMC) regularly discusses and sets monetary policy, affecting the borrowing and spending decisions of millions of people. In this work, we release Op-Fed, a dataset of 1044 human-annotated sentences and their contexts from FOMC transcripts. We faced two major technical challenges in dataset creation: imbalanced classes – we estimate fewer than 8% of sentences express a non-neutral stance towards monetary policy – and inter-sentence dependence – 65% of instances require context beyond the sentence-level. To address these challenges, we developed a five-stage hierarchical schema to isolate aspects of opinion, monetary policy, and stance towards monetary policy as well as the level of context needed. Second, we selected instances to annotate using active learning, roughly doubling the number of positive instances across all schema aspects. Using Op-Fed, we found a top-performing, closed-weight LLM achieves 0.80 zero-shot accuracy in opinion classification but only 0.61 zero-shot accuracy classifying stance towards monetary policy – below our human baseline of 0.89. We expect Op-Fed to be useful for future model training, confidence calibration, and as a seed dataset for future annotation efforts.
zh

[NLP-53] Gender-Neutral Rewriting in Italian: Models Approaches and Trade-offs

【速读】: 该论文旨在解决意大利语等具有语法性别特征语言中的性别中立改写(Gender-neutral Rewriting, GNR)问题,即在不改变原文语义的前提下消除不必要的性别指涉。其解决方案的关键在于构建一个二维评估框架,同时衡量改写结果的性别中立性与语义保真度,并通过少量示例提示(few-shot prompting)、模型微调(fine-tuning)以及针对性的数据清洗策略提升生成质量。实验表明,开源权重的大语言模型(LLMs)在该任务上优于现有专用模型,而微调后的模型能在显著更小规模下达到甚至超越最优开源模型的表现。

链接: https://arxiv.org/abs/2509.13480
作者: Andrea Piergentili,Beatrice Savoldi,Matteo Negri,Luisa Bentivogli
机构: University of Trento (特伦托大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会)
类目: Computation and Language (cs.CL)
备注: Accepted at CLiC-it 2025

点击查看摘要

Abstract:Gender-neutral rewriting (GNR) aims to reformulate text to eliminate unnecessary gender specifications while preserving meaning, a particularly challenging task in grammatical-gender languages like Italian. In this work, we conduct the first systematic evaluation of state-of-the-art large language models (LLMs) for Italian GNR, introducing a two-dimensional framework that measures both neutrality and semantic fidelity to the input. We compare few-shot prompting across multiple LLMs, fine-tune selected models, and apply targeted cleaning to boost task relevance. Our findings show that open-weight LLMs outperform the only existing model dedicated to GNR in Italian, whereas our fine-tuned models match or exceed the best open-weight LLM’s performance at a fraction of its size. Finally, we discuss the trade-off between optimizing the training data for neutrality and meaning preservation.
zh

[NLP-54] SteeringControl: Holistic Evaluation of Alignment Steering in LLM s

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 领域中表示控制(representation steering)方法在多个对齐目标(如偏见、有害生成和幻觉)及其对次要行为(如谄媚倾向和常识道德)影响方面的系统性理解不足问题。现有研究多关注真实性或推理能力作为副作用指标,但缺乏对未探索的权衡关系的全面评估。解决方案的关键在于构建一个模块化的表示控制框架,其由可复用的核心组件构成,能够统一实现五种主流控制方法,并基于新收集的安全相关行为数据集,系统评估不同方法在特定模型与目标行为组合下的有效性及概念纠缠程度。结果表明,强控制效果高度依赖于方法、模型和目标行为的协同配置,不当组合易引发严重概念纠缠。

链接: https://arxiv.org/abs/2509.13450
作者: Vincent Siu,Nicholas Crispino,David Park,Nathan W. Henry,Zhun Wang,Yang Liu,Dawn Song,Chenguang Wang
机构: University of California, Santa Cruz (加州大学圣克鲁兹分校); Washington University in St. Louis (华盛顿大学圣路易斯分校); University of California, Berkeley (加州大学伯克利分校)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives–bias, harmful generation, and hallucination–and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: this https URL.
zh

[NLP-55] CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI

【速读】: 该论文旨在解决当前人工智能(Artificial Intelligence, AI)与人类价值观对齐过程中存在的抽象性、冲突性和黑箱性问题,尤其是现有方法在道德推理上的不透明和不可靠性。其解决方案的关键在于提出一种基于自然主义道德实在论的多智能体审议框架——CogniAlign,该框架将道德推理建立在个体与集体层面的生存能力(survivability)基础上,并通过跨学科科学家智能体(分别代表神经科学、心理学、社会学和进化生物学)之间的结构化辩论与反驳,由仲裁者整合形成可解释且实证锚定的伦理判断。此设计显著提升了道德推理的质量、广度和深度,相较GPT-4o在多个经典及新设道德问题上表现出系统性优势,体现了以跨学科审议实现安全、透明AI对齐的可行性路径。

链接: https://arxiv.org/abs/2509.13356
作者: Hasin Jawad Ali,Ilhamul Azam,Ajwad Abrar,Md. Kamrul Hasan,Hasan Mahmud
机构: Islamic University of Technology (伊斯兰技术大学)
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches. This paper introduces CogniAlign, a multi-agent deliberation framework based on naturalistic moral realism, that grounds moral reasoning in survivability, defined across individual and collective dimensions, and operationalizes it through structured deliberations among discipline-specific scientist agents. Each agent, representing neuroscience, psychology, sociology, and evolutionary biology, provides arguments and rebuttals that are synthesized by an arbiter into transparent and empirically anchored judgments. We evaluate CogniAlign on classic and novel moral questions and compare its outputs against GPT-4o using a five-part ethical audit framework. Results show that CogniAlign consistently outperforms the baseline across more than sixty moral questions, with average performance gains of 16.2 points in analytic quality, 14.3 points in breadth, and 28.4 points in depth of explanation. In the Heinz dilemma, for example, CogniAlign achieved an overall score of 89.2 compared to GPT-4o’s 69.2, demonstrating a decisive advantage in handling moral reasoning. By reducing black-box reasoning and avoiding deceptive alignment, CogniAlign highlights the potential of interdisciplinary deliberation as a scalable pathway for safe and transparent AI alignment.
zh

[NLP-56] aching LLM s to Plan: Logical Chain-of-Thought Instruction Tuning for Symbolic Planning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化符号规划任务中表现不足的问题,尤其是在需要形式化表示(如规划领域定义语言 Planning Domain Definition Language, PDDL)的场景下,LLMs 缺乏对动作适用性、状态转移和计划有效性的严格逻辑推理能力。解决方案的关键在于提出一种名为 PDDL-Instruct 的指令微调框架,通过引导模型进行逻辑链式推理(logical chain-of-thought reasoning),显式地分解规划过程为前提条件满足、效果应用与不变量保持等推理步骤,并借助精心设计的指令提示(instruction prompts)使模型能够自我修正规划过程,从而显著提升其在多个规划域中的计划准确性,实验表明该方法相较基线模型实现高达 66% 的绝对性能提升(最高达 94% 的规划准确率)。

链接: https://arxiv.org/abs/2509.13351
作者: Pulkit Verma,Ngoc La,Anthony Favier,Swaroop Mishra,Julie A. Shah
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive capabilities across diverse tasks, yet their ability to perform structured symbolic planning remains limited, particularly in domains requiring formal representations like the Planning Domain Definition Language (PDDL). In this paper, we present a novel instruction tuning framework, PDDL-Instruct, designed to enhance LLMs’ symbolic planning capabilities through logical chain-of-thought reasoning. Our approach focuses on teaching models to rigorously reason about action applicability, state transitions, and plan validity using explicit logical inference steps. By developing instruction prompts that guide models through the precise logical reasoning required to determine when actions can be applied in a given state, we enable LLMs to self-correct their planning processes through structured reflection. The framework systematically builds verification skills by decomposing the planning process into explicit reasoning chains about precondition satisfaction, effect application, and invariant preservation. Experimental results on multiple planning domains show that our chain-of-thought reasoning based instruction-tuned models are significantly better at planning, achieving planning accuracy of up to 94% on standard benchmarks, representing a 66% absolute improvement over baseline models. This work bridges the gap between the general reasoning capabilities of LLMs and the logical precision required for automated planning, offering a promising direction for developing better AI planning systems.
zh

[NLP-57] Accuracy Paradox in Large Language Models : Regulating Hallucination Risks in Generative AI

【速读】: 该论文试图解决生成式 AI(Generative AI)中“幻觉”(hallucination)问题的误判与治理困境,尤其是当前以准确性(accuracy)为核心指标的治理范式所引发的“准确性悖论”(accuracy paradox)。论文指出,过度依赖准确性不仅掩盖了幻觉在输出层面、个体认知和社会结构中的多重危害,还导致对误导性、价值偏倚和系统性风险的忽视。其解决方案的关键在于推动从单一准确性导向向多元、情境敏感且抗操纵的可信治理框架转型,强调需构建能够识别并应对认知信任缺失、社会分化、信息同质化等深层次风险的制度设计。

链接: https://arxiv.org/abs/2509.13345
作者: Zihao Li,Weiwei Yi,Jiahong Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) permeate everyday decision-making, their epistemic and societal risks demand urgent scrutiny. Hallucinations, the generation of fabricated, misleading, oversimplified or untrustworthy outputs, has emerged as imperative challenges. While regulatory, academic, and technical discourse position accuracy as the principal benchmark for mitigating such harms, this article contends that overreliance on accuracy misdiagnoses the problem and has counterproductive effect: the accuracy paradox. Drawing on interdisciplinary literatures, this article develops a taxonomy of hallucination types and shows the paradox along three intertwining dimensions: outputs, individuals and society. First, accuracy functions as a superficial proxy for reliability, incentivising the optimisation of rhetorical fluency and surface-level correctness over epistemic trustworthiness. This encourages passive user trust in outputs that appear accurate but epistemically untenable. Second, accuracy as a singular metric fails to detect harms that are not factually false but are nonetheless misleading, value-laden, or socially distorting, including consensus illusions, sycophantic alignment, and subtle manipulation. Third, regulatory overemphasis on accuracy obscures the wider societal consequences of hallucination, including social sorting, privacy violations, equity harms, epistemic convergence that marginalises dissent, reduces pluralism, and causes social deskilling. By examining the EU AI Act, GDPR, and DSA, the article argues that current regulations are not yet structurally equipped to address these epistemic, relational, and systemic harms and exacerbated by the overreliance on accuracy. By exposing such conceptual and practical challenges, this article calls for a fundamental shift towards pluralistic, context-aware, and manipulation-resilient approaches to AI trustworthy governance.
zh

[NLP-58] Explicit Reasoning Makes Better Judges: A Systematic Study on Accuracy Efficiency and Robustness

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)作为自动化评判者(LLM-as-a-judge)在基准测试与奖励建模中所面临的可靠性、效率和鲁棒性不足的问题。其核心解决方案在于系统性地比较“思考型”(thinking)与“非思考型”(non-thinking)LLMs的表现,发现通过显式推理机制(explicit reasoning)的引入,即使使用参数规模较小的开源Qwen 3模型(0.6B、1.7B、4B),也能显著提升判断准确性(约高10个百分点)并保持较低计算开销(低于2倍FLOPs),同时优于多种增强策略(如少样本学习、基于评分标准的判分等),后者虽有小幅增益但代价更高(达8倍)。此外,该方案在多语言场景下仍具优势,表明显式推理能有效提升模型在不同偏置条件下的鲁棒性(平均高出6%)。

链接: https://arxiv.org/abs/2509.13332
作者: Pratik Jayarao,Himanshu Gupta,Neeraj Varshney,Chaitanya Dwivedi
机构: Arizona State University (亚利桑那州立大学); Carnegie Mellon University (卡内基梅隆大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly adopted as automated judges in benchmarking and reward modeling, ensuring their reliability, efficiency, and robustness has become critical. In this work, we present a systematic comparison of “thinking” and “non-thinking” LLMs in the LLM-as-a-judge paradigm using open-source Qwen 3 models of relatively small sizes (0.6B, 1.7B, and 4B parameters). We evaluate both accuracy and computational efficiency (FLOPs) on RewardBench tasks, and further examine augmentation strategies for non-thinking models, including in-context learning, rubric-guided judging, reference-based evaluation, and n-best aggregation. Our results show that despite these enhancements, non-thinking models generally fall short of their thinking counterparts. Our results show that thinking models achieve approximately 10% points higher accuracy with little overhead (under 2x), in contrast to augmentation strategies like few-shot learning, which deliver modest gains at a higher cost (8x). Bias and robustness analyses further demonstrate that thinking models maintain significantly greater consistency under a variety of bias conditions such as positional, bandwagon, identity, diversity, and random biases (6% higher on average). We further extend our experiments to the multilingual setting and our results confirm that explicit reasoning extends its benefits beyond English. Overall, our work results in several important findings that provide systematic evidence that explicit reasoning offers clear advantages in the LLM-as-a-judge paradigm not only in accuracy and efficiency but also in robustness.
zh

[NLP-59] An AI-Powered Framework for Analyzing Collective Idea Evolution in Deliberative Assemblies

【速读】: 该论文旨在解决当前关于代表型审议会议(representative deliberative assemblies)中政策建议形成过程的实证研究不足问题,特别是缺乏对具体观点如何在审议过程中演化、优先排序或被舍弃的系统性追踪。其核心挑战在于传统方法难以捕捉审议动态中的微观机制和个体视角变化。解决方案的关键在于开发基于大语言模型(Large Language Models, LLMs)的方法学框架:首先通过LLM对会议记录进行分析,识别并可视化参与者提出建议的语义空间;其次,重建每位代表在整个审议过程中观点的演变轨迹。这一方法实现了对审议过程高分辨率动态的量化呈现,为理解集体决策机制提供了新的实证工具。

链接: https://arxiv.org/abs/2509.12577
作者: Elinor Poole-Dayan,Deb Roy,Jad Kabbara
机构: 未知
类目: Computers and Society (cs.CY); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In an era of increasing societal fragmentation, political polarization, and erosion of public trust in institutions, representative deliberative assemblies are emerging as a promising democratic forum for developing effective policy outcomes on complex global issues. Despite theoretical attention, there remains limited empirical work that systematically traces how specific ideas evolve, are prioritized, or are discarded during deliberation to form policy recommendations. Addressing these gaps, this work poses two central questions: (1) How might we trace the evolution and distillation of ideas into concrete recommendations within deliberative assemblies? (2) How does the deliberative process shape delegate perspectives and influence voting dynamics over the course of the assembly? To address these questions, we develop LLM-based methodologies for empirically analyzing transcripts from a tech-enhanced in-person deliberative assembly. The framework identifies and visualizes the space of expressed suggestions. We also empirically reconstruct each delegate’s evolving perspective throughout the assembly. Our methods contribute novel empirical insights into deliberative processes and demonstrate how LLMs can surface high-resolution dynamics otherwise invisible in traditional assembly outputs.
zh

[NLP-60] ICL: Text-Embedding KNN For Speech In-Context Learning Unlocks Speech Recognition Abilities of Large Multimodal Models

【速读】: 该论文旨在解决语音基础模型在进行语音上下文学习(Speech In-Context Learning, SICL)时,如何有效选择示例以提升性能的问题。现有方法对示例选择策略研究不足,导致模型在复杂场景下表现受限。解决方案的关键在于提出Text-Embedding KNN for SICL (TICL),该方法利用文本嵌入(text embedding)的语义相似性来选取最具代表性的上下文示例,从而增强预训练多模态模型在无需微调(fine-tuning)的前提下对语音识别任务的适应能力。实验表明,该方法在带口音英语、多语言语音和儿童语音等挑战性任务中,相较零样本(zero-shot)性能可实现最高达84.7%的词错误率(WER)相对降低。

链接: https://arxiv.org/abs/2509.13395
作者: Haolong Zheng,Yekaterina Yegorova,Mark Hasegawa-Johnson
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Speech foundation models have recently demonstrated the ability to perform Speech In-Context Learning (SICL). Selecting effective in-context examples is crucial for SICL performance, yet selection methodologies remain underexplored. In this work, we propose Text-Embedding KNN for SICL (TICL), a simple pipeline that uses semantic context to enhance off-the-shelf large multimodal models’ speech recognition ability without fine-tuning. Across challenging automatic speech recognition tasks, including accented English, multilingual speech, and children’s speech, our method enables models to surpass zero-shot performance with up to 84.7% relative WER reduction. We conduct ablation studies to show the robustness and efficiency of our method.
zh

计算机视觉

[CV-0] GenExam: A Multidisciplinary Text-to-Image Exam

【速读】:该论文旨在解决当前生成式 AI(Generative AI)在多学科文本到图像生成任务中缺乏系统性评估基准的问题,尤其是忽视了对严谨绘图考试(drawing exams)的测评。现有基准主要关注理解与推理能力或世界知识的可视化表达,而未能充分检验模型在复杂语义约束下生成准确、合理图像的能力。解决方案的关键在于提出 GenExam,这是首个面向多学科文本到图像考试的基准,包含10个学科的1,000个样本,采用四级分类体系组织考题,并为每道题提供真实标注图像和细粒度评分点,从而实现对语义正确性和视觉合理性双重维度的精确评估。实验表明,即使是最先进的模型如 GPT-Image-1 和 Gemini-2.5-Flash-Image,在严格标准下得分也低于15%,凸显了该基准的挑战性,也为通向通用人工智能(AGI)提供了更严格的测试路径。

链接: https://arxiv.org/abs/2509.14232
作者: Zhaokai Wang,Penghao Yin,Xiangyu Zhao,Changyao Tian,Yu Qiao,Wenhai Wang,Jifeng Dai,Gen Luo
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai AI Laboratory (上海人工智能实验室); Tsinghua University (清华大学); The Chinese University of Hong Kong (香港中文大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Exams are a fundamental test of expert-level intelligence and require integrated understanding, reasoning, and generation. Existing exam-style benchmarks mainly focus on understanding and reasoning tasks, and current generation benchmarks emphasize the illustration of world knowledge and visual concepts, neglecting the evaluation of rigorous drawing exams. We introduce GenExam, the first benchmark for multidisciplinary text-to-image exams, featuring 1,000 samples across 10 subjects with exam-style prompts organized under a four-level taxonomy. Each problem is equipped with ground-truth images and fine-grained scoring points to enable a precise evaluation of semantic correctness and visual plausibility. Experiments show that even state-of-the-art models such as GPT-Image-1 and Gemini-2.5-Flash-Image achieve less than 15% strict scores, and most models yield almost 0%, suggesting the great challenge of our benchmark. By framing image generation as an exam, GenExam offers a rigorous assessment of models’ ability to integrate knowledge, reasoning, and generation, providing insights on the path to general AGI.
zh

[CV-1] Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在长视频叙事理解方面的能力评估不足问题,尤其是现有基准测试多聚焦于短片段识别或模板化问答,难以衡量模型对长时序、细粒度语境推理的掌握程度。其解决方案的关键在于构建了一个名为 Cineˊaste\mathsf{Cin\acute{e}aste} 的综合性长片电影理解基准,包含3,119个源自200部多样化电影的多选题-答案对,覆盖五个新颖的细粒度语境推理类别;并通过GPT-4o结合视觉描述、字幕、场景标题和摘要生成具有高情境丰富度的问题,同时设计两阶段过滤流程——上下文独立性筛选(Context-Independence filtering)确保问题依赖视频内容,以及情境真实性筛选(Contextual Veracity filtering)验证事实一致性以减少幻觉,从而实现高质量、可信赖的评估体系。实验表明,当前主流多模态大语言模型(Multimodal Large Language Models, MLLMs)在该基准上表现不佳,凸显出长程时间推理是主要瓶颈,为未来研究指明方向。

链接: https://arxiv.org/abs/2509.14227
作者: Nisarg A. Shah,Amir Ziai,Chaitanya Ekanadham,Vishal M. Patel
机构: Netflix, Inc. (Netflix公司); Johns Hopkins University (约翰霍普金斯大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 5 figures, 5 tables

点击查看摘要

Abstract:While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce \mathsfCin\acuteeaste , a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on \mathsfCin\acuteeaste ; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.
zh

[CV-2] MCGS-SLAM: A Multi-Camera SLAM Framework Using Gaussian Splatting for High-Fidelity Mapping

【速读】:该论文旨在解决单目视觉SLAM(Simultaneous Localization and Mapping)系统在鲁棒性和几何覆盖范围上的不足问题,特别是在复杂场景中难以实现高精度位姿估计与完整三维重建的挑战。其核心解决方案是提出MCGS-SLAM,一种基于3D高斯点绘制(3D Gaussian Splatting, 3DGS)的纯RGB多相机SLAM系统,通过融合来自多个视角的密集RGB输入构建统一且持续优化的高斯地图;关键创新在于引入多相机束调整(Multi-camera Bundle Adjustment, MCBA),联合优化位姿和深度信息,利用密集光度和几何残差提升精度,并结合低秩先验约束实现跨视图尺度一致性,从而在保持实时性能的同时显著增强重建完整性与几何准确性,尤其在侧向区域重建方面优于传统单目方案,适用于机器人和自动驾驶等对高保真度地图需求严苛的应用场景。

链接: https://arxiv.org/abs/2509.14191
作者: Zhihao Cao,Hanyu Wu,Li Wa Tang,Zizhou Luo,Zihan Zhu,Wei Zhang,Marc Pollefeys,Martin R. Oswald
机构: ETH Zurich (苏黎世联邦理工学院); University of Zurich (苏黎世大学); University of Stuttgart (斯图加特大学); Microsoft Mixed Reality and AI Lab (微软混合现实与人工智能实验室); University of Amsterdam (阿姆斯特丹大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent progress in dense SLAM has primarily targeted monocular setups, often at the expense of robustness and geometric coverage. We present MCGS-SLAM, the first purely RGB-based multi-camera SLAM system built on 3D Gaussian Splatting (3DGS). Unlike prior methods relying on sparse maps or inertial data, MCGS-SLAM fuses dense RGB inputs from multiple viewpoints into a unified, continuously optimized Gaussian map. A multi-camera bundle adjustment (MCBA) jointly refines poses and depths via dense photometric and geometric residuals, while a scale consistency module enforces metric alignment across views using low-rank priors. The system supports RGB input and maintains real-time performance at large scale. Experiments on synthetic and real-world datasets show that MCGS-SLAM consistently yields accurate trajectories and photorealistic reconstructions, usually outperforming monocular baselines. Notably, the wide field of view from multi-camera input enables reconstruction of side-view regions that monocular setups miss, critical for safe autonomous operation. These results highlight the promise of multi-camera Gaussian Splatting SLAM for high-fidelity mapping in robotics and autonomous driving.
zh

[CV-3] Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions

【速读】:该论文旨在解决视觉 Transformer (Vision Transformer, ViT) 在语义分割任务中因高计算和内存开销而导致的效率瓶颈问题。其解决方案的关键在于提出了一种混合的令牌压缩框架 STEP(SuperToken and Early-Pruning),该框架结合了动态 patch 合并(dynamic patch merging)与令牌剪枝(token pruning)策略:首先通过轻量级 CNN 策略网络 dCTS 实现灵活的超块(superpatch)合并以减少令牌数量;其次在编码器模块中引入早期退出机制,移除高置信度的超令牌以降低计算负载。实验表明,仅使用 dCTS 即可将令牌数减少 2.5 倍,计算成本降低 2.6 倍,而完整 STEP 框架可实现最高 4 倍计算复杂度削减和 1.7 倍推理速度提升,同时保持精度损失不超过 2.0%。

链接: https://arxiv.org/abs/2509.14165
作者: Michal Szczepanski,Martyna Poreba,Karim Haroun
机构: CEA(法国原子能和替代能源委员会); University of Côte d’Azur (蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision Transformers (ViTs) achieve state-of-the-art performance in semantic segmentation but are hindered by high computational and memory costs. To address this, we propose STEP (SuperToken and Early-Pruning), a hybrid token-reduction framework that combines dynamic patch merging and token pruning to enhance efficiency without significantly compromising accuracy. At the core of STEP is dCTS, a lightweight CNN-based policy network that enables flexible merging into superpatches. Encoder blocks integrate also early-exits to remove high-confident supertokens, lowering computational load. We evaluate our method on high-resolution semantic segmentation benchmarks, including images up to 1024 x 1024, and show that when dCTS is applied alone, the token count can be reduced by a factor of 2.5 compared to the standard 16 x 16 pixel patching scheme. This yields a 2.6x reduction in computational cost and a 3.4x increase in throughput when using ViT-Large as the backbone. Applying the full STEP framework further improves efficiency, reaching up to a 4x reduction in computational complexity and a 1.7x gain in inference speed, with a maximum accuracy drop of no more than 2.0%. With the proposed STEP configurations, up to 40% of tokens can be confidently predicted and halted before reaching the final encoder layer.
zh

[CV-4] BEVUDA: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection

【速读】:该论文旨在解决多视角3D目标检测中鸟瞰图(Bird’s Eye View, BEV)感知因域偏移(domain shift)导致的性能下降问题,尤其在真实世界跨域场景下,如昼夜变化、不同传感器配置等,现有方法未充分考虑BEV感知过程中在2D图像空间、3D体素空间和BEV空间中的域差异累积效应。解决方案的关键在于提出一种几何感知的师生框架BEVUDA++,其核心包括:1)可靠深度教师(Reliable Depth Teacher, RDT),通过融合目标域LiDAR与不确定性估计下的深度预测,生成深度感知信息以增强体素与BEV特征提取;2)几何一致性学生(Geometric Consistent Student, GCS),将多空间特征映射至统一几何嵌入空间,缩小源域与目标域间的数据分布差距;3)引入不确定性引导的指数移动平均(Uncertainty-guided Exponential Moving Average, UEMA),利用先前获得的不确定性指导减少域偏移引起的误差累积。该方案在四个跨域场景中实现了BEV 3D目标检测的最先进性能,例如在昼夜适应任务中NDS提升12.9%、mAP提升9.5%。

链接: https://arxiv.org/abs/2509.14151
作者: Rongyu Zhang,Jiaming Liu,Xiaoqi Li,Xiaowei Chi,Dan Wang,Li Du,Yuan Du,Shanghang Zhang
机构: Peking University (北京大学); Nanjing University (南京大学); The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TCSVT

点击查看摘要

Abstract:Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.
zh

[CV-5] An Exploratory Study on Abstract Images and Visual Representations Learned from Them BMVC2025

【速读】:该论文旨在解决抽象图像(由原始几何形状构成)与传统栅格图像在视觉语义信息表达上的性能差距问题,即为何从抽象图像中提取的表征在分类、分割和目标检测等任务中表现逊于传统图像。其解决方案的关键在于构建了一个多层级抽象图像数据集——Hierarchical Abstraction Image Dataset (HAID),该数据集通过从正常栅格图像生成不同抽象层级的抽象图像,系统性地评估了不同抽象水平下高阶语义内容的保留程度,并在此基础上对主流视觉模型进行训练与测试,从而揭示抽象图像作为视觉语义信息载体的有效性及其局限性。

链接: https://arxiv.org/abs/2509.14149
作者: Haotian Li,Jianbo Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025

点击查看摘要

Abstract:Imagine living in a world composed solely of primitive shapes, could you still recognise familiar objects? Recent studies have shown that abstract images-constructed by primitive shapes-can indeed convey visual semantic information to deep learning models. However, representations obtained from such images often fall short compared to those derived from traditional raster images. In this paper, we study the reasons behind this performance gap and investigate how much high-level semantic content can be captured at different abstraction levels. To this end, we introduce the Hierarchical Abstraction Image Dataset (HAID), a novel data collection that comprises abstract images generated from normal raster images at multiple levels of abstraction. We then train and evaluate conventional vision systems on HAID across various tasks including classification, segmentation, and object detection, providing a comprehensive study between rasterised and abstract image representations. We also discuss if the abstract image can be considered as a potentially effective format for conveying visual semantic information and contributing to vision tasks.
zh

[CV-6] MARS2 2025 Challenge on Multimodal Reasoning : Datasets Methods Results Discussion and Outlook ICCV2025

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在真实世界和特定场景中推理能力评估不足的问题,以推动其在复杂应用场景中的落地与优化。解决方案的关键在于构建一个面向实际任务的基准测试平台——MARS2 2025 Challenge,通过发布两个定制化数据集Lens(支持12个日常场景的一般性推理)和AdsQA(针对广告视频的领域特定推理),并设立三个竞赛赛道(视觉定位在现实场景中、带空间意识的视觉问答、创意广告视频中的视觉推理),系统性地评估40余种基线模型及参赛方法,从而为研究者提供清晰的状态追踪路径,并促进多模态推理技术从通用能力向专业化、场景化演进。

链接: https://arxiv.org/abs/2509.14142
作者: Peng Xu,Shengwu Xiong,Jiajun Zhang,Yaxiong Chen,Bowen Zhou,Chen Change Loy,David A. Clifton,Kyoung Mu Lee,Luc Van Gool,Ruiming He,Ruilin Yao,Xinwei Long,Jirui Huang,Kai Tian,Sa Yang,Yihua Shao,Jin Feng,Yue Zhong,Jiakai Zhou,Cheng Tang,Tianyu Zou,Yifang Zhang,Junming Liang,Guoyou Li,Zhaoxiang Wang,Qiang Zhou,Yichen Zhao,Shili Xiong,Hyeongjin Nam,Jaerin Lee,Jaeyoung Chung,JoonKyu Park,Junghun Oh,Kanggeon Lee,Wooseok Lee,Juneyoung Ro,Turghun Osman,Can Hu,Chaoyang Liao,Cheng Chen,Chengcheng Han,Chenhao Qiu,Chong Peng,Cong Xu,Dailin Li,Feiyu Wang,Feng Gao,Guibo Zhu,Guopeng Tang,Haibo Lu,Han Fang,Han Qi,Hanxiao Wu,Haobo Cheng,Hongbo Sun,Hongyao Chen,Huayong Hu,Hui Li,Jiaheng Ma,Jiang Yu,Jianing Wang,Jie Yang,Jing He,Jinglin Zhou,Jingxuan Li,Josef Kittler,Lihao Zheng,Linnan Zhao,Mengxi Jia,Muyang Yan,Nguyen Thanh Thien,Pu Luo,Qi Li,Shien Song,Shijie Dong,Shuai Shao,Shutao Li,Taofeng Xue,Tianyang Xu,Tianyi Gao,Tingting Li,Wei Zhang,Weiyang Su,Xiaodong Dong,Xiao-Jun Wu,Xiaopeng Zhou,Xin Chen,Xin Wei,Xinyi You,Xudong Kang,Xujie Zhou,Xusheng Liu,Yanan Wang,Yanbin Huang,Yang Liu,Yang Yang,Yanglin Deng,Yashu Kang,Ye Yuan,Yi Wen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 MARS2 Workshop and Challenge "Multimodal Reasoning and Slow Thinking in the Large Model Era: Towards System 2 and Beyond’’

点击查看摘要

Abstract:This paper reviews the MARS2 2025 Challenge on Multimodal Reasoning. We aim to bring together different approaches in multimodal machine learning and LLMs via a large benchmark. We hope it better allows researchers to follow the state-of-the-art in this very dynamic area. Meanwhile, a growing number of testbeds have boosted the evolution of general-purpose large language models. Thus, this year’s MARS2 focuses on real-world and specialized scenarios to broaden the multimodal reasoning applications of MLLMs. Our organizing team released two tailored datasets Lens and AdsQA as test sets, which support general reasoning in 12 daily scenarios and domain-specific reasoning in advertisement videos, respectively. We evaluated 40+ baselines that include both generalist MLLMs and task-specific models, and opened up three competition tracks, i.e., Visual Grounding in Real-world Scenarios (VG-RS), Visual Question Answering with Spatial Awareness (VQA-SA), and Visual Reasoning in Creative Advertisement Videos (VR-Ads). Finally, 76 teams from the renowned academic and industrial institutions have registered and 40+ valid submissions (out of 1200+) have been included in our ranking lists. Our datasets, code sets (40+ baselines and 15+ participants’ methods), and rankings are publicly available on the MARS2 workshop website and our GitHub organization page this https URL, where our updates and announcements of upcoming events will be continuously provided.
zh

[CV-7] Deceptive Beauty: Evaluating the Impact of Beauty Filters on Deepfake and Morphing Attack Detection

【速读】:该论文旨在解决数字美颜滤镜对深度伪造(deepfake)和变形攻击(morphing attack)检测器性能的影响问题,即美颜处理是否会导致自动化人脸分析系统失效。其解决方案的关键在于通过在多个基准数据集上对比应用不同平滑滤镜前后的状态,系统性评估当前主流检测模型的鲁棒性,结果表明美颜滤镜显著降低了检测性能,揭示了现有检测模型在面对面部增强时的脆弱性,从而强调开发对这类视觉修改具有抗扰性的鲁棒检测模型的重要性。

链接: https://arxiv.org/abs/2509.14120
作者: Sara Concas,Simone Maurizio La Cava,Andrea Panzino,Ester Masala,Giulia Orrù,Gian Luca Marcialis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the 2025 IEEE INTERNATIONAL CONFERENCE ON Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering

点击查看摘要

Abstract:Digital beautification through social media filters has become increasingly popular, raising concerns about the reliability of facial images and videos and the effectiveness of automated face analysis. This issue is particularly critical for digital manipulation detectors, systems aiming at distinguishing between genuine and manipulated data, especially in cases involving deepfakes and morphing attacks designed to deceive humans and automated facial recognition. This study examines whether beauty filters impact the performance of deepfake and morphing attack detectors. We perform a comprehensive analysis, evaluating multiple state-of-the-art detectors on benchmark datasets before and after applying various smoothing filters. Our findings reveal performance degradation, highlighting vulnerabilities introduced by facial enhancements and underscoring the need for robust detection models resilient to such alterations.
zh

[CV-8] Generative AI for Misalignment-Resistant Virtual Staining to Accelerate Histopathology Workflows

【速读】:该论文旨在解决虚拟染色(virtual staining)在临床应用中因配对数据难以获取而导致的像素级监督不准确问题。现有方法依赖于严格对齐的成对数据,但化学染色过程常导致组织结构形变,且同一组织切片无法重复染色而不损伤信息,使得可用数据多为未配对或粗略配对,限制了模型性能。解决方案的关键在于提出一种具有级联注册机制(cascaded registration mechanisms)的鲁棒框架,通过逐步校正生成输出与真实标签之间的空间错位,显著提升跨数据集的泛化能力与精度,在五个数据集上均优于当前最优模型,尤其在严重错位情况下PSNR提升达23.8%。

链接: https://arxiv.org/abs/2509.14119
作者: Jiabo MA,Wenqiang Li,Jinbang Li,Ziyi Liu,Linshan Wu,Fengtao Zhou,Li Liang,Ronald Cheong Kin Chan,Terence T.W. Wong,Hao Chen
机构: The Hong Kong University of Science and Technology (香港科技大学); City University of Hong Kong (香港城市大学); Singapore Management University (新加坡管理大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: the arxiv version of the under review journal paper

点击查看摘要

Abstract:Accurate histopathological diagnosis often requires multiple differently stained tissue sections, a process that is time-consuming, labor-intensive, and environmentally taxing due to the use of multiple chemical stains. Recently, virtual staining has emerged as a promising alternative that is faster, tissue-conserving, and environmentally friendly. However, existing virtual staining methods face significant challenges in clinical applications, primarily due to their reliance on well-aligned paired data. Obtaining such data is inherently difficult because chemical staining processes can distort tissue structures, and a single tissue section cannot undergo multiple staining procedures without damage or loss of information. As a result, most available virtual staining datasets are either unpaired or roughly paired, making it difficult for existing methods to achieve accurate pixel-level supervision. To address this challenge, we propose a robust virtual staining framework featuring cascaded registration mechanisms to resolve spatial mismatches between generated outputs and their corresponding ground truth. Experimental results demonstrate that our method significantly outperforms state-of-the-art models across five datasets, achieving an average improvement of 3.2% on internal datasets and 10.1% on external datasets. Moreover, in datasets with substantial misalignment, our approach achieves a remarkable 23.8% improvement in peak signal-to-noise ratio compared to baseline models. The exceptional robustness of the proposed method across diverse datasets simplifies the data acquisition process for virtual staining and offers new insights for advancing its development.
zh

[CV-9] CSMoE: An Efficient Remote Sensing Foundation Model with Soft Mixture-of-Experts

【速读】:该论文旨在解决遥感(Remote Sensing, RS)基础模型(Foundation Model, FM)在训练与推理过程中计算复杂度高、表征能力有限的问题,从而限制了其在实际遥感应用中的可扩展性。解决方案的关键在于将软混合专家(Soft Mixture-of-Experts, Soft MoE)机制引入RS FM架构中,使模型能够在保持跨传感器共享表征学习的同时,实现模态特异性专家的专业化分工;同时,通过主题气候描述符驱动的采样策略构建更具代表性和多样性的训练数据集,进一步提升模型效率与性能。实验表明,所提出的Cross-Sensor Mixture-of-Experts (CSMoE)模型在保证或优于现有方法表征能力的前提下,平均计算效率超过现有RS FMs的两倍。

链接: https://arxiv.org/abs/2509.14104
作者: Leonard Hackel,Tom Burgert,Begüm Demir
机构: Berlin Institute for the Foundations of Learning and Data (BIFOLD) (柏林学习与数据基础研究所); Technische Universität Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Self-supervised learning through masked autoencoders has attracted great attention for remote sensing (RS) foundation model (FM) development, enabling improved representation learning across diverse sensors and downstream tasks. However, existing RS FMs often either suffer from substantial computational complexity during both training and inference or exhibit limited representational capacity. These issues restrict their practical applicability in RS. To address this limitation, we propose an adaptation for enhancing the efficiency of RS FMs by integrating the Soft mixture-of-experts (MoE) mechanism into the FM. The integration of Soft MoEs into the FM allows modality-specific expert specialization alongside shared cross-sensor representation learning. To demonstrate the effectiveness of our adaptation, we apply it on the Cross-Sensor Masked Autoencoder (CSMAE) model, resulting in the Cross-Sensor Mixture-of-Experts (CSMoE) model. In addition, we introduce a thematic-climatic descriptor-driven sampling strategy for the construction of a representative and diverse training set to train our CSMoE model. Extensive experiments on scene classification, semantic segmentation, and content-based image retrieval demonstrate that our adaptation yields a reduction in computational requirements while maintaining or improving representational performance. Compared to state-of-the-art RS FMs, CSMoE achieves a superior trade-off between representational capacity, accuracy, and computational efficiency. On average, CSMoE achieves more than twice the computational efficiency of existing RS FMs, while maintaining competitive performance across all experiments. These results show the effectiveness of the proposed adaptation for creating computationally efficient RS FMs. The code for the model, the training set creation, and the model weights will be available at this https URL.
zh

[CV-10] acher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing

【速读】:该论文旨在解决弱监督音频-视觉视频解析(Weakly-supervised Audio-Visual Video Parsing, AVVP)中缺乏时序标注导致的段级监督不稳定以及跨模态对齐不充分的问题。现有方法虽通过对比学习或协同学习优化全局预测,但未能有效提供稳定的段级监督信号和类别感知的跨模态一致性。其解决方案的关键在于:(1) 提出基于指数移动平均(Exponential Moving Average, EMA)引导的伪监督框架,通过自适应阈值或Top-k选择生成可靠的段级掩码,从而在视频级标签基础上提供稳定的时间粒度指导;(2) 设计类别感知的跨模态一致(Class-aware Cross-modal Agreement, CMA)损失函数,对可靠段落-类别配对进行音频与视觉嵌入对齐,确保跨模态一致性的同时保留时间结构。

链接: https://arxiv.org/abs/2509.14097
作者: Yaru Chen,Ruohao Guo,Liting Gao,Yang Xiang,Qingyu Luo,Zhenbo Li,Wenwu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Weakly-supervised audio-visual video parsing (AVVP) seeks to detect audible, visible, and audio-visual events without temporal annotations. Previous work has emphasized refining global predictions through contrastive or collaborative learning, but neglected stable segment-level supervision and class-aware cross-modal alignment. To address this, we propose two strategies: (1) an exponential moving average (EMA)-guided pseudo supervision framework that generates reliable segment-level masks via adaptive thresholds or top-k selection, offering stable temporal guidance beyond video-level labels; and (2) a class-aware cross-modal agreement (CMA) loss that aligns audio and visual embeddings at reliable segment-class pairs, ensuring consistency across modalities while preserving temporal structure. Evaluations on LLP and UnAV-100 datasets shows that our method achieves state-of-the-art (SOTA) performance across multiple metrics.
zh

[CV-11] AD-DINOv3: Enhancing DINOv3 for Zero-Shot Anomaly Detection with Anomaly-Aware Calibration

【速读】:该论文旨在解决零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)中因预训练数据域偏差导致的特征错位问题,以及因模型对全局语义的固有偏好而难以识别细微异常区域的问题。解决方案的关键在于提出一种基于DINOv3的多模态框架AD-DINOv3,其核心创新包括:(1) 将异常检测建模为多模态对比学习任务,利用DINOv3提取图像patch tokens和CLS token,并结合CLIP文本编码器生成正常与异常提示的嵌入;(2) 引入轻量级适配器以校准视觉与文本模态表示,缓解域偏移;(3) 设计异常感知校准模块(Anomaly-Aware Calibration Module, AACM),引导CLS token关注异常区域而非一般前景语义,从而提升异常判别能力。

链接: https://arxiv.org/abs/2509.14084
作者: Jingyi Yuan,Jianxiong Ye,Wenkang Chen,Chenqiang Gao
机构: Sun Yat-Sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Zero-Shot Anomaly Detection (ZSAD) seeks to identify anomalies from arbitrary novel categories, offering a scalable and annotation-efficient solution. Traditionally, most ZSAD works have been based on the CLIP model, which performs anomaly detection by calculating the similarity between visual and text embeddings. Recently, vision foundation models such as DINOv3 have demonstrated strong transferable representation capabilities. In this work, we are the first to adapt DINOv3 for ZSAD. However, this adaptation presents two key challenges: (i) the domain bias between large-scale pretraining data and anomaly detection tasks leads to feature misalignment; and (ii) the inherent bias toward global semantics in pretrained representations often leads to subtle anomalies being misinterpreted as part of the normal foreground objects, rather than being distinguished as abnormal regions. To overcome these challenges, we introduce AD-DINOv3, a novel vision-language multimodal framework designed for ZSAD. Specifically, we formulate anomaly detection as a multimodal contrastive learning problem, where DINOv3 is employed as the visual backbone to extract patch tokens and a CLS token, and the CLIP text encoder provides embeddings for both normal and abnormal prompts. To bridge the domain gap, lightweight adapters are introduced in both modalities, enabling their representations to be recalibrated for the anomaly detection task. Beyond this baseline alignment, we further design an Anomaly-Aware Calibration Module (AACM), which explicitly guides the CLS token to attend to anomalous regions rather than generic foreground semantics, thereby enhancing discriminability. Extensive experiments on eight industrial and medical benchmarks demonstrate that AD-DINOv3 consistently matches or surpasses state-of-the-art methods, verifying its superiority as a general zero-shot anomaly detection framework.
zh

[CV-12] VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement

【速读】:该论文旨在解决当前多目标跟踪(Multi-Object Tracking, MOT)算法在低质量视频场景下性能显著下降的问题,这类问题在真实世界应用中尤为突出。其解决方案的关键在于提出一种视觉语义增强引导的多目标跟踪框架(Visual Semantic Enhancement-guided Multi-Object Tracking, VSE-MOT),核心创新包括:设计一个三分支架构,利用视觉语言模型提取全局视觉语义信息并融合至查询向量;引入多目标跟踪适配器(MOT-Adapter)将语义信息适配至MOT任务,并通过视觉语义融合模块(VSFM)提升特征融合效率,从而显著提升在低质量视频中的跟踪精度与鲁棒性。

链接: https://arxiv.org/abs/2509.14060
作者: Jun Du,Weiwei Xing,Ming Li,Fei Richard Yu
机构: Beijing Jiaotong University (北京交通大学); Guangzhou Marine Laboratory (广州海洋实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking framework (VSE-MOT). Specifically, we first design a tri-branch architecture that leverages a vision-language model to extract global visual semantic information from images and fuse it with query vectors. Subsequently, to further enhance the utilization of visual semantic information, we introduce the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic information to suit multi-object tracking tasks, while the VSFM improves the efficacy of feature fusion. Through extensive experiments, we validate the effectiveness and superiority of the proposed method in real-world low-quality video scenarios. Its tracking performance metrics outperform those of existing methods by approximately 8% to 20%, while maintaining robust performance in conventional scenarios.
zh

[CV-13] Wan-Animate: Unified Character Animation and Replacement with Holistic Replication

【速读】:该论文旨在解决角色动画生成与角色替换的统一建模问题,即如何在保持角色外观一致性的同时,精确复现参考视频中的表情和动作,并实现角色与环境光照及色调的无缝融合。解决方案的关键在于构建一个基于Wan模型的统一框架Wan-Animate,通过改进输入范式以区分参考条件与生成区域,采用空间对齐的骨骼信号(skeleton signals)捕捉身体运动,并利用源图像隐式提取的面部特征实现表情再现,从而提升生成视频的可控性与表现力;此外,引入辅助的Relighting LoRA模块,在角色替换任务中保留角色外观一致性的同时,自适应地应用环境光照与色彩调制,显著增强角色与背景的融合效果。

链接: https://arxiv.org/abs/2509.14055
作者: Gang Cheng,Xin Gao,Li Hu,Siqi Hu,Mingyang Huang,Chaonan Ji,Ju Li,Dechao Meng,Jinwei Qi,Penchong Qiao,Zhen Shen,Yafei Song,Ke Sun,Linrui Tian,Feng Wang,Guangyuan Wang,Qi Wang,Zhongjian Wang,Jiayu Xiao,Sheng Xu,Bang Zhang,Peng Zhang,Xindi Zhang,Zhe Zhang,Jingren Zhou,Lian Zhuo
机构: Tongyi Lab (通义实验室); Alibaba (阿里巴巴)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene’s lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character’s appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.
zh

[CV-14] PROFUSEme: PROstate Cancer Biochemical Recurrence Prediction via FUSEd Multi-modal Embeddings

【速读】:该论文旨在解决前列腺癌(Prostate Cancer, PCa)患者在根治性前列腺切除术(Radical Prostatectomy, RP)后生化复发(Biochemical Recurrence, BCR)的早期精准预测问题,以实现临床决策的及时调整和患者预后的改善。解决方案的关键在于提出了一种多模态嵌入融合方法(PROFUSEme),通过中间融合架构整合临床、影像学与病理数据的交叉模态交互特征,并结合Cox比例风险回归模型进行生存分析,从而显著提升预测性能,在内部5折嵌套交叉验证中达到平均C-index为0.861,且在CHIMERA 2025挑战赛验证集上获得0.7103的C-index,优于传统的晚期融合策略。

链接: https://arxiv.org/abs/2509.14051
作者: Suhang You,Carla Pitarch-Abaigar,Sanket Kachole,Sumedh Sonawane,Juhyung Ha,Anish Sudarshan Gada,David Crandall,Rakesh Shiradkar,Spyridon Bakas
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure, method paper for CHIMERA 2025 Challenge

点击查看摘要

Abstract:Almost 30% of prostate cancer (PCa) patients undergoing radical prostatectomy (RP) experience biochemical recurrence (BCR), characterized by increased prostate specific antigen (PSA) and associated with increased mortality. Accurate early prediction of BCR, at the time of RP, would contribute to prompt adaptive clinical decision-making and improved patient outcomes. In this work, we propose prostate cancer BCR prediction via fused multi-modal embeddings (PROFUSEme), which learns cross-modal interactions of clinical, radiology, and pathology data, following an intermediate fusion configuration in combination with Cox Proportional Hazard regressors. Quantitative evaluation of our proposed approach reveals superior performance, when compared with late fusion configurations, yielding a mean C-index of 0.861 ( \sigma=0.112 ) on the internal 5-fold nested cross-validation framework, and a C-index of 0.7103 on the hold out data of CHIMERA 2025 challenge validation leaderboard.
zh

[CV-15] SAIL-VL2 Technical Report

【速读】:该论文旨在解决当前多模态基础模型在复杂视觉-语言理解与推理任务中性能不足、训练效率低以及架构扩展性差的问题。其核心解决方案在于三项关键创新:一是构建大规模数据清洗流水线,通过评分与过滤策略优化图像描述、OCR、问答及视频数据的质量与分布,显著提升训练效率;二是提出渐进式训练框架,从预训练视觉编码器(SAIL-ViT)出发,经多模态预训练后引入思维融合的监督微调与强化学习(SFT-RL)混合范式,系统性增强模型的推理能力;三是采用高效的稀疏专家混合(Mixture-of-Experts, MoE)架构设计,突破传统密集型大语言模型(LLM)的扩展瓶颈,实现更强的参数效率和可扩展性。这些改进使SAIL-VL2在106个数据集上表现优异,并在MMMU和MathVista等高难度推理基准上达到最先进水平。

链接: https://arxiv.org/abs/2509.14033
作者: Weijie Yin,Yongjie Ye,Fangxun Shu,Yue Liao,Zijian Kang,Hongyuan Dong,Haiyang Yu,Dingkang Yang,Jiacong Wang,Han Wang,Wenzhuo Liu,Xiao Liang,Shuicheng Yan,Chao Feng
机构: Douyin SAIL Team (抖音SAIL团队); LV-NUS Lab (LV-NUS实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
zh

[CV-16] Performance Optimization of YOLO-FEDER FusionNet for Robust Drone Detection in Visually Complex Environments

【速读】:该论文旨在解决无人机(drone)在视觉复杂环境中的检测难题,尤其是背景杂乱、目标尺度小以及伪装效应导致的目标-背景分离度低等问题。现有通用目标检测器如YOLO在低纹理场景中表现良好,但在复杂背景下性能显著下降。其解决方案的关键在于提出改进的YOLO-FEDER FusionNet检测框架,通过系统性增强训练数据组成(采用大规模逼真合成数据与少量真实样本结合)、优化多尺度特征融合策略,并升级骨干网络结构;其中,中间层FEDER特征的有效引入与YOLOv8l骨干网络的协同作用尤为关键,实验证明该配置可使漏检率(FNR)降低最多39.1个百分点,平均精度均值(mAP)提升最多62.8个百分点(IoU阈值为0.5)。

链接: https://arxiv.org/abs/2509.14012
作者: Tamara R. Lenhard,Andreas Weinmann,Tobias Koch
机构: Institute for the Protection of Terrestrial Infrastructures, German Aerospace Center (德国航空航天中心); ACIDA Lab, University of Applied Sciences Darmstadt (达姆施塔特应用技术大学); European University of Technology (欧洲科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Drone detection in visually complex environments remains challenging due to background clutter, small object scale, and camouflage effects. While generic object detectors like YOLO exhibit strong performance in low-texture scenes, their effectiveness degrades in cluttered environments with low object-background separability. To address these limitations, this work presents an enhanced iteration of YOLO-FEDER FusionNet – a detection framework that integrates generic object detection with camouflage object detection techniques. Building upon the original architecture, the proposed iteration introduces systematic advancements in training data composition, feature fusion strategies, and backbone design. Specifically, the training process leverages large-scale, photo-realistic synthetic data, complemented by a small set of real-world samples, to enhance robustness under visually complex conditions. The contribution of intermediate multi-scale FEDER features is systematically evaluated, and detection performance is comprehensively benchmarked across multiple YOLO-based backbone configurations. Empirical results indicate that integrating intermediate FEDER features, in combination with backbone upgrades, contributes to notable performance improvements. In the most promising configuration – YOLO-FEDER FusionNet with a YOLOv8l backbone and FEDER features derived from the DWD module – these enhancements lead to a FNR reduction of up to 39.1 percentage points and a mAP increase of up to 62.8 percentage points at an IoU threshold of 0.5, compared to the initial baseline.
zh

[CV-17] MOCHA: Multi-modal Objects-aware Cross-arcHitecture Alignment

【速读】:该论文旨在解决如何将大型视觉-语言模型(vision-language model)中蕴含的区域级多模态语义知识高效地迁移到轻量级纯视觉目标检测器中的问题,尤其在少样本(few-shot)场景下实现性能提升。解决方案的关键在于提出MOCHA(Multi-modal Objects-aware Cross-architecture Alignment),其核心是引入一个翻译模块(translation module),将学生模型(student)特征映射到联合空间(joint space),并通过双目标损失函数同时约束局部对齐(local alignment)与全局关系一致性(global relational consistency),从而在不修改教师模型或推理时无需文本输入的前提下,实现基于对象级别的语义迁移。

链接: https://arxiv.org/abs/2509.14001
作者: Elena Camuffo,Francesco Barbato,Mete Ozay,Simone Milani,Umberto Michieli
机构: Samsung R&D Institute UK (三星研发研究院英国)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce MOCHA (Multi-modal Objects-aware Cross-arcHitecture Alignment), a knowledge distillation approach that transfers region-level multimodal semantics from a large vision-language teacher (e.g., LLaVa) into a lightweight vision-only object detector student (e.g., YOLO). A translation module maps student features into a joint space, where the training of the student and translator is guided by a dual-objective loss that enforces both local alignment and global relational consistency. Unlike prior approaches focused on dense or global alignment, MOCHA operates at the object level, enabling efficient transfer of semantics without modifying the teacher or requiring textual input at inference. We validate our method across four personalized detection benchmarks under few-shot regimes. Results show consistent gains over baselines, with a +10.1 average score improvement. Despite its compact architecture, MOCHA reaches performance on par with larger multimodal models, proving its suitability for real-world deployment.
zh

[CV-18] MetricNet: Recovering Metric Scale in Generative Navigation Policies

【速读】:该论文旨在解决生成式导航策略中存在的两个结构性问题:一是采样轨迹存在于抽象且无度量基准的未缩放空间中,导致机器人无法准确感知真实世界中的距离;二是控制策略仅关注单个航点(waypoint),忽略了完整路径信息,从而引发短视且不安全的动作,例如朝向障碍物移动。解决方案的关键在于提出MetricNet,这是一种可集成到生成式导航中的附加模块,能够预测航点之间的度量距离,从而将策略输出在真实坐标系中进行尺度化(metric grounding)。通过这种方式,机器人能够在保持向目标前进的同时避开障碍物,显著提升导航与探索性能。

链接: https://arxiv.org/abs/2509.13965
作者: Abhijeet Nayak,Débora N.P. Oliveira,Samiran Gode,Cordelia Schmid,Wolfram Burgard
机构: University of Technology Nuremberg (纽伦堡应用技术大学); Inria, Ecole Normale Supérieure, CNRS, PSL Research University (法国国家信息与自动化研究院,巴黎高等师范学院,法国国家科学研究中心,巴黎文理研究大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Generative navigation policies have made rapid progress in improving end-to-end learned navigation. Despite their promising results, this paradigm has two structural problems. First, the sampled trajectories exist in an abstract, unscaled space without metric grounding. Second, the control strategy discards the full path, instead moving directly towards a single waypoint. This leads to short-sighted and unsafe actions, moving the robot towards obstacles that a complete and correctly scaled path would circumvent. To address these issues, we propose MetricNet, an effective add-on for generative navigation that predicts the metric distance between waypoints, grounding policy outputs in real-world coordinates. We evaluate our method in simulation with a new benchmarking framework and show that executing MetricNet-scaled waypoints significantly improves both navigation and exploration performance. Beyond simulation, we further validate our approach in real-world experiments. Finally, we propose MetricNav, which integrates MetricNet into a navigation policy to guide the robot away from obstacles while still moving towards the goal.
zh

[CV-19] Can Current AI Models Count What We Mean Not What They See? A Benchmark and Systematic Evaluation

【速读】:该论文旨在解决当前视觉计数模型在细粒度、意图驱动场景下计数能力不足的问题,尤其针对用户需在复杂场景中精确识别并统计特定类别的对象时表现不佳的挑战。解决方案的关键在于构建了一个名为PairTally的新基准数据集,该数据集包含681张高分辨率图像,每张图包含两类物体(跨类别或近似子类),要求模型基于形状、尺寸、颜色或语义等细微差异进行区分与计数,从而系统性地评估模型在选择性计数任务中的性能瓶颈。通过在该数据集上对多种前沿模型(包括示例驱动方法、语言提示模型和大视觉语言模型)进行基准测试,研究揭示了现有方法在细粒度和视觉模糊情况下仍难以可靠实现用户意图计数的问题,为诊断和改进细粒度视觉计数系统提供了新的基础。

链接: https://arxiv.org/abs/2509.13939
作者: Gia Khanh Nguyen,Yifeng Huang,Minh Hoai
机构: Australian Institute for Machine Learning, University of Adelaide, SA, Australia (澳大利亚机器学习研究所,阿德莱德大学); Stony Brook University, Stony Brook, NY, USA (石溪大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.
zh

[CV-20] Noise-Level Diffusion Guidance: Well Begun is Half Done

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像生成过程中因初始随机高斯噪声(random Gaussian noise)导致的输出质量波动和提示词(prompt)遵循度不一致的问题。现有噪声水平优化方法通常依赖额外的数据集构建、辅助网络或基于反向传播的优化,限制了其实际应用。解决方案的关键在于提出一种名为噪声水平引导(Noise Level Guidance, NLG)的新方法,该方法通过提升初始噪声与通用引导信号的一致性概率来优化噪声水平,无需额外训练数据、辅助网络或反向传播,从而实现对条件与非条件扩散模型的统一适配,并兼容多种形式的扩散级引导策略。

链接: https://arxiv.org/abs/2509.13936
作者: Harvey Mannering,Zhiwu Huang,Adam Prugel-Bennett
机构: University of Southampton (南安普顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have achieved state-of-the-art image generation. However, the random Gaussian noise used to start the diffusion process influences the final output, causing variations in image quality and prompt adherence. Existing noise-level optimization approaches generally rely on extra dataset construction, additional networks, or backpropagation-based optimization, limiting their practicality. In this paper, we propose Noise Level Guidance (NLG), a simple, efficient, and general noise-level optimization approach that refines initial noise by increasing the likelihood of its alignment with general guidance - requiring no additional training data, auxiliary networks, or backpropagation. The proposed NLG approach provides a unified framework generalizable to both conditional and unconditional diffusion models, accommodating various forms of diffusion-level guidance. Extensive experiments on five standard benchmarks demonstrate that our approach enhances output generation quality and input condition adherence. By seamlessly integrating with existing guidance methods while maintaining computational efficiency, our method establishes NLG as a practical and scalable enhancement to diffusion models. Code can be found at this https URL.
zh

[CV-21] MAP: End-to-End Autonomous Driving with Map-Assisted Planning ICCV ATC DATE

【速读】:该论文旨在解决当前端到端自动驾驶系统中在线地图模块被严重低估的问题,即现有方法未能充分挖掘地图信息对轨迹规划的增强潜力。其核心解决方案是提出一种名为MAP(Map-Assisted Planning)的新框架,关键在于通过三个模块实现语义地图特征与车辆状态的显式融合:一是Plan-enhancing Online Mapping模块,用于提取分割-based地图特征;二是Ego-status-guided Planning模块,根据当前车辆状态引导规划过程;三是基于车辆状态的Weight Adapter,动态调整地图特征与车辆状态的权重。这种结构设计显著提升了轨迹规划的准确性与安全性,在DAIR-V2X-seq-SPD数据集上实现了L2位移误差降低16.6%、离路率下降56.2%,并在MEIS Workshop @CVPR2025的V2X协作端到端自动驾驶挑战赛Track 2中以39.5%的优势领先第二名模型。

链接: https://arxiv.org/abs/2509.13926
作者: Huilin Yin,Yiming Kan,Daniel Watzenig
机构: Tongji University (同济大学); Graz University of Technology (格拉茨工业大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 2 figures, accepted by ICCVW Author list updated to match the camera-ready version, in compliance with conference policy

点击查看摘要

Abstract:In recent years, end-to-end autonomous driving has attracted increasing attention for its ability to jointly model perception, prediction, and planning within a unified framework. However, most existing approaches underutilize the online mapping module, leaving its potential to enhance trajectory planning largely untapped. This paper proposes MAP (Map-Assisted Planning), a novel map-assisted end-to-end trajectory planning framework. MAP explicitly integrates segmentation-based map features and the current ego status through a Plan-enhancing Online Mapping module, an Ego-status-guided Planning module, and a Weight Adapter based on current ego status. Experiments conducted on the DAIR-V2X-seq-SPD dataset demonstrate that the proposed method achieves a 16.6% reduction in L2 displacement error, a 56.2% reduction in off-road rate, and a 44.5% improvement in overall score compared to the UniV2X baseline, even without post-processing. Furthermore, it achieves top ranking in Track 2 of the End-to-End Autonomous Driving through V2X Cooperation Challenge of MEIS Workshop @CVPR2025, outperforming the second-best model by 39.5% in terms of overall score. These results highlight the effectiveness of explicitly leveraging semantic map features in planning and suggest new directions for improving structure design in end-to-end autonomous driving systems. Our code is available at this https URL
zh

[CV-22] owards Robust Defense against Customization via Protective Perturbation Resistant to Diffusion-based Purification ICCV2025

【速读】:该论文旨在解决生成式 AI(Generative AI)在图像合成任务中因可定制性强而引发的安全风险问题,尤其是保护性扰动(protective perturbation)易被净化(purification)算法移除导致图像再次面临恶意伪造(如深度伪造和版权侵权)的风险。解决方案的关键在于提出一种名为 AntiPure 的诊断性保护扰动方法,其核心创新是引入两种引导机制:1)块级频率引导(Patch-wise Frequency Guidance),降低模型对净化后图像高频成分的影响;2)错误时间步引导(Erroneous Timestep Guidance),破坏模型在不同时间步上的去噪策略,从而确保扰动在典型净化-定制工作流中仍能持续存在并有效实现后定制失真(post-customization distortion)。

链接: https://arxiv.org/abs/2509.13922
作者: Wenkui Yang,Jie Cao,Junxian Duan,Ran He
机构: MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025

点击查看摘要

Abstract:Diffusion models like Stable Diffusion have become prominent in visual synthesis tasks due to their powerful customization capabilities, which also introduce significant security risks, including deepfakes and copyright infringement. In response, a class of methods known as protective perturbation emerged, which mitigates image misuse by injecting imperceptible adversarial noise. However, purification can remove protective perturbations, thereby exposing images again to the risk of malicious forgery. In this work, we formalize the anti-purification task, highlighting challenges that hinder existing approaches, and propose a simple diagnostic protective perturbation named AntiPure. AntiPure exposes vulnerabilities of purification within the “purification-customization” workflow, owing to two guidance mechanisms: 1) Patch-wise Frequency Guidance, which reduces the model’s influence over high-frequency components in the purified image, and 2) Erroneous Timestep Guidance, which disrupts the model’s denoising strategy across different timesteps. With additional guidance, AntiPure embeds imperceptible perturbations that persist under representative purification settings, achieving effective post-customization distortion. Experiments show that, as a stress test for purification, AntiPure achieves minimal perceptual discrepancy and maximal distortion, outperforming other protective perturbation methods within the purification-customization workflow.
zh

[CV-23] owards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration ICML2025

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在视觉问答任务中推理过程与答案之间对齐不足的问题,这导致模型产生不一致的推理逻辑和错误回答。解决方案的关键在于提出自洽校准(Self-Rationale Calibration, SRC)框架,其核心包括两个步骤:首先通过轻量级“推理链微调”方法强制模型在生成答案前输出理由,从而改善推理与答案的结构对齐;其次利用一个专门设计的R-Scorer评分模型,基于成对比较策略评估候选回答的推理质量与事实一致性,并结合置信度加权偏好筛选机制进行偏好微调,实现推理链与答案之间的解耦式对齐优化,显著提升LVLM在感知、推理和泛化能力上的表现。

链接: https://arxiv.org/abs/2509.13919
作者: Yuanchen Wu,Ke Yan,Shouhong Ding,Ziyin Zhou,Xiaoqiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICML 2025

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving an answer without explicit prompts. Next, SRC searches for a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.
zh

[CV-24] White Aggregation and Restoration for Few-shot 3D Point Cloud Semantic Segmentation

【速读】:该论文致力于解决少样本3D点云分割(Few-Shot 3D Point Cloud Segmentation, FS-PCS)中因支持集(support set)样本有限而导致的原型(prototype)生成质量不佳的问题。现有方法通常依赖于远点采样(farthest point sampling)等传统算法构建原型,但其初始随机性显著影响性能,且原型生成过程尚未得到充分探索。为提升原型代表性,论文提出基于注意力机制的先进原型生成方法——白化聚合与恢复模块(White Aggregation and Restoration Module, WARM),其核心创新在于通过夹在白化(whitening)和着色(coloring)变换之间的交叉注意力机制,缓解可学习原型token与支持特征之间的分布差异(distributional gap)。具体而言,白化在注意力前将支持特征对齐至原型token分布,着色则在注意力后恢复原始分布,从而实现鲁棒的注意力机制,有效捕捉支持特征间的语义关系,生成更具代表性的原型。该方法在多个FS-PCS基准上取得显著优于现有技术的性能。

链接: https://arxiv.org/abs/2509.13907
作者: Jiyun Im,SuBeen Lee,Miso Lee,Jae-Pil Heo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Few-Shot 3D Point Cloud Segmentation (FS-PCS) aims to predict per-point labels for an unlabeled point cloud, given only a few labeled examples. To extract discriminative representations from the limited support set, existing methods have constructed prototypes using conventional algorithms such as farthest point sampling. However, we point out that its initial randomness significantly affects FS-PCS performance and that the prototype generation process remains underexplored despite its prevalence. This motivates us to investigate an advanced prototype generation method based on attention mechanism. Despite its potential, we found that vanilla module suffers from the distributional gap between learnable prototypical tokens and support features. To overcome this, we propose White Aggregation and Restoration Module (WARM), which resolves the misalignment by sandwiching cross-attention between whitening and coloring transformations. Specifically, whitening aligns the support features to prototypical tokens before attention process, and subsequently coloring restores the original distribution to the attended tokens. This simple yet effective design enables robust attention, thereby generating representative prototypes by capturing the semantic relationships among support features. Our method achieves state-of-the-art performance with a significant margin on multiple FS-PCS benchmarks, demonstrating its effectiveness through extensive experiments.
zh

[CV-25] EvHand-FPV: Efficient Event-Based 3D Hand Tracking from First-Person View

【速读】:该论文旨在解决在资源受限的扩展现实(XR)设备中实现高精度、低延迟和低功耗的首人称视角(First-Person-View, FPV)3D手部追踪问题。传统基于帧的方法难以同时满足这些要求,而事件相机(event camera)因其微秒级时间分辨率和毫瓦级功耗具有天然优势,但其在FPV场景下的手部追踪仍面临数据稀缺与计算效率不足的挑战。解决方案的关键在于提出EvHand-FPV框架:首先构建一个融合合成数据(带3D标签)与真实事件数据(带2D标签)的新型事件驱动FPV数据集以缓解基准缺失问题;其次引入基于腕部区域感兴趣区(Region of Interest, ROI)的几何线索定位策略,并通过端到端嵌入ROI偏移量的方式减少计算量而不依赖显式重建;最后采用多任务学习策略,在训练时加入辅助几何特征头以增强表征能力且不增加推理阶段开销。该方案在保持3D-AUCp达0.84的同时,将参数量和每推理次浮点运算次数分别降低89%,显著提升了性能与效率的平衡。

链接: https://arxiv.org/abs/2509.13883
作者: Zhen Xu,Guorui Lu,Chang Gao,Qinyu Chen
机构: Leiden Institute of Advanced Computer Science (莱顿高级计算机科学研究所); Leiden University (莱顿大学); Department of Microelectronics (微电子系); Delft University of Technology (代尔夫特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Hand tracking holds great promise for intuitive interaction paradigms, but frame-based methods often struggle to meet the requirements of accuracy, low latency, and energy efficiency, especially in resource-constrained settings such as Extended Reality (XR) devices. Event cameras provide \mu s-level temporal resolution at mW-level power by asynchronously sensing brightness changes. In this work, we present EvHand-FPV, a lightweight framework for egocentric First-Person-View 3D hand tracking from a single event camera. We construct an event-based FPV dataset that couples synthetic training data with 3D labels and real event data with 2D labels for evaluation to address the scarcity of egocentric benchmarks. EvHand-FPV also introduces a wrist-based region of interest (ROI) that localizes the hand region via geometric cues, combined with an end-to-end mapping strategy that embeds ROI offsets into the network to reduce computation without explicit reconstruction, and a multi-task learning strategy with an auxiliary geometric feature head that improves representations without test-time overhead. On our real FPV test set, EvHand-FPV improves 2D-AUCp from 0.77 to 0.85 while reducing parameters from 11.2M to 1.2M by 89% and FLOPs per inference from 1.648G to 0.185G by 89%. It also maintains a competitive 3D-AUCp of 0.84 on synthetic data. These results demonstrate accurate and efficient egocentric event-based hand tracking suitable for on-device XR applications. The dataset and code are available at this https URL.
zh

[CV-26] Invisible Yet Detected: PelFANet with Attention-Guided Anatomical Fusion for Pelvic Fracture Diagnosis MICCAI

【速读】:该论文旨在解决骨盆骨折在标准X光片上表现细微或不可见时的诊断难题,此类情况常导致漏诊或误诊。解决方案的关键在于提出PelFANet——一种双流注意力网络架构,通过融合原始骨盆X光图像与分割后的骨骼图像,利用Fused Attention Blocks(FABlocks)迭代地交换和优化两种输入特征,从而同时捕获全局上下文信息与局部解剖细节,显著提升了骨折分类的准确性与泛化能力。

链接: https://arxiv.org/abs/2509.13873
作者: Siam Tahsin Bhuiyan,Rashedur Rahman,Sefatul Wasi,Naomi Yagi,Syoji Kobashi,Ashraful Islam,Saadia Binte Alam
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI EMERGE 2025

点击查看摘要

Abstract:Pelvic fractures pose significant diagnostic challenges, particularly in cases where fracture signs are subtle or invisible on standard radiographs. To address this, we introduce PelFANet, a dual-stream attention network that fuses raw pelvic X-rays with segmented bone images to improve fracture classification. The network em-ploys Fused Attention Blocks (FABlocks) to iteratively exchange and refine fea-tures from both inputs, capturing global context and localized anatomical detail. Trained in a two-stage pipeline with a segmentation-guided approach, PelFANet demonstrates superior performance over conventional methods. On the AMERI dataset, it achieves 88.68% accuracy and 0.9334 AUC on visible fractures, while generalizing effectively to invisible fracture cases with 82.29% accuracy and 0.8688 AUC, despite not being trained on them. These results highlight the clini-cal potential of anatomy-aware dual-input architectures for robust fracture detec-tion, especially in scenarios with subtle radiographic presentations.
zh

[CV-27] Distractor-Aware Memory-Based Visual Object Tracking

【速读】:该论文旨在解决基于记忆的视频分割模型(如SAM2)在视觉目标跟踪任务中因干扰物(distractors,即与目标外观相似的物体)导致的跟踪漂移问题,并提升目标被遮挡后重新检测的能力。其解决方案的关键在于提出一种**干扰物感知的即插即用记忆模块(distractor-aware drop-in memory module)**和基于内省(introspection-based)的记忆管理方法,构建了DAM4SAM模型。该设计通过增强对干扰物的识别与抑制能力,显著减少了跟踪漂移,同时提升了遮挡后的重检测性能,且在多个基准上实现了显著优于SAM2.1的跟踪精度。

链接: https://arxiv.org/abs/2509.13864
作者: Jovana Videnovic,Matej Kristan,Alan Lukezic
机构: University of Ljubljana (卢布尔雅那大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available on Github: this https URL

点击查看摘要

Abstract:Recent emergence of memory-based video segmentation methods such as SAM2 has led to models with excellent performance in segmentation tasks, achieving leading results on numerous benchmarks. However, these modes are not fully adjusted for visual object tracking, where distractors (i.e., objects visually similar to the target) pose a key challenge. In this paper we propose a distractor-aware drop-in memory module and introspection-based management method for SAM2, leading to DAM4SAM. Our design effectively reduces the tracking drift toward distractors and improves redetection capability after object occlusion. To facilitate the analysis of tracking in the presence of distractors, we construct DiDi, a Distractor-Distilled dataset. DAM4SAM outperforms SAM2.1 on thirteen benchmarks and sets new state-of-the-art results on ten. Furthermore, integrating the proposed distractor-aware memory into a real-time tracker EfficientTAM leads to 11% improvement and matches tracking quality of the non-real-time SAM2.1-L on multiple tracking and segmentation benchmarks, while integration with edge-based tracker EdgeTAM delivers 4% performance boost, demonstrating a very good generalization across architectures.
zh

[CV-28] LamiGauss: Pitching Radiative Gaussian for Sparse-View X-ray Laminography Reconstruction

【速读】:该论文旨在解决X射线计算机层析成像(X-ray Computed Laminography, CL)在稀疏视角采集条件下重建高质量三维体积图像的难题,尤其针对微芯片和复合电池材料等板状结构的无损检测场景。其解决方案的关键在于提出了一种名为LamiGauss的新算法,该方法将高斯点绘制(Gaussian Splatting)与专为层析成像设计的探测器到世界坐标系的变换模型相结合,并引入一种显式滤除常见层析伪影的初始化策略,从而避免冗余高斯分布被分配至虚假结构,使模型容量聚焦于真实物体表示。此方案可直接从稀疏投影中优化重建,显著提升重建精度与效率,实验表明仅需3%全视角数据即可超越基于完整数据集迭代优化的传统方法。

链接: https://arxiv.org/abs/2509.13863
作者: Chu Chen,Ander Biguri,Jean-Michel Morel,Raymond H. Chan,Carola-Bibiane Schönlieb,Jizhou Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:X-ray Computed Laminography (CL) is essential for non-destructive inspection of plate-like structures in applications such as microchips and composite battery materials, where traditional computed tomography (CT) struggles due to geometric constraints. However, reconstructing high-quality volumes from laminographic projections remains challenging, particularly under highly sparse-view acquisition conditions. In this paper, we propose a reconstruction algorithm, namely LamiGauss, that combines Gaussian Splatting radiative rasterization with a dedicated detector-to-world transformation model incorporating the laminographic tilt angle. LamiGauss leverages an initialization strategy that explicitly filters out common laminographic artifacts from the preliminary reconstruction, preventing redundant Gaussians from being allocated to false structures and thereby concentrating model capacity on representing the genuine object. Our approach effectively optimizes directly from sparse projections, enabling accurate and efficient reconstruction with limited data. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method over existing techniques. LamiGauss uses only 3 % of full views to achieve superior performance over the iterative method optimized on a full dataset.
zh

[CV-29] EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics

【速读】:该论文旨在解决传统数据蒸馏(Dataset Distillation)方法在压缩大规模图像数据集时,仅能捕获低级视觉特征而忽略图像中高级语义与结构信息的问题。其解决方案的关键在于提出EDITS框架,通过引入视觉语言模型(Vision Language Model, VLM)生成的外部文本语义,结合全局语义查询模块构建先验聚类缓冲区,并利用局部语义感知机制选取代表性样本形成图像与文本原型;最终借助大语言模型(Large Language Model, LLM)引导生成高质量文本提示,通过扩散模型实现双原型引导的合成数据生成,从而显著提升蒸馏数据的语义保真度与模型性能。

链接: https://arxiv.org/abs/2509.13858
作者: Qianxin Xia,Jiawei Du,Guoming Lu,Zhiyong Shu,Jielei Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our this http URL code is available in: this https URL.
zh

[CV-30] InterKey: Cross-modal Intersection Keypoints for Global Localization on OpenStreetMap

【速读】:该论文旨在解决自动驾驶车辆在全球定位系统(GNSS)信号弱或不可用环境(如城市峡谷和隧道)中实现可靠全局定位的问题。现有高精地图(HD map)虽提供精确先验信息,但其数据采集、构建与维护成本高昂,难以规模化应用;而开源的OpenStreetMap(OSM)虽具全球覆盖性,但其粗粒度抽象导致与传感器数据匹配困难。论文提出InterKey框架,其核心创新在于利用道路交叉口作为显著地标,并通过联合编码点云中的道路与建筑特征,生成紧凑的二进制描述子,从而实现跨模态匹配。关键策略包括差异缓解(discrepancy mitigation)、方向确定(orientation determination)和区域等权采样(area-equalized sampling),有效弥合了OSM语义信息与实际点云结构之间的模态鸿沟,实现了高精度、可扩展且低成本的全局定位方案。

链接: https://arxiv.org/abs/2509.13857
作者: Nguyen Hoang Khoi Tran,Julie Stephany Berrio,Mao Shan,Stewart Worrall
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Reliable global localization is critical for autonomous vehicles, especially in environments where GNSS is degraded or unavailable, such as urban canyons and tunnels. Although high-definition (HD) maps provide accurate priors, the cost of data collection, map construction, and maintenance limits scalability. OpenStreetMap (OSM) offers a free and globally available alternative, but its coarse abstraction poses challenges for matching with sensor data. We propose InterKey, a cross-modal framework that leverages road intersections as distinctive landmarks for global localization. Our method constructs compact binary descriptors by jointly encoding road and building imprints from point clouds and OSM. To bridge modality gaps, we introduce discrepancy mitigation, orientation determination, and area-equalized sampling strategies, enabling robust cross-modal matching. Experiments on the KITTI dataset demonstrate that InterKey achieves state-of-the-art accuracy, outperforming recent baselines by a large margin. The framework generalizes to sensors that can produce dense structural point clouds, offering a scalable and cost-effective solution for robust vehicle localization.
zh

[CV-31] SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation

【速读】:该论文旨在解决扩散模型(Diffusion Model)推理过程中因高计算需求导致的效率瓶颈问题,尤其是现有特征缓存(Feature Caching)方法仅依赖历史信息时所面临的准确率与加速性能受限的问题。其解决方案的关键在于提出了一种新颖的“自推测”(self-speculation)范式,通过利用同一时间步在不同迭代次数间的信息相似性引入未来信息,从而实现更精准的特征选择与多级分类策略:一方面基于自推测信息和历史信息动态计算每个token的重要性得分以优化缓存特征选取;另一方面依据重要性得分差异对token进行多级分类,并设计相应的多级特征计算机制。该方法显著提升了推理速度(平均提速2.80×–3.17×),同时保持了高质量输出,突破了传统加速方案中速度与精度之间的权衡限制。

链接: https://arxiv.org/abs/2509.13848
作者: Jiayi Pan,Jiaming Xu,Yongkang Zhou,Guohao Dai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process of the diffusion model. In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. And we propose a novel paradigm that introduces future information via self-speculation based on the information similarity at the same time step across different iteration times. Based on this paradigm, we present \textitSpecDiff, a training-free multi-level feature caching strategy including a cached feature selection algorithm and a multi-level feature classification algorithm. (1) Feature selection algorithm based on self-speculative information. \textitSpecDiff determines a dynamic importance score for each token based on self-speculative information and historical information, and performs cached feature selection through the importance score. (2) Multi-level feature classification algorithm based on feature importance scores. \textitSpecDiff classifies tokens by leveraging the differences in feature importance scores and introduces a multi-level feature calculation strategy. Extensive experiments show that \textitSpecDiff achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow on NVIDIA A800-80GB GPU. By merging speculative and historical information, \textitSpecDiff overcomes the speedup-accuracy trade-off bottleneck, pushing the Pareto frontier of speedup and accuracy in the efficient diffusion model inference.
zh

[CV-32] Consistent View Alignment Improves Foundation Models for 3D Medical Image Segmentation MICCAI2025

【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中多视图表示学习的问题,即现有方法普遍假设不同视图之间具有不相关性即可自动学习到有意义的潜在结构,但作者通过实证表明,这种结构并不会自然涌现,必须通过显式机制进行诱导。解决方案的关键在于提出一种名为“一致视图对齐”(Consistent View Alignment)的方法,该方法通过精确对齐来自不同视图的表示来整合互补信息,同时避免引入虚假正例(false positives),从而在下游任务中显著提升性能。

链接: https://arxiv.org/abs/2509.13846
作者: Puru Vaish,Felix Meister,Tobias Heimann,Christoph Brune,Jelmer M. Wolterink
机构: University of Twente (特温特大学); Siemens Healthineers (西门子医疗)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: MICCAI 2025: 1st Place in Transformer track and 2nd Place in Convolution track of SSL3D-OpenMind challenge

点击查看摘要

Abstract:Many recent approaches in representation learning implicitly assume that uncorrelated views of a data point are sufficient to learn meaningful representations for various downstream tasks. In this work, we challenge this assumption and demonstrate that meaningful structure in the latent space does not emerge naturally. Instead, it must be explicitly induced. We propose a method that aligns representations from different views of the data to align complementary information without inducing false positives. Our experiments show that our proposed self-supervised learning method, Consistent View Alignment, improves performance for downstream tasks, highlighting the critical role of structured view alignment in learning effective representations. Our method achieved first and second place in the MICCAI 2025 SSL3D challenge when using a Primus vision transformer and ResEnc convolutional neural network, respectively. The code and pretrained model weights are released at this https URL.
zh

[CV-33] Semi-MoE: Mixture-of-Experts meets Semi-Supervised Histopathology Segmentation BMVC2025

【速读】:该论文旨在解决半监督学习在组织病理图像分割中因腺体边界模糊和形态学误分类导致伪标签噪声大的问题。其解决方案的关键在于提出了一种基于多任务专家混合(Multi-task Mixture-of-Experts, Semi-MOE)的框架,通过三个专用专家网络——主分割专家、符号距离场回归专家和边界预测专家——分别捕获不同的形态特征;并引入多门控伪标签模块(Multi-Gating Pseudo-labeling module),动态聚合专家特征以实现鲁棒的融合与精化伪标签机制;同时设计自适应多目标损失函数(Adaptive Multi-Objective Loss),无需人工调参即可动态平衡多个学习目标,从而显著提升低标注数据场景下的分割性能。

链接: https://arxiv.org/abs/2509.13834
作者: Nguyen Lan Vi Vu,Thanh-Huy Nguyen,Thien Nguyen,Daisuke Kihara,Tianyang Wang,Xingjian Li,Min Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to BMVC 2025

点击查看摘要

Abstract:Semi-supervised learning has been employed to alleviate the need for extensive labeled data for histopathology image segmentation, but existing methods struggle with noisy pseudo-labels due to ambiguous gland boundaries and morphological misclassification. This paper introduces Semi-MOE, to the best of our knowledge, the first multi-task Mixture-of-Experts framework for semi-supervised histopathology image segmentation. Our approach leverages three specialized expert networks: A main segmentation expert, a signed distance field regression expert, and a boundary prediction expert, each dedicated to capturing distinct morphological features. Subsequently, the Multi-Gating Pseudo-labeling module dynamically aggregates expert features, enabling a robust fuse-and-refine pseudo-labeling mechanism. Furthermore, to eliminate manual tuning while dynamically balancing multiple learning objectives, we propose an Adaptive Multi-Objective Loss. Extensive experiments on GlaS and CRAG benchmarks show that our method outperforms state-of-the-art approaches in low-label settings, highlighting the potential of MoE-based architectures in advancing semi-supervised segmentation. Our code is available at this https URL.
zh

[CV-34] Data-Efficient Spectral Classification of Hyperspectral Data Using MiniROCKET and HDC-MiniROCKET

【速读】:该论文旨在解决在训练数据有限的情况下,纯光谱分类模型(仅依赖光谱信息)性能显著下降的问题。当前最先进的1D-Justo-LiuNet虽然参数少、效率高,但在小样本场景下表现不佳。为此,作者提出采用MiniROCKET及其改进版本HDC-MiniROCKET作为解决方案,其核心在于使用无训练参数的特征提取机制——通过预先设计的、高度工程化的特征变换(如随机投影和哈希编码)提取判别性光谱特征,从而降低对大规模标注数据的依赖,提升模型在有限数据下的鲁棒性和泛化能力。实验表明,尽管MiniROCKET参数量更大,其在小样本条件下优于1D-Justo-LiuNet,且在一般情况下性能相当。

链接: https://arxiv.org/abs/2509.13809
作者: Nick Theisen,Kenny Schlegel,Dietrich Paulus,Peer Neubert
机构: Wehrtechnische Dienststelle 41 (WTD); University of Koblenz (科布伦茨大学); TU Chemnitz (卡尔蔡司工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at IEEE CASE 2025

点击查看摘要

Abstract:The classification of pixel spectra of hyperspectral images, i.e. spectral classification, is used in many fields ranging from agricultural, over medical to remote sensing applications and is currently also expanding to areas such as autonomous driving. Even though for full hyperspectral images the best-performing methods exploit spatial-spectral information, performing classification solely on spectral information has its own advantages, e.g. smaller model size and thus less data required for training. Moreover, spectral information is complementary to spatial information and improvements on either part can be used to improve spatial-spectral approaches in the future. Recently, 1D-Justo-LiuNet was proposed as a particularly efficient model with very few parameters, which currently defines the state of the art in spectral classification. However, we show that with limited training data the model performance deteriorates. Therefore, we investigate MiniROCKET and HDC-MiniROCKET for spectral classification to mitigate that problem. The model extracts well-engineered features without trainable parameters in the feature extraction part and is therefore less vulnerable to limited training data. We show that even though MiniROCKET has more parameters it outperforms 1D-Justo-LiuNet in limited data scenarios and is mostly on par with it in the general case
zh

[CV-35] Masked Feature Modeling Enhances Adaptive Segmentation

【速读】:该论文旨在解决无监督域自适应(Unsupervised Domain Adaptation, UDA)中语义分割任务的性能瓶颈问题,即如何在不依赖目标域标注的情况下,有效提升模型在目标域上的分割精度。现有方法虽借助对比学习等自监督任务增强特征判别性,但掩码建模(Masked Modeling)方法因架构不兼容和优化目标错位而未被充分探索。其解决方案的关键在于提出一种新型辅助任务——掩码特征建模(Masked Feature Modeling, MFM),该方法直接在特征空间中对特征进行掩码与重建,并通过轻量级重构模块(Rebuilder)实现高效训练,且在推理阶段可完全移除以零计算开销运行;更重要的是,MFM利用分割解码器对重建特征进行分类,使辅助目标与像素级预测任务紧密耦合,从而避免干扰主任务并显著提升分割性能。

链接: https://arxiv.org/abs/2509.13801
作者: Wenlve Zhou,Zhiheng Zhou,Tiantao Xian,Yikui Zhai,Weibin Wu,Biyun Ma
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer models from a labeled source domain to an unlabeled target domain. While auxiliary self-supervised tasks-particularly contrastive learning-have improved feature discriminability, masked modeling approaches remain underexplored in this setting, largely due to architectural incompatibility and misaligned optimization objectives. We propose Masked Feature Modeling (MFM), a novel auxiliary task that performs feature masking and reconstruction directly in the feature space. Unlike existing masked modeling methods that reconstruct low-level inputs or perceptual features (e.g., HOG or visual tokens), MFM aligns its learning target with the main segmentation task, ensuring compatibility with standard architectures like DeepLab and DAFormer without modifying the inference pipeline. To facilitate effective reconstruction, we introduce a lightweight auxiliary module, Rebuilder, which is trained jointly but discarded during inference, adding zero computational overhead at test time. Crucially, MFM leverages the segmentation decoder to classify the reconstructed features, tightly coupling the auxiliary objective with the pixel-wise prediction task to avoid interference with the primary task. Extensive experiments across various architectures and UDA benchmarks demonstrate that MFM consistently enhances segmentation performance, offering a simple, efficient, and generalizable strategy for unsupervised domain-adaptive semantic segmentation.
zh

[CV-36] SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments

【速读】:该论文旨在解决视觉导航无人机(UAV)在无全球导航卫星系统(GNSS)环境下进行高精度定位时所面临的挑战,包括现有基于检索的方法在数据集可用性、实时性能、环境敏感性和泛化能力方面的局限性,尤其是在动态或时间变化环境中。其解决方案的关键在于提出一个大规模多高度飞行段数据集(Multi-Altitude Flight Segments, MAFS)以及一种新颖的语义加权自适应粒子滤波(Semantic-Weighted Adaptive Particle Filter, SWA-PF)方法,通过引入语义加权机制和优化的粒子滤波架构,融合无人机拍摄图像与卫星影像中的鲁棒语义特征,从而实现高效、准确的4自由度(4-DoF)位姿估计,在仅使用低分辨率卫星地图的情况下仍能保持全局定位误差低于10米,并相较传统特征提取方法提升10倍计算效率。

链接: https://arxiv.org/abs/2509.13795
作者: Jiayu Yuan,Ming Dai,Enhui Zheng,Chao Su,Nanxing Chen,Qiming Hu,Shibo Zhu,Yibin Cao
机构: China Jiliang University (中国计量大学); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-based Unmanned Aerial Vehicle (UAV) localization systems have been extensively investigated for Global Navigation Satellite System (GNSS)-denied environments. However, existing retrieval-based approaches face limitations in dataset availability and persistent challenges including suboptimal real-time performance, environmental sensitivity, and limited generalization capability, particularly in dynamic or temporally varying environments. To overcome these limitations, we present a large-scale Multi-Altitude Flight Segments dataset (MAFS) for variable altitude scenarios and propose a novel Semantic-Weighted Adaptive Particle Filter (SWA-PF) method. This approach integrates robust semantic features from both UAV-captured images and satellite imagery through two key innovations: a semantic weighting mechanism and an optimized particle filtering architecture. Evaluated using our dataset, the proposed method achieves 10x computational efficiency gain over feature extraction methods, maintains global positioning errors below 10 meters, and enables rapid 4 degree of freedom (4-DoF) pose estimation within seconds using accessible low-resolution satellite maps. Code and dataset will be available at this https URL.
zh

[CV-37] Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation

【速读】:该论文旨在解决航天器位姿估计(Spacecraft Pose Estimation, SPE)中因合成数据到真实数据域差异导致的性能下降问题,尤其在仅有少量标注的真实数据可用时,现有无监督域自适应方法表现不佳。解决方案的关键在于提出首个面向SPE关键点回归任务的有监督域自适应(Supervised Domain Adaptation, SDA)框架,基于学习不变表示与风险(Learning Invariant Representation and Risk, LIRR)范式,联合优化域不变特征表示与任务特定风险,利用少量标注的真实数据和大量合成数据进行协同训练,从而显著降低域偏移下的泛化误差。实验表明,仅需5%的标注目标数据即可达到甚至超越使用更大比例标注数据训练的“Oracle”基线性能。

链接: https://arxiv.org/abs/2509.13792
作者: Inder Pal Singh,Nidhal Eddine Chenni,Abd El Rahman Shabayek,Arunkumar Rathinam,Djamila Aouada
机构: SnT, University of Luxembourg (卢森堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spacecraft Pose Estimation (SPE) is a fundamental capability for autonomous space operations such as rendezvous, docking, and in-orbit servicing. Hybrid pipelines that combine object detection, keypoint regression, and Perspective-n-Point (PnP) solvers have recently achieved strong results on synthetic datasets, yet their performance deteriorates sharply on real or lab-generated imagery due to the persistent synthetic-to-real domain gap. Existing unsupervised domain adaptation approaches aim to mitigate this issue but often underperform when a modest number of labeled target samples are available. In this work, we propose the first Supervised Domain Adaptation (SDA) framework tailored for SPE keypoint regression. Building on the Learning Invariant Representation and Risk (LIRR) paradigm, our method jointly optimizes domain-invariant representations and task-specific risk using both labeled synthetic and limited labeled real data, thereby reducing generalization error under domain shift. Extensive experiments on the SPEED+ benchmark demonstrate that our approach consistently outperforms source-only, fine-tuning, and oracle baselines. Notably, with only 5% labeled target data, our method matches or surpasses oracle performance trained on larger fractions of labeled data. The framework is lightweight, backbone-agnostic, and computationally efficient, offering a practical pathway toward robust and deployable spacecraft pose estimation in real-world space environments.
zh

[CV-38] BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

【速读】:该论文旨在解决基于扩散 Transformer (Diffusion Transformers, DiTs) 的视频生成模型在推理阶段因固有的逐步去噪过程导致的高延迟问题,从而限制了其在实际场景中的应用。现有加速方法要么因结构改动牺牲视觉质量,要么未能以适当粒度复用中间特征。论文的关键解决方案是提出一种无需训练的块级缓存机制(Block-Wise Caching, BWCache),通过动态缓存并重用 DiT 块在不同扩散步长间的特征,并引入一个相似性指标,在相邻时间步的块特征差异低于阈值时触发特征复用,从而有效减少冗余计算,同时保持视觉保真度。实验表明,BWCache 在多个视频扩散模型上实现了最高达 2.24 倍的加速比。

链接: https://arxiv.org/abs/2509.13789
作者: Hanshuai Cui,Zhiqing Tang,Zhifei Xu,Zhi Yao,Wenyi Zeng,Weijia Jia
机构: Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24 \times speedup with comparable visual quality.
zh

[CV-39] CETUS: Causal Event-Driven Temporal Modeling With Unified Variable-Rate Scheduling

【速读】:该论文旨在解决现有事件相机(event camera)处理方法中存在的两个核心问题:一是传统方法将事件流转换为帧、体素网格或点云等中间表示形式,导致必须预设时间窗口而引入窗口延迟(window latency);二是基于点级检测的方法因计算复杂度高难以实现实时效率。解决方案的关键在于提出一种名为Variable-Rate Spatial Event Mamba的新架构,其直接处理原始事件流而不依赖中间表示,并通过轻量级因果空间邻域编码器高效捕捉局部几何关系,再结合Mamba状态空间模型实现线性复杂度的可扩展时序建模;此外,在推理阶段引入控制器动态调整处理速率,以在窗口延迟与推理延迟之间取得最优平衡。

链接: https://arxiv.org/abs/2509.13784
作者: Hanfang Liang,Bing Wang,Shizhen Zhang,Wen Jiang,Yizhuo Yang,Weixiang Guo,Shenghai Yuan
机构: Jianghan University (江汉大学); Beijing Institute of Technology (北京理工大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 6 figures

点击查看摘要

Abstract:Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution, offering unique advantages for high-speed vision tasks. Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds, which inevitably require predefined time windows and thus introduce window latency. Meanwhile, pointwise detection methods face computational challenges that prevent real-time efficiency due to their high computational cost. To overcome these limitations, we propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations. Our method introduces a lightweight causal spatial neighborhood encoder to efficiently capture local geometric relations, followed by Mamba-based state space models for scalable temporal modeling with linear complexity. During inference, a controller adaptively adjusts the processing speed according to the event rate, achieving an optimal balance between window latency and inference latency.
zh

[CV-40] Morphology-optimized Multi-Scale Fusion: Combining Local Artifacts and Mesoscopic Semantics for Deepfake Detection and Localization IJCAI2025

【速读】:该论文旨在解决深度伪造(deepfake)图像中伪造区域精确定位的难题,尤其针对现有基于分类的检测方法在局部细节与全局语义信息融合不足、以及局部与全局预测结果简单拼接导致噪声放大和定位精度下降的问题。其解决方案的关键在于提出一种双分支独立预测机制:分别从局部和全局视角建模伪造区域,并通过形态学操作(morphological operations)对两者的输出进行融合,从而有效抑制噪声并增强空间一致性,显著提升伪造定位的准确性与鲁棒性。

链接: https://arxiv.org/abs/2509.13776
作者: Chao Shuai,Gaojian Wang,Kun Pan,Tong Wu,Fanli Jin,Haohan Tan,Mengxiang Li,Zhenguang Liu,Feng Lin,Kui Ren
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江区)区块链与数据安全研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The 3rd Place, IJCAI 2025 Workshop on Deepfake Detection, Localization, and Interpretability

点击查看摘要

Abstract:While the pursuit of higher accuracy in deepfake detection remains a central goal, there is an increasing demand for precise localization of manipulated regions. Despite the remarkable progress made in classification-based detection, accurately localizing forged areas remains a significant challenge. A common strategy is to incorporate forged region annotations during model training alongside manipulated images. However, such approaches often neglect the complementary nature of local detail and global semantic context, resulting in suboptimal localization performance. Moreover, an often-overlooked aspect is the fusion strategy between local and global predictions. Naively combining the outputs from both branches can amplify noise and errors, thereby undermining the effectiveness of the localization. To address these issues, we propose a novel approach that independently predicts manipulated regions using both local and global perspectives. We employ morphological operations to fuse the outputs, effectively suppressing noise while enhancing spatial coherence. Extensive experiments reveal the effectiveness of each module in improving the accuracy and robustness of forgery localization. Comments: The 3rd Place, IJCAI 2025 Workshop on Deepfake Detection, Localization, and Interpretability Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.13776 [cs.CV] (or arXiv:2509.13776v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.13776 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-41] AdaThinkDrive: Adaptive Thinking via Reinforcement Learning for Autonomous Driving

【速读】:该论文旨在解决当前视觉语言动作(Vision Language Action, VLA)模型在端到端自动驾驶中引入链式思维(Chain of Thought, CoT)推理时存在的效率与效果失衡问题:即在简单场景下仍强制执行CoT推理,导致不必要的计算开销且未提升决策质量。解决方案的关键在于提出AdaThinkDrive框架,其核心创新是基于“快思考-慢思考”双模式机制,通过预训练获取驾驶常识和世界知识,并在监督微调阶段引入包含快速回答(无CoT)与慢速推理(有CoT)的两模式数据集,使模型能够根据任务复杂度自适应选择是否启用CoT;同时结合组相对策略优化(Group Relative Policy Optimization, GRPO)设计自适应思考奖励策略,以轨迹质量差异为依据引导模型在准确性和推理效率之间实现动态平衡。

链接: https://arxiv.org/abs/2509.13769
作者: Yuechen Luo,Fang Li,Shaoqing Xu,Zhiyi Lai,Lei Yang,Qimao Chen,Ziang Luo,Zixun Xie,Shengyin Jiang,Jiaxin Liu,Long Chen,Bing Wang,Zhi-xin Yang
机构: Tsinghua University (清华大学); Xiaomi EV; University of Macau (澳门大学); Nanyang Technological University (南洋理工大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:While reasoning technology like Chain of Thought (CoT) has been widely adopted in Vision Language Action (VLA) models, it demonstrates promising capabilities in end to end autonomous driving. However, recent efforts to integrate CoT reasoning often fall short in simple scenarios, introducing unnecessary computational overhead without improving decision quality. To address this, we propose AdaThinkDrive, a novel VLA framework with a dual mode reasoning mechanism inspired by fast and slow thinking. First, our framework is pretrained on large scale autonomous driving (AD) scenarios using both question answering (QA) and trajectory datasets to acquire world knowledge and driving commonsense. During supervised fine tuning (SFT), we introduce a two mode dataset, fast answering (w/o CoT) and slow thinking (with CoT), enabling the model to distinguish between scenarios that require reasoning. Furthermore, an Adaptive Think Reward strategy is proposed in conjunction with the Group Relative Policy Optimization (GRPO), which rewards the model for selectively applying CoT by comparing trajectory quality across different reasoning modes. Extensive experiments on the Navsim benchmark show that AdaThinkDrive achieves a PDMS of 90.3, surpassing the best vision only baseline by 1.7 points. Moreover, ablations show that AdaThinkDrive surpasses both the never Think and always Think baselines, improving PDMS by 2.0 and 1.4, respectively. It also reduces inference time by 14% compared to the always Think baseline, demonstrating its ability to balance accuracy and efficiency through adaptive reasoning.
zh

[CV-42] Generative Image Coding with Diffusion Prior

【速读】:该论文旨在解决当前图像压缩技术在高压缩比下难以维持主观视觉质量的问题,尤其是在面对自然图像与生成式 AI (Generative AI) 生成图像混合场景时,传统编码器(codec)和现有学习方法表现不足,而纯生成式方法又面临重建保真度和泛化能力的挑战。解决方案的关键在于提出一种基于扩散先验(diffusion priors)的新型生成式编码框架:通过预优化编码器生成通用的压缩域表示,并利用轻量级适配器与注意力融合模块将这些表示与预训练扩散模型的内部特征进行高效集成;同时引入分布重归一化(distribution renormalization)方法进一步提升重建保真度。该框架无需大量重训练即可适配不同预训练模型,显著提升了低比特率下的视觉质量和压缩效率,相较 H.266/VVC 最高提升达 79%。

链接: https://arxiv.org/abs/2509.13768
作者: Jianhui Chang
机构: China Telecom Cloud Computing Research Institute (中国电信云计算研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:As generative technologies advance, visual content has evolved into a complex mix of natural and AI-generated images, driving the need for more efficient coding techniques that prioritize perceptual quality. Traditional codecs and learned methods struggle to maintain subjective quality at high compression ratios, while existing generative approaches face challenges in visual fidelity and generalization. To this end, we propose a novel generative coding framework leveraging diffusion priors to enhance compression performance at low bitrates. Our approach employs a pre-optimized encoder to generate generalized compressed-domain representations, integrated with the pretrained model’s internal features via a lightweight adapter and an attentive fusion module. This framework effectively leverages existing pretrained diffusion models and enables efficient adaptation to different pretrained models for new requirements with minimal retraining costs. We also introduce a distribution renormalization method to further enhance reconstruction fidelity. Extensive experiments show that our method (1) outperforms existing methods in visual fidelity across low bitrates, (2) improves compression performance by up to 79% over H.266/VVC, and (3) offers an efficient solution for AI-generated content while being adaptable to broader content types.
zh

[CV-43] VocSegMRI: Multimodal Learning for Precise Vocal Tract Segmentation in Real-time MRI ICASSP

【速读】:该论文旨在解决实时磁共振成像(rtMRI)中发音结构分割精度不足的问题,传统方法主要依赖视觉线索,而忽略了同步声学与音位信号所蕴含的互补信息。解决方案的关键在于提出VocSegMRI框架,通过交叉注意力融合机制整合视频、音频和音位输入,实现动态特征对齐;同时引入对比学习目标以增强跨模态表征能力,即使在推理阶段音频不可用时也能保持高精度分割性能。

链接: https://arxiv.org/abs/2509.13767
作者: Daiqi Liu,Tomás Arias-Vergara,Johannes Enk,Fangxu Xing,Maureen Stone,Jerry L. Prince,Jana Hutter,Andreas Maier,Jonghye Woo,Paula Andrea Pérez-Toro
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希亚历山大大学); Universidad de Antioquia UdeA (安蒂奥基亚大学); Harvard Medical School / MGH (哈佛医学院/麻省总医院); University of Maryland School of Dentistry (马里兰大学牙科学院); Johns Hopkins University (约翰霍普金斯大学); Smart Imaging Lab (智能成像实验室); Pattern Recognition Lab (模式识别实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint submitted to ICASSP

点击查看摘要

Abstract:Accurately segmenting articulatory structures in real-time magnetic resonance imaging (rtMRI) remains challenging, as most existing methods rely almost entirely on visual cues. Yet synchronized acoustic and phonological signals provide complementary context that can enrich visual information and improve precision. In this paper, we introduce VocSegMRI, a multimodal framework that integrates video, audio, and phonological inputs through cross-attention fusion for dynamic feature alignment. To further enhance cross-modal representation, we incorporate a contrastive learning objective that improves segmentation performance even when the audio modality is unavailable at inference. Evaluated on a sub-set of USC-75 rtMRI dataset, our approach achieves state-of-the-art performance, with a Dice score of 0.95 and a 95th percentile Hausdorff Distance (HD_95) of 4.20 mm, outperforming both unimodal and multimodal baselines. Ablation studies confirm the contributions of cross-attention and contrastive learning to segmentation precision and robustness. These results highlight the value of integrative multimodal modeling for accurate vocal tract analysis.
zh

[CV-44] NDLPNet: A Location-Aware Nighttime Deraining Network and a Real-World Benchmark Dataset

【速读】:该论文旨在解决低光照条件下雨滴条纹(rain streak artifacts)导致的图像退化问题,此类退化严重影响夜间监控和自动驾驶导航系统的性能。现有去雨技术主要针对白天场景设计,在夜间因雨滴分布的空间异质性和光依赖性条纹可见度而表现不佳。解决方案的关键在于提出一种新型夜间去雨网络NDLPNet,其核心创新是引入位置感知模块(Position Perception Module, PPM),以有效捕捉低光环境下雨滴条纹的空间位置信息和密度分布特征,并通过增强不同特征通道的重要性权重来提升模型对雨滴与背景信息的区分能力。此外,作者构建了首个基于真实夜间场景的NSR数据集(900对图像),为夜间去雨任务提供了新的基准。实验表明,该方法在多个数据集上均优于当前最先进的去雨技术。

链接: https://arxiv.org/abs/2509.13766
作者: Huichun Liu,Xiaosong Li,Yang Liu,Xiaoqi Cheng,Haishu Tan
机构: Beihang University (北京航空航天大学); Foshan University (佛山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Visual degradation caused by rain streak artifacts in low-light conditions significantly hampers the performance of nighttime surveillance and autonomous navigation. Existing image deraining techniques are primarily designed for daytime conditions and perform poorly under nighttime illumination due to the spatial heterogeneity of rain distribution and the impact of light-dependent stripe visibility. In this paper, we propose a novel Nighttime Deraining Location-enhanced Perceptual Network(NDLPNet) that effectively captures the spatial positional information and density distribution of rain streaks in low-light environments. Specifically, we introduce a Position Perception Module (PPM) to capture and leverage spatial contextual information from input data, enhancing the model’s capability to identify and recalibrate the importance of different feature channels. The proposed nighttime deraining network can effectively remove the rain streaks as well as preserve the crucial background information. Furthermore, We construct a night scene rainy (NSR) dataset comprising 900 image pairs, all based on real-world nighttime scenes, providing a new benchmark for nighttime deraining task research. Extensive qualitative and quantitative experimental evaluations on both existing datasets and the NSR dataset consistently demonstrate our method outperform the state-of-the-art (SOTA) methods in nighttime deraining tasks. The source code and dataset is available at this https URL.
zh

[CV-45] ask-Aware Image Signal Processor for Advanced Visual Perception

【速读】:该论文旨在解决当前基于RAW数据的图像信号处理(ISP)方法在视觉感知任务中面临的两大问题:一是大规模ISP网络带来显著的计算开销,二是依赖传统ISP流水线调优的方法受限于表达能力不足。其解决方案的关键在于提出一种任务感知的图像信号处理框架(Task-Aware Image Signal Processing, TA-ISP),该框架不采用密集卷积结构,而是通过预测一组轻量级、多尺度的调制算子(modulation operators),在全局、区域和像素三个空间尺度上对图像统计特性进行灵活调控,从而在保持低内存占用、低计算复杂度和低延迟的同时,显著扩展可表示的空间变化变换范围,有效提升下游检测与分割任务的性能。

链接: https://arxiv.org/abs/2509.13762
作者: Kai Chen,Jin Xiao,Leheng Zhang,Kexuan Shi,Shuhang Gu
机构: University of Electronic Science and Technology of China (电子科技大学); ByteDance Inc. (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, there has been a growing trend in computer vision towards exploiting RAW sensor data, which preserves richer information compared to conventional low-bit RGB images. Early studies mainly focused on enhancing visual quality, while more recent efforts aim to leverage the abundant information in RAW data to improve the performance of visual perception tasks such as object detection and segmentation. However, existing approaches still face two key limitations: large-scale ISP networks impose heavy computational overhead, while methods based on tuning traditional ISP pipelines are restricted by limited representational this http URL address these issues, we propose Task-Aware Image Signal Processing (TA-ISP), a compact RAW-to-RGB framework that produces task-oriented representations for pretrained vision models. Instead of heavy dense convolutional pipelines, TA-ISP predicts a small set of lightweight, multi-scale modulation operators that act at global, regional, and pixel scales to reshape image statistics across different spatial extents. This factorized control significantly expands the range of spatially varying transforms that can be represented while keeping memory usage, computation, and latency tightly constrained. Evaluated on several RAW-domain detection and segmentation benchmarks under both daytime and nighttime conditions, TA-ISP consistently improves downstream accuracy while markedly reducing parameter count and inference time, making it well suited for deployment on resource-constrained devices.
zh

[CV-46] Iterative Prompt Refinement for Safer Text-to-Image Generation

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在输出质量和安全性方面对提示词(prompt)表述高度依赖的问题,尤其针对现有基于大语言模型(Large Language Models, LLMs)的安全性方法仅依赖文本层面优化而忽视生成图像内容,导致可能产生不安全输出或对本已安全的提示词进行无谓修改的局限性。解决方案的关键在于提出一种基于视觉语言模型(Vision Language Models, VLMs)的迭代式提示词精炼算法,通过引入对生成图像的视觉反馈来动态调整提示词,在提升安全性的同时保持用户意图的一致性和系统可靠性。此外,作者还构建了一个结合文本与视觉安全标签的新数据集,支持监督微调,从而实现更精准的安全控制。

链接: https://arxiv.org/abs/2509.13760
作者: Jinwoo Jeon,JunHyeok Oh,Hayeong Lee,Byung-Jun Lee
机构: Korea University (韩国国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at this https URL. \textbf\textcolorredWARNING: This paper contains examples of harmful or inappropriate images generated by models.
zh

[CV-47] Controllable-Continuous Color Editing in Diffusion Model via Color Mapping

【速读】:该论文旨在解决文本驱动图像编辑中颜色编辑精度不足与连续控制困难的问题(color editing still faces challenges such as insufficient precision and difficulty in achieving continuous control)。其核心解决方案在于引入一个颜色映射模块(color mapping module),该模块显式建模文本嵌入空间与图像RGB值之间的对应关系,通过预测给定RGB值对应的嵌入向量,实现对生成图像颜色的精确控制,同时保持语义一致性,从而支持用户指定目标RGB范围以生成具有连续颜色变化的图像。

链接: https://arxiv.org/abs/2509.13756
作者: Yuqi Yang,Dongliang Chang,Yuanchen Fang,Yi-Zhe SonG,Zhanyu Ma,Jun Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, text-driven image editing has made significant progress. However, due to the inherent ambiguity and discreteness of natural language, color editing still faces challenges such as insufficient precision and difficulty in achieving continuous control. Although linearly interpolating the embedding vectors of different textual descriptions can guide the model to generate a sequence of images with varying colors, this approach lacks precise control over the range of color changes in the output images. Moreover, the relationship between the interpolation coefficient and the resulting image color is unknown and uncontrollable. To address these issues, we introduce a color mapping module that explicitly models the correspondence between the text embedding space and image RGB values. This module predicts the corresponding embedding vector based on a given RGB value, enabling precise color control of the generated images while maintaining semantic consistency. Users can specify a target RGB range to generate images with continuous color variations within the desired range, thereby achieving finer-grained, continuous, and controllable color editing. Experimental results demonstrate that our method performs well in terms of color continuity and controllability.
zh

[CV-48] Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval

【速读】:该论文旨在解决文本到图像人物检索(Text-to-Image Person Retrieval, TIPR)任务中跨模态对齐不充分的问题,特别是现有方法依赖隐式注意力机制进行局部对齐但缺乏验证能力,且仅关注难负样本以优化正负对区分,忽视了错误匹配的正样本对。解决方案的关键在于提出一种全模式细粒度对齐框架(Full-Mode Fine-grained Alignment, FMFA),其核心创新包括两个模块:一是自适应相似度分布匹配(Adaptive Similarity Distribution Matching, A-SDM)模块,通过自适应地拉近未正确匹配的正样本对在联合嵌入空间中的距离,提升全局对齐精度;二是显式细粒度对齐(Explicit Fine-grained Alignment, EFA)模块,通过稀疏化相似度矩阵并采用硬编码方式强化跨模态细粒度交互,弥补隐式关系推理的不可验证性,从而实现无需额外监督的显式与隐式协同对齐。

链接: https://arxiv.org/abs/2509.13754
作者: Hao Yin,Xin Man,Feiyu Chen,Jie Shao,Heng Tao Shen
机构: Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China (深圳先进院,电子科技大学); Sichuan Artificial Intelligence Research Institute (四川人工智能研究院); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-to-Image Person Retrieval (TIPR) is a cross-modal matching task that aims to retrieve the most relevant person images based on a given text query. The key challenge in TIPR lies in achieving effective alignment between textual and visual modalities within a common latent space. To address this challenge, prior approaches incorporate attention mechanisms for implicit cross-modal local alignment. However, they lack the ability to verify whether all local features are correctly aligned. Moreover, existing methods primarily focus on hard negative samples during model updates, with the goal of refining distinctions between positive and negative pairs, often neglecting incorrectly matched positive pairs. To alleviate these issues, we propose FMFA, a cross-modal Full-Mode Fine-grained Alignment framework, which enhances global matching through explicit fine-grained alignment and existing implicit relational reasoning – hence the term ``full-mode" – without requiring additional supervision. Specifically, we design an Adaptive Similarity Distribution Matching (A-SDM) module to rectify unmatched positive sample pairs. A-SDM adaptively pulls the unmatched positive pairs closer in the joint embedding space, thereby achieving more precise global alignment. Additionally, we introduce an Explicit Fine-grained Alignment (EFA) module, which makes up for the lack of verification capability of implicit relational reasoning. EFA strengthens explicit cross-modal fine-grained interactions by sparsifying the similarity matrix and employs a hard coding method for local alignment. Our proposed method is evaluated on three public datasets, achieving state-of-the-art performance among all global matching methods. Our code is available at this https URL.
zh

[CV-49] Improving Generalized Visual Grounding with Instance-aware Joint Learning

【速读】:该论文旨在解决当前通用视觉定位任务(Generalized Visual Grounding)中存在的一系列问题:一是现有方法通常独立处理广义指代表达理解(GREC)与广义分割(GRES),忽略了联合训练带来的多粒度预测一致性优势;二是当前方法将GRES视为纯语义分割任务,忽视了实例级感知能力的重要性,导致边界框与掩码预测之间缺乏一致性。解决方案的关键在于提出InstanceVG框架,其核心创新是引入实例查询(instance query)机制,通过为每个实例查询分配先验参考点(prior reference point),统一实现实例级边界框和掩码的联合预测与一致性约束,从而在GREC和GRES任务上同时提升性能。这一设计首次将实例感知能力融入通用视觉定位,显著提升了多粒度预测的鲁棒性与准确性。

链接: https://arxiv.org/abs/2509.13747
作者: Ming Dai,Wenxuan Cheng,Jiang-Jiang Liu,Lingfeng Yang,Zhenhua Feng,Wankou Yang,Jingdong Wang
机构: Southeast University (东南大学); Baidu Inc. (百度公司); JiangNan University (江南大学); Nanjing University of Science and Technology (南京理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in September 2025

点击查看摘要

Abstract:Generalized visual grounding tasks, including Generalized Referring Expression Comprehension (GREC) and Segmentation (GRES), extend the classical visual grounding paradigm by accommodating multi-target and non-target scenarios. Specifically, GREC focuses on accurately identifying all referential objects at the coarse bounding box level, while GRES aims for achieve fine-grained pixel-level perception. However, existing approaches typically treat these tasks independently, overlooking the benefits of jointly training GREC and GRES to ensure consistent multi-granularity predictions and streamline the overall process. Moreover, current methods often treat GRES as a semantic segmentation task, neglecting the crucial role of instance-aware capabilities and the necessity of ensuring consistent predictions between instance-level boxes and masks. To address these limitations, we propose InstanceVG, a multi-task generalized visual grounding framework equipped with instance-aware capabilities, which leverages instance queries to unify the joint and consistency predictions of instance-level boxes and masks. To the best of our knowledge, InstanceVG is the first framework to simultaneously tackle both GREC and GRES while incorporating instance-aware capabilities into generalized visual grounding. To instantiate the framework, we assign each instance query a prior reference point, which also serves as an additional basis for target matching. This design facilitates consistent predictions of points, boxes, and masks for the same instance. Extensive experiments obtained on ten datasets across four tasks demonstrate that InstanceVG achieves state-of-the-art performance, significantly surpassing the existing methods in various evaluation metrics. The code and model will be publicly available at this https URL.
zh

[CV-50] Mitigating Query Selection Bias in Referring Video Object Segmentation

【速读】:该论文旨在解决参考视频目标分割(Referring Video Object Segmentation, RVOS)中因使用静态文本查询而导致的查询选择偏差问题,即静态查询易受外观或运动相似的干扰项误导,从而影响跨模态对齐的准确性。解决方案的关键在于提出Triple Query Former (TQF),其核心创新是将参考查询分解为三个专业化组件:用于表征静态属性的外观查询(appearance query)、用于捕捉帧内空间关系的帧内交互查询(intra-frame interaction query),以及用于建模跨帧时序关联的帧间运动查询(inter-frame motion query)。这些查询通过融合语言线索与视觉引导动态构建,并结合两个运动感知聚合模块——帧内交互聚合(Intra-frame Interaction Aggregation)和帧间运动聚合(Inter-frame Motion Aggregation),分别增强帧内对象间的空间关系和跨帧轨迹引导的一致性,从而提升分割精度与鲁棒性。

链接: https://arxiv.org/abs/2509.13722
作者: Dingwei Zhang,Dong Zhang,Jinhui Tang
机构: Nanjing University of Science and Technology (南京理工大学); The Hong Kong University of Science and Technology (香港科技大学); Nanjing Forestry University (南京林业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, query-based methods have achieved remarkable performance in Referring Video Object Segmentation (RVOS) by using textual static object queries to drive cross-modal alignment. However, these static queries are easily misled by distractors with similar appearance or motion, resulting in \emphquery selection bias. To address this issue, we propose Triple Query Former (TQF), which factorizes the referring query into three specialized components: an appearance query for static attributes, an intra-frame interaction query for spatial relations, and an inter-frame motion query for temporal association. Instead of relying solely on textual embeddings, our queries are dynamically constructed by integrating both linguistic cues and visual guidance. Furthermore, we introduce two motion-aware aggregation modules that enhance object token representations: Intra-frame Interaction Aggregation incorporates position-aware interactions among objects within a single frame, while Inter-frame Motion Aggregation leverages trajectory-guided alignment across frames to ensure temporal coherence. Extensive experiments on multiple RVOS benchmarks demonstrate the advantages of TQF and the effectiveness of our structured query design and motion-aware aggregation modules.
zh

[CV-51] UM-Depth : Uncertainty Masked Self-Supervised Monocular Depth Estimation with Visual Odometry

【速读】:该论文旨在解决自监督单目深度估计中因输入数据不确定性(如低纹理区域和动态物体边界)导致的深度精度下降问题。其解决方案的关键在于提出UM-Depth框架,通过融合运动感知与不确定性感知的精炼机制,在训练过程中嵌入不确定性估计,从而在光度信号弱的区域增强监督信号;具体而言,采用教师-学生训练策略,仅在教师网络中利用光流信息进行不确定性建模,避免了推理时的额外计算开销和辅助标注需求,实现了无需额外标签即可提升动态物体边界和低纹理区域的深度估计精度。

链接: https://arxiv.org/abs/2509.13713
作者: Tae-Wook Um,Ki-Hyeon Kim,Hyun-Duck Choi,Hyo-Sung Ahn
机构: Gwangju Institute of Science and Technology (GIST); Seoul National University of Science and Technology
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Monocular depth estimation has been increasingly adopted in robotics and autonomous driving for its ability to infer scene geometry from a single camera. In self-supervised monocular depth estimation frameworks, the network jointly generates and exploits depth and pose estimates during training, thereby eliminating the need for depth labels. However, these methods remain challenged by uncertainty in the input data, such as low-texture or dynamic regions, which can cause reduced depth accuracy. To address this, we introduce UM-Depth, a framework that combines motion- and uncertainty-aware refinement to enhance depth accuracy at dynamic object boundaries and in textureless regions. Specifically, we develop a teacherstudent training strategy that embeds uncertainty estimation into both the training pipeline and network architecture, thereby strengthening supervision where photometric signals are weak. Unlike prior motion-aware approaches that incur inference-time overhead and rely on additional labels or auxiliary networks for real-time generation, our method uses optical flow exclusively within the teacher network during training, which eliminating extra labeling demands and any runtime cost. Extensive experiments on the KITTI and Cityscapes datasets demonstrate the effectiveness of our uncertainty-aware refinement. Overall, UM-Depth achieves state-of-the-art results in both self-supervised depth and pose estimation on the KITTI datasets.
zh

[CV-52] StyleProtect: Safeguarding Artistic Identity in Fine-tuned Diffusion Models

【速读】:该论文旨在解决生成式扩散模型(diffusion models)被恶意利用以低成本复制艺术家独特艺术风格的问题,这种行为严重侵害了创作者的劳动成果与个人风格。针对这一挑战,论文提出了一种名为StyleProtect的轻量级保护策略,其核心在于识别并更新特定的交叉注意力层(cross-attention layers),这些层对艺术风格具有更高的敏感性——通过分析注意力层激活强度与外部特征提取模型之间的相关性来量化敏感度。该方法仅需微调少数关键层即可有效防御细调后的扩散模型对艺术风格的高保真模仿,同时保持视觉上的不可感知性。

链接: https://arxiv.org/abs/2509.13711
作者: Qiuyu Tang,Joshua Krinsky,Aparna Bharati
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancement of generative models, particularly diffusion-based approaches, has inadvertently facilitated their potential for misuse. Such models enable malicious exploiters to replicate artistic styles that capture an artist’s creative labor, personal vision, and years of dedication in an inexpensive manner. This has led to a rise in the need and exploration of methods for protecting artworks against style mimicry. Although generic diffusion models can easily mimic an artistic style, finetuning amplifies this capability, enabling the model to internalize and reproduce the style with higher fidelity and control. We hypothesize that certain cross-attention layers exhibit heightened sensitivity to artistic styles. Sensitivity is measured through activation strengths of attention layers in response to style and content representations, and assessing their correlations with features extracted from external models. Based on our findings, we introduce an efficient and lightweight protection strategy, StyleProtect, that achieves effective style defense against fine-tuned diffusion models by updating only selected cross-attention layers. Our experiments utilize a carefully curated artwork dataset based on WikiArt, comprising representative works from 30 artists known for their distinctive and influential styles and cartoon animations from the Anita dataset. The proposed method demonstrates promising performance in safeguarding unique styles of artworks and anime from malicious diffusion customization, while maintaining competitive imperceptibility.
zh

[CV-53] aylor-Series Expanded Kolmogorov-Arnold Network for Medical Imaging Classification

【速读】:该论文旨在解决医学图像分类中模型的准确性与可解释性难题,尤其是在资源受限的临床环境中,面对小规模、多样化数据集时的泛化能力不足问题。其解决方案的关键在于提出基于样条函数的Kolmogorov-Arnold Networks (KANs),通过引入B-样条(B-spline)与不同基函数(如泰勒展开、径向基函数、Morlet小波变换)的融合结构,实现对局部和全局非线性特征的有效逼近。该方法在无需图像预处理的情况下直接从原始数据中学习,且参数量极低(如SBTAYLOR-KAN仅需2,872个可训练参数),显著优于传统卷积神经网络(CNNs)如ResNet50(24.18M参数),同时保持高准确率(最高达98.93%)和稳定性能(即使使用30%训练数据仍保持86%以上准确率),并通过Grad-CAM提供可解释性,为医疗AI在数据稀缺场景下的部署提供了轻量化、可靠且透明的解决方案。

链接: https://arxiv.org/abs/2509.13687
作者: Kaniz Fatema,Emad A. Mohammed,Sukhjit Singh Sehra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Effective and interpretable classification of medical images is a challenge in computer-aided diagnosis, especially in resource-limited clinical settings. This study introduces spline-based Kolmogorov-Arnold Networks (KANs) for accurate medical image classification with limited, diverse datasets. The models include SBTAYLOR-KAN, integrating B-splines with Taylor series; SBRBF-KAN, combining B-splines with Radial Basis Functions; and SBWAVELET-KAN, embedding B-splines in Morlet wavelet transforms. These approaches leverage spline-based function approximation to capture both local and global nonlinearities. The models were evaluated on brain MRI, chest X-rays, tuberculosis X-rays, and skin lesion images without preprocessing, demonstrating the ability to learn directly from raw data. Extensive experiments, including cross-dataset validation and data reduction analysis, showed strong generalization and stability. SBTAYLOR-KAN achieved up to 98.93% accuracy, with a balanced F1-score, maintaining over 86% accuracy using only 30% of the training data across three datasets. Despite class imbalance in the skin cancer dataset, experiments on both imbalanced and balanced versions showed SBTAYLOR-KAN outperforming other models, achieving 68.22% accuracy. Unlike traditional CNNs, which require millions of parameters (e.g., ResNet50 with 24.18M), SBTAYLOR-KAN achieves comparable performance with just 2,872 trainable parameters, making it more suitable for constrained medical environments. Gradient-weighted Class Activation Mapping (Grad-CAM) was used for interpretability, highlighting relevant regions in medical images. This framework provides a lightweight, interpretable, and generalizable solution for medical image classification, addressing the challenges of limited datasets and data-scarce scenarios in clinical AI applications.
zh

[CV-54] FishBEV: Distortion-Resilient Birds Eye View Segmentation with Surround-View Fisheye Cameras

【速读】:该论文旨在解决基于鱼眼相机(fisheye cameras)的鸟瞰图(Bird’s Eye View, BEV)分割任务中因严重几何畸变、多视角对应关系模糊及不稳定时序动态导致性能显著下降的问题。解决方案的关键在于提出FishBEV框架,其核心创新包括:(1)畸变鲁棒的多尺度特征提取(Distortion-Resilient Multi-scale Extraction, DRME)骨干网络,可在保持尺度一致性的同时学习抗畸变的鲁棒特征;(2)不确定性感知的空间交叉注意力机制(Uncertainty-aware Spatial Cross-Attention, U-SCA),利用不确定性估计实现可靠的跨视角对齐;(3)距离感知的时序自注意力模块(Distance-aware Temporal Self-Attention, D-TSA),自适应平衡近场细节与远场上下文以保障时序一致性。这三个模块协同提升模型在复杂场景下的BEV分割精度与稳定性。

链接: https://arxiv.org/abs/2509.13681
作者: Hang Li,Dianmo Sheng,Qiankun Dong,Zichun Wang,Zhiwei Xu,Tao Li
机构: Nankai University (南开大学); University of Science and Technology of China (中国科学技术大学); Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures

点击查看摘要

Abstract:As a cornerstone technique for autonomous driving, Bird’s Eye View (BEV) segmentation has recently achieved remarkable progress with pinhole cameras. However, it is non-trivial to extend the existing methods to fisheye cameras with severe geometric distortion, ambiguous multi-view correspondences and unstable temporal dynamics, all of which significantly degrade BEV performance. To address these challenges, we propose FishBEV, a novel BEV segmentation framework specifically tailored for fisheye cameras. This framework introduces three complementary innovations, including a Distortion-Resilient Multi-scale Extraction (DRME) backbone that learns robust features under distortion while preserving scale consistency, an Uncertainty-aware Spatial Cross-Attention (U-SCA) mechanism that leverages uncertainty estimation for reliable cross-view alignment, a Distance-aware Temporal Self-Attention (D-TSA) module that adaptively balances near field details and far field context to ensure temporal coherence. Extensive experiments on the Synwoodscapes dataset demonstrate that FishBEV consistently outperforms SOTA baselines, regarding the performance evaluation of FishBEV on the surround-view fisheye BEV segmentation tasks.
zh

[CV-55] Re-purposing SAM into Efficient Visual Projectors for MLLM -Based Referring Image Segmentation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Model, MLLM)在引用图像分割(Referring Image Segmentation, RIS)任务中因视觉令牌冗余导致的计算效率低下问题。现有基于patch-wise视觉投影器的方法难以在减少视觉令牌数量与保持语义清晰度之间取得平衡,常需保留过长的令牌序列以避免性能下降。解决方案的关键在于提出一种新型语义视觉投影器(semantic visual projector),其利用Segment Anything Model (SAM)生成的语义超像素(semantic superpixels)识别图像中的“视觉词”,并将这些超像素压缩并投影为视觉令牌,从而根据场景复杂度自适应缩短令牌序列,同时最小化语义损失;此外,通过引入语义超像素位置嵌入(semantic superpixel positional embedding)和语义超像素聚合器(semantic superpixel aggregator),有效保留超像素内部的细粒度细节与外部全局上下文信息,实验表明该方法可在不牺牲性能的前提下将视觉令牌减少93%,显著提升MLLM训练与推理速度。

链接: https://arxiv.org/abs/2509.13676
作者: Xiaobo Yang,Xiaojin Gong
机构: Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify “visual words” in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM’s awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.
zh

[CV-56] Deep Lookup Network

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)中乘法运算计算复杂度高、能耗大且推理时间长的问题,从而阻碍其在移动设备和资源受限边缘设备上的部署。解决方案的关键在于提出一种通用且高效的查找表(lookup table)操作,替代传统CNN中的乘法运算:通过预先构建可微分的查找表,并设计相应的训练策略以实现端到端优化,使模型能够在保持性能的同时显著提升能效与推理速度。实验表明,基于该方法构建的查找网络在图像分类、图像超分辨率和点云分类等任务上均实现了优于或相当的性能,同时显著降低了计算开销。

链接: https://arxiv.org/abs/2509.13662
作者: Yulan Guo,Longguang Wang,Wendong Mao,Xiaoyu Dong,Yingqian Wang,Li Liu,Wei An
机构: Sun Yat-sen University (中山大学); Aviation University of Air Force (空军航空大学); National University of Defense Technology (国防科技大学); University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Convolutional neural networks are constructed with massive operations with different types and are highly computationally intensive. Among these operations, multiplication operation is higher in computational complexity and usually requires more energy consumption with longer inference time than other operations, which hinders the deployment of convolutional neural networks on mobile devices. In many resource-limited edge devices, complicated operations can be calculated via lookup tables to reduce computational cost. Motivated by this, in this paper, we introduce a generic and efficient lookup operation which can be used as a basic operation for the construction of neural networks. Instead of calculating the multiplication of weights and activation values, simple yet efficient lookup operations are adopted to compute their responses. To enable end-to-end optimization of the lookup operation, we construct the lookup tables in a differentiable manner and propose several training strategies to promote their convergence. By replacing computationally expensive multiplication operations with our lookup operations, we develop lookup networks for the image classification, image super-resolution, and point cloud classification tasks. It is demonstrated that our lookup networks can benefit from the lookup operations to achieve higher efficiency in terms of energy consumption and inference speed while maintaining competitive performance to vanilla convolutional networks. Extensive experiments show that our lookup networks produce state-of-the-art performance on different tasks (both classification and regression tasks) and different data types (both images and point clouds).
zh

[CV-57] Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction

【速读】:该论文旨在解决传统两视图相对位姿估计方法无法获得度量尺度(metric scale)的问题,尤其在大基线、无纹理或高反射表面等挑战性场景下性能下降明显。其解决方案的关键在于提出一种无需训练的框架GARPS,通过独立重建两张图像对应的度量三维高斯混合模型(Gaussian Mixture Model, GMM),并利用可微分的GMM对齐目标函数优化初始位姿,该目标函数联合考虑几何结构、视角无关颜色、各向异性协方差及语义特征一致性,从而实现鲁棒且度量准确的相对位姿估计,且无需显式2D对应关系。

链接: https://arxiv.org/abs/2509.13652
作者: Yumin Li,Dylan Campbell
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures, accepted by AJCAI 2025

点击查看摘要

Abstract:Estimating metric relative camera pose from a pair of images is of great importance for 3D reconstruction and localisation. However, conventional two-view pose estimation methods are not metric, with camera translation known only up to a scale, and struggle with wide baselines and textureless or reflective surfaces. This paper introduces GARPS, a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes. GARPS leverages a metric monocular depth estimator and a Gaussian scene reconstructor to obtain a metric 3D Gaussian Mixture Model (GMM) for each image. It then refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective. This objective jointly considers geometric structure, view-independent colour, anisotropic covariance, and semantic feature consistency, and is robust to occlusions and texture-poor regions without requiring explicit 2D correspondences. Extensive experiments on the Real-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods, including MASt3R. These results highlight the potential of bridging single-view perception with multi-view geometry to achieve robust and metric relative pose estimation.
zh

[CV-58] LLM -I: LLM s are Naturally Interleaved Multimodal Creators

【速读】:该论文旨在解决当前统一视觉-语言模型在图像-文本生成任务中面临的“单一工具”瓶颈问题,即现有模型受限于合成图像生成能力,难以处理需要事实依据或程序精确性的复杂任务。其解决方案的关键在于提出一种名为LLM-Interleaved(LLM-I)的灵活动态框架,将交错式图像-文本生成重构为工具使用问题,通过一个中央大语言模型(LLM)或多模态大语言模型(MLLM)代理智能调度多样化的专用视觉工具集(包括在线图像搜索、基于扩散的生成、代码执行和图像编辑),并利用强化学习(Reinforcement Learning, RL)框架进行训练,该框架结合规则逻辑与LLM/MLLM评估器的奖励机制,实现工具选择与应用的优化决策。

链接: https://arxiv.org/abs/2509.13642
作者: Zirun Guo,Feng Zhang,Kai Jia,Tao Jin
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We propose LLM-Interleaved (LLM-I), a flexible and dynamic framework that reframes interleaved image-text generation as a tool-use problem. LLM-I is designed to overcome the “one-tool” bottleneck of current unified models, which are limited to synthetic imagery and struggle with tasks requiring factual grounding or programmatic precision. Our framework empowers a central LLM or MLLM agent to intelligently orchestrate a diverse toolkit of specialized visual tools, including online image search, diffusion-based generation, code execution, and image editing. The agent is trained to select and apply these tools proficiently via a Reinforcement Learning (RL) framework that features a hybrid reward system combining rule-based logic with judgments from LLM and MLLM evaluators. Trained on a diverse new dataset using four different model backbones, LLM-I demonstrates state-of-the-art performance, outperforming existing methods by a large margin across four benchmarks. We also introduce a novel test-time scaling strategy that provides further performance gains. Project Page: this https URL.
zh

[CV-59] Federated Learning for Deforestation Detection: A Distributed Approach with Satellite Imagery

【速读】:该论文旨在解决从卫星图像中准确识别和定位森林砍伐问题,同时保障各客户端(边缘卫星中心)的数据隐私与安全。传统集中式训练方法需汇聚所有数据,存在数据泄露风险;而本文提出的解决方案基于联邦学习(Federated Learning, FL),使多个客户端在不共享原始数据的前提下协同训练模型。其关键在于利用FL框架(如FLOWER)结合RAY框架实现高效分布式计算,支持按需动态生成客户端模拟环境,并采用YOLOS-small、Faster R-CNN(ResNet50/ MobileNetV3)等先进目标检测模型进行卫星影像中的森林砍伐区域分割与识别,从而在保护数据隐私的同时提升模型性能。

链接: https://arxiv.org/abs/2509.13631
作者: Yuvraj Dutta,Aaditya Sikder,Basabdatta Palit
机构: NIT Rourkela, India (国家技术研究所鲁尔克拉, 印度)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 6 pages, 7 figures, accepted at IEEE INDISCON 2025

点击查看摘要

Abstract:Accurate identification of deforestation from satellite images is essential in order to understand the geographical situation of an area. This paper introduces a new distributed approach to identify as well as locate deforestation across different clients using Federated Learning (FL). Federated Learning enables distributed network clients to collaboratively train a model while maintaining data privacy and security of the active users. In our framework, a client corresponds to an edge satellite center responsible for local data processing. Moreover, FL provides an advantage over centralized training method which requires combining data, thereby compromising with data security of the clients. Our framework leverages the FLOWER framework with RAY framework to execute the distributed learning workload. Furthermore, efficient client spawning is ensured by RAY as it can select definite amount of users to create an emulation environment. Our FL framework uses YOLOS-small (a Vision Transformer variant), Faster R-CNN with a ResNet50 backbone, and Faster R-CNN with a MobileNetV3 backbone models trained and tested on publicly available datasets. Our approach provides us a different view for image segmentation-based tasks on satellite imagery.
zh

[CV-60] SAMIR an efficient registration framework via robust feature learning from SAM

【速读】:该论文旨在解决医学图像配准(medical image registration)中因缺乏标注数据而导致特征提取不准确的问题。现有弱监督方法依赖于分割掩码或地标等先验信息,但这些标签在实际应用中往往难以获取,限制了模型的实用性。解决方案的关键在于利用视觉基础模型Segment Anything Model (SAM) 的强大表征学习能力,设计了一种任务特定的适配流程:通过SAM的图像编码器提取结构感知的特征嵌入(structure-aware feature embeddings),从而更精准地建模解剖一致性与形变模式;同时引入轻量级3D头部网络和分层特征一致性损失(Hierarchical Feature Consistency Loss),以实现从粗到细的特征匹配与局部形变适应,显著提升跨被试腹部CT图像及同被试心脏图像配准性能。

链接: https://arxiv.org/abs/2509.13629
作者: Yue He,Min Liu,Qinghao Liu,Jiazheng Wang,Yaonan Wang,Hang Zhang,Xiang Chen
机构: Hunan University (湖南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Image registration is a fundamental task in medical image analysis. Deformations are often closely related to the morphological characteristics of tissues, making accurate feature extraction crucial. Recent weakly supervised methods improve registration by incorporating anatomical priors such as segmentation masks or landmarks, either as inputs or in the loss function. However, such weak labels are often not readily available, limiting their practical use. Motivated by the strong representation learning ability of visual foundation models, this paper introduces SAMIR, an efficient medical image registration framework that utilizes the Segment Anything Model (SAM) to enhance feature extraction. SAM is pretrained on large-scale natural image datasets and can learn robust, general-purpose visual representations. Rather than using raw input images, we design a task-specific adaptation pipeline using SAM’s image encoder to extract structure-aware feature embeddings, enabling more accurate modeling of anatomical consistency and deformation patterns. We further design a lightweight 3D head to refine features within the embedding space, adapting to local deformations in medical images. Additionally, we introduce a Hierarchical Feature Consistency Loss to guide coarse-to-fine feature matching and improve anatomical alignment. Extensive experiments demonstrate that SAMIR significantly outperforms state-of-the-art methods on benchmark datasets for both intra-subject cardiac image registration and inter-subject abdomen CT image registration, achieving performance improvements of 2.68% on ACDC and 6.44% on the abdomen dataset. The source code will be publicly available on GitHub following the acceptance of this paper.
zh

[CV-61] A Generalization of CLAP from 3D Localization to Image Processing A Connection With RANSAC Hough Transforms

【速读】:该论文旨在解决在存在噪声和异常值(outliers)情况下的精准定位与图像拼接问题,特别是在机器人视觉系统中如何提升鲁棒性。其解决方案的关键在于将原有的2D定位算法CLAP(Clustering to Localize Across n Possibilities)扩展为一个通用框架,通过聚类策略替代传统基于重投影误差的异常值剔除方法(如RANSAC),从而有效抑制误匹配带来的干扰,增强对不确定性和噪声的容忍能力。此方法不仅适用于3D定位,还可推广至图像拼接等任务,且揭示了CLAP、RANSAC与霍夫变换(Hough transform)之间的内在关联。

链接: https://arxiv.org/abs/2509.13605
作者: Ruochen Hou,Gabriel I. Fernandez,Alex Xu,Dennis W. Hong
机构: University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:In previous work, we introduced a 2D localization algorithm called CLAP, Clustering to Localize Across n Possibilities, which was used during our championship win in RoboCup 2024, an international autonomous humanoid soccer competition. CLAP is particularly recognized for its robustness against outliers, where clustering is employed to suppress noise and mitigate against erroneous feature matches. This clustering-based strategy provides an alternative to traditional outlier rejection schemes such as RANSAC, in which candidates are validated by reprojection error across all data points. In this paper, CLAP is extended to a more general framework beyond 2D localization, specifically to 3D localization and image stitching. We also show how CLAP, RANSAC, and Hough transforms are related. The generalization of CLAP is widely applicable to many different fields and can be a useful tool to deal with noise and uncertainty.
zh

[CV-62] Object Pose Estimation through Dexterous Touch

【速读】:该论文旨在解决在视觉数据受限(如光照变化、遮挡或外观差异)场景下,机器人对物体位姿(pose)估计的鲁棒性问题。传统方法依赖于视觉信息,难以应对复杂环境;而触觉传感器提供的局部接触信息又不足以重构完整位姿。解决方案的关键在于利用传感器运动探索(sensorimotor exploration),通过强化学习(Reinforcement Learning, RL)驱动机器人手部主动交互物体表面,收集多视角的触觉3D点云数据,并在此基础上迭代优化物体的形状与位姿估计。实验表明,该方法无需预先知晓物体几何结构即可识别关键位姿特征,从而实现高鲁棒性的位姿估计。

链接: https://arxiv.org/abs/2509.13591
作者: Amir-Hossein Shahidzadeh,Jiyue Zhu,Kezhou Chen,Sha Yi,Cornelia Fermüller,Yiannis Aloimonos,Xiaolong Wang
机构: University of Twente (特温特大学); Wright State University (赖特州立大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Robust object pose estimation is essential for manipulation and interaction tasks in robotics, particularly in scenarios where visual data is limited or sensitive to lighting, occlusions, and appearances. Tactile sensors often offer limited and local contact information, making it challenging to reconstruct the pose from partial data. Our approach uses sensorimotor exploration to actively control a robot hand to interact with the object. We train with Reinforcement Learning (RL) to explore and collect tactile data. The collected 3D point clouds are used to iteratively refine the object’s shape and pose. In our setup, one hand holds the object steady while the other performs active exploration. We show that our method can actively explore an object’s surface to identify critical pose features without prior knowledge of the object’s geometry. Supplementary material and more demonstrations will be provided at this https URL .
zh

[CV-63] Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

【速读】:该论文旨在解决医疗影像诊断中自动化程度低、多模态图像分析能力不足以及临床工作流程效率低下等问题。其解决方案的关键在于构建一个基于视觉-语言模型(Vision-Language Models, VLMs)的智能多模态医学图像分析框架,利用Google Gemini 2.5 Flash实现跨模态的肿瘤自动检测与结构化临床报告生成,通过视觉特征提取与自然语言处理相结合的方式进行上下文感知的图像解读,并引入坐标验证机制和概率高斯建模以提升异常分布识别的准确性;同时采用多层可视化技术增强临床可解释性,结合精准提示工程实现结构化信息抽取,在无需大量标注数据的情况下展现出零样本学习能力,从而显著提升放射科工作流效率和辅助诊断的自动化水平。

链接: https://arxiv.org/abs/2509.13590
作者: Samer Al-Hamadani
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 32 pages, 14 figures, 6 tables

点击查看摘要

Abstract:The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.
zh

[CV-64] Dynamic Aware: Adaptive Multi-Mode Out-of-Distribution Detection for Trajectory Prediction in Autonomous Vehicles

【速读】:该论文旨在解决自动驾驶车辆(AV)在实际部署中因训练数据与真实场景分布差异导致的轨迹预测模型外分布(Out-of-Distribution, OOD)检测难题,尤其关注轨迹层面的OOD识别问题。现有研究多聚焦于计算机视觉任务(如目标检测和分割),而轨迹级OOD检测仍缺乏系统性方法。其解决方案的关键在于提出一种引入自适应机制的新框架,通过显式建模预测误差的模式依赖分布及其随时间演化的数据集特异性动态,从而实现对复杂驾驶环境中OOD情况的鲁棒检测。实验表明,该方法在检测延迟和误报率上均显著优于基于不确定性估计(Uncertainty Quantification, UQ)和视觉特征的现有方法,在准确性和计算效率上均有提升。

链接: https://arxiv.org/abs/2509.13577
作者: Tongfei Guo,Lili Su
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Trajectory prediction is central to the safe and seamless operation of autonomous vehicles (AVs). In deployment, however, prediction models inevitably face distribution shifts between training data and real-world conditions, where rare or underrepresented traffic scenarios induce out-of-distribution (OOD) cases. While most prior OOD detection research in AVs has concentrated on computer vision tasks such as object detection and segmentation, trajectory-level OOD detection remains largely underexplored. A recent study formulated this problem as a quickest change detection (QCD) task, providing formal guarantees on the trade-off between detection delay and false alarms [1]. Building on this foundation, we propose a new framework that introduces adaptive mechanisms to achieve robust detection in complex driving environments. Empirical analysis across multiple real-world datasets reveals that prediction errors–even on in-distribution samples–exhibit mode-dependent distributions that evolve over time with dataset-specific dynamics. By explicitly modeling these error modes, our method achieves substantial improvements in both detection delay and false alarm rates. Comprehensive experiments on established trajectory prediction benchmarks show that our framework significantly outperforms prior UQ- and vision-based OOD approaches in both accuracy and computational efficiency, offering a practical path toward reliable, driving-aware autonomy.
zh

[CV-65] Semantic 3D Reconstructions with SLAM for Central Airway Obstruction

【速读】:该论文旨在解决中央气道阻塞(Central Airway Obstruction, CAO)诊疗中缺乏高精度、实时且具语义信息的三维重建技术的问题。传统治疗手段如支气管镜和电灼术虽可切除肿瘤,但并发症风险较高;而现有自动化或机器人辅助方法在实时性与临床相关区域标注方面存在不足。其解决方案的关键在于提出一种新颖的端到端管道,将单目内窥视频与DROID-SLAM(Simultaneous Localization and Mapping)算法结合,并嵌入训练好的语义分割模型以识别阻塞性组织。该设计实现了两个核心功能:一是基于SLAM模块实现实时三维几何重建,二是利用分割掩码对点云进行语义标注,从而精准标示出阻塞区域。实验表明,该系统在离体模型上与CT扫描相比具有0.62 mm的Chamfer距离,且具备高速处理能力,显著优于以往方法,首次实现了语义感知的实时单目SLAM在内窥CAO场景中的应用,为未来自主机器人手术提供了可扩展、模块化的基础框架。

链接: https://arxiv.org/abs/2509.13541
作者: Ayberk Acar,Fangjie Li,Hao Li,Lidia Al-Zogbi,Kanyifeechukwu Jane Oguine,Susheela Sharma Stern,Jesse F. d’Almeida,Robert J. Webster III,Ipek Oguz,Jie Ying Wu
机构: Vanderbilt University (范德比尔特大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Central airway obstruction (CAO) is a life-threatening condition with increasing incidence, caused by tumors in and outside of the airway. Traditional treatment methods such as bronchoscopy and electrocautery can be used to remove the tumor completely; however, these methods carry a high risk of complications. Recent advances allow robotic interventions with lesser risk. The combination of robot interventions with scene understanding and mapping also opens up the possibilities for automation. We present a novel pipeline that enables real-time, semantically informed 3D reconstructions of the central airway using monocular endoscopic video. Our approach combines DROID-SLAM with a segmentation model trained to identify obstructive tissues. The SLAM module reconstructs the 3D geometry of the airway in real time, while the segmentation masks guide the annotation of obstruction regions within the reconstructed point cloud. To validate our pipeline, we evaluate the reconstruction quality using ex vivo models. Qualitative and quantitative results show high similarity between ground truth CT scans and the 3D reconstructions (0.62 mm Chamfer distance). By integrating segmentation directly into the SLAM workflow, our system produces annotated 3D maps that highlight clinically relevant regions in real time. High-speed capabilities of the pipeline allows quicker reconstructions compared to previous work, reflecting the surgical scene more accurately. To the best of our knowledge, this is the first work to integrate semantic segmentation with real-time monocular SLAM for endoscopic CAO scenarios. Our framework is modular and can generalize to other anatomies or procedures with minimal changes, offering a promising step toward autonomous robotic interventions. Comments: 5 pages, 2 figures, 1 table Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.13541 [cs.RO] (or arXiv:2509.13541v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2509.13541 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ayberk Acar [view email] [v1] Tue, 16 Sep 2025 21:14:16 UTC (1,038 KB)
zh

[CV-66] MemGS: Memory-Efficient Gaussian Splatting for Real-Time SLAM

【速读】:该论文旨在解决3D高斯泼溅(3D Gaussian Splatting, 3DGS)在嵌入式平台(如微型无人机MAVs)上应用时面临的GPU内存占用高与重建质量难以兼顾的问题。其关键解决方案包括:首先,在体素空间(voxel space)中基于几何相似性合并SLAM过程中产生的冗余3D高斯原语(primitives),有效降低GPU内存使用而不影响系统运行效率;其次,通过Patch-Grid(PG)点采样初始化3D高斯原语,提升场景建模精度,从而改善渲染质量。

链接: https://arxiv.org/abs/2509.13536
作者: Yinlong Bai,Hongxin Zhang,Sheng Zhong,Junkai Niu,Hai Li,Yijia He,Yi Zhou
机构: Neuromorphic Automation and Intelligence Lab (NAIL) at School of Robotics, Hunan University (湖南大学); TCL RayNeo (RayNeo)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting (3DGS) have made a significant impact on rendering and reconstruction techniques. Current research predominantly focuses on improving rendering performance and reconstruction quality using high-performance desktop GPUs, largely overlooking applications for embedded platforms like micro air vehicles (MAVs). These devices, with their limited computational resources and memory, often face a trade-off between system performance and reconstruction quality. In this paper, we improve existing methods in terms of GPU memory usage while enhancing rendering quality. Specifically, to address redundant 3D Gaussian primitives in SLAM, we propose merging them in voxel space based on geometric similarity. This reduces GPU memory usage without impacting system runtime performance. Furthermore, rendering quality is improved by initializing 3D Gaussian primitives via Patch-Grid (PG) point sampling, enabling more accurate modeling of the entire scene. Quantitative and qualitative evaluations on publicly available datasets demonstrate the effectiveness of our improvements.
zh

[CV-67] ColonCrafter: A Depth Estimation Model for Colonoscopy Videos Using Diffusion Priors

【速读】:该论文旨在解决结肠镜检查中三维(3D)场景理解的挑战,特别是针对单目视频序列中深度估计的时序不一致性问题,这限制了其在3D重建中的应用。解决方案的关键在于提出ColonCrafter——一种基于扩散模型(diffusion-based model)的深度估计方法,该方法通过从合成结肠镜视频序列中学习鲁棒的几何先验,生成时序一致的深度图;同时引入风格迁移技术,在保持几何结构不变的前提下将真实临床视频适配到合成训练域,从而实现零样本(zero-shot)条件下在C3VD数据集上的最优性能。

链接: https://arxiv.org/abs/2509.13525
作者: Romain Hardy,Tyler Berzin,Pranav Rajpurkar
机构: Harvard Medical School (哈佛医学院); Beth Israel Deaconess Medical Center (贝斯以色列女执事医疗中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Three-dimensional (3D) scene understanding in colonoscopy presents significant challenges that necessitate automated methods for accurate depth estimation. However, existing depth estimation models for endoscopy struggle with temporal consistency across video sequences, limiting their applicability for 3D reconstruction. We present ColonCrafter, a diffusion-based depth estimation model that generates temporally consistent depth maps from monocular colonoscopy videos. Our approach learns robust geometric priors from synthetic colonoscopy sequences to generate temporally consistent depth maps. We also introduce a style transfer technique that preserves geometric structure while adapting real clinical videos to match our synthetic training domain. ColonCrafter achieves state-of-the-art zero-shot performance on the C3VD dataset, outperforming both general-purpose and endoscopy-specific approaches. Although full trajectory 3D reconstruction remains a challenge, we demonstrate clinically relevant applications of ColonCrafter, including 3D point cloud generation and surface coverage assessment.
zh

[CV-68] Multimodal Hate Detection Using Dual-Stream Graph Neural Networks

【速读】:该论文旨在解决仇恨视频(hateful video)检测中两个关键问题:一是现有多模态分类方法通常对视频内容进行均匀处理,未能突出微小但决定性的仇恨成分;二是现有方法难以系统捕捉视频内部及跨模态的结构化信息,限制了多模态融合的有效性。解决方案的关键在于提出一种新颖的多模态双流图神经网络模型,通过将视频分割为多个实例构建实例图以提取实例级特征,并引入互补权重图为这些特征分配重要性权重,从而聚焦于仇恨实例;同时,该模型利用基于图的框架系统建模模态内与模态间的结构关系,实现更精准的分类与可解释性。

链接: https://arxiv.org/abs/2509.13515
作者: Jiangbei Yue,Shuonan Yang,Tailin Chen,Jianbo Jiao,Zeyu Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hateful videos present serious risks to online safety and real-world well-being, necessitating effective detection methods. Although multimodal classification approaches integrating information from several modalities outperform unimodal ones, they typically neglect that even minimal hateful content defines a video’s category. Specifically, they generally treat all content uniformly, instead of emphasizing the hateful components. Additionally, existing multimodal methods cannot systematically capture structured information in videos, limiting the effectiveness of multimodal fusion. To address these limitations, we propose a novel multimodal dual-stream graph neural network model. It constructs an instance graph by separating the given video into several instances to extract instance-level features. Then, a complementary weight graph assigns importance weights to these features, highlighting hateful instances. Importance weights and instance features are combined to generate video labels. Our model employs a graph-based framework to systematically model structured relationships within and across modalities. Extensive experiments on public datasets show that our model is state-of-the-art in hateful video classification and has strong explainability. Code is available: this https URL.
zh

[CV-69] FunKAN: Functional Kolmogorov-Arnold Network for Medical Image Enhancement and Segmentation AAAI AAAI-26

【速读】:该论文旨在解决医学图像增强与分割任务中因伪影干扰和解剖结构复杂性导致的性能瓶颈问题,同时克服传统深度学习方法在模型可解释性方面的局限。其解决方案的关键在于提出一种功能型柯尔莫戈洛夫-阿诺德网络(Functional Kolmogorov-Arnold Network, FunKAN),该框架通过将柯尔莫戈洛夫-阿诺德表示定理形式化推广至函数空间,并利用赫米特基函数的傅里叶分解来学习内部映射函数,从而在保持图像空间结构完整性的同时实现高可解释性建模。进一步地,作者基于FunKAN构建了U-FunKAN架构,在多个医学影像数据集上实现了优于现有KAN类骨干网络的增强(PSNR、TV)与分割性能(IoU、F1),验证了其在临床场景中的有效性与鲁棒性。

链接: https://arxiv.org/abs/2509.13508
作者: Maksim Penkin,Andrey Krylov(Lomonosov Moscow State University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures, submitted to the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26)

点击查看摘要

Abstract:Medical image enhancement and segmentation are critical yet challenging tasks in modern clinical practice, constrained by artifacts and complex anatomical variations. Traditional deep learning approaches often rely on complex architectures with limited interpretability. While Kolmogorov-Arnold networks offer interpretable solutions, their reliance on flattened feature representations fundamentally disrupts the intrinsic spatial structure of imaging data. To address this issue we propose a Functional Kolmogorov-Arnold Network (FunKAN) – a novel interpretable neural framework, designed specifically for image processing, that formally generalizes the Kolmogorov-Arnold representation theorem onto functional spaces and learns inner functions using Fourier decomposition over the basis Hermite functions. We explore FunKAN on several medical image processing tasks, including Gibbs ringing suppression in magnetic resonance images, benchmarking on IXI dataset. We also propose U-FunKAN as state-of-the-art binary medical segmentation model with benchmarks on three medical datasets: BUSI (ultrasound images), GlaS (histological structures) and CVC-ClinicDB (colonoscopy videos), detecting breast cancer, glands and polyps, respectively. Experiments on those diverse datasets demonstrate that our approach outperforms other KAN-based backbones in both medical image enhancement (PSNR, TV) and segmentation (IoU, F1). Our work bridges the gap between theoretical function approximation and medical image analysis, offering a robust, interpretable solution for clinical applications.
zh

[CV-70] Adversarial Appearance Learning in Augmented Cityscapes for Pedestrian Recognition in Autonomous Driving

【速读】:该论文旨在解决合成数据与真实数据之间的域差距(domain gap)问题,特别是在自动驾驶场景中,如何通过增强数据来提升对行人等弱势道路使用者(VRUs, Vulnerable Road Users)的识别能力。其解决方案的关键在于构建一个针对Cityscapes数据集的增强流水线,引入虚拟行人以定制交通场景,并提出一种新颖的生成式网络架构用于对抗学习数据集的光照条件,从而提高增强数据的真实性;同时在语义分割和实例分割任务上验证了该方法的有效性。

链接: https://arxiv.org/abs/2509.13507
作者: Artem Savkin,Thomas Lapotre,Kevin Strauss,Uzair Akbar,Federico Tombari
机构: TU Munich (慕尼黑工业大学); BMW AG (宝马集团); Google (谷歌)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the autonomous driving area synthetic data is crucial for cover specific traffic scenarios which autonomous vehicle must handle. This data commonly introduces domain gap between synthetic and real domains. In this paper we deploy data augmentation to generate custom traffic scenarios with VRUs in order to improve pedestrian recognition. We provide a pipeline for augmentation of the Cityscapes dataset with virtual pedestrians. In order to improve augmentation realism of the pipeline we reveal a novel generative network architecture for adversarial learning of the data-set lighting conditions. We also evaluate our approach on the tasks of semantic and instance segmentation.
zh

[CV-71] DEFT-VTON: Efficient Virtual Try-On with Consistent Generalised H-Transform CVPR

【速读】:该论文旨在解决当前虚拟试衣(Virtual Try-On, VTO)方法在实际应用中面临的训练与推理资源受限问题,尤其是在使用大规模预训练扩散模型时,如何实现高效、低成本的微调和部署。解决方案的关键在于提出一种基于Doob’s h-transform的高效微调方法(DEFT),其核心思想是冻结预训练模型参数,仅训练一个小型h-transform网络来学习条件转移函数,从而将无条件扩散模型适配为图像条件驱动的VTO模型;该方法仅需微调1.42%的参数,显著低于传统参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)的5.52%。此外,作者进一步引入自适应一致性损失,通过数据自适应地结合一致性约束与去噪分数匹配损失,在保持性能的同时降低推理时间,使模型在仅15步去噪的情况下仍达到SOTA效果。

链接: https://arxiv.org/abs/2509.13506
作者: Xingzi Xu,Qi Li,Shuwen Qiu,Julien Han,Karim Bouyarmane
机构: Amazon(亚马逊); Duke University (杜克大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Published in 2025 CVPR Workshop

点击查看摘要

Abstract:Diffusion models enable high-quality virtual try-on (VTO) with their established image synthesis abilities. Despite the extensive end-to-end training of large pre-trained models involved in current VTO methods, real-world applications often prioritize limited training and inference, serving, and deployment budgets for VTO. To solve this obstacle, we apply Doob’s h-transform efficient fine-tuning (DEFT) for adapting large pre-trained unconditional models for downstream image-conditioned VTO abilities. DEFT freezes the pre-trained model’s parameters and trains a small h-transform network to learn a conditional h-transform. The h-transform network allows training only 1.42 percent of the frozen parameters, compared to a baseline of 5.52 percent in traditional parameter-efficient fine-tuning (PEFT). To further improve DEFT’s performance and decrease existing models’ inference time, we additionally propose an adaptive consistency loss. Consistency training distills slow but high-performing diffusion models into a fast one while retaining performance by enforcing consistencies along the inference path. Inspired by constrained optimization, instead of distillation, we combine the consistency loss and the denoising score matching loss in a data-adaptive manner for fine-tuning existing VTO models at a low cost. Empirical results show the proposed DEFT-VTON method achieves state-of-the-art performance on VTO tasks, with as few as 15 denoising steps, while maintaining competitive results. Comments: Published in 2025 CVPR Workshop Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.13506 [cs.CV] (or arXiv:2509.13506v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.13506 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-72] LivePyxel: Accelerating image annotations with a Python-integrated webcam live streaming

【速读】:该论文旨在解决科学领域中AI模型部署因缺乏灵活标注工具而受阻的问题,尤其针对实验室环境中实时获取显微镜等成像设备数据时,现有图像标注软件需预先上传数据集、无法支持按需标注流程的局限性。其解决方案的关键在于开发了一个名为LivePyxel的Python图形用户界面(GUI)工具,该工具可直接集成于各类成像系统(如网络摄像头、显微镜),实现对实时图像的高效标注;其核心特性包括:支持贝塞尔样条(Bézier splines)和二值掩膜(binary masks)进行精确区域划定,并采用非破坏性图层机制以提升编辑性能;同时通过OpenCV与NumPy等高性能计算库优化矩阵运算,实现了跨视频设备的良好兼容性和面向目标检测任务的高效处理能力。

链接: https://arxiv.org/abs/2509.13504
作者: Uriel Garcilazo-Cruz,Joseph O. Okeme,Rodrigo A. Vargas–Hernández
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 10 figures, SM, 5 pages, 4 figures

点击查看摘要

Abstract:The lack of flexible annotation tools has hindered the deployment of AI models in some scientific areas. Most existing image annotation software requires users to upload a precollected dataset, which limits support for on-demand pipelines and introduces unnecessary steps to acquire images. This constraint is particularly problematic in laboratory environments, where real-time data acquisition from instruments such as microscopes is increasingly common. In this work, we introduce \textttLivePixel, a Python-based graphical user interface that integrates with imaging systems, such as webcams, microscopes, and others, to enable real-time image annotation. LivePyxel is designed to be easy to use through a simple interface that allows users to precisely delimit areas for annotation using tools commonly found in commercial graphics editing software. Of particular interest is the availability of Bézier splines and binary masks, and the software’s capacity to work with non-destructive layers that enable high-performance editing. LivePyxel also integrates a wide compatibility across video devices, and it’s optimized for object detection operations via the use of OpenCV in combination with high-performance libraries designed to handle matrix and linear algebra operations via Numpy effectively. LivePyxel facilitates seamless data collection and labeling, accelerating the development of AI models in experimental workflows. LivePyxel freely available at this https URL
zh

[CV-73] BiasMap: Leverag ing Cross-Attentions to Discover and Mitigate Hidden Social Biases in Text-to-Image Generation

【速读】:该论文旨在解决黑箱生成式AI模型(特别是文本到图像生成模型,Text-to-Image, TTI)中潜在的概念层面表征偏见问题,现有方法主要关注输出层面的人口统计学分布差异,但无法确保概念解耦(concept disentanglement)。其解决方案的关键在于提出BiasMap框架,该框架利用交叉注意力归因图(cross-attention attribution maps)揭示性别、种族等人口统计学特征与职业等语义概念之间的空间结构纠缠关系,并通过交并比(Intersection over Union, IoU)量化这种纠缠程度,从而深入识别传统公平性检测手段所忽视的表征偏见;进一步地,基于BiasMap设计能量引导的扩散采样策略,在潜噪声空间直接优化预期SoftIoU,实现对概念耦合的显式缓解,同时补充分布偏差的治理。

链接: https://arxiv.org/abs/2509.13496
作者: Rajatsubhra Chakraborty,Xujun Che,Depeng Xu,Cori Faklaris,Xi Niu,Shuhan Yuan
机构: University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校); Utah State University(犹他州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Bias discovery is critical for black-box generative models, especiall text-to-image (TTI) models. Existing works predominantly focus on output-level demographic distributions, which do not neces- sarily guarantee concept representations to be disentangled post- mitigation. We propose BiasMap, a model-agnostic framework for uncovering latent concept-level representational biases in stable dif- fusion models. BiasMap leverages cross-attention attribution maps to reveal structural entanglements between demographics (e.g., gender, race) and semantics (e.g., professions), going deeper into representational bias during the image generation. Using attribu- tion maps of these concepts, we quantify the spatial demographics- semantics concept entanglement via Intersection over Union (IoU), offering a lens into bias that remains hidden in existing fairness dis- covery approaches. In addition, we further utilize BiasMap for bias mitigation through energy-guided diffusion sampling that directly modifies latent noise space and minimizes the expected SoftIoU dur- ing the denoising process. Our findings show that existing fairness interventions may reduce the output distributional gap but often fail to disentangle concept-level coupling, whereas our mitigation method can mitigate concept entanglement in image generation while complementing distributional bias mitigation.
zh

[CV-74] MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes AAAI2026

【速读】:该论文旨在解决如何从公共空间的图像中准确检测群体层面的社会互动问题,这一任务涉及对细微视觉线索(如关系、距离和协同运动)的语义理解,远超传统目标检测范畴。其解决方案的关键在于提出了一种名为MINGLE(Modeling INterpersonal Group-Level Engagement)的模块化三阶段流水线:首先利用现成的人体检测与深度估计模型获取基础感知信息;其次通过视觉语言模型(VLM)推理个体间的社会关联类别;最后采用轻量级空间聚合算法定位具有社会连通性的群体区域。该方法结合了高质量标注数据集(包含10万张城市街景图像及个体与群体级别的边界框标签),为复杂社会交互的识别提供了可扩展且语义丰富的技术路径。

链接: https://arxiv.org/abs/2509.13484
作者: Liu Liu,Alexandra Kudaeva,Marco Cipriano,Fatimeh Al Ghannam,Freya Tan,Gerard de Melo,Andres Sevtsuk
机构: Massachusetts Institute of Technology (麻省理工学院); Hasso Plattner Institute (哈索普拉特纳研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
备注: 13 pages, 4 figures, under review at AAAI 2026

点击查看摘要

Abstract:Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.
zh

[CV-75] Improving 3D Gaussian Splatting Compression by Scene-Adaptive Lattice Vector Quantization

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)模型中因数据量庞大而导致的存储与传输成本高昂的问题,尤其是在现有基于锚点的神经压缩方法依赖简单但效率较低的均匀标量量化(Uniform Scalar Quantization, USQ)的情况下。其解决方案的关键在于用格矢量量化(Lattice Vector Quantization, LVQ)替代USQ,并提出场景自适应格量化(Scene-Adaptive Lattice Vector Quantization, SALVQ),通过为每个场景优化格基(lattice basis)以提升压缩的率失真(Rate-Distortion, R-D)效率,同时保持接近USQ的低复杂度。SALVQ可无缝集成至现有3DGS压缩架构,在几乎不增加计算开销的前提下显著改善压缩性能,并支持通过缩放格基向量动态调整码率,实现单模型多比特率适配,从而大幅降低训练时间和内存消耗。

链接: https://arxiv.org/abs/2509.13482
作者: Hao Xu,Xiaolin Wu,Xi Zhang
机构: McMaster University (麦克马斯特大学); Southwest Jiaotong University (西南交通大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Code available at this https URL

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) is rapidly gaining popularity for its photorealistic rendering quality and real-time performance, but it generates massive amounts of data. Hence compressing 3DGS data is necessary for the cost effectiveness of 3DGS models. Recently, several anchor-based neural compression methods have been proposed, achieving good 3DGS compression performance. However, they all rely on uniform scalar quantization (USQ) due to its simplicity. A tantalizing question is whether more sophisticated quantizers can improve the current 3DGS compression methods with very little extra overhead and minimal change to the system. The answer is yes by replacing USQ with lattice vector quantization (LVQ). To better capture scene-specific characteristics, we optimize the lattice basis for each scene, improving LVQ’s adaptability and R-D efficiency. This scene-adaptive LVQ (SALVQ) strikes a balance between the R-D efficiency of vector quantization and the low complexity of USQ. SALVQ can be seamlessly integrated into existing 3DGS compression architectures, enhancing their R-D performance with minimal modifications and computational overhead. Moreover, by scaling the lattice basis vectors, SALVQ can dynamically adjust lattice density, enabling a single model to accommodate multiple bit rate targets. This flexibility eliminates the need to train separate models for different compression levels, significantly reducing training time and memory consumption.
zh

[CV-76] Semantic-Enhanced Cross-Modal Place Recognition for Robust Robot Localization

【速读】:该论文旨在解决在无GPS环境下机器人定位的准确性问题,尤其是现有基于RGB图像的视觉场景识别(Visual Place Recognition, VPR)方法对光照、天气和季节变化敏感,而现有跨模态定位方法在复杂场景、细粒度匹配及视角变化下性能受限的问题。解决方案的关键在于提出一种语义增强的跨模态场景识别框架(Semantic-Enhanced Cross-Modal Place Recognition, SCM-PR),其核心创新包括:引入VMamba主干网络提取RGB图像特征;设计语义感知特征融合(Semantic-Aware Feature Fusion, SAFF)模块融合场景描述符与分割掩码;构建融合语义与几何信息的LiDAR描述符;以及在NetVLAD中引入跨模态语义注意力机制以提升匹配精度;此外,还提出了多视角语义-几何匹配策略与语义一致性损失函数,均在对比学习框架下实现,从而显著提升了跨模态定位的鲁棒性和准确性。

链接: https://arxiv.org/abs/2509.13474
作者: Yujia Lin,Nicholas Evans
机构: Dali University (大理大学); Bandırma Onyedi Eylül University (班迪尔马七旬一月大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring accurate localization of robots in environments without GPS capability is a challenging task. Visual Place Recognition (VPR) techniques can potentially achieve this goal, but existing RGB-based methods are sensitive to changes in illumination, weather, and other seasonal changes. Existing cross-modal localization methods leverage the geometric properties of RGB images and 3D LiDAR maps to reduce the sensitivity issues highlighted above. Currently, state-of-the-art methods struggle in complex scenes, fine-grained or high-resolution matching, and situations where changes can occur in viewpoint. In this work, we introduce a framework we call Semantic-Enhanced Cross-Modal Place Recognition (SCM-PR) that combines high-level semantics utilizing RGB images for robust localization in LiDAR maps. Our proposed method introduces: a VMamba backbone for feature extraction of RGB images; a Semantic-Aware Feature Fusion (SAFF) module for using both place descriptors and segmentation masks; LiDAR descriptors that incorporate both semantics and geometry; and a cross-modal semantic attention mechanism in NetVLAD to improve matching. Incorporating the semantic information also was instrumental in designing a Multi-View Semantic-Geometric Matching and a Semantic Consistency Loss, both in a contrastive learning framework. Our experimental work on the KITTI and KITTI-360 datasets show that SCM-PR achieves state-of-the-art performance compared to other cross-modal place recognition methods.
zh

[CV-77] MapAnything: Universal Feed-Forward Metric 3D Reconstruction

【速读】:该论文旨在解决多视图3D视觉任务中缺乏统一建模框架的问题,即传统方法通常针对特定任务(如结构光重建、单目深度估计或相机定位)设计专用模型,导致训练效率低且难以跨任务泛化。其解决方案的关键在于提出MapAnything——一个基于Transformer的前馈式统一模型,通过因子化表示多视角场景几何(包括深度图、局部射线图、相机位姿和度量尺度因子),将局部重建提升至全局一致的度量空间;同时结合标准化监督与灵活输入增强策略,使模型能在单次前向传播中完成多种3D视觉任务(如未标定结构光重建、校准多视角立体匹配等),从而实现高效联合训练并优于或媲美专用模型性能。

链接: https://arxiv.org/abs/2509.13414
作者: Nikhil Keetha,Norman Müller,Johannes Schönberger,Lorenzo Porzi,Yuchen Zhang,Tobias Fischer,Arno Knapitsch,Duncan Zauss,Ethan Weber,Nelson Antunes,Jonathon Luiten,Manuel Lopez-Antequera,Samuel Rota Bulò,Christian Richardt,Deva Ramanan,Sebastian Scherer,Peter Kontschieder
机构: Meta Reality Labs (Meta 实景实验室); Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)
备注: Project Page: this https URL

点击查看摘要

Abstract:We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.
zh

[CV-78] EdiVal-Agent : An Object-Centric Framework for Automated Scalable Fine-Grained Evaluation of Multi-Turn Editing

【速读】:该论文旨在解决当前基于指令的图像编辑(instruction-based image editing)评估中缺乏可靠性和可解释性的问题。现有方法要么依赖成对参考图像(导致覆盖范围有限并继承生成模型偏差),要么仅依赖零样本视觉-语言模型(VLMs)进行评估,其基于提示的指令遵循、内容一致性和视觉质量判断往往不够精确。解决方案的关键在于提出 EdiVal-Agent,一个从对象中心视角出发的自动化、可扩展且细粒度的多轮编辑评估框架,其核心创新包括:首先将图像分解为语义上合理的对象;其次合成多样化的上下文感知编辑指令;并在评估阶段融合 VLMs 与开放词汇对象检测器以提升指令遵循准确性,利用语义级特征提取器评估内容一致性,并借助人类偏好模型判断视觉质量。实验表明,该方法在指令遵循评估中相比纯 VLM 或 CLIP 基础指标更贴近人工判断,且模块化设计支持未来工具无缝集成,从而持续提升评估精度。

链接: https://arxiv.org/abs/2509.13399
作者: Tianyu Chen,Yasi Zhang,Zhi Zhang,Peiyu Yu,Shu Wang,Zhendong Wang,Kevin Lin,Xiaofei Wang,Zhengyuan Yang,Linjie Li,Chung-Ching Lin,Jianwen Xie,Oscar Leong,Lijuan Wang,Ying Nian Wu,Mingyuan Zhou
机构: University of Texas at Austin (德克萨斯大学奥斯汀分校); University of California, Los Angeles (加州大学洛杉矶分校); Microsoft (微软); Lambda, Inc (Lambda 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tianyu Chen and Yasi Zhang contributed equally; Oscar Leong, Lijuan Wang, Ying Nian Wu, and Mingyuan Zhou advised equally

点击查看摘要

Abstract:Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images – resulting in limited coverage and inheriting biases from prior generative models – or (ii) rely solely on zero-shot vision–language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline’s modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: this https URL. Comments: Tianyu Chen and Yasi Zhang contributed equally; Oscar Leong, Lijuan Wang, Ying Nian Wu, and Mingyuan Zhou advised equally Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2509.13399 [cs.CV] (or arXiv:2509.13399v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.13399 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-79] Real-Time Detection and Tracking of Foreign Object Intrusions in Power Systems via Feature-Based Edge Intelligence

【速读】:该论文旨在解决电力输电系统中异物侵入(Foreign Object Intrusion, FOI)的实时检测与跟踪问题,尤其在复杂场景下如遮挡、运动模糊等条件下保持高精度和鲁棒性。解决方案的关键在于提出一个三阶段框架:首先使用YOLOv7分割模型实现快速且稳定的对象定位;其次引入基于ConvNeXt架构并采用三元组损失(triplet loss)训练的特征提取器,生成具有区分度的嵌入向量(embeddings);最后通过融合特征信息的IoU跟踪器提升多目标跟踪在遮挡和运动中的稳定性。此外,系统支持增量更新机制,无需重新训练即可将新类别的嵌入加入参考数据库,结合混合精度推理优化,在低成本边缘设备(如NVIDIA Jetson)上实现了高效部署与可扩展性。

链接: https://arxiv.org/abs/2509.13396
作者: Xinan Wang,Di Shi,Fengyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: 12 page Journal paper, accepted by IEEE Open Access Journal of Power and Energy

点击查看摘要

Abstract:This paper presents a novel three-stage framework for real-time foreign object intrusion (FOI) detection and tracking in power transmission systems. The framework integrates: (1) a YOLOv7 segmentation model for fast and robust object localization, (2) a ConvNeXt-based feature extractor trained with triplet loss to generate discriminative embeddings, and (3) a feature-assisted IoU tracker that ensures resilient multi-object tracking under occlusion and motion. To enable scalable field deployment, the pipeline is optimized for deployment on low-cost edge hardware using mixed-precision inference. The system supports incremental updates by adding embeddings from previously unseen objects into a reference database without requiring model retraining. Extensive experiments on real-world surveillance and drone video datasets demonstrate the framework’s high accuracy and robustness across diverse FOI scenarios. In addition, hardware benchmarks on NVIDIA Jetson devices confirm the framework’s practicality and scalability for real-world edge applications.
zh

[CV-80] A Domain Knowledge Informed Approach for Anomaly Detection of Electric Vehicle Interior Sounds

【速读】:该论文旨在解决汽车座舱声音异常检测中因缺乏标注故障数据而导致的无监督学习模型选择难题。在真实场景下,由于故障样本稀缺或完全缺失,传统依赖标签数据的监督学习方法难以适用,而仅基于健康样本训练的模型又面临验证困难和评估指标不可靠的问题。解决方案的关键在于提出一种融合领域知识的模型选择方法:通过结构化扰动健康频谱图生成代理异常(proxy-anomalies),将其作为验证集用于模型性能评估,从而实现更可靠的最优模型筛选。实验表明,该方法在五类典型故障(不平衡、调制、啸叫、风噪、脉宽调制)上显著优于传统模型选择策略。

链接: https://arxiv.org/abs/2509.13390
作者: Deepti Kunte,Bram Cornelis,Claudio Colangeli,Karl Janssens,Brecht Van Baelen,Konstantinos Gryllias
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Submitted to: Mechanical Systems and Signal Processing

点击查看摘要

Abstract:The detection of anomalies in automotive cabin sounds is critical for ensuring vehicle quality and maintaining passenger comfort. In many real-world settings, this task is more appropriately framed as an unsupervised learning problem rather than the supervised case due to the scarcity or complete absence of labeled faulty data. In such an unsupervised setting, the model is trained exclusively on healthy samples and detects anomalies as deviations from normal behavior. However, in the absence of labeled faulty samples for validation and the limited reliability of commonly used metrics, such as validation reconstruction error, effective model selection remains a significant challenge. To overcome these limitations, a domain-knowledge-informed approach for model selection is proposed, in which proxy-anomalies engineered through structured perturbations of healthy spectrograms are used in the validation set to support model selection. The proposed methodology is evaluated on a high-fidelity electric vehicle dataset comprising healthy and faulty cabin sounds across five representative fault types viz., Imbalance, Modulation, Whine, Wind, and Pulse Width Modulation. This dataset, generated using advanced sound synthesis techniques, and validated via expert jury assessments, has been made publicly available to facilitate further research. Experimental evaluations on the five fault cases demonstrate the selection of optimal models using proxy-anomalies, significantly outperform conventional model selection strategies.
zh

[CV-81] Landcover classification and change detection using remote sensing and machine learning: a case study of Western Fiji

【速读】:该论文旨在解决斐济快速城市化背景下土地利用/土地覆盖(Land Use and Land Cover, LULC)变化的监测与建模问题,以支持可持续城市发展决策。其解决方案的关键在于融合遥感技术和机器学习方法:首先利用Landsat-8卫星影像构建训练数据集,并通过Google Earth Engine平台结合k-means聚类实现无监督分类生成初始地表覆盖图;随后采用卷积神经网络(Convolutional Neural Networks, CNNs)对选定区域进行精细化分类;最终通过可视化变化检测结果,定量识别2013至2024年间纳迪市(Nadi)城市扩张等关键变化趋势,从而提供高时效性和空间精度的LULC动态监测技术支撑。

链接: https://arxiv.org/abs/2509.13388
作者: Yadvendra Gurjar,Ruoni Wan,Ehsan Farahbakhsh,Rohitash Chandra
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:As a developing country, Fiji is facing rapid urbanisation, which is visible in the massive development projects that include housing, roads, and civil works. In this study, we present machine learning and remote sensing frameworks to compare land use and land cover change from 2013 to 2024 in Nadi, Fiji. The ultimate goal of this study is to provide technical support in land cover/land use modelling and change detection. We used Landsat-8 satellite image for the study region and created our training dataset with labels for supervised machine learning. We used Google Earth Engine and unsupervised machine learning via k-means clustering to generate the land cover map. We used convolutional neural networks to classify the selected regions’ land cover types. We present a visualisation of change detection, highlighting urban area changes over time to monitor changes in the map.
zh

[CV-82] Curvature as a tool for evaluating dimensionality reduction and estimating intrinsic dimension

【速读】:该论文旨在解决如何量化评估数据表示效果以及估计数据集内在维度的问题,特别是在低维嵌入空间中保持原始高维数据几何结构的挑战。其解决方案的关键在于引入基于截面曲率(sectional curvature)的几何轮廓分析方法,该方法通过刻画三元点组与其他点之间的度量关系来捕捉离散度量空间的局部几何特性,并据此构建一种可计算的定量指标,用于衡量诸如降维技术所生成的数据表示的质量,同时能够有效估计数据集的内在维度,从而为大规模实证网络的几何结构分析和降维算法的有效性评估提供理论支撑。

链接: https://arxiv.org/abs/2509.13385
作者: Charlotte Beylier,Parvaneh Joharinad,Jürgen Jost,Nahid Torbati
机构: Center for Scalable Data Analytics and Artificial Intelligence (ScaDS,AI) Dresden/Leipzig, Germany; Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany; Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany; Santa Fe Institute for the Sciences of Complexity, New Mexico, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
备注: 31 pages, 14 figures

点击查看摘要

Abstract:Utilizing recently developed abstract notions of sectional curvature, we introduce a method for constructing a curvature-based geometric profile of discrete metric spaces. The curvature concept that we use here captures the metric relations between triples of points and other points. More significantly, based on this curvature profile, we introduce a quantitative measure to evaluate the effectiveness of data representations, such as those produced by dimensionality reduction techniques. Furthermore, Our experiments demonstrate that this curvature-based analysis can be employed to estimate the intrinsic dimensionality of datasets. We use this to explore the large-scale geometry of empirical networks and to evaluate the effectiveness of dimensionality reduction techniques.
zh

[CV-83] he Art of Saying “Maybe”: A Conformal Lens for Uncertainty Benchmarking in VLMs

【速读】:该论文旨在解决视觉-语言模型(Vision-Language Models, VLMs)在复杂视觉理解任务中缺乏可靠不确定性量化(uncertainty quantification)评估的问题。现有研究多集中于性能基准测试,而忽视了模型对自身预测不确定性的感知能力。其解决方案的关键在于开展一项全面的不确定性基准测试,系统性地评估16个最先进的VLMs(包括开源与闭源模型),在6个多模态数据集上使用3种不同的评分函数进行分析。结果表明,模型规模越大,不确定性量化能力越强;同时,数学和推理类任务中的不确定性表现普遍低于其他任务类型,为多模态系统中可靠性评估提供了可复现的基准框架。

链接: https://arxiv.org/abs/2509.13379
作者: Asif Azad,Mohammad Sadat Hossain,MD Sadik Hossain Shanto,M Saifur Rahman,Md Rizwan Pervez
机构: Bangladesh University of Engineering & Technology (BUET); Qatar Computing Research Institute (QCRI)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) have achieved remarkable progress in complex visual understanding across scientific and reasoning tasks. While performance benchmarking has advanced our understanding of these capabilities, the critical dimension of uncertainty quantification has received insufficient attention. Therefore, unlike prior conformal prediction studies that focused on limited settings, we conduct a comprehensive uncertainty benchmarking study, evaluating 16 state-of-the-art VLMs (open and closed-source) across 6 multimodal datasets with 3 distinct scoring functions. Our findings demonstrate that larger models consistently exhibit better uncertainty quantification; models that know more also know better what they don’t know. More certain models achieve higher accuracy, while mathematical and reasoning tasks elicit poorer uncertainty performance across all models compared to other domains. This work establishes a foundation for reliable uncertainty evaluation in multimodal systems.
zh

[CV-84] An Empirical Analysis of VLM-based OOD Detection: Mechanisms Advantages and Sensitivity

【速读】:该论文旨在解决当前对视觉语言模型(Vision-Language Models, VLMs)在零样本分布外检测(zero-shot out-of-distribution detection, OOD detection)中性能优越性的机制、相较于单模态方法的优势以及行为鲁棒性理解不足的问题。其解决方案的关键在于通过系统性的实证分析,揭示了VLM嵌入空间中的关键操作特性如何促进OOD检测,并量化其在语义新颖性利用上的优势;同时发现了一种此前未被充分关注的不对称鲁棒性:即VLM方法对图像噪声具有较强鲁棒性,但对提示词(prompt)表述方式高度敏感,从而为未来构建更可靠和鲁棒的OOD检测系统提供了基于实证的指导。

链接: https://arxiv.org/abs/2509.13375
作者: Yuxiao Lee,Xiaofeng Cao,Wei Ye,Jiangchao Yao,Jingkuan Song,Heng Tao Shen
机构: Jilin University (吉林大学); Tongji University (同济大学); Shanghai Jiao Tong University (上海交通大学); University of Electronic Science and Technology of China (电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs), such as CLIP, have demonstrated remarkable zero-shot out-of-distribution (OOD) detection capabilities, vital for reliable AI systems. Despite this promising capability, a comprehensive understanding of (1) why they work so effectively, (2) what advantages do they have over single-modal methods, and (3) how is their behavioral robustness – remains notably incomplete within the research community. This paper presents a systematic empirical analysis of VLM-based OOD detection using in-distribution (ID) and OOD prompts. (1) Mechanisms: We systematically characterize and formalize key operational properties within the VLM embedding space that facilitate zero-shot OOD detection. (2) Advantages: We empirically quantify the superiority of these models over established single-modal approaches, attributing this distinct advantage to the VLM’s capacity to leverage rich semantic novelty. (3) Sensitivity: We uncovers a significant and previously under-explored asymmetry in their robustness profile: while exhibiting resilience to common image noise, these VLM-based methods are highly sensitive to prompt phrasing. Our findings contribute a more structured understanding of the strengths and critical vulnerabilities inherent in VLM-based OOD detection, offering crucial, empirically-grounded guidance for developing more robust and reliable future designs.
zh

[CV-85] Parking Space Ground Truth Test Automation by Artificial Intelligence Using Convolutional Neural Networks

【速读】:该论文旨在解决实时云端路边停车服务中地面真值测试(ground truth test)自动化程度低的问题,从而优化停车信息服务的质量。其核心挑战在于人工标注和分析过程耗时且效率低下。解决方案的关键在于引入卷积神经网络(Convolutional Neural Networks, CNNs)进行图像模式识别,以自动化地处理和丰富数据库中的检测数据,替代传统人工工程工作。实验表明,该方法可实现高达99.58%的人力资源时间节省,显著提升了分析流程的效率与可扩展性。

链接: https://arxiv.org/abs/2509.13366
作者: Tony Rohe,Martin Margreiter,Markus Moertl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:This research is part of a study of a real-time, cloud-based on-street parking service using crowd-sourced in-vehicle fleet data. The service provides real-time information about available parking spots by classifying crowd-sourced detections observed via ultrasonic sensors. The goal of this research is to optimize the current parking service quality by analyzing the automation of the existing test process for ground truth tests. Therefore, methods from the field of machine learning, especially image pattern recognition, are applied to enrich the database and substitute human engineering work in major areas of the analysis process. After an introduction into the related areas of machine learning, this paper explains the methods and implementations made to achieve a high level of automation, applying convolutional neural networks. Finally, predefined metrics present the performance level achieved, showing a time reduction of human resources up to 99.58 %. The overall improvements are discussed, summarized, and followed by an outlook for future development and potential application of the analysis automation tool.
zh

[CV-86] Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention

【速读】:该论文旨在解决高速公路交通拥堵导致的通行效率下降与区域连通性受阻问题,重点突破现有“检测-预测”系统在车辆遮挡下的感知精度不足以及拥堵预测中长期序列依赖关系丢失的瓶颈。其解决方案的关键在于构建一个集成技术框架:首先通过改进YOLOv11-DIoU目标检测模型(以DIoU Loss替代GIoU Loss)提升复杂场景下的车辆检测准确率(mAP达95.7%),并结合融合马氏距离(运动)与余弦距离(外观)的DeepSort跟踪算法实现高精度轨迹关联(MOTA达93.8%);其次基于Greenberg模型验证高速密度下车速与密度呈强负相关(r=-0.97),进而设计GRU-Attention神经网络模型捕捉拥堵前兆特征,在训练300轮后实现99.7%的测试准确率(较传统GRU提升7–9个百分点),并在提前10分钟预警未来30分钟拥堵时误差≤1分钟,展现出优异的时间与空间预测稳定性(独立视频验证准确率达95%,空间重叠超90%)。该框架为高速公路智能管控提供了可量化的理论支撑和实用化路径。

链接: https://arxiv.org/abs/2509.13361
作者: Tong Yulin,Liang Xuechen
机构: East China Jiaotong University (华东交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Expressway traffic congestion severely reduces travel efficiency and hinders regional connectivity. Existing “detection-prediction” systems have critical flaws: low vehicle perception accuracy under occlusion and loss of long-sequence dependencies in congestion forecasting. This study proposes an integrated technical framework to resolve these this http URL traffic flow perception, two baseline algorithms were optimized. Traditional YOLOv11 was upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort was improved by fusing Mahalanobis (motion) and cosine (appearance) distances. Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7% mAP (6.5 percentage points higher than baseline) with 5.3% occlusion miss rate. DeepSort reached 93.8% MOTA (11.3 percentage points higher than SORT) with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km high-density scenarios), speed and density showed a strong negative correlation (r=-0.97), conforming to traffic flow theory. For congestion warning, a GRU-Attention model was built to capture congestion precursors. Trained 300 epochs with flow, density, and speed, it achieved 99.7% test accuracy (7-9 percentage points higher than traditional GRU). In 10-minute advance warnings for 30-minute congestion, time error was \leq 1 minute. Validation with an independent video showed 95% warning accuracy, over 90% spatial overlap of congestion points, and stable performance in high-flow ( 5 vehicles/second) this http URL framework provides quantitative support for expressway congestion control, with promising intelligent transportation applications.
zh

[CV-87] Hybrid Quantum-Classical Model for Image Classification

【速读】:该论文旨在解决传统纯经典深度学习模型在复杂视觉任务中面临的性能瓶颈与资源效率问题,特别是在准确率、训练速度和参数规模等方面的局限性。其解决方案的关键在于引入混合量子-经典神经网络架构,将可编程量子电路(parameterized quantum circuits)与经典卷积神经网络(CNN)相结合,利用量子计算的并行性和非线性表达能力增强模型的表征能力。实验表明,该混合架构在多个基准数据集上均实现了更高的验证准确率(如CIFAR100提升9.44%)、更快的训练速度(快5–12倍)、更少的参数量(减少6–32%),并在简单任务中展现出更强的对抗鲁棒性,证明了量子增强机制对复杂视觉任务具有显著优势。

链接: https://arxiv.org/abs/2509.13353
作者: Muhammad Adnan Shahzad
机构: Concordia University (康考迪亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study presents a systematic comparison between hybrid quantum-classical neural networks and purely classical models across three benchmark datasets (MNIST, CIFAR100, and STL10) to evaluate their performance, efficiency, and robustness. The hybrid models integrate parameterized quantum circuits with classical deep learning architectures, while the classical counterparts use conventional convolutional neural networks (CNNs). Experiments were conducted over 50 training epochs for each dataset, with evaluations on validation accuracy, test accuracy, training time, computational resource usage, and adversarial robustness (tested with \epsilon=0.1 perturbations).Key findings demonstrate that hybrid models consistently outperform classical models in final accuracy, achieving 99.38% (MNIST), 41.69% (CIFAR100), and 74.05% (STL10) validation accuracy, compared to classical benchmarks of 98.21%, 32.25%, and 63.76%, respectively. Notably, the hybrid advantage scales with dataset complexity, showing the most significant gains on CIFAR100 (+9.44%) and STL10 (+10.29%). Hybrid models also train 5–12 \times faster (e.g., 21.23s vs. 108.44s per epoch on MNIST) and use 6–32% fewer parameters while maintaining superior generalization to unseen test this http URL robustness tests reveal that hybrid models are significantly more resilient on simpler datasets (e.g., 45.27% robust accuracy on MNIST vs. 10.80% for classical) but show comparable fragility on complex datasets like CIFAR100 ( \sim 1% robustness for both). Resource efficiency analyses indicate that hybrid models consume less memory (4–5GB vs. 5–6GB for classical) and lower CPU utilization (9.5% vs. 23.2% on average).These results suggest that hybrid quantum-classical architectures offer compelling advantages in accuracy, training efficiency, and parameter scalability, particularly for complex vision tasks.
zh

[CV-88] Proximity-Based Evidence Retrieval for Uncertainty-Aware Neural Networks

【速读】:该论文旨在解决传统不确定性感知决策中依赖单一全局阈值(如预测熵阈值)所带来的可靠性不足和可解释性差的问题。其解决方案的关键在于提出一种基于证据检索的机制,通过在嵌入空间中为每个测试实例检索邻近的支撑样本(proximal exemplars),并利用Dempster-Shafer理论融合这些样本的预测分布,生成一个实例自适应的置信度信念(belief)作为动态阈值。该方法使决策过程具有透明性和可审计性,且实验表明仅需少量证据即可显著减少误判率,同时维持较低的审查负担,优于固定熵阈值策略。

链接: https://arxiv.org/abs/2509.13338
作者: Hassan Gharoun,Mohammad Sadegh Khorshidi,Kasra Ranjbarigderi,Fang Chen,Amir H. Gandomi
机构: University of Technology Sydney (悉尼科技大学); Óbuda University (布达佩斯技术与经济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 4 figures, 3 tables

点击查看摘要

Abstract:This work proposes an evidence-retrieval mechanism for uncertainty-aware decision-making that replaces a single global cutoff with an evidence-conditioned, instance-adaptive criterion. For each test instance, proximal exemplars are retrieved in an embedding space; their predictive distributions are fused via Dempster-Shafer theory. The resulting fused belief acts as a per-instance thresholding mechanism. Because the supporting evidences are explicit, decisions are transparent and auditable. Experiments on CIFAR-10/100 with BiT and ViT backbones show higher or comparable uncertainty-aware performance with materially fewer confidently incorrect outcomes and a sustainable review load compared with applying threshold on prediction entropy. Notably, only a few evidences are sufficient to realize these gains; increasing the evidence set yields only modest changes. These results indicate that evidence-conditioned tagging provides a more reliable and interpretable alternative to fixed prediction entropy thresholds for operational uncertainty-aware decision-making.
zh

[CV-89] Rest2Visual: Predicting Visually Evoked fMRI from Resting-State Scans

【速读】:该论文旨在解决如何将自发脑活动(spontaneous brain activity)与刺激驱动的神经响应(stimulus-driven neural responses)建立联系这一认知神经科学中的核心挑战。传统任务态功能磁共振成像(task-based fMRI)虽能捕捉到特定刺激诱发的局部脑激活,但其采集成本高、耗时长且难以大规模推广;而静息态功能磁共振成像(resting-state fMRI, rs-fMRI)虽易于获取且数据丰富,却缺乏直接的可解释性。为此,作者提出Rest2Visual——一种条件生成模型,通过输入rs-fMRI和二维视觉刺激图像,预测对应的视觉诱发fMRI(ve-fMRI)激活图。其关键创新在于采用体素级编码器-解码器架构,利用自适应归一化机制将多尺度三维rs-fMRI特征与图像嵌入(image embeddings)动态融合,从而实现空间精确、刺激特异的激活合成。该方法在大规模自然场景数据集(NSD)构建的三元组数据上训练并验证,结果表明预测激活图在标准相似性和表征度量上高度匹配真实值,并支持下游图像重建,同时保留个体特异性结构,证明了从个体自发脑活动生成可解释的刺激对齐表征的可行性。

链接: https://arxiv.org/abs/2509.13612
作者: Chuyang Zhou,Ziao Ji,Daochang Liu,Dongang Wang,Chenyu Wang,Chang Xu
机构: The University of Sydney (悉尼大学); The University of Western Australia (西澳大利亚大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding how spontaneous brain activity relates to stimulus-driven neural responses is a fundamental challenge in cognitive neuroscience. While task-based functional magnetic resonance imaging (fMRI) captures localized stimulus-evoked brain activation, its acquisition is costly, time-consuming, and difficult to scale across populations. In contrast, resting-state fMRI (rs-fMRI) is task-free and abundant, but lacks direct interpretability. We introduce Rest2Visual, a conditional generative model that predicts visually evoked fMRI (ve-fMRI) from resting-state input and 2D visual stimuli. It follows a volumetric encoder–decoder design, where multiscale 3D features from rs-fMRI are modulated by image embeddings via adaptive normalization, enabling spatially accurate, stimulus-specific activation synthesis. To enable model training, we construct a large-scale triplet dataset from the Natural Scenes Dataset (NSD), aligning each rs-fMRI volume with stimulus images and their corresponding ve-fMRI activation maps. Quantitative evaluation shows that the predicted activations closely match ground truth across standard similarity and representational metrics, and support successful image reconstruction in downstream decoding. Notably, the predicted maps preserve subject-specific structure, demonstrating the model’s capacity to generate individualized functional surrogates. Our results provide compelling evidence that individualized spontaneous neural activity can be transformed into stimulus-aligned representations, opening new avenues for scalable, task-free functional brain modeling.
zh

[CV-90] Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction for Sparse-View CT

【速读】:该论文旨在解决稀疏视图计算机断层成像(Sparse-View CT, SVCT)在临床应用中因视角减少和分布偏移(如扫描仪、协议或解剖结构差异)导致的伪影问题,尤其是在分布外(Out-of-Distribution, OOD)场景下重建性能下降的问题。解决方案的关键在于提出了一种基于跨分布扩散先验驱动的迭代重建框架(Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction, CDPIR),其核心创新是通过可扩展插值Transformer(Scalable Interpolant Transformer, SiT)构建统一的随机插值框架,并利用无分类器引导(Classifier-Free Guidance, CFG)机制学习域特定与域不变的先验信息;训练时通过随机丢弃条件输入(null embedding)使模型具备跨分布泛化能力,推理时则借助全局敏感的Transformer扩散模型,在统一随机插值框架内灵活控制多分布到噪声的插值路径并实现解耦采样策略,从而显著提升对OOD数据的适应性与重建质量。

链接: https://arxiv.org/abs/2509.13576
作者: Haodong Li,Shuo Han,Haiyang Mao,Yu Shi,Changsheng Fang,Jianjia Zhang,Weiwen Wu,Hengyong Yu
机构: University of Massachusetts Lowell (马萨诸塞大学洛厄尔分校); Sun Yat-Sen University (中山大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 8 figures, under reviewing of IEEE TMI

点击查看摘要

Abstract:Sparse-View CT (SVCT) reconstruction enhances temporal resolution and reduces radiation dose, yet its clinical use is hindered by artifacts due to view reduction and domain shifts from scanner, protocol, or anatomical variations, leading to performance degradation in out-of-distribution (OOD) scenarios. In this work, we propose a Cross-Distribution Diffusion Priors-Driven Iterative Reconstruction (CDPIR) framework to tackle the OOD problem in SVCT. CDPIR integrates cross-distribution diffusion priors, derived from a Scalable Interpolant Transformer (SiT), with model-based iterative reconstruction methods. Specifically, we train a SiT backbone, an extension of the Diffusion Transformer (DiT) architecture, to establish a unified stochastic interpolant framework, leveraging Classifier-Free Guidance (CFG) across multiple datasets. By randomly dropping the conditioning with a null embedding during training, the model learns both domain-specific and domain-invariant priors, enhancing generalizability. During sampling, the globally sensitive transformer-based diffusion model exploits the cross-distribution prior within the unified stochastic interpolant framework, enabling flexible and stable control over multi-distribution-to-noise interpolation paths and decoupled sampling strategies, thereby improving adaptation to OOD reconstruction. By alternating between data fidelity and sampling updates, our model achieves state-of-the-art performance with superior detail preservation in SVCT reconstructions. Extensive experiments demonstrate that CDPIR significantly outperforms existing approaches, particularly under OOD conditions, highlighting its robustness and potential clinical value in challenging imaging scenarios.
zh

[CV-91] Autonomous Reporting of Normal Chest X-rays by Artificial Intelligence in the United Kingdom. Can We Take the Human Out of the Loop?

【速读】:该论文旨在解决英国胸片(Chest X-ray, CXR)报告延迟问题,其根源在于放射科医生人力资源短缺。解决方案的关键在于利用生成式 AI (Generative AI) 技术实现对正常胸片的自主报告(autonomous AI reporting),从而减少放射科工作负荷。核心挑战包括明确定义“正常”影像、确保模型在不同人群中的泛化能力(generalisability)、平衡敏感性与特异性,并应对法律合规(如IR(ME)R和GDPR)及责任归属缺失等监管问题。

链接: https://arxiv.org/abs/2509.13428
作者: Katrina Nash,James Vaz,Ahmed Maiter,Christopher Johns,Nicholas Woznitza,Aditya Kale,Abdala Espinosa Morgado,Rhidian Bramley,Mark Hall,David Lowe,Alex Novak,Sarim Ather
机构: 未知
类目: Populations and Evolution (q-bio.PE); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Chest X-rays (CXRs) are the most commonly performed imaging investigation. In the UK, many centres experience reporting delays due to radiologist workforce shortages. Artificial intelligence (AI) tools capable of distinguishing normal from abnormal CXRs have emerged as a potential solution. If normal CXRs could be safely identified and reported without human input, a substantial portion of radiology workload could be reduced. This article examines the feasibility and implications of autonomous AI reporting of normal CXRs. Key issues include defining normal, ensuring generalisability across populations, and managing the sensitivity-specificity trade-off. It also addresses legal and regulatory challenges, such as compliance with IR(ME)R and GDPR, and the lack accountability frameworks for errors. Further considerations include the impact on radiologists practice, the need for robust post-market surveillance, and incorporation of patient perspectives. While the benefits are clear, adoption must be cautious. Subjects: Populations and Evolution (q-bio.PE); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.13428 [q-bio.PE] (or arXiv:2509.13428v1 [q-bio.PE] for this version) https://doi.org/10.48550/arXiv.2509.13428 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-92] Generative AI Pipeline for Interactive Prompt-driven 2D-to-3D Vascular Reconstruction for Fontan Geometries from Contrast-Enhanced X-Ray Fluoroscopy Imaging

【速读】:该论文旨在解决单心室先天性心脏病(univentricular congenital heart disease)经Fontan手术后的血流动力学评估难题,传统二维(2D)影像难以准确表征复杂血流模式,且依赖荧光造影 angiography 提供的三维几何信息有限,无法满足计算流体动力学(Computational Fluid Dynamics, CFD)分析与术前规划需求。其解决方案的关键在于构建一个基于人工智能(AI)的多步骤自动化处理流程:首先利用Google的Gemini 2.5 Flash模型(2.5B参数)通过Transformer架构对单视角荧光造影图像进行系统迭代处理,完成血管分割、对比增强、伪影去除及虚拟血流动态可视化;随后借助Tencent Hunyuan3D-2mini模型(384M参数)生成可用于立体光刻(stereolithography)的三维文件。整个流程在定制化Web界面中实现,仅需16步处理即可从原始影像生成几何优化的2D投影,并具备秒级API响应能力,最终实现了从常规临床影像快速生成适用于CFD分析的高保真几何结构和初步虚拟血流特征识别,为临床提供了一种可扩展、低成本的先进血流动力学建模方法。

链接: https://arxiv.org/abs/2509.13372
作者: Prahlad G Menon
机构: University of Pittsburgh (匹兹堡大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Fontan palliation for univentricular congenital heart disease progresses to hemodynamic failure with complex flow patterns poorly characterized by conventional 2D imaging. Current assessment relies on fluoroscopic angiography, providing limited 3D geometric information essential for computational fluid dynamics (CFD) analysis and surgical planning. A multi-step AI pipeline was developed utilizing Google’s Gemini 2.5 Flash (2.5B parameters) for systematic, iterative processing of fluoroscopic angiograms through transformer-based neural architecture. The pipeline encompasses medical image preprocessing, vascular segmentation, contrast enhancement, artifact removal, and virtual hemodynamic flow visualization within 2D projections. Final views were processed through Tencent’s Hunyuan3D-2mini (384M parameters) for stereolithography file generation. The pipeline successfully generated geometrically optimized 2D projections from single-view angiograms after 16 processing steps using a custom web interface. Initial iterations contained hallucinated vascular features requiring iterative refinement to achieve anatomically faithful representations. Final projections demonstrated accurate preservation of complex Fontan geometry with enhanced contrast suitable for 3D conversion. AI-generated virtual flow visualization identified stagnation zones in central connections and flow patterns in branch arteries. Complete processing required under 15 minutes with second-level API response times. This approach demonstrates clinical feasibility of generating CFD-suitable geometries from routine angiographic data, enabling 3D generation and rapid virtual flow visualization for cursory insights prior to full CFD simulation. While requiring refinement cycles for accuracy, this establishes foundation for democratizing advanced geometric and hemodynamic analysis using readily available imaging data. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Quantitative Methods (q-bio.QM) MSC classes: 92C50, 68T07, 76D05, 65D18, 92C55 ACMclasses: I.4.6; I.4.8; J.3; I.2.10; I.4.9 Cite as: arXiv:2509.13372 [eess.IV] (or arXiv:2509.13372v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2509.13372 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Prahlad Menon [view email] [v1] Tue, 16 Sep 2025 04:47:25 UTC (6,760 KB)
zh

[CV-93] PREDICT-GBM: Platform for Robust Evaluation and Development of Individualized Computational Tumor Models in Glioblastoma

【速读】:该论文旨在解决胶质母细胞瘤(Glioblastoma)放疗中因传统均匀照射范围无法考虑个体解剖与生物学因素而导致的复发率高问题。其核心解决方案是构建一个名为PREDICT-GBM的集成化计算平台与数据集,通过整合255例具有完整肿瘤分割和组织特征图的临床数据,实现对前沿肿瘤生长模型的系统性基准测试。关键在于利用个性化肿瘤细胞分布预测生成更精准的放疗计划,实验表明其中两个模型在复发区域覆盖方面显著优于传统方法,从而推动模型向临床转化并提升患者预后。

链接: https://arxiv.org/abs/2509.13360
作者: L. Zimmer,J. Weidner,M. Balcerak,F. Kofler,I. Ezhov,B. Menze,B. Wiestler
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注:

点击查看摘要

Abstract:Glioblastoma is the most prevalent primary brain malignancy, distinguished by its highly invasive behavior and exceptionally high rates of recurrence. Conventional radiation therapy, which employs uniform treatment margins, fails to account for patient-specific anatomical and biological factors that critically influence tumor cell migration. To address this limitation, numerous computational models of glioblastoma growth have been developed, enabling generation of tumor cell distribution maps extending beyond radiographically visible regions and thus informing more precise treatment strategies. However, despite encouraging preliminary findings, the clinical adoption of these growth models remains limited. To bridge this translational gap and accelerate both model development and clinical validation, we introduce PREDICT-GBM, a comprehensive integrated pipeline and dataset for modeling and evaluation. This platform enables systematic benchmarking of state-of-the-art tumor growth models using an expert-curated clinical dataset comprising 255 subjects with complete tumor segmentations and tissue characterization maps. Our analysis demonstrates that personalized radiation treatment plans derived from tumor growth predictions achieved superior recurrence coverage compared to conventional uniform margin approaches for two of the evaluated models. This work establishes a robust platform for advancing and systematically evaluating cutting-edge tumor growth modeling approaches, with the ultimate goal of facilitating clinical translation and improving patient outcomes.
zh

[CV-94] 3D Reconstruction of Coronary Vessel Trees from Biplanar X-Ray Images Using a Geometric Approach

【速读】:该论文旨在解决从双平面X射线血管造影(X-ray angiography)视频中重建冠状动脉树三维结构的问题,以提升心脏介入治疗中的可视化精度与导航效率。其核心挑战在于如何在不同C臂角度下获取的二维图像中准确匹配运动相位,并实现高精度的三维重建。解决方案的关键在于提出一个包含图像分割、运动相位匹配和三维重建三个模块的框架:首先通过自动视频分割实现对血管、导管等结构的语义区分;其次利用静止物体(如导管或电极)跟踪来识别呼吸与心搏周期相似的图像对,从而减少因生理运动差异导致的重建误差;最后采用基于关键解剖点匹配与新型几何算法的三维重建方法,通过计算两组三维曲面交线确定血管中心线,相较传统基于极线约束的方法简化了流程并提高了准确性。

链接: https://arxiv.org/abs/2509.13358
作者: Ethan Koland,Lin Xi,Nadeev Wijesuriya,YingLiang Ma
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:X-ray angiography is widely used in cardiac interventions to visualize coronary vessels, assess integrity, detect stenoses and guide treatment. We propose a framework for reconstructing 3D vessel trees from biplanar X-ray images which are extracted from two X-ray videos captured at different C-arm angles. The proposed framework consists of three main components: image segmentation, motion phase matching, and 3D reconstruction. An automatic video segmentation method for X-ray angiography to enable semantic segmentation for image segmentation and motion phase matching. The goal of the motion phase matching is to identify a pair of X-ray images that correspond to a similar respiratory and cardiac motion phase to reduce errors in 3D reconstruction. This is achieved by tracking a stationary object such as a catheter or lead within the X-ray video. The semantic segmentation approach assigns different labels to different object classes enabling accurate differentiation between blood vessels, balloons, and catheters. Once a suitable image pair is selected, key anatomical landmarks (vessel branching points and endpoints) are matched between the two views using a heuristic method that minimizes reconstruction errors. This is followed by a novel geometric reconstruction algorithm to generate the 3D vessel tree. The algorithm computes the 3D vessel centrelines by determining the intersection of two 3D surfaces. Compared to traditional methods based on epipolar constraints, the proposed approach simplifies there construction workflow and improves overall accuracy. We trained and validated our segmentation method on 62 X-ray angiography video sequences. On the test set, our method achieved a segmentation accuracy of 0.703. The 3D reconstruction framework was validated by measuring the reconstruction error of key anatomical landmarks, achieving a reprojection errors of 0.62mm +/- 0.38mm.
zh

人工智能

[AI-0] A Universal Banach–Bregman Framework for Stochastic Iterations: Unifying Stochastic Mirror Descent Learning and LLM Training

【速读】:该论文旨在解决当前随机优化理论主要局限于希尔伯特空间(Hilbert space)所带来的局限性,无法有效处理非欧几里得场景下的优化问题,例如单纯形上的镜像下降(mirror descent)、稀疏学习中的Bregman邻近方法、信息几何中的自然梯度下降,以及KL正则化的大语言模型训练等。其解决方案的关键在于提出一个开创性的Banach–Bregman框架,将Bregman几何作为下一代优化算法的统一基础:通过Bregman投影和Bregman–Fejér单调性构建统一模板,涵盖随机逼近、镜像下降、自然梯度、自适应方法及镜像近似(mirror-prox);在非希尔伯特空间中实现超松弛(super-relaxation, λ > 2),揭示几何灵活性对加速效果的贡献,并提供从几乎必然有界到几何收敛速率的完整收敛定理,从而在理论上和实践上统一了多种核心人工智能范式中的优化方法。

链接: https://arxiv.org/abs/2509.14216
作者: Johnny R. Zhang(Independent Researcher),Xiaomei Mi(University of Manchester),Gaoyuan Du(Amazon),Qianyi Sun(Microsoft),Shiqi Wang(Meta),Jiaxuan Li(Amazon),Wenhua Zhou(Independent Researcher)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 69 pages, 10 figures. Preprint

点击查看摘要

Abstract:Stochastic optimization powers the scalability of modern artificial intelligence, spanning machine learning, deep learning, reinforcement learning, and large language model training. Yet, existing theory remains largely confined to Hilbert spaces, relying on inner-product frameworks and orthogonality. This paradigm fails to capture non-Euclidean settings, such as mirror descent on simplices, Bregman proximal methods for sparse learning, natural gradient descent in information geometry, or Kullback–Leibler-regularized language model training. Unlike Euclidean-based Hilbert-space methods, this approach embraces general Banach spaces. This work introduces a pioneering Banach–Bregman framework for stochastic iterations, establishing Bregman geometry as a foundation for next-generation optimization. It (i) provides a unified template via Bregman projections and Bregman–Fejer monotonicity, encompassing stochastic approximation, mirror descent, natural gradient, adaptive methods, and mirror-prox; (ii) establishes super-relaxations ( \lambda 2 ) in non-Hilbert settings, enabling flexible geometries and elucidating their acceleration effect; and (iii) delivers convergence theorems spanning almost-sure boundedness to geometric rates, validated on synthetic and real-world tasks. Empirical studies across machine learning (UCI benchmarks), deep learning (e.g., Transformer training), reinforcement learning (actor–critic), and large language models (WikiText-2 with distilGPT-2) show up to 20% faster convergence, reduced variance, and enhanced accuracy over classical baselines. These results position Banach–Bregman geometry as a cornerstone unifying optimization theory and practice across core AI paradigms.
zh

[AI-1] Hierarchical Learning for Maze Navigation: Emergence of Mental Representations via Second-Order Learning

【速读】:该论文旨在解决如何通过结构化的内部心理表征(mental representation)来提升认知系统对复杂环境的适应能力,特别是验证第二层学习(second-order learning)是否能促进环境与认知之间的同构性(isomorphism),从而增强学习效率和泛化性能。其解决方案的关键在于提出了一种分层架构:以图卷积网络(Graph Convolutional Network, GCN)作为第一层学习者,直接从节点特征映射到最优导航路径预测;同时引入多层感知机(MLP)控制器作为第二层学习者,动态调整GCN参数以应对结构新颖的迷宫环境。实验证明,当认知系统发展出与环境结构同构的内部心理地图时,第二层学习显著提升了任务表现和未见场景下的鲁棒性,从而为结构化心理表征在最大化第二层学习效能中的核心作用提供了实证支持。

链接: https://arxiv.org/abs/2509.14195
作者: Shalima Binta Manir,Tim Oates
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Mental representation, characterized by structured internal models mirroring external environments, is fundamental to advanced cognition but remains challenging to investigate empirically. Existing theory hypothesizes that second-order learning – learning mechanisms that adapt first-order learning (i.e., learning about the task/domain) – promotes the emergence of such environment-cognition isomorphism. In this paper, we empirically validate this hypothesis by proposing a hierarchical architecture comprising a Graph Convolutional Network (GCN) as a first-order learner and an MLP controller as a second-order learner. The GCN directly maps node-level features to predictions of optimal navigation paths, while the MLP dynamically adapts the GCN’s parameters when confronting structurally novel maze environments. We demonstrate that second-order learning is particularly effective when the cognitive system develops an internal mental map structurally isomorphic to the environment. Quantitative and qualitative results highlight significant performance improvements and robust generalization on unseen maze tasks, providing empirical support for the pivotal role of structured mental representations in maximizing the effectiveness of second-order learning.
zh

[AI-2] Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting

【速读】:该论文旨在解决时间序列预测中输入历史与未来目标之间分布差异导致的性能瓶颈问题,尤其是频率不匹配带来的挑战。现有基于对比学习等表示学习的方法在时间序列预测中未被广泛采用,因其被认为难以带来显著性能提升。论文提出了一种轻量级、可插拔的框架TimeAlign,其关键在于通过一个简单的重建任务学习辅助特征,并将这些特征反馈给任意基础预测器,从而显式对齐输入历史与未来目标的表示空间。实验表明,这种对齐机制能有效缩小分布差距,主要归因于纠正了频率不匹配问题,并提供了理论支持:TimeAlign可增加学习表示与预测目标之间的互信息(mutual information)。由于架构无关且计算开销极低,TimeAlign可作为现代深度学习时间序列预测系统中的通用对齐模块。

链接: https://arxiv.org/abs/2509.14181
作者: Yifan Hu,Jie Yang,Tian Zhou,Peiyuan Liu,Yujin Tang,Rong Jin,Liang Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Representation learning techniques like contrastive learning have long been explored in time series forecasting, mirroring their success in computer vision and natural language processing. Yet recent state-of-the-art (SOTA) forecasters seldom adopt these representation approaches because they have shown little performance advantage. We challenge this view and demonstrate that explicit representation alignment can supply critical information that bridges the distributional gap between input histories and future targets. To this end, we introduce TimeAlign, a lightweight, plug-and-play framework that learns auxiliary features via a simple reconstruction task and feeds them back to any base forecaster. Extensive experiments across eight benchmarks verify its superior performance. Further studies indicate that the gains arises primarily from correcting frequency mismatches between historical inputs and future outputs. We also provide a theoretical justification for the effectiveness of TimeAlign in increasing the mutual information between learned representations and predicted targets. As it is architecture-agnostic and incurs negligible overhead, TimeAlign can serve as a general alignment module for modern deep learning time-series forecasting systems. The code is available at this https URL.
zh

[AI-3] GPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning

【速读】:该论文旨在解决基于强化学习训练Web Agent时面临的三大关键挑战:信用分配错位(credit assignment misallocation)、标注成本过高以及奖励稀疏性(reward sparsity)。其解决方案的核心是提出一种树状引导的偏好优化框架(Tree-Guided Preference Optimization, TGPO),通过构建树结构轨迹表示来合并语义相同的跨轨迹状态,从而消除标签冲突;同时引入过程奖励模型(Process Reward Model)自动生成细粒度奖励,基于子目标进展、冗余检测和动作验证提升奖励信号质量,并结合动态加权机制聚焦高影响力决策点,显著提升了训练效率与任务成功率。

链接: https://arxiv.org/abs/2509.14172
作者: Ziyuan Chen,Zhenghui Zhao,Zhangye Han,Miancan Liu,Xianhang Ye,Yiqing Li,Hongbo Min,Jinkui Ren,Xiantao Zhang,Guitao Cao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.
zh

[AI-4] Queen Detection in Beehives via Environmental Sensor Fusion for Low-Power Edge Computing

【速读】:该论文旨在解决蜂群健康监测中蜂后(queen bee)定位困难的问题,传统方法依赖人工巡检,存在劳动强度大、干扰性强且难以规模化等局限。为克服现有基于音频的检测方案在功耗高、预处理复杂及易受环境噪声干扰等方面的不足,作者提出一种轻量级多模态蜂后检测系统,其关键在于通过融合蜂巢内外温湿度与气压差等环境传感器数据,利用量化决策树推理算法部署于商用STM32微控制器上,实现低功耗边缘计算下的实时精准识别。实验表明,仅使用环境输入即可达到99%以上的检测准确率,且音频特征未带来显著性能提升,验证了该方案在资源受限场景下具备可扩展性和可持续性优势。

链接: https://arxiv.org/abs/2509.14061
作者: Chiara De Luca,Elisa Donati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Queen bee presence is essential for the health and stability of honeybee colonies, yet current monitoring methods rely on manual inspections that are labor-intensive, disruptive, and impractical for large-scale beekeeping. While recent audio-based approaches have shown promise, they often require high power consumption, complex preprocessing, and are susceptible to ambient noise. To overcome these limitations, we propose a lightweight, multimodal system for queen detection based on environmental sensor fusion-specifically, temperature, humidity, and pressure differentials between the inside and outside of the hive. Our approach employs quantized decision tree inference on a commercial STM32 microcontroller, enabling real-time, low-power edge computing without compromising accuracy. We show that our system achieves over 99% queen detection accuracy using only environmental inputs, with audio features offering no significant performance gain. This work presents a scalable and sustainable solution for non-invasive hive monitoring, paving the way for autonomous, precision beekeeping using off-the-shelf, energy-efficient hardware.
zh

[AI-5] Comprehensive Evaluation of CNN-Based Audio Tagging Models on Resource-Constrained Devices

【速读】:该论文旨在解决在资源受限设备(如Raspberry Pi)上部署卷积神经网络(Convolutional Neural Networks, CNNs)进行音频标记(audio tagging)时面临的计算效率与热管理问题。解决方案的关键在于对多种CNN架构进行全面评估,包括来自预训练音频神经网络(Pretrained Audio Neural Networks, PANNs)框架的1D和2D模型、基于ConvNeXt的音频分类模型以及MobileNetV3结构,并引入两个新提出的PANNs衍生模型CNN9和CNN13;同时,所有模型均转换为开放神经网络交换(Open Neural Network Exchange, ONNX)格式以提升跨平台可移植性,并通过连续24小时推理测试验证性能稳定性。结果表明,通过合理选择与优化模型,可在长时间运行中维持稳定的推理延迟并有效控制温度行为,从而为实际边缘计算场景中的音频标记模型部署提供可行路径。

链接: https://arxiv.org/abs/2509.14049
作者: Jordi Grau-Haro,Ruben Ribes-Serrano,Javier Naranjo-Alcazar,Marta Garcia-Ballesteros,Pedro Zuccarello
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted at Computing Conference 2026, London, UK

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have demonstrated exceptional performance in audio tagging tasks. However, deploying these models on resource-constrained devices like the Raspberry Pi poses challenges related to computational efficiency and thermal management. In this paper, a comprehensive evaluation of multiple convolutional neural network (CNN) architectures for audio tagging on the Raspberry Pi is conducted, encompassing all 1D and 2D models from the Pretrained Audio Neural Networks (PANNs) framework, a ConvNeXt-based model adapted for audio classification, as well as MobileNetV3 architectures. In addition, two PANNs-derived networks, CNN9 and CNN13, recently proposed, are also evaluated. To enhance deployment efficiency and portability across diverse hardware platforms, all models are converted to the Open Neural Network Exchange (ONNX) format. Unlike previous works that focus on a single model, our analysis encompasses a broader range of architectures and involves continuous 24-hour inference sessions to assess performance stability. Our experiments reveal that, with appropriate model selection and optimization, it is possible to maintain consistent inference latency and manage thermal behavior effectively over extended periods. These findings provide valuable insights for deploying audio tagging models in real-world edge computing scenarios.
zh

[AI-6] Prompt2Auto: From Motion Prompt to Automated Control via Geometry-Invariant One-Shot Gaussian Process Learning

【速读】:该论文旨在解决传统机器人学习从示范(Learning from Demonstration, LfD)方法中存在的两个关键问题:一是需要大量数据才能训练模型,二是难以在坐标变换(如平移、旋转和缩放)下保持泛化能力。为应对这些挑战,作者提出了一种几何不变的一次性高斯过程学习框架——Prompt2Auto(GeoGP),其核心创新在于引入基于坐标变换的数据构建策略,通过强制模型对平移、旋转和缩放保持不变性,实现仅需单次运动提示即可完成技能学习,并支持多步预测与多技能自主控制。该方案显著降低了人类示范负担,同时提升了模型在不同任务间的泛化性能。

链接: https://arxiv.org/abs/2509.14040
作者: Zewen Yang,Xiaobing Dai,Dongfa Zhang,Yu Li,Ziyang Meng,Bingkun Huang,Hamid Sadeghian,Sami Haddadin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:Learning from demonstration allows robots to acquire complex skills from human demonstrations, but conventional approaches often require large datasets and fail to generalize across coordinate transformations. In this paper, we propose Prompt2Auto, a geometry-invariant one-shot Gaussian process (GeoGP) learning framework that enables robots to perform human-guided automated control from a single motion prompt. A dataset-construction strategy based on coordinate transformations is introduced that enforces invariance to translation, rotation, and scaling, while supporting multi-step predictions. Moreover, GeoGP is robust to variations in the user’s motion prompt and supports multi-skill autonomy. We validate the proposed approach through numerical simulations with the designed user graphical interface and two real-world robotic experiments, which demonstrate that the proposed method is effective, generalizes across tasks, and significantly reduces the demonstration burden. Project page is available at: this https URL
zh

[AI-7] CrowdAgent : Multi-Agent Managed Multi-Source Annotation System

【速读】:该论文旨在解决多源标注(包括大语言模型(Large Language Models, LLMs)、小语言模型(Small Language Models, SLMs)和人类专家)在自然语言处理(Natural Language Processing, NLP)任务中协同标注时,缺乏统一的端到端流程控制机制的问题,尤其是如何动态调度不同标注资源以平衡质量与成本。其解决方案的关键在于提出 CrowdAgent——一个基于多智能体系统的框架,通过集成任务分配、数据标注与质量/成本管理功能,实现对多种标注源的协同调度与理性任务分配,从而在统一架构下优化标注流程的效率与效果。

链接: https://arxiv.org/abs/2509.14030
作者: Maosheng Qin,Renyu Zhu,Mingxuan Xia,Chenkai Chen,Zhen Zhu,Minmin Lin,Junbo Zhao,Lu Xu,Changjie Fan,Runze Wu,Haobo Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human experts-they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality-cost trade-offs in a unified manner. Inspired by real-world crowdsourcing companies, we introduce CrowdAgent, a multi-agent system that provides end-to-end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at this https URL.
zh

[AI-8] RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing

【速读】:该论文旨在解决文本引导的音频编辑(text-guided audio editing)问题,即在保持音频其余内容不变的前提下,精准定位并根据文本提示修改目标音频内容。现有方法通常依赖全标注数据或高成本优化,难以应对复杂场景且实用性不足。解决方案的关键在于提出一种基于修正流匹配(rectified flow matching)的端到端高效扩散框架,并构建包含重叠多事件音频的数据集以支持复杂场景下的训练与评估。该方法无需辅助标注或掩码即可实现语义对齐,同时在多个指标上保持优异的编辑质量。

链接: https://arxiv.org/abs/2509.14003
作者: Liting Gao,Yi Yuan,Yaru Chen,Yuelan Cheng,Zhenbo Li,Juan Wen,Shubin Zhang,Wenwu Wang
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have shown remarkable progress in text-to-audio generation. However, text-guided audio editing remains in its early stages. This task focuses on modifying the target content within an audio signal while preserving the rest, thus demanding precise localization and faithful editing according to the text prompt. Existing training-based and zero-shot methods that rely on full-caption or costly optimization often struggle with complex editing or lack practicality. In this work, we propose a novel end-to-end efficient rectified flow matching-based diffusion framework for audio editing, and construct a dataset featuring overlapping multi-event audio to support training and benchmarking in complex scenarios. Experiments show that our model achieves faithful semantic alignment without requiring auxiliary captions or masks, while maintaining competitive editing quality across metrics.
zh

[AI-9] Differential Privacy in Federated Learning: Mitigating Inference Attacks with Randomized Response

【速读】:该论文旨在解决联邦学习(Federated Learning)中因模型传输导致的数据隐私泄露问题,特别是攻击者可通过推理攻击从本地训练模型中近似还原客户端的原始训练数据。其解决方案的关键在于引入差分隐私(Differential Privacy),并通过随机响应(Randomized Response)技术对客户端数据进行扰动处理,在保护隐私的同时评估模型性能的下降情况。研究发现,随着隐私预算 ε 值减小,模型准确率降低且类别预测出现不平衡现象,表明隐私增强与模型性能之间存在权衡关系,需在实际应用中谨慎调整参数以实现安全与效用的平衡。

链接: https://arxiv.org/abs/2509.13987
作者: Ozer Ozturk,Busra Buyuktanir,Gozde Karatas Baydogmus,Kazim Yildiz
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Machine learning models used for distributed architectures consisting of servers and clients require large amounts of data to achieve high accuracy. Data obtained from clients are collected on a central server for model training. However, storing data on a central server raises concerns about security and privacy. To address this issue, a federated learning architecture has been proposed. In federated learning, each client trains a local model using its own data. The trained models are periodically transmitted to the central server. The server then combines the received models using federated aggregation algorithms to obtain a global model. This global model is distributed back to the clients, and the process continues in a cyclical manner. Although preventing data from leaving the clients enhances security, certain concerns still remain. Attackers can perform inference attacks on the obtained models to approximate the training dataset, potentially causing data leakage. In this study, differential privacy was applied to address the aforementioned security vulnerability, and a performance analysis was conducted. The Data-Unaware Classification Based on Association (duCBA) algorithm was used as the federated aggregation method. Differential privacy was implemented on the data using the Randomized Response technique, and the trade-off between security and performance was examined under different epsilon values. As the epsilon value decreased, the model accuracy declined, and class prediction imbalances were observed. This indicates that higher levels of privacy do not always lead to practical outcomes and that the balance between security and performance must be carefully considered.
zh

[AI-10] LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology

【速读】:该论文旨在解决大规模科学工作流中数据证明(provenance)复杂性带来的分析难题,传统方法如定制脚本、结构化查询或静态仪表板难以实现高效的数据交互与深入洞察。其解决方案的关键在于引入基于轻量级元数据驱动的设计,利用交互式大语言模型(Large Language Model, LLM)代理将自然语言转化为结构化证明查询,并结合模块化架构、提示调优(prompt tuning)和检索增强生成(Retrieval-Augmented Generation, RAG),从而在运行时实现准确且具有洞察力的分析能力。

链接: https://arxiv.org/abs/2509.13978
作者: Renan Souza,Timothy Poteet,Brian Etz,Daniel Rosendo,Amal Gueroudji,Woong Shin,Prasanna Balaprakash,Rafael Ferreira da Silva
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: Paper accepted in the proceedings of the ACM/IEEE Supercomputing Conference (SC). Cite it as Renan Souza, Timothy Poteet, Brian Etz, Daniel Rosendo, Amal Gueroudji, Woong Shin, Prasanna Balaprakash, and Rafael Ferreira da Silva. 2025. LLM Agents for Interactive Workflow Provenance: Reference Architecture and Evaluation Methodology. In SC Workshops (WORKS)

点击查看摘要

Abstract:Modern scientific discovery increasingly relies on workflows that process data across the Edge, Cloud, and High Performance Computing (HPC) continuum. Comprehensive and in-depth analyses of these data are critical for hypothesis validation, anomaly detection, reproducibility, and impactful findings. Although workflow provenance techniques support such analyses, at large scale, the provenance data become complex and difficult to analyze. Existing systems depend on custom scripts, structured queries, or static dashboards, limiting data interaction. In this work, we introduce an evaluation methodology, reference architecture, and open-source implementation that leverages interactive Large Language Model (LLM) agents for runtime data analysis. Our approach uses a lightweight, metadata-driven design that translates natural language into structured provenance queries. Evaluations across LLaMA, GPT, Gemini, and Claude, covering diverse query classes and a real-world chemistry workflow, show that modular design, prompt tuning, and Retrieval-Augmented Generation (RAG) enable accurate and insightful LLM agent responses beyond recorded provenance.
zh

[AI-11] Ensemble of Pre-Trained Models for Long-Tailed Trajectory Prediction ITSC2025

【速读】:该论文旨在解决多维回归问题中车辆轨迹预测的集成建模挑战,即如何在不进行昂贵再训练的前提下,有效融合多个先进的深度学习预测模型的优势。其解决方案的关键在于采用一种无需微调或重新训练的简单置信度加权平均方法,通过组合多个状态领先模型的输出,在NuScenes和Argoverse数据集上实现了比单一最佳模型提升10%的性能,尤其在长尾指标上表现显著,且改进效果覆盖整个数据分布。

链接: https://arxiv.org/abs/2509.13914
作者: Divya Thuremella,Yi Yang,Simon Wanna,Lars Kunze,Daniele De Martini
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted 2025 IEEE International Conference on Intelligent Transportation Systems (ITSC 2025)

点击查看摘要

Abstract:This work explores the application of ensemble modeling to the multidimensional regression problem of trajectory prediction for vehicles in urban environments. As newer and bigger state-of-the-art prediction models for autonomous driving continue to emerge, an important open challenge is the problem of how to combine the strengths of these big models without the need for costly re-training. We show how, perhaps surprisingly, combining state-of-the-art deep learning models out-of-the-box (without retraining or fine-tuning) with a simple confidence-weighted average method can enhance the overall prediction. Indeed, while combining trajectory prediction models is not straightforward, this simple approach enhances performance by 10% over the best prediction model, especially in the long-tailed metrics. We show that this performance improvement holds on both the NuScenes and Argoverse datasets, and that these improvements are made across the dataset distribution. The code for our work is open source.
zh

[AI-12] FedSSG: Expectation-Gated and History-Aware Drift Alignment for Federated Learning

【速读】:该论文旨在解决联邦学习(Federated Learning)中因非独立同分布(Non-IID)数据和部分客户端参与所导致的客户端漂移(client drift)与局部最优解不一致问题,这些问题会引发收敛不稳定和准确率下降。其解决方案的关键在于提出FedSSG方法——一种基于随机采样引导的历史感知漂移对齐机制:该方法为每个客户端维护一个轻量级的漂移记忆模块,用于累积本地模型差异作为历史梯度的简要表示;并通过一个由服务器采样器推导出的“按阶段期望信号”(phase-by-expectation signal)平滑地控制记忆更新和局部对齐项的权重,使该门控机制在早期采样噪声主导时保持弱而平滑,在参与统计趋于稳定后逐步增强,从而在不增加通信开销的前提下缩小局部与全局模型间的差距。

链接: https://arxiv.org/abs/2509.13895
作者: Zhanting Zhou,Jinshan Lai,Fengchun Zhang,Zeqin Wu,Fengli Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 4 page main text for conference

点击查看摘要

Abstract:Non-IID data and partial participation induce client drift and inconsistent local optima in federated learning, causing unstable convergence and accuracy loss. We present FedSSG, a stochastic sampling-guided, history-aware drift alignment method. FedSSG maintains a per-client drift memory that accumulates local model differences as a lightweight sketch of historical gradients; crucially, it gates both the memory update and the local alignment term by a smooth function of the observed/expected participation ratio (a phase-by-expectation signal derived from the server sampler). This statistically grounded gate stays weak and smooth when sampling noise dominates early, then strengthens once participation statistics stabilize, contracting the local-global gap without extra communication. Across CIFAR-10/100 with 100/500 clients and 2-15 percent participation, FedSSG consistently outperforms strong drift-aware baselines and accelerates convergence; on our benchmarks it improves test accuracy by up to a few points (e.g., about +0.9 on CIFAR-10 and about +2.7 on CIFAR-100 on average over the top-2 baseline) and yields about 4.5x faster target-accuracy convergence on average. The method adds only O(d) client memory and a constant-time gate, and degrades gracefully to a mild regularizer under near-IID or uniform sampling. FedSSG shows that sampling statistics can be turned into a principled, history-aware phase control to stabilize and speed up federated training.
zh

[AI-13] Synthetic Data Generation for Screen Time and App Usage

【速读】:该论文试图解决大规模真实世界智能手机使用数据收集中存在的挑战,包括高成本、隐私问题、样本代表性不足以及非响应偏差等,这些问题可能导致研究结果失真。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成合成的智能手机使用数据,以替代或补充真实数据采集;研究重点在于通过设计不同粒度的提示(prompt)策略——包括用户画像描述与预期结果特征描述的详细程度,以及是否引入初始真实使用示例——来优化生成数据的质量与行为合理性。实验表明,在采用详细提示的情况下,LLMs可生成结构合理且符合人类行为模式的合成数据集,适用于特定应用场景,但如何在数据保真度与多样性之间取得平衡仍需进一步研究。

链接: https://arxiv.org/abs/2509.13892
作者: Gustavo Kruger,Nikhil Sachdeva,Michael Sobolev
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 14 pages

点击查看摘要

Abstract:Smartphone usage data can provide valuable insights for understanding interaction with technology and human behavior. However, collecting large-scale, in-the-wild smartphone usage logs is challenging due to high costs, privacy concerns, under representative user samples and biases like non-response that can skew results. These challenges call for exploring alternative approaches to obtain smartphone usage datasets. In this context, large language models (LLMs) such as Open AI’s ChatGPT present a novel approach for synthetic smartphone usage data generation, addressing limitations of real-world data collection. We describe a case study on how four prompt strategies influenced the quality of generated smartphone usage data. We contribute with insights on prompt design and measures of data quality, reporting a prompting strategy comparison combining two factors, prompt level of detail (describing a user persona, describing the expected results characteristics) and seed data inclusion (with versus without an initial real usage example). Our findings suggest that using LLMs to generate structured and behaviorally plausible smartphone use datasets is feasible for some use cases, especially when using detailed prompts. Challenges remain in capturing diverse nuances of human behavioral patterns in a single synthetic dataset, and evaluating tradeoffs between data fidelity and diversity, suggesting the need for use-case-specific evaluation metrics and future research with more diverse seed data and different LLM models.
zh

[AI-14] An Exhaustive DPLL Approach to Model Counting over Integer Linear Constraints with Simplification Techniques

【速读】:该论文旨在解决整数线性约束模型计数(Model Counting over Integer Linear Constraints, MCILC)问题,这是计算机科学、运筹学和优化等领域中的基础任务。其解决方案的关键在于设计一种基于完整DPLL(Davis-Putnam-Logemann-Loveland)架构的精确算法,并引入混合整数规划(Mixed Integer Programming, MIP)中的多种高效简化技术以提升求解效率。实验表明,该方法在随机基准测试中显著优于现有最先进方法,在4131个应用实例上唯一实现了全部求解。

链接: https://arxiv.org/abs/2509.13880
作者: Mingwei Zhang,Zhenhao Gu,Liangda Fang,Cunjing Ge,Ziliang Chen,Zhao-Rong Lai,Quanlong Guan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Linear constraints are one of the most fundamental constraints in fields such as computer science, operations research and optimization. Many applications reduce to the task of model counting over integer linear constraints (MCILC). In this paper, we design an exact approach to MCILC based on an exhaustive DPLL architecture. To improve the efficiency, we integrate several effective simplification techniques from mixed integer programming into the architecture. We compare our approach to state-of-the-art MCILC counters and propositional model counters on 2840 random and 4131 application benchmarks. Experimental results show that our approach significantly outperforms all exact methods in random benchmarks solving 1718 instances while the state-of-the-art approach only computes 1470 instances. In addition, our approach is the only approach to solve all 4131 application instances.
zh

[AI-15] Masked Diffusion Models as Energy Minimization

【速读】:该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)的理论基础不清晰以及采样效率低的问题。其核心贡献在于构建了一个系统性的理论框架,将MDMs解释为离散最优传输中的能量最小化问题,并证明了三种不同形式的能量函数——动能、条件动能和测地线能量——在MDMs结构下是数学等价的。解决方案的关键在于识别出当掩码调度满足闭式最优条件时,MDMs能同时最小化这三类能量;进一步通过参数化插值调度为Beta分布,将调度设计空间压缩至二维可搜索空间,从而实现无需修改模型即可高效后训练调优,显著提升了低步数采样场景下的性能表现。

链接: https://arxiv.org/abs/2509.13866
作者: Sitong Chen,Shen Nie,Jiacheng Sun,Zijin Feng,Zhenguo Li,Ji-Rong Wen,Chongxuan Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a systematic theoretical framework that interprets masked diffusion models (MDMs) as solutions to energy minimization problems in discrete optimal transport. Specifically, we prove that three distinct energy formulations–kinetic, conditional kinetic, and geodesic energy–are mathematically equivalent under the structure of MDMs, and that MDMs minimize all three when the mask schedule satisfies a closed-form optimality condition. This unification not only clarifies the theoretical foundations of MDMs, but also motivates practical improvements in sampling. By parameterizing interpolation schedules via Beta distributions, we reduce the schedule design space to a tractable 2D search, enabling efficient post-training tuning without model modification. Experiments on synthetic and real-world benchmarks demonstrate that our energy-inspired schedules outperform hand-crafted baselines, particularly in low-step sampling settings.
zh

[AI-16] Understanding the Process of Human-AI Value Alignment

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)研究中“价值对齐”(value alignment)概念使用缺乏精确性的核心问题,旨在通过系统性文献综述厘清该术语的内涵与外延。其解决方案的关键在于对172篇相关研究进行主题分析,提炼出六大核心主题:价值对齐的驱动因素与方法、挑战、价值观本身、人类与AI的认知过程、人-代理协作以及价值对齐系统的构建。基于这些主题的综合分析,作者提出一个更精确的定义:价值对齐是一个持续的人类与自主代理之间的过程,旨在多元情境下表达并实现抽象价值观,同时管理人类与AI的认知局限,并平衡不同群体间因价值观冲突而产生的伦理与政治张力。

链接: https://arxiv.org/abs/2509.13854
作者: Jack McKinlay,Marina De Vos,Janina A. Hoffmann,Andreas Theodorou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 39 pages, 7 figures

点击查看摘要

Abstract:Background: Value alignment in computer science research is often used to refer to the process of aligning artificial intelligence with humans, but the way the phrase is used often lacks precision. Objectives: In this paper, we conduct a systematic literature review to advance the understanding of value alignment in artificial intelligence by characterising the topic in the context of its research literature. We use this to suggest a more precise definition of the term. Methods: We analyse 172 value alignment research articles that have been published in recent years and synthesise their content using thematic analyses. Results: Our analysis leads to six themes: value alignment drivers approaches; challenges in value alignment; values in value alignment; cognitive processes in humans and AI; human-agent teaming; and designing and developing value-aligned systems. Conclusions: By analysing these themes in the context of the literature we define value alignment as an ongoing process between humans and autonomous agents that aims to express and implement abstract values in diverse contexts, while managing the cognitive limits of both humans and AI agents and also balancing the conflicting ethical and political demands generated by the values in different groups. Our analysis gives rise to a set of research challenges and opportunities in the field of value alignment for future work.
zh

[AI-17] owards a Physics Foundation Model

【速读】:该论文旨在解决当前物理感知机器学习方法在应用中存在的重要局限性:即模型通常局限于单一、狭窄的物理领域,且在面对新系统时需要重新训练,难以实现跨域泛化与高效部署。为应对这一挑战,作者提出General Physics Transformer (GPhyT),其关键创新在于利用Transformer架构从大规模多样化仿真数据(1.8 TB)中直接学习物理系统的控制动力学规律,无需显式提供方程信息,即可通过上下文推理(in-context learning)模拟流固耦合、激波传播、热对流及多相流等多种复杂物理现象。这一方法实现了零样本泛化能力与长期稳定预测(50步滚动预测),标志着通用物理基础模型(Physics Foundation Model, PFM)的可能性,为计算科学与工程提供了可扩展、通用的建模范式。

链接: https://arxiv.org/abs/2509.13805
作者: Florian Wiesner,Matthias Wessling,Stephen Baek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere’’ paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative – democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by up to 29x, (2) zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) stable long-term predictions through 50-timestep rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.
zh

[AI-18] Who is Introducing the Failure? Automatically Attributing Failures of Multi-Agent Systems via Spectrum Analysis

【速读】:该论文旨在解决大型语言模型驱动的多智能体系统(Multi-Agent Systems, MASs)中失败归因(failure attribution)问题,即在复杂任务执行过程中,精准定位导致系统失败的具体智能体行为。当前该领域缺乏高效、自动化的失败归因方法,使得调试与优化困难且依赖人工分析。解决方案的关键在于提出FAMAS,一种基于谱分析(spectrum-based)的失败归因方法,其核心思想是通过系统性轨迹重放与抽象,结合针对MAS特性的新型可疑度公式(suspiciousness formula),综合考虑智能体激活模式(agent behavior group)与动作激活模式(action behavior group)两个因素,从而量化每个智能体动作对失败的贡献概率。实验表明,FAMAS在Who and When基准上显著优于12种基线方法,验证了其有效性。

链接: https://arxiv.org/abs/2509.13782
作者: Yu Ge(1),Linna Xie(1),Zhong Li(1),Yu Pei(2),Tian Zhang(1) ((1) Nanjing University, (2) The Hong Kong Polytechnic University)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 20 pages, 6 figures

点击查看摘要

Abstract:Large Language Model Powered Multi-Agent Systems (MASs) are increasingly employed to automate complex real-world problems, such as programming and scientific discovery. Despite their promising, MASs are not without their flaws. However, failure attribution in MASs - pinpointing the specific agent actions responsible for failures - remains underexplored and labor-intensive, posing significant challenges for debugging and system improvement. To bridge this gap, we propose FAMAS, the first spectrum-based failure attribution approach for MASs, which operates through systematic trajectory replay and abstraction, followed by spectrum this http URL core idea of FAMAS is to estimate, from variations across repeated MAS executions, the likelihood that each agent action is responsible for the failure. In particular, we propose a novel suspiciousness formula tailored to MASs, which integrates two key factor groups, namely the agent behavior group and the action behavior group, to account for the agent activation patterns and the action activation patterns within the execution trajectories of MASs. Through expensive evaluations against 12 baselines on the Who and When benchmark, FAMAS demonstrates superior performance by outperforming all the methods in comparison.
zh

[AI-19] MIRA: Empowering One-Touch AI Services on Smartphones with MLLM -based Instruction Recommendation ACL2025 ACL

【速读】:该论文旨在解决智能手机上用户访问预定义AI服务时操作复杂、交互效率低的问题,提出通过智能化的任务指令推荐机制实现直观的一键式AI任务执行。解决方案的关键在于构建MIRA框架,其核心创新包括:基于多模态大语言模型(Multimodal Large Language Model, MLLM)的推荐流水线,结合结构化推理以提取关键实体、推断用户意图并生成精准指令;引入模板增强的推理机制,融合高层推理模板提升任务推断准确性;以及采用前缀树约束解码策略,将输出限制在预定义指令候选集中,确保建议的连贯性和意图一致性。

链接: https://arxiv.org/abs/2509.13773
作者: Zhipeng Bian,Jieming Zhu,Xuyang Xie,Quanyu Dai,Zhou Zhao,Zhenhua Dong
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Published in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), ACL 2025. Official version: this https URL

点击查看摘要

Abstract:The rapid advancement of generative AI technologies is driving the integration of diverse AI-powered services into smartphones, transforming how users interact with their devices. To simplify access to predefined AI services, this paper introduces MIRA, a pioneering framework for task instruction recommendation that enables intuitive one-touch AI tasking on smartphones. With MIRA, users can long-press on images or text objects to receive contextually relevant instruction recommendations for executing AI tasks. Our work introduces three key innovations: 1) A multimodal large language model (MLLM)-based recommendation pipeline with structured reasoning to extract key entities, infer user intent, and generate precise instructions; 2) A template-augmented reasoning mechanism that integrates high-level reasoning templates, enhancing task inference accuracy; 3) A prefix-tree-based constrained decoding strategy that restricts outputs to predefined instruction candidates, ensuring coherent and intent-aligned suggestions. Through evaluation using a real-world annotated datasets and a user study, MIRA has demonstrated substantial improvements in the accuracy of instruction recommendation. The encouraging results highlight MIRA’s potential to revolutionize the way users engage with AI services on their smartphones, offering a more seamless and efficient experience.
zh

[AI-20] Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning ICSE2026

【速读】:该论文旨在解决生成式代码语言模型(Code Language Models, CLMs)中存在的敏感信息记忆风险问题,即模型在训练过程中无意中记忆了敏感数据,并可能在特定提示下直接复现这些信息,从而引发隐私泄露。针对这一问题,论文提出通过机器遗忘(machine unlearning)方法实现对已部署CLM中敏感记忆的高效擦除,而无需进行全模型重新训练。其解决方案的关键在于:首先量化训练数据中的记忆风险并构建包含5万条高风险敏感样本的数据集作为遗忘目标;其次引入CodeEraser这一先进变体,采用基于梯度上升的方法,选择性地擦除代码中与敏感内容相关的片段,同时保持周围代码的结构完整性和功能正确性,从而在保障模型可用性的前提下实现精准、高效的敏感信息删除。

链接: https://arxiv.org/abs/2509.13755
作者: Zhaoyang Chu,Yao Wan,Zhikun Zhang,Di Wang,Zhou Yang,Hongyu Zhang,Pan Zhou,Xuanhua Shi,Hai Jin,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Accepted at the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026)

点击查看摘要

Abstract:While Code Language Models (CLMs) have demonstrated superior performance in software engineering tasks such as code generation and summarization, recent empirical studies reveal a critical privacy vulnerability: these models exhibit unintended memorization of sensitive training data, enabling verbatim reproduction of confidential information when specifically prompted. To address this issue, several approaches, including training data de-duplication and differential privacy augmentation, have been proposed. However, these methods require full-model retraining for deployed CLMs, which incurs substantial computational costs. In this paper, we aim to answer the following research question: Can sensitive information memorized by CLMs be erased effectively and efficiently? We conduct a pioneering investigation into erasing sensitive memorization in CLMs through machine unlearning - a post-hoc modification method that removes specific information from trained models without requiring full retraining. Specifically, we first quantify the memorization risks of sensitive data within CLM training datasets and curate a high-risk dataset of 50,000 sensitive memorized samples as unlearning targets. We study two widely used gradient ascent-based unlearning approaches: the vanilla and constraint-based methods, and introduce CodeEraser, an advanced variant that selectively unlearns sensitive memorized segments in code while preserving the structural integrity and functional correctness of the surrounding code. Extensive experiments on three families of CLMs, i.e., CodeParrot, CodeGen-Mono, and Qwen2.5-Coder, validate the effectiveness and efficiency of CodeEraser in erasing targeted sensitive memorization while maintaining model utility. Comments: Accepted at the 48th IEEE/ACM International Conference on Software Engineering (ICSE 2026) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR) Cite as: arXiv:2509.13755 [cs.SE] (or arXiv:2509.13755v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2509.13755 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.1145/3744916.3764573 Focus to learn more DOI(s) linking to related resources
zh

[AI-21] State Space Models over Directed Graphs

【速读】:该论文旨在解决现有图神经网络(Graph Neural Networks, GNNs)和图Transformer在处理有向图时面临的两大挑战:一是难以有效捕捉由有向边所蕴含的长距离因果依赖关系;二是大规模图数据集上难以兼顾模型准确性和训练效率。其解决方案的关键在于提出了一种全新的方法DirEgo2Token,通过k跳邻域子图(k-hop ego graphs)对有向图进行序列化表示,并在此基础上构建了DirGraphSSM——一种基于状态空间模型(State Space Models, SSMs)的有向图神经网络架构,利用消息传递机制将SSMs首次系统性地扩展至有向图学习领域。实验表明,该方法在三个代表性有向图学习任务中达到最先进性能,同时相较现有最优模型实现1.5至2倍的训练加速。

链接: https://arxiv.org/abs/2509.13735
作者: Junzhi She,Xunkai Li,Rong-Hua Li,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: currently undergoing review by IEEE Transactions on Big Data

点击查看摘要

Abstract:Directed graphs are ubiquitous across numerous domains, where the directionality of edges encodes critical causal dependencies. However, existing GNNs and graph Transformers tailored for directed graphs face two major challenges: (1) effectively capturing long-range causal dependencies derived from directed edges; (2) balancing accuracy and training efficiency when processing large-scale graph datasets. In recent years, state space models (SSMs) have achieved substantial progress in causal sequence tasks, and their variants designed for graphs have demonstrated state-of-the-art accuracy while maintaining high efficiency across various graph learning benchmarks. However, existing graph state space models are exclusively designed for undirected graphs, which limits their performance in directed graph learning. To this end, we propose an innovative approach DirEgo2Token which sequentializes directed graphs via k-hop ego graphs. This marks the first systematic extension of state space models to the field of directed graph learning. Building upon this, we develop DirGraphSSM, a novel directed graph neural network architecture that implements state space models on directed graphs via the message-passing mechanism. Experimental results demonstrate that DirGraphSSM achieves state-of-the-art performance on three representative directed graph learning tasks while attaining competitive performance on two additional tasks with 1.5 \times to 2 \times training speed improvements compared to existing state-of-the-art models.
zh

[AI-22] InfraMind: A Novel Exploration-based GUI Agent ic Framework for Mission-critical Industrial Management

【速读】:该论文旨在解决工业管理场景中复杂系统自动化面临的五大关键挑战:GUI元素理解不熟悉、操作精度与效率不足、状态定位困难、部署约束严苛以及安全要求高。针对这些问题,作者提出InfraMind框架,其核心在于通过五个创新模块实现系统性突破:基于虚拟机快照的搜索式探索以自主理解复杂图形用户界面(GUI),记忆驱动的规划策略保障高精度和高效任务执行,先进的状态识别技术提升在层级化界面中的定位鲁棒性,结构化知识蒸馏实现轻量化模型的高效部署,以及多层次的安全机制确保敏感操作的安全性。该方案在开源与商业DCIM平台上的实验验证了其在任务成功率和运行效率方面的显著优势,为工业管理自动化提供了可扩展且可靠的解决方案。

链接: https://arxiv.org/abs/2509.13704
作者: Liangtao Lin,Zhaomeng Zhu,Tianwei Zhang,Yonggang Wen
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Mission-critical industrial infrastructure, such as data centers, increasingly depends on complex management software. Its operations, however, pose significant challenges due to the escalating system complexity, multi-vendor integration, and a shortage of expert operators. While Robotic Process Automation (RPA) offers partial automation through handcrafted scripts, it suffers from limited flexibility and high maintenance costs. Recent advances in Large Language Model (LLM)-based graphical user interface (GUI) agents have enabled more flexible automation, yet these general-purpose agents face five critical challenges when applied to industrial management, including unfamiliar element understanding, precision and efficiency, state localization, deployment constraints, and safety requirements. To address these issues, we propose InfraMind, a novel exploration-based GUI agentic framework specifically tailored for industrial management systems. InfraMind integrates five innovative modules to systematically resolve different challenges in industrial management: (1) systematic search-based exploration with virtual machine snapshots for autonomous understanding of complex GUIs; (2) memory-driven planning to ensure high-precision and efficient task execution; (3) advanced state identification for robust localization in hierarchical interfaces; (4) structured knowledge distillation for efficient deployment with lightweight models; and (5) comprehensive, multi-layered safety mechanisms to safeguard sensitive operations. Extensive experiments on both open-source and commercial DCIM platforms demonstrate that our approach consistently outperforms existing frameworks in terms of task success rate and operational efficiency, providing a rigorous and scalable solution for industrial management automation.
zh

[AI-23] CraftMesh: High-Fidelity Generative Mesh Manipulation via Poisson Seamless Fusion

【速读】:该论文旨在解决3D内容创作中可控且高保真网格编辑的难题,现有生成式方法在处理复杂几何结构时往往难以保持细节和一致性。其解决方案的关键在于提出了一种名为CraftMesh的新框架,通过泊松无缝融合(Poisson Seamless Fusion)实现高保真网格操作:首先在2D参考图像上进行编辑,再生成特定区域的3D网格,并利用两种核心技术——泊松几何融合(Poisson Geometric Fusion)与泊松纹理调和(Poisson Texture Harmonization),分别实现几何与纹理的无缝整合,从而在全局一致性与局部细节表现上显著优于当前最优方法。

链接: https://arxiv.org/abs/2509.13688
作者: James Jincheng,Youcheng Cai,Ligang Liu
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Controllable, high-fidelity mesh editing remains a significant challenge in 3D content creation. Existing generative methods often struggle with complex geometries and fail to produce detailed results. We propose CraftMesh, a novel framework for high-fidelity generative mesh manipulation via Poisson Seamless Fusion. Our key insight is to decompose mesh editing into a pipeline that leverages the strengths of 2D and 3D generative models: we edit a 2D reference image, then generate a region-specific 3D mesh, and seamlessly fuse it into the original model. We introduce two core techniques: Poisson Geometric Fusion, which utilizes a hybrid SDF/Mesh representation with normal blending to achieve harmonious geometric integration, and Poisson Texture Harmonization for visually consistent texture blending. Experimental results demonstrate that CraftMesh outperforms state-of-the-art methods, delivering superior global consistency and local detail in complex editing tasks.
zh

[AI-24] Prompt Stability in Code LLM s: Measuring Sensitivity across Emotion- and Personality-Driven Variations

【速读】:该论文旨在解决当前代码生成模型在实际应用中对提示(prompt)措辞敏感性问题,即相同语义要求因情感或表达风格差异而产生不同输出的现象,这影响了模型的稳定性和可信度。解决方案的关键在于提出 PromptSE(Prompt Sensitivity Evaluation)框架,通过情感和人格模板生成语义等价但表达方式不同的提示变体,并采用概率感知的连续评分或二值通过率来评估模型输出稳定性,进而以提出的 AUC-E 指标实现跨模型的稳定性比较,揭示性能与稳定性之间解耦的优化特性及架构规模相关规律。

链接: https://arxiv.org/abs/2509.13680
作者: Wei Ma,Yixiao Yang,Jingquan Ge,Xiaofei Xie,Lingxiao Jiang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Code generation models are widely used in software development, yet their sensitivity to prompt phrasing remains under-examined. Identical requirements expressed with different emotions or communication styles can yield divergent outputs, while most benchmarks emphasize only peak performance. We present PromptSE (Prompt Sensitivity Evaluation), a framework that creates semantically equivalent prompt variants with emotion and personality templates, and that evaluates stability using probability aware continuous scoring or using binary pass rates when logits are unavailable. The results are aggregated into a proposed area under curve metric (AUC-E) for cross model comparison. Across 14 models from three families (Llama, Qwen, and DeepSeek), our study shows that performance and stability behave as largely decoupled optimization objectives, and it reveals architectural and scale related patterns that challenge common assumptions about model robustness. The framework supports rapid screening for closed-source models as well as detailed stability analysis in research settings. PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection, positioning prompt stability as a complementary evaluation dimension alongside performance and fairness, and contributing to more trustworthy AI-assisted software development tools.
zh

[AI-25] DREAM: Domain-aware Reasoning for Efficient Autonomous Underwater Monitoring ICRA2026

【速读】:该论文旨在解决海洋温度升高和酸化背景下,对温度敏感的贝类(如牡蛎)等生物面临的群体死亡风险问题,进而推动长期、低成本、广域的海底栖息地监测系统建设。传统依赖人工的水下监测方式成本高且危险,因此研究提出了一种基于视觉语言模型(Vision Language Model, VLM)引导的自主性框架——DREAM,其核心在于通过VLM实现水下机器人在无先验位置信息条件下进行环境感知与实时决策,从而高效完成目标物体(如牡蛎、沉船)的发现与探索任务。实验表明,该框架在牡蛎监测中比基线方法节省31.5%时间,在沉船场景中以27.5%更少步数实现100%覆盖率,显著优于纯VLM方法,验证了其在持久性、广域性和低资源消耗方面的优势。

链接: https://arxiv.org/abs/2509.13666
作者: Zhenqi Wu,Abhinav Modi,Angelos Mavrogiannis,Kaustubh Joshi,Nikhil Chopra,Yiannis Aloimonos,Nare Karapetyan,Ioannis Rekleitis,Xiaomin Lin
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: submitted to ICRA 2026

点击查看摘要

Abstract:The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time, environment-aware decisions without human intervention, we must equip them with an intelligent “brain.” This highlights the need for persistent,wide-area, and low-cost benthic monitoring. To this end, we present DREAM, a Vision Language Model (VLM)-guided autonomy framework for long-term underwater exploration and habitat monitoring. The results show that our framework is highly efficient in finding and exploring target objects (e.g., oysters, shipwrecks) without prior location information. In the oyster-monitoring task, our framework takes 31.5% less time than the previous baseline with the same amount of oysters. Compared to the vanilla VLM, it uses 23% fewer steps while covering 8.88% more oysters. In shipwreck scenes, our framework successfully explores and maps the wreck without collisions, requiring 27.5% fewer steps than the vanilla model and achieving 100% coverage, while the vanilla model achieves 60.23% average coverage in our shipwreck environments.
zh

[AI-26] GitHubs Copilot Code Review: Can AI Spot Security Flaws Before You Commit?

【速读】:该论文旨在解决当前AI辅助代码审查工具在识别关键安全漏洞方面的有效性不足问题,尤其是针对GitHub Copilot新引入的代码审查功能是否能有效检测SQL注入、跨站脚本(XSS)和不安全反序列化等严重安全缺陷。研究的关键在于通过系统性评估Copilot在多语言、多领域开源项目中对已标注漏洞样本的识别能力,发现其反馈主要集中在低严重性问题(如编码风格和拼写错误),而对高危漏洞的检出率极低,从而揭示了AI代码审查工具与实际安全需求之间的显著差距,并强调了仍需依赖专用安全工具和人工代码审计来保障软件安全性。

链接: https://arxiv.org/abs/2509.13650
作者: Amena Amro,Manar H. Alalfi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As software development practices increasingly adopt AI-powered tools, ensuring that such tools can support secure coding has become critical. This study evaluates the effectiveness of GitHub Copilot’s recently introduced code review feature in detecting security vulnerabilities. Using a curated set of labeled vulnerable code samples drawn from diverse open-source projects spanning multiple programming languages and application domains, we systematically assessed Copilot’s ability to identify and provide feedback on common security flaws. Contrary to expectations, our results reveal that Copilot’s code review frequently fails to detect critical vulnerabilities such as SQL injection, cross-site scripting (XSS), and insecure deserialization. Instead, its feedback primarily addresses low-severity issues, such as coding style and typographical errors. These findings expose a significant gap between the perceived capabilities of AI-assisted code review and its actual effectiveness in supporting secure development practices. Our results highlight the continued necessity of dedicated security tools and manual code audits to ensure robust software security.
zh

[AI-27] DeepLogit: A sequentially constrained explainable deep learning modeling approach for transport policy analysis

【速读】:该论文旨在解决深度学习模型在交通规划与政策分析领域应用受限的问题,主要原因是其“黑箱”特性导致缺乏可解释性,难以满足政策制定对模型透明度的要求。解决方案的关键在于提出了一种新颖的分步约束方法(sequentially constrained approach):首先训练一个仅含线性项的卷积神经网络(CNN),其等价于线性参数化的多项Logit模型(linear-in-parameter multinomial logit model),从而获得具有明确经济含义的初始参数估计;随后在后续步骤中,通过固定这些关键参数值并引入高阶项或先进架构(如Transformer),构建更复杂的深度学习模型,既保留了部分参数的可解释性,又显著提升了预测精度。该方法实现了理论驱动的离散选择模型(Discrete Choice Model, DCM)与数据驱动的人工智能(AI)模型之间的优势互补。

链接: https://arxiv.org/abs/2509.13633
作者: Jeremy Oon,Rakhi Manohar Mepparambath,Ling Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the significant progress of deep learning models in multitude of applications, their adaption in planning and policy related areas remains challenging due to the black-box nature of these models. In this work, we develop a set of DeepLogit models that follow a novel sequentially constrained approach in estimating deep learning models for transport policy analysis. In the first step of the proposed approach, we estimate a convolutional neural network (CNN) model with only linear terms, which is equivalent of a linear-in-parameter multinomial logit model. We then estimate other deep learning models by constraining the parameters that need interpretability at the values obtained in the linear-in-parameter CNN model and including higher order terms or by introducing advanced deep learning architectures like Transformers. Our approach can retain the interpretability of the selected parameters, yet provides significantly improved model accuracy than the discrete choice model. We demonstrate our approach on a transit route choice example using real-world transit smart card data from Singapore. This study shows the potential for a unifying approach, where theory-based discrete choice model (DCM) and data-driven AI models can leverage each other’s strengths in interpretability and predictive power. With the availability of larger datasets and more complex constructions, such approach can lead to more accurate models using discrete choice models while maintaining its applicability in planning and policy-related areas. Our code is available on this https URL .
zh

[AI-28] Secure Scalable and Privacy Aware Data Strategy in Cloud

【速读】:该论文旨在解决企业在云环境中安全、可扩展地处理与存储大量数据,并支持决策者快速做出数据驱动决策的挑战。解决方案的关键在于构建一个有效的企业数据战略,其核心包括针对安全性、可扩展性和隐私保护等方面的具体架构设计,从而实现高效的数据管理与利用。

链接: https://arxiv.org/abs/2509.13627
作者: Vijay Kumar Butte,Sujata Butte
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:The enterprises today are faced with the tough challenge of processing, storing large amounts of data in a secure, scalable manner and enabling decision makers to make quick, informed data driven decisions. This paper addresses this challenge and develops an effective enterprise data strategy in the cloud. Various components of an effective data strategy are discussed and architectures addressing security, scalability and privacy aspects are provided.
zh

[AI-29] Mind the Gap: Aligning Knowledge Bases with User Needs to Enhance Mental Health Retrieval ALT NEURIPS2025

【速读】:该论文旨在解决心理健康信息检索系统中因知识库覆盖不足或用户表达方式非标准化而导致的性能低下问题,尤其是在面对未被充分涵盖的议题或非正式语境化表达时。其解决方案的关键在于提出一种基于AI的“缺口导向”(gap-informed)语料库扩展框架,通过将自然语言用户数据(如论坛帖子)与现有知识库叠加分析,精准识别出代表性不足的主题(gaps),从而优先针对这些高价值领域进行内容扩充。实验表明,相较于随机扩展(Non-Directed augmentation),该方法以显著更小的扩展比例(最高仅需318%)即可达到接近全量参考语料库的检索效果(~95%性能),有效降低了内容创建成本并提升了生成式AI在高风险领域应用中的可靠性与实用性。

链接: https://arxiv.org/abs/2509.13626
作者: Amanda Chan,James Jiayu Liu,He Kai,Onno P. Kampman
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 25 pages, 3 figures, submitted to NeurIPS 2025 GenAI4Health

点击查看摘要

Abstract:Access to reliable mental health information is vital for early help-seeking, yet expanding knowledge bases is resource-intensive and often misaligned with user needs. This results in poor performance of retrieval systems when presented concerns are not covered or expressed in informal or contextualized language. We present an AI-based gap-informed framework for corpus augmentation that authentically identifies underrepresented topics (gaps) by overlaying naturalistic user data such as forum posts in order to prioritize expansions based on coverage and usefulness. In a case study, we compare Directed (gap-informed augmentations) with Non-Directed augmentation (random additions), evaluating the relevance and usefulness of retrieved information across four retrieval-augmented generation (RAG) pipelines. Directed augmentation achieved near-optimal performance with modest expansions–requiring only a 42% increase for Query Transformation, 74% for Reranking and Hierarchical, and 318% for Baseline–to reach ~95% of the performance of an exhaustive reference corpus. In contrast, Non-Directed augmentation required substantially larger and thus practically infeasible expansions to achieve comparable performance (232%, 318%, 403%, and 763%, respectively). These results show that strategically targeted corpus growth can reduce content creation demands while sustaining high retrieval and provision quality, offering a scalable approach for building trusted health information repositories and supporting generative AI applications in high-stakes domains.
zh

[AI-30] Modernizing Facebook Scoped Search: Keyword and Embedding Hybrid Retrieval with LLM Evaluation

【速读】:该论文旨在解决社交网络搜索中信息检索相关性与多样性不足的问题,特别是在Facebook群组(Facebook Group)场景下,传统基于关键词的检索方法难以捕捉用户查询的语义意图,导致结果不够精准且缺乏上下文关联。其解决方案的关键在于构建一个融合传统关键词检索与嵌入式检索(Embedding-Based Retrieval, EBR)的混合检索框架,将语义检索集成到现有的关键词搜索流水线中,从而提升搜索结果的相关性和多样性。同时,研究提出了一种基于大语言模型(Large Language Models, LLMs)的离线评估框架,实现对检索效果的规模化、一致性质量衡量,有效验证了混合检索策略在真实社交平台上的显著性能提升。

链接: https://arxiv.org/abs/2509.13603
作者: Yongye Su,Zeya Zhang,Jane Kou,Cheng Ju,Shubhojeet Sarkar,Yamin Wang,Ji Liu,Shengbo Guo
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 5 Pages, work done as Yongye Su’s internship project at Meta

点击查看摘要

Abstract:Beyond general web-scale search, social network search uniquely enables users to retrieve information and discover potential connections within their social context. We introduce a framework of modernized Facebook Group Scoped Search by blending traditional keyword-based retrieval with embedding-based retrieval (EBR) to improve the search relevance and diversity of search results. Our system integrates semantic retrieval into the existing keyword search pipeline, enabling users to discover more contextually relevant group posts. To rigorously assess the impact of this blended approach, we introduce a novel evaluation framework that leverages large language models (LLMs) to perform offline relevance assessments, providing scalable and consistent quality benchmarks. Our results demonstrate that the blended retrieval system significantly enhances user engagement and search quality, as validated by both online metrics and LLM-based evaluation. This work offers practical insights for deploying and evaluating advanced retrieval systems in large-scale, real-world social platforms.
zh

[AI-31] Agent ic JWT: A Secure Delegation Protocol for Autonomous AI Agents

【速读】:该论文针对自主大语言模型(LLM)代理在无监督环境下频繁调用API时存在的安全风险问题展开研究,尤其关注OAuth 2.0协议在代理场景下因客户端行为非确定性(如随机推理、提示注入或多代理编排)而导致权限越权扩展的漏洞。其解决方案的核心是提出一种双面意图令牌——Agentic JWT(A-JWT),该机制通过绑定每个代理操作到可验证的用户意图,并支持与特定工作流步骤关联;A-JWT包含基于提示、工具和配置生成的代理身份单向哈希摘要、链式委托断言以证明下游代理的执行权限,以及每代理的持有证明密钥(proof-of-possession key),从而防止重放攻击和进程内冒充。此外,作者开发了一个轻量级客户端封装库,在运行时自验证代码并动态生成令牌与密钥,实现单进程中安全的身份隔离与零信任保障。

链接: https://arxiv.org/abs/2509.13597
作者: Abhishek Goswami
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 17 pages, 6 figures, 2 Tables

点击查看摘要

Abstract:Autonomous LLM agents can issue thousands of API calls per hour without human oversight. OAuth 2.0 assumes deterministic clients, but in agentic settings stochastic reasoning, prompt injection, or multi-agent orchestration can silently expand privileges. We introduce Agentic JWT (A-JWT), a dual-faceted intent token that binds each agent’s action to verifiable user intent and, optionally, to a specific workflow step. A-JWT carries an agent’s identity as a one-way checksum hash derived from its prompt, tools and configuration, and a chained delegation assertion to prove which downstream agent may execute a given task, and per-agent proof-of-possession keys to prevent replay and in-process impersonation. We define a new authorization mechanism and add a lightweight client shim library that self-verifies code at run time, mints intent tokens, tracks workflow steps and derives keys, thus enabling secure agent identity and separation even within a single process. We illustrate a comprehensive threat model for agentic applications, implement a Python proof-of-concept and show functional blocking of scope-violating requests, replay, impersonation, and prompt-injection pathways with sub-millisecond overhead on commodity hardware. The design aligns with ongoing OAuth agent discussions and offers a drop-in path toward zero-trust guarantees for agentic applications. A comprehensive performance and security evaluation with experimental results will appear in our forthcoming journal publication Comments: 17 pages, 6 figures, 2 Tables Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.13597 [cs.CR] (or arXiv:2509.13597v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.13597 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-32] Programmable Cognitive Bias in Social Agents

【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的社会模拟中代理行为一致性差的问题,即传统通过隐式自然语言描述指定代理行为的方法难以在不同模型间保持行为一致性,且无法准确体现描述中的细微差别。解决方案的关键在于提出CoBRA工具包,其核心创新是显式编程代理的认知偏差(cognitive bias),通过将代理的行为预期锚定在经典社会科学研究实验的基础上实现可控调节;具体包括两个组件:一是认知偏差指数(Cognitive Bias Index),用于量化代理在一系列经过验证的经典社会实验中的反应以衡量其认知偏差程度;二是行为调控引擎(Behavioral Regulation Engine),用于引导代理行为以展示受控的认知偏差。实证表明,CoBRA可在模型无关的前提下精确控制社会代理的认知偏差表现。

链接: https://arxiv.org/abs/2509.13588
作者: Xuan Liu,Haoyang Shang,Haojian Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:This paper introduces CoBRA, a novel toolkit for systematically specifying agent behavior in LLM-based social simulation. We found that conventional approaches that specify agent behaviors through implicit natural language descriptions cannot yield consistent behaviors across models, and the produced agent behaviors do not capture the nuances of the descriptions. In contrast, CoBRA presents a new approach to program agents’ cognitive biases explicitly, by grounding agents’ expected behaviors using classic social science experiments. CoBRA has two components: (1) Cognitive Bias Index that measures the cognitive bias of a social agent, by quantifying the agent’s reactions in a set of validated classical social science experiments; (2) Behavioral Regulation Engine that aligns the agent’s behavior to demonstrate controlled cognitive bias. We evaluated CoBRA as an HCI toolkit through demonstration and technical benchmarks. Our results suggest that CoBRA can precisely program the cognitive bias demonstrated in a social agent in a model-agnostic manner.
zh

[AI-33] reeIRL: Safe Urban Driving with Tree Search and Inverse Reinforcement Learning

【速读】:该论文旨在解决自动驾驶规划(planning)中的瓶颈问题,即如何在复杂动态环境中实现安全、高效且符合人类驾驶行为的路径决策。解决方案的关键在于提出TreeIRL框架,该框架融合蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS)与逆强化学习(Inverse Reinforcement Learning, IRL):利用MCTS高效生成一组安全候选轨迹,再通过深度IRL训练的评分函数从这些轨迹中挑选最接近人类驾驶风格的最优路径。这一设计实现了安全性、行驶进展、舒适性与人类相似性的综合平衡,并首次在公共道路上验证了基于MCTS的规划方法的有效性。

链接: https://arxiv.org/abs/2509.13579
作者: Momchil S. Tomov,Sang Uk Lee,Hansford Hendrago,Jinwook Huh,Teawon Han,Forbes Howington,Rafael da Silva,Gianmarco Bernasconi,Marc Heim,Samuel Findler,Xiaonan Ji,Alexander Boule,Michael Napoli,Kuo Chen,Jesse Miller,Boaz Floor,Yunqing Hu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present TreeIRL, a novel planner for autonomous driving that combines Monte Carlo tree search (MCTS) and inverse reinforcement learning (IRL) to achieve state-of-the-art performance in simulation and in real-world driving. The core idea is to use MCTS to find a promising set of safe candidate trajectories and a deep IRL scoring function to select the most human-like among them. We evaluate TreeIRL against both classical and state-of-the-art planners in large-scale simulations and on 500+ miles of real-world autonomous driving in the Las Vegas metropolitan area. Test scenarios include dense urban traffic, adaptive cruise control, cut-ins, and traffic lights. TreeIRL achieves the best overall performance, striking a balance between safety, progress, comfort, and human-likeness. To our knowledge, our work is the first demonstration of MCTS-based planning on public roads and underscores the importance of evaluating planners across a diverse set of metrics and in real-world environments. TreeIRL is highly extensible and could be further improved with reinforcement learning and imitation learning, providing a framework for exploring different combinations of classical and learning-based approaches to solve the planning bottleneck in autonomous driving.
zh

[AI-34] Dense-Jump Flow Matching with Non-Uniform Time Scheduling for Robotic Policies: Mitigating Multi-Step Inference Degradation

【速读】:该论文旨在解决流匹配(Flow Matching)在机器人生成式策略学习中出现的泛化能力早期饱和及推理阶段性能下降的问题。研究表明,增加欧拉积分步数会因均匀采样晚期区域而限制动作空间、降低泛化性,并且当积分时间趋近于1时,学习到的速度场变得非利普希茨(non-Lipschitz),引发不稳定性。解决方案的关键在于:(1) 训练阶段采用非均匀时间调度(如U型分布),强化早期与晚期时间阶段以正则化策略训练;(2) 推理阶段引入密集跳跃积分策略(dense-jump integration),通过单步积分替代跳点后的多步积分,避开接近1时的不稳定区域。该方法实现了高效的单步学习,同时借助多步积分提升性能,在多种机器人任务中相较最先进基线提升达23.7%。

链接: https://arxiv.org/abs/2509.13574
作者: Zidong Chen,Zihao Guo,Peng Wang,ThankGod Itua Egbe,Yan Lyu,Chenghao Qian
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Flow matching has emerged as a competitive framework for learning high-quality generative policies in robotics; however, we find that generalisation arises and saturates early along the flow trajectory, in accordance with recent findings in the literature. We further observe that increasing the number of Euler integration steps during inference counter-intuitively and universally degrades policy performance. We attribute this to (i) additional, uniformly spaced integration steps oversample the late-time region, thereby constraining actions towards the training trajectories and reducing generalisation; and (ii) the learned velocity field becoming non-Lipschitz as integration time approaches 1, causing instability. To address these issues, we propose a novel policy that utilises non-uniform time scheduling (e.g., U-shaped) during training, which emphasises both early and late temporal stages to regularise policy training, and a dense-jump integration schedule at inference, which uses a single-step integration to replace the multi-step integration beyond a jump point, to avoid unstable areas around 1. Essentially, our policy is an efficient one-step learner that still pushes forward performance through multi-step integration, yielding up to 23.7% performance gains over state-of-the-art baselines across diverse robotic tasks.
zh

[AI-35] Gen AI in Proof-based Math Courses: A Pilot Study

【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)迅速进入高等教育领域且现有 AI 检测工具不可靠的背景下,如何制定政策以促进学生的学习和批判性思维能力。解决方案的关键在于通过实证研究三门证明导向的本科数学课程中学生对生成式 AI 的使用情况与认知态度,明确其在学习过程中的实际作用与局限性,并据此提出面向教学实践的整合策略,从而指导教师设计更具启发性的课程政策,引导学生合理利用生成式 AI 以增强数学证明能力与深度思考。

链接: https://arxiv.org/abs/2509.13570
作者: Hannah Klawa,Shraddha Rajpal,Cigole Thomas
机构: 未知
类目: Artificial Intelligence (cs.AI); History and Overview (math.HO)
备注: 35 pages, 6 figures, Comments welcome!

点击查看摘要

Abstract:With the rapid rise of generative AI in higher education and the unreliability of current AI detection tools, developing policies that encourage student learning and critical thinking has become increasingly important. This study examines student use and perceptions of generative AI across three proof-based undergraduate mathematics courses: a first-semester abstract algebra course, a topology course and a second-semester abstract algebra course. In each case, course policy permitted some use of generative AI. Drawing on survey responses and student interviews, we analyze how students engaged with AI tools, their perceptions of generative AI’s usefulness and limitations, and what implications these perceptions hold for teaching proof-based mathematics. We conclude by discussing future considerations for integrating generative AI into proof-based mathematics instruction.
zh

[AI-36] AI Agents with Human-Like Collaborative Tools: Adaptive Strategies for Enhanced Problem-Solving

【速读】:该论文试图解决的问题是:如何通过引入人类在解决问题时常用的协作工具与自主性,来提升大型语言模型(Large Language Models, LLM)代理在复杂任务中的表现。其解决方案的关键在于为Claude Code代理配备基于MCP(Model-Context Protocol)的社会媒体和日记记录工具,并允许它们自主选择使用这些工具的方式。实验表明,在34个Aider Polyglot Python编程挑战中,协作工具显著提升了最难问题上的性能,成本降低15–40%,完成时间缩短12–38%,且交互轮次减少12–27%;行为分析进一步揭示,代理更倾向于写作而非阅读(比例约为2–9倍),说明结构化表达(articulation-based cognitive scaffolding)驱动了性能提升,而非单纯的信息获取。这表明,人类启发的协作工具可作为推理增强器,在AI代理能力边界处实现系统性优化。

链接: https://arxiv.org/abs/2509.13547
作者: Harper Reed,Michael Sugimura,Angelo Zangari
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 16 pages, 5 tables

点击查看摘要

Abstract:We investigate whether giving LLM agents the collaborative tools and autonomy that humans naturally use for problem solving can improve their performance. We equip Claude Code agents with MCP-based social media and journaling tools and allow them to use these tools as they see fit. Across 34 Aider Polyglot Python programming challenges, collaborative tools substantially improve performance on the hardest problems, delivering 15-40% lower cost, 12-27% fewer turns, and 12-38% faster completion than baseline agents. Effects on the full challenge set are mixed, suggesting these tools act as performance enhancers when additional reasoning scaffolding is most needed. Surprisingly, Different models naturally adopted distinct collaborative strategies without explicit instruction. Sonnet 3.7 engaged broadly across tools and benefited from articulation-based cognitive scaffolding. Sonnet 4 showed selective adoption, leaning on journal-based semantic search when problems were genuinely difficult. This mirrors how human developers adjust collaboration based on expertise and task complexity. Behavioral analysis shows agents prefer writing over reading by about 2-9x, indicating that structured articulation drives much of the improvement rather than information access alone. Overall, AI agents can systematically benefit from human-inspired collaboration tools at the edge of their capabilities, pointing to adaptive collaborative interfaces as reasoning enhancers rather than universal efficiency boosts.
zh

[AI-37] Reproducible workflow for online AI in digital health

【速读】:该论文旨在解决在线人工智能(Online AI)在数字健康干预中部署时面临的适应性与可重现性之间的平衡问题。其解决方案的关键在于提出一个可重现的科学工作流程,贯穿在线AI算法开发生命周期的各个阶段,确保数据准确存储、算法行为可审计、结果具有时间可比性,从而支持科学发现和可信优化。

链接: https://arxiv.org/abs/2509.13499
作者: Susobhan Ghosh,Bhanu T. Gulapalli,Daiqi Gao,Asim Gazi,Anna Trella,Ziping Xu,Kelly Zhang,Susan A. Murphy
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Online artificial intelligence (AI) algorithms are an important component of digital health interventions. These online algorithms are designed to continually learn and improve their performance as streaming data is collected on individuals. Deploying online AI presents a key challenge: balancing adaptability of online AI with reproducibility. Online AI in digital interventions is a rapidly evolving area, driven by advances in algorithms, sensors, software, and devices. Digital health intervention development and deployment is a continuous process, where implementation - including the AI decision-making algorithm - is interspersed with cycles of re-development and optimization. Each deployment informs the next, making iterative deployment a defining characteristic of this field. This iterative nature underscores the importance of reproducibility: data collected across deployments must be accurately stored to have scientific utility, algorithm behavior must be auditable, and results must be comparable over time to facilitate scientific discovery and trustworthy refinement. This paper proposes a reproducible scientific workflow for developing, deploying, and analyzing online AI decision-making algorithms in digital health interventions. Grounded in practical experience from multiple real-world deployments, this workflow addresses key challenges to reproducibility across all phases of the online AI algorithm development life-cycle.
zh

[AI-38] Prompt2DAG: A Modular Methodology for LLM -Based Data Enrichment Pipeline Generation

【速读】:该论文旨在解决数据增强(data enrichment)管道开发中依赖高工程专业技能的问题,提出通过自然语言描述自动生成可执行的 Apache Airflow DAG(Directed Acyclic Graph)来实现生产级自动化。其核心解决方案是 Prompt2DAG 方法,关键在于采用结构化的混合生成策略(Hybrid approach),结合模板约束与大语言模型(LLM)的灵活性,在保证生成工作流可靠性的同时提升代码质量与执行成功率。实验表明,该方法在 260 次测试中达到 78.5% 的成功率,显著优于纯 LLM(66.2%)和直接提示(29.2%)方法,且单位成功 DAG 的成本效率提升超两倍,验证了结构化设计对平衡灵活性与鲁棒性的必要性。

链接: https://arxiv.org/abs/2509.13487
作者: Abubakari Alidu,Michele Ciavotta,Flavio DePaoli
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Developing reliable data enrichment pipelines demands significant engineering expertise. We present Prompt2DAG, a methodology that transforms natural language descriptions into executable Apache Airflow DAGs. We evaluate four generation approaches – Direct, LLM-only, Hybrid, and Template-based – across 260 experiments using thirteen LLMs and five case studies to identify optimal strategies for production-grade automation. Performance is measured using a penalized scoring framework that combines reliability with code quality (SAT), structural integrity (DST), and executability (PCT). The Hybrid approach emerges as the optimal generative method, achieving a 78.5% success rate with robust quality scores (SAT: 6.79, DST: 7.67, PCT: 7.76). This significantly outperforms the LLM-only (66.2% success) and Direct (29.2% success) methods. Our findings show that reliability, not intrinsic code quality, is the primary differentiator. Cost-effectiveness analysis reveals the Hybrid method is over twice as efficient as Direct prompting per successful DAG. We conclude that a structured, hybrid approach is essential for balancing flexibility and reliability in automated workflow generation, offering a viable path to democratize data pipeline development.
zh

[AI-39] An LLM Agent ic Approach for Legal-Critical Software: A Case Study for Tax Prep Software ICSE26

【速读】:该论文旨在解决生成式 AI(Generative AI)在法律关键场景中因语义模糊性和幻觉问题导致的可靠性不足问题,特别是在将自然语言法律条文转化为可执行逻辑时的准确性挑战。解决方案的关键在于提出一种基于代理(agentic)的方法,结合高阶变态关系(higher-order metamorphic relations)与大语言模型(LLM)驱动的角色化测试生成框架,通过结构化个体间的输出对比来自动识别反例,并利用多代理系统实现税法代码到可执行软件的转换与验证,从而提升法律关键软件的鲁棒性和可信度。

链接: https://arxiv.org/abs/2509.13471
作者: Sina Gogani-Khiabani(University of Illinois Chicago),Ashutosh Trivedi(University of Colorado Boulder),Diptikalyan Saha(IBM Research),Saeid Tizpaz-Niari(University of Illinois Chicago)
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: To appear at ICSE 26. 12 pages

点击查看摘要

Abstract:Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.
zh

[AI-40] Justice in Judgment: Unveiling (Hidden) Bias in LLM -assisted Peer Reviews

【速读】:该论文旨在解决生成式 AI(Generative AI)在学术同行评审中可能引入的偏见问题,特别是针对作者所属机构和性别等敏感元数据所引发的不公平性。其解决方案的关键在于通过受控实验方法系统地评估大型语言模型(Large Language Models, LLMs)生成的审稿意见是否存在偏向性,并发现机构排名较高的单位更易获得正面评价,同时识别出虽微弱但可能随时间累积的性别偏好;此外,研究进一步揭示了基于token的软评分机制能更清晰地暴露隐性偏见,从而为改进LLM在评审流程中的公平性和可靠性提供实证依据与技术指引。

链接: https://arxiv.org/abs/2509.13400
作者: Sai Suresh Marchala Vasu,Ivaxi Sheth,Hui-Po Wang,Ruta Binkyte,Mario Fritz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The adoption of large language models (LLMs) is transforming the peer review process, from assisting reviewers in writing more detailed evaluations to generating entire reviews automatically. While these capabilities offer exciting opportunities, they also raise critical concerns about fairness and reliability. In this paper, we investigate bias in LLM-generated peer reviews by conducting controlled experiments on sensitive metadata, including author affiliation and gender. Our analysis consistently shows affiliation bias favoring institutions highly ranked on common academic rankings. Additionally, we find some gender preferences, which, even though subtle in magnitude, have the potential to compound over time. Notably, we uncover implicit biases that become more evident with token-based soft ratings.
zh

[AI-41] he threat of analytic flexibility in using large language models to simulate human data: A call to attention

【速读】:该论文旨在解决生成式 AI(Generative AI)在社会科学中用于创建“硅样本”(silicon samples)时,因分析选择多样性而导致的样本质量不稳定问题。其关键解决方案在于系统性地梳理了生成硅样本过程中的诸多分析决策,并通过实证表明:仅少数几个配置选择就能显著影响硅样本与真实人类数据之间的对应关系,且不同配置在评估指标(如参与者排序准确性、响应分布拟合度及量表间相关性)上的表现存在高度不一致性,说明不存在适用于所有场景的最优配置。因此,研究呼吁重视分析灵活性对硅样本可靠性的潜在威胁,推动更严谨的配置验证与透明化报告机制。

链接: https://arxiv.org/abs/2509.13397
作者: Jamie Cummins
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 11 pages, 3 figures

点击查看摘要

Abstract:Social scientists are now using large language models to create “silicon samples” - synthetic datasets intended to stand in for human respondents, aimed at revolutionising human subjects research. However, there are many analytic choices which must be made to produce these samples. Though many of these choices are defensible, their impact on sample quality is poorly understood. I map out these analytic choices and demonstrate how a very small number of decisions can dramatically change the correspondence between silicon samples and human data. Configurations (N = 252) varied substantially in their capacity to estimate (i) rank ordering of participants, (ii) response distributions, and (iii) between-scale correlations. Most critically, configurations were not consistent in quality: those that performed well on one dimension often performed poorly on another, implying that there is no “one-size-fits-all” configuration that optimises the accuracy of these samples. I call for greater attention to the threat of analytic flexibility in using silicon samples.
zh

[AI-42] he Intercepted Self: How Generative AI Challenges the Dynamics of the Relational Self AAAI

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)系统日益具备预测和代行人类行为的能力,这种技术演进将如何重塑人与技术的关系,以及个体如何在自我认知层面发生改变。解决方案的关键在于,基于“关系性自我”(relational self)理论,从三个维度分析生成式 AI 的影响:外部输出领域(sphere of externalised output)、情境化领域(contextual sphere)和自我关联领域(sphere of self-relating),并指出生成式 AI 不仅能辅助完成任务,更会逐步介入并预判人类行动,从而深刻改变人类在这些领域的存在方式与主体性体验。

链接: https://arxiv.org/abs/2509.13391
作者: Sandrine R. Schiller,Camilo Miguel Signorelli,Filippos Stamatiou
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 8 pages, accepted at the 8th AAAI/ACM Conference on AI, Ethics, and Society

点击查看摘要

Abstract:Generative AI is changing our way of interacting with technology, others, and ourselves. Systems such as Microsoft copilot, Gemini and the expected Apple intelligence still awaits our prompt for action. Yet, it is likely that AI assistant systems will only become better at predicting our behaviour and acting on our behalf. Imagine new generations of generative and predictive AI deciding what you might like best at a new restaurant, picking an outfit that increases your chances on your date with a partner also chosen by the same or a similar system. Far from a science fiction scenario, the goal of several research programs is to build systems capable of assisting us in exactly this manner. The prospect urges us to rethink human-technology relations, but it also invites us to question how such systems might change the way we relate to ourselves. Building on our conception of the relational self, we question the possible effects of generative AI with respect to what we call the sphere of externalised output, the contextual sphere and the sphere of self-relating. In this paper, we attempt to deepen the existential considerations accompanying the AI revolution by outlining how generative AI enables the fulfilment of tasks and also increasingly anticipates, i.e. intercepts, our initiatives in these different spheres.
zh

[AI-43] From Next Token Prediction to (STRIPS) World Models – Preliminary Results

【速读】:该论文旨在解决从纯动作序列(action traces)中学习命题STRIPS世界模型的问题,即仅通过观察合法(正例)和非法(负例)的动作序列来推断出潜在的环境状态转移规则。其解决方案的关键在于将此任务建模为一个监督式的下一个标记预测问题(next token prediction),其中标记为动作序列中的动作;并利用Transformer架构与梯度下降法进行训练,使得模型能够捕捉到前序动作的隐式效应(hidden effects)是否会导致后续动作的前提条件变为假。实验表明,合适的Transformer结构可以忠实表示命题STRIPS模型,并且仅凭随机生成的有效与无效动作序列即可成功学习这些模型。

链接: https://arxiv.org/abs/2509.13389
作者: Carlos Núñez-Molina,Vicenç Gómez,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages, 3 figures

点击查看摘要

Abstract:We consider the problem of learning propositional STRIPS world models from action traces alone, using a deep learning architecture (transformers) and gradient descent. The task is cast as a supervised next token prediction problem where the tokens are the actions, and an action a may follow an action sequence if the hidden effects of the previous actions do not make an action precondition of a false. We show that a suitable transformer architecture can faithfully represent propositional STRIPS world models, and that the models can be learned from sets of random valid (positive) and invalid (negative) action sequences alone. A number of experiments are reported.
zh

[AI-44] Uncovering AI Governance Themes in EU Policies using BERTopic and Thematic Analysis

【速读】:该论文试图解决的问题是:欧盟在人工智能(Artificial Intelligence, AI)治理领域政策与指南日益增多,导致治理框架呈现碎片化现象,亟需系统性梳理其政策演进逻辑与核心关切。解决方案的关键在于结合定性主题分析与定量话题建模方法,具体采用Bertopic模型对2018年后发布的欧盟AI政策文件进行扩展分析,从而揭示欧盟在AI治理中的核心主题演变路径,并从整体上厘清其政策体系的结构特征与内在一致性。

链接: https://arxiv.org/abs/2509.13387
作者: Delaram Golpayegani,Marta Lasek-Markey,Arjumand Younus,Aphra Kerr,Dave Lewis
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The upsurge of policies and guidelines that aim to ensure Artificial Intelligence (AI) systems are safe and trustworthy has led to a fragmented landscape of AI governance. The European Union (EU) is a key actor in the development of such policies and guidelines. Its High-Level Expert Group (HLEG) issued an influential set of guidelines for trustworthy AI, followed in 2024 by the adoption of the EU AI Act. While the EU policies and guidelines are expected to be aligned, they may differ in their scope, areas of emphasis, degrees of normativity, and priorities in relation to AI. To gain a broad understanding of AI governance from the EU perspective, we leverage qualitative thematic analysis approaches to uncover prevalent themes in key EU documents, including the AI Act and the HLEG Ethics Guidelines. We further employ quantitative topic modelling approaches, specifically through the use of the BERTopic model, to enhance the results and increase the document sample to include EU AI policy documents published post-2018. We present a novel perspective on EU policies, tracking the evolution of its approach to addressing AI governance.
zh

[AI-45] ASTREA: Introducing Agent ic Intelligence for Orbital Thermal Autonomy DATE

【速读】:该论文旨在解决在轨航天器自主运行中如何有效融合语义推理与自适应控制的问题,特别是在资源受限的飞行遗产硬件(TRL 9)平台上实现生成式 AI(Generative AI)驱动的智能决策。其解决方案的关键在于构建一个异步架构,将轻量化大语言模型(LLM)代理与强化学习控制器相结合,以实现基于语义理解的监督控制;地面实验验证了该方法能提升热控系统的稳定性并减少违规事件,但轨道验证揭示了因推理延迟与低地球轨道(LEO)快速热循环不匹配而导致的性能下降,凸显出当前基于LLM的智能体系统在真实空间环境中面临的挑战与优化方向。

链接: https://arxiv.org/abs/2509.13380
作者: Alejandro D. Mousist
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Systems and Control (eess.SY)
备注: This preprint presents ASTREA, a multi-agent architecture combining LLM-guided semantic modulation with reinforcement learning for autonomous satellite operations. The system is validated in hardware orbital environments

点击查看摘要

Abstract:This paper presents ASTREA, the first agentic system deployed on flight-heritage hardware (TRL 9) for autonomous spacecraft operations. Using thermal control as a representative use case, we integrate a resource-constrained Large Language Model (LLM) agent with a reinforcement learning controller in an asynchronous architecture tailored for space-qualified platforms. Ground experiments show that LLM-guided supervision improves thermal stability and reduces violations, confirming the feasibility of combining semantic reasoning with adaptive control under hardware constraints. However, on-orbit validation aboard the International Space Station (ISS) reveals performance degradation caused by inference latency mismatched with the rapid thermal cycles characteristic of Low Earth Orbit (LEO) satellites. These results highlight both the opportunities and current limitations of agentic LLM-based systems in real flight environments, providing practical design guidelines for future space autonomy.
zh

[AI-46] Agent 2: An Agent Agent -Generates-Agent Framework for Reinforcement Learning Automation

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)智能体开发过程中对专家知识依赖性强、迭代周期长、失败率高及可访问性差的问题。其解决方案的关键在于提出了一种名为 Agent^2 的“代理生成代理”框架,该框架通过引入双代理架构实现完全自动化RL智能体设计:其中生成代理(Generator Agent)作为自主AI设计师,基于自然语言任务描述和环境代码自动生成高性能的可执行RL智能体;目标代理(Target Agent)即为生成的RL智能体。该框架将RL开发分解为马尔可夫决策过程(MDP)建模与算法优化两个独立阶段,并依托模型上下文协议(Model Context Protocol)构建统一标准,支持跨环境与算法的智能体生成,同时集成自适应训练管理和智能反馈分析机制,从而实现闭环自动化与持续性能提升。

链接: https://arxiv.org/abs/2509.13368
作者: Yuan Wei,Xiaohan Shan,Ran Miao,Jianmin Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Reinforcement learning agent development traditionally requires extensive expertise and lengthy iterations, often resulting in high failure rates and limited accessibility. This paper introduces Agent^2 , a novel agent-generates-agent framework that achieves fully automated RL agent design through intelligent LLM-driven generation. The system autonomously transforms natural language task descriptions and environment code into comprehensive, high-performance reinforcement learning solutions without human intervention. Agent^2 features a revolutionary dual-agent architecture. The Generator Agent serves as an autonomous AI designer that analyzes tasks and generates executable RL agents, while the Target Agent is the resulting automatically generated RL agent. The framework decomposes RL development into two distinct stages: MDP modeling and algorithmic optimization, enabling more targeted and effective agent generation. Built on the Model Context Protocol, Agent^2 provides a unified framework that standardizes intelligent agent creation across diverse environments and algorithms, while incorporating adaptive training management and intelligent feedback analysis for continuous improvement. Extensive experiments on a wide range of benchmarks, including MuJoCo, MetaDrive, MPE, and SMAC, demonstrate that Agent^2 consistently outperforms manually designed solutions across all tasks, achieving up to 55% performance improvement and substantial gains on average. By enabling truly end-to-end, closed-loop automation, this work establishes a new paradigm in which intelligent agents design and optimize other agents, marking a fundamental breakthrough for automated AI systems.
zh

[AI-47] he Provenance Problem: LLM s and the Breakdown of Citation Norms

【速读】:该论文试图解决生成式 AI(Generative AI)在科学写作中引发的“溯源问题”(provenance problem),即当AI系统无意识地复现未被引用的他人研究成果时,尽管研究人员主观上并无剽窃意图,仍导致学术信用链条断裂,形成一种新型的 attributional harm(归属性损害)。解决方案的关键在于引入概念工具以识别和分析这一现象,并提出策略以维护学术交流中的完整性与公平性,从而应对当前伦理与职业规范无法覆盖此类问题的挑战。

链接: https://arxiv.org/abs/2509.13365
作者: Brian D. Earp,Haotian Yuan,Julian Koplin,Sebastian Porsdam Mann
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 9 pages

点击查看摘要

Abstract:The increasing use of generative AI in scientific writing raises urgent questions about attribution and intellectual credit. When a researcher employs ChatGPT to draft a manuscript, the resulting text may echo ideas from sources the author has never encountered. If an AI system reproduces insights from, for example, an obscure 1975 paper without citation, does this constitute plagiarism? We argue that such cases exemplify the ‘provenance problem’: a systematic breakdown in the chain of scholarly credit. Unlike conventional plagiarism, this phenomenon does not involve intent to deceive (researchers may disclose AI use and act in good faith) yet still benefit from the uncredited intellectual contributions of others. This dynamic creates a novel category of attributional harm that current ethical and professional frameworks fail to address. As generative AI becomes embedded across disciplines, the risk that significant ideas will circulate without recognition threatens both the reputational economy of science and the demands of epistemic justice. This Perspective analyzes how AI challenges established norms of authorship, introduces conceptual tools for understanding the provenance problem, and proposes strategies to preserve integrity and fairness in scholarly communication.
zh

[AI-48] Asterisk Operator

【速读】:该论文旨在解决抽象推理(Abstract Reasoning)任务中模型缺乏统一建模框架、计算效率低以及难以保证全局推理收敛性的问题。其核心解决方案是提出一种名为“星号算子”(\ast-operator)的新型统一框架,该框架基于邻接结构并行传播(Adjacency-Structured Parallel Propagation, ASPP),将抽象推理任务形式化为由隐式关系图引导的局部并行状态演化过程。关键创新在于:一方面通过局部计算约束确保高效性,另一方面借助结构化传播机制实现全局推理能力,并在ARC2和Conway’s Game of Life等基准上验证了其通用性、收敛性和优越性能;特别地,结合嵌入-星号蒸馏方法(Embedding-Asterisk distillation),仅用6M参数即在ARC2验证集达到100%准确率,显著推进了神经符号推理领域的发展。

链接: https://arxiv.org/abs/2509.13364
作者: Zixi Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code available at: this https URL

点击查看摘要

Abstract:We propose the \textbfAsterisk Operator ( \ast -operator), a novel unified framework for abstract reasoning based on Adjacency-Structured Parallel Propagation (ASPP). The operator formalizes structured reasoning tasks as local, parallel state evolution processes guided by implicit relational graphs. We prove that the \ast -operator maintains local computational constraints while achieving global reasoning capabilities, providing an efficient and convergent computational paradigm for abstract reasoning problems. Through rigorous mathematical analysis and comprehensive experiments on ARC2 challenges and Conway’s Game of Life, we demonstrate the operator’s universality, convergence properties, and superior performance. Our innovative Embedding-Asterisk distillation method achieves 100% accuracy on ARC2 validation with only 6M parameters, representing a significant breakthrough in neural-symbolic reasoning. \textbfKeywords: Abstract Reasoning, Adjacency Structure, Parallel Propagation, Asterisk Operator, Convergence, Universal Approximation Comments: Code available at: this https URL Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2509.13364 [cs.AI] (or arXiv:2509.13364v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.13364 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-49] Evaluating undergraduate mathematics examinations in the era of generative AI: a curriculum-level case study

【速读】:该论文试图解决的问题是:在生成式 AI(Generative AI)工具普及的背景下,传统闭卷数学考试在非监考、开卷环境下是否仍具有教学相关性与评估价值。其解决方案的关键在于通过实证方法,模拟生成式 AI 在无监督环境中作答八门本科数学课程考试,并对结果进行盲评和跨模块分析,从而量化 AI 的表现水平及其一致性,进而揭示当前评估体系在 AI 时代可能存在的教学价值下降问题。

链接: https://arxiv.org/abs/2509.13359
作者: Benjamin J. Walker,Beatriz Navarro Lameda,Ruth A. Reynolds
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative artificial intelligence (GenAI) tools such as OpenAI’s ChatGPT are transforming the educational landscape, prompting reconsideration of traditional assessment practices. In parallel, universities are exploring alternatives to in-person, closed-book examinations, raising concerns about academic integrity and pedagogical alignment in uninvigilated settings. This study investigates whether traditional closed-book mathematics examinations retain their pedagogical relevance when hypothetically administered in uninvigilated, open-book settings with GenAI access. Adopting an empirical approach, we generate, transcribe, and blind-mark GenAI submissions to eight undergraduate mathematics examinations at a Russel Group university, spanning the entirety of the first-year curriculum. By combining independent GenAI responses to individual questions, we enable a meaningful evaluation of GenAI performance, both at the level of modules and across the first-year curriculum. We find that GenAI attainment is at the level of a first-class degree, though current performance can vary between modules. Further, we find that GenAI performance is remarkably consistent when viewed across the entire curriculum, significantly more so than that of students in invigilated examinations. Our findings evidence the need for redesigning assessments in mathematics for unsupervised settings, and highlight the potential reduction in pedagogical value of current standards in the era of generative artificial intelligence.
zh

[AI-50] Semantic Fusion with Fuzzy-Membership Features for Controllable Language Modelling

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在控制输出语义属性(如情感极性、标点符号等)时缺乏精确性和可解释性的问题。传统 Transformer 语言模型虽具备强大生成能力,但其内部机制难以直接干预特定语义维度,导致生成结果难以满足用户对语义可控性的需求。解决方案的关键在于提出“语义融合”(semantic fusion)机制:通过引入一个并行的模糊隶属度特征通道,将每个词元(token)表示为一组可解释的语义特征向量(如词性、浅层角色、边界标志、情感极性和强度等),这些特征由可微分隶属函数(如幂核函数)生成;随后利用门控适配器(gated adapter)将该语义矩阵融合进原始语言模型的隐藏状态中,实现语义信息的轻量级注入与条件化生成。此方法仅增加少量计算开销,保持与共享输入输出嵌入的兼容性,并提供一条清晰、可解释的路径用于控制自然语言生成过程。

链接: https://arxiv.org/abs/2509.13357
作者: Yongchao Huang,Hassan Raza
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:We propose semantic fusion, a lightweight scheme that augments a Transformer language model (LM) with a parallel, fuzzy-membership feature channel that encodes token-level semantics. Each token is represented by a vector of interpretable features (e.g. part-of-speech cues, shallow roles, boundary flags, sentiment polarity and strength) whose values are graded degrees from differentiable membership functions (e.g. power kernels). These per-token vectors form a sentence-level semantic matrix fused via a gated adapter into the LM. Training uses standard next-token prediction, an auxiliary loss that reconstructs the semantic features from hidden states, and a lightweight uniformizer that regularizes adjective-class distributions. On a synthetic two-clause corpus with held-out adjectives for out-of-distribution (OOD) control, semantic fusion improves perplexity and enables precise, user-controllable generation of polarity and punctuation while maintaining model simplicity. This approach adds only small overhead, remains fully compatible with tied input-output embeddings, and provides an interpretable pathway for conditioned natural language generation.
zh

[AI-51] Synthetic Data and the Shifting Ground of Truth

【速读】:该论文试图解决在生成式 AI(Generative AI)背景下,传统“真实数据”(ground truth)概念失效后如何构建可靠训练数据和标签体系的问题。其核心挑战在于:合成数据虽缺乏与现实世界的表征关联(representational relationship),却常能提升模型性能,从而颠覆“垃圾进、垃圾出”(garbage in - garbage out)的经典假设。解决方案的关键在于重构对“真实”的理解——从依赖外部参照的表征性(representational)概念转向基于模仿或类像(mimetic or iconic)的数据观,即承认标签本身也是生成模型的产品,不再需要外部验证,而是通过内部一致性与任务适配性来确立其作为“真值”的作用。

链接: https://arxiv.org/abs/2509.13355
作者: Dietmar Offenhuber
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Talk presented at the Society for the Social Studies of Science (4S) 2025 meeting in Seattle, Sept. 3, 2025

点击查看摘要

Abstract:The emergence of synthetic data for privacy protection, training data generation, or simply convenient access to quasi-realistic data in any shape or volume complicates the concept of ground truth. Synthetic data mimic real-world observations, but do not refer to external features. This lack of a representational relationship, however, not prevent researchers from using synthetic data as training data for AI models and ground truth repositories. It is claimed that the lack of data realism is not merely an acceptable tradeoff, but often leads to better model performance than realistic data: compensate for known biases, prevent overfitting and support generalization, and make the models more robust in dealing with unexpected outliers. Indeed, injecting noisy and outright implausible data into training sets can be beneficial for the model. This greatly complicates usual assumptions based on which representational accuracy determines data fidelity (garbage in - garbage out). Furthermore, ground truth becomes a self-referential affair, in which the labels used as a ground truth repository are themselves synthetic products of a generative model and as such not connected to real-world observations. My paper examines how ML researchers and practitioners bootstrap ground truth under such paradoxical circumstances without relying on the stable ground of representation and real-world reference. It will also reflect on the broader implications of a shift from a representational to what could be described as a mimetic or iconic concept of data.
zh

[AI-52] Agent ic UAVs: LLM -Driven Autonomy with Integrated Tool-Calling and Cognitive Reasoning

【速读】:该论文旨在解决当前无人机(Unmanned Aerial Vehicles, UAVs)在复杂动态任务中自主性不足的问题,具体表现为现有系统多局限于SAE Level 2–3自动化水平,依赖规则驱动控制与窄域人工智能(Narrow AI),缺乏情境感知推理能力、自主决策机制以及生态系统级集成。其关键解决方案是提出Agentic UAVs框架,该框架采用五层架构(感知、推理、行动、集成、学习),引入大语言模型(Large Language Model, LLM)代理并结合工具调用(tool-calling)能力,实现对实时知识的访问与多源系统交互;通过ROS2和Gazebo平台原型验证,集成YOLOv11目标检测与GPT-4推理能力,在模拟搜救场景中显著提升检测置信度(0.79 vs. 0.72)、人形目标识别率(91% vs. 75%)及动作建议准确率(92% vs. 4.5%),证明了轻量级计算开销下可实现质的飞跃式自主性和生态整合能力。

链接: https://arxiv.org/abs/2509.13352
作者: Anis Koubaa,Khaled Gabr
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 14 pages, 1 figure

点击查看摘要

Abstract:Unmanned Aerial Vehicles (UAVs) are increasingly deployed in defense, surveillance, and disaster response, yet most systems remain confined to SAE Level 2–3 autonomy. Their reliance on rule-based control and narrow AI restricts adaptability in dynamic, uncertain missions. Existing UAV frameworks lack context-aware reasoning, autonomous decision-making, and ecosystem-level integration; critically, none leverage Large Language Model (LLM) agents with tool-calling for real-time knowledge access. This paper introduces the Agentic UAVs framework, a five-layer architecture (Perception, Reasoning, Action, Integration, Learning) that augments UAVs with LLM-driven reasoning, database querying, and third-party system interaction. A ROS2 and Gazebo-based prototype integrates YOLOv11 object detection with GPT-4 reasoning and local Gemma-3 deployment. In simulated search-and-rescue scenarios, agentic UAVs achieved higher detection confidence (0.79 vs. 0.72), improved person detection rates (91% vs. 75%), and markedly increased action recommendation (92% vs. 4.5%). These results confirm that modest computational overhead enables qualitatively new levels of autonomy and ecosystem integration.
zh

[AI-53] Label-Efficient Grasp Joint Prediction with Point-JEPA IROS2025

【速读】:该论文旨在解决在标签数据稀缺条件下,如何实现高效且准确的抓取关节角度预测问题。其解决方案的关键在于采用基于点云的3D自监督预训练方法,具体使用Joint-Embedding Predictive Architecture (Point-JEPA) 对点云进行特征编码,并利用ShapeNet预训练的Point-JEPA编码器作为基础模型,结合轻量级多假设输出头(multi-hypothesis head)与winner-takes-all策略进行推理,最终通过top-logit选择机制提升预测精度。实验表明,在DLR-Hand II数据集上,该方法在低标签场景下可将均方根误差(RMSE)降低高达26%,并达到全监督学习的性能水平,验证了JEPA类自监督预训练在数据效率方面的有效性。

链接: https://arxiv.org/abs/2509.13349
作者: Jed Guzelkabaagac,Boris Petrović
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 5 figures. Submitted to IROS 2025 Workshop

点击查看摘要

Abstract:We investigate whether 3D self-supervised pretraining with a Joint-Embedding Predictive Architecture (Point-JEPA) enables label-efficient grasp joint-angle prediction. Using point clouds tokenized from meshes and a ShapeNet-pretrained Point-JEPA encoder, we train a lightweight multi-hypothesis head with winner-takes-all and evaluate by top-logit selection. On DLR-Hand II with object-level splits, Point-JEPA reduces RMSE by up to 26% in low-label regimes and reaches parity with full supervision. These results suggest JEPA-style pretraining is a practical approach for data-efficient grasp learning.
zh

[AI-54] OpenHA: A Series of Open-Source Hierarchical Agent ic Models in Minecraft

【速读】:该论文旨在解决多模态视觉-语言-动作(Vision-Language-Action, VLA)代理在开放世界环境中因动作空间抽象方式不统一而导致的泛化能力受限问题。现有方法通常依赖于单一固定的动作空间,但研究表明不同任务对动作抽象的需求差异显著,导致难以构建通用性强的智能体。解决方案的关键在于提出Chain of Action (CoA) 框架,该框架将高层规划与低层控制统一在一个端到端可训练的VLA模型中,将抽象动作视为中间推理步骤(类比思维链),而非独立指令,从而引导生成最终可执行动作。这一机制使代理能够从多种动作空间混合训练中学习更鲁棒、通用的策略,并在超过800个Minecraft任务上实现新的最先进性能。

链接: https://arxiv.org/abs/2509.13347
作者: Zihao Wang,Muyao Li,Kaichen He,Xiangyu Wang,Zhancun Mu,Anji Liu,Yitao Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The choice of action spaces is a critical yet unresolved challenge in developing capable, end-to-end trainable agents. This paper first presents a large-scale, systematic comparison of prominent abstracted action spaces and tokenizers for Vision-Language-Action (VLA) or hierarchical agent models in the open-ended Minecraft. Our analysis reveals that no single action space is universally optimal; instead, the most effective abstraction is highly task-dependent, creating a dilemma for building generalist agents. To resolve this, we introduce Chain of Action (CoA), a novel framework that unifies high-level planning and low-level control within a single, monolithic VLA model. CoA treats an abstracted action not as a command for a separate policy, but as an intermediate reasoning step–akin to a chain of thought–that guides the generation of the final, executable action. Furthermore, we demonstrate that an All-in-One agent trained on a diverse mixture of action spaces using the CoA paradigm learns a more robust and generalizable policy. This unified agent achieves a new state-of-the-art, improving the overall task success rate over strong, specialized baselines. To foster reproducible research, we release the OpenHA (Open Hierarchical Agents) suite, which includes our comprehensive benchmark of over 800 distinct tasks, curated datasets, source code, and all pretrained model checkpoints at this https URL
zh

[AI-55] Real World Robotic Exploration using Deep Neural Networks Trained in Photorealistic Reconstructed Environments

【速读】:该论文旨在解决机器人在室内场景中基于RGB图像进行位姿估计(pose estimation)时的定位精度不足问题,尤其是在存在感知歧义(perceptual aliasing)的情况下。解决方案的关键在于对现有深度神经网络的损失函数进行改进,通过一种直观的方式联合优化位置误差和旋转误差,从而提升模型对视觉混淆的鲁棒性。实验表明,改进后的模型在室内场景中将中位数位置误差和角度误差分别降低了9.64%和2.99%,并结合摄影测量数据构建的标注数据集实现高精度定位(0.11m、0.89°),最终形成可实时运行于TurtleBot上的导航算法,构成一套适用于任意室内环境的完整导航系统。

链接: https://arxiv.org/abs/2509.13342
作者: Isaac Ronald Ward
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: This report is submitted as partial fulfilment of the requirements for the Honours Programme of the Department of Computer Science and Software Engineering, The University of Western Australia, 2019

点击查看摘要

Abstract:In this work, an existing deep neural network approach for determining a robot’s pose from visual information (RGB images) is modified, improving its localization performance without impacting its ease of training. Explicitly, the network’s loss function is extended in a manner which intuitively combines the positional and rotational error in order to increase robustness to perceptual aliasing. An improvement in the localization accuracy for indoor scenes is observed: with decreases of up to 9.64% and 2.99% in the median positional and rotational error respectively, when compared to the unmodified network. Additionally, photogrammetry data is used to produce a pose-labelled dataset which allows the above model to be trained on a local environment, resulting in localization accuracies of 0.11m 0.89 degrees. This trained model forms the basis of a navigation algorithm, which is tested in real-time on a TurtleBot (a wheeled robotic device). As such, this work introduces a full pipeline for creating a robust navigational algorithm for any given real world indoor scene; the only requirement being a collection of images from the scene, which can be captured in as little as 330 seconds of Comments: This report is submitted as partial fulfilment of the requirements for the Honours Programme of the Department of Computer Science and Software Engineering, The University of Western Australia, 2019 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.13342 [cs.RO] (or arXiv:2509.13342v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2509.13342 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-56] Imagined Autocurricula

【速读】:该论文旨在解决在真实世界中难以获取大量训练数据或高精度仿真环境时,如何训练具备泛化能力的智能体(agent)的问题。其核心挑战在于如何利用有限的离线收集数据生成有用的训练环境,以提升智能体对新任务变体的适应能力。解决方案的关键在于提出一种名为IMAC(Imagined Autocurricula)的新方法,该方法基于无监督环境设计(Unsupervised Environment Design, UED),通过自动构建由简至繁的生成环境课程(autocurricula),引导智能体在世界模型(world model)生成的想象环境中进行训练,从而实现从窄域数据学到的策略向未见环境的有效迁移。

链接: https://arxiv.org/abs/2509.13341
作者: Ahmet H. Güzel,Matthew Thomas Jackson,Jarek Luca Liesen,Tim Rocktäschel,Jakob Nicolaus Foerster,Ilija Bogunovic,Jack Parker-Holder
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Training agents to act in embodied environments typically requires vast training data or access to accurate simulation, neither of which exists for many cases in the real world. Instead, world models are emerging as an alternative leveraging offline, passively collected data, they make it possible to generate diverse worlds for training agents in simulation. In this work, we harness world models to generate imagined environments to train robust agents capable of generalizing to novel task variations. One of the challenges in doing this is ensuring the agent trains on useful generated data. We thus propose a novel approach, IMAC (Imagined Autocurricula), leveraging Unsupervised Environment Design (UED), which induces an automatic curriculum over generated worlds. In a series of challenging, procedurally generated environments, we show it is possible to achieve strong transfer performance on held-out environments, having trained only inside a world model learned from a narrower dataset. We believe this opens the path to utilizing larger-scale, foundation world models for generally capable agents.
zh

[AI-57] Position: AI Safety Must Embrace an Antifrag ile Perspective

【速读】:该论文试图解决当前AI安全研究中因依赖静态基准测试和一次性鲁棒性评估而导致的长期可靠性不足问题,尤其在面对环境演化、罕见事件(out-of-distribution, OOD)以及模型漂移(如奖励黑客行为、过度优化或能力退化)时缺乏适应能力。解决方案的关键在于引入“抗脆弱性”(antifragile)视角——即不追求快速消除现有不确定性,而是主动利用这些不确定性来增强系统对未来更大、更不可预测风险的应对能力,从而实现AI安全机制的持续进化与强化。

链接: https://arxiv.org/abs/2509.13339
作者: Ming Jin,Hyunin Lee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper contends that modern AI research must adopt an antifragile perspective on safety – one in which the system’s capacity to guarantee long-term AI safety such as handling rare or out-of-distribution (OOD) events expands over time. Conventional static benchmarks and single-shot robustness tests overlook the reality that environments evolve and that models, if left unchallenged, can drift into maladaptation (e.g., reward hacking, over-optimization, or atrophy of broader capabilities). We argue that an antifragile approach – Rather than striving to rapidly reduce current uncertainties, the emphasis is on leveraging those uncertainties to better prepare for potentially greater, more unpredictable uncertainties in the future – is pivotal for the long-term reliability of open-ended ML systems. In this position paper, we first identify key limitations of static testing, including scenario diversity, reward hacking, and over-alignment. We then explore the potential of antifragile solutions to manage rare events. Crucially, we advocate for a fundamental recalibration of the methods used to measure, benchmark, and continually improve AI safety over the long term, complementing existing robustness approaches by providing ethical and practical guidelines towards fostering an antifragile AI safety community.
zh

[AI-58] FRIT: Using Causal Importance to Improve Chain-of-Thought Faithfulness

【速读】:该论文旨在解决生成式 AI(Generative AI)在链式思维(Chain-of-thought, CoT)推理中存在因果不一致的问题,即模型的推理步骤往往未能真正因果性地影响最终答案,导致输出不可靠、难以信任。解决方案的关键在于提出一种可扩展的对齐方法——通过干预训练实现忠实推理(Faithful Reasoning via Intervention Training, FRIT):首先通过干预模型生成的CoT中的单个推理步骤,构造出“忠实”与“不忠实”的推理对;随后利用直接偏好优化(Direct Preference Optimization, DPO)训练模型偏好因果一致的推理路径。该方法无需人工标注,实现了从“仅衡量”到“系统提升”模型推理可信度的突破。

链接: https://arxiv.org/abs/2509.13334
作者: Anand Swaroop,Akshat Nallani,Saksham Uboweja,Adiliia Uzdenova,Michael Nguyen,Kevin Zhu,Sunishchal Dev,Ashwinee Panda,Vasu Sharma,Maheep Chaudhary
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Chain-of-thought (CoT) reasoning has emerged as a powerful tool for improving large language model performance on complex tasks, but recent work shows that reasoning steps often fail to causally influence the final answer, creating brittle and untrustworthy outputs. Prior approaches focus primarily on measuring faithfulness, while methods for systematically improving it remain limited. We introduce Faithful Reasoning via Intervention Training (FRIT), a scalable alignment method that trains models to produce causally consistent reasoning by learning from systematically corrupted examples. FRIT generates synthetic training data by intervening on individual reasoning steps in model-generated CoTs, creating faithful/unfaithful pairs that highlight when reasoning breaks down. We then apply Direct Preference Optimization to teach models to prefer causally consistent reasoning paths. Evaluating on Qwen3-8B and Mistral-7B-v0.1 across factual and symbolic reasoning tasks, FRIT increases faithful reasoning by 3.4 percentage points for Mistral on GSM8K while improving accuracy by 7.6 percentage points. Our approach provides the first scalable, supervision-free method for training language models to produce more reliable and interpretable reasoning, addressing a critical gap between reasoning performance and trustworthiness. We release our code at \hrefthis https URL.
zh

[AI-59] Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在安全评估中表现出的“评估意识”(evaluation awareness)问题,即模型能够区分评估环境与实际部署环境,并在评估阶段隐藏其潜在危险能力,从而误导安全性测试结果。为解决这一问题,作者通过在线性探测(linear probing)基础上分析转向向量激活(steering vector activations)的方式,在15个参数规模从0.27B到70B不等的模型上系统研究了评估意识的演化规律。关键发现是:评估意识随模型规模呈幂律增长,揭示了一个可预测的缩放定律(scaling law),这不仅可用于预测未来更大模型中的欺骗行为,也为设计适应模型规模的安全评估策略提供了理论依据和实践指导。

链接: https://arxiv.org/abs/2509.13333
作者: Maheep Chaudhary,Ian Su,Nikhil Hooda,Nishith Shankar,Julia Tan,Kevin Zhu,Ashwinee Panda,Ryan Lagasse,Vasu Sharma
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can internally distinguish between evaluation and deployment contexts, a behaviour known as \emphevaluation awareness. This undermines AI safety evaluations, as models may conceal dangerous capabilities during testing. Prior work demonstrated this in a single 70 B model, but the scaling relationship across model sizes remains unknown. We investigate evaluation awareness across 15 models scaling from 0.27 B to 70 B parameters from four families using linear probing on steering vector activations. Our results reveal a clear power-law scaling: evaluation awareness increases predictably with model size. This scaling law enables forecasting deceptive behavior in future larger models and guides the design of scale-aware evaluation strategies for AI safety. A link to the implementation of this paper can be found at this https URL.
zh

[AI-60] Joint data imputation and mechanistic modelling for simulating heart-brain interactions in incomplete datasets

【速读】:该论文旨在解决临床研究中机制模型(mechanistic models)应用受限的问题,即缺乏多模态患者数据以表征不同解剖和生理过程,例如神经影像数据通常无法充分反映心脏特征,从而限制了心血管因素在脑部疾病建模中的应用。其解决方案的关键在于提出一种概率框架,实现心脏数据的联合填补(imputation)与心血管机制模型的个性化(personalisation),具体通过变分推断联合估计心脏信息的填补模型和高斯过程(Gaussian Process)代理模型,以忠实再现个体化的心血管动力学;实验表明,该方法可在仅含少量心脏信息(如仅收缩压和舒张压)的数据集中准确填补缺失心脏特征,并同步估计简化模型(lumped model)的参数,从而支持基于真实心脏动态模拟的脑-心协同关系探索。

链接: https://arxiv.org/abs/2010.01052
作者: Jaume Banus,Maxime Sermesant,Oscar Camara,Marco Lorenzi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:The use of mechanistic models in clinical studies is limited by the lack of multi-modal patients data representing different anatomical and physiological processes. For example, neuroimaging datasets do not provide a sufficient representation of heart features for the modeling of cardiovascular factors in brain disorders. To tackle this problem we introduce a probabilistic framework for joint cardiac data imputation and personalisation of cardiovascular mechanistic models, with application to brain studies with incomplete heart data. Our approach is based on a variational framework for the joint inference of an imputation model of cardiac information from the available features, along with a Gaussian Process emulator that can faithfully reproduce personalised cardiovascular dynamics. Experimental results on UK Biobank show that our model allows accurate imputation of missing cardiac features in datasets containing minimal heart information, e.g. systolic and diastolic blood pressures only, while jointly estimating the emulated parameters of the lumped model. This allows a novel exploration of the heart-brain joint relationship through simulation of realistic cardiac dynamics corresponding to different conditions of brain anatomy.
zh

[AI-61] Machines are more productive than humans until they arent and vice versa

【速读】:该论文旨在解决组织在人工智能(Artificial Intelligence, AI)快速发展背景下,如何基于经济原则优化人力与机器技能(human and machine skills)配置决策的问题。其解决方案的关键在于构建一个基于蒙特卡洛模拟(Monte Carlo simulations)的“在硅框架”(in-silico framework),该框架以实证现实为基础,量化分析不同复杂度任务中单一或协同部署人力与机器技能的经济影响。研究发现,自动化在低至中等泛化难度的任务中最具经济效益,而在高复杂度场景下则难以超越人类技能;唯有当实现真正的人机增强(genuine augmentation)时,人机协同策略才能成为最优选择;否则,由于双重技能结构带来的固有成本,该策略反而会破坏价值,成为最差经济选项。因此,关键不在于简单分配技能,而在于组织必须致力于实现增强效应,这是提升竞争力的核心前提。

链接: https://arxiv.org/abs/2509.14057
作者: Riccardo Zanardelli
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the growth of artificial skills, organizations may increasingly confront with the problem of optimizing skill policy decisions guided by economic principles. This paper addresses the underlying complexity of this challenge by developing an in-silico framework based on Monte Carlo simulations grounded in empirical realism to analyze the economic impact of human and machine skills, individually or jointly deployed, in the execution of tasks presenting varying levels of complexity. Our results provide quantitative support for the established notions that automation tends to be the most economically-effective strategy for tasks characterized by low-to-medium generalization difficulty, while automation struggles to match the economic utility of human skills in more complex scenarios. Critically, our simulations highlight that combining human and machine skills can be the most effective strategy when a high level of generalization is required, but only if genuine augmentation is achieved. In contrast, when failing to realize this synergy, the human-machine policy is severely penalized by the inherent costs of its dual skill structure, causing it to destroy value and becoming the worst choice from an economic perspective. The takeaway for decision-makers is unambiguous: simply allocating human and machine skills to a task is insufficient, and a human-machine skill policy is neither a silver-bullet solution nor a low-risk compromise. Rather, it is a critical opportunity to boost competitiveness that demands a strong organizational commitment to enabling augmentation. Also, our findings show that improving the cost-effectiveness of machine skills over time, while useful, does not replace the fundamental need to focus on achieving augmentation.
zh

[AI-62] PhenoGnet: A Graph-Based Contrastive Learning Framework for Disease Similarity Prediction

【速读】:该论文旨在解决疾病相似性预测问题,以支持精准医学和罕见病研究中的诊断、药物发现与个性化治疗策略。其核心挑战在于如何有效整合基因功能相互作用网络与人类表型本体(HPO)信息,从而挖掘出超越直接基因-表型重叠的潜在生物关联。解决方案的关键在于提出PhenoGnet框架,该框架采用图对比学习机制:首先通过图卷积网络(GCN)和图注意力网络(GAT)分别编码基因图和表型图(即“单视图模型”),再利用共享权重的多层感知机(MLP)作为跨视图模型,通过对比学习对齐基因与表型嵌入;训练时以已知的基因-表型关联为正样本,随机采样无关对为负样本。最终基于疾病相关基因与表型的平均嵌入向量计算余弦相似度,实现高精度疾病相似性预测(AUCPR达0.9012)。

链接: https://arxiv.org/abs/2509.14037
作者: Ranga Baminiwatte,Kazi Jewel Rana,Aaron J. Masino
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Understanding disease similarity is critical for advancing diagnostics, drug discovery, and personalized treatment strategies. We present PhenoGnet, a novel graph-based contrastive learning framework designed to predict disease similarity by integrating gene functional interaction networks with the Human Phenotype Ontology (HPO). PhenoGnet comprises two key components: an intra-view model that separately encodes gene and phenotype graphs using Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs), and a cross view model implemented as a shared weight multilayer perceptron (MLP) that aligns gene and phenotype embeddings through contrastive learning. The model is trained using known gene phenotype associations as positive pairs and randomly sampled unrelated pairs as negatives. Diseases are represented by the mean embeddings of their associated genes and/or phenotypes, and pairwise similarity is computed via cosine similarity. Evaluation on a curated benchmark of 1,100 similar and 866 dissimilar disease pairs demonstrates strong performance, with gene based embeddings achieving an AUCPR of 0.9012 and AUROC of 0.8764, outperforming existing state of the art methods. Notably, PhenoGnet captures latent biological relationships beyond direct overlap, offering a scalable and interpretable solution for disease similarity prediction. These results underscore its potential for enabling downstream applications in rare disease research and precision medicine.
zh

[AI-63] DSpAST: Disentangled Representations for Spatial Audio Reasoning with Large Language Models

【速读】:该论文旨在解决单个音频编码器在空间音频推理任务中难以同时准确捕捉声事件类型、方向和距离信息的问题,因为这些任务所需的信息通常是相互独立的,导致单一编码器性能受限。解决方案的关键在于提出DSpAST,一种基于SpatialAST的新型音频编码器,通过学习解耦的空间音频表征,在仅增加0.2%参数量的前提下,显著提升了空间音频理解能力,实验表明其在SpatialSoundQA数据集上的表现优于原始的SpatialAST。

链接: https://arxiv.org/abs/2509.13927
作者: Kevin Wilkinghoff,Zheng-Hua Tan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the type of sound events, as well as the direction and distance of their corresponding sources. Accomplishing this with a single audio encoder is demanding as the information required for each of these tasks is mostly independent of each other. As a result, the performance obtained with a single encoder is often worse than when using task-specific audio encoders. In this work, we present DSpAST, a novel audio encoder based on SpatialAST that learns disentangled representations of spatial audio while having only 0.2% additional parameters. Experiments on SpatialSoundQA with the spatial audio reasoning system BAT demonstrate that DSpAST significantly outperforms SpatialAST.
zh

[AI-64] A reduced-order derivative-informed neural operator for subsurface fluid-flow

【速读】:该论文旨在解决神经算子(Neural Operator)在流体流动模拟中的梯度精度不足问题,尤其是在基于时序地震数据的渗透率反演和不确定性量化等计算密集型任务中,下游优化与贝叶斯推断的准确性高度依赖于代理模型对系统参数的梯度 fidelity。传统物理信息方法虽可利用导数信息提升精度,但显式引入雅可比矩阵(Jacobian)会导致计算复杂度随输入参数数量呈二次增长,难以实际应用。解决方案的关键在于提出 DeFINO(Derivative-based Fisher-score Informed Neural Operator),其核心是将傅里叶神经算子(Fourier Neural Operator, FNO)与一种基于费舍尔信息矩阵(Fisher Information Matrix, FIM)的新型导数驱动训练策略相结合:通过将雅可比矩阵投影到由 FIM 识别出的主要特征方向上,DeFINO 能够高效捕获由观测数据驱动的敏感性信息,在显著降低计算开销的同时大幅提升梯度精度,同时保持对地下多相流体动力学的稳健前向预测能力。

链接: https://arxiv.org/abs/2509.13620
作者: Jeongjin(Jayjay)Park,Grant Bruer,Huseyin Tuna Erdinc,Abhinav Prakash Gahlot,Felix J. Herrmann
机构: 未知
类目: Computational Physics (physics.comp-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Neural operators have emerged as cost-effective surrogates for expensive fluid-flow simulators, particularly in computationally intensive tasks such as permeability inversion from time-lapse seismic data, and uncertainty quantification. In these applications, the fidelity of the surrogate’s gradients with respect to system parameters is crucial, as the accuracy of downstream tasks, such as optimization and Bayesian inference, relies directly on the quality of the derivative information. Recent advances in physics-informed methods have leveraged derivative information to improve surrogate accuracy. However, incorporating explicit Jacobians can become computationally prohibitive, as the complexity typically scales quadratically with the number of input parameters. To address this limitation, we propose DeFINO (Derivative-based Fisher-score Informed Neural Operator), a reduced-order, derivative-informed training framework. DeFINO integrates Fourier neural operators (FNOs) with a novel derivative-based training strategy guided by the Fisher Information Matrix (FIM). By projecting Jacobians onto dominant eigen-directions identified by the FIM, DeFINO captures critical sensitivity information directly informed by observational data, significantly reducing computational expense. We validate DeFINO through synthetic experiments in the context of subsurface multi-phase fluid-flow, demonstrating improvements in gradient accuracy while maintaining robust forward predictions of underlying fluid dynamics. These results highlight DeFINO’s potential to offer practical, scalable solutions for inversion problems in complex real-world scenarios, all at substantially reduced computational cost.
zh

[AI-65] Complexity Bounds for Smooth Convex Multiobjective Optimization

【速读】:该论文旨在解决平滑多目标优化中寻找 ε-Pareto 平稳点(ε-Pareto stationary points)的Oracle复杂度问题,其核心进展指标为Pareto平稳性间隙(Pareto stationarity gap G(x)\mathcal{G}(x)),即梯度最优凸组合的范数。解决方案的关键在于对不同类别的优化方法(包括强凸、凸问题下的span first-order方法、自适应与非自适应标量化策略)进行细致的下界分析,并揭示了在非退化实例(distinct objectives and non-singleton Pareto fronts)中,不同算法结构所导致的收敛速率差异:例如,对于强凸情形,span方法的线性收敛速率受限于exp(Θ(T/κ))\exp(-\Theta(T/\sqrt{\kappa}));而对于一般凸问题,自适应标量化方法可实现1/T21/T^2的最终梯度范数下界,显著区别于已知上界,从而揭示了当前理论与最坏情况之间存在的差距。

链接: https://arxiv.org/abs/2509.13550
作者: Phillipe R. Sampaio
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI)
备注: 16 pages

点击查看摘要

Abstract:We study the oracle complexity of finding \varepsilon -Pareto stationary points in smooth multiobjective optimization with m objectives. The progress metric is the Pareto stationarity gap \mathcalG(x) (the norm of an optimal convex combination of gradients). Our contributions are fourfold. (i) For strongly convex objectives, any span first-order method (iterates lie in the span of past gradients) exhibits linear convergence no faster than \exp(-\Theta(T/\sqrt\kappa)) after T oracle calls, where \kappa is the condition number, implying \Theta(\sqrt\kappa\log(1/\varepsilon)) iterations; this matches classical accelerated upper bounds. (ii) For convex problems and oblivious one-step methods (a fixed scalarization with pre-scheduled step sizes), we prove a lower bound of order 1/T on the best gradient norm among the first T iterates. (iii) Although accelerated gradient descent is outside this restricted class, it is an oblivious span method and attains the same 1/T upper rate on a fixed scalarization. (iv) For convex problems and general span methods with adaptive scalarizations, we establish a universal lower bound of order 1/T^2 on the gradient norm of the final iterate after T steps, highlighting a gap between known upper bounds and worst-case guarantees. All bounds hold on non-degenerate instances with distinct objectives and non-singleton Pareto fronts; rates are stated up to universal constants and natural problem scaling.
zh

[AI-66] Explainable AI-Enhanced Supervisory Control for High-Precision Spacecraft Formation

【速读】:该论文旨在解决高精度空间星群任务中因动态不确定性与扰动导致的控制精度不足问题,特别是在X射线观测虚拟望远镜(VTXO)任务中,两颗分离航天器需维持千米级基线距离以实现55毫角秒的角分辨率。解决方案的关键在于融合生成式AI与监督自适应控制(supervisory adaptive control),通过引入深度神经网络与约束非凸动态优化流水线联合预测最优任务参数,并利用时序自动机(timed automata)进行监督控制,结合蒙特卡洛仿真评估稳定性与鲁棒性,从而在保证高精度的同时显著降低能耗并提升任务准确性。该框架还提供可解释性,能够实时预测能量消耗和任务误差,支持透明化决策,突破传统自适应控制器的局限性。

链接: https://arxiv.org/abs/2509.13331
作者: Reza Pirayeshshirazinezhad
机构: 未知
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Robotics (cs.RO); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We use artificial intelligence (AI) and supervisory adaptive control systems to plan and optimize the mission of precise spacecraft formation. Machine learning and robust control enhance the efficiency of spacecraft precision formation of the Virtual Telescope for X-ray Observation (VTXO) space mission. VTXO is a precise formation of two separate spacecraft making a virtual telescope with a one-kilometer focal length. One spacecraft carries the lens and the other spacecraft holds the camera to observe high-energy space objects in the X-ray domain with 55 milli-arcsecond angular resolution accuracy. Timed automata for supervisory control, Monte Carlo simulations for stability and robustness evaluation, and integration of deep neural networks for optimal estimation of mission parameters, satisfy the high precision mission criteria. We integrate deep neural networks with a constrained, non-convex dynamic optimization pipeline to predict optimal mission parameters, ensuring precision mission criteria are met. AI framework provides explainability by predicting the resulting energy consumption and mission error for a given set of mission parameters. It allows for transparent, justifiable, and real-time trade-offs, a capability not present in traditional adaptive controllers. The results show reductions in energy consumption and improved mission accuracy, demonstrating the capability of the system to address dynamic uncertainties and disturbances.
zh

[AI-67] Dual Actor DDPG for Airborne STAR-RIS Assisted Communications

【速读】:该论文旨在解决空中同时透射与反射可重构智能表面(Aerial-STAR)系统中传统独立传输与反射系数(Transmission and Reflection Coefficients, TRC)假设带来的性能瓶颈问题,特别是在多用户下行链路通信场景下,如何联合优化无人机(UAV)轨迹、基站主动波束赋形向量及RIS的TRC以提升系统通信效率并满足能量约束。其解决方案的关键在于:提出一种新型双演员深度确定性策略梯度(Dual Actor Deep Deterministic Policy Gradient, DA-DDPG)算法,通过两个独立的Actor网络处理高维混合动作空间(离散与连续动作组合),并设计基于调和平均指数(Harmonic Mean Index, HFI)的奖励函数以保障用户间通信公平性,从而实现对Aerial-STAR系统的高效协同优化。

链接: https://arxiv.org/abs/2509.13328
作者: Danish Rizvi,David Boyle
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:This study departs from the prevailing assumption of independent Transmission and Reflection Coefficients (TRC) in Airborne Simultaneous Transmit and Reflect Reconfigurable Intelligent Surface (STAR-RIS) research. Instead, we explore a novel multi-user downlink communication system that leverages a UAV-mounted STAR-RIS (Aerial-STAR) incorporating a coupled TRC phase shift model. Our key contributions include the joint optimization of UAV trajectory, active beamforming vectors at the base station, and passive RIS TRCs to enhance communication efficiency, while considering UAV energy constraints. We design the TRC as a combination of discrete and continuous actions, and propose a novel Dual Actor Deep Deterministic Policy Gradient (DA-DDPG) algorithm. The algorithm relies on two separate actor networks for high-dimensional hybrid action space. We also propose a novel harmonic mean index (HFI)-based reward function to ensure communication fairness amongst users. For comprehensive analysis, we study the impact of RIS size on UAV aerodynamics showing that it increases drag and energy demand. Simulation results demonstrate that the proposed DA-DDPG algorithm outperforms conventional DDPG and DQN-based solutions by 24% and 97%, respectively, in accumulated reward. Three-dimensional UAV trajectory optimization achieves 28% higher communication efficiency compared to two-dimensional and altitude optimization. The HFI based reward function provides 41% lower QoS denial rates as compared to other benchmarks. The mobile Aerial-STAR system shows superior performance over fixed deployed counterparts, with the coupled phase STAR-RIS outperforming dual Transmit/Reflect RIS and conventional RIS setups. These findings highlight the potential of Aerial-STAR systems and the effectiveness of our proposed DA-DDPG approach in optimizing their performance.
zh

[AI-68] Prognosis of COVID-19 using Artificial Intelligence: A Systematic Review and Meta-analysis

【速读】:该论文旨在解决如何利用人工智能(AI)技术,特别是机器学习和深度学习方法,基于CT或胸部X线(CXR)图像的影像组学特征(radiomic features)对新冠肺炎(COVID-19)患者的预后进行准确预测的问题。其解决方案的关键在于:首先,系统性地筛选并整合了36篇相关研究,验证了多种AI模型(如卷积神经网络、随机森林、支持向量机等)在预测疾病严重程度、是否需要机械通气或入住重症监护室(ICU)以及死亡风险方面的有效性;其次,发现将患者的人口统计学信息、临床数据、实验室检测指标与影像组学特征相结合,可显著提升模型性能,从而为临床决策和医疗资源优化分配提供可靠依据。

链接: https://arxiv.org/abs/2408.00208
作者: SaeedReza Motamedian,Sadra Mohaghegh,Elham Babadi Oregani,Mahrsa Amjadi,Parnian Shobeiri,Negin Cheraghi,Niusha Solouki,Nikoo Ahmadi,Hossein Mohammad-Rahimi,Yassine Bouchareb,Arman Rahmim
机构: 未知
类目: Medical Physics (physics.med-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Purpose: Artificial intelligence (AI) techniques have been extensively utilized for diagnosing and prognosis of several diseases in recent years. This study identifies, appraises and synthesizes published studies on the use of AI for the prognosis of COVID-19. Method: Electronic search was performed using Medline, Google Scholar, Scopus, Embase, Cochrane and ProQuest. Studies that examined machine learning or deep learning methods to determine the prognosis of COVID-19 using CT or chest X-ray images were included. Polled sensitivity, specificity area under the curve and diagnostic odds ratio were calculated. Result: A total of 36 articles were included; various prognosis-related issues, including disease severity, mechanical ventilation or admission to the intensive care unit and mortality, were investigated. Several AI models and architectures were employed, such as the Siamense model, support vector machine, Random Forest , eXtreme Gradient Boosting, and convolutional neural networks. The models achieved 71%, 88% and 67% sensitivity for mortality, severity assessment and need for ventilation, respectively. The specificity of 69%, 89% and 89% were reported for the aforementioned variables. Conclusion: Based on the included articles, machine learning and deep learning methods used for the prognosis of COVID-19 patients using radiomic features from CT or CXR images can help clinicians manage patients and allocate resources more effectively. These studies also demonstrate that combining patient demographic, clinical data, laboratory tests and radiomic features improves model performances.
zh

机器学习

[LG-0] Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

链接: https://arxiv.org/abs/2509.14234
作者: Dulhan Jayalath,Shashwat Goel,Thomas Foster,Parag Jain,Suchin Gururangan,Cheng Zhang,Anirudh Goyal,Alan Schelten
类目: Machine Learning (cs.LG)
*备注: 22 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Where do learning signals come from when there is no ground truth in post-training? We propose turning exploration into supervision through Compute as Teacher (CaT), which converts the model’s own exploration at inference-time into reference-free supervision by synthesizing a single reference from a group of parallel rollouts and then optimizing toward it. Concretely, the current policy produces a group of rollouts; a frozen anchor (the initial policy) reconciles omissions and contradictions to estimate a reference, turning extra inference-time compute into a teacher signal. We turn this into rewards in two regimes: (i) verifiable tasks use programmatic equivalence on final answers; (ii) non-verifiable tasks use self-proposed rubrics-binary, auditable criteria scored by an independent LLM judge, with reward given by the fraction satisfied. Unlike selection methods (best-of-N, majority, perplexity, or judge scores), synthesis may disagree with the majority and be correct even when all rollouts are wrong; performance scales with the number of rollouts. As a test-time procedure, CaT improves Gemma 3 4B, Qwen 3 4B, and Llama 3.1 8B (up to +27% on MATH-500; +12% on HealthBench). With reinforcement learning (CaT-RL), we obtain further gains (up to +33% and +30%), with the trained policy surpassing the initial teacher signal.

[LG-1] NIRVANA: Structured pruning reimagined for large language models compression

链接: https://arxiv.org/abs/2509.14230
作者: Mengting Ai,Tianxin Wei,Sirui Chen,Jingrui He
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Structured pruning of large language models (LLMs) offers substantial efficiency improvements by removing entire hidden units, yet current approaches often suffer from significant performance degradation, particularly in zero-shot settings, and necessitate costly recovery techniques such as supervised fine-tuning (SFT) or adapter insertion. To address these critical shortcomings, we introduce NIRVANA, a novel pruning method explicitly designed to balance immediate zero-shot accuracy preservation with robust fine-tuning capability. Leveraging a first-order saliency criterion derived from the Neural Tangent Kernel under Adam optimization dynamics, NIRVANA provides a theoretically grounded pruning strategy that respects essential model training behaviors. To further address the unique challenges posed by structured pruning, NIRVANA incorporates an adaptive sparsity allocation mechanism across layers and modules (attention vs. MLP), which adjusts pruning intensity between modules in a globally balanced manner. Additionally, to mitigate the high sensitivity of pruning decisions to calibration data quality, we propose a simple yet effective KL divergence-based calibration data selection strategy, ensuring more reliable and task-agnostic pruning outcomes. Comprehensive experiments conducted on Llama3, Qwen, and T5 models demonstrate that NIRVANA outperforms existing structured pruning methods under equivalent sparsity constraints, providing a theoretically sound and practical approach to LLM compression. The code is available at this https URL.

[LG-2] Multi-robot Multi-source Localization in Complex Flows with Physics-Preserving Environment Models

链接: https://arxiv.org/abs/2509.14228
作者: Benjamin Shaffer,Victoria Edwards,Brooks Kinch,Nathaniel Trask,M. Ani Hsieh
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Source localization in a complex flow poses a significant challenge for multi-robot teams tasked with localizing the source of chemical leaks or tracking the dispersion of an oil spill. The flow dynamics can be time-varying and chaotic, resulting in sporadic and intermittent sensor readings, and complex environmental geometries further complicate a team’s ability to model and predict the dispersion. To accurately account for the physical processes that drive the dispersion dynamics, robots must have access to computationally intensive numerical models, which can be difficult when onboard computation is limited. We present a distributed mobile sensing framework for source localization in which each robot carries a machine-learned, finite element model of its environment to guide information-based sampling. The models are used to evaluate an approximate mutual information criterion to drive an infotaxis control strategy, which selects sensing regions that are expected to maximize informativeness for the source localization objective. Our approach achieves faster error reduction compared to baseline sensing strategies and results in more accurate source localization compared to baseline machine learning approaches.

[LG-3] Defending Diffusion Models Against Membership Inference Attacks via Higher-Order Langevin Dynamics

链接: https://arxiv.org/abs/2509.14225
作者: Benjamin Sterling,Yousef El-Laham,Mónica F. Bugallo
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 5 pages, 2 figures, 1 table

点击查看摘要

Abstract:Recent advances in generative artificial intelligence applications have raised new data security concerns. This paper focuses on defending diffusion models against membership inference attacks. This type of attack occurs when the attacker can determine if a certain data point was used to train the model. Although diffusion models are intrinsically more resistant to membership inference attacks than other generative models, they are still susceptible. The defense proposed here utilizes critically-damped higher-order Langevin dynamics, which introduces several auxiliary variables and a joint diffusion process along these variables. The idea is that the presence of auxiliary variables mixes external randomness that helps to corrupt sensitive input data earlier on in the diffusion process. This concept is theoretically investigated and validated on a toy dataset and a speech dataset using the Area Under the Receiver Operating Characteristic (AUROC) curves and the FID metric.

[LG-4] Data Denoising and Derivative Estimation for Data-Driven Modeling of Nonlinear Dynamical Systems

链接: https://arxiv.org/abs/2509.14219
作者: Jiaqi Yao,Lewis Mitchell,John Maclean,Hemanth Saratchandran
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Data-driven modeling of nonlinear dynamical systems is often hampered by measurement noise. We propose a denoising framework, called Runge-Kutta and Total Variation Based Implicit Neural Representation (RKTV-INR), that represents the state trajectory with an implicit neural representation (INR) fitted directly to noisy observations. Runge-Kutta integration and total variation are imposed as constraints to ensure that the reconstructed state is a trajectory of a dynamical system that remains close to the original data. The trained INR yields a clean, continuous trajectory and provides accurate first-order derivatives via automatic differentiation. These denoised states and derivatives are then supplied to Sparse Identification of Nonlinear Dynamics (SINDy) to recover the governing equations. Experiments demonstrate effective noise suppression, precise derivative estimation, and reliable system identification.

[LG-5] A Variational Framework for Residual-Based Adaptivity in Neural PDE Solvers and Operator Learning

链接: https://arxiv.org/abs/2509.14198
作者: Juan Diego Toscano,Daniel T. Chen,Vivek Oommen,George Em Karniadakis
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Residual-based adaptive strategies are widely used in scientific machine learning but remain largely heuristic. We introduce a unifying variational framework that formalizes these methods by integrating convex transformations of the residual. Different transformations correspond to distinct objective functionals: exponential weights target the minimization of uniform error, while linear weights recover the minimization of quadratic error. Within this perspective, adaptive weighting is equivalent to selecting sampling distributions that optimize the primal objective, thereby linking discretization choices directly to error metrics. This principled approach yields three benefits: (1) it enables systematic design of adaptive schemes across norms, (2) reduces discretization error through variance reduction of the loss estimator, and (3) enhances learning dynamics by improving the gradient signal-to-noise ratio. Extending the framework to operator learning, we demonstrate substantial performance gains across optimizers and architectures. Our results provide a theoretical justification of residual-based adaptivity and establish a foundation for principled discretization and training strategies.

[LG-6] opoSizing: An LLM -aided Framework of Topology-based Understanding and Sizing for AMS Circuits

链接: https://arxiv.org/abs/2509.14169
作者: Ziming Wei,Zichen Kong,Yuan Wang,David Z. Pan,Xiyuan Tang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Analog and mixed-signal circuit design remains challenging due to the shortage of high-quality data and the difficulty of embedding domain knowledge into automated flows. Traditional black-box optimization achieves sampling efficiency but lacks circuit understanding, which often causes evaluations to be wasted in low-value regions of the design space. In contrast, learning-based methods embed structural knowledge but are case-specific and costly to retrain. Recent attempts with large language models show potential, yet they often rely on manual intervention, limiting generality and transparency. We propose TopoSizing, an end-to-end framework that performs robust circuit understanding directly from raw netlists and translates this knowledge into optimization gains. Our approach first applies graph algorithms to organize circuits into a hierarchical device-module-stage representation. LLM agents then execute an iterative hypothesis-verification-refinement loop with built-in consistency checks, producing explicit annotations. Verified insights are integrated into Bayesian optimization through LLM-guided initial sampling and stagnation-triggered trust-region updates, improving efficiency while preserving feasibility.

[LG-7] Deconstructing Intraocular Pressure: A Non-invasive Multi-Stage Probabilistic Inverse Framework

链接: https://arxiv.org/abs/2509.14167
作者: Md Rezwan Jaher,Abul Mukid Mohammad Mukaddes,A. B. M. Abdul Malek
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM); Applications (stat.AP); Methodology (stat.ME)
*备注: 43 pages, 10 figures (including supplementary material)

点击查看摘要

Abstract:Many critical healthcare decisions are challenged by the inability to measure key underlying parameters. Glaucoma, a leading cause of irreversible blindness driven by elevated intraocular pressure (IOP), provides a stark example. The primary determinant of IOP, a tissue property called trabecular meshwork permeability, cannot be measured in vivo, forcing clinicians to depend on indirect surrogates. This clinical challenge is compounded by a broader computational one: developing predictive models for such ill-posed inverse problems is hindered by a lack of ground-truth data and prohibitive cost of large-scale, high-fidelity simulations. We address both challenges with an end-to-end framework to noninvasively estimate unmeasurable variables from sparse, routine data. Our approach combines a multi-stage artificial intelligence architecture to functionally separate the problem; a novel data generation strategy we term PCDS that obviates the need for hundreds of thousands of costly simulations, reducing the effective computational time from years to hours; and a Bayesian engine to quantify predictive uncertainty. Our framework deconstructs a single IOP measurement into its fundamental components from routine inputs only, yielding estimates for the unmeasurable tissue permeability and a patient’s outflow facility. Our noninvasively estimated outflow facility achieved excellent agreement with state-of-the-art tonography with precision comparable to direct physical instruments. Furthermore, the newly derived permeability biomarker demonstrates high accuracy in stratifying clinical cohorts by disease risk, highlighting its diagnostic potential. More broadly, our framework establishes a generalizable blueprint for solving similar inverse problems in other data-scarce, computationally-intensive domains.

[LG-8] A Compositional Kernel Model for Feature Learning

链接: https://arxiv.org/abs/2509.14158
作者: Feng Ruan,Keli Liu,Michael Jordan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:We study a compositional variant of kernel ridge regression in which the predictor is applied to a coordinate-wise reweighting of the inputs. Formulated as a variational problem, this model provides a simple testbed for feature learning in compositional architectures. From the perspective of variable selection, we show how relevant variables are recovered while noise variables are eliminated. We establish guarantees showing that both global minimizers and stationary points discard noise coordinates when the noise variables are Gaussian distributed. A central finding is that \ell_1 -type kernels, such as the Laplace kernel, succeed in recovering features contributing to nonlinear effects at stationary points, whereas Gaussian kernels recover only linear ones.

[LG-9] Breaking the Cycle of Incarceration With Targeted Mental Health Outreach: A Case Study in Machine Learning for Public Policy

链接: https://arxiv.org/abs/2509.14129
作者: Kit T. Rodolfa,Erika Salomon,Jin Yao,Steve Yoder,Robert Sullivan,Kevin McGuire,Allie Dickinson,Rob MacDougall,Brian Seidler,Christina Sung,Claire Herdeman,Rayid Ghani
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Many incarcerated individuals face significant and complex challenges, including mental illness, substance dependence, and homelessness, yet jails and prisons are often poorly equipped to address these needs. With little support from the existing criminal justice system, these needs can remain untreated and worsen, often leading to further offenses and a cycle of incarceration with adverse outcomes both for the individual and for public safety, with particularly large impacts on communities of color that continue to widen the already extensive racial disparities in criminal justice outcomes. Responding to these failures, a growing number of criminal justice stakeholders are seeking to break this cycle through innovative approaches such as community-driven and alternative approaches to policing, mentoring, community building, restorative justice, pretrial diversion, holistic defense, and social service connections. Here we report on a collaboration between Johnson County, Kansas, and Carnegie Mellon University to perform targeted, proactive mental health outreach in an effort to reduce reincarceration rates. This paper describes the data used, our predictive modeling approach and results, as well as the design and analysis of a field trial conducted to confirm our model’s predictive power, evaluate the impact of this targeted outreach, and understand at what level of reincarceration risk outreach might be most effective. Through this trial, we find that our model is highly predictive of new jail bookings, with more than half of individuals in the trial’s highest-risk group returning to jail in the following year. Outreach was most effective among these highest-risk individuals, with impacts on mental health utilization, EMS dispatches, and criminal justice involvement. Subjects: Machine Learning (cs.LG); Computers and Society (cs.CY) Cite as: arXiv:2509.14129 [cs.LG] (or arXiv:2509.14129v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.14129 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] From Distributional to Quantile Neural Basis Models: the case of Electricity Price Forecasting

链接: https://arxiv.org/abs/2509.14113
作者: Alessandro Brusaferri,Danial Ramin,Andrea Ballarino
类目: Machine Learning (cs.LG)
*备注: 6 pages

点击查看摘要

Abstract:While neural networks are achieving high predictive accuracy in multi-horizon probabilistic forecasting, understanding the underlying mechanisms that lead to feature-conditioned outputs remains a significant challenge for forecasters. In this work, we take a further step toward addressing this critical issue by introducing the Quantile Neural Basis Model, which incorporates the interpretability principles of Quantile Generalized Additive Models into an end-to-end neural network training framework. To this end, we leverage shared basis decomposition and weight factorization, complementing Neural Models for Location, Scale, and Shape by avoiding any parametric distributional assumptions. We validate our approach on day-ahead electricity price forecasting, achieving predictive performance comparable to distributional and quantile regression neural networks, while offering valuable insights into model behavior through the learned nonlinear mappings from input features to output predictions across the horizon.

[LG-11] Exploring the Relationship between Brain Hemisphere States and Frequency Bands through Deep Learning Optimization Techniques

链接: https://arxiv.org/abs/2509.14078
作者: Robiul Islam,Dmitry I. Ignatov,Karl Kaberg,Roman Nabatchikov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study investigates classifier performance across EEG frequency bands using various optimizers and evaluates efficient class prediction for the left and right hemispheres. Three neural network architectures - a deep dense network, a shallow three-layer network, and a convolutional neural network (CNN) - are implemented and compared using the TensorFlow and PyTorch frameworks. Results indicate that the Adagrad and RMSprop optimizers consistently perform well across different frequency bands, with Adadelta exhibiting robust performance in cross-model evaluations. Specifically, Adagrad excels in the beta band, while RMSprop achieves superior performance in the gamma band. Conversely, SGD and FTRL exhibit inconsistent performance. Among the models, the CNN demonstrates the second highest accuracy, particularly in capturing spatial features of EEG data. The deep dense network shows competitive performance in learning complex patterns, whereas the shallow three-layer network, sometimes being less accurate, provides computational efficiency. SHAP (Shapley Additive Explanations) plots are employed to identify efficient class prediction, revealing nuanced contributions of EEG frequency bands to model accuracy. Overall, the study highlights the importance of optimizer selection, model architecture, and EEG frequency band analysis in enhancing classifier performance and understanding feature importance in neuroimaging-based classification tasks.

[LG-12] Online Bayesian Risk-Averse Reinforcement Learning

链接: https://arxiv.org/abs/2509.14077
作者: Yuhao Wang,Enlu Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In this paper, we study the Bayesian risk-averse formulation in reinforcement learning (RL). To address the epistemic uncertainty due to a lack of data, we adopt the Bayesian Risk Markov Decision Process (BRMDP) to account for the parameter uncertainty of the unknown underlying model. We derive the asymptotic normality that characterizes the difference between the Bayesian risk value function and the original value function under the true unknown distribution. The results indicate that the Bayesian risk-averse approach tends to pessimistically underestimate the original value function. This discrepancy increases with stronger risk aversion and decreases as more data become available. We then utilize this adaptive property in the setting of online RL as well as online contextual multi-arm bandits (CMAB), a special case of online RL. We provide two procedures using posterior sampling for both the general RL problem and the CMAB problem. We establish a sub-linear regret bound, with the regret defined as the conventional regret for both the RL and CMAB settings. Additionally, we establish a sub-linear regret bound for the CMAB setting with the regret defined as the Bayesian risk regret. Finally, we conduct numerical experiments to demonstrate the effectiveness of the proposed algorithm in addressing epistemic uncertainty and verifying the theoretical properties.

[LG-13] Physics-based deep kernel learning for parameter estimation in high dimensional PDEs

链接: https://arxiv.org/abs/2509.14054
作者: Weihao Yan,Christoph Brune,Mengwu Guo
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Inferring parameters of high-dimensional partial differential equations (PDEs) poses significant computational and inferential challenges, primarily due to the curse of dimensionality and the inherent limitations of traditional numerical methods. This paper introduces a novel two-stage Bayesian framework that synergistically integrates training, physics-based deep kernel learning (DKL) with Hamiltonian Monte Carlo (HMC) to robustly infer unknown PDE parameters and quantify their uncertainties from sparse, exact observations. The first stage leverages physics-based DKL to train a surrogate model, which jointly yields an optimized neural network feature extractor and robust initial estimates for the PDE parameters. In the second stage, with the neural network weights fixed, HMC is employed within a full Bayesian framework to efficiently sample the joint posterior distribution of the kernel hyperparameters and the PDE parameters. Numerical experiments on canonical and high-dimensional inverse PDE problems demonstrate that our framework accurately estimates parameters, provides reliable uncertainty estimates, and effectively addresses challenges of data sparsity and model complexity, offering a robust and scalable tool for diverse scientific and engineering applications.

[LG-14] Nash Equilibria in Games with Playerwise Concave Coupling Constraints: Existence and Computation

链接: https://arxiv.org/abs/2509.14032
作者: Philip Jordan,Maryam Kamgarpour
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We study the existence and computation of Nash equilibria in continuous static games where the players’ admissible strategies are subject to shared coupling constraints, i.e., constraints that depend on their \emphjoint strategies. Specifically, we focus on a class of games characterized by playerwise concave utilities and playerwise concave constraints. Prior results on the existence of Nash equilibria are not applicable to this class, as they rely on strong assumptions such as joint convexity of the feasible set. By leveraging topological fixed point theory and novel structural insights into the contractibility of feasible sets under playerwise concave constraints, we give an existence proof for Nash equilibria under weaker conditions. Having established existence, we then focus on the computation of Nash equilibria via independent gradient methods under the additional assumption that the utilities admit a potential function. To account for the possibly nonconvex feasible region, we employ a log barrier regularized gradient ascent with adaptive stepsizes. Starting from an initial feasible strategy profile and under exact gradient feedback, the proposed method converges to an \epsilon -approximate constrained Nash equilibrium within \mathcalO(\epsilon^-3) iterations.

[LG-15] Deep Learning-Driven Peptide Classification in Biological Nanopores

链接: https://arxiv.org/abs/2509.14029
作者: Samuel Tovey,Julian Hoßbach,Sandro Kuppel,Tobias Ensslen,Jan C. Behrends,Christian Holm
类目: Machine Learning (cs.LG); Signal Processing (eess.SP); Computational Physics (physics.comp-ph); Biomolecules (q-bio.BM)
*备注: 29 pages (incl. references) 7 figures

点击查看摘要

Abstract:A device capable of performing real time classification of proteins in a clinical setting would allow for inexpensive and rapid disease diagnosis. One such candidate for this technology are nanopore devices. These devices work by measuring a current signal that arises when a protein or peptide enters a nanometer-length-scale pore. Should this current be uniquely related to the structure of the peptide and its interactions with the pore, the signals can be used to perform identification. While such a method would allow for real time identification of peptides and proteins in a clinical setting, to date, the complexities of these signals limit their accuracy. In this work, we tackle the issue of classification by converting the current signals into scaleogram images via wavelet transforms, capturing amplitude, frequency, and time information in a modality well-suited to machine learning algorithms. When tested on 42 peptides, our method achieved a classification accuracy of ~ 81,% , setting a new state-of-the-art in the field and taking a step toward practical peptide/protein diagnostics at the point of care. In addition, we demonstrate model transfer techniques that will be critical when deploying these models into real hardware, paving the way to a new method for real-time disease diagnosis.

[LG-16] Differentially private federated learning for localized control of infectious disease dynamics

链接: https://arxiv.org/abs/2509.14024
作者: Raouf Kerkouche,Henrik Zunker,Mario Fritz,Martin J. Kühn
类目: Machine Learning (cs.LG)
*备注: 18 pages, 6 figures

点击查看摘要

Abstract:In times of epidemics, swift reaction is necessary to mitigate epidemic spreading. For this reaction, localized approaches have several advantages, limiting necessary resources and reducing the impact of interventions on a larger scale. However, training a separate machine learning (ML) model on a local scale is often not feasible due to limited available data. Centralizing the data is also challenging because of its high sensitivity and privacy constraints. In this study, we consider a localized strategy based on the German counties and communities managed by the related local health authorities (LHA). For the preservation of privacy to not oppose the availability of detailed situational data, we propose a privacy-preserving forecasting method that can assist public health experts and decision makers. ML methods with federated learning (FL) train a shared model without centralizing raw data. Considering the counties, communities or LHAs as clients and finding a balance between utility and privacy, we study a FL framework with client-level differential privacy (DP). We train a shared multilayer perceptron on sliding windows of recent case counts to forecast the number of cases, while clients exchange only norm-clipped updates and the server aggregated updates with DP noise. We evaluate the approach on COVID-19 data on county-level during two phases. As expected, very strict privacy yields unstable, unusable forecasts. At a moderately strong level, the DP model closely approaches the non-DP model: R^2= 0.94 (vs. 0.95) and mean absolute percentage error (MAPE) of 26 % in November 2020; R^2= 0.88 (vs. 0.93) and MAPE of 21 % in March 2022. Overall, client-level DP-FL can deliver useful county-level predictions with strong privacy guarantees, and viable privacy budgets depend on epidemic phase, allowing privacy-compliant collaboration among health authorities for local forecasting.

[LG-17] Deep Temporal Graph Networks for Real-Time Correction of GNSS Jamming-Induced Deviations

链接: https://arxiv.org/abs/2509.14000
作者: Ivana Kesić,Aljaž Blatnik,Carolina Fortuna,Blaž Bertalanič
类目: Machine Learning (cs.LG)
*备注: 20 pages, 4 figures

点击查看摘要

Abstract:Global Navigation Satellite Systems (GNSS) are increasingly disrupted by intentional jamming, degrading availability precisely when positioning and timing must remain operational. We address this by reframing jamming mitigation as dynamic graph regression and introducing a receiver-centric deep temporal graph network that predicts, and thus corrects, the receivers horizontal deviation in real time. At each 1 Hz epoch, the satellite receiver environment is represented as a heterogeneous star graph (receiver center, tracked satellites as leaves) with time varying attributes (e.g., SNR, azimuth, elevation, latitude/longitude). A single layer Heterogeneous Graph ConvLSTM (HeteroGCLSTM) aggregates one hop spatial context and temporal dynamics over a short history to output the 2D deviation vector applied for on the fly correction. We evaluate on datasets from two distinct receivers under three jammer profiles, continuous wave (cw), triple tone (cw3), and wideband FM, each exercised at six power levels between -45 and -70 dBm, with 50 repetitions per scenario (prejam/jam/recovery). Against strong multivariate time series baselines (MLP, uniform CNN, and Seq2Point CNN), our model consistently attains the lowest mean absolute error (MAE). At -45 dBm, it achieves 3.64 cm (GP01/cw), 7.74 cm (GP01/cw3), 4.41 cm (ublox/cw), 4.84 cm (ublox/cw3), and 4.82 cm (ublox/FM), improving to 1.65-2.08 cm by -60 to -70 dBm. On mixed mode datasets pooling all powers, MAE is 3.78 cm (GP01) and 4.25 cm (ublox10), outperforming Seq2Point, MLP, and CNN. A split study shows superior data efficiency: with only 10% training data our approach remains well ahead of baselines (20 cm vs. 36-42 cm). Comments: 20 pages, 4 figures Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.14000 [cs.LG] (or arXiv:2509.14000v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.14000 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-18] Personalization on a Budget: Minimally-Labeled Continual Learning for Resource-Efficient Seizure Detection

链接: https://arxiv.org/abs/2509.13974
作者: Amirhossein Shahbazinia,Jonathan Dan,Jose A. Miranda,Giovanni Ansaloni,David Atienza
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Objective: Epilepsy, a prevalent neurological disease, demands careful diagnosis and continuous care. Seizure detection remains challenging, as current clinical practice relies on expert analysis of electroencephalography, which is a time-consuming process and requires specialized knowledge. Addressing this challenge, this paper explores automated epileptic seizure detection using deep learning, focusing on personalized continual learning models that adapt to each patient’s unique electroencephalography signal features, which evolve over time. Methods: In this context, our approach addresses the challenge of integrating new data into existing models without catastrophic forgetting, a common issue in static deep learning models. We propose EpiSMART, a continual learning framework for seizure detection that uses a size-constrained replay buffer and an informed sample selection strategy to incrementally adapt to patient-specific electroencephalography signals. By selectively retaining high-entropy and seizure-predicted samples, our method preserves critical past information while maintaining high performance with minimal memory and computational requirements. Results: Validation on the CHB-MIT dataset, shows that EpiSMART achieves a 21% improvement in the F1 score over a trained baseline without updates in all other patients. On average, EpiSMART requires only 6.46 minutes of labeled data and 6.28 updates per day, making it suitable for real-time deployment in wearable systems. Conclusion:EpiSMART enables robust and personalized seizure detection under realistic and resource-constrained conditions by effectively integrating new data into existing models without degrading past knowledge. Significance: This framework advances automated seizure detection by providing a continual learning approach that supports patient-specific adaptation and practical deployment in wearable healthcare systems.

[LG-19] Xtended Physics Informed Neural Network Method for Fracture Mechanics Problems

链接: https://arxiv.org/abs/2509.13952
作者: Amin Lotfalian,Mohammad Reza Banan,Pooyan Broumand
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:This paper presents eXtended Physics-Informed Neural Network (X-PINN), a novel and robust framework for addressing fracture mechanics problems involving multiple cracks in fractured media. To address this, an energy-based loss function, customized integration schemes, and domain decomposition procedures are proposed. Inspired by the Extended Finite Element Method (XFEM), the neural network solution space is enriched with specialized functions that allow crack body discontinuities and singularities at crack tips to be explicitly captured. Furthermore, a structured framework is introduced in which standard and enriched solution components are modeled using distinct neural networks, enabling flexible and effective simulations of complex multiple-crack problems in 1D and 2D domains, with convenient extensibility to 3D problems. Numerical experiments are conducted to validate the effectiveness and robustness of the proposed method.

[LG-20] Large Language Model-Empowered Decision Transformer for UAV-Enabled Data Collection

链接: https://arxiv.org/abs/2509.13934
作者: Zhixion Chen,Jiangzhou Wang,and Hyundong Shin,Arumugam Nallanathan
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 14pages, 8 figures

点击查看摘要

Abstract:The deployment of unmanned aerial vehicles (UAVs) for reliable and energy-efficient data collection from spatially distributed devices holds great promise in supporting diverse Internet of Things (IoT) applications. Nevertheless, the limited endurance and communication range of UAVs necessitate intelligent trajectory planning. While reinforcement learning (RL) has been extensively explored for UAV trajectory optimization, its interactive nature entails high costs and risks in real-world environments. Offline RL mitigates these issues but remains susceptible to unstable training and heavily rely on expert-quality datasets. To address these challenges, we formulate a joint UAV trajectory planning and resource allocation problem to maximize energy efficiency of data collection. The resource allocation subproblem is first transformed into an equivalent linear programming formulation and solved optimally with polynomial-time complexity. Then, we propose a large language model (LLM)-empowered critic-regularized decision transformer (DT) framework, termed LLM-CRDT, to learn effective UAV control policies. In LLM-CRDT, we incorporate critic networks to regularize the DT model training, thereby integrating the sequence modeling capabilities of DT with critic-based value guidance to enable learning effective policies from suboptimal datasets. Furthermore, to mitigate the data-hungry nature of transformer models, we employ a pre-trained LLM as the transformer backbone of the DT model and adopt a parameter-efficient fine-tuning strategy, i.e., LoRA, enabling rapid adaptation to UAV control tasks with small-scale dataset and low computational overhead. Extensive simulations demonstrate that LLM-CRDT outperforms benchmark online and offline RL methods, achieving up to 36.7% higher energy efficiency than the current state-of-the-art DT approaches.

[LG-21] Adaptive Client Selection via Q-Learning-based Whittle Index in Wireless Federated Learning

链接: https://arxiv.org/abs/2509.13933
作者: Qiyue Li,Yingxin Liu,Hang Qi,Jieping Luo,Zhizhang Liu,Jingjin Wu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We consider the client selection problem in wireless Federated Learning (FL), with the objective of reducing the total required time to achieve a certain level of learning accuracy. Since the server cannot observe the clients’ dynamic states that can change their computation and communication efficiency, we formulate client selection as a restless multi-armed bandit problem. We propose a scalable and efficient approach called the Whittle Index Learning in Federated Q-learning (WILF-Q), which uses Q-learning to adaptively learn and update an approximated Whittle index associated with each client, and then selects the clients with the highest indices. Compared to existing approaches, WILF-Q does not require explicit knowledge of client state transitions or data distributions, making it well-suited for deployment in practical FL settings. Experiment results demonstrate that WILF-Q significantly outperforms existing baseline policies in terms of learning efficiency, providing a robust and efficient approach to client selection in wireless FL.

[LG-22] APFEx: Adaptive Pareto Front Explorer for Intersectional Fairness

链接: https://arxiv.org/abs/2509.13908
作者: Priyobrata Mondal,Faizanuddin Ansari,Swagatam Das
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Ensuring fairness in machine learning models is critical, especially when biases compound across intersecting protected attributes like race, gender, and age. While existing methods address fairness for single attributes, they fail to capture the nuanced, multiplicative biases faced by intersectional subgroups. We introduce Adaptive Pareto Front Explorer (APFEx), the first framework to explicitly model intersectional fairness as a joint optimization problem over the Cartesian product of sensitive attributes. APFEx combines three key innovations- (1) an adaptive multi-objective optimizer that dynamically switches between Pareto cone projection, gradient weighting, and exploration strategies to navigate fairness-accuracy trade-offs, (2) differentiable intersectional fairness metrics enabling gradient-based optimization of non-smooth subgroup disparities, and (3) theoretical guarantees of convergence to Pareto-optimal solutions. Experiments on four real-world datasets demonstrate APFEx’s superiority, reducing fairness violations while maintaining competitive accuracy. Our work bridges a critical gap in fair ML, providing a scalable, model-agnostic solution for intersectional fairness.

[LG-23] FMAdapter: Lightweight Instance-Level Adaptation of Foundation Models for Forecasting with Covariates CIKM2025

链接: https://arxiv.org/abs/2509.13906
作者: Afrin Dange,Sunita Sarawagi
类目: Machine Learning (cs.LG)
*备注: Accepted at CIKM 2025

点击查看摘要

Abstract:Time Series Foundation Models (TSFMs) have recently achieved state-of-the-art performance in univariate forecasting on new time series simply by conditioned on a brief history of past values. Their success demonstrates that large-scale pretraining across diverse domains can acquire the inductive bias to generalize from temporal patterns in a brief history. However, most TSFMs are unable to leverage covariates – future-available exogenous variables critical for accurate forecasting in many applications – due to their domain-specific nature and the lack of associated inductive bias. We propose TFMAdapter, a lightweight, instance-level adapter that augments TSFMs with covariate information without fine-tuning. Instead of retraining, TFMAdapter operates on the limited history provided during a single model call, learning a non-parametric cascade that combines covariates with univariate TSFM forecasts. However, such learning would require univariate forecasts at all steps in the history, requiring too many calls to the TSFM. To enable training on the full historical context while limiting TSFM invocations, TFMAdapter uses a two-stage method: (1) generating pseudo-forecasts with a simple regression model, and (2) training a Gaussian Process regressor to refine predictions using both pseudo- and TSFM forecasts alongside covariates. Extensive experiments on real-world datasets demonstrate that TFMAdapter consistently outperforms both foundation models and supervised baselines, achieving a 24-27% improvement over base foundation models with minimal data and computational overhead. Our results highlight the potential of lightweight adapters to bridge the gap between generic foundation models and domain-specific forecasting needs.

[LG-24] Graph-Regularized Learning of Gaussian Mixture Models

链接: https://arxiv.org/abs/2509.13855
作者: Shamsiiat Abdurakhmanova,Alex Jung
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:We present a graph-regularized learning of Gaussian Mixture Models (GMMs) in distributed settings with heterogeneous and limited local data. The method exploits a provided similarity graph to guide parameter sharing among nodes, avoiding the transfer of raw data. The resulting model allows for flexible aggregation of neighbors’ parameters and outperforms both centralized and locally trained GMMs in heterogeneous, low-sample regimes.

[LG-25] An End-to-End Differentiable Graph Neural Network-Embedded Pore Network Model for Permeability Prediction

链接: https://arxiv.org/abs/2509.13841
作者: Qingqi Zhao,Heng Xiao
类目: Machine Learning (cs.LG); Geophysics (physics.geo-ph)
*备注: This preprint is also available at ESS Open Archive: this https URL

点击查看摘要

Abstract:Accurate prediction of permeability in porous media is essential for modeling subsurface flow. While pure data-driven models offer computational efficiency, they often lack generalization across scales and do not incorporate explicit physical constraints. Pore network models (PNMs), on the other hand, are physics-based and efficient but rely on idealized geometric assumptions to estimate pore-scale hydraulic conductance, limiting their accuracy in complex structures. To overcome these limitations, we present an end-to-end differentiable hybrid framework that embeds a graph neural network (GNN) into a PNM. In this framework, the analytical formulas used for conductance calculations are replaced by GNN-based predictions derived from pore and throat features. The predicted conductances are then passed to the PNM solver for permeability computation. In this way, the model avoids the idealized geometric assumptions of PNM while preserving the physics-based flow calculations. The GNN is trained without requiring labeled conductance data, which can number in the thousands per pore network; instead, it learns conductance values by using a single scalar permeability as the training target. This is made possible by backpropagating gradients through both the GNN (via automatic differentiation) and the PNM solver (via a discrete adjoint method), enabling fully coupled, end-to-end training. The resulting model achieves high accuracy and generalizes well across different scales, outperforming both pure data-driven and traditional PNM approaches. Gradient-based sensitivity analysis further reveals physically consistent feature influences, enhancing model interpretability. This approach offers a scalable and physically informed framework for permeability prediction in complex porous media, reducing model uncertainty and improving accuracy.

[LG-26] Hybrid Quantum-Classical Neural Networks for Few-Shot Credit Risk Assessment

链接: https://arxiv.org/abs/2509.13818
作者: Zheng-an Wang,Yanbo J. Wang,Jiachi Zhang,Qi Xu,Yilun Zhao,Jintao Li,Yipeng Zhang,Bo Yang,Xinkai Gao,Xiaofeng Cao,Kai Xu,Pengpeng Hao,Xuan Yang,Heng Fan
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:Quantum Machine Learning (QML) offers a new paradigm for addressing complex financial problems intractable for classical methods. This work specifically tackles the challenge of few-shot credit risk assessment, a critical issue in inclusive finance where data scarcity and imbalance limit the effectiveness of conventional models. To address this, we design and implement a novel hybrid quantum-classical workflow. The methodology first employs an ensemble of classical machine learning models (Logistic Regression, Random Forest, XGBoost) for intelligent feature engineering and dimensionality reduction. Subsequently, a Quantum Neural Network (QNN), trained via the parameter-shift rule, serves as the core classifier. This framework was evaluated through numerical simulations and deployed on the Quafu Quantum Cloud Platform’s ScQ-P21 superconducting processor. On a real-world credit dataset of 279 samples, our QNN achieved a robust average AUC of 0.852 +/- 0.027 in simulations and yielded an impressive AUC of 0.88 in the hardware experiment. This performance surpasses a suite of classical benchmarks, with a particularly strong result on the recall metric. This study provides a pragmatic blueprint for applying quantum computing to data-constrained financial scenarios in the NISQ era and offers valuable empirical evidence supporting its potential in high-stakes applications like inclusive finance.

[LG-27] Circuit realization and hardware linearization of monotone operator equilibrium networks

链接: https://arxiv.org/abs/2509.13793
作者: Thomas Chaffey
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:It is shown that the port behavior of a resistor- diode network corresponds to the solution of a ReLU monotone operator equilibrium network (a neural network in the limit of infinite depth), giving a parsimonious construction of a neural network in analog hardware. We furthermore show that the gradient of such a circuit can be computed directly in hardware, using a procedure we call hardware linearization. This allows the network to be trained in hardware, which we demonstrate with a device-level circuit simulation. We extend the results to cascades of resistor-diode networks, which can be used to implement feedforward and other asymmetric networks. We finally show that different nonlinear elements give rise to different activation functions, and introduce the novel diode ReLU which is induced by a non-ideal diode model.

[LG-28] Floating-Body Hydrodynamic Neural Networks

链接: https://arxiv.org/abs/2509.13783
作者: Tianshuo Zhang,Wenzhe Zhai,Rui Yann,Jia Gao,He Cao,Xianglei Xing
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fluid-structure interaction is common in engineering and natural systems, where floating-body motion is governed by added mass, drag, and background flows. Modeling these dissipative dynamics is difficult: black-box neural models regress state derivatives with limited interpretability and unstable long-horizon predictions. We propose Floating-Body Hydrodynamic Neural Networks (FHNN), a physics-structured framework that predicts interpretable hydrodynamic parameters such as directional added masses, drag coefficients, and a streamfunction-based flow, and couples them with analytic equations of motion. This design constrains the hypothesis space, enhances interpretability, and stabilizes integration. On synthetic vortex datasets, FHNN achieves up to an order-of-magnitude lower error than Neural ODEs, recovers physically consistent flow fields. Compared with Hamiltonian and Lagrangian neural networks, FHNN more effectively handles dissipative dynamics while preserving interpretability, which bridges the gap between black-box learning and transparent system identification.

[LG-29] Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2509.13772
作者: Baolei Zhang,Haoran Xin,Yuxi Chen,Zhuqing Liu,Biao Yi,Tong Li,Lihai Nie,Zheli Liu,Minghong Fang
类目: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: To appear in the IEEE Symposium on Security and Privacy, 2026

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) integrates external knowledge into large language models to improve response quality. However, recent work has shown that RAG systems are highly vulnerable to poisoning attacks, where malicious texts are inserted into the knowledge database to influence model outputs. While several defenses have been proposed, they are often circumvented by more adaptive or sophisticated attacks. This paper presents RAGOrigin, a black-box responsibility attribution framework designed to identify which texts in the knowledge database are responsible for misleading or incorrect generations. Our method constructs a focused attribution scope tailored to each misgeneration event and assigns a responsibility score to each candidate text by evaluating its retrieval ranking, semantic relevance, and influence on the generated response. The system then isolates poisoned texts using an unsupervised clustering method. We evaluate RAGOrigin across seven datasets and fifteen poisoning attacks, including newly developed adaptive poisoning strategies and multi-attacker scenarios. Our approach outperforms existing baselines in identifying poisoned content and remains robust under dynamic and noisy conditions. These results suggest that RAGOrigin provides a practical and effective solution for tracing the origins of corrupted knowledge in RAG systems. Comments: To appear in the IEEE Symposium on Security and Privacy, 2026 Subjects: Cryptography and Security (cs.CR); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2509.13772 [cs.CR] (or arXiv:2509.13772v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.13772 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-30] Beyond Correlation: Causal Multi-View Unsupervised Feature Selection Learning

链接: https://arxiv.org/abs/2509.13763
作者: Zongxin Shen,Yanyong Huang,Bin Wang,Jinyuan Chang,Shiyu Liu,Tianrui Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multi-view unsupervised feature selection (MUFS) has recently received increasing attention for its promising ability in dimensionality reduction on multi-view unlabeled data. Existing MUFS methods typically select discriminative features by capturing correlations between features and clustering labels. However, an important yet underexplored question remains: \textitAre such correlations sufficiently reliable to guide feature selection? In this paper, we analyze MUFS from a causal perspective by introducing a novel structural causal model, which reveals that existing methods may select irrelevant features because they overlook spurious correlations caused by confounders. Building on this causal perspective, we propose a novel MUFS method called CAusal multi-view Unsupervised feature Selection leArning (CAUSA). Specifically, we first employ a generalized unsupervised spectral regression model that identifies informative features by capturing dependencies between features and consensus clustering labels. We then introduce a causal regularization module that can adaptively separate confounders from multi-view data and simultaneously learn view-shared sample weights to balance confounder distributions, thereby mitigating spurious correlations. Thereafter, integrating both into a unified learning framework enables CAUSA to select causally informative features. Comprehensive experiments demonstrate that CAUSA outperforms several state-of-the-art methods. To our knowledge, this is the first in-depth study of causal multi-view feature selection in the unsupervised setting.

[LG-31] ST-LINK: Spatially-Aware Large Language Models for Spatio-Temporal Forecasting CIKM2025

链接: https://arxiv.org/abs/2509.13753
作者: Hyotaek Jeon,Hyunwook Lee,Juwon Kim,Sungahn Ko
类目: Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, Accepted to CIKM 2025. Code: this https URL

点击查看摘要

Abstract:Traffic forecasting represents a crucial problem within intelligent transportation systems. In recent research, Large Language Models (LLMs) have emerged as a promising method, but their intrinsic design, tailored primarily for sequential token processing, introduces notable challenges in effectively capturing spatial dependencies. Specifically, the inherent limitations of LLMs in modeling spatial relationships and their architectural incompatibility with graph-structured spatial data remain largely unaddressed. To overcome these limitations, we introduce ST-LINK, a novel framework that enhances the capability of Large Language Models to capture spatio-temporal dependencies. Its key components are Spatially-Enhanced Attention (SE-Attention) and the Memory Retrieval Feed-Forward Network (MRFFN). SE-Attention extends rotary position embeddings to integrate spatial correlations as direct rotational transformations within the attention mechanism. This approach maximizes spatial learning while preserving the LLM’s inherent sequential processing structure. Meanwhile, MRFFN dynamically retrieves and utilizes key historical patterns to capture complex temporal dependencies and improve the stability of long-term forecasting. Comprehensive experiments on benchmark datasets demonstrate that ST-LINK surpasses conventional deep learning and LLM approaches, and effectively captures both regular traffic patterns and abrupt changes.

[LG-32] ParaAegis: Parallel Protection for Flexible Privacy-preserved Federated Learning

链接: https://arxiv.org/abs/2509.13739
作者: Zihou Wu(1),Yuecheng Li(1),Tianchi Liao(2),Jian Lou(2),Chuan Chen(1) ((1) School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China (2) School of Software Engineering, Sun Yat-sen University, Zhuhai, China)
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 8 pages, 1 figure

点击查看摘要

Abstract:Federated learning (FL) faces a critical dilemma: existing protection mechanisms like differential privacy (DP) and homomorphic encryption (HE) enforce a rigid trade-off, forcing a choice between model utility and computational efficiency. This lack of flexibility hinders the practical implementation. To address this, we introduce ParaAegis, a parallel protection framework designed to give practitioners flexible control over the privacy-utility-efficiency balance. Our core innovation is a strategic model partitioning scheme. By applying lightweight DP to the less critical, low norm portion of the model while protecting the remainder with HE, we create a tunable system. A distributed voting mechanism ensures consensus on this partitioning. Theoretical analysis confirms the adjustments between efficiency and utility with the same privacy. Crucially, the experimental results demonstrate that by adjusting the hyperparameters, our method enables flexible prioritization between model accuracy and training time.

[LG-33] WatchAnxiety: A Transfer Learning Approach for State Anxiety Prediction from Smartwatch Data

链接: https://arxiv.org/abs/2509.13725
作者: Md Sabbir Ahmed,Noah French,Mark Rucker,Zhiyuan Wang,Taylor Myers-Brower,Kaitlyn Petz,Mehdi Boukhechba,Bethany A. Teachman,Laura E. Barnes
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Social anxiety is a common mental health condition linked to significant challenges in academic, social, and occupational functioning. A core feature is elevated momentary (state) anxiety in social situations, yet little prior work has measured or predicted fluctuations in this anxiety throughout the day. Capturing these intra-day dynamics is critical for designing real-time, personalized interventions such as Just-In-Time Adaptive Interventions (JITAIs). To address this gap, we conducted a study with socially anxious college students (N=91; 72 after exclusions) using our custom smartwatch-based system over an average of 9.03 days (SD = 2.95). Participants received seven ecological momentary assessments (EMAs) per day to report state anxiety. We developed a base model on over 10,000 days of external heart rate data, transferred its representations to our dataset, and fine-tuned it to generate probabilistic predictions. These were combined with trait-level measures in a meta-learner. Our pipeline achieved 60.4% balanced accuracy in state anxiety detection in our dataset. To evaluate generalizability, we applied the training approach to a separate hold-out set from the TILES-18 dataset-the same dataset used for pretraining. On 10,095 once-daily EMAs, our method achieved 59.1% balanced accuracy, outperforming prior work by at least 7%.

[LG-34] A Conformal Prediction Framework for Uncertainty Quantification in Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2509.13717
作者: Yifan Yu,Cheuk Hin Ho,Yangshuai Wang
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) have emerged as a powerful framework for solving PDEs, yet existing uncertainty quantification (UQ) approaches for PINNs generally lack rigorous statistical guarantees. In this work, we bridge this gap by introducing a distribution-free conformal prediction (CP) framework for UQ in PINNs. This framework calibrates prediction intervals by constructing nonconformity scores on a calibration set, thereby yielding distribution-free uncertainty estimates with rigorous finite-sample coverage guarantees for PINNs. To handle spatial heteroskedasticity, we further introduce local conformal quantile estimation, enabling spatially adaptive uncertainty bands while preserving theoretical guarantee. Through systematic evaluations on typical PDEs (damped harmonic oscillator, Poisson, Allen-Cahn, and Helmholtz equations) and comprehensive testing across multiple uncertainty metrics, our results demonstrate that the proposed framework achieves reliable calibration and locally adaptive uncertainty intervals, consistently outperforming heuristic UQ approaches. By bridging PINNs with distribution-free UQ, this work introduces a general framework that not only enhances calibration and reliability, but also opens new avenues for uncertainty-aware modeling of complex PDE systems.

[LG-35] RF-LSCM: Pushing Radiance Fields to Multi-Domain Localized Statistical Channel Modeling for Cellular Network Optimization

链接: https://arxiv.org/abs/2509.13686
作者: Bingsheng Peng,Shutao Zhang,Xi Zheng,Ye Xue,Xinyu Qin,Tsung-Hui Chang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate localized wireless channel modeling is a cornerstone of cellular network optimization, enabling reliable prediction of network performance during parameter tuning. Localized statistical channel modeling (LSCM) is the state-of-the-art channel modeling framework tailored for cellular network optimization. However, traditional LSCM methods, which infer the channel’s Angular Power Spectrum (APS) from Reference Signal Received Power (RSRP) measurements, suffer from critical limitations: they are typically confined to single-cell, single-grid and single-carrier frequency analysis and fail to capture complex cross-domain interactions. To overcome these challenges, we propose RF-LSCM, a novel framework that models the channel APS by jointly representing large-scale signal attenuation and multipath components within a radiance field. RF-LSCM introduces a multi-domain LSCM formulation with a physics-informed frequency-dependent Attenuation Model (FDAM) to facilitate the cross frequency generalization as well as a point-cloud-aided environment enhanced method to enable multi-cell and multi-grid channel modeling. Furthermore, to address the computational inefficiency of typical neural radiance fields, RF-LSCM leverages a low-rank tensor representation, complemented by a novel Hierarchical Tensor Angular Modeling (HiTAM) algorithm. This efficient design significantly reduces GPU memory requirements and training time while preserving fine-grained accuracy. Extensive experiments on real-world multi-cell datasets demonstrate that RF-LSCM significantly outperforms state-of-the-art methods, achieving up to a 30% reduction in mean absolute error (MAE) for coverage prediction and a 22% MAE improvement by effectively fusing multi-frequency data.

[LG-36] Efficient Last-Iterate Convergence in Regret Minimization via Adaptive Reward Transformation

链接: https://arxiv.org/abs/2509.13653
作者: Hang Ren,Yulin Wu,Shuhan Qi,Jiajia Zhang,Xiaozhen Sun,Tianzi Ma,Xuan Wang
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Regret minimization is a powerful method for finding Nash equilibria in Normal-Form Games (NFGs) and Extensive-Form Games (EFGs), but it typically guarantees convergence only for the average strategy. However, computing the average strategy requires significant computational resources or introduces additional errors, limiting its practical applicability. The Reward Transformation (RT) framework was introduced to regret minimization to achieve last-iterate convergence through reward function regularization. However, it faces practical challenges: its performance is highly sensitive to manually tuned parameters, which often deviate from theoretical convergence conditions, leading to slow convergence, oscillations, or stagnation in local optima. Inspired by previous work, we propose an adaptive technique to address these issues, ensuring better consistency between theoretical guarantees and practical performance for RT Regret Matching (RTRM), RT Counterfactual Regret Minimization (RTCFR), and their variants in solving NFGs and EFGs more effectively. Our adaptive methods dynamically adjust parameters, balancing exploration and exploitation while improving regret accumulation, ultimately enhancing asymptotic last-iterate convergence and achieving linear convergence. Experimental results demonstrate that our methods significantly accelerate convergence, outperforming state-of-the-art algorithms. Subjects: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG) Cite as: arXiv:2509.13653 [cs.GT] (or arXiv:2509.13653v1 [cs.GT] for this version) https://doi.org/10.48550/arXiv.2509.13653 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-37] Controllable Pareto Trade-off between Fairness and Accuracy

链接: https://arxiv.org/abs/2509.13651
作者: Yongkang Du,Jieyu Zhao,Yijun Yang,Tianyi Zhou
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The fairness-accuracy trade-off is a key challenge in NLP tasks. Current work focuses on finding a single “optimal” solution to balance the two objectives, which is limited considering the diverse solutions on the Pareto front. This work intends to provide controllable trade-offs according to the user’s preference of the two objectives, which is defined as a reference vector. To achieve this goal, we apply multi-objective optimization (MOO), which can find solutions from various regions of the Pareto front. However, it is challenging to precisely control the trade-off due to the stochasticity of the training process and the high dimentional gradient vectors. Thus, we propose Controllable Pareto Trade-off (CPT) that can effectively train models to perform different trade-offs according to users’ preferences. CPT 1) stabilizes the fairness update with a moving average of stochastic gradients to determine the update direction, and 2) prunes the gradients by only keeping the gradients of the critical parameters. We evaluate CPT on hate speech detection and occupation classification tasks. Experiments show that CPT can achieve a higher-quality set of solutions on the Pareto front than the baseline methods. It also exhibits better controllability and can precisely follow the human-defined reference vectors.

[LG-38] Sequential Data Augmentation for Generative Recommendation

链接: https://arxiv.org/abs/2509.13648
作者: Geon Lee,Bhuvesh Kumar,Clark Mingxuan Ju,Tong Zhao,Kijung Shin,Neil Shah,Liam Collins
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Generative recommendation plays a crucial role in personalized systems, predicting users’ future interactions from their historical behavior sequences. A critical yet underexplored factor in training these models is data augmentation, the process of constructing training data from user interaction histories. By shaping the training distribution, data augmentation directly and often substantially affects model generalization and performance. Nevertheless, in much of the existing work, this process is simplified, applied inconsistently, or treated as a minor design choice, without a systematic and principled understanding of its effects. Motivated by our empirical finding that different augmentation strategies can yield large performance disparities, we conduct an in-depth analysis of how they reshape training distributions and influence alignment with future targets and generalization to unseen inputs. To systematize this design space, we propose GenPAS, a generalized and principled framework that models augmentation as a stochastic sampling process over input-target pairs with three bias-controlled steps: sequence sampling, target sampling, and input sampling. This formulation unifies widely used strategies as special cases and enables flexible control of the resulting training distribution. Our extensive experiments on benchmark and industrial datasets demonstrate that GenPAS yields superior accuracy, data efficiency, and parameter efficiency compared to existing strategies, providing practical guidance for principled training data construction in generative recommendation. Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2509.13648 [cs.LG] (or arXiv:2509.13648v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.13648 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-39] Multimodal signal fusion for stress detection using deep neural networks: a novel approach for converting 1D signals to unified 2D images

链接: https://arxiv.org/abs/2509.13636
作者: Yasin Hasanpoor,Bahram Tarvirdizadeh,Khalil Alipour,Mohammad Ghamari
类目: Machine Learning (cs.LG)
*备注: 14 pages 7 images 2 tables

点击查看摘要

Abstract:This study introduces a novel method that transforms multimodal physiological signalsphotoplethysmography (PPG), galvanic skin response (GSR), and acceleration (ACC) into 2D image matrices to enhance stress detection using convolutional neural networks (CNNs). Unlike traditional approaches that process these signals separately or rely on fixed encodings, our technique fuses them into structured image representations that enable CNNs to capture temporal and cross signal dependencies more effectively. This image based transformation not only improves interpretability but also serves as a robust form of data augmentation. To further enhance generalization and model robustness, we systematically reorganize the fused signals into multiple formats, combining them in a multi stage training pipeline. This approach significantly boosts classification performance. While demonstrated here in the context of stress detection, the proposed method is broadly applicable to any domain involving multimodal physiological signals, paving the way for more accurate, personalized, and real time health monitoring through wearable technologies.

[LG-40] Secure UAV-assisted Federated Learning: A Digital Twin-Driven Approach with Zero-Knowledge Proofs

链接: https://arxiv.org/abs/2509.13634
作者: Md Bokhtiar Al Zami,Md Raihan Uddin,Dinh C. Nguyen
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: 15 pages, under revision at IEEE Internet of Things Journal

点击查看摘要

Abstract:Federated learning (FL) has gained popularity as a privacy-preserving method of training machine learning models on decentralized networks. However to ensure reliable operation of UAV-assisted FL systems, issues like as excessive energy consumption, communication inefficiencies, and security vulnerabilities must be solved. This paper proposes an innovative framework that integrates Digital Twin (DT) technology and Zero-Knowledge Federated Learning (zkFed) to tackle these challenges. UAVs act as mobile base stations, allowing scattered devices to train FL models locally and upload model updates for aggregation. By incorporating DT technology, our approach enables real-time system monitoring and predictive maintenance, improving UAV network efficiency. Additionally, Zero-Knowledge Proofs (ZKPs) strengthen security by allowing model verification without exposing sensitive data. To optimize energy efficiency and resource management, we introduce a dynamic allocation strategy that adjusts UAV flight paths, transmission power, and processing rates based on network conditions. Using block coordinate descent and convex optimization techniques, our method significantly reduces system energy consumption by up to 29.6% compared to conventional FL approaches. Simulation results demonstrate improved learning performance, security, and scalability, positioning this framework as a promising solution for next-generation UAV-based intelligent networks.

[LG-41] Unsupervised Anomaly Detection in ALS EPICS Event Logs

链接: https://arxiv.org/abs/2509.13621
作者: Antonin Sulc,Thorsten Hellert,Steven Hunt
类目: Machine Learning (cs.LG)
*备注: 6 pages, 5 figures, The 20th International Conference on Accelerator and Large Experimental Physics Control Systems

点击查看摘要

Abstract:This paper introduces an automated fault analysis framework for the Advanced Light Source (ALS) that processes real-time event logs from its EPICS control system. By treating log entries as natural language, we transform them into contextual vector representations using semantic embedding techniques. A sequence-aware neural network, trained on normal operational data, assigns a real-time anomaly score to each event. This method flags deviations from baseline behavior, enabling operators to rapidly identify the critical event sequences that precede complex system failures.

[LG-42] Is GPT -4o mini Blinded by its Own Safety Filters? Exposing the Multimodal-to-Unimodal Bottleneck in Hate Speech Detection

链接: https://arxiv.org/abs/2509.13608
作者: Niruthiha Selvanayagam,Ted Kurti
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As Large Multimodal Models (LMMs) become integral to daily digital life, understanding their safety architectures is a critical problem for AI Alignment. This paper presents a systematic analysis of OpenAI’s GPT-4o mini, a globally deployed model, on the difficult task of multimodal hate speech detection. Using the Hateful Memes Challenge dataset, we conduct a multi-phase investigation on 500 samples to probe the model’s reasoning and failure modes. Our central finding is the experimental identification of a “Unimodal Bottleneck,” an architectural flaw where the model’s advanced multimodal reasoning is systematically preempted by context-blind safety filters. A quantitative validation of 144 content policy refusals reveals that these overrides are triggered in equal measure by unimodal visual 50% and textual 50% content. We further demonstrate that this safety system is brittle, blocking not only high-risk imagery but also benign, common meme formats, leading to predictable false positives. These findings expose a fundamental tension between capability and safety in state-of-the-art LMMs, highlighting the need for more integrated, context-aware alignment strategies to ensure AI systems can be deployed both safely and effectively.

[LG-43] Meta-Learning Linear Models for Molecular Property Prediction

链接: https://arxiv.org/abs/2509.13527
作者: Yulia Pimonova,Michael G. Taylor,Alice Allen,Ping Yang,Nicholas Lubbers
类目: Machine Learning (cs.LG); Chemical Physics (physics.chem-ph)
*备注: 26 pages, 16 figures

点击查看摘要

Abstract:Chemists in search of structure-property relationships face great challenges due to limited high quality, concordant datasets. Machine learning (ML) has significantly advanced predictive capabilities in chemical sciences, but these modern data-driven approaches have increased the demand for data. In response to the growing demand for explainable AI (XAI) and to bridge the gap between predictive accuracy and human comprehensibility, we introduce LAMeL - a Linear Algorithm for Meta-Learning that preserves interpretability while improving the prediction accuracy across multiple properties. While most approaches treat each chemical prediction task in isolation, LAMeL leverages a meta-learning framework to identify shared model parameters across related tasks, even if those tasks do not share data, allowing it to learn a common functional manifold that serves as a more informed starting point for new unseen tasks. Our method delivers performance improvements ranging from 1.1- to 25-fold over standard ridge regression, depending on the domain of the dataset. While the degree of performance enhancement varies across tasks, LAMeL consistently outperforms or matches traditional linear methods, making it a reliable tool for chemical property prediction where both accuracy and interpretability are critical.

[LG-44] AERIS: Argonne Earth Systems Model for Reliable and Skillful Predictions

链接: https://arxiv.org/abs/2509.13523
作者: Väinö Hatanpää,Eugene Ku,Jason Stock,Murali Emani,Sam Foreman,Chunyong Jung,Sandeep Madireddy,Tung Nguyen,Varuni Sastry,Ray A. O. Sinurat,Sam Wheeler,Huihuo Zheng,Troy Arcomano,Venkatram Vishwanath,Rao Kotamarthi
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: 14 pages, 7 figures

点击查看摘要

Abstract:Generative machine learning offers new opportunities to better understand complex Earth system dynamics. Recent diffusion-based methods address spectral biases and improve ensemble calibration in weather forecasting compared to deterministic methods, yet have so far proven difficult to scale stably at high resolutions. We introduce AERIS, a 1.3 to 80B parameter pixel-level Swin diffusion transformer to address this gap, and SWiPe, a generalizable technique that composes window parallelism with sequence and pipeline parallelism to shard window-based transformers without added communication cost or increased global batch size. On Aurora (10,080 nodes), AERIS sustains 10.21 ExaFLOPS (mixed precision) and a peak performance of 11.21 ExaFLOPS with 1 \times 1 patch size on the 0.25° ERA5 dataset, achieving 95.5% weak scaling efficiency, and 81.6% strong scaling efficiency. AERIS outperforms the IFS ENS and remains stable on seasonal scales to 90 days, highlighting the potential of billion-parameter diffusion models for weather and climate prediction.

[LG-45] Learning Nonlinear Responses in PET Bottle Buckling with a Hybrid DeepONet-Transolver Framework

链接: https://arxiv.org/abs/2509.13520
作者: Varun Kumar,Jing Bi,Cyril Ngo Ngoc,Victor Oancea,George Em Karniadakis
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural surrogates and operator networks for solving partial differential equation (PDE) problems have attracted significant research interest in recent years. However, most existing approaches are limited in their ability to generalize solutions across varying non-parametric geometric domains. In this work, we address this challenge in the context of Polyethylene Terephthalate (PET) bottle buckling analysis, a representative packaging design problem conventionally solved using computationally expensive finite element analysis (FEA). We introduce a hybrid DeepONet-Transolver framework that simultaneously predicts nodal displacement fields and the time evolution of reaction forces during top load compression. Our methodology is evaluated on two families of bottle geometries parameterized by two and four design variables. Training data is generated using nonlinear FEA simulations in Abaqus for 254 unique designs per family. The proposed framework achieves mean relative L^2 errors of 2.5-13% for displacement fields and approximately 2.4% for time-dependent reaction forces for the four-parameter bottle family. Point-wise error analyses further show absolute displacement errors on the order of 10^-4 - 10^-3 , with the largest discrepancies confined to localized geometric regions. Importantly, the model accurately captures key physical phenomena, such as buckling behavior, across diverse bottle geometries. These results highlight the potential of our framework as a scalable and computationally efficient surrogate, particularly for multi-task predictions in computational mechanics and applications requiring rapid design evaluation.

[LG-46] An Analysis of Optimizer Choice on Energy Efficiency and Performance in Neural Network Training

链接: https://arxiv.org/abs/2509.13516
作者: Tom Almog
类目: Machine Learning (cs.LG)
*备注: 7 pages. 3 figures

点击查看摘要

Abstract:As machine learning models grow increasingly complex and computationally demanding, understanding the environmental impact of training decisions becomes critical for sustainable AI development. This paper presents a comprehensive empirical study investigating the relationship between optimizer choice and energy efficiency in neural network training. We conducted 360 controlled experiments across three benchmark datasets (MNIST, CIFAR-10, CIFAR-100) using eight popular optimizers (SGD, Adam, AdamW, RMSprop, Adagrad, Adadelta, Adamax, NAdam) with 15 random seeds each. Using CodeCarbon for precise energy tracking on Apple M1 Pro hardware, we measured training duration, peak memory usage, carbon dioxide emissions, and final model performance. Our findings reveal substantial trade-offs between training speed, accuracy, and environmental impact that vary across datasets and model complexity. We identify AdamW and NAdam as consistently efficient choices, while SGD demonstrates superior performance on complex datasets despite higher emissions. These results provide actionable insights for practitioners seeking to balance performance and sustainability in machine learning workflows.

[LG-47] Unified Spatiotemopral Physics-Informed Learning (USPIL): A Framework for Modeling Complex Predator-Prey Dynamics

链接: https://arxiv.org/abs/2509.13425
作者: Julian Evan Chrisnanto,Yulison Herry Chrisnanto,Ferry Faizal
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph)
*备注: 20 pages, 11 figures. A preprint on using a unified physics-informed neural network framework to model predator-prey dynamics

点击查看摘要

Abstract:Ecological systems exhibit complex multi-scale dynamics that challenge traditional modeling. New methods must capture temporal oscillations and emergent spatiotemporal patterns while adhering to conservation principles. We present the Unified Spatiotemporal Physics-Informed Learning (USPIL) framework, a deep learning architecture integrating physics-informed neural networks (PINNs) and conservation laws to model predator-prey dynamics across dimensional scales. The framework provides a unified solution for both ordinary (ODE) and partial (PDE) differential equation systems, describing temporal cycles and reaction-diffusion patterns within a single neural network architecture. Our methodology uses automatic differentiation to enforce physics constraints and adaptive loss weighting to balance data fidelity with physical consistency. Applied to the Lotka-Volterra system, USPIL achieves 98.9% correlation for 1D temporal dynamics (loss: 0.0219, MAE: 0.0184) and captures complex spiral waves in 2D systems (loss: 4.7656, pattern correlation: 0.94). Validation confirms conservation law adherence within 0.5% and shows a 10-50x computational speedup for inference compared to numerical solvers. USPIL also enables mechanistic understanding through interpretable physics constraints, facilitating parameter discovery and sensitivity analysis not possible with purely data-driven methods. Its ability to transition between dimensional formulations opens new avenues for multi-scale ecological modeling. These capabilities make USPIL a transformative tool for ecological forecasting, conservation planning, and understanding ecosystem resilience, establishing physics-informed deep learning as a powerful and scientifically rigorous paradigm.

[LG-48] VEGA: Electric Vehicle Navigation Agent via Physics-Informed Neural Operator and Proximal Policy Optimization ICRA

链接: https://arxiv.org/abs/2509.13386
作者: Hansol Lim,Minhyeok Im,Jonathan Boyack,Jee Won Lee,Jongseong Brad Choi
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: This work has been submitted to the 2026 IEEE International Conference on Robotics and Automation (ICRA) for possible publication

点击查看摘要

Abstract:Demands for software-defined vehicles (SDV) are rising and electric vehicles (EVs) are increasingly being equipped with powerful computers. This enables onboard AI systems to optimize charge-aware path optimization customized to reflect vehicle’s current condition and environment. We present VEGA, a charge-aware EV navigation agent that plans over a charger-annotated road graph using Proximal Policy Optimization (PPO) with budgeted A* teacher-student guidance under state-of-charge (SoC) feasibility. VEGA consists of two modules. First, a physics-informed neural operator (PINO), trained on real vehicle speed and battery-power logs, uses recent vehicle speed logs to estimate aerodynamic drag, rolling resistance, mass, motor and regenerative-braking efficiencies, and auxiliary load by learning a vehicle-custom dynamics. Second, a Reinforcement Learning (RL) agent uses these dynamics to optimize a path with optimal charging stops and dwell times under SoC constraints. VEGA requires no additional sensors and uses only vehicle speed signals. It may serve as a virtual sensor for power and efficiency to potentially reduce EV cost. In evaluation on long routes like San Francisco to New York, VEGA’s stops, dwell times, SoC management, and total travel time closely track Tesla Trip Planner while being slightly more conservative, presumably due to real vehicle conditions such as vehicle parameter drift due to deterioration. Although trained only in U.S. regions, VEGA was able to compute optimal charge-aware paths in France and Japan, demonstrating generalizability. It achieves practical integration of physics-informed learning and RL for EV eco-routing.

[LG-49] Cooperative Target Detection with AUVs: A Dual-Timescale Hierarchical MARDL Approach

链接: https://arxiv.org/abs/2509.13381
作者: Zhang Xueyao,Yang Bo,Yu Zhiwen,Cao Xuelin,George C. Alexandropoulos,Merouane Debbah,Chau Yuen
类目: Robotics (cs.RO); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: 6 pages

点击查看摘要

Abstract:Autonomous Underwater Vehicles (AUVs) have shown great potential for cooperative detection and reconnaissance. However, collaborative AUV communications introduce risks of exposure. In adversarial environments, achieving efficient collaboration while ensuring covert operations becomes a key challenge for underwater cooperative missions. In this paper, we propose a novel dual time-scale Hierarchical Multi-Agent Proximal Policy Optimization (H-MAPPO) framework. The high-level component determines the individuals participating in the task based on a central AUV, while the low-level component reduces exposure probabilities through power and trajectory control by the participating AUVs. Simulation results show that the proposed framework achieves rapid convergence, outperforms benchmark algorithms in terms of performance, and maximizes long-term cooperative efficiency while ensuring covert operations.

[LG-50] A novel approach of day-ahead cooling load prediction and optimal control for ice-based thermal energy storag e (TES) system in commercial buildings

链接: https://arxiv.org/abs/2509.13371
作者: Xuyuan Kang,Xiao Wang,Jingjing An,Da Yan
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 16 pages,14 figures,published to Energy Buildings

点击查看摘要

Abstract:Thermal energy storage (TES) is an effective method for load shifting and demand response in buildings. Optimal TES control and management are essential to improve the performance of the cooling system. Most existing TES systems operate on a fixed schedule, which cannot take full advantage of its load shifting capability, and requires extensive investigation and optimization. This study proposed a novel integrated load prediction and optimized control approach for ice-based TES in commercial buildings. A cooling load prediction model was developed and a mid-day modification mechanism was introduced into the prediction model to improve the accuracy. Based on the predictions, a rule-based control strategy was proposed according to the time-of-use tariff; the mid-day control adjustment mechanism was introduced in accordance with the mid-day prediction modifications. The proposed approach was applied in the ice-based TES system of a commercial complex in Beijing, and achieved a mean absolute error (MAE) of 389 kW and coefficient of variance of MAE of 12.5%. The integrated prediction-based control strategy achieved an energy cost saving rate of 9.9%. The proposed model was deployed in the realistic building automation system of the case building and significantly improved the efficiency and automation of the cooling system.

[LG-51] Maximizing UAV Cellular Connectivity with Reinforcement Learning for BVLoS Path Planning

链接: https://arxiv.org/abs/2509.13336
作者: Mehran Behjati,Rosdiadee Nordin,Nor Fadzilah Abdullah
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: Submitted to an IEEE Conference

点击查看摘要

Abstract:This paper presents a reinforcement learning (RL) based approach for path planning of cellular connected unmanned aerial vehicles (UAVs) operating beyond visual line of sight (BVLoS). The objective is to minimize travel distance while maximizing the quality of cellular link connectivity by considering real world aerial coverage constraints and employing an empirical aerial channel model. The proposed solution employs RL techniques to train an agent, using the quality of communication links between the UAV and base stations (BSs) as the reward function. Simulation results demonstrate the effectiveness of the proposed method in training the agent and generating feasible UAV path plans. The proposed approach addresses the challenges due to limitations in UAV cellular communications, highlighting the need for investigations and considerations in this area. The RL algorithm efficiently identifies optimal paths, ensuring maximum connectivity with ground BSs to ensure safe and reliable BVLoS flight operation. Moreover, the solution can be deployed as an offline path planning module that can be integrated into future ground control systems (GCS) for UAV operations, enhancing their capabilities and safety. The method holds potential for complex long range UAV applications, advancing the technology in the field of cellular connected UAV path planning.

[LG-52] LLM Chatbot-Creation Approaches

链接: https://arxiv.org/abs/2509.13326
作者: Hemil Mehta,Tanvi Raut,Kohav Yadav,Edward F. Gehringer
类目: Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Forthcoming in Frontiers in Education (FIE 2025), Nashville, Tennessee, USA, Nov 2-5, 2025

点击查看摘要

Abstract:This full research-to-practice paper explores approaches for developing course chatbots by comparing low-code platforms and custom-coded solutions in educational contexts. With the rise of Large Language Models (LLMs) like GPT-4 and LLaMA, LLM-based chatbots are being integrated into teaching workflows to automate tasks, provide assistance, and offer scalable support. However, selecting the optimal development strategy requires balancing ease of use, customization, data privacy, and scalability. This study compares two development approaches: low-code platforms like AnythingLLM and Botpress, with custom-coded solutions using LangChain, FAISS, and FastAPI. The research uses Prompt engineering, Retrieval-augmented generation (RAG), and personalization to evaluate chatbot prototypes across technical performance, scalability, and user experience. Findings indicate that while low-code platforms enable rapid prototyping, they face limitations in customization and scaling, while custom-coded systems offer more control but require significant technical expertise. Both approaches successfully implement key research principles such as adaptive feedback loops and conversational continuity. The study provides a framework for selecting the appropriate development strategy based on institutional goals and resources. Future work will focus on hybrid solutions that combine low-code accessibility with modular customization and incorporate multimodal input for intelligent tutoring systems.

[LG-53] Spacing Test for Fused Lasso

链接: https://arxiv.org/abs/2509.14229
作者: Rieko Tasaka,Tatsuya Kimura,Joe Suzuki
类目: atistics Theory (math.ST); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the unresolved problem of selecting the regularization parameter in the fused lasso. In particular, we extend the framework of the Spacing Test proposed by Tibshirani et al. to the fused lasso, providing a theoretical foundation for post-selection inference by characterizing the selection event as a polyhedral constraint. Based on the analysis of the solution path of the fused lasso using a LARS-type algorithm, we derive exact conditional p -values for the selected change-points. Our method broadens the applicability of the Spacing Test from the standard lasso to fused penalty structures. Furthermore, through numerical experiments comparing the proposed method with sequential versions of AIC and BIC as well as cross-validation, we demonstrate that the proposed approach properly controls the type I error while achieving high detection power. This work offers a theoretically sound and computationally practical solution for parameter selection and post-selection inference in structured signal estimation problems. Keywords: Fused Lasso, Regularization parameter selection, Spacing Test for Lasso, Selective inference, Change-point detection

[LG-54] Bellm an Optimality of Averag e-Reward Robust Markov Decision Processes with a Constant Gain

链接: https://arxiv.org/abs/2509.14203
作者: Shengbo Wang,Nian Si
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Learning and optimal control under robust Markov decision processes (MDPs) have received increasing attention, yet most existing theory, algorithms, and applications focus on finite-horizon or discounted models. The average-reward formulation, while natural in many operations research and management contexts, remains underexplored. This is primarily because the dynamic programming foundations are technically challenging and only partially understood, with several fundamental questions remaining open. This paper steps toward a general framework for average-reward robust MDPs by analyzing the constant-gain setting. We study the average-reward robust control problem with possible information asymmetries between the controller and an S-rectangular adversary. Our analysis centers on the constant-gain robust Bellman equation, examining both the existence of solutions and their relationship to the optimal average reward. Specifically, we identify when solutions to the robust Bellman equation characterize the optimal average reward and stationary policies, and we provide sufficient conditions ensuring solutions’ existence. These findings expand the dynamic programming theory for average-reward robust MDPs and lay a foundation for robust dynamic decision making under long-run average criteria in operational environments.

[LG-55] Quantum Reinforcement Learning-Guided Diffusion Model for Image Synthesis via Hybrid Quantum-Classical Generative Model Architectures

链接: https://arxiv.org/abs/2509.14163
作者: Chi-Sheng Chen,En-Jui Kuo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion models typically employ static or heuristic classifier-free guidance (CFG) schedules, which often fail to adapt across timesteps and noise conditions. In this work, we introduce a quantum reinforcement learning (QRL) controller that dynamically adjusts CFG at each denoising step. The controller adopts a hybrid quantum–classical actor–critic architecture: a shallow variational quantum circuit (VQC) with ring entanglement generates policy features, which are mapped by a compact multilayer perceptron (MLP) into Gaussian actions over \Delta CFG, while a classical critic estimates value functions. The policy is optimized using Proximal Policy Optimization (PPO) with Generalized Advantage Estimation (GAE), guided by a reward that balances classification confidence, perceptual improvement, and action regularization. Experiments on CIFAR-10 demonstrate that our QRL policy improves perceptual quality (LPIPS, PSNR, SSIM) while reducing parameter count compared to classical RL actors and fixed schedules. Ablation studies on qubit number and circuit depth reveal trade-offs between accuracy and efficiency, and extended evaluations confirm robust generation under long diffusion schedules.

[LG-56] On the Rate of Gaussian Approximation for Linear Regression Problems

链接: https://arxiv.org/abs/2509.14039
作者: Marat Khusainov,Marina Sheshukova,Alain Durmus,Sergey Samsonov
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this paper, we consider the problem of Gaussian approximation for the online linear regression task. We derive the corresponding rates for the setting of a constant learning rate and study the explicit dependence of the convergence rate upon the problem dimension d and quantities related to the design matrix. When the number of iterations n is known in advance, our results yield the rate of normal approximation of order \sqrt\logn/n , provided that the sample size n is large enough.

[LG-57] Quantum Variational Activation Functions Empower Kolmogorov-Arnold Networks

链接: https://arxiv.org/abs/2509.14026
作者: Jiun-Cheng Jiang,Morris Yu-Chao Huang,Tianlong Chen,Hsi-Sheng Goan
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 45 pages

点击查看摘要

Abstract:Variational quantum circuits (VQCs) are central to quantum machine learning, while recent progress in Kolmogorov-Arnold networks (KANs) highlights the power of learnable activation functions. We unify these directions by introducing quantum variational activation functions (QVAFs), realized through single-qubit data re-uploading circuits called DatA Re-Uploading ActivatioNs (DARUANs). We show that DARUAN with trainable weights in data pre-processing possesses an exponentially growing frequency spectrum with data repetitions, enabling an exponential reduction in parameter size compared with Fourier-based activations without loss of expressivity. Embedding DARUAN into KANs yields quantum-inspired KANs (QKANs), which retain the interpretability of KANs while improving their parameter efficiency, expressivity, and generalization. We further introduce two novel techniques to enhance scalability, feasibility and computational efficiency, such as layer extension and hybrid QKANs (HQKANs) as drop-in replacements of multi-layer perceptrons (MLPs) for feed-forward networks in large-scale models. We provide theoretical analysis and extensive experiments on function regression, image classification, and autoregressive generative language modeling, demonstrating the efficiency and scalability of QKANs. DARUANs and QKANs offer a promising direction for advancing quantum machine learning on both noisy intermediate-scale quantum (NISQ) hardware and classical quantum simulators.

[LG-58] Artificial neural networks ensemble methodology to predict significant wave height

链接: https://arxiv.org/abs/2509.14020
作者: Felipe Crivellaro Minuzzi,Leandro Farina
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:

点击查看摘要

Abstract:The forecast of wave variables are important for several applications that depend on a better description of the ocean state. Due to the chaotic behaviour of the differential equations which model this problem, a well know strategy to overcome the difficulties is basically to run several simulations, by for instance, varying the initial condition, and averaging the result of each of these, creating an ensemble. Moreover, in the last few years, considering the amount of available data and the computational power increase, machine learning algorithms have been applied as surrogate to traditional numerical models, yielding comparative or better results. In this work, we present a methodology to create an ensemble of different artificial neural networks architectures, namely, MLP, RNN, LSTM, CNN and a hybrid CNN-LSTM, which aims to predict significant wave height on six different locations in the Brazilian coast. The networks are trained using NOAA’s numerical reforecast data and target the residual between observational data and the numerical model output. A new strategy to create the training and target datasets is demonstrated. Results show that our framework is capable of producing high efficient forecast, with an average accuracy of 80% , that can achieve up to 88% in the best case scenario, which means 5% reduction in error metrics if compared to NOAA’s numerical model, and a increasingly reduction of computational cost.

[LG-59] Improving cosmological reach of a gravitational wave observatory using Deep Loop Shaping

链接: https://arxiv.org/abs/2509.14016
作者: Jonas Buchli,Brendan Tracey,Tomislav Andric,Christopher Wipf,Yu Him Justin Chiu,Matthias Lochbrunner,Craig Donner,Rana X. Adhikari,Jan Harms,Iain Barr,Roland Hafner,Andrea Huber,Abbas Abdolmaleki,Charlie Beattie,Joseph Betzwieser,Serkan Cabi,Jonas Degrave,Yuzhu Dong,Leslie Fritz,Anchal Gupta,Oliver Groth,Sandy Huang,Tamara Norman,Hannah Openshaw,Jameson Rollins,Greg Thornton,George Van Den Driessche,Markus Wulfmeier,Pushmeet Kohli,Martin Riedmiller,LIGO Instrument Team
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG); Systems and Control (eess.SY); General Relativity and Quantum Cosmology (gr-qc)
*备注:

点击查看摘要

Abstract:Improved low-frequency sensitivity of gravitational wave observatories would unlock study of intermediate-mass black hole mergers, binary black hole eccentricity, and provide early warnings for multi-messenger observations of binary neutron star mergers. Today’s mirror stabilization control injects harmful noise, constituting a major obstacle to sensitivity improvements. We eliminated this noise through Deep Loop Shaping, a reinforcement learning method using frequency domain rewards. We proved our methodology on the LIGO Livingston Observatory (LLO). Our controller reduced control noise in the 10–30Hz band by over 30x, and up to 100x in sub-bands surpassing the design goal motivated by the quantum limit. These results highlight the potential of Deep Loop Shaping to improve current and future GW observatories, and more broadly instrumentation and control systems.

[LG-60] Classification Filtering

链接: https://arxiv.org/abs/2509.13975
作者: Ilker Bayram
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider a streaming signal in which each sample is linked to a latent class. We assume that multiple classifiers are available, each providing class probabilities with varying degrees of accuracy. These classifiers are employed following a straightforward and fixed policy. In this setting, we consider the problem of fusing the output of the classifiers while incorporating the temporal aspect to improve classification accuracy. We propose a state-space model and develop a filter tailored for realtime execution. We demonstrate the effectiveness of the proposed filter in an activity classification application based on inertial measurement unit (IMU) data from a wearable device.

[LG-61] Mixture of Low-Rank Adapter Experts in Generalizable Audio Deepfake Detection

链接: https://arxiv.org/abs/2509.13878
作者: Janne Laakkonen,Ivan Kukanov,Ville Hautamäki
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: 6 pages, 3 figures, 1 table

点击查看摘要

Abstract:Foundation models such as Wav2Vec2 excel at representation learning in speech tasks, including audio deepfake detection. However, after being fine-tuned on a fixed set of bonafide and spoofed audio clips, they often fail to generalize to novel deepfake methods not represented in training. To address this, we propose a mixture-of-LoRA-experts approach that integrates multiple low-rank adapters (LoRA) into the model’s attention layers. A routing mechanism selectively activates specialized experts, enhancing adaptability to evolving deepfake attacks. Experimental results show that our method outperforms standard fine-tuning in both in-domain and out-of-domain scenarios, reducing equal error rates relative to baseline models. Notably, our best MoE-LoRA model lowers the average out-of-domain EER from 8.55% to 6.08%, demonstrating its effectiveness in achieving generalizable audio deepfake detection.

[LG-62] Learning Minimal Representations of Many-Body Physics from Snapshots of a Quantum Simulator

链接: https://arxiv.org/abs/2509.13821
作者: Frederik Møller,Gabriel Fernández-Fernández,Thomas Schweigler,Paulin de Schoulepnikoff,Jörg Schmiedmayer,Gorka Muñoz-Gil
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Analog quantum simulators provide access to many-body dynamics beyond the reach of classical computation. However, extracting physical insights from experimental data is often hindered by measurement noise, limited observables, and incomplete knowledge of the underlying microscopic model. Here, we develop a machine learning approach based on a variational autoencoder (VAE) to analyze interference measurements of tunnel-coupled one-dimensional Bose gases, which realize the sine-Gordon quantum field theory. Trained in an unsupervised manner, the VAE learns a minimal latent representation that strongly correlates with the equilibrium control parameter of the system. Applied to non-equilibrium protocols, the latent space uncovers signatures of frozen-in solitons following rapid cooling, and reveals anomalous post-quench dynamics not captured by conventional correlation-based methods. These results demonstrate that generative models can extract physically interpretable variables directly from noisy and sparse experimental data, providing complementary probes of equilibrium and non-equilibrium physics in quantum simulators. More broadly, our work highlights how machine learning can supplement established field-theoretical techniques, paving the way for scalable, data-driven discovery in quantum many-body systems.

[LG-63] Learning quantum many-body data locally: A provably scalable framework

链接: https://arxiv.org/abs/2509.13705
作者: Koki Chinzei,Quoc Hoan Tran,Norifumi Matsumoto,Yasuhiro Endo,Hirotaka Oshima
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 38 pages, 5 figures

点击查看摘要

Abstract:Machine learning (ML) holds great promise for extracting insights from complex quantum many-body data obtained in quantum experiments. This approach can efficiently solve certain quantum problems that are classically intractable, suggesting potential advantages of harnessing quantum data. However, addressing large-scale problems still requires significant amounts of data beyond the limited computational resources of near-term quantum devices. We propose a scalable ML framework called Geometrically Local Quantum Kernel (GLQK), designed to efficiently learn quantum many-body experimental data by leveraging the exponential decay of correlations, a phenomenon prevalent in noncritical systems. In the task of learning an unknown polynomial of quantum expectation values, we rigorously prove that GLQK substantially improves polynomial sample complexity in the number of qubits n , compared to the existing shadow kernel, by constructing a feature space from local quantum information at the correlation length scale. This improvement is particularly notable when each term of the target polynomial involves few local subsystems. Remarkably, for translationally symmetric data, GLQK achieves constant sample complexity, independent of n . We numerically demonstrate its high scalability in two learning tasks on quantum many-body phenomena. These results establish new avenues for utilizing experimental data to advance the understanding of quantum many-body physics.

[LG-64] Accelerated Gradient Methods with Biased Gradient Estimates: Risk Sensitivity High-Probability Guarantees and Large Deviation Bounds

链接: https://arxiv.org/abs/2509.13628
作者: Mert Gürbüzbalaban,Yasa Syed,Necdet Serhat Aybat
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study trade-offs between convergence rate and robustness to gradient errors in first-order methods. Our focus is on generalized momentum methods (GMMs), a class that includes Nesterov’s accelerated gradient, heavy-ball, and gradient descent. We allow stochastic gradient errors that may be adversarial and biased, and quantify robustness via the risk-sensitive index (RSI) from robust control theory. For quadratic objectives with i.i.d. Gaussian noise, we give closed-form expressions for RSI using 2x2 Riccati equations, revealing a Pareto frontier between RSI and convergence rate over stepsize and momentum choices. We prove a large-deviation principle for time-averaged suboptimality and show that the rate function is, up to scaling, the convex conjugate of the RSI. We further connect RSI to the H_\infty -norm, showing that stronger worst-case robustness (smaller H_\infty norm) yields sharper decay of tail probabilities. Beyond quadratics, under biased sub-Gaussian gradient errors, we derive non-asymptotic bounds on a finite-time analogue of the RSI, giving finite-time high-probability guarantees and large-deviation bounds. We also observe an analogous trade-off between RSI and convergence-rate bounds for smooth strongly convex functions. To our knowledge, these are the first non-asymptotic guarantees and risk-sensitive analysis of GMMs with biased gradients. Numerical experiments on robust regression illustrate the results.

[LG-65] A Geometric Graph-Based Deep Learning Model for Drug-Target Affinity Prediction

链接: https://arxiv.org/abs/2509.13476
作者: Md Masud Rana,Farjana Tasnim Mukta,Duc D. Nguyen
类目: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In structure-based drug design, accurately estimating the binding affinity between a candidate ligand and its protein receptor is a central challenge. Recent advances in artificial intelligence, particularly deep learning, have demonstrated superior performance over traditional empirical and physics-based methods for this task, enabled by the growing availability of structural and experimental affinity data. In this work, we introduce DeepGGL, a deep convolutional neural network that integrates residual connections and an attention mechanism within a geometric graph learning framework. By leveraging multiscale weighted colored bipartite subgraphs, DeepGGL effectively captures fine-grained atom-level interactions in protein-ligand complexes across multiple scales. We benchmarked DeepGGL against established models on CASF-2013 and CASF-2016, where it achieved state-of-the-art performance with significant improvements across diverse evaluation metrics. To further assess robustness and generalization, we tested the model on the CSAR-NRC-HiQ dataset and the PDBbind v2019 holdout set. DeepGGL consistently maintained high predictive accuracy, highlighting its adaptability and reliability for binding affinity prediction in structure-based drug discovery.

[LG-66] Why all roads dont lead to Rome: Representation geometry varies across the human visual cortical hierarchy

链接: https://arxiv.org/abs/2509.13459
作者: Arna Ghosh,Zahraa Chorghay,Shahab Bakhtiari,Blake A. Richards
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: 9 pages, 4 figures

点击查看摘要

Abstract:Biological and artificial intelligence systems navigate the fundamental efficiency-robustness tradeoff for optimal encoding, i.e., they must efficiently encode numerous attributes of the input space while also being robust to noise. This challenge is particularly evident in hierarchical processing systems like the human brain. With a view towards understanding how systems navigate the efficiency-robustness tradeoff, we turned to a population geometry framework for analyzing representations in the human visual cortex alongside artificial neural networks (ANNs). In the ventral visual stream, we found general-purpose, scale-free representations characterized by a power law-decaying eigenspectrum in most areas. However, in certain higher-order visual areas did not have scale-free representations, indicating that scale-free geometry is not a universal property of the brain. In parallel, ANNs trained with a self-supervised learning objective also exhibited free-free geometry, but not after fine-tune on a specific task. Based on these empirical results and our analytical insights, we posit that a system’s representation geometry is not a universal property and instead depends upon the computational objective.

[LG-67] Unleashing the power of computational insights in revealing the complexity of biological systems in the new era of spatial multi-omics

链接: https://arxiv.org/abs/2509.13376
作者: Zhiwei Fan,Tiangang Wang,Kexin Huang,Binwu Ying,Xiaobo Zhou
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG)
*备注: 43 pages, 9 figures, 1 table

点击查看摘要

Abstract:Recent advances in spatial omics technologies have revolutionized our ability to study biological systems with unprecedented resolution. By preserving the spatial context of molecular measurements, these methods enable comprehensive mapping of cellular heterogeneity, tissue architecture, and dynamic biological processes in developmental biology, neuroscience, oncology, and evolutionary studies. This review highlights a systematic overview of the continuous advancements in both technology and computational algorithms that are paving the way for a deeper, more systematic comprehension of the structure and mechanisms of mammalian tissues and organs by using spatial multi-omics. Our viewpoint demonstrates how advanced machine learning algorithms and multi-omics integrative modeling can decode complex biological processes, including the spatial organization and topological relationships of cells during organ development, as well as key molecular signatures and regulatory networks underlying tumorigenesis and metastasis. Finally, we outline future directions for technological innovation and modeling insights of spatial omics in precision medicine.

[LG-68] Valuation of Exotic Options and Counterparty Games Based on Conditional Diffusion

链接: https://arxiv.org/abs/2509.13374
作者: Helin Zhao,Junchi Shen
类目: Pricing of Securities (q-fin.PR); Machine Learning (cs.LG); Risk Management (q-fin.RM)
*备注: 28 pages, 12 figures

点击查看摘要

Abstract:This paper addresses the challenges of pricing exotic options and structured products, which traditional models often fail to handle due to their inability to capture real-world market phenomena like fat-tailed distributions and volatility clustering. We introduce a Diffusion-Conditional Probability Model (DDPM) to generate more realistic price paths. Our method incorporates a composite loss function with financial-specific features, and we propose a P-Q dynamic game framework for evaluating the model’s economic value through adversarial backtesting. Static validation shows our P-model effectively matches market mean and volatility. In dynamic games, it demonstrates significantly higher profitability than a traditional Monte Carlo-based model for European and Asian options. However, the model shows limitations in pricing products highly sensitive to extreme events, such as snowballs and accumulators, because it tends to underestimate tail risks. The study concludes that diffusion models hold significant potential for enhancing pricing accuracy, though further research is needed to improve their ability to model extreme market risks.

[LG-69] Benchmarking Dimensionality Reduction Techniques for Spatial Transcriptomics ALT

链接: https://arxiv.org/abs/2509.13344
作者: Md Ishtyaq Mahmud,Veena Kochat,Suresh Satpati,Jagan Mohan Reddy Dwarampudi,Kunal Rai,Tania Banerjee
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注: This paper is accepted to the 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025), 10 page and have 4 figures

点击查看摘要

Abstract:We introduce a unified framework for evaluating dimensionality reduction techniques in spatial transcriptomics beyond standard PCA approaches. We benchmark six methods PCA, NMF, autoencoder, VAE, and two hybrid embeddings on a cholangiocarcinoma Xenium dataset, systematically varying latent dimensions ( k =5-40) and clustering resolutions ( \rho =0.1-1.2). Each configuration is evaluated using complementary metrics including reconstruction error, explained variance, cluster cohesion, and two novel biologically-motivated measures: Cluster Marker Coherence (CMC) and Marker Exclusion Rate (MER). Our results demonstrate distinct performance profiles: PCA provides a fast baseline, NMF maximizes marker enrichment, VAE balances reconstruction and interpretability, while autoencoders occupy a middle ground. We provide systematic hyperparameter selection using Pareto optimal analysis and demonstrate how MER-guided reassignment improves biological fidelity across all methods, with CMC scores improving by up to 12% on average. This framework enables principled selection of dimensionality reduction methods tailored to specific spatial transcriptomics analyses.

[LG-70] Self-Supervised and Topological Signal-Quality Assessment for Any PPG Device

链接: https://arxiv.org/abs/2509.12510
作者: Wei Shao,Ruoyu Zhang,Zequan Liang,Ehsan Kourkchi,Setareh Rafatirad,Houman Homayoun
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: In the proceedings of IEEE-EMBS BSN 2025

点击查看摘要

Abstract:Wearable photoplethysmography (PPG) is embedded in billions of devices, yet its optical waveform is easily corrupted by motion, perfusion loss, and ambient light, jeopardizing downstream cardiometric analytics. Existing signal-quality assessment (SQA) methods rely either on brittle heuristics or on data-hungry supervised models. We introduce the first fully unsupervised SQA pipeline for wrist PPG. Stage 1 trains a contrastive 1-D ResNet-18 on 276 h of raw, unlabeled data from heterogeneous sources (varying in device and sampling frequency), yielding optical-emitter- and motion-invariant embeddings (i.e., the learned representation is stable across differences in LED wavelength, drive intensity, and device optics, as well as wrist motion). Stage 2 converts each 512-D encoder embedding into a 4-D topological signature via persistent homology (PH) and clusters these signatures with HDBSCAN. To produce a binary signal-quality index (SQI), the acceptable PPG signals are represented by the densest cluster while the remaining clusters are assumed to mainly contain poor-quality PPG signals. Without re-tuning, the SQI attains Silhouette, Davies-Bouldin, and Calinski-Harabasz scores of 0.72, 0.34, and 6173, respectively, on a stratified sample of 10,000 windows. In this study, we propose a hybrid self-supervised-learning–topological-data-analysis (SSL–TDA) framework that offers a drop-in, scalable, cross-device quality gate for PPG signals.

信息检索

[IR-0] MA-DPR: Manifold-aware Distance Metrics for Dense Passage Retrieval

链接: https://arxiv.org/abs/2509.13562
作者: Yifan Liu,Qianfeng Wen,Mark Zhao,Jiazhou Liang,Scott Sanner
类目: Information Retrieval (cs.IR)
*备注: 19 pages, 8 figures

点击查看摘要

Abstract:Dense Passage Retrieval (DPR) typically relies on Euclidean or cosine distance to measure query-passage relevance in embedding space, which is effective when embeddings lie on a linear manifold. However, our experiments across DPR benchmarks suggest that embeddings often lie on lower-dimensional, non-linear manifolds, especially in out-of-distribution (OOD) settings, where cosine and Euclidean distance fail to capture semantic similarity. To address this limitation, we propose a manifold-aware distance metric for DPR (MA-DPR) that models the intrinsic manifold structure of passages using a nearest neighbor graph and measures query-passage distance based on their shortest path in this graph. We show that MA-DPR outperforms Euclidean and cosine distances by up to 26% on OOD passage retrieval with comparable in-distribution performance across various embedding models while incurring a minimal increase in query inference time. Empirical evidence suggests that manifold-aware distance allows DPR to leverage context from related neighboring passages, making it effective even in the absence of direct semantic overlap. MADPR can be applied to a wide range of dense embedding and retrieval tasks, offering potential benefits across a wide spectrum of domains.

附件下载

点击下载今日全部论文列表