本篇博文主要内容为 2025-09-10 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-10)

今日共更新467篇论文,其中:

  • 自然语言处理57篇(Computation and Language (cs.CL))
  • 人工智能149篇(Artificial Intelligence (cs.AI))
  • 计算机视觉90篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习128篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂现实推理任务中难以有效激活并训练并行思维(Parallel Thinking)能力的问题。现有方法主要依赖于监督微调(Supervised Fine-Tuning, SFT)合成数据,导致模型倾向于教师强制模仿而非探索与泛化。论文提出首个基于强化学习(Reinforcement Learning, RL)的框架 Parallel-R1,其关键在于引入一种渐进式课程学习机制:首先在简单任务上使用 SFT 从提示生成轨迹中注入并行思维能力,随后切换至 RL 阶段以在更难任务中探索和泛化该能力。实验表明,该方案不仅显著提升数学基准测试(如 MATH、AMC23、AIME)上的准确率,且揭示了模型在训练早期将并行思维作为探索策略、后期转为多视角验证机制的行为转变,尤其在中期引入并行思维作为探索支架时,可使最终性能大幅提升(如 AIME25 上较基线提高 42.9%)。

链接: https://arxiv.org/abs/2509.07980
作者: Tong Zheng,Hongming Zhang,Wenhao Yu,Xiaoyang Wang,Xinyu Yang,Runpeng Dai,Rui Liu,Huiwen Bao,Chengsong Huang,Heng Huang,Dong Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注: Project website: this https URL

点击查看摘要

Abstract:Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbfParallel-R1, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbfmid-training exploration scaffold, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at this https URL.
zh

[NLP-1] Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

【速读】: 该论文旨在解决当前开源工具驱动型视觉任务模型在复杂推理场景下表现不足的问题,具体表现为推理路径单调、交互轮次有限,难以应对需要试错探索的高难度视觉搜索任务。其解决方案的关键在于通过三个核心组件实现深度多轮推理能力的提升:一是构建大规模挑战性视觉搜索数据集(Visual Probe Dataset),用于训练模型进行探索式推理;二是设计迭代式数据收集流水线,获取包含深度优先搜索、试错和目标维持等多样化推理模式的冷启动轨迹;三是提出过轮掩码策略(over-turn masking),在强化学习过程中避免对超出最大轮次限制的响应进行惩罚,从而平衡训练效率与推理时的可扩展性。最终,尽管训练阶段仅允许最多六轮交互,模型在推理时能自然扩展至数十轮,并随轮次增加提升准确率。

链接: https://arxiv.org/abs/2509.07969
作者: Xin Lai,Junyi Li,Wei Li,Tao Liu,Tianjian Li,Hengshuang Zhao
机构: ByteDance(字节跳动); The University of Hong Kong (香港大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Code, datasets, models are available at this https URL . Project Page: this https URL

点击查看摘要

Abstract:Recent advances in large multimodal models have leveraged image-based tools with reinforcement learning to tackle visual problems. However, existing open-source approaches often exhibit monotonous reasoning patterns and allow only a limited number of interaction turns, making them inadequate for difficult tasks that require trial-and-error exploration. In this work, we address this limitation by scaling up tool-based interactions and introduce Mini-o3, a system that executes deep, multi-turn reasoning – spanning tens of steps – and achieves state-of-the-art performance on challenging visual search tasks. Our recipe for reproducing OpenAI o3-style behaviors comprises three key components. First, we construct the Visual Probe Dataset, a collection of thousands of challenging visual search problems designed for exploratory reasoning. Second, we develop an iterative data collection pipeline to obtain cold-start trajectories that exhibit diverse reasoning patterns, including depth-first search, trial-and-error, and goal maintenance. Third, we propose an over-turn masking strategy that prevents penalization of over-turn responses (those that hit the maximum number of turns) during reinforcement learning, thereby balancing training-time efficiency with test-time scalability. Despite training with an upper bound of only six interaction turns, our model generates trajectories that naturally scale to tens of turns at inference time, with accuracy improving as the number of turns increases. Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths, effectively solving challenging visual search problems.
zh

[NLP-2] SimpleQA Verified: A Reliable Factuality Benchmark to Measure Parametric Knowledge

【速读】: 该论文旨在解决当前大型语言模型(Large Language Model, LLM)短文本事实性评估中存在的关键问题,包括标签噪声与错误、主题偏差以及问题冗余等局限性。其解决方案的核心在于构建了一个名为SimpleQA Verified的1,000个提示基准测试集,通过多阶段严格过滤流程——包括去重、主题平衡和来源校验——提升数据质量与评估挑战性,并优化自动评分提示(autorater prompt)。这一改进使得该基准更具可靠性与公平性,从而为研究社区提供更高保真度的事实性评估工具,有效追踪参数化模型在减少幻觉方面的实际进展。

链接: https://arxiv.org/abs/2509.07968
作者: Lukas Haas,Gal Yona,Giovanni D’Antonio,Sasha Goldshtein,Dipanjan Das
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce SimpleQA Verified, a 1,000-prompt benchmark for evaluating Large Language Model (LLM) short-form factuality based on OpenAI’s SimpleQA. It addresses critical limitations in OpenAI’s benchmark, including noisy and incorrect labels, topical biases, and question redundancy. SimpleQA Verified was created through a rigorous multi-stage filtering process involving de-duplication, topic balancing, and source reconciliation to produce a more reliable and challenging evaluation set, alongside improvements in the autorater prompt. On this new benchmark, Gemini 2.5 Pro achieves a state-of-the-art F1-score of 55.6, outperforming other frontier models, including GPT-5. This work provides the research community with a higher-fidelity tool to track genuine progress in parametric model factuality and to mitigate hallucinations. The benchmark dataset, evaluation code, and leaderboard are available at: this https URL.
zh

[NLP-3] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

【速读】: 该论文旨在解决当前视觉-语言模型(Vision-Language Models, VLMs)在处理结构化数据(如表格)时,尤其是在渲染后的表格图像上进行视觉推理能力评估方面存在的局限性,即现有基准测试在规模、多样性或推理深度上的不足。其解决方案的关键在于提出一个大规模、开放域的多模态数据集 Visual-TableQA,该数据集包含 2.5k 富有结构的 LaTeX 渲染表格和 6k 推理密集型问答对,全部通过模块化、可扩展且全自动的数据生成流程构建而成。该流程利用多个推理型大语言模型(Reasoning Large Language Models, LLMs)协同工作,分别承担生成、验证和灵感启发(inspiration)的角色,通过跨模型提示(cross-model prompting)与 LLM-jury 过滤机制促进多样性与创造性,从而有效提升模型对复杂表格视觉推理任务的泛化能力。

链接: https://arxiv.org/abs/2509.07966
作者: Boammani Aser Lompo,Marc Haraoui
机构: École de Technologie Supérieure (高等技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Work in Progress

点击查看摘要

Abstract:Visual reasoning over structured data such as tables is a critical capability for modern vision-language models (VLMs), yet current benchmarks remain limited in scale, diversity, or reasoning depth, especially when it comes to rendered table images. Addressing this gap, we introduce Visual-TableQA, a large-scale, open-domain multimodal dataset specifically designed to evaluate and enhance visual reasoning over complex tabular data. Our generation pipeline is modular, scalable, and fully autonomous, involving multiple reasoning LLMs collaborating across distinct roles: generation, validation, and inspiration. Visual-TableQA comprises 2.5k richly structured LaTeX-rendered tables and 6k reasoning-intensive QA pairs, all produced at a cost of under USD 100. To promote diversity and creativity, our pipeline performs multi-model collaborative data generation via cross-model prompting (‘inspiration’) and LLM-jury filtering. Stronger models seed layouts and topics that weaker models elaborate, collectively distilling diverse reasoning patterns and visual structures into the dataset. Empirical results show that models fine-tuned on Visual-TableQA generalize robustly to external benchmarks, outperforming several proprietary models despite the dataset’s synthetic nature. The full pipeline and resources are publicly available at this https URL.
zh

[NLP-4] GENUINE: Graph Enhanced Multi-level Uncertainty Estimation for Large Language Models EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在高风险应用场景中缺乏可靠不确定性估计的问题。现有方法通常仅依赖于词元级别(token-level)的概率度量,忽略了语义依赖关系,难以捕捉生成文本的结构化信息。其解决方案的关键在于提出 GENUINE 框架——一种基于图结构增强的多层级不确定性估计方法,通过引入依存句法树(dependency parse trees)和层次化图池化(hierarchical graph pooling)来建模语义与结构关系,并结合监督学习提升置信度评估的准确性。实验证明,该方法在多个自然语言处理任务上相比基于语义熵的方法提升了最高达 29% 的 AUROC 值,并将校准误差降低超过 15%。

链接: https://arxiv.org/abs/2509.07925
作者: Tuo Wang,Adithya Kulkarni,Tyler Cody,Peter A. Beling,Yujun Yan,Dawei Zhou
机构: Virginia Polytechnic Institute and State University (弗吉尼亚理工大学); Ball State University (鲍尔州立大学); Dartmouth College (达特茅斯学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by EMNLP 2025

点击查看摘要

Abstract:Uncertainty estimation is essential for enhancing the reliability of Large Language Models (LLMs), particularly in high-stakes applications. Existing methods often overlook semantic dependencies, relying on token-level probability measures that fail to capture structural relationships within the generated text. We propose GENUINE: Graph ENhanced mUlti-level uncertaINty Estimation for Large Language Models, a structure-aware framework that leverages dependency parse trees and hierarchical graph pooling to refine uncertainty quantification. By incorporating supervised learning, GENUINE effectively models semantic and structural relationships, improving confidence assessments. Extensive experiments across NLP tasks show that GENUINE achieves up to 29% higher AUROC than semantic entropy-based approaches and reduces calibration errors by over 15%, demonstrating the effectiveness of graph-based uncertainty modeling. The code is available at this https URL.
zh

[NLP-5] Uncovering Scaling Laws for Large Language Models via Inverse Problems EMNLP

【速读】: 该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)的开发中,由于训练成本高昂,依赖暴力试错法来提升模型性能已不可行,亟需一种更高效、可预测的方法来指导模型构建。解决方案的关键在于借鉴反问题(inverse problems)在揭示基础科学规律中的成功经验,将反问题方法引入LLM研究,以高效地发现和验证缩放定律(scaling laws),从而在保证目标性能的前提下显著提升计算资源利用的成本效益。

链接: https://arxiv.org/abs/2509.07909
作者: Arun Verma,Zhaoxuan Wu,Zijian Zhou,Xiaoqiang Lin,Zhiliang Chen,Rachael Hwee Ling Sim,Rui Qiao,Jingtan Wang,Nhung Bui,Xinyuan Niu,Wenyang Hu,Gregory Kang Ruey Lau,Zi-Yu Khoo,Zitong Zhao,Xinyi Xu,Apivich Hemachandra,See-Kiong Ng,Bryan Kian Hsiang Low
机构: Singapore-MIT Alliance for Research and Technology (新加坡-麻省理工联盟研究与技术); Dept. of Computer Science, National University of Singapore (计算机科学系,新加坡国立大学); Agency for Science, Technology and Research (科技研究局); Institute of Data Science, National University of Singapore (数据科学研究所,新加坡国立大学); SAP (SAP); CNRS@CREATE (CNRS@CREATE); AI Singapore (AI新加坡)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP Findings 2025

点击查看摘要

Abstract:Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also efficiently uncover scaling laws that guide the building of LLMs to achieve the desirable performance with significantly better cost-effectiveness.
zh

[NLP-6] Biased Tales: Cultural and Topic Bias in Generating Childrens Stories

【速读】: 该论文旨在解决生成式 AI(Generative AI)在创作儿童睡前故事时可能引入的文化和性别偏见问题,这些问题会影响儿童的价值观塑造。其解决方案的关键在于构建了一个名为 Biased Tales 的综合性数据集,用于系统分析大语言模型(LLM)生成故事中主角属性与情节元素如何受偏见影响,从而揭示并量化性别、文化背景等因素对叙事内容的偏差作用,为提升AI创作的公平性与多样性提供实证基础。

链接: https://arxiv.org/abs/2509.07908
作者: Donya Rooein,Vilém Zouhar,Debora Nozza,Dirk Hovy
机构: Bocconi University (博科尼大学); ETH Zurich (苏黎世联邦理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stories play a pivotal role in human communication, shaping beliefs and morals, particularly in children. As parents increasingly rely on large language models (LLMs) to craft bedtime stories, the presence of cultural and gender stereotypes in these narratives raises significant concerns. To address this issue, we present Biased Tales, a comprehensive dataset designed to analyze how biases influence protagonists’ attributes and story elements in LLM-generated stories. Our analysis uncovers striking disparities. When the protagonist is described as a girl (as compared to a boy), appearance-related attributes increase by 55.26%. Stories featuring non-Western children disproportionately emphasize cultural heritage, tradition, and family themes far more than those for Western children. Our findings highlight the role of sociocultural bias in making creative AI use more equitable and diverse.
zh

[NLP-7] From Detection to Mitigation: Addressing Gender Bias in Chinese Texts via Efficient Tuning and Voting-Based Rebalancing NLPCC2025

【速读】: 该论文旨在解决中文语境下的句子级别性别偏见检测与缓解问题,以提升自然语言生成(Natural Language Generation, NLG)中的公平性与可控性。其解决方案的关键在于:首先,基于大语言模型(Large Language Models, LLMs)采用低秩适应(Low-Rank Adaptation, LoRA)进行高效微调,实现对性别偏见的精准检测与分类;其次,通过构建更平衡的训练数据集并引入多源异构样本增强模型泛化能力;再次,利用多个专家模型的多数投票策略提升检测与分类性能;最后,设计多温度采样机制以捕捉偏见表达风格的多样性,从而有效改善偏见生成的检测与缓解效果。

链接: https://arxiv.org/abs/2509.07889
作者: Chengyan Wu,Yiqiang Cai,Yufei Cheng,Yun Xue
机构: 未知
类目: Computation and Language (cs.CL)
备注: NLPCC 2025

点击查看摘要

Abstract:This paper presents our team’s solution to Shared Task 7 of NLPCC-2025, which focuses on sentence-level gender bias detection and mitigation in Chinese. The task aims to promote fairness and controllability in natural language generation by automatically detecting, classifying, and mitigating gender bias. To address this challenge, we adopt a fine-tuning approach based on large language models (LLMs), efficiently adapt to the bias detection task via Low-Rank Adaptation (LoRA). In terms of data processing, we construct a more balanced training set to alleviate class imbalance and introduce heterogeneous samples from multiple sources to enhance model generalization. For the detection and classification sub-tasks, we employ a majority voting strategy that integrates outputs from multiple expert models to boost performance. Additionally, to improve bias generation detection and mitigation, we design a multi-temperature sampling mechanism to capture potential variations in bias expression styles. Experimental results demonstrate the effectiveness of our approach in bias detection, classification, and mitigation. Our method ultimately achieves an average score of 47.90%, ranking fourth in the shared task.
zh

[NLP-8] Are Humans as Brittle as Large Language Models ?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本分类任务中对提示(prompt)变化敏感性(即“提示脆弱性”)的问题,特别是探究这种脆弱性是否为LLMs独有,还是反映了人类标注者在面对相同指令扰动时的自然变异性。其解决方案的关键在于系统性地对比人类标注者与LLMs在相同文本分类任务下对多种提示修改(如标签集替换、标签格式变更、拼写错误和标签顺序反转)的响应差异,从而揭示提示脆弱性是否可被视作人类标注变异性的一种映射,而非单纯的模型缺陷。

链接: https://arxiv.org/abs/2509.07869
作者: Jiahui Li,Sean Papay,Roman Klinger
机构: University of Bamberg (巴伐利亚大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:The output of large language models (LLM) is unstable, due to both non-determinism of the decoding process as well as to prompt brittleness. While the intrinsic non-determinism of LLM generation may mimic existing uncertainty in human annotations through distributional shifts in outputs, it is largely assumed, yet unexplored, that the prompt brittleness effect is unique to LLMs. This raises the question: do human annotators show similar sensitivity to instruction changes? If so, should prompt brittleness in LLMs be considered problematic? One may alternatively hypothesize that prompt brittleness correctly reflects human annotation variances. To fill this research gap, we systematically compare the effects of prompt modifications on LLMs and identical instruction modifications for human annotators, focusing on the question of whether humans are similarly sensitive to prompt perturbations. To study this, we prompt both humans and LLMs for a set of text classification tasks conditioned on prompt variations. Our findings indicate that both humans and LLMs exhibit increased brittleness in response to specific types of prompt modifications, particularly those involving the substitution of alternative label sets or label formats. However, the distribution of human judgments is less affected by typographical errors and reversed label order than that of LLMs.
zh

[NLP-9] Small Open Models Achieve Near Parity with Large Models in Low Resource Literary Translation at a Fraction of the Cost

【速读】: 该论文旨在解决小规模开源语言模型在英-罗文学翻译任务中表现不足的问题,尤其是在低资源语言(如罗马尼亚语)中缺乏高质量、大规模平行语料库的困境。其解决方案的关键在于提出一个统一的框架——TINYFABULIST TRANSLATION FRAMEWORK (TF2),包含两个核心组件:一是基于已有英文寓言数据集(DS-TF1-EN-3M)生成15K高质量罗马尼亚语参考文本的合成数据构建管道;二是对12B参数开源模型进行两阶段微调:首先通过指令微调捕捉文学叙事风格,再利用适配器压缩实现高效部署。该方案显著提升了开源模型在文学翻译中的流畅性和准确性,且具备成本效益与可复现性,为低资源语言的文化内容翻译提供了可行路径。

链接: https://arxiv.org/abs/2509.07829
作者: Mihai Nadas,Laura Diosan,Andreea Tomescu,Andrei Piscoran
机构: Babeş-Bolyai University (巴贝什-博雅大学); KlusAI Labs (KlusAI 实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 8 figures, includes datasets and models released on Hugging Face

点击查看摘要

Abstract:Literary translation has recently gained attention as a distinct and complex task in machine translation research. However, the translation by small open models remains an open problem. We contribute to this ongoing research by introducing TINYFABULIST TRANSLATION FRAMEWORK (TF2), a unified framework for dataset creation, fine tuning, and evaluation in English-Romanian literary translations, centred on the creation and open release of both a compact, fine tuned language model (TF2-12B) and large scale synthetic parallel datasets (DS-TF2-EN-RO-3M and DS-TF2-EN-RO-15K). Building on DS-TF1-EN-3M (TF1), the largest collection of synthetic English fables to date, we address the need for rich, high quality literary datasets in low resource languages such as Romanian. Our pipeline first generates 15k high quality Romanian references from the TF1 pool using a high performing LLM. We then apply a two stage fine tuning process to a 12B parameter open weight model: (i) instruction tuning to capture genre specific narrative style, and (ii) adapter compression for efficient deployment. Evaluation combines corpus level BLEU and a five dimension LLM based rubric (accuracy, fluency, coherence, style, cultural adaptation) to provide a nuanced assessment of translation quality. Results show that our fine tuned model achieves fluency and adequacy competitive with top performing large proprietary models, while being open, accessible, and significantly more cost effective. Alongside the fine tuned model and both datasets, we publicly release all scripts and evaluation prompts. TF2 thus provides an end-to-end, reproducible pipeline for research on cost efficient translation, cross lingual narrative generation, and the broad adoption of open models for culturally significant literary content in low resource settings.
zh

[NLP-10] Dual Knowledge-Enhanced Two-Stage Reason er for Multimodal Dialog Systems

【速读】: 该论文旨在解决多模态任务导向对话系统中文本响应生成的两个核心问题:一是忽视了非结构化评论知识(unstructured review knowledge)的作用,二是未能充分利用大语言模型(LLMs)的潜力。为应对这一挑战,作者提出了一种双知识增强的两阶段推理框架(DK2R),其关键在于通过LLM动态评估结构化属性与非结构化评论知识的适用性,并分离意图相关的关键线索进行专门推理,从而作为辅助信号提升基于LLM的文本响应生成质量。

链接: https://arxiv.org/abs/2509.07817
作者: Xiaolin Chen,Xuemeng Song,Haokun Wen,Weili Guan,Xiangyu Zhao,Liqiang Nie
机构: 未知
类目: Computation and Language (cs.CL); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Textual response generation is pivotal for multimodal \mboxtask-oriented dialog systems, which aims to generate proper textual responses based on the multimodal context. While existing efforts have demonstrated remarkable progress, there still exist the following limitations: 1) \textitneglect of unstructured review knowledge and 2) \textitunderutilization of large language models (LLMs). Inspired by this, we aim to fully utilize dual knowledge (\textiti.e., structured attribute and unstructured review knowledge) with LLMs to promote textual response generation in multimodal task-oriented dialog systems. However, this task is non-trivial due to two key challenges: 1) \textitdynamic knowledge type selection and 2) \textitintention-response decoupling. To address these challenges, we propose a novel dual knowledge-enhanced two-stage reasoner by adapting LLMs for multimodal dialog systems (named DK2R). To be specific, DK2R first extracts both structured attribute and unstructured review knowledge from external knowledge base given the dialog context. Thereafter, DK2R uses an LLM to evaluate each knowledge type’s utility by analyzing LLM-generated provisional probe responses. Moreover, DK2R separately summarizes the intention-oriented key clues via dedicated reasoning, which are further used as auxiliary signals to enhance LLM-based textual response generation. Extensive experiments conducted on a public dataset verify the superiority of DK2R. We have released the codes and parameters.
zh

[NLP-11] SciNLP: A Domain-Specific Benchmark for Full-Text Scientific Entity and Relation Extraction in NLP EMNLP2025

【速读】: 该论文旨在解决科学文献中结构化信息抽取的难题,特别是针对自然语言处理(Natural Language Processing, NLP)领域缺乏全文本实体与关系标注数据的问题。现有数据集多局限于特定段落或章节,受限于领域复杂性和人工标注成本,难以支撑对完整学术文本的深度理解。其解决方案的关键在于构建首个面向NLP领域的全文本实体与关系标注基准数据集——SciNLP,包含60篇人工标注的完整论文,涵盖7,072个实体和1,826条关系。通过在该数据集上开展对比实验与模型评估,验证了其对主流监督模型性能提升的有效性,并进一步实现了细粒度知识图谱(Knowledge Graph, KG)的自动构建,该KG具有平均节点度为3.2的丰富拓扑结构,显著增强下游应用能力。

链接: https://arxiv.org/abs/2509.07801
作者: Decheng Duan,Yingyi Zhang,Jitong Peng,Chengzhi Zhang
机构: Nanjing University of Science and Technology (南京理工大学); Soochow University (苏州大学)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:Structured information extraction from scientific literature is crucial for capturing core concepts and emerging trends in specialized fields. While existing datasets aid model development, most focus on specific publication sections due to domain complexity and the high cost of annotating scientific texts. To address this limitation, we introduce SciNLP - a specialized benchmark for full-text entity and relation extraction in the Natural Language Processing (NLP) domain. The dataset comprises 60 manually annotated full-text NLP publications, covering 7,072 entities and 1,826 relations. Compared to existing research, SciNLP is the first dataset providing full-text annotations of entities and their relationships in the NLP domain. To validate the effectiveness of SciNLP, we conducted comparative experiments with similar datasets and evaluated the performance of state-of-the-art supervised models on this dataset. Results reveal varying extraction capabilities of existing models across academic texts of different lengths. Cross-comparisons with existing datasets show that SciNLP achieves significant performance improvements on certain baseline models. Using models trained on SciNLP, we implemented automatic construction of a fine-grained knowledge graph for the NLP domain. Our KG has an average node degree of 3.2 per entity, indicating rich semantic topological information that enhances downstream applications. The dataset is publicly available at this https URL.
zh

[NLP-12] Are LLM s Enough for Hyperpartisan Fake Polarized and Harmful Content Detection? Evaluating In-Context Learning vs. Fine-Tuning

【速读】: 该论文旨在解决在线平台中虚假新闻、极化内容及政治偏见等有害信息的检测问题,特别是针对不同大型语言模型(Large Language Models, LLMs)在多语言环境下的适应性与性能差异缺乏系统评估的现状。其关键解决方案在于通过大规模实验对比多种模型适配范式,包括参数高效的微调(Fine-Tuning)与多种上下文学习(In-Context Learning)策略(如零样本提示、代码本、少量示例提示以及思维链提示),并在涵盖英语、西班牙语、葡萄牙语、阿拉伯语和保加利亚语共5种语言的10个数据集上验证效果。研究发现,尽管大模型在上下文学习中表现突出,但微调方法通常优于纯上下文学习,尤其在任务特定场景下,即使是较小模型经微调后也优于未微调的大模型,凸显了微调在实际应用中的重要价值。

链接: https://arxiv.org/abs/2509.07768
作者: Michele Joshua Maggini,Dhia Merzougui,Rabiraj Bandyopadhyay,Gaël Dias,Fabrice Maurel,Pablo Gamallo
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The spread of fake news, polarizing, politically biased, and harmful content on online platforms has been a serious concern. With large language models becoming a promising approach, however, no study has properly benchmarked their performance across different models, usage methods, and languages. This study presents a comprehensive overview of different Large Language Models adaptation paradigms for the detection of hyperpartisan and fake news, harmful tweets, and political bias. Our experiments spanned 10 datasets and 5 different languages (English, Spanish, Portuguese, Arabic and Bulgarian), covering both binary and multiclass classification scenarios. We tested different strategies ranging from parameter efficient Fine-Tuning of language models to a variety of different In-Context Learning strategies and prompts. These included zero-shot prompts, codebooks, few-shot (with both randomly-selected and diversely-selected examples using Determinantal Point Processes), and Chain-of-Thought. We discovered that In-Context Learning often underperforms when compared to Fine-Tuning a model. This main finding highlights the importance of Fine-Tuning even smaller models on task-specific settings even when compared to the largest models evaluated in an In-Context Learning setup - in our case LlaMA3.1-8b-Instruct, Mistral-Nemo-Instruct-2407 and Qwen2.5-7B-Instruct.
zh

[NLP-13] Factuality Beyond Coherence: Evaluating LLM Watermarking Methods for Medical Texts EMNLP2025

【速读】: 该论文旨在解决生成式 AI(Generative AI)在医疗领域应用中因文本流畅性提升而引发的可追溯性与责任归属问题,特别是现有水印技术在低熵场景下可能对医学事实准确性造成损害的问题。其解决方案的关键在于提出了一种面向医疗领域的评估流程,通过引入事实性加权评分(Factuality-Weighted Score, FWS),在评估中优先考虑事实准确性而非仅关注连贯性,从而更全面地衡量水印方法对医学内容完整性的影响,并为医疗场景下的水印部署提供科学依据。

链接: https://arxiv.org/abs/2509.07755
作者: Rochana Prih Hastuti,Rian Adam Rajagede,Mansour Al Ghanim,Mengxin Zheng,Qian Lou
机构: University of Central Florida (中佛罗里达大学)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Accepted at EMNLP 2025 Findings

点击查看摘要

Abstract:As large language models (LLMs) adapted to sensitive domains such as medicine, their fluency raises safety risks, particularly regarding provenance and accountability. Watermarking embeds detectable patterns to mitigate these risks, yet its reliability in medical contexts remains untested. Existing benchmarks focus on detection-quality tradeoffs, overlooking factual risks under low-entropy settings often exploited by watermarking’s reweighting strategy. We propose a medical-focused evaluation workflow that jointly assesses factual accuracy and coherence. Using GPT-Judger and further human validation, we introduce the Factuality-Weighted Score (FWS), a composite metric prioritizing factual accuracy beyond coherence to guide watermarking deployment in medical domains. Our evaluation shows current watermarking methods substantially compromise medical factuality, with entropy shifts degrading medical entity representation. These findings underscore the need for domain-aware watermarking approaches that preserve the integrity of medical content.
zh

[NLP-14] M-BRe: Discovering Training Samples for Relation Extraction from Unlabeled Texts with Large Language Models EMNLP2025

【速读】: 该论文旨在解决关系抽取(Relation Extraction, RE)任务中训练数据标注成本高昂的问题,尤其是在未标注文本中自动提取高质量训练样本的挑战。其核心解决方案是提出一种名为M-BRe的框架,关键在于通过三个模块——关系分组(Relation Grouping)、关系抽取(Relation Extraction)和标签决策(Label Decision)——协同融合多分类与二分类策略的优势:一方面避免了大语言模型(LLMs)在多类场景下难以全面捕捉各类关系语义的问题,另一方面也规避了对每类关系单独进行二分类带来的高计算开销,从而实现高效且高质量的训练样本自动挖掘。

链接: https://arxiv.org/abs/2509.07730
作者: Zexuan Li,Hongliang Dai,Piji Li
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (工业和信息化部模式分析与机器智能重点实验室); The Key Laboratory of Brain-Machine Intelligence Technology, Ministry of Education (教育部脑机智能技术重点实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP2025 Main Conference

点击查看摘要

Abstract:For Relation Extraction (RE), the manual annotation of training data may be prohibitively expensive, since the sentences that contain the target relations in texts can be very scarce and difficult to find. It is therefore beneficial to develop an efficient method that can automatically extract training instances from unlabeled texts for training RE models. Recently, large language models (LLMs) have been adopted in various natural language processing tasks, with RE also benefiting from their advances. However, when leveraging LLMs for RE with predefined relation categories, two key challenges arise. First, in a multi-class classification setting, LLMs often struggle to comprehensively capture the semantics of every relation, leading to suboptimal results. Second, although employing binary classification for each relation individually can mitigate this issue, it introduces significant computational overhead, resulting in impractical time complexity for real-world applications. Therefore, this paper proposes a framework called M-BRe to extract training instances from unlabeled texts for RE. It utilizes three modules to combine the advantages of both of the above classification approaches: Relation Grouping, Relation Extraction, and Label Decision. Extensive experiments confirm its superior capability in discovering high-quality training samples from unlabeled texts for RE.
zh

[NLP-15] MoLoRAG : Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval EMNLP

【速读】: 该论文旨在解决多页文档理解中因传统方法忽略逻辑关联而导致的问答准确率低的问题。现有方法通常将文档转为纯文本输入大语言模型(LLM),丢失了图表等关键多模态信息;而尽管大视觉语言模型(LVLM)可处理多模态内容,其受限的输入长度难以支持跨页推理。此外,检索增强生成(RAG)方法虽能筛选相关页面,但仅依赖语义相似性,未考虑页面间的逻辑结构与查询意图之间的关系,影响推理效果。论文提出MoLoRAG框架,其核心在于构建一个捕捉页面间上下文关系的页面图(page graph),并通过轻量级视觉语言模型(VLM)进行图遍历,从而同时利用语义和逻辑相关性实现更精准的页面检索。该方案在四个DocQA数据集上平均提升准确率9.68%(相比LVLM直接推理)和检索精度7.44%(相比基线方法)。

链接: https://arxiv.org/abs/2509.07666
作者: Xixi Wu,Yanchao Tan,Nan Hou,Ruiyang Zhang,Hong Cheng
机构: The Chinese University of Hong Kong (香港中文大学); Fuzhou University (福州大学); University of Macau (澳门大学)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: EMNLP Main 2025

点击查看摘要

Abstract:Document Understanding is a foundational AI capability with broad applications, and Document Question Answering (DocQA) is a key evaluation task. Traditional methods convert the document into text for processing by Large Language Models (LLMs), but this process strips away critical multi-modal information like figures. While Large Vision-Language Models (LVLMs) address this limitation, their constrained input size makes multi-page document comprehension infeasible. Retrieval-augmented generation (RAG) methods mitigate this by selecting relevant pages, but they rely solely on semantic relevance, ignoring logical connections between pages and the query, which is essential for reasoning. To this end, we propose MoLoRAG, a logic-aware retrieval framework for multi-modal, multi-page document understanding. By constructing a page graph that captures contextual relationships between pages, a lightweight VLM performs graph traversal to retrieve relevant pages, including those with logical connections often overlooked. This approach combines semantic and logical relevance to deliver more accurate retrieval. After retrieval, the top- K pages are fed into arbitrary LVLMs for question answering. To enhance flexibility, MoLoRAG offers two variants: a training-free solution for easy deployment and a fine-tuned version to improve logical relevance checking. Experiments on four DocQA datasets demonstrate average improvements of 9.68% in accuracy over LVLM direct inference and 7.44% in retrieval precision over baselines. Codes and datasets are released at this https URL. Comments: EMNLP Main 2025 Subjects: Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2509.07666 [cs.CL] (or arXiv:2509.07666v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.07666 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-16] MaLei at MultiClinSUM: Summarisation of Clinical Documents using Perspective-Aware Iterative Self-Prompting with LLM s

【速读】: 该论文旨在解决临床病例报告中信息冗长且充斥专业术语导致医患沟通效率低下的问题,从而支持共享决策。其核心解决方案是采用一种视角感知的迭代自提示(Perspective-Aware Iterative Self-Prompting, PA-ISP)技术,通过大语言模型(Large Language Models, LLMs)生成并迭代优化任务特定提示,并结合少量示例进行引导式微调;同时利用ROUGE和BERT-score等指标指导模型训练过程,最终在3,396份多专科临床病例报告上实现了高语义一致性(BERT-score达85.46 F1)的摘要生成效果,验证了PA-ISP在提升临床文档摘要质量方面的有效性。

链接: https://arxiv.org/abs/2509.07622
作者: Libo Ren,Yee Man Ng,Lifeng Han
机构: University of Manchester (曼彻斯特大学); Modul University Vienna (模度大学); Leiden Institute of Advanced Computer Science (莱顿高级计算机科学研究所); Leiden University (莱顿大学); Leiden University Medical Center (莱顿大学医学中心)
类目: Computation and Language (cs.CL)
备注: system paper at CLEF 2025

点击查看摘要

Abstract:Efficient communication between patients and clinicians plays an important role in shared decision-making. However, clinical reports are often lengthy and filled with clinical jargon, making it difficult for domain experts to identify important aspects in the document efficiently. This paper presents the methodology we applied in the MultiClinSUM shared task for summarising clinical case documents. We used an Iterative Self-Prompting technique on large language models (LLMs) by asking LLMs to generate task-specific prompts and refine them via example-based few-shot learning. Furthermore, we used lexical and embedding space metrics, ROUGE and BERT-score, to guide the model fine-tuning with epochs. Our submission using perspective-aware ISP on GPT-4 and GPT-4o achieved ROUGE scores (46.53, 24.68, 30.77) and BERTscores (87.84, 83.25, 85.46) for (P, R, F1) from the official evaluation on 3,396 clinical case reports from various specialties extracted from open journals. The high BERTscore indicates that the model produced semantically equivalent output summaries compared to the references, even though the overlap at the exact lexicon level is lower, as reflected in the lower ROUGE scores. This work sheds some light on how perspective-aware ISP (PA-ISP) can be deployed for clinical report summarisation and support better communication between patients and clinicians.
zh

[NLP-17] BALI: Enhancing Biomedical Language Representations through Knowledge Graph and Language Model Alignment SIGIR2025 SIGIR

【速读】: 该论文旨在解决当前生物医学大语言模型(Biomedical Large Language Models, Biomedical LLMs)在理解复杂领域概念结构和编码于生物医学知识图谱(Knowledge Graph, KG)中的事实信息方面能力有限的问题。解决方案的关键在于提出一种名为BALI(Biomedical Knowledge Graph and Language Model Alignment)的联合预训练方法,该方法通过同时学习专用的知识图谱编码器,并对齐语言模型与知识图谱的表示空间,从而将外部知识注入语言模型;具体而言,针对给定文本序列,将其生物医学概念提及链接至统一医学语言系统(UMLS)知识图谱,并利用局部子图作为跨模态正样本进行对齐训练,实验证明该方法可显著提升主流生物医学语言模型(如PubMedBERT、BioLinkBERT)在多项语言理解任务上的性能及实体表征质量,即使仅使用来自PubMed科学摘要的小规模对齐数据集进行预训练亦然。

链接: https://arxiv.org/abs/2509.07588
作者: Andrey Sakhovskiy,Elena Tutubalina
机构: Sber AISkoltechMoscowRussia; AIRISber AIISP RAS Research Center for Trusted AIMoscowRussia
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 9 pages, 1 figure, published in “The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2025)”

点击查看摘要

Abstract:In recent years, there has been substantial progress in using pretrained Language Models (LMs) on a range of tasks aimed at improving the understanding of biomedical texts. Nonetheless, existing biomedical LLMs show limited comprehension of complex, domain-specific concept structures and the factual information encoded in biomedical Knowledge Graphs (KGs). In this work, we propose BALI (Biomedical Knowledge Graph and Language Model Alignment), a novel joint LM and KG pre-training method that augments an LM with external knowledge by the simultaneous learning of a dedicated KG encoder and aligning the representations of both the LM and the graph. For a given textual sequence, we link biomedical concept mentions to the Unified Medical Language System (UMLS) KG and utilize local KG subgraphs as cross-modal positive samples for these mentions. Our empirical findings indicate that implementing our method on several leading biomedical LMs, such as PubMedBERT and BioLinkBERT, improves their performance on a range of language understanding tasks and the quality of entity representations, even with minimal pre-training on a small alignment dataset sourced from PubMed scientific abstracts.
zh

[NLP-18] Avoiding Knowledge Edit Skipping in Multi-hop Question Answering with Guided Decomposition EMNLP

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多跳问答(multi-hop question answering)场景下知识编辑(Knowledge Editing, KE)失效的问题,其核心挑战在于“编辑跳过”(edit skipping)现象——即模型在推理过程中跳过了已编辑的相关事实。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法虽能有效编辑简单知识,但在多跳推理中因LLM解决问题的粒度与编辑事实的粒度不匹配而失效。解决方案的关键在于提出一种新的迭代式检索增强知识编辑方法(Iterative Retrieval-Augmented Knowledge Editing with guided decomposition, IRAKE),通过单个编辑事实和完整编辑案例的双重引导,实现对复杂知识结构的逐步分解与精准定位,从而缓解编辑跳过问题并显著提升多跳问答场景下的知识编辑效果。

链接: https://arxiv.org/abs/2509.07555
作者: Yi Liu,Xiangrong Zhu,Xiangyu Liu,Wei Wei,Wei Hu
机构: State Key Laboratory for Novel Software Technology, Nanjing University, China (南京大学新型软件技术国家重点实验室); National Institute of Healthcare Data Science, Nanjing University, China (南京大学医疗健康数据科学国家研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted in EMNLP Findings 2025

点击查看摘要

Abstract:In a rapidly evolving world where information updates swiftly, knowledge in large language models (LLMs) becomes outdated quickly. Retraining LLMs is not a cost-effective option, making knowledge editing (KE) without modifying parameters particularly necessary. We find that although existing retrieval-augmented generation (RAG)-based KE methods excel at editing simple knowledge, they struggle with KE in multi-hop question answering due to the issue of “edit skipping”, which refers to skipping the relevant edited fact in inference. In addition to the diversity of natural language expressions of knowledge, edit skipping also arises from the mismatch between the granularity of LLMs in problem-solving and the facts in the edited memory. To address this issue, we propose a novel Iterative Retrieval-Augmented Knowledge Editing method with guided decomposition (IRAKE) through the guidance from single edited facts and entire edited cases. Experimental results demonstrate that IRAKE mitigates the failure of editing caused by edit skipping and outperforms state-of-the-art methods for KE in multi-hop question answering.
zh

[NLP-19] VeriOS: Query-Driven Proactive Human-Agent -GUI Interaction for Trustworthy OS Agents

【速读】: 该论文旨在解决当前操作系统(OS)代理在真实世界环境中因依赖理想化假设而可能导致的过度执行风险问题,尤其是在不可信场景下缺乏可靠决策机制。其解决方案的关键在于提出一种查询驱动的人机-图形用户界面(GUI)交互框架,使OS代理能够在正常条件下自主执行任务,而在不可信场景中主动向人类寻求帮助;进一步地,基于该框架构建了VeriOS-Agent,采用两阶段学习范式实现元知识(meta-knowledge)的解耦与利用,从而在不牺牲正常性能的前提下显著提升不可信场景下的任务成功率(平均步级成功率达20.64%提升)。

链接: https://arxiv.org/abs/2509.07553
作者: Zheng Wu,Heyuan Huang,Xingyu Lou,Xiangmou Qu,Pengzhou Cheng,Zongru Wu,Weiwen Liu,Weinan Zhang,Jun Wang,Zhaoxiang Wang,Zhuosheng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); OPPO (OPPO)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent’s rationality, generalizability, and scalability. The codes, datasets and models are available at this https URL.
zh

[NLP-20] Competitive Audio-Language Models with Data-Efficient Single-Stage Training on Public Data

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)与音频模态融合不足的问题,尤其是在音频作为人类交流核心媒介的背景下,如何高效构建高性能的音频-语言模型(Audio-Language Models, ALMs)。其解决方案的关键在于:基于指令微调的大语言模型与Whisper编码器相结合,仅使用不到30K小时的公开音频数据(约5K个唯一音频),便实现了在MMAU基准上64.14的得分,达到当前开源权重模型的最佳水平,且显著优于同类模型在数据效率、参数效率、训练流程透明度和单阶段训练方面的表现。此外,研究发现复杂设计如课程学习、多音频编码器或复杂的交叉注意力机制并非必要,即使在远少于其他模型(<500K小时)的数据下也能取得优异性能。

链接: https://arxiv.org/abs/2509.07526
作者: Gokul Karthik Kumar,Rishabh Saraf,Ludovick Lepauloux,Abdul Muneer,Billel Mokeddem,Hakim Hacid
机构: Technology Innovation Institute (技术革新研究所)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at ASRU 2025

点击查看摘要

Abstract:Large language models (LLMs) have transformed NLP, yet their integration with audio remains underexplored – despite audio’s centrality to human communication. We introduce Falcon3-Audio, a family of Audio-Language Models (ALMs) built on instruction-tuned LLMs and Whisper encoders. Using a remarkably small amount of public audio data – less than 30K hours (5K unique) – Falcon3-Audio-7B matches the best reported performance among open-weight models on the MMAU benchmark, with a score of 64.14, matching R1-AQA, while distinguishing itself through superior data and parameter efficiency, single-stage training, and transparency. Notably, our smallest 1B model remains competitive with larger open models ranging from 2B to 13B parameters. Through extensive ablations, we find that common complexities – such as curriculum learning, multiple audio encoders, and intricate cross-attention connectors – are not required for strong performance, even compared to models trained on over 500K hours of data.
zh

[NLP-21] ALLabel: Three-stage Active Learning for LLM -based Entity Recognition using Demonstration Retrieval

【速读】: 该论文旨在解决科学领域(如化学和材料科学)中大规模高精度实体识别任务因依赖大量标注数据而导致的高昂成本问题。当前主流方法多采用微调(fine-tuning)策略,但其标注成本过高,难以在有限预算下实现最优性能。解决方案的关键在于提出ALLabel框架——一个三阶段主动学习(active learning)流程,通过系统性地筛选最具信息量和代表性的样本用于构建演示(demonstration)语料库,从而提升大语言模型(LLM)在上下文学习(in-context learning)中的表现。实验证明,仅需标注5%–10%的数据即可达到全量标注的效果,显著降低了标注成本并保持了高性能。

链接: https://arxiv.org/abs/2509.07512
作者: Zihan Chen,Lei Shi,Weize Wu,Qiji Zhou,Yue Zhang
机构: Beihang University (北京航空航天大学); Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Many contemporary data-driven research efforts in the natural sciences, such as chemistry and materials science, require large-scale, high-performance entity recognition from scientific datasets. Large language models (LLMs) have increasingly been adopted to solve the entity recognition task, with the same trend being observed on all-spectrum NLP tasks. The prevailing entity recognition LLMs rely on fine-tuned technology, yet the fine-tuning process often incurs significant cost. To achieve a best performance-cost trade-off, we propose ALLabel, a three-stage framework designed to select the most informative and representative samples in preparing the demonstrations for LLM modeling. The annotated examples are used to construct a ground-truth retrieval corpus for LLM in-context learning. By sequentially employing three distinct active learning strategies, ALLabel consistently outperforms all baselines under the same annotation budget across three specialized domain datasets. Experimental results also demonstrate that selectively annotating only 5%-10% of the dataset with ALLabel can achieve performance comparable to the method annotating the entire dataset. Further analyses and ablation studies verify the effectiveness and generalizability of our proposal.
zh

[NLP-22] Astra: A Multi-Agent System for GPU Kernel Performance Optimization

【速读】: 该论文旨在解决GPU核优化中高度依赖人工调优的问题,尤其是在大语言模型(Large Language Model, LLM)训练与推理场景下,高效CUDA内核的开发成本高昂且难以自动化。传统编译器系统虽可减轻部分负担,但仍需大量手动设计与工程投入。其解决方案的关键在于提出Astra——首个基于多智能体(multi-agent)架构的生成式AI系统,用于自动优化现有CUDA代码。不同于以往从PyTorch模块生成CUDA代码的方法,Astra以SGLang框架中实际部署的CUDA实现为起点,通过多个专用LLM智能体协作完成迭代式的代码生成、测试、性能分析与策略规划,从而在无需显式编程干预的情况下实现正确性与高性能并重的内核优化。实验证明,Astra在零样本提示(zero-shot prompting)下使用OpenAI o4-mini即可获得平均1.32倍的加速比,并能自主应用循环变换、内存访问优化、CUDA原语利用及快速数学运算等高级优化技术。

链接: https://arxiv.org/abs/2509.07506
作者: Anjiang Wei,Tianran Sun,Yogesh Seenichamy,Hang Song,Anne Ouyang,Azalia Mirhoseini,Ke Wang,Alex Aiken
机构: Stanford University (斯坦福大学); Shanghai Jiao Tong University (上海交通大学); Nanjing University (南京大学)
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:GPU kernel optimization has long been a central challenge at the intersection of high-performance computing and machine learning. Efficient kernels are crucial for accelerating large language model (LLM) training and serving, yet attaining high performance typically requires extensive manual tuning. Compiler-based systems reduce some of this burden, but still demand substantial manual design and engineering effort. Recently, researchers have explored using LLMs for GPU kernel generation, though prior work has largely focused on translating high-level PyTorch modules into CUDA code. In this work, we introduce Astra, the first LLM-based multi-agent system for GPU kernel optimization. Unlike previous approaches, Astra starts from existing CUDA implementations extracted from SGLang, a widely deployed framework for serving LLMs, rather than treating PyTorch modules as the specification. Within Astra, specialized LLM agents collaborate through iterative code generation, testing, profiling, and planning to produce kernels that are both correct and high-performance. On kernels from SGLang, Astra achieves an average speedup of 1.32x using zero-shot prompting with OpenAI o4-mini. A detailed case study further demonstrates that LLMs can autonomously apply loop transformations, optimize memory access patterns, exploit CUDA intrinsics, and leverage fast math operations to yield substantial performance gains. Our work highlights multi-agent LLM systems as a promising new paradigm for GPU kernel optimization.
zh

[NLP-23] HALT-RAG : A Task-Adaptable Framework for Hallucination Detection with Calibrated NLI Ensembles and Abstention

【速读】: 该论文旨在解决生成式语言模型在检索增强生成(Retrieval-Augmented Generation, RAG)过程中产生的幻觉(hallucination)问题,即模型输出内容与输入源文本矛盾或缺乏支持的现象。解决方案的关键在于提出一种后验验证系统HALT-RAG,其核心包括:(1) 采用由两个冻结的预训练自然语言推理(Natural Language Inference, NLI)模型和轻量级词汇信号组成的通用特征集;(2) 训练一个简单、校准良好且任务自适应的元分类器(meta-classifier);(3) 基于严格的五折交叉验证(5-fold out-of-fold, OOF)训练协议防止数据泄露,确保无偏估计。该方法在HaluEval基准上实现了高F1分数,并通过校准的概率输出支持实用的“拒答”机制,有效平衡了模型性能与安全性需求。

链接: https://arxiv.org/abs/2509.07475
作者: Saumya Goswami,Siddharth Kurra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting content that contradicts or is unsupported by a given source text is a critical challenge for the safe deployment of generative language models. We introduce HALT-RAG, a post-hoc verification system designed to identify hallucinations in the outputs of Retrieval-Augmented Generation (RAG) pipelines. Our flexible and task-adaptable framework uses a universal feature set derived from an ensemble of two frozen, off-the-shelf Natural Language Inference (NLI) models and lightweight lexical signals. These features are used to train a simple, calibrated, and task-adapted meta-classifier. Using a rigorous 5-fold out-of-fold (OOF) training protocol to prevent data leakage and produce unbiased estimates, we evaluate our system on the HaluEval benchmark. By pairing our universal feature set with a lightweight, task-adapted classifier and a precision-constrained decision policy, HALT-RAG achieves strong OOF F1-scores of 0.7756, 0.9786, and 0.7391 on the summarization, QA, and dialogue tasks, respectively. The system’s well-calibrated probabilities enable a practical abstention mechanism, providing a reliable tool for balancing model performance with safety requirements.
zh

[NLP-24] From Scarcity to Efficiency: Investigating the Effects of Data Augmentation on African Machine Translation

【速读】: 该论文旨在解决非洲大陆低资源语言(low-resource languages)在机器翻译(machine translation, MT)中性能不足的问题。其关键解决方案在于应用两种数据增强技术:句子拼接结合回译(sentence concatenation with back translation)和替换策略(switch-out),通过提升训练数据的多样性和规模,显著改善了六种非洲语言的翻译效果,实验结果显示BLEU分数最低提升达25%,验证了这些方法在增强低资源语言翻译系统鲁棒性方面的有效性。

链接: https://arxiv.org/abs/2509.07471
作者: Mardiyyah Oduwole,Oluwatosin Olajide,Jamiu Suleiman,Faith Hunja,Busayo Awobade,Fatimo Adebanjo,Comfort Akanni,Chinonyelum Igwe,Peace Ododo,Promise Omoigui,Steven Kolawole,Abraham Owodunni
机构: ML Collective
类目: Computation and Language (cs.CL)
备注: 8 pages, 3 tables. Exploratory work on Data Augmentation for African Machine Translation

点击查看摘要

Abstract:The linguistic diversity across the African continent presents different challenges and opportunities for machine translation. This study explores the effects of data augmentation techniques in improving translation systems in low-resource African languages. We focus on two data augmentation techniques: sentence concatenation with back translation and switch-out, applying them across six African languages. Our experiments show significant improvements in machine translation performance, with a minimum increase of 25% in BLEU score across all six this http URL provide a comprehensive analysis and highlight the potential of these techniques to improve machine translation systems for low-resource languages, contributing to the development of more robust translation systems for under-resourced languages.
zh

[NLP-25] Understanding Stigmatizing Language Lexicons: A Comparative Analysis in Clinical Contexts

【速读】: 该论文试图解决的问题是:当前医疗领域缺乏统一且标准化的术语词典来界定哪些词汇、术语或表达属于污名化语言(stigmatizing language),从而导致医疗不平等现象持续存在。其解决方案的关键在于系统性地检索现有污名化语言词典,并通过比较分析揭示它们在语义一致性上的差异以及负面情感词汇的分布特征,进而强调建立一个标准化词典的必要性,并指出在临床文本中定义污名化语言所面临的挑战。

链接: https://arxiv.org/abs/2509.07462
作者: Yiliang Zhou,Di Hu,Tianchu Lyu,Jasmine Dhillon,Alexandra L. Beck,Gelareh Sadigh,Kai Zheng
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Stigmatizing language results in healthcare inequities, yet there is no universally accepted or standardized lexicon defining which words, terms, or phrases constitute stigmatizing language in healthcare. We conducted a systematic search of the literature to identify existing stigmatizing language lexicons and then analyzed them comparatively to examine: 1) similarities and discrepancies between these lexicons, and 2) the distribution of positive, negative, or neutral terms based on an established sentiment dataset. Our search identified four lexicons. The analysis results revealed moderate semantic similarity among them, and that most stigmatizing terms are related to judgmental expressions by clinicians to describe perceived negative behaviors. Sentiment analysis showed a predominant proportion of negatively classified terms, though variations exist across lexicons. Our findings underscore the need for a standardized lexicon and highlight challenges in defining stigmatizing language in clinical texts.
zh

[NLP-26] AIxcellent Vibes at GermEval 2025 Shared Task on Candy Speech Detection: Improving Model Performance by Span-Level Training

【速读】: 该论文旨在解决社交平台上正向支持性在线交流(candy speech)的自动化检测问题,该问题在现有研究中尚未得到充分探索,限制了对其社会影响的系统性分析。解决方案的关键在于采用基于跨度(span-level)训练的多语言XLM-RoBERTa-Large模型,该模型在德语YouTube评论语料库上表现出最优性能,在二分类F1分数(0.8906)和细粒度跨度检测任务(严格F1:0.6307)中均排名第一。研究表明,跨语言能力、跨度级标注策略以及对表情符号敏感的分词器是提升检测准确性的核心因素。

链接: https://arxiv.org/abs/2509.07459
作者: Christian Rene Thelen,Patrick Gustav Blaneck,Tobias Bornheim,Niklas Grieger,Stephan Bialonski
机构: FH Aachen University of Applied Sciences (亚琛应用技术大学); RWTH Aachen University (亚琛工业大学); ORDIX AG (ORDIX AG); Utrecht University (乌得勒支大学)
类目: Computation and Language (cs.CL)
备注: 6 pages, 1 figure, 2 tables

点击查看摘要

Abstract:Positive, supportive online communication in social media (candy speech) has the potential to foster civility, yet automated detection of such language remains underexplored, limiting systematic analysis of its impact. We investigate how candy speech can be reliably detected in a 46k-comment German YouTube corpus by monolingual and multilingual language models, including GBERT, Qwen3 Embedding, and XLM-RoBERTa. We find that a multilingual XLM-RoBERTa-Large model trained to detect candy speech at the span level outperforms other approaches, ranking first in both binary positive F1: 0.8906) and categorized span-based detection (strict F1: 0.6307) subtasks at the GermEval 2025 Shared Task on Candy Speech Detection. We speculate that span-based training, multilingual capabilities, and emoji-aware tokenizers improved detection performance. Our results demonstrate the effectiveness of multilingual models in identifying positive, supportive language.
zh

[NLP-27] GLEAM: Learning to Match and Explain in Cross-View Geo-Localization

【速读】: 该论文旨在解决跨视角地理定位(Cross-View Geo-Localization, CVGL)中现有方法受限于单一视图或模态、且缺乏可解释性的问题。传统方法仅预测图像间是否对应,无法提供匹配依据,限制了其在实际场景中的可信度与应用扩展。解决方案的关键在于提出两个核心组件:一是GLEAM-C,一个统一多视图(包括无人机影像、街景地图、全景视图和地面照片)与多模态数据的基准模型,通过仅与卫星影像对齐实现高效训练并达到与专用模态模型相当的精度;二是GLEAM-X,引入多模态大语言模型(Multimodal Large Language Models, MLLMs)进行可解释推理的新任务,结合对应关系预测与自然语言解释,并构建了一个双语基准测试集,经人工精细化修订后支持对可解释跨视角推理的系统评估。该方案实现了从“准确匹配”到“可解释匹配”的跃迁,推动地理定位技术向透明化、可信赖方向发展。

链接: https://arxiv.org/abs/2509.07450
作者: Xudong Lu,Zhi Zheng,Yi Wan,Yongxiang Yao,Annan Wang,Renrui Zhang,Panwang Xia,Qiong Wu,Qingyun Li,Weifeng Lin,Xiangyu Zhao,Xue Yang,Hongsheng Li
机构: The Chinese University of Hong Kong (香港中文大学); Wuhan University (武汉大学); Shanghai Jiao Tong University (上海交通大学); Nanyang Technological University (南洋理工大学); Harbin Institute of Technology (哈尔滨工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 18 pages

点击查看摘要

Abstract:Cross-View Geo-Localization (CVGL) focuses on identifying correspondences between images captured from distinct perspectives of the same geographical location. However, existing CVGL approaches are typically restricted to a single view or modality, and their direct visual matching strategy lacks interpretability: they merely predict whether two images correspond, without explaining the rationale behind the match. In this paper, we present GLEAM-C, a foundational CVGL model that unifies multiple views and modalities-including UAV imagery, street maps, panoramic views, and ground photographs-by aligning them exclusively with satellite imagery. Our framework enhances training efficiency through optimized implementation while achieving accuracy comparable to prior modality-specific CVGL models through a two-phase training strategy. Moreover, to address the lack of interpretability in traditional CVGL methods, we leverage the reasoning capabilities of multimodal large language models (MLLMs) to propose a new task, GLEAM-X, which combines cross-view correspondence prediction with explainable reasoning. To support this task, we construct a bilingual benchmark using GPT-4o and Doubao-1.5-Thinking-Vision-Pro to generate training and testing data. The test set is further refined through detailed human revision, enabling systematic evaluation of explainable cross-view reasoning and advancing transparency and scalability in geo-localization. Together, GLEAM-C and GLEAM-X form a comprehensive CVGL pipeline that integrates multi-modal, multi-view alignment with interpretable correspondence analysis, unifying accurate cross-view matching with explainable reasoning and advancing Geo-Localization by enabling models to better Explain And Match. Code and datasets used in this work will be made publicly accessible at this https URL.
zh

[NLP-28] Language Self-Play For Data-Free Training

【速读】: 该论文试图解决大语言模型(Large Language Models, LLMs)持续进步所面临的瓶颈问题,即对海量新增训练数据的依赖。为克服这一限制,作者提出了一种基于强化学习的解决方案,其关键在于引入一种博弈论框架下的自对弈机制(Language Self-Play, LSP),通过让模型在竞争性游戏中与自身对抗来提升能力,从而无需额外数据即可实现性能增强。实验表明,预训练模型仅通过自对弈即可显著提升复杂任务的表现,且效果优于依赖数据驱动的基线方法。

链接: https://arxiv.org/abs/2509.07414
作者: Jakub Grudzien Kuba,Mengting Gu,Qi Ma,Yuandong Tian,Vijai Mohan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have advanced rapidly in recent years, driven by scale, abundant high-quality training data, and reinforcement learning. Yet this progress faces a fundamental bottleneck: the need for ever more data from which models can continue to learn. In this work, we propose a reinforcement learning approach that removes this dependency by enabling models to improve without additional data. Our method leverages a game-theoretic framework of self-play, where a model’s capabilities are cast as performance in a competitive game and stronger policies emerge by having the model play against itself - a process we call Language Self-Play (LSP). Experiments with Llama-3.2-3B-Instruct on instruction-following benchmarks show that pretrained models can not only enhance their performance on challenging tasks through self-play alone, but can also do so more effectively than data-driven baselines.
zh

[NLP-29] LongEmotion: Measuring Emotional Intelligence of Large Language Models in Long-Context Interaction

【速读】: 该论文旨在解决当前情感智能(Emotional Intelligence, EI)评估基准在长上下文场景下对现实交互复杂性、多样性及噪声敏感性的忽视问题,尤其针对长时间、多变且含噪的对话情境中模型表现不足的局限。其解决方案的关键在于提出一个名为LongEmotion的新基准,涵盖情感分类、检测、问答、对话、摘要与表达等多样化任务,平均输入长度达8,777 tokens,并引入两种创新方法:一是检索增强生成(Retrieval-Augmented Generation, RAG),利用对话上下文和大语言模型自身作为检索源,无需依赖外部知识库;二是协作式情感建模(Collaborative Emotional Modeling, CoEM),将任务分解为五个阶段,融合检索增强与有限知识注入。实验表明,这两种方法显著提升了多数长上下文情感任务的表现,推动大语言模型向更贴近实际应用的情感智能方向发展。

链接: https://arxiv.org/abs/2509.07403
作者: Weichu Liu,Jing Xiong,Yuxuan Hu,Zixuan Li,Minghuan Tan,Ningning Mao,Chenyang Zhao,Zhongwei Wan,Chaofan Tao,Wendong Xu,Hui Shen,Chengming Li,Lingpeng Kong,Ngai Wong
机构: 1. University of Science and Technology of China (中国科学技术大学);
2. Alibaba Group (阿里巴巴集团);
3. Tsinghua University (清华大学);
4. Peking University (北京大学);
5. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所);
6. Nanjing University (南京大学);
7. Shanghai Jiao Tong University (上海交通大学);
8. Zhejiang University (浙江大学);
9. Alibaba Cloud (阿里云)
类目: Computation and Language (cs.CL)
备注: Technical Report

点击查看摘要

Abstract:Large language models (LLMs) make significant progress in Emotional Intelligence (EI) and long-context understanding. However, existing benchmarks tend to overlook certain aspects of EI in long-context scenarios, especially under realistic, practical settings where interactions are lengthy, diverse, and often noisy. To move towards such realistic settings, we present LongEmotion, a benchmark specifically designed for long-context EI tasks. It covers a diverse set of tasks, including Emotion Classification, Emotion Detection, Emotion QA, Emotion Conversation, Emotion Summary, and Emotion Expression. On average, the input length for these tasks reaches 8,777 tokens, with long-form generation required for Emotion Expression. To enhance performance under realistic constraints, we incorporate Retrieval-Augmented Generation (RAG) and Collaborative Emotional Modeling (CoEM), and compare them with standard prompt-based methods. Unlike conventional approaches, our RAG method leverages both the conversation context and the large language model itself as retrieval sources, avoiding reliance on external knowledge bases. The CoEM method further improves performance by decomposing the task into five stages, integrating both retrieval augmentation and limited knowledge injection. Experimental results show that both RAG and CoEM consistently enhance EI-related performance across most long-context tasks, advancing LLMs toward more practical and real-world EI applications. Furthermore, we conducted a comparative case study experiment on the GPT series to demonstrate the differences among various models in terms of EI. Code is available on GitHub at this https URL, and the project page can be found at this https URL.
zh

[NLP-30] he Role of Exploration Modules in Small Language Models for Knowledge Graph Question Answering ACL2025

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在基于知识图谱(Knowledge Graphs, KGs)的问答任务中因自身推理和遍历能力有限而导致性能受限的问题。其关键解决方案是引入轻量级且高效的探索模块(exploration modules),用于替代SLMs完成知识图谱的遍历任务,从而显著提升模型在KG-based QA上的表现,同时保持整体架构的简洁性与可扩展性。

链接: https://arxiv.org/abs/2509.07399
作者: Yi-Jie Cheng,Oscar Chew,Yun-Nung Chen
机构: ASUS; National Taiwan University
类目: Computation and Language (cs.CL)
备注: Extended from ACL 2025 SRW

点击查看摘要

Abstract:Integrating knowledge graphs (KGs) into the reasoning processes of large language models (LLMs) has emerged as a promising approach to mitigate hallucination. However, existing work in this area often relies on proprietary or extremely large models, limiting accessibility and scalability. In this study, we investigate the capabilities of existing integration methods for small language models (SLMs) in KG-based question answering and observe that their performance is often constrained by their limited ability to traverse and reason over knowledge graphs. To address this limitation, we propose leveraging simple and efficient exploration modules to handle knowledge graph traversal in place of the language model itself. Experiment results demonstrate that these lightweight modules effectively improve the performance of small language models on knowledge graph question answering tasks. Source code: this https URL.
zh

[NLP-31] alking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents

【速读】: 该论文试图解决的问题是:现有对大语言模型(Large Language Models, LLM)语言能力的评估主要集中在词汇、形态规则、句法泛化、语用推理和跨语言迁移等方面,但缺乏对其是否能通过模式识别与交互反馈习得语言这一人类语言习得核心机制的考察。解决方案的关键在于提出了一种新颖的实验框架,其中LLM代理被置于与仅理解新构造语言(Tinkatongue)的对话机器人进行交互的情境中,从而测试其在无先验知识条件下通过互动学习并使用该语言的能力。实验结果表明,尽管LLM代理未能在100轮响应内建立有效对话,但其表现出与人类语言学习策略相似的策略选择,这为未来评估基准的设计和基于交互反馈优化模型学习效率提供了新方向。

链接: https://arxiv.org/abs/2509.07389
作者: Sankalp Tattwadarshi Swain,Anshika Krishnatray,Dhruv Kumar,Jagat Sesh Challa
机构: BITS Pilani(印度理工学院比兰尼分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Under review

点击查看摘要

Abstract:Existing evaluation studies on linguistic competence of large language models (LLM agents) have focused primarily on vocabulary learning, morphological rule induction, syntactic generalization, pragmatic inference, and cross-linguistic transfer. However, none assess whether LLM agents can acquire a language through pattern recognition and interactive feedback, a central feature of human language acquisition. We propose a novel experimental framework in which an LLM agent is evaluated on its ability to acquire and use a newly constructed language (Tinkatongue) in conversation with a bot that understands only Tinkatongue. Our findings show that LLM agents fail to establish a conversation within 100 responses, yet they adopt distinct strategies that mirror human approaches to language learning. The results suggest a new direction for evaluation benchmarks and open pathways to model designs that learn more effectively from interactive feedback.
zh

[NLP-32] PersonaFuse: A Personality Activation-Driven Framework for Enhancing Human-LLM Interactions

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在真实社交场景中情感感知与社会适应能力不足的问题,特别是其难以根据不同的社会情境动态调整沟通风格和情绪表达。解决方案的关键在于提出 PersonaFuse 框架,该框架基于特质激活理论(Trait Activation Theory)与大五人格模型(Big Five personality model),采用专家混合(Mixture-of-Expert)架构结合个性化适配器(persona adapters)与动态路由网络(dynamic routing network),实现上下文相关的特质表达,从而显著提升模型的社会情感智能,同时保持通用推理能力和安全性。

链接: https://arxiv.org/abs/2509.07370
作者: Yixuan Tang,Yi Yang,Ahmed Abbasi
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) demonstrate remarkable capabilities across various fields. These developments have led to more direct communication between humans and LLMs in various situations, such as social companionship and psychological support. However, LLMs often exhibit limitations in emotional perception and social competence during real-world conversations. These limitations partly originate from their inability to adapt their communication style and emotional expression to different social and task contexts. In this work, we introduce PersonaFuse, a novel LLM post-training framework that enables LLMs to adapt and express different personalities for varying situations. Inspired by Trait Activation Theory and the Big Five personality model, PersonaFuse employs a Mixture-of-Expert architecture that combines persona adapters with a dynamic routing network, enabling contextual trait expression. Experimental results show that PersonaFuse substantially outperforms baseline models across multiple dimensions of social-emotional intelligence. Importantly, these gains are achieved without sacrificing general reasoning ability or model safety, which remain common limitations of direct prompting and supervised fine-tuning approaches. PersonaFuse also delivers consistent improvements in downstream human-centered applications, such as mental health counseling and review-based customer service. Finally, human preference evaluations against leading LLMs, including GPT-4o and DeepSeek, demonstrate that PersonaFuse achieves competitive response quality despite its comparatively smaller model size. These findings demonstrate that PersonaFuse~offers a theoretically grounded and practical approach for developing social-emotional enhanced LLMs, marking a significant advancement toward more human-centric AI systems.
zh

[NLP-33] Mitigating Attention Localization in Small Scale: Self-Attention Refinement via One-step Belief Propagation EMNLP2025

【速读】: 该论文旨在解决基于Transformer的自注意力机制在深层模型中出现的定位问题(localization),即注意力集中在有限的token子集上,导致无法有效捕捉长距离依赖关系。解决方案的关键在于提出Self-Attention One-step Belief Propagation (SAOBP) 框架,通过信念传播过程注入多跳(multi-hop)关系以优化注意力分布;同时引入全局token依赖(Global Token Dependency, GTD)来量化和解释这些多跳连接对注意力图的相对贡献,从而在保持任务相关GTD水平的同时缓解熵塌陷(entropy collapse),提升模型性能,尤其在小规模模型中表现出显著优势。

链接: https://arxiv.org/abs/2509.07324
作者: Nakyung Lee,Yeongoon Kim,Minhae Oh,Suhwan Kim,Jin Woo Koo,Hyewon Jo,Jungwoo Lee
机构: Seoul National University (首尔国立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:Transformer-based self-attention mechanism serves as the core of modern language models, yet it often suffers from localization, where attentions collapse onto a limited subset of tokens and fail to capture long-range dependencies. To address this issue, we propose Self-Attention One-step Belief Propagation (SAOBP), a refinement framework that injects multi-hop relationships through a belief propagation process. To interpret and quantify these interactions, we introduce Global Token Dependency (GTD) that captures the relative contribution of multihop connections within the attention graph. Empirical results indicate that SAOBP helps prevent entropy collapse in deeper layers and adaptively maintains GTD at task-appropriate levels, thereby supporting improvements in model performance. Importantly, we observe competitive gains in small-scale models, highlighting its potential for improving inference quality in resource-constrained scenarios.
zh

[NLP-34] Does This Look Familiar to You? Knowledge Analysis via Model Internal Representations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine Tuning, SFT)阶段中训练数据选择缺乏系统性方法的问题。现有方法通常依赖于提示工程(prompt engineering),不仅对输入变化敏感,还增加了额外的设计成本,且多局限于多选题类任务。论文提出的解决方案是KAMIR(Knowledge Analysis via Model Internal Representations),其关键在于通过分析模型各层隐藏状态(hidden states)与最终隐藏状态之间的相似性来评估数据质量,从而识别出模型不熟悉的数据样本用于训练。该方法无需复杂提示设计,适用于机器阅读理解、摘要生成等多种任务,并在小样本条件下即可提升模型的泛化性能。

链接: https://arxiv.org/abs/2509.07311
作者: Sihyun Park
机构: Dongguk University (东国大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have been driven by pretraining, supervised fine tuning (SFT), and alignment tuning. Among these, SFT plays a crucial role in transforming a model 's general knowledge into structured responses tailored to specific tasks. However, there is no clearly established methodology for effective training data selection. Simply increasing the volume of data does not guarantee performance improvements, while preprocessing, sampling, and validation require substantial time and cost. To address this issue, a variety of data selection methods have been proposed. Among them, knowledge based selection approaches identify suitable training data by analyzing the model 's responses. Nevertheless, these methods typically rely on prompt engineering, making them sensitive to variations and incurring additional costs for prompt design. In this study, we propose Knowledge Analysis via Model Internal Representations (KAMIR), a novel approach that overcomes these limitations by analyzing data based on the model 's internal representations. KAMIR computes similarities between the hidden states of each layer (block) and the final hidden states for a given input to assess the data. Unlike prior methods that were largely limited to multiple choice tasks, KAMIR can be applied to a wide range of tasks such as machine reading comprehension and summarization. Moreover, it selects data useful for training based on the model 's familiarity with the input, even with a small dataset and a simple classifier architecture. Experiments across diverse task datasets demonstrate that training with less familiar data leads to better generalization performance. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.07311 [cs.CL] (or arXiv:2509.07311v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.07311 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-35] Instance-level Performance Prediction for Long-form Generation Tasks

【速读】: 该论文旨在解决长文本生成任务中实例级性能预测的问题,即在缺乏模型内部结构信息(黑盒)的情况下,准确预测每个生成实例在多维、细粒度质量指标上的连续评分,并量化预测的不确定性。其解决方案的关键在于提出了一种任务、模型和指标无关的建模框架,仅依赖输入和输出即可预测连续评分,并通过预测区间(prediction intervals)来表征不确定性;实验表明,该方法在11个长文本生成任务上,使用最少仅16个训练样本即可实现有效预测,为实际应用提供了可落地的基线方案。

链接: https://arxiv.org/abs/2509.07309
作者: Chi-Yang Hsu,Alexander Braylan,Yiheng Su,Omar Alonso,Matthew Lease
机构: The University of Texas at Austin (得克萨斯大学奥斯汀分校); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We motivate and share a new benchmark for instance-level performance prediction of long-form generation tasks having multi-faceted, fine-grained quality metrics. Our task-, model- and metric-agnostic formulation predicts continuous evaluation metric scores given only black-box model inputs and outputs. Beyond predicting point estimates of metric scores, the benchmark also requires inferring prediction intervals to quantify uncertainty around point estimates. Evaluation spans 11 long-form datasets/tasks with multiple LLMs, baselines, and metrics per task. We show that scores can be effectively predicted across long-form generation tasks using as few as 16 training examples. Overall, we introduce a novel and useful task, a valuable benchmark to drive progress, and baselines ready for practical adoption today.
zh

[NLP-36] Basis Vector Metric: A Method for Robust Open-Ended State Change Detection

【速读】: 该论文旨在解决如何利用语言嵌入(language embeddings)有效判断图像中物体状态变化的问题,特别是在细粒度语义区分场景下(如名词与形容词组合的属性变化)。其解决方案的关键在于提出一种基于基向量(Basis Vectors Method, BVM)的新方法,通过构建每个名词类别的特征基向量来捕捉其状态变化模式,并在MIT-States数据集上验证其性能。实验表明,BVM在单独分类每个名词的状态时优于余弦相似度、点积、产品量化、二进制索引、朴素贝叶斯及自定义神经网络等对比方法;而在区分形容词方面虽未显著超越MIT-States原论文提出的逻辑回归模型,但发现了潜在改进方向,提示可通过优化方法论进一步提升准确性。

链接: https://arxiv.org/abs/2509.07308
作者: David Oprea,Sam Powers
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 24 pages

点击查看摘要

Abstract:We test a new method, which we will abbreviate using the acronym BVM (Basis Vectors Method), in its ability to judge the state changes in images through using language embeddings. We used the MIT-States dataset, containing about 53,000 images, to gather all of our data, which has 225 nouns and 115 adjectives, with each noun having about 9 different adjectives, forming approximately 1000 noun-adjective pairs. For our first experiment, we test our method’s ability to determine the state of each noun class separately against other metrics for comparison. These metrics are cosine similarity, dot product, product quantization, binary index, Naive Bayes, and a custom neural network. Among these metrics, we found that our proposed BVM performs the best in classifying the states for each noun. We then perform a second experiment where we try using BVM to determine if it can differentiate adjectives from one another for each adjective separately. We compared the abilities of BVM to differentiate adjectives against the proposed method the MIT-States paper suggests: using a logistic regression model. In the end, we did not find conclusive evidence that our BVM metric could perform better than the logistic regression model at discerning adjectives. Yet, we were able to find evidence for possible improvements to our method; this leads to the chance of increasing our method’s accuracy through certain changes in our methodologies.
zh

[NLP-37] Causal Attention with Lookahead Keys

【速读】: 该论文旨在解决标准因果注意力(causal attention)中每个标记的查询(query)、键(key)和值(value)静态且仅编码前序上下文的问题,这限制了模型在生成当前 token 时对后续信息的利用能力。解决方案的关键在于提出一种名为 CASTLE(CAuSal aTtention with Lookahead kEys)的注意力机制,其核心创新是持续更新每个 token 的键为“前瞻键”(lookahead keys),这些键虽属于较早位置,却融合了相对于该位置更晚出现的 token 信息,同时严格保持自回归性质。尽管该机制看似顺序执行,作者通过数学等价性推导避免了显式存储每个位置的前瞻键,从而实现了高效的并行训练,在语言建模基准测试中显著优于标准因果注意力,表现为更低的验证困惑度及下游任务性能提升。

链接: https://arxiv.org/abs/2509.07301
作者: Zhuoqing Song,Peng Sun,Huizhuo Yuan,Quanquan Gu
机构: ByteDance(字节跳动)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In standard causal attention, each token’s query, key, and value (QKV) are static and encode only preceding context. We introduce CAuSal aTtention with Lookahead kEys (CASTLE), an attention mechanism that continually updates each token’s keys as the context unfolds. We term these updated keys lookahead keys because they belong to earlier positions yet integrate information from tokens that appear later relative to those positions, while strictly preserving the autoregressive property. Although the mechanism appears sequential, we derive a mathematical equivalence that avoids explicitly materializing lookahead keys at each position and enables efficient parallel training. On language modeling benchmarks, CASTLE consistently outperforms standard causal attention across model scales, reducing validation perplexity and improving performance on a range of downstream tasks.
zh

[NLP-38] ALICE: An Interpretable Neural Architecture for Generalization in Substitution Ciphers

【速读】: 该论文旨在解决神经网络在组合复杂域中的泛化能力问题,具体以替代密码破解(substitution cipher decryption)为任务场景,要求模型从26!种可能的字符映射中无显式访问密钥的情况下进行解密。其关键解决方案是提出ALICE架构——一种仅含编码器的Transformer模型,通过引入基于Gumbel-Sinkhorn方法的双射解码头(bijective decoding head),显式建模置换关系并实现可解释的密钥提取;同时利用早期退出分析揭示了模型预测逐步细化的过程:早期层依赖频率启发式策略,中间层构建词结构,最终层修正单字符错误,这种分阶段推理机制与人类解密策略高度一致,显著提升了模型在极小训练样本(约1500个未见过的密文)下的泛化性能和解释性。

链接: https://arxiv.org/abs/2509.07282
作者: Jeff Shen,Lindsay Smith
机构: Princeton University (普林斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Preprint. Project page at this https URL

点击查看摘要

Abstract:We present cryptogram solving as an ideal testbed for studying neural network generalization in combinatorially complex domains. In this task, models must decrypt text encoded with substitution ciphers, choosing from 26! possible mappings without explicit access to the cipher. We develop ALICE (an Architecture for Learning Interpretable Cryptogram dEcipherment): a simple encoder-only Transformer that sets a new state-of-the-art for both accuracy and speed on this decryption problem. Surprisingly, ALICE generalizes to unseen ciphers after training on only \sim1500 unique ciphers, a minute fraction ( 3.7 \times 10^-24 ) of the possible cipher space. To enhance interpretability, we introduce a novel bijective decoding head that explicitly models permutations via the Gumbel-Sinkhorn method, enabling direct extraction of learned cipher mappings. Through early exit analysis, we reveal how ALICE progressively refines its predictions in a way that appears to mirror common human strategies for this task: early layers employ frequency-based heuristics, middle layers form word structures, and final layers correct individual characters. Our architectural innovations and analysis methods extend beyond cryptograms to any domain with bijective mappings and combinatorial structure, offering new insights into neural network generalization and interpretability.
zh

[NLP-39] LLM Analysis of 150 years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

【速读】: 该论文旨在解决德国议会辩论中关于移民议题的(反)团结态度((anti-)solidarity)自动化标注难题,传统方法依赖大量人工标注,难以覆盖大规模文本数据。其解决方案的关键在于系统评估多种大语言模型(Large Language Models, LLMs)在标注此类政治话语中的有效性,包括模型规模、提示策略(prompting)、微调(fine-tuning)以及历史与当代数据差异的影响,并通过数千条人工参考标注进行验证。研究发现,LLMs能够在保证较高准确性的前提下显著提升标注效率,从而支持对德国战后至近年移民政策辩论中(反)团结趋势的深度社会科学研究,揭示出战后初期高程度的移民团结性及2015年以来反团结趋势的明显上升。

链接: https://arxiv.org/abs/2509.07274
作者: Aida Kostikova,Ole Pütz,Steffen Eger,Olga Sabelfeld,Benjamin Paassen
机构: Bielefeld University (比勒费尔德大学); University of Technology Nuremberg (纽伦堡工业大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Migration has been a core topic in German political debate, from millions of expellees post World War II over labor migration to refugee movements in the recent past. Studying political speech regarding such wide-ranging phenomena in depth traditionally required extensive manual annotations, limiting the scope of analysis to small subsets of the data. Large language models (LLMs) have the potential to partially automate even complex annotation tasks. We provide an extensive evaluation of a multiple LLMs in annotating (anti-)solidarity subtypes in German parliamentary debates compared to a large set of thousands of human reference annotations (gathered over a year). We evaluate the influence of model size, prompting differences, fine-tuning, historical versus contemporary data; and we investigate systematic errors. Beyond methodological evaluation, we also interpret the resulting annotations from a social science lense, gaining deeper insight into (anti-)solidarity trends towards migrants in the German post-World War II period and recent past. Our data reveals a high degree of migrant-directed solidarity in the postwar period, as well as a strong trend towards anti-solidarity in the German parliament since 2015, motivating further research. These findings highlight the promise of LLMs for political text analysis and the importance of migration debates in Germany, where demographic decline and labor shortages coexist with rising polarization.
zh

[NLP-40] Benchmarking Information Retrieval Models on Complex Retrieval Tasks

【速读】: 该论文旨在解决当前检索模型在复杂检索任务中能力不足的问题,尤其是面对包含多部分、约束或自然语言要求的查询时,现有模型表现有限。其关键解决方案是构建了一个多样且贴近现实的复杂检索任务基准集,并对一系列前沿检索模型进行系统性评估;同时探索了生成式 AI(Generative AI)驱动的查询扩展与重写对检索质量的影响。实验结果表明,即使最优模型在所有任务上的平均 nDCG@10 仅为 0.346,R@100 仅 0.587,说明当前检索模型在处理复杂查询方面仍存在显著瓶颈,且 LLM 增强策略对最强模型反而产生负面效果,凸显了下一代检索模型研发的紧迫性与挑战。

链接: https://arxiv.org/abs/2509.07253
作者: Julian Killingback,Hamed Zamani
机构: University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are incredible and versatile tools for text-based tasks that have enabled countless, previously unimaginable, applications. Retrieval models, in contrast, have not yet seen such capable general-purpose models emerge. To achieve this goal, retrieval models must be able to perform complex retrieval tasks, where queries contain multiple parts, constraints, or requirements in natural language. These tasks represent a natural progression from the simple, single-aspect queries that are used in the vast majority of existing, commonly used evaluation sets. Complex queries naturally arise as people expect search systems to handle more specific and often ambitious information requests, as is demonstrated by how people use LLM-based information systems. Despite the growing desire for retrieval models to expand their capabilities in complex retrieval tasks, there exist limited resources to assess the ability of retrieval models on a comprehensive set of diverse complex tasks. The few resources that do exist feature a limited scope and often lack realistic settings making it hard to know the true capabilities of retrieval models on complex real-world retrieval tasks. To address this shortcoming and spur innovation in next-generation retrieval models, we construct a diverse and realistic set of complex retrieval tasks and benchmark a representative set of state-of-the-art retrieval models. Additionally, we explore the impact of LLM-based query expansion and rewriting on retrieval quality. Our results show that even the best models struggle to produce high-quality retrieval results with the highest average nDCG@10 of only 0.346 and R@100 of only 0.587 across all tasks. Although LLM augmentation can help weaker models, the strongest model has decreased performance across all metrics with all rewriting techniques.
zh

[NLP-41] Neurocognitive Modeling for Text Generation: Deep Learning Architecture for EEG Data

【速读】: 该论文旨在解决脑电图(Electroencephalography, EEG)基文本生成技术中数据需求量大、计算资源消耗高且性能受限的问题。其解决方案的关键在于提出一种结合Gemma 2B大语言模型(Large Language Model, LLM)与分类器-LLM架构的新型方法,并引入循环神经网络(Recurrent Neural Network, RNN)编码器,从而显著降低对数据和算力的需求,同时实现接近前沿方法的性能表现,整体性能提升达10%。该架构展示了在数据有限条件下高效迁移学习的可能性,为EEG解码与生成式AI融合提供了新路径,推动了脑机接口在辅助技术中的应用潜力。

链接: https://arxiv.org/abs/2509.07202
作者: Khushiyant
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注: 15 pages, 10 figures, 5 tables

点击查看摘要

Abstract:Text generating capabilities have undergone a substantial transformation with the introduction of large language models (LLMs). Electroencephalography (EEG)-based text production is still difficult, though, because it requires a lot of data and processing power. This paper introduces a new method that combines the use of the Gemma 2B LLM with a classifier-LLM architecture to incorporate a Recurrent Neural Network (RNN) encoder. Our approach drastically lowers the amount of data and compute power needed while achieving performance close to that of cutting-edge methods. Notably, compared to current methodologies, our methodology delivers an overall performance improvement of 10%. The suggested architecture demonstrates the possibility of effective transfer learning for EEG-based text production, remaining strong and functional even in the face of data limits. This work highlights the potential of integrating LLMs with EEG decoding to improve assistive technologies and improve independence and communication for those with severe motor limitations. Our method pushes the limits of present capabilities and opens new paths for research and application in brain-computer interfaces by efficiently using the strengths of pre-trained language models. This makes EEG-based text production more accessible and efficient.
zh

[NLP-42] Rule-Based Moral Principles for Explaining Uncertainty in Natural Language Generation

【速读】: 该论文试图解决在高风险场景中,大型语言模型(Large Language Models, LLMs)生成文本时如何有效解释不确定性的问题,这不仅涉及技术挑战,也关乎伦理责任。现有基于概率的方法往往缺乏透明度且与用户对可解释性的期望不一致。解决方案的关键在于提出一个基于规则的道德原则框架,融合道德心理学和德性伦理学思想,定义了“谨慎”(precaution)、“尊重”(deference)和“责任”(responsibility)等规则,用以指导在认知不确定性(epistemic uncertainty)或随机不确定性(aleatoric uncertainty)下的响应策略;这些规则被编码于轻量级Prolog推理引擎中,根据不确定性水平(低、中、高)触发相应系统行为,并附带自然语言说明,从而提升可信度与可解释性。

链接: https://arxiv.org/abs/2509.07190
作者: Zahra Atf,Peter R Lewis
机构: Ontario Tech University (安大略理工大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: This paper was accepted for presentation at the 35th IEEE International Conference on Collaborative Advances in Software and Computing. Conference website: this https URL

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in high-stakes settings, where explaining uncertainty is both technical and ethical. Probabilistic methods are often opaque and misaligned with expectations of transparency. We propose a framework based on rule-based moral principles for handling uncertainty in LLM-generated text. Using insights from moral psychology and virtue ethics, we define rules such as precaution, deference, and responsibility to guide responses under epistemic or aleatoric uncertainty. These rules are encoded in a lightweight Prolog engine, where uncertainty levels (low, medium, high) trigger aligned system actions with plain-language rationales. Scenario-based simulations benchmark rule coverage, fairness, and trust calibration. Use cases in clinical and legal domains illustrate how moral reasoning can improve trust and interpretability. Our approach offers a transparent, lightweight alternative to probabilistic models for socially responsible natural language generation.
zh

[NLP-43] DischargeSim: A Simulation Benchmark for Educational Doctor-Patient Communication at Discharge EMNLP

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在临床场景中对出院教育(discharge education)支持能力评估不足的问题,即现有基准主要聚焦于就诊期间的诊断推理,而忽视了模型在患者离院后提供个性化健康教育的能力。其解决方案的关键在于提出DischargeSim这一新型基准,通过模拟医患多轮对话(DoctorAgents与PatientAgents),在涵盖六类临床相关主题的交互中,从对话质量、个性化文档生成(如自由文本摘要和AHRQ检查清单)以及患者理解度(通过多项选择题测试)三个维度系统评估LLMs的出院教育表现,从而为后就诊阶段的AI辅助医疗教育提供可量化、可比较的评测框架。

链接: https://arxiv.org/abs/2509.07188
作者: Zonghai Yao,Michael Sun,Won Seok Jang,Sunjae Kwon,Soie Kwon,Hong Yu
机构: Center for Healthcare Organization and Implementation Research, VA Bedford Health Care; Manning College of Information and Computer Sciences, UMass Amherst, MA, USA; Miner School of Computer and Information Sciences, UMass Lowell, MA, USA; Department of Internal Medicine, Chung-Ang University, Seoul, Republic of Korea; Department of Medicine, UMass Chan Medical School, Worcester, MA, USA
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Equal contribution for the first two authors. To appear in the proceedings of the Main Conference on Empirical Methods in Natural Language Processing (EMNLP) 2025

点击查看摘要

Abstract:Discharge communication is a critical yet underexplored component of patient care, where the goal shifts from diagnosis to education. While recent large language model (LLM) benchmarks emphasize in-visit diagnostic reasoning, they fail to evaluate models’ ability to support patients after the visit. We introduce DischargeSim, a novel benchmark that evaluates LLMs on their ability to act as personalized discharge educators. DischargeSim simulates post-visit, multi-turn conversations between LLM-driven DoctorAgents and PatientAgents with diverse psychosocial profiles (e.g., health literacy, education, emotion). Interactions are structured across six clinically grounded discharge topics and assessed along three axes: (1) dialogue quality via automatic and LLM-as-judge evaluation, (2) personalized document generation including free-text summaries and structured AHRQ checklists, and (3) patient comprehension through a downstream multiple-choice exam. Experiments across 18 LLMs reveal significant gaps in discharge education capability, with performance varying widely across patient profiles. Notably, model size does not always yield better education outcomes, highlighting trade-offs in strategy use and content prioritization. DischargeSim offers a first step toward benchmarking LLMs in post-visit clinical education and promoting equitable, personalized patient support.
zh

[NLP-44] owards EnergyGPT : A Large Language Model Specialized for the Energy Sector

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在能源等专业领域中因通用性过强而导致的领域适配性不足问题,即模型缺乏深度技术理解和精准的领域知识。解决方案的关键在于构建一个针对能源领域的专用语言模型——EnergyGPT,通过在高质量、精心筛选的能源文本语料上对LLaMA 3.1-8B模型进行监督微调(Supervised Fine-Tuning),从而显著提升其在能源相关语言理解与生成任务中的表现,且无需依赖大规模计算基础设施即可实现性能优化。

链接: https://arxiv.org/abs/2509.07177
作者: Amal Chebbi,Babajide Kolade
机构: Fitila Labs (Fitila 实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models have demonstrated impressive capabilities across various domains. However, their general-purpose nature often limits their effectiveness in specialized fields such as energy, where deep technical expertise and precise domain knowledge are essential. In this paper, we introduce EnergyGPT, a domain-specialized language model tailored for the energy sector, developed by fine-tuning LLaMA 3.1-8B model using Supervised Fine-Tuning on a high-quality, curated corpus of energy-related texts. We present a complete development pipeline, including data collection and curation, model fine-tuning, benchmark design and LLM-judge choice, evaluation and deployment. Through this work, we demonstrate that our training strategy enables improvements in domain relevance and performance without the need for large-scale infrastructure. By evaluating the performance of the model using domain-specific question-answering benchmarks, our results demonstrate that EnergyGPT outperforms the base model in most of the energy-related language understanding and generation tasks.
zh

[NLP-45] hats So FETCH: Fashioning Ensemble Techniques for LLM Classification in Civil Legal Intake and Referral

【速读】: 该论文旨在解决法律援助系统中申请人与正确法律资源匹配不当的问题,这可能导致申请人错过关键期限、遭受身体虐待、失去住房或监护权等严重后果。解决方案的关键在于提出并评估了FETCH分类器用于法律问题分类,并采用两种方法提升准确率:一是混合大语言模型(Large Language Model, LLM)与机器学习(Machine Learning, ML)的集成分类方法;二是通过自动生成后续追问问题来丰富初始问题描述,从而提高分类精度。实验基于419个真实非营利律师推荐服务的查询数据集,最终实现hits@2为97.37%的高准确率,优于当前最先进的GPT-5模型,同时显著降低了引导用户获取合适法律资源的成本。

链接: https://arxiv.org/abs/2509.07170
作者: Quinten Steenhuis
机构: Suffolk University Law School (萨福克大学法学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Submission to JURIX 2025

点击查看摘要

Abstract:Each year millions of people seek help for their legal problems by calling a legal aid program hotline, walking into a legal aid office, or using a lawyer referral service. The first step to match them to the right help is to identify the legal problem the applicant is experiencing. Misdirection has consequences. Applicants may miss a deadline, experience physical abuse, lose housing or lose custody of children while waiting to connect to the right legal help. We introduce and evaluate the FETCH classifier for legal issue classification and describe two methods for improving accuracy: a hybrid LLM/ML ensemble classification method, and the automatic generation of follow-up questions to enrich the initial problem narrative. We employ a novel data set of 419 real-world queries to a nonprofit lawyer referral service. Ultimately, we show classification accuracy (hits@2) of 97.37% using a mix of inexpensive models, exceeding the performance of the current state-of-the-art GPT-5 model. Our approach shows promise in significantly reducing the cost of guiding users of the legal system to the right resource for their problem while achieving high accuracy.
zh

[NLP-46] Beyond Sequential Reranking: Reranker-Guided Search Improves Reasoning Intensive Retrieval

【速读】: 该论文旨在解决传统“检索-重排序”(retrieve-and-rerank)流程中的两个关键问题:一是受限于初始检索阶段 top-k 文档的质量,二是基于大语言模型(LLM)的重排序器因计算开销大而难以处理大量文档。其解决方案的核心在于提出一种名为 Reranker-Guided-Search (RGS) 的新方法,该方法通过直接依据重排序器偏好进行文档检索,而非沿用传统的顺序重排序机制;具体而言,RGS 在近似最近邻算法生成的邻近图(proximity graph)上执行贪心搜索,根据文档相似性策略性地优先选择高潜力文档进行重排序,在固定重排序预算(如100篇文档)下显著提升检索准确率。

链接: https://arxiv.org/abs/2509.07163
作者: Haike Xu,Tong Chen
机构: Massachusetts Institute of Technology (麻省理工学院); University of Washington (华盛顿大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widely used retrieve-and-rerank pipeline faces two critical limitations: they are constrained by the initial retrieval quality of the top-k documents, and the growing computational demands of LLM-based rerankers restrict the number of documents that can be effectively processed. We introduce Reranker-Guided-Search (RGS), a novel approach that bypasses these limitations by directly retrieving documents according to reranker preferences rather than following the traditional sequential reranking method. Our method uses a greedy search on proximity graphs generated by approximate nearest neighbor algorithms, strategically prioritizing promising documents for reranking based on document similarity. Experimental results demonstrate substantial performance improvements across multiple benchmarks: 3.5 points on BRIGHT, 2.9 on FollowIR, and 5.1 on M-BEIR, all within a constrained reranker budget of 100 documents. Our analysis suggests that, given a fixed pair of embedding and reranker models, strategically selecting documents to rerank can significantly improve retrieval accuracy under limited reranker budget.
zh

[NLP-47] Measuring Uncertainty in Transformer Circuits with Effective Information Consistency

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中Transformer电路(Transformer Circuits, TCs)在运行时缺乏统一、单次遍历的量化指标来判断其行为是否一致且可信的问题。现有方法虽能识别出功能子图,但无法有效评估这些电路在执行特定算法时的协同性和可靠性。解决方案的关键在于引入有效信息一致性评分(Effective-Information Consistency Score, EICS),该评分融合了两个核心要素:(i) 基于局部雅可比矩阵(local Jacobians)和激活值计算的归一化层析不一致性(normalized sheaf inconsistency),以及(ii) 从同一前向状态推导出的电路级因果涌现(circuit-level causal emergence)的高斯代理有效信息(Gaussian EI proxy)。EICS具有白盒、单次遍历特性,且输出为无量纲数值,从而实现了对TC行为一致性的定量评估。

链接: https://arxiv.org/abs/2509.07149
作者: Anatoly A. Krasnovsky
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Mechanistic interpretability has identified functional subgraphs within large language models (LLMs), known as Transformer Circuits (TCs), that appear to implement specific algorithms. Yet we lack a formal, single-pass way to quantify when an active circuit is behaving coherently and thus likely trustworthy. Building on prior systems-theoretic proposals, we specialize a sheaf/cohomology and causal emergence perspective to TCs and introduce the Effective-Information Consistency Score (EICS). EICS combines (i) a normalized sheaf inconsistency computed from local Jacobians and activations, with (ii) a Gaussian EI proxy for circuit-level causal emergence derived from the same forward state. The construction is white-box, single-pass, and makes units explicit so that the score is dimensionless. We further provide practical guidance on score interpretation, computational overhead (with fast and exact modes), and a toy sanity-check analysis. Empirical validation on LLM tasks is deferred.
zh

[NLP-48] oward Purpose-oriented Topic Model Evaluation enabled by Large Language Models

【速读】: 该论文旨在解决传统自动评估指标(如主题一致性与多样性)在动态演化主题模型评估中存在局限性的问题,这些指标往往仅捕捉狭窄的统计模式,难以解释实际应用中的语义失效。其解决方案的关键在于提出一种面向目的的评估框架,利用大型语言模型(Large Language Models, LLMs)构建九个跨四个核心维度的主题质量指标:词汇有效性、主题内语义合理性、主题间结构合理性以及文档-主题对齐合理性。该框架通过对抗性和采样协议验证,并在新闻、学术文献和社交媒体等多种数据集及多种主题建模方法上应用,结果表明LLM-based指标能够提供可解释、鲁棒且任务相关的评估,揭示出传统指标常忽略的主题冗余与语义漂移等关键缺陷。

链接: https://arxiv.org/abs/2509.07142
作者: Zhiyin Tan,Jennifer D’Souza
机构: L3S Research Center (L3S 研究中心); TIB (德国国家信息基础设施)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: Accepted for publication in International Journal on Digital Libraries (IJDL)

点击查看摘要

Abstract:This study presents a framework for automated evaluation of dynamically evolving topic models using Large Language Models (LLMs). Topic modeling is essential for organizing and retrieving scholarly content in digital library systems, helping users navigate complex and evolving knowledge domains. However, widely used automated metrics, such as coherence and diversity, often capture only narrow statistical patterns and fail to explain semantic failures in practice. We introduce a purpose-oriented evaluation framework that employs nine LLM-based metrics spanning four key dimensions of topic quality: lexical validity, intra-topic semantic soundness, inter-topic structural soundness, and document-topic alignment soundness. The framework is validated through adversarial and sampling-based protocols, and is applied across datasets spanning news articles, scholarly publications, and social media posts, as well as multiple topic modeling methods and open-source LLMs. Our analysis shows that LLM-based metrics provide interpretable, robust, and task-relevant assessments, uncovering critical weaknesses in topic models such as redundancy and semantic drift, which are often missed by traditional metrics. These results support the development of scalable, fine-grained evaluation tools for maintaining topic relevance in dynamic datasets. All code and data supporting this work are accessible at this https URL.
zh

[NLP-49] he ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties INTERSPEECH2025

【速读】: 该论文旨在解决多语言自动语音识别(ASR)技术在不同语言、口音和方言之间性能分布不均的问题,以推动ASR模型的公平性和包容性。其解决方案的关键在于构建了一个包含200多种语言、口音和方言的新测试套件,并发起Interspeech 2025 ML-SUPERB 2.0挑战赛,通过基于DynaBench的在线评估服务器支持灵活的模型设计与架构创新,从而有效激励社区开发更具泛化能力的多语言语音模型。

链接: https://arxiv.org/abs/2509.07139
作者: William Chen,Chutong Meng,Jiatong Shi,Martijn Bartelds,Shih-Heng Wang,Hsiu-Hsuan Wang,Rafael Mosquera,Sara Hincapie,Dan Jurafsky,Antonis Anastasopoulos,Hung-yi Lee,Karen Livescu,Shinji Watanabe
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Interspeech 2025

点击查看摘要

Abstract:Recent improvements in multilingual ASR have not been equally distributed across languages and language varieties. To advance state-of-the-art (SOTA) ASR models, we present the Interspeech 2025 ML-SUPERB 2.0 Challenge. We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models. The challenge also introduces an online evaluation server based on DynaBench, allowing for flexibility in model design and architecture for participants. The challenge received 5 submissions from 3 teams, all of which outperformed our baselines. The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18% when compared to the best baseline on a general multilingual test set. On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy, showing the importance of community challenges in making speech technologies more inclusive.
zh

[NLP-50] MedBench-IT: A Comprehensive Benchmark for Evaluating Large Language Models on Italian Medical Entrance Examinations

【速读】: 该论文旨在解决非英语语言在专业领域(特别是医学教育)中大型语言模型(Large Language Models, LLMs)评估基准稀缺的问题。针对意大利语医学入学考试这一特定场景,研究者构建了首个综合性评测基准MedBench-IT,包含17,410道专家编写的选择题,覆盖生物学、化学、逻辑学、通识文化、数学和物理学六个科目及三个难度层级。解决方案的关键在于:一方面提供高质量、结构化的多学科数据集以支持本地化LLM能力评估;另一方面通过系统性测试(如可复现性分析、排序偏差检验与推理提示效果评估)揭示模型性能与题目可读性之间的统计显著但微弱的负相关关系,从而为EdTech开发者和NLP研究者提供标准化评估方法与实用洞见。

链接: https://arxiv.org/abs/2509.07135
作者: Ruggero Marino Lazzaroni,Alessandro Angioi,Michelangelo Puliga,Davide Sanna,Roberto Marras
机构: University of Graz (格拉茨大学); OnePix Academy
类目: Computation and Language (cs.CL)
备注: Accepted as an oral presentation at CLiC-it 2025

点击查看摘要

Abstract:Large language models (LLMs) show increasing potential in education, yet benchmarks for non-English languages in specialized domains remain scarce. We introduce MedBench-IT, the first comprehensive benchmark for evaluating LLMs on Italian medical university entrance examinations. Sourced from Edizioni Simone, a leading preparatory materials publisher, MedBench-IT comprises 17,410 expert-written multiple-choice questions across six subjects (Biology, Chemistry, Logic, General Culture, Mathematics, Physics) and three difficulty levels. We evaluated diverse models including proprietary LLMs (GPT-4o, Claude series) and resource-efficient open-source alternatives (30B parameters) focusing on practical deployability. Beyond accuracy, we conducted rigorous reproducibility tests (88.86% response consistency, varying by subject), ordering bias analysis (minimal impact), and reasoning prompt evaluation. We also examined correlations between question readability and model performance, finding a statistically significant but small inverse relationship. MedBench-IT provides a crucial resource for Italian NLP community, EdTech developers, and practitioners, offering insights into current capabilities and standardized evaluation methodology for this critical domain. Comments: Accepted as an oral presentation at CLiC-it 2025 Subjects: Computation and Language (cs.CL) Cite as: arXiv:2509.07135 [cs.CL] (or arXiv:2509.07135v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2509.07135 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-51] Neuro-Symbolic Frameworks: Conceptual Characterization and Empirical Comparative Analysis

【速读】: 该论文旨在解决当前神经符号(Neurosymbolic, NeSy)框架在实际应用中面临的挑战,即开发者面临的学习曲线陡峭、缺乏用户友好的工具与库以及缺乏统一的框架支持。其解决方案的关键在于系统性地分析现有NeSy框架的技术特性,包括符号表示语言、与神经模型的集成方式及底层算法,并通过展示三个通用框架——DeepProbLog、Scallop和DomiKnowS——来揭示各框架在表达能力上的差异与优势。论文强调,当前多数研究聚焦于算法开发而非提供可声明式问题规范的通用框架,因此提出以结构化评估为基础,推动社区重新思考NeSy建模方法,从而提升复杂问题求解的可靠性与数据效率。

链接: https://arxiv.org/abs/2509.07122
作者: Sania Sinha,Tanawan Premsri,Danial Kamali,Parisa Kordjamshidi
机构: Michigan State University (密歇根州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Neurosymbolic (NeSy) frameworks combine neural representations and learning with symbolic representations and reasoning. Combining the reasoning capacities, explainability, and interpretability of symbolic processing with the flexibility and power of neural computing allows us to solve complex problems with more reliability while being data-efficient. However, this recently growing topic poses a challenge to developers with its learning curve, lack of user-friendly tools, libraries, and unifying frameworks. In this paper, we characterize the technical facets of existing NeSy frameworks, such as the symbolic representation language, integration with neural models, and the underlying algorithms. A majority of the NeSy research focuses on algorithms instead of providing generic frameworks for declarative problem specification to leverage problem solving. To highlight the key aspects of Neurosymbolic modeling, we showcase three generic NeSy frameworks - \textitDeepProbLog, \textitScallop, and \textitDomiKnowS. We identify the challenges within each facet that lay the foundation for identifying the expressivity of each framework in solving a variety of problems. Building on this foundation, we aim to spark transformative action and encourage the community to rethink this problem in novel ways.
zh

[NLP-52] Instruction Agent : Enhancing Agent with Expert Demonstration

【速读】: 该论文旨在解决当前图形用户界面(GUI)代理在处理涉及新颖UI元素、长周期动作序列以及个性化操作轨迹的复杂任务时表现不佳的问题。解决方案的关键在于提出Instruction Agent,该代理通过利用专家示范来提取分步指令,并严格遵循用户意图的执行轨迹,从而避免执行过程中的错误;同时引入验证器(verifier)和回溯器(backtracker)模块,以理解每一步操作后的结果并应对执行过程中意外中断(如弹窗出现),显著提升了任务成功率,在OSWorld基准上实现了60%的成功率,超越了所有顶级现有代理。

链接: https://arxiv.org/abs/2509.07098
作者: Yinheng Li,Hailey Hultquist,Justin Wagle,Kazuhito Koishida
机构: Microsoft(微软)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Graphical user interface (GUI) agents have advanced rapidly but still struggle with complex tasks involving novel UI elements, long-horizon actions, and personalized trajectories. In this work, we introduce Instruction Agent, a GUI agent that leverages expert demonstrations to solve such tasks, enabling completion of otherwise difficult workflows. Given a single demonstration, the agent extracts step-by-step instructions and executes them by strictly following the trajectory intended by the user, which avoids making mistakes during execution. The agent leverages the verifier and backtracker modules further to improve robustness. Both modules are critical to understand the current outcome from each action and handle unexpected interruptions(such as pop-up windows) during execution. Our experiments show that Instruction Agent achieves a 60% success rate on a set of tasks in OSWorld that all top-ranked agents failed to complete. The Instruction Agent offers a practical and extensible framework, bridging the gap between current GUI agents and reliable real-world GUI task automation.
zh

[NLP-53] From Eigenmodes to Proofs: Integrating Graph Spectral Operators with Symbolic Interpretable Reasoning

【速读】: 该论文旨在解决当前神经符号推理系统在可解释性、可扩展性和鲁棒性之间的权衡问题,尤其是如何将符号逻辑的严谨性与谱学习(spectral learning)的高效性相结合。其核心解决方案是提出Spectral NSR框架,关键在于将逻辑规则嵌入为频谱模板(spectral templates),并在图谱域(graph spectral domain)中直接执行推理,利用知识图谱的拉普拉斯特征结构设计频率选择性滤波器(frequency-selective filters),从而实现符号推理的可解释性与谱方法的可扩展性统一。此架构通过多项增强技术如动态图学习、混合频谱专家、证明引导训练及不确定性量化等,显著提升了模型的准确性、推理速度、对抗鲁棒性与领域迁移能力。

链接: https://arxiv.org/abs/2509.07017
作者: Andrew Kiruluta,Priscilla Burity
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We introduce Spectral NSR, a fully spectral neuro-symbolic reasoning framework that embeds logical rules as spectral templates and performs inference directly in the graph spectral domain. By leveraging graph signal processing (GSP) and frequency-selective filters grounded in the Laplacian eigenstructure of knowledge graphs, the architecture unifies the interpretability of symbolic reasoning with the scalability and adaptability of spectral learning. Beyond the core formulation, we incorporate a comprehensive set of extensions, including dynamic graph and basis learning, rational and diffusion filters for sharper spectral selectivity, mixture-of-spectral-experts for modular specialization, proof-guided training with spectral curricula, and uncertainty quantification for calibrated confidence. Additional enhancements such as large language model coupling, co-spectral transfer alignment, adversarial robustness, efficient GPU kernels, generalized Laplacians, and causal interventions further expand the versatility of the framework. Empirical evaluation on state-of-the-art reasoning benchmarks such as ProofWriter and CLUTRR demonstrates that Spectral NSR achieves superior accuracy, faster inference, improved robustness to adversarial perturbations, and higher interpretability compared to leading baselines including transformers, message-passing neural networks, and neuro-symbolic logic programming systems. Spectral attribution and proof-band agreement analyses confirm that model decisions align closely with symbolic proof structures, while transfer experiments validate effective domain adaptation through co-spectral alignment. These results establish Spectral NSR as a scalable and principled foundation for the next generation of reasoning systems, offering transparency, robustness, and generalization beyond conventional approaches. Subjects: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2509.07017 [cs.AI] (or arXiv:2509.07017v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.07017 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-54] ArGen: Auto-Regulation of Generative AI via GRPO and Policy-as-Code

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在复杂、多维度治理规则下的对齐难题,这些规则涵盖伦理原则、操作安全协议及合规标准等。传统偏好对齐方法难以满足高度结构化和可验证的政策约束,尤其在跨文化价值体系中面临挑战。解决方案的关键在于提出ArGen(Auto-Regulation of Generative AI systems)框架,其核心创新包括:基于原则的自动化奖励评分机制、群体相对策略优化(Group Relative Policy Optimisation, GRPO)以及受开放策略代理(Open Policy Agent, OPA)启发的治理层,从而实现LLM对多元且细粒度政策的精确遵循与可验证合规性。

链接: https://arxiv.org/abs/2509.07006
作者: Kapil Madan
机构: Principled Evolution
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 53 pages, 7 figures, 8 tables. Open-source implementation available at: this https URL . Work explores the integration of policy-as-code for AI alignment, with a case study in culturally-nuanced, ethical AI using Dharmic principles

点击查看摘要

Abstract:This paper introduces ArGen (Auto-Regulation of Generative AI systems), a framework for aligning Large Language Models (LLMs) with complex sets of configurable, machine-readable rules spanning ethical principles, operational safety protocols, and regulatory compliance standards. Moving beyond just preference-based alignment, ArGen is designed to ensure LLMs adhere to these multifaceted policies through a novel synthesis of principle-based automated reward scoring, Group Relative Policy Optimisation (GRPO), and an Open Policy Agent (OPA) inspired governance layer. This approach provides the technical foundation for achieving and demonstrating compliance with diverse and nuanced governance requirements. To showcase the framework’s capability to operationalize a deeply nuanced and culturally-specific value system, we present an in-depth case study: the development of a medical AI assistant guided by principles from Dharmic ethics (such as Ahimsa and Dharma), as derived from texts like the Bhagavad Gita. This challenging application demonstrates ArGen’s adaptability, achieving a 70.9% improvement in domain-scope adherence over the baseline. Through our open-source repository, we show that ArGen’s methodology offers a path to ‘Governable Al’ systems that are technically proficient, ethically robust, and verifiably compliant for safe deployment in diverse global contexts.
zh

[NLP-55] VLMs-in-the-Wild: Bridging the Gap Between Academic Benchmarks and Enterprise Reality

【速读】: 该论文旨在解决当前开放源代码视觉语言模型(Vision-Language Models, VLMs)在学术评估与企业实际部署需求之间存在的显著脱节问题。现有基准测试主要依赖多项选择题和合成数据,无法反映真实业务场景(如社交媒体内容分析)的复杂性。为此,作者提出VLM-in-the-Wild(ViLD)框架,通过定义十项企业关键任务(如Logo检测、OCR、对象检测、人类活动与外观分析等)来系统化评估VLMs在实际应用中的表现。其核心创新在于提出的BlockWeaver算法,该算法无需依赖嵌入或大语言模型(LLM),即可高效可靠地比较无序且分组不固定的OCR输出,从而显著提升评估效率与准确性。此外,研究构建了一个包含7,500个多样化样本的真实世界数据集,并结合语义匹配(基于嵌入和LLM作为裁判)、传统指标及新提出的完整性与忠实度测量方法,为VLMs在企业环境中的部署提供可操作的洞察。

链接: https://arxiv.org/abs/2509.06994
作者: Srihari Bandraupalli,Anupam Purwar
机构: Sprinklr(斯普林克尔)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Open-source Vision-Language Models show immense promise for enterprise applications, yet a critical disconnect exists between academic evaluation and enterprise deployment requirements. Current benchmarks rely heavily on multiple-choice questions and synthetic data, failing to capture the complexity of real-world business applications like social media content analysis. This paper introduces VLM-in-the-Wild (ViLD), a comprehensive framework to bridge this gap by evaluating VLMs on operational enterprise requirements. We define ten business-critical tasks: logo detection, OCR, object detection, human presence and demographic analysis, human activity and appearance analysis, scene detection, camera perspective and media quality assessment, dominant colors, comprehensive description, and NSFW detection. To this framework, we bring an innovative BlockWeaver Algorithm that solves the challenging problem of comparing unordered, variably-grouped OCR outputs from VLMs without relying on embeddings or LLMs, achieving remarkable speed and reliability. To demonstrate efficacy of ViLD, we constructed a new benchmark dataset of 7,500 diverse samples, carefully stratified from a corpus of one million real-world images and videos. ViLD provides actionable insights by combining semantic matching (both embedding-based and LLM-as-a-judge approaches), traditional metrics, and novel methods to measure the completeness and faithfulness of descriptive outputs. By benchmarking leading open-source VLMs (Qwen, MIMO, and InternVL) against a powerful proprietary baseline as per ViLD framework, we provide one of the first industry-grounded, task-driven assessment of VLMs capabilities, offering actionable insights for their deployment in enterprise environments.
zh

[NLP-56] CARE: Decoding Time Safety Alignment via Rollback and Introspection Intervention

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在实际应用中解码阶段输出安全性不足的问题,尤其针对现有解码时干预方法(如对比解码)常导致安全性和响应质量之间存在严重权衡的挑战。其解决方案的关键在于提出一种名为CARE的新型解码时安全对齐框架,该框架整合三项核心技术:(1)一个用于实时安全监控的防护模型(guard model),实现对潜在不安全内容的精准检测;(2)基于令牌缓冲区的回滚机制(rollback mechanism),可在早期阶段高效纠正不安全输出而不显著影响用户体验;(3)一种基于内省的干预策略(introspection-based intervention strategy),通过模型生成对其先前输出的自我反思性批评,并将这些反思纳入上下文以指导后续解码步骤,从而实现有效的自我修正。该框架通过上述组件协同作用,在保障高安全性的同时最小化对响应质量与用户体验的影响。

链接: https://arxiv.org/abs/2509.06982
作者: Xiaomeng Hu,Fei Huang,Chenhan Yuan,Junyang Lin,Tsung-Yi Ho
机构: Qwen Team, Alibaba Group (通义实验室,阿里巴巴集团); The Chinese University of Hong Kong (香港中文大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) are increasingly deployed in real-world applications, ensuring the safety of their outputs during decoding has become a critical challenge. However, existing decoding-time interventions, such as Contrastive Decoding, often force a severe trade-off between safety and response quality. In this work, we propose CARE, a novel framework for decoding-time safety alignment that integrates three key components: (1) a guard model for real-time safety monitoring, enabling detection of potentially unsafe content; (2) a rollback mechanism with a token buffer to correct unsafe outputs efficiently at an earlier stage without disrupting the user experience; and (3) a novel introspection-based intervention strategy, where the model generates self-reflective critiques of its previous outputs and incorporates these reflections into the context to guide subsequent decoding steps. The framework achieves a superior safety-quality trade-off by using its guard model for precise interventions, its rollback mechanism for timely corrections, and our novel introspection method for effective self-correction. Experimental results demonstrate that our framework achieves a superior balance of safety, quality, and efficiency, attaining a low harmful response rate and minimal disruption to the user experience while maintaining high response quality.
zh

计算机视觉

[CV-0] Visual Representation Alignment for Multimodal Large Language Models

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLs)在视觉主导任务(如物体计数和空间推理)中表现不足的问题。作者指出,当前主流的纯文本监督范式对视觉路径提供的是间接指导,导致模型在训练过程中丢弃细粒度视觉信息。解决方案的关键在于提出一种名为VIRAL(VIsual Representation ALignment)的正则化策略,通过显式对齐MLLM内部视觉表示与预训练视觉基础模型(Vision Foundation Models, VFMs)的特征表示,使模型既能保留来自输入视觉编码器的关键视觉细节,又能从VFMs中补充额外的视觉知识,从而提升其对复杂视觉输入的推理能力。

链接: https://arxiv.org/abs/2509.07979
作者: Heeji Yoon,Jaewoo Jung,Junwan Kim,Hyungyu Choi,Heeseong Shin,Sangbeom Lim,Honggyu An,Chaehyun Kim,Jisang Han,Donghyun Kim,Chanho Eom,Sunghwan Hong,Seungryong Kim
机构: KAIST AI(韩国科学技术院人工智能); NYU(纽约大学); Chung-Ang University(中央大学); Korea University(高丽大学); ETH Zürich(苏黎世联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL

点击查看摘要

Abstract:Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
zh

[CV-1] One View Many Worlds: Single-Image to 3D Object Meets Generative Domain Randomization for One-Shot 6D Pose Estimation

【速读】:该论文旨在解决从单张参考图像中估计任意未见物体的6D位姿(即三维位置和姿态)这一挑战性问题,该任务对机器人在真实世界长尾场景中的操作至关重要。现有方法面临三大难题:缺乏3D模型、单视角重建无法获得度量尺度,以及合成数据与真实图像之间的域差异导致鲁棒性不足。解决方案的关键在于提出OnePoseViaGen框架,其核心包含两个创新组件:一是粗到精的对齐模块,通过多视图特征匹配与渲染-比对优化联合精调尺度与位姿;二是文本引导的生成式域随机化策略,通过多样化纹理增强合成数据多样性,从而有效微调位姿估计器。该方案实现了高保真单视角3D重建,并显著提升了单样本6D位姿估计的准确性与实用性,在多个基准测试(YCBInEOAT、Toyota-Light、LM-O)上达到当前最优性能,且已在真实机器人手抓取任务中验证了其有效性。

链接: https://arxiv.org/abs/2509.07978
作者: Zheng Geng,Nan Wang,Shaocong Xu,Chongjie Ye,Bohan Li,Zhaoxi Chen,Sida Peng,Hao Zhao
机构: Beijing Academy of Artificial Intelligence, BAAI; Zhejiang University; Institute for AI Industry Research (AIR), Tsinghua University; Nanyang Technological University; FNii, The Chinese University of Hongkong, Shenzhen; Shanghai Jiao Tong University; Eastern Institute of Technology, Ningbo
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CoRL 2025 Oral, Project page: this https URL

点击查看摘要

Abstract:Estimating the 6D pose of arbitrary unseen objects from a single reference image is critical for robotics operating in the long-tail of real-world instances. However, this setting is notoriously challenging: 3D models are rarely available, single-view reconstructions lack metric scale, and domain gaps between generated models and real-world images undermine robustness. We propose OnePoseViaGen, a pipeline that tackles these challenges through two key components. First, a coarse-to-fine alignment module jointly refines scale and pose by combining multi-view feature matching with render-and-compare refinement. Second, a text-guided generative domain randomization strategy diversifies textures, enabling effective fine-tuning of pose estimators with synthetic data. Together, these steps allow high-fidelity single-view 3D generation to support reliable one-shot 6D pose estimation. On challenging benchmarks (YCBInEOAT, Toyota-Light, LM-O), OnePoseViaGen achieves state-of-the-art performance far surpassing prior approaches. We further demonstrate robust dexterous grasping with a real robot hand, validating the practicality of our method in real-world manipulation. Project page: this https URL
zh

[CV-2] Feature Space Analysis by Guided Diffusion Model

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)中特征提取过程的黑箱特性问题,特别是在视觉相关领域,如何明确理解DNN对图像中不同属性的编码机制。其解决方案的关键在于提出了一种可生成图像的解码器(decoder),该解码器能确保生成图像的特征与用户指定的目标特征高度接近。该解码器基于预训练扩散模型构建,通过引导反向图像生成过程,最小化每一步生成图像特征与目标特征之间的欧氏距离,从而实现对DNN特征空间的可解释性分析。此方法无需额外训练即可适用于多种DNN架构(如CLIP图像编码器、ResNet-50和视觉Transformer),且可在单张消费级GPU(COTS GPU)上高效运行。

链接: https://arxiv.org/abs/2509.07936
作者: Kimiaki Shirahama,Miki Yanobu,Kaduki Yamashita,Miho Ohsaki
机构: Doshisha University (同志社大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 19 pages, 13 figures, codes: this https URL

点击查看摘要

Abstract:One of the key issues in Deep Neural Networks (DNNs) is the black-box nature of their internal feature extraction process. Targeting vision-related domains, this paper focuses on analysing the feature space of a DNN by proposing a decoder that can generate images whose features are guaranteed to closely match a user-specified feature. Owing to this guarantee that is missed in past studies, our decoder allows us to evidence which of various attributes in an image are encoded into a feature by the DNN, by generating images whose features are in proximity to that feature. Our decoder is implemented as a guided diffusion model that guides the reverse image generation of a pre-trained diffusion model to minimise the Euclidean distance between the feature of a clean image estimated at each step and the user-specified feature. One practical advantage of our decoder is that it can analyse feature spaces of different DNNs with no additional training and run on a single COTS GPU. The experimental results targeting CLIP’s image encoder, ResNet-50 and vision transformer demonstrate that images generated by our decoder have features remarkably similar to the user-specified ones and reveal valuable insights into these DNNs’ feature spaces.
zh

[CV-3] Dynamic Scene 3D Reconstruction of an Uncooperative Resident Space Object

【速读】:该论文旨在解决在轨服务(On-Orbit Servicing, OOS)与主动碎片清除(Active Debris Removal, ADR)任务中,对非合作空间目标(Uncooperative Resident Space Objects, RSO)进行几何与运动特性表征的难题,特别是针对翻滚状态下的动态场景实现高保真度的三维重建。其解决方案的关键在于构建一个基于Isaac Sim的物理精确仿真环境,生成在真实轨道光照条件下具有动态特性的2D图像序列,并评估现有先进3D重建算法(如Neuralangelo)在动态场景中的几何准确性。初步结果表明,该方法可生成与原始CAD模型高度一致的3D网格,误差和伪影最小,且能保留关键细节,为后续动态场景重建评估提供了可靠基准。

链接: https://arxiv.org/abs/2509.07932
作者: Bala Prenith Reddy Gopu,Timothy Jacob Huber,George M. Nehma,Patrick Quinn,Madhur Tiwari,Matt Ueckermann,David Hinckley,Christopher McKenna
机构: Florida Institute of Technology (佛罗里达理工学院); Creare LLC
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Characterization of uncooperative Resident Space Objects (RSO) play a crucial role in On-Orbit Servicing (OOS) and Active Debris Removal (ADR) missions to assess the geometry and motion properties. To address the challenges of reconstructing tumbling uncooperative targets, this study evaluates the performance of existing state-of-the-art 3D reconstruction algorithms for dynamic scenes, focusing on their ability to generate geometrically accurate models with high-fidelity. To support our evaluation, we developed a simulation environment using Isaac Sim to generate physics-accurate 2D image sequences of tumbling satellite under realistic orbital lighting conditions. Our preliminary results on static scenes using Neuralangelo demonstrate promising reconstruction quality. The generated 3D meshes closely match the original CAD models with minimal errors and artifacts when compared using Cloud Compare (CC). The reconstructed models were able to capture critical fine details for mission planning. This provides a baseline for our ongoing evaluation of dynamic scene reconstruction.
zh

[CV-4] Accelerating Local AI on Consumer GPUs: A Hardware-Aware Dynamic Strategy for YOLOv10s

【速读】:该论文旨在解决当前生成式AI(Generative AI)在消费级硬件上部署时存在的性能瓶颈问题,即模型在高算力GPU上的基准检测速度与实际在资源受限设备(如配备RTX 4060 GPU的笔记本电脑)上的运行效率之间存在显著差距。研究表明,这类系统性能并非由计算能力决定,而是受制于系统级瓶颈。解决方案的关键在于提出一种无需架构改动的两阶段自适应推理算法(Two-Pass Adaptive Inference),通过快速低分辨率初步推理,在置信度不足时才触发高分辨率模型进行精细检测,从而实现吞吐量最大化。实验表明,在COCO数据集上相较PyTorch早期退出基线方法可获得1.85倍加速,仅损失5.51%的平均精度(mAP)。

链接: https://arxiv.org/abs/2509.07928
作者: Mahmudul Islam Masum,Miad Islam,Arif I. Sarwat
机构: Florida International University (佛罗里达国际大学); Saint Leo University (圣利奥大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 7 figures

点击查看摘要

Abstract:As local AI grows in popularity, there is a critical gap between the benchmark performance of object detectors and their practical viability on consumer-grade hardware. While models like YOLOv10s promise real-time speeds, these metrics are typically achieved on high-power, desktop-class GPUs. This paper reveals that on resource-constrained systems, such as laptops with RTX 4060 GPUs, performance is not compute-bound but is instead dominated by system-level bottlenecks, as illustrated by a simple bottleneck test. To overcome this hardware-level constraint, we introduce a Two-Pass Adaptive Inference algorithm, a model-independent approach that requires no architectural changes. This study mainly focuses on adaptive inference strategies and undertakes a comparative analysis of architectural early-exit and resolution-adaptive routing, highlighting their respective trade-offs within a unified evaluation framework. The system uses a fast, low-resolution pass and only escalates to a high-resolution model pass when detection confidence is low. On a 5000-image COCO dataset, our method achieves a 1.85x speedup over a PyTorch Early-Exit baseline, with a modest mAP loss of 5.51%. This work provides a practical and reproducible blueprint for deploying high-performance, real-time AI on consumer-grade devices by shifting the focus from pure model optimization to hardware-aware inference strategies that maximize throughput.
zh

[CV-5] Multimodal Contrastive Pretraining of CBCT and IOS for Enhanced Tooth Segmentation

【速读】:该论文旨在解决数字牙科中牙齿分割(tooth segmentation)的准确性与临床适用性不足的问题,尤其针对现有方法缺乏严格验证、性能有限且难以适应不同成像条件的挑战。解决方案的关键在于提出一种多模态对比学习预训练框架——ToothMCL,该框架融合了体积型CBCT(Cone-Beam Computed Tomography)和表面型IOS(Intraoral Scan)两种模态数据,通过模态不变表示学习捕捉细粒度解剖特征,从而实现高精度的多类别牙齿分割及国际牙科联合会(FDI)编号识别。这一方法显著提升了分割性能,在Dice相似系数(DSC)上较现有技术提升12%(CBCT)和8%(IOS),并展现出跨数据集和临床场景的强泛化能力。

链接: https://arxiv.org/abs/2509.07923
作者: Moo Hyun Son,Juyoung Bae,Zelin Qiu,Jiale Peng,Kai Xin Li,Yifan Lin,Hao Chen
机构: The University of Hong Kong (香港大学); The Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Digital dentistry represents a transformative shift in modern dental practice. The foundational step in this transformation is the accurate digital representation of the patient’s dentition, which is obtained from segmented Cone-Beam Computed Tomography (CBCT) and Intraoral Scans (IOS). Despite the growing interest in digital dental technologies, existing segmentation methodologies frequently lack rigorous validation and demonstrate limited performance and clinical applicability. To the best of our knowledge, this is the first work to introduce a multimodal pretraining framework for tooth segmentation. We present ToothMCL, a Tooth Multimodal Contrastive Learning for pretraining that integrates volumetric (CBCT) and surface-based (IOS) modalities. By capturing modality-invariant representations through multimodal contrastive learning, our approach effectively models fine-grained anatomical features, enabling precise multi-class segmentation and accurate identification of Fédération Dentaire Internationale (FDI) tooth numbering. Along with the framework, we curated CBCT-IOS3.8K, the largest paired CBCT and IOS dataset to date, comprising 3,867 patients. We then evaluated ToothMCL on a comprehensive collection of independent datasets, representing the largest and most diverse evaluation to date. Our method achieves state-of-the-art performance in both internal and external testing, with an increase of 12% for CBCT segmentation and 8% for IOS segmentation in the Dice Similarity Coefficient (DSC). Furthermore, ToothMCL consistently surpasses existing approaches in tooth groups and demonstrates robust generalizability across varying imaging conditions and clinical scenarios.
zh

[CV-6] ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion ICCV2025

【速读】:该论文旨在解决人类-物体交互(Human-Object Interaction, HOI)联合重建中因缺乏先验知识而导致的物理合理性不足问题。现有优化方法难以生成符合物理规律的交互姿态,尤其在复杂场景下表现不佳。解决方案的关键在于提出ScoreHOI,一种基于扩散模型(diffusion-based optimizer)的优化框架,其核心创新是引入扩散先验(diffusion priors),通过得分引导采样(score-guided sampling)实现给定图像观测和物体特征下的条件分布重建;同时,在推理阶段利用特定物理约束指导去噪过程,并结合接触驱动的迭代精化策略(contact-driven iterative refinement),显著提升交互接触的合理性与重建精度。

链接: https://arxiv.org/abs/2509.07920
作者: Ao Li,Jinpeng Liu,Yixuan Zhu,Yansong Tang
机构: Tsinghua Shenzhen International Graduate School (清华大学深圳国际研究生院); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI’s superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.
zh

[CV-7] Object-level Correlation for Few-Shot Segmentation ICCV2025

【速读】:该论文旨在解决少样本语义分割(Few-shot Semantic Segmentation, FSS)中因图像级关联导致的背景过拟合问题,即现有方法在支持样本与查询图像之间建立全局关联时,难以抑制无关背景对象引入的硬像素噪声(hard pixel noise),从而影响分割精度。其解决方案的关键在于模仿生物视觉机制,提出一种基于目标级关联的网络架构——Object-level Correlation Network (OCNet),通过构建支持目标对象与查询图像中一般对象(general objects,包括前景目标和背景对象)之间的对象级关联来增强目标识别的有效性。该方法的核心模块包括:General Object Mining Module (GOMM) 用于学习显著性和高层相似性特征以提取查询中的通用对象特征,以及 Correlation Construction Module (CCM) 将目标原型分配至匹配的通用对象特征,从而生成具有判别力的对象级关联,有效挖掘目标特征并抑制硬像素噪声,最终提升分割性能。

链接: https://arxiv.org/abs/2509.07917
作者: Chunlin Wen,Yu Zhang,Jie Fan,Hongyuan Zhu,Xiu-Shen Wei,Yijun Wang,Zhiqiang Kou,Shuzhou Sun
机构: Southeast University (东南大学); Samsung Electronics (China) R&D Centre (三星电子(中国)研发中心); Shanghai AI Laboratory (上海人工智能实验室); Institute for Infocomm Research (I2R), A*STAR Singapore (新加坡科技研究局信息通信研究所); Center for Machine Vision and Signal Analysis (CMVS), University of Oulu (奥卢大学机器视觉与信号分析中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper was accepted by ICCV 2025

点击查看摘要

Abstract:Few-shot semantic segmentation (FSS) aims to segment objects of novel categories in the query images given only a few annotated support samples. Existing methods primarily build the image-level correlation between the support target object and the entire query image. However, this correlation contains the hard pixel noise, \textiti.e., irrelevant background objects, that is intractable to trace and suppress, leading to the overfitting of the background. To address the limitation of this correlation, we imitate the biological vision process to identify novel objects in the object-level information. Target identification in the general objects is more valid than in the entire image, especially in the low-data regime. Inspired by this, we design an Object-level Correlation Network (OCNet) by establishing the object-level correlation between the support target object and query general objects, which is mainly composed of the General Object Mining Module (GOMM) and Correlation Construction Module (CCM). Specifically, GOMM constructs the query general object feature by learning saliency and high-level similarity cues, where the general objects include the irrelevant background objects and the target foreground object. Then, CCM establishes the object-level correlation by allocating the target prototypes to match the general object feature. The generated object-level correlation can mine the query target feature and suppress the hard pixel noise for the final prediction. Extensive experiments on PASCAL- 5^i and COCO- 20^i show that our model achieves the state-of-the-art performance.
zh

[CV-8] Active Membership Inference Test (aMINT): Enhancing Model Auditability with Multi-Task Learning ICCV

【速读】:该论文旨在解决机器学习模型中训练数据泄露检测的问题,即判断给定数据是否曾被用于模型训练,从而提升AI模型的可审计性与透明度。其解决方案的关键在于提出了一种名为Active Membership Inference Test (aMINT) 的多任务学习框架,在训练过程中同时优化主模型(Audited Model)和一个辅助的MINT模型,后者专门用于识别训练数据;其中,通过将中间激活特征图(intermediate activation maps)作为输入注入到MINT模块中,使模型在训练阶段就具备对训练数据的敏感性,从而显著提高检测准确率(超过80%),优于现有方法。

链接: https://arxiv.org/abs/2509.07879
作者: Daniel DeAlcala,Aythami Morales,Julian Fierrez,Gonzalo Mancera,Ruben Tolosana,Javier Ortega-Garcia
机构: Universidad Autonoma de Madrid (马德里自治大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: In Proc. IEEE/CVF Intenational Conference on Computer Vision, ICCV, 2025

点击查看摘要

Abstract:Active Membership Inference Test (aMINT) is a method designed to detect whether given data were used during the training of machine learning models. In Active MINT, we propose a novel multitask learning process that involves training simultaneously two models: the original or Audited Model, and a secondary model, referred to as the MINT Model, responsible for identifying the data used for training the Audited Model. This novel multi-task learning approach has been designed to incorporate the auditability of the model as an optimization objective during the training process of neural networks. The proposed approach incorporates intermediate activation maps as inputs to the MINT layers, which are trained to enhance the detection of training data. We present results using a wide range of neural networks, from lighter architectures such as MobileNet to more complex ones such as Vision Transformers, evaluated in 5 public benchmarks. Our proposed Active MINT achieves over 80% accuracy in detecting if given data was used for training, significantly outperforming previous approaches in the literature. Our aMINT and related methodological developments contribute to increasing transparency in AI models, facilitating stronger safeguards in AI deployments to achieve proper security, privacy, and copyright protection.
zh

[CV-9] D-LEAF: Localizing and Correcting Hallucinations in Multimodal LLM s via Layer-to-head Attention Diagnostics

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图像描述生成和视觉问答等任务中普遍存在幻觉(hallucination)的问题,即模型生成的文本与输入图像内容不一致。现有方法虽尝试通过注意力机制检测和缓解幻觉,但通常对所有层和注意力头采用统一调整策略,难以精确定位错误来源。论文的关键创新在于提出两种诊断指标:Layer Image Attention Entropy (LIAE) 用于识别异常层,Image Attention Focus (IAF) 用于量化特定层内注意力头的重要性;基于此,进一步设计了动态层级熵与注意力融合(Dynamic Layer-wise Entropy and Attention Fusion, D-LEAF)方法,能够在推理阶段无任务依赖地动态定位并修正错误,显著降低幻觉率,同时保持计算效率。

链接: https://arxiv.org/abs/2509.07864
作者: Tiancheng Yang,Lin Zhang,Jiaye Lin,Guimin Hu,Di Wang,Lijie Hu
机构: MBZUAI; Provable Responsible AI and Data Analytics (PRADA) Lab; King Abdullah University of Science and Technology; School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences; University of Copenhagen; Tsinghua University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) achieve strong performance on tasks like image captioning and visual question answering, but remain prone to hallucinations, where generated text conflicts with the visual input. Prior work links this partly to insufficient visual attention, but existing attention-based detectors and mitigation typically apply uniform adjustments across layers and heads, obscuring where errors originate. In this paper, we first show these methods fail to accurately localize problematic layers. Then, we introduce two diagnostics: Layer Image Attention Entropy (LIAE) which flags anomalous layers, and Image Attention Focus (IAF) which scores attention heads within those layers. Analysis shows that LIAE pinpoints faulty layers and IAF reliably ranks heads that warrant correction. Guided by these signals, we propose Dynamic Layer-wise Entropy and Attention Fusion (D-LEAF), a task-agnostic, attention-guided method that dynamically localizes and corrects errors during inference with negligible overhead. Results show our D-LEAF delivers a 53% relative improvement on standard captioning benchmarks, and on VQA both accuracy and F1-score improve by approximately 4%, substantially suppressing hallucinations while preserving efficiency.
zh

[CV-10] Deep Learning-Based Burned Area Mapping Using Bi-Temporal Siamese Networks and AlphaEarth Foundation Datasets

【速读】:该论文旨在解决火灾损毁区域自动化制图中精度不足与泛化能力弱的问题,尤其在复杂背景和多样化生态系统中准确识别部分燃烧植被及火边界存在挑战。解决方案的关键在于结合AlphaEarth数据集(包含高分辨率光学与热红外影像及详尽地面真值标注)与Siamese U-Net深度学习架构,构建了一种集成学习方法,实现了95%的整体准确率、0.6的交并比(IoU)和74%的F1分数,展现出优异的跨区域迁移能力和广泛适用性,为全球尺度的烧毁区域监测提供了可扩展的技术路径。

链接: https://arxiv.org/abs/2509.07852
作者: Seyd Teymoor Seydi
机构: University of Tehran (德黑兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate and timely mapping of burned areas is crucial for environmental monitoring, disaster management, and assessment of climate change. This study presents a novel approach to automated burned area mapping using the AlphaEArth dataset combined with the Siamese U-Net deep learning architecture. The AlphaEArth Dataset, comprising high-resolution optical and thermal infrared imagery with comprehensive ground-truth annotations, provides an unprecedented resource for training robust burned area detection models. We trained our model with the Monitoring Trends in Burn Severity (MTBS) dataset in the contiguous US and evaluated it with 17 regions cross in Europe. Our experimental results demonstrate that the proposed ensemble approach achieves superior performance with an overall accuracy of 95%, IoU of 0.6, and F1-score of 74% on the test dataset. The model successfully identifies burned areas across diverse ecosystems with complex background, showing particular strength in detecting partially burned vegetation and fire boundaries and its transferability and high generalization in burned area mapping. This research contributes to the advancement of automated fire damage assessment and provides a scalable solution for global burn area monitoring using the AlphaEarth dataset.
zh

[CV-11] Point Linguist Model: Segment Any Object via Bridged Large 3D-Language Model

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)与密集3D点云之间因表征错位(representation misalignment)导致的语义-几何不一致问题。具体而言,LLMs处理的是高层语义token,而3D点云仅包含密集几何结构,这使得输入阶段需依赖大量预对齐操作以提取对象级语义,且易受相似干扰项混淆;输出阶段则因缺乏显式几何线索而丢失细粒度分割精度。其解决方案的核心在于提出Point Linguist Model(PLM),通过两个关键模块实现:一是引入面向对象的判别性表征(Object-centric Discriminative Representation, OcDR),在硬负样本感知训练目标下学习捕捉目标语义与场景关系的对象中心token,从而缓解LLM token与3D点之间的表征错位并增强抗干扰能力;二是设计几何重激活解码器(Geometric Reactivation Decoder, GRD),将OcDR中蕴含的LLM推断几何信息与对应密集特征融合预测掩码,保留完整密集特征流,显著提升分割精度。

链接: https://arxiv.org/abs/2509.07825
作者: Zhuoxu Huang,Mingqi Gao,Jungong Han
机构: Aberystwyth University (阿伯里斯特威斯大学); University of Sheffield (谢菲尔德大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint

点击查看摘要

Abstract:3D object segmentation with Large Language Models (LLMs) has become a prevailing paradigm due to its broad semantics, task flexibility, and strong generalization. However, this paradigm is hindered by representation misalignment: LLMs process high-level semantic tokens, whereas 3D point clouds convey only dense geometric structures. In prior methods, misalignment limits both input and output. At the input stage, dense point patches require heavy pre-alignment, weakening object-level semantics and confusing similar distractors. At the output stage, predictions depend only on dense features without explicit geometric cues, leading to a loss of fine-grained accuracy. To address these limitations, we present the Point Linguist Model (PLM), a general framework that bridges the representation gap between LLMs and dense 3D point clouds without requiring large-scale pre-alignment between 3D-text or 3D-images. Specifically, we introduce Object-centric Discriminative Representation (OcDR), which learns object-centric tokens that capture target semantics and scene relations under a hard negative-aware training objective. This mitigates the misalignment between LLM tokens and 3D points, enhances resilience to distractors, and facilitates semantic-level reasoning within LLMs. For accurate segmentation, we introduce the Geometric Reactivation Decoder (GRD), which predicts masks by combining OcDR tokens carrying LLM-inferred geometry with corresponding dense features, preserving comprehensive dense features throughout the pipeline. Extensive experiments show that PLM achieves significant improvements of +7.3 mIoU on ScanNetv2 and +6.0 mIoU on Multi3DRefer for 3D referring segmentation, with consistent gains across 7 benchmarks spanning 4 different tasks, demonstrating the effectiveness of comprehensive object-centric reasoning for robust 3D understanding.
zh

[CV-12] SplatFill: 3D Scene Inpainting via Depth-Guided Gaussian Splatting

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 场景中缺失区域修复(inpainting)的问题,尤其是在遮挡或场景编辑导致的空洞区域,传统方法常出现细节模糊、伪影和几何不一致等缺陷。其解决方案的关键在于提出一种深度引导的修复方法 SplatFill,核心创新包括:(1) 联合使用基于深度和基于物体的监督机制,确保修复后的高斯点在三维空间中精确定位并与周围几何结构对齐;(2) 设计一致性感知的精细化修正策略,能够选择性识别并修正不一致区域,同时保持其他区域不变。该方法在 SPIn-NeRF 数据集上实现了优于现有 NeRF 和 3DGS 方法的视觉保真度,并将训练时间减少 24.5%。

链接: https://arxiv.org/abs/2509.07809
作者: Mahtab Dahaghin,Milind G. Padalkar,Matteo Toso,Alessio Del Bue
机构: Istituto Italiano di Tecnologia (意大利技术研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has enabled the creation of highly realistic 3D scene representations from sets of multi-view images. However, inpainting missing regions, whether due to occlusion or scene editing, remains a challenging task, often leading to blurry details, artifacts, and inconsistent geometry. In this work, we introduce SplatFill, a novel depth-guided approach for 3DGS scene inpainting that achieves state-of-the-art perceptual quality and improved efficiency. Our method combines two key ideas: (1) joint depth-based and object-based supervision to ensure inpainted Gaussians are accurately placed in 3D space and aligned with surrounding geometry, and (2) we propose a consistency-aware refinement scheme that selectively identifies and corrects inconsistent regions without disrupting the rest of the scene. Evaluations on the SPIn-NeRF dataset demonstrate that SplatFill not only surpasses existing NeRF-based and 3DGS-based inpainting methods in visual fidelity but also reduces training time by 24.5%. Qualitative results show our method delivers sharper details, fewer artifacts, and greater coherence across challenging viewpoints.
zh

[CV-13] Faster Self-Supervised Super-Resolution for Anisotropic Multi-View MRI Using a Sparse Coordinate Loss

【速读】:该论文旨在解决医学磁共振(MR)成像中高分辨率(HR)图像获取困难的问题,特别是在扫描时间与图像质量、患者舒适度之间难以平衡的场景下。传统方法通常对两个不同低分辨率(LR)方向的各向异性扫描分别进行分析,不仅耗时且易导致误判。其解决方案的关键在于提出一种基于多视角神经网络的自监督图像融合方法,通过引入稀疏坐标损失(sparse coordinate-based loss),实现任意缩放比例下的LR图像融合,从而重建出统一表示的高分辨率结构。该方法无需配对的HR数据即可训练,并结合离线预训练与在线微调策略,在保持或提升超分辨率(SR)质量的同时,使患者特异性重建速度提升达十倍。

链接: https://arxiv.org/abs/2509.07798
作者: Maja Schlereth,Moritz Schillinger,Katharina Breininger
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Acquiring images in high resolution is often a challenging task. Especially in the medical sector, image quality has to be balanced with acquisition time and patient comfort. To strike a compromise between scan time and quality for Magnetic Resonance (MR) imaging, two anisotropic scans with different low-resolution (LR) orientations can be acquired. Typically, LR scans are analyzed individually by radiologists, which is time consuming and can lead to inaccurate interpretation. To tackle this, we propose a novel approach for fusing two orthogonal anisotropic LR MR images to reconstruct anatomical details in a unified representation. Our multi-view neural network is trained in a self-supervised manner, without requiring corresponding high-resolution (HR) data. To optimize the model, we introduce a sparse coordinate-based loss, enabling the integration of LR images with arbitrary scaling. We evaluate our method on MR images from two independent cohorts. Our results demonstrate comparable or even improved super-resolution (SR) performance compared to state-of-the-art (SOTA) self-supervised SR methods for different upsampling scales. By combining a patient-agnostic offline and a patient-specific online phase, we achieve a substantial speed-up of up to ten times for patient-specific reconstruction while achieving similar or better SR quality. Code is available at this https URL.
zh

[CV-14] RayGaussX: Accelerating Gaussian-Based Ray Marching for Real-Time and High-Quality Novel View Synthesis

【速读】:该论文旨在解决RayGauss在真实世界场景中渲染效率低、难以实现实时渲染的问题。其核心挑战在于,尽管RayGauss在合成数据和室内场景中已实现先进的新视角合成质量,但其基于不规则分布椭圆基函数的表示方法与体素射线投射(volume ray casting)策略导致计算开销过高。为提升训练与推理速度,RayGaussX引入多项关键技术改进:首先,采用空空间跳过(empty-space skipping)和自适应采样(adaptive sampling)加速体积渲染;其次,增强射线相干性(ray coherence)以优化GPU并行效率;再次,提出尺度正则化(scale regularization)减少虚假交点;最后,设计新的密化准则(densification criterion),改善远距离区域的密度分布,从而显著提升大场景下的视觉质量。这些改进使RayGaussX在真实世界数据集上实现5倍至12倍的训练加速和50至80倍的帧率提升(FPS),同时PSNR提升达+0.56 dB。

链接: https://arxiv.org/abs/2509.07782
作者: Hugo Blanc,Jean-Emmanuel Deschaud,Alexis Paljic
机构: Mines Paris, PSL University, Centre for robotics (机器人中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page with videos and code: this https URL

点击查看摘要

Abstract:RayGauss has achieved state-of-the-art rendering quality for novel-view synthesis on synthetic and indoor scenes by representing radiance and density fields with irregularly distributed elliptical basis functions, rendered via volume ray casting using a Bounding Volume Hierarchy (BVH). However, its computational cost prevents real-time rendering on real-world scenes. Our approach, RayGaussX, builds on RayGauss by introducing key contributions that accelerate both training and inference. Specifically, we incorporate volumetric rendering acceleration strategies such as empty-space skipping and adaptive sampling, enhance ray coherence, and introduce scale regularization to reduce false-positive intersections. Additionally, we propose a new densification criterion that improves density distribution in distant regions, leading to enhanced graphical quality on larger scenes. As a result, RayGaussX achieves 5x to 12x faster training and 50x to 80x higher rendering speeds (FPS) on real-world datasets while improving visual quality by up to +0.56 dB in PSNR. Project page with videos and code: this https URL.
zh

[CV-15] HairGS: Hair Strand Reconstruction based on 3D Gaussian Splatting BMVC2025

【速读】:该论文致力于解决从多视角图像中重建头发丝级几何结构(hair strand-level geometry reconstruction)的问题,这是计算机视觉领域中一个具有挑战性的任务,尤其在虚拟现实和数字人建模等应用中日益重要。现有方法通常仅关注几何精度,而忽视了头发丝之间的连通性和拓扑结构,导致重建结果缺乏真实感。解决方案的关键在于扩展3D高斯泼溅(3D Gaussian Splatting, 3DGS)框架,提出一个多阶段流程:首先利用可微分高斯光栅化器重建精细的头发几何;接着设计一种新颖的合并策略将离散的高斯片段融合为连贯的发丝;最后在光度监督下对发丝进行精细化与生长优化。此外,作者还提出了一个新的评估指标,作为衡量发丝拓扑准确性的代理指标,从而更全面地评价重建质量。

链接: https://arxiv.org/abs/2509.07774
作者: Yimin Pan,Matthias Nießner,Tobias Kirschstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This is the arXiv preprint of the paper “Hair Strand Reconstruction based on 3D Gaussian Splatting” published at BMVC 2025. Project website: this https URL

点击查看摘要

Abstract:Human hair reconstruction is a challenging problem in computer vision, with growing importance for applications in virtual reality and digital human modeling. Recent advances in 3D Gaussians Splatting (3DGS) provide efficient and explicit scene representations that naturally align with the structure of hair strands. In this work, we extend the 3DGS framework to enable strand-level hair geometry reconstruction from multi-view images. Our multi-stage pipeline first reconstructs detailed hair geometry using a differentiable Gaussian rasterizer, then merges individual Gaussian segments into coherent strands through a novel merging scheme, and finally refines and grows the strands under photometric supervision. While existing methods typically evaluate reconstruction quality at the geometric level, they often neglect the connectivity and topology of hair strands. To address this, we propose a new evaluation metric that serves as a proxy for assessing topological accuracy in strand reconstruction. Extensive experiments on both synthetic and real-world datasets demonstrate that our method robustly handles a wide range of hairstyles and achieves efficient reconstruction, typically completing within one hour. The project page can be found at: this https URL Comments: This is the arXiv preprint of the paper “Hair Strand Reconstruction based on 3D Gaussian Splatting” published at BMVC 2025. Project website: this https URL Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2509.07774 [cs.CV] (or arXiv:2509.07774v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.07774 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-16] XSRD-Net: EXplainable Stroke Relapse Detection

【速读】:该论文旨在降低脑卒中(stroke)复发率,因其复发死亡率高达40%,是临床亟需解决的关键问题。解决方案的核心在于利用多模态深度学习模型(multimodal XSRD-net)对患者进行早期风险分层:一方面通过整合3D颅内CTA影像数据(vision modality)与患者的心脏疾病史、年龄和性别等表型数据(tabular modality),实现脑卒中复发的二分类预测(Task 1)和无复发生存时间(relapse-free survival, RFS)的回归预测(Task 2)。模型在测试集上取得AUC=0.71和c-index=0.68,且可解释性分析揭示了心脏疾病与颈动脉病变之间的关联,为精准干预提供了依据。

链接: https://arxiv.org/abs/2509.07772
作者: Christian Gapp,Elias Tappeiner,Martin Welk,Karl Fritscher,Stephanie Mangesius,Constantin Eisenschink,Philipp Deisl,Michael Knoflach,Astrid E. Grams,Elke R. Gizewski,Rainer Schubert
机构: University of Innsbruck (因斯布鲁克大学); Medical University of Innsbruck (因斯布鲁克医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Contribution to MICAD 2025 conference, Nov. 19-21, 2025 | London, UK

点击查看摘要

Abstract:Stroke is the second most frequent cause of death world wide with an annual mortality of around 5.5 million. Recurrence rates of stroke are between 5 and 25% in the first year. As mortality rates for relapses are extraordinarily high (40%) it is of utmost importance to reduce the recurrence rates. We address this issue by detecting patients at risk of stroke recurrence at an early stage in order to enable appropriate therapy planning. To this end we collected 3D intracranial CTA image data and recorded concomitant heart diseases, the age and the gender of stroke patients between 2010 and 2024. We trained single- and multimodal deep learning based neural networks for binary relapse detection (Task 1) and for relapse free survival (RFS) time prediction together with a subsequent classification (Task 2). The separation of relapse from non-relapse patients (Task 1) could be solved with tabular data (AUC on test dataset: 0.84). However, for the main task, the regression (Task 2), our multimodal XSRD-net processed the modalities vision:tabular with 0.68:0.32 according to modality contribution measures. The c-index with respect to relapses for the multimodal model reached 0.68, and the AUC is 0.71 for the test dataset. Final, deeper interpretability analysis results could highlight a link between both heart diseases (tabular) and carotid arteries (vision) for the detection of relapses and the prediction of the RFS time. This is a central outcome that we strive to strengthen with ongoing data collection and model retraining.
zh

[CV-17] Spectral and Rhythm Feature Performance Evaluation for Category and Class Level Audio Classification with Deep Convolutional Neural Networks

【速读】:该论文旨在解决音频分类任务中特征表示对深度卷积神经网络(CNN)性能影响的问题,特别是针对环境声音(environmental sounds)分类场景下,不同频谱和节奏特征作为输入时模型表现的差异性。其解决方案的关键在于系统性地评估多种常见特征(如梅尔尺度频谱图、梅尔频率倒谱系数(MFCC)、循环时谱图、短时傅里叶变换(STFT)色度图、恒定Q变换(CQT)色度图及色度能量归一化统计量(CENS)色度图)在ESC-50数据集上的分类效果,并通过端到端深度学习流程对比它们在类别级和类别内级别分类任务中的准确率、精确率、召回率和F1分数。结果表明,梅尔尺度频谱图和MFCC在所有评估指标上均显著优于其他特征,成为最适合用于深度CNN音频分类的输入表示形式。

链接: https://arxiv.org/abs/2509.07756
作者: Friedrich Wolf-Monheim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Next to decision tree and k-nearest neighbours algorithms deep convolutional neural networks (CNNs) are widely used to classify audio data in many domains like music, speech or environmental sounds. To train a specific CNN various spectral and rhythm features like mel-scaled spectrograms, mel-frequency cepstral coefficients (MFCC), cyclic tempograms, short-time Fourier transform (STFT) chromagrams, constant-Q transform (CQT) chromagrams and chroma energy normalized statistics (CENS) chromagrams can be used as digital image input data for the neural network. The performance of these spectral and rhythm features for audio category level as well as audio class level classification is investigated in detail with a deep CNN and the ESC-50 dataset with 2,000 labeled environmental audio recordings using an end-to-end deep learning pipeline. The evaluated metrics accuracy, precision, recall and F1 score for multiclass classification clearly show that the mel-scaled spectrograms and the mel-frequency cepstral coefficients (MFCC) perform significantly better then the other spectral and rhythm features investigated in this research for audio classification tasks using deep CNNs.
zh

[CV-18] Enhancing Online Learning by Integrating Biosensors and Multimodal Learning Analytics for Detecting and Predicting Student Behavior: A Review WWW

【速读】:该论文旨在解决在线学习环境中对学生行为理解与预测的难题,以提升学习参与度并优化教育成效。其解决方案的关键在于整合生物传感器(biosensors)与多模态学习分析(Multimodal Learning Analytics, MmLA),通过融合生理信号(如心率、脑电活动和眼动追踪)与传统交互数据及自评报告,实现对学生认知状态和参与水平的深度洞察。研究指出,借助先进的机器学习算法和多模态数据预处理技术,可有效提升个性化学习体验、实时反馈与智能教育干预的能力,从而推动自适应在线学习系统的创新发展。

链接: https://arxiv.org/abs/2509.07742
作者: Alvaro Becerra,Ruth Cobos,Charles Lang
机构: Universidad Autónoma de Madrid (马德里自治大学); Digital Futures Institute, Teachers College Columbia University (哥伦比亚大学教师学院数字未来研究所)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Behaviour Information Technology (Taylor Francis). Final published version will be available soon at this https URL

点击查看摘要

Abstract:In modern online learning, understanding and predicting student behavior is crucial for enhancing engagement and optimizing educational outcomes. This systematic review explores the integration of biosensors and Multimodal Learning Analytics (MmLA) to analyze and predict student behavior during computer-based learning sessions. We examine key challenges, including emotion and attention detection, behavioral analysis, experimental design, and demographic considerations in data collection. Our study highlights the growing role of physiological signals, such as heart rate, brain activity, and eye-tracking, combined with traditional interaction data and self-reports to gain deeper insights into cognitive states and engagement levels. We synthesize findings from 54 key studies, analyzing commonly used methodologies such as advanced machine learning algorithms and multimodal data pre-processing techniques. The review identifies current research trends, limitations, and emerging directions in the field, emphasizing the transformative potential of biosensor-driven adaptive learning systems. Our findings suggest that integrating multimodal data can facilitate personalized learning experiences, real-time feedback, and intelligent educational interventions, ultimately advancing toward a more customized and adaptive online learning experience.
zh

[CV-19] SEEC: Segmentation-Assisted Multi-Entropy Models for Learned Lossless Image Compression

【速读】:该论文旨在解决传统图像压缩方法中因使用单一熵模型(entropy model)无法有效捕捉图像不同语义区域统计特性而导致的压缩效率受限问题。其解决方案的关键在于提出一种基于分割辅助的多熵模型(Segmentation-Assisted Multi-Entropy Models, SEEC)框架,通过引入语义分割(semantic segmentation)指导多个专用熵模型的选择与自适应调整,从而在不同语义区域内实现更精确的概率分布估计;同时采用多通道离散对数混合似然(multi-channel discrete logistic mixture likelihood)建模像素值分布,显著提升无损压缩性能,并支持基于分割掩码的感兴趣区域(Regions of Interest, ROIs)编码。

链接: https://arxiv.org/abs/2509.07704
作者: Chunhang Zheng,Zichang Ren,Dou Li
机构: Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: under review

点击查看摘要

Abstract:Recently, learned image compression has attracted considerable attention due to its superior performance over traditional methods. However, most existing approaches employ a single entropy model to estimate the probability distribution of pixel values across the entire image, which limits their ability to capture the diverse statistical characteristics of different semantic regions. To overcome this limitation, we propose Segmentation-Assisted Multi-Entropy Models for Lossless Image Compression (SEEC). Our framework utilizes semantic segmentation to guide the selection and adaptation of multiple entropy models, enabling more accurate probability distribution estimation for distinct semantic regions. Specifically, SEEC first extracts image features and then applies semantic segmentation to identify different regions, each assigned a specialized entropy model to better capture its unique statistical properties. Finally, a multi-channel discrete logistic mixture likelihood is employed to model the pixel value distributions effectively. Experimental results on benchmark datasets demonstrate that SEEC achieves state-of-the-art compression ratios while introducing only minimal encoding and decoding latency. With superior performance, the proposed model also supports Regions of Interest (ROIs) coding condition on the provided segmentation mask. Our code is available at this https URL.
zh

[CV-20] CAViAR: Critic-Augmented Video Agent ic Reasoning

【速读】:该论文旨在解决视频理解任务中复杂推理能力不足的问题,尤其是在查询复杂度和视频长度增加时,现有模型性能显著下降的现象。其解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的智能体(agent),该智能体可调用视频模块作为子代理或工具,并根据每次模块调用的结果动态决定后续步骤,从而实现灵活、自适应的推理流程。此外,受文本推理领域启发,作者引入一个批评者(critic)机制,用于区分成功与失败的推理序列,进而指导智能体优化决策路径,最终在LVBench、Neptune和ActivityNet-RTL等多个基准数据集上取得显著提升。

链接: https://arxiv.org/abs/2509.07680
作者: Sachit Menon,Ahmet Iscen,Arsha Nagrani,Tobias Weyand,Carl Vondrick,Cordelia Schmid
机构: Google DeepMind(谷歌深度思维); Columbia University (哥伦比亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Video understanding has seen significant progress in recent years, with models’ performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.
zh

[CV-21] Nearest Neighbor Projection Removal Adversarial Training

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在图像分类任务中对对抗样本(adversarial examples)敏感的问题,尤其是现有标准对抗训练方法未能有效处理类别间特征重叠(inter-class feature overlap)这一关键因素。其解决方案的关键在于提出一种新颖的对抗训练框架,通过在特征空间中显式地移除对抗样本与干净样本中的类别间依赖关系,具体做法是识别每个对抗样本的最近邻异类样本,并将其投影方向从特征表示中剔除,从而增强类别间的特征可分性。理论分析表明,该方法通过修正 logits 降低了神经网络的 Lipschitz 常数,进而减小 Rademacher 复杂度,提升了模型的泛化能力和鲁棒性。

链接: https://arxiv.org/abs/2509.07673
作者: Himanshu Singh,A. V. Subramanyam,Shivank Rajput,Mohan Kankanhalli
机构: IIIT Delhi (印度信息技术研究所); NUS (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks have exhibited impressive performance in image classification tasks but remain vulnerable to adversarial examples. Standard adversarial training enhances robustness but typically fails to explicitly address inter-class feature overlap, a significant contributor to adversarial susceptibility. In this work, we introduce a novel adversarial training framework that actively mitigates inter-class proximity by projecting out inter-class dependencies from adversarial and clean samples in the feature space. Specifically, our approach first identifies the nearest inter-class neighbors for each adversarial sample and subsequently removes projections onto these neighbors to enforce stronger feature separability. Theoretically, we demonstrate that our proposed logits correction reduces the Lipschitz constant of neural networks, thereby lowering the Rademacher complexity, which directly contributes to improved generalization and robustness. Extensive experiments across standard benchmarks including CIFAR-10, CIFAR-100, and SVHN show that our method demonstrates strong performance that is competitive with leading adversarial training techniques, highlighting significant achievements in both robust and clean accuracy. Our findings reveal the importance of addressing inter-class feature proximity explicitly to bolster adversarial robustness in DNNs.
zh

[CV-22] EDFFDNet: Towards Accurate and Efficient Unsupervised Multi-Grid Image Registration

【速读】:该论文旨在解决现有深度图像配准方法(如单仿射变换、多网格仿射变换或薄板样条)在存在深度差异的真实场景中性能受限的问题。其关键解决方案在于提出一种基于指数衰减基函数的自由形变网络(Exponential-Decay Free-Form Deformation Network, EDFFDNet),该设计利用局部性优势提升对深度变化的适应能力;同时引入自适应稀疏运动聚合器(Adaptive Sparse Motion Aggregator, ASMA)替代传统全连接层(MLP)结构,通过将密集交互转为稀疏交互显著降低参数量并提高精度;此外,采用渐进式相关性精化策略,融合全局-局部相关性模式实现从粗到细的运动估计,从而在保持低计算成本的同时提升配准准确性和泛化能力。

链接: https://arxiv.org/abs/2509.07662
作者: Haokai Zhu,Bo Qu,Si-Yuan Cao,Runmin Zhang,Shujie Chen,Bailin Yang,Hui-Liang Shen
机构: Ningbo Global Innovation Center, Zhejiang University (浙江大学宁波全球创新港); College of Information Science and Electronic Engineering, Zhejiang University (浙江大学信息科学与电子工程学院); NingboTech University (宁波理工学院); Zhejiang Key Laboratory of Big Data and Future E-Commerce Technology, Hangzhou, China (浙江省大数据与未来电子商务技术重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Previous deep image registration methods that employ single homography, multi-grid homography, or thin-plate spline often struggle with real scenes containing depth disparities due to their inherent limitations. To address this, we propose an Exponential-Decay Free-Form Deformation Network (EDFFDNet), which employs free-form deformation with an exponential-decay basis function. This design achieves higher efficiency and performs well in scenes with depth disparities, benefiting from its inherent locality. We also introduce an Adaptive Sparse Motion Aggregator (ASMA), which replaces the MLP motion aggregator used in previous methods. By transforming dense interactions into sparse ones, ASMA reduces parameters and improves accuracy. Additionally, we propose a progressive correlation refinement strategy that leverages global-local correlation patterns for coarse-to-fine motion estimation, further enhancing efficiency and accuracy. Experiments demonstrate that EDFFDNet reduces parameters, memory, and total runtime by 70.5%, 32.6%, and 33.7%, respectively, while achieving a 0.5 dB PSNR gain over the state-of-the-art method. With an additional local refinement stage,EDFFDNet-2 further improves PSNR by 1.06 dB while maintaining lower computational costs. Our method also demonstrates strong generalization ability across datasets, outperforming previous deep learning methods.
zh

[CV-23] Beyond Motion Cues and Structural Sparsity: Revisiting Small Moving Target Detection

【速读】:该论文旨在解决小目标检测(small moving target detection)在复杂背景下的鲁棒性问题,尤其是在低信噪比、视觉线索模糊和背景杂乱等挑战场景中。传统方法通常依赖于目标特定特征或运动信息,难以适应多样化环境。其解决方案的关键在于提出一种基于张量低秩与稀疏分解(tensor-based low-rank and sparse decomposition)的新范式,将小目标检测与背景建模视为耦合任务,并利用背景的强低秩结构作为稳定先验。具体而言,作者设计了TenRPCANet网络,通过自注意力机制隐式施加多阶张量低秩约束以建模背景,同时引入特征精炼模块增强目标显著性,从而无需对目标特性做过多假设即可实现高精度检测。

链接: https://arxiv.org/abs/2509.07654
作者: Guoyi Zhang,Siyang Chen,Guangsheng Xu,Zhihua Shen,Han Wang,Xiaohu Zhang
机构: AVP lab, the School of Aeronautics and Astronautics, Sun Yat-sen University, Shenzhen 518107, Guangdong, China(中山大学航空航天学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Small moving target detection is crucial for many defense applications but remains highly challenging due to low signal-to-noise ratios, ambiguous visual cues, and cluttered backgrounds. In this work, we propose a novel deep learning framework that differs fundamentally from existing approaches, which often rely on target-specific features or motion cues and tend to lack robustness in complex environments. Our key insight is that small target detection and background discrimination are inherently coupled, even cluttered video backgrounds often exhibit strong low-rank structures that can serve as stable priors for detection. We reformulate the task as a tensor-based low-rank and sparse decomposition problem and conduct a theoretical analysis of the background, target, and noise components to guide model design. Building on these insights, we introduce TenRPCANet, a deep neural network that requires minimal assumptions about target characteristics. Specifically, we propose a tokenization strategy that implicitly enforces multi-order tensor low-rank priors through a self-attention mechanism. This mechanism captures both local and non-local self-similarity to model the low-rank background without relying on explicit iterative optimization. In addition, inspired by the sparse component update in tensor RPCA, we design a feature refinement module to enhance target saliency. The proposed method achieves state-of-the-art performance on two highly distinct and challenging tasks: multi-frame infrared small target detection and space object detection. These results demonstrate both the effectiveness and the generalizability of our approach.
zh

[CV-24] Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity ICCV

【速读】:该论文旨在解决潜在扩散模型(Latent Diffusion Models, LDMs)中语义水印技术在面对再生攻击(regeneration attacks)和裁剪攻击(cropping attacks)时,因频域完整性丢失而导致检测性能下降的问题。解决方案的关键在于提出一种名为赫米特对称傅里叶水印(Hermitian Symmetric Fourier Watermarking, SFW)的新嵌入方法,通过强制执行赫米特对称性(Hermitian symmetry)来保持频域结构完整性;同时引入中心感知嵌入策略(center-aware embedding strategy),增强水印在图像裁剪下的鲁棒性,从而提升语义水印的识别准确率与抗攻击能力。实验表明,该方法在保证图像保真度(FID 和 CLIP 分数)的同时实现了最优的检测与识别性能。

链接: https://arxiv.org/abs/2509.07647
作者: Sung Ju Lee,Nam Ik Cho
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the IEEE/CVF International Conference on Computer Vision (ICCV) 2025. Project page: this https URL Code: this https URL

点击查看摘要

Abstract:Semantic watermarking techniques for latent diffusion models (LDMs) are robust against regeneration attacks, but often suffer from detection performance degradation due to the loss of frequency integrity. To tackle this problem, we propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW), which maintains frequency integrity by enforcing Hermitian symmetry. Additionally, we introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks by ensuring robust information retention. To validate our approach, we apply these techniques to existing semantic watermarking schemes, enhancing their frequency-domain structures for better robustness and retrieval accuracy. Extensive experiments demonstrate that our methods achieve state-of-the-art verification and identification performance, surpassing previous approaches across various attack scenarios. Ablation studies confirm the impact of SFW on detection capabilities, the effectiveness of the center-aware embedding against cropping, and how message capacity influences identification accuracy. Notably, our method achieves the highest detection accuracy while maintaining superior image fidelity, as evidenced by FID and CLIP scores. Conclusively, our proposed SFW is shown to be an effective framework for balancing robustness and image fidelity, addressing the inherent trade-offs in semantic watermarking. Code available at this https URL
zh

[CV-25] Self-Supervised Cross-Encoder for Neurodegenerative Disease Diagnosis

【速读】:该论文旨在解决深度学习在神经退行性疾病(如阿尔茨海默病)MRI影像诊断中面临的两大挑战:一是对大规模标注数据的依赖性强,二是模型学习到的表征缺乏可解释性。解决方案的关键在于提出一种新颖的自监督交叉编码框架(self-supervised cross-encoder framework),利用纵向MRI扫描中的时间连续性作为监督信号,将学习到的表征解耦为两个组件:一个由对比学习约束的静态表征,用于捕捉稳定的解剖学特征;另一个由输入梯度正则化引导的动态表征,反映时间变化并可用于下游分类任务的高效微调。这一设计显著提升了模型的分类准确率与可解释性,并展现出优异的零样本泛化能力和跨任务迁移性能。

链接: https://arxiv.org/abs/2509.07623
作者: Fangqi Cheng,Yingying Zhao,Xiaochen Yang
机构: University of Strathclyde (斯特拉斯克莱德大学); School of Mathematics and Statistics, Glasgow, UK. (格拉斯哥大学数学与统计学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Deep learning has shown significant potential in diagnosing neurodegenerative diseases from MRI data. However, most existing methods rely heavily on large volumes of labeled data and often yield representations that lack interpretability. To address both challenges, we propose a novel self-supervised cross-encoder framework that leverages the temporal continuity in longitudinal MRI scans for supervision. This framework disentangles learned representations into two components: a static representation, constrained by contrastive learning, which captures stable anatomical features; and a dynamic representation, guided by input-gradient regularization, which reflects temporal changes and can be effectively fine-tuned for downstream classification tasks. Experimental results on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrate that our method achieves superior classification accuracy and improved interpretability. Furthermore, the learned representations exhibit strong zero-shot generalization on the Open Access Series of Imaging Studies (OASIS) dataset and cross-task generalization on the Parkinson Progression Marker Initiative (PPMI) dataset. The code for the proposed method will be made publicly available.
zh

[CV-26] Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimers Disease

【速读】:该论文旨在解决当前医学视觉语言模型(Medical Vision-Language Models, Med-VLMs)在临床应用中的两大核心问题:一是对患者结构化元数据(structured metadata)利用不足,二是缺乏与临床诊断知识的深度融合,且现有模型多基于2D图像训练,在3D医学影像(如CT和MRI)上的泛化能力有限。解决方案的关键在于提出一种数据高效的微调流程,通过两个创新机制实现:其一,将结构化元数据转化为合成报告以增强文本输入,提升图像-文本对齐效果;其二,引入一个辅助标记(auxiliary token)用于预测简易精神状态检查(Mini-Mental State Examination, MMSE)评分,提供额外的临床监督信号,从而增强模型对阿尔茨海默病(Alzheimer’s Disease, AD)严重程度的感知能力。结合轻量级提示微调(prompt tuning)策略,该方法在仅使用1,500张训练图像的情况下即达到优于使用10,000张图像微调模型的性能,显著提升了3D MRI影像上AD诊断的准确性与效率。

链接: https://arxiv.org/abs/2509.07613
作者: Fangqi Cheng,Surajit Ray,Xiaochen Yang
机构: University of Glasgow (格拉斯哥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer’s disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.
zh

[CV-27] Bias in Gender Bias Benchmarks: How Spurious Features Distort Evaluation ICCV2025

【速读】:该论文旨在解决当前视觉语言基础模型(Vision-Language Foundation Models, VLMs)性别偏见评估中存在的关键缺陷——即现有基准测试(benchmarks)中性别与非性别特征(如物体、背景)之间存在伪相关(spurious correlations),导致偏见评估结果可能反映的是模型对这些伪特征的响应,而非真正的性别偏见。解决方案的关键在于通过系统性地扰动非性别特征(例如遮挡10%的物体或弱模糊背景),量化其对偏见评分的影响,并发现微小扰动即可显著改变偏见指标(生成式VLMs偏见分数变化高达175%,CLIP变体达43%),从而揭示当前评估方法的不可靠性;因此,作者建议在报告偏见指标的同时补充特征敏感性测量(feature-sensitivity measurements),以提升偏见评估的可信度和透明度。

链接: https://arxiv.org/abs/2509.07596
作者: Yusuke Hirota,Ryo Hachiuma,Boyi Li,Ximing Lu,Michael Ross Boone,Boris Ivanovic,Yejin Choi,Marco Pavone,Yu-Chiang Frank Wang,Noa Garcia,Yuta Nakashima,Chao-Han Huck Yang
机构: NVIDIA; Osaka University (大阪大学); UC Berkeley (加州大学伯克利分校); Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Gender bias in vision-language foundation models (VLMs) raises concerns about their safe deployment and is typically evaluated using benchmarks with gender annotations on real-world images. However, as these benchmarks often contain spurious correlations between gender and non-gender features, such as objects and backgrounds, we identify a critical oversight in gender bias evaluation: Do spurious features distort gender bias evaluation? To address this question, we systematically perturb non-gender features across four widely used benchmarks (COCO-gender, FACET, MIAP, and PHASE) and various VLMs to quantify their impact on bias evaluation. Our findings reveal that even minimal perturbations, such as masking just 10% of objects or weakly blurring backgrounds, can dramatically alter bias scores, shifting metrics by up to 175% in generative VLMs and 43% in CLIP variants. This suggests that current bias evaluations often reflect model responses to spurious features rather than gender bias, undermining their reliability. Since creating spurious feature-free benchmarks is fundamentally challenging, we recommend reporting bias metrics alongside feature-sensitivity measurements to enable a more reliable bias assessment.
zh

[CV-28] Can SSD-Mamba2 Unlock Reinforcement Learning for End-to-End Motion Control?

【速读】:该论文旨在解决端到端强化学习(Reinforcement Learning, RL)在运动控制中面临的两大核心挑战:一是现有控制器要么缺乏视觉感知(仅依赖本体感觉),要么采用基于Transformer的融合架构,导致计算与内存开销呈二次增长,难以支持长时程决策和高分辨率时空上下文;二是传统递归控制器在长期信用分配上表现不佳。其解决方案的关键在于提出一种基于SSD-Mamba2的视觉驱动跨模态RL框架,该框架利用选择性状态空间模型(Selective State-Space Model, SSD)实现递归与卷积扫描的统一,并通过硬件感知的数据流处理实现近线性扩展。该设计显著降低了延迟和内存消耗,相比二次复杂度的自注意力机制保留了长距离依赖关系,从而支持更远的前瞻视野、更高分辨率的token表示以及在有限算力下的稳定训练。

链接: https://arxiv.org/abs/2509.07593
作者: Gavin Tao,Yinuo Wang,Jinzhao Zhou
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注: 4 figures and 6 tables

点击查看摘要

Abstract:End-to-end reinforcement learning for motion control promises unified perception-action policies that scale across embodiments and tasks, yet most deployed controllers are either blind (proprioception-only) or rely on fusion backbones with unfavorable compute-memory trade-offs. Recurrent controllers struggle with long-horizon credit assignment, and Transformer-based fusion incurs quadratic cost in token length, limiting temporal and spatial context. We present a vision-driven cross-modal RL framework built on SSD-Mamba2, a selective state-space backbone that applies state-space duality (SSD) to enable both recurrent and convolutional scanning with hardware-aware streaming and near-linear scaling. Proprioceptive states and exteroceptive observations (e.g., depth tokens) are encoded into compact tokens and fused by stacked SSD-Mamba2 layers. The selective state-space updates retain long-range dependencies with markedly lower latency and memory use than quadratic self-attention, enabling longer look-ahead, higher token resolution, and stable training under limited compute. Policies are trained end-to-end under curricula that randomize terrain and appearance and progressively increase scene complexity. A compact, state-centric reward balances task progress, energy efficiency, and safety. Across diverse motion-control scenarios, our approach consistently surpasses strong state-of-the-art baselines in return, safety (collisions and falls), and sample efficiency, while converging faster at the same compute budget. These results suggest that SSD-Mamba2 provides a practical fusion backbone for scalable, foresightful, and efficient end-to-end motion control.
zh

[CV-29] mporal Image Forensics: A Review and Critical Evaluation

【速读】:该论文旨在解决数字图像年龄估计(temporal image forensics)中的可靠性与可解释性问题,特别是针对由图像采集流水线引入的时间依赖性痕迹(age traces)的利用有效性及潜在内容偏差(content bias)的影响。其解决方案的关键在于:首先提出更贴近现实的取证场景以提升研究实用性;其次通过实证验证传感器缺陷等典型年龄痕迹的生长速率和空间分布特性;更重要的是,借助可解释人工智能(eXplainable Artificial Intelligence, XAI)方法揭示当前主流神经网络模型在图像年龄估计中可能并非真正学习了预期的年龄痕迹,而是受到内容偏差干扰——例如,一种用于利用传感器缺陷进行年龄估算的方法实际上可能是在捕捉图像内容特征而非真实时间痕迹;最后,通过实验展示神经网络极易被误导而无法稳定学习有效的年龄特征,从而强调未来研究需更加注重模型的可解释性和鲁棒性设计。

链接: https://arxiv.org/abs/2509.07591
作者: Robert Jöchl,Andreas Uhl
机构: University of Salzburg (萨尔茨堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Temporal image forensics is the science of estimating the age of a digital image. Usually, time-dependent traces (age traces) introduced by the image acquisition pipeline are exploited for this purpose. In this review, a comprehensive overview of the field of temporal image forensics based on time-dependent traces from the image acquisition pipeline is given. This includes a detailed insight into the properties of known age traces (i.e., in-field sensor defects and sensor dust) and temporal image forensics techniques. Another key aspect of this work is to highlight the problem of content bias and to illustrate how important eXplainable Artificial Intelligence methods are to verify the reliability of temporal image forensics techniques. Apart from reviewing material presented in previous works, in this review: (i) a new (probably more realistic) forensic setting is proposed; (ii) the main properties (growth rate and spatial distribution) of in-field sensor defects are verified; (iii) it is shown that a method proposed to utilize in-field sensor defects for image age approximation actually exploits other traces (most likely content bias); (iv) the features learned by a neural network dating palmprint images are further investigated; (v) it is shown how easily a neural network can be distracted from learning age traces. For this purpose, previous work is analyzed, re-implemented if required and experiments are conducted.
zh

[CV-30] Attention Maps in 3D Shape Classification for Dental Stage Estimation with Class Node Graph Attention Networks

【速读】:该论文旨在解决深度学习模型在高风险应用场景(如医学领域)中因“黑箱特性”导致的信任与责任缺失问题,特别是在3D形状识别任务中缺乏可解释性。其解决方案的关键在于提出一种类节点图注意力网络(Class Node Graph Attention Network, CGAT),通过引入图注意力卷积和内嵌注意力机制,并借助注意力传播(attention rollout)可视化决策过程,使模型输出具有人类可理解的解释能力。实验表明,结合局部平均曲率与到中心节点距离作为节点特征,并引入指向全局CLS节点的有向边结构,不仅能提升分类性能(加权F1达0.76),还能生成更直观的注意力热力图,从而增强专家对模型决策的可信度与验证能力。

链接: https://arxiv.org/abs/2509.07581
作者: Barkin Buyukcakir,Rocharles Cavalcante Fontenele,Reinhilde Jacobs,Jannick De Tobel,Patrick Thevissen,Dirk Vandermeulen,Peter Claes
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 8 figures, 2nd International Conference on Explainable AI for Neural or Symbolic Methods

点击查看摘要

Abstract:Deep learning offers a promising avenue for automating many recognition tasks in fields such as medicine and forensics. However, the black-box nature of these models hinders their adoption in high-stakes applications where trust and accountability are required. For 3D shape recognition tasks in particular, this paper introduces the Class Node Graph Attention Network (CGAT) architecture to address this need. Applied to 3D meshes of third molars derived from CBCT images, for Demirjian stage allocation, CGAT utilizes graph attention convolutions and an inherent attention mechanism, visualized via attention rollout, to explain its decision-making process. We evaluated the local mean curvature and distance to centroid node features, both individually and in combination, as well as model depth, finding that models incorporating directed edges to a global CLS node produced more intuitive attention maps, while also yielding desirable classification performance. We analyzed the attention-based explanations of the models, and their predictive performances to propose optimal settings for the CGAT. The combination of local mean curvature and distance to centroid as node features yielded a slight performance increase with 0.76 weighted F1 score, and more comprehensive attention visualizations. The CGAT architecture’s ability to generate human-understandable attention maps can enhance trust and facilitate expert validation of model decisions. While demonstrated on dental data, CGAT is broadly applicable to graph-based classification and regression tasks, promoting wider adoption of transparent and competitive deep learning models in high-stakes environments.
zh

[CV-31] PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

【速读】:该论文旨在解决从单张未对齐图像中快速重建高保真度的全头高斯表示(Gaussian full-head model)的问题,传统方法依赖耗时的生成对抗网络(GAN)反演和测试时优化,难以满足实时性需求。其解决方案的关键在于提出一个前向传播框架,能够在一次前向推理中完成高质量重建,同时利用由训练好的3D GAN生成的大规模合成数据集进行模型训练,避免了真实3D头部数据稀缺的问题;此外,设计了一个从粗到精的高斯头生成流水线,通过FLAME模型稀疏点与图像特征经Transformer模块交互实现粗略形状重建,再进行密集化处理以提升细节 fidelity,并引入双分支结构有效融合结构化的球面三平面特征(spherical triplane feature)与非结构化的点特征,从而充分利用预训练3D GAN中的先验知识,显著提升重建效果。

链接: https://arxiv.org/abs/2509.07552
作者: Peng Li,Yisheng He,Yingdong Hu,Yuan Dong,Weihao Yuan,Yuan Liu,Zilong Dong,Yike Guo
机构: HKUST; Tongyi Lab, Alibaba Group
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work.
zh

[CV-32] xtlessRAG : End-to-End Visual Document RAG by Speech Without Text

【速读】:该论文旨在解决在大规模文档图像上进行基于语音查询的知识库问答问题,即如何在不依赖传统文本转换技术(如自动语音识别 ASR、光学字符识别 OCR 和文本到语音 TTS)的前提下,实现端到端的语音驱动文档理解与答案生成。其解决方案的关键在于提出 TextlessRAG 框架,该框架构建了一个完全无文本(textless)的处理流程:直接解析语音输入,从文档图像中检索相关视觉知识,并生成答案,从而避免了中间文本转换环节带来的误差累积和复杂性;同时引入布局感知重排序机制以提升检索精度,显著优化了效率与准确性。

链接: https://arxiv.org/abs/2509.07538
作者: Peijin Xie,Shun Qian,Bingquan Liu,Dexin Wang,Lin Sun,Xiangzheng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, 4 figures,

点击查看摘要

Abstract:Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech–document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:this https URL
zh

[CV-33] HU-based Foreground Masking for 3D Medical Masked Image Modeling MICCAI

【速读】:该论文旨在解决当前掩码图像建模(Masked Image Modeling, MIM)在3D医学图像计算中应用受限的问题,即传统随机掩码策略忽视了解剖结构的密度分布,导致学习到的表示缺乏临床相关性。其解决方案的关键在于提出一种基于Hounsfield Unit (HU)的前景掩码(HU-based Foreground Masking)策略,该策略通过利用HU值的强度分布聚焦于实质性脏器区域,同时排除空气和液体等无诊断意义的非组织区域,从而提升预训练任务对医学图像语义特征的捕捉能力。实验表明,该方法在多个公开3D医学影像数据集上显著提升了分割质量与Dice分数,验证了领域定制化MIM在医学图像分割中的有效性。

链接: https://arxiv.org/abs/2509.07534
作者: Jin Lee,Vu Dang,Gwang-Hyun Yu,Anh Le,Zahid Rahman,Jin-Ho Jang,Heonzoo Lee,Kun-Yung Kim,Jin-Sul Kim,Jin-Young Kim
机构: 11; 22; 33
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by MICCAI AMAI Workshop 2025

点击查看摘要

Abstract:While Masked Image Modeling (MIM) has revolutionized fields of computer vision, its adoption in 3D medical image computing has been limited by the use of random masking, which overlooks the density of anatomical objects. To address this limitation, we enhance the pretext task with a simple yet effective masking strategy. Leveraging Hounsfield Unit (HU) measurements, we implement an HU-based Foreground Masking, which focuses on the intensity distribution of visceral organs and excludes non-tissue regions, such as air and fluid, that lack diagnostically meaningful features. Extensive experiments on five public 3D medical imaging datasets demonstrate that our masking consistently improves performance, both in quality of segmentation and Dice score (BTCV:~84.64%, Flare22:~92.43%, MM-WHS:~90.67%, Amos22:~88.64%, BraTS:~78.55%). These results underscore the importance of domain-centric MIM and suggest a promising direction for representation learning in medical image segmentation. Implementation is available at this http URL.
zh

[CV-34] Universal Few-Shot Spatial Control for Diffusion Models

【速读】:该论文旨在解决预训练文本到图像扩散模型中空间控制(Spatial Conditioning)适应性不足的问题,即现有控制适配器在面对与训练任务差异较大的新型空间控制条件时,泛化能力有限且训练成本高昂。解决方案的关键在于提出通用少样本控制(Universal Few-Shot Control, UFC),其通过构建查询条件与支持条件之间的类比关系,利用匹配机制和少量任务特定参数的更新,动态生成任务相关的控制特征,从而实现对未见任务的高效、细粒度空间控制。

链接: https://arxiv.org/abs/2509.07530
作者: Kiet T. Nguyen,Chanhuyk Lee,Donggyun Kim,Dong Hoon Lee,Seunghoon Hong
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Spatial conditioning in pretrained text-to-image diffusion models has significantly improved fine-grained control over the structure of generated images. However, existing control adapters exhibit limited adaptability and incur high training costs when encountering novel spatial control conditions that differ substantially from the training tasks. To address this limitation, we propose Universal Few-Shot Control (UFC), a versatile few-shot control adapter capable of generalizing to novel spatial conditions. Given a few image-condition pairs of an unseen task and a query condition, UFC leverages the analogy between query and support conditions to construct task-specific control features, instantiated by a matching mechanism and an update on a small set of task-specific parameters. Experiments on six novel spatial control tasks show that UFC, fine-tuned with only 30 annotated examples of novel tasks, achieves fine-grained control consistent with the spatial conditions. Notably, when fine-tuned with 0.1% of the full training data, UFC achieves competitive performance with the fully supervised baselines in various control tasks. We also show that UFC is applicable agnostically to various diffusion backbones and demonstrate its effectiveness on both UNet and DiT architectures. Code is available at this https URL.
zh

[CV-35] EHWGesture – A dataset for multimodal understanding of clinical gestures ICCV2025

【速读】:该论文旨在解决动态手势理解中因复杂时空变化带来的挑战,以及现有数据集在多模态多样性、多视角覆盖、精确轨迹标注和动作质量评估方面的不足。其解决方案的关键在于构建一个名为EHWGesture的多模态视频数据集,该数据集包含5种临床相关的手势,涵盖超过1,100段录制视频(总计6小时),由25名健康受试者在双高分辨率RGB-Depth相机与事件相机下采集,并通过动作捕捉系统提供精确的手部关键点跟踪,同时所有设备经过空间标定与同步以确保跨模态对齐;此外,为嵌入动作质量评估任务,数据按执行速度分组,模拟临床手部灵巧性评估标准,从而支持手势分类、触发检测与动作质量评估等多任务基准测试。

链接: https://arxiv.org/abs/2509.07525
作者: Gianluca Amprimo,Alberto Ancilotto,Alessandro Savino,Fabio Quazzolo,Claudia Ferraris,Gabriella Olmo,Elisabetta Farella,Stefano Di Carlo
机构: Politecnico di Torino (都灵理工大学); Fondazione Bruno Kessler (布鲁诺·凯斯勒基金会); CNR-IEIIT (意大利国家研究委员会-电气电子与信息技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCV 2025 Workshop on AI-driven Skilled Activity Understanding, Assessment Feedback Generation

点击查看摘要

Abstract:Hand gesture understanding is essential for several applications in human-computer interaction, including automatic clinical assessment of hand dexterity. While deep learning has advanced static gesture recognition, dynamic gesture understanding remains challenging due to complex spatiotemporal variations. Moreover, existing datasets often lack multimodal and multi-view diversity, precise ground-truth tracking, and an action quality component embedded within gestures. This paper introduces EHWGesture, a multimodal video dataset for gesture understanding featuring five clinically relevant gestures. It includes over 1,100 recordings (6 hours), captured from 25 healthy subjects using two high-resolution RGB-Depth cameras and an event camera. A motion capture system provides precise ground-truth hand landmark tracking, and all devices are spatially calibrated and synchronized to ensure cross-modal alignment. Moreover, to embed an action quality task within gesture understanding, collected recordings are organized in classes of execution speed that mirror clinical evaluations of hand dexterity. Baseline experiments highlight the dataset’s potential for gesture classification, gesture trigger detection, and action quality assessment. Thus, EHWGesture can serve as a comprehensive benchmark for advancing multimodal clinical gesture understanding.
zh

[CV-36] Neural Cone Radiosity for Interactive Global Illumination with Glossy Materials

【速读】:该论文旨在解决高频率出射辐射分布建模难题,尤其针对具有强视角依赖性的镜面材质(glossy material)的渲染问题。现有基于神经辐射度(neural radiosity)的方法主要依赖位置特征编码,在捕捉此类高频率、视点敏感的反射分布时存在显著局限。其解决方案的关键在于提出一种反射感知的射线锥体编码方法(reflectance-aware ray cone encoding),构建名为神经锥体辐射度(neural cone radiosity)的新框架:通过预滤波的多分辨率哈希网格对镜面BSDF主瓣进行精确逼近,并利用连续空间聚合将视点相关的反射特性直接嵌入编码过程,从而显著提升网络对高频反射分布的建模能力,同时兼顾不同光泽度范围的表面表现,且保持模型结构紧凑高效。

链接: https://arxiv.org/abs/2509.07522
作者: Jierui Ren,Haojie Jin,Bo Pang,Yisong Chen,Guoping Wang,Sheng Li
机构: Peking University (北京大学); National Key Laboratory of Intelligent Parallel Technology (智能并行技术国家重点实验室)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling of high-frequency outgoing radiance distributions has long been a key challenge in rendering, particularly for glossy material. Such distributions concentrate radiative energy within a narrow lobe and are highly sensitive to changes in view direction. However, existing neural radiosity methods, which primarily rely on positional feature encoding, exhibit notable limitations in capturing these high-frequency, strongly view-dependent radiance distributions. To address this, we propose a highly-efficient approach by reflectance-aware ray cone encoding based on the neural radiosity framework, named neural cone radiosity. The core idea is to employ a pre-filtered multi-resolution hash grid to accurately approximate the glossy BSDF lobe, embedding view-dependent reflectance characteristics directly into the encoding process through continuous spatial aggregation. Our design not only significantly improves the network’s ability to model high-frequency reflection distributions but also effectively handles surfaces with a wide range of glossiness levels, from highly glossy to low-gloss finishes. Meanwhile, our method reduces the network’s burden in fitting complex radiance distributions, allowing the overall architecture to remain compact and efficient. Comprehensive experimental results demonstrate that our method consistently produces high-quality, noise-free renderings in real time under various glossiness conditions, and delivers superior fidelity and realism compared to baseline approaches.
zh

[CV-37] MVAT: Multi-View Aware Teacher for Weakly Supervised 3D Object Detection WACV2026

【速读】:该论文旨在解决弱监督三维目标检测中因仅依赖二维边界框(2D box)标注而引发的投影歧义和单视角下部分遮挡导致的三维边界框(3D box)估计不准问题。其解决方案的关键在于提出一种名为MVAT(Multi-View Aggregation and Teacher-Student Distillation)的新框架,该框架利用序列数据中的时序多视角信息,通过时间维度上对物体中心点云的聚合构建尽可能密集完整的3D对象表示;同时采用教师-学生知识蒸馏机制,使学生网络从单视角学习到由时序聚合静态物体生成的高质量伪标签,从而提升对静止与运动目标的3D检测性能,并引入多视角2D投影一致性损失以强化预测3D框与所有可用2D标注之间的几何一致性。

链接: https://arxiv.org/abs/2509.07507
作者: Saad Lahlali,Alexandre Fournier Montgieux,Nicolas Granger,Hervé Le Borgne,Quoc Cuong Pham
机构: Université Paris-Saclay (巴黎-萨克雷大学); CEA (法国原子能和替代能源委员会); List (List实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at WACV 2026

点击查看摘要

Abstract:Annotating 3D data remains a costly bottleneck for 3D object detection, motivating the development of weakly supervised annotation methods that rely on more accessible 2D box annotations. However, relying solely on 2D boxes introduces projection ambiguities since a single 2D box can correspond to multiple valid 3D poses. Furthermore, partial object visibility under a single viewpoint setting makes accurate 3D box estimation difficult. We propose MVAT, a novel framework that leverages temporal multi-view present in sequential data to address these challenges. Our approach aggregates object-centric point clouds across time to build 3D object representations as dense and complete as possible. A Teacher-Student distillation paradigm is employed: The Teacher network learns from single viewpoints but targets are derived from temporally aggregated static objects. Then the Teacher generates high quality pseudo-labels that the Student learns to predict from a single viewpoint for both static and moving objects. The whole framework incorporates a multi-view 2D projection loss to enforce consistency between predicted 3D boxes and all available 2D annotations. Experiments on the nuScenes and Waymo Open datasets demonstrate that MVAT achieves state-of-the-art performance for weakly supervised 3D object detection, significantly narrowing the gap with fully supervised methods without requiring any 3D box annotations. % \footnoteCode available upon acceptance Our code is available in our public repository (\hrefthis https URLcode).
zh

[CV-38] Generating Transferrable Adversarial Examples via Local Mixing and Logits Optimization for Remote Sensing Object Recognition

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在遥感应用中易受对抗攻击的问题,特别是现有基于混合策略的攻击方法在提升对抗样本迁移性时存在的局限性:如全局混叠或区域直接替换会破坏图像的全局语义特征,并且依赖交叉熵损失进行扰动优化时容易出现梯度消失问题,从而影响对抗样本质量。其解决方案的关键在于提出一种新型非目标攻击框架,包含三个核心创新:(1) 提出局部混合策略,在保留全局语义信息的前提下生成多样化且语义一致的输入;(2) 将目标攻击中的logit损失适配至非目标场景,缓解梯度消失问题;(3) 引入扰动平滑损失以抑制高频噪声,增强对抗样本的迁移能力。实验表明,该方法在FGSCR-42和MTARSI数据集上优于12种先进方法,尤其在ResNet作为代理模型时,黑盒攻击成功率平均提升17.28%。

链接: https://arxiv.org/abs/2509.07495
作者: Chun Liu,Hailong Wang,Bingqian Zhu,Panpan Ding,Zheng Zheng,Tao Xu,Zhigang Han,Jiayao Wang
机构: Henan University (河南大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, posing significant security threats to their deployment in remote sensing applications. Research on adversarial attacks not only reveals model vulnerabilities but also provides critical insights for enhancing robustness. Although current mixing-based strategies have been proposed to increase the transferability of adversarial examples, they either perform global blending or directly exchange a region in the images, which may destroy global semantic features and mislead the optimization of adversarial examples. Furthermore, their reliance on cross-entropy loss for perturbation optimization leads to gradient diminishing during iterative updates, compromising adversarial example quality. To address these limitations, we focus on non-targeted attacks and propose a novel framework via local mixing and logits optimization. First, we present a local mixing strategy to generate diverse yet semantically consistent inputs. Different from MixUp, which globally blends two images, and MixCut, which stitches images together, our method merely blends local regions to preserve global semantic information. Second, we adapt the logit loss from targeted attacks to non-targeted scenarios, mitigating the gradient vanishing problem of cross-entropy loss. Third, a perturbation smoothing loss is applied to suppress high-frequency noise and enhance transferability. Extensive experiments on FGSCR-42 and MTARSI datasets demonstrate superior performance over 12 state-of-the-art methods across 6 surrogate models. Notably, with ResNet as the surrogate on MTARSI, our method achieves a 17.28% average improvement in black-box attack success rate.
zh

[CV-39] DiGS: Accurate and Complete Surface Reconstruction from 3D Gaussians via Direct SDF Learning

【速读】:该论文旨在解决3D Gaussian Splatting (3DGS) 在表面重建中精度与完整性不足的问题,其核心挑战源于该表示方法的非结构化特性以及缺乏显式的几何监督信号。解决方案的关键在于提出DiGS框架,通过将有符号距离场(Signed Distance Field, SDF)学习直接嵌入3DGS管线,为每个高斯分布赋予可学习的SDF值,从而引入强且可解释的表面先验,使高斯原语显式对齐于底层几何结构并提升跨视角一致性;此外,设计了一种基于几何引导的网格生长策略,在多尺度层次下自适应地沿几何一致区域分布高斯,以实现稠密且连贯的覆盖。

链接: https://arxiv.org/abs/2509.07493
作者: Wenzhi Guo,Bing Wang
机构: The Hong Kong Polytechnic University (香港理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computational Geometry (cs.CG)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has recently emerged as a powerful paradigm for photorealistic view synthesis, representing scenes with spatially distributed Gaussian primitives. While highly effective for rendering, achieving accurate and complete surface reconstruction remains challenging due to the unstructured nature of the representation and the absence of explicit geometric supervision. In this work, we propose DiGS, a unified framework that embeds Signed Distance Field (SDF) learning directly into the 3DGS pipeline, thereby enforcing strong and interpretable surface priors. By associating each Gaussian with a learnable SDF value, DiGS explicitly aligns primitives with underlying geometry and improves cross-view consistency. To further ensure dense and coherent coverage, we design a geometry-guided grid growth strategy that adaptively distributes Gaussians along geometry-consistent regions under a multi-scale hierarchy. Extensive experiments on standard benchmarks, including DTU, Mip-NeRF 360, and Tanks Temples, demonstrate that DiGS consistently improves reconstruction accuracy and completeness while retaining high rendering fidelity.
zh

[CV-40] Fine-Tuning Vision-Language Models for Visual Navigation Assistance

【速读】:该论文旨在解决视觉障碍人士在室内环境中依赖图像和自然语言引导进行导航的难题,传统导航系统因缺乏精确位置数据而在室内场景中失效。其解决方案的关键在于融合视觉与语言模型,通过在人工标注的室内导航数据集上对BLIP-2模型采用低秩适应(Low Rank Adaptation, LoRA)进行微调,从而生成更准确的分步导航指令;同时提出一种改进的评估指标,基于BERT F1分数并强化方向性和顺序性变量,以更全面地衡量导航性能,显著提升了模型生成方向性指令的能力,克服了原始BLIP-2模型的局限性。

链接: https://arxiv.org/abs/2509.07488
作者: Xiao Li,Bharat Gandhi,Ming Zhan,Mohit Nehra,Zhicheng Zhang,Yuchen Sun,Meijia Song,Naisheng Zhang,Xi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We address vision-language-driven indoor navigation to assist visually impaired individuals in reaching a target location using images and natural language guidance. Traditional navigation systems are ineffective indoors due to the lack of precise location data. Our approach integrates vision and language models to generate step-by-step navigational instructions, enhancing accessibility and independence. We fine-tune the BLIP-2 model with Low Rank Adaptation (LoRA) on a manually annotated indoor navigation dataset. We propose an evaluation metric that refines the BERT F1 score by emphasizing directional and sequential variables, providing a more comprehensive measure of navigational performance. After applying LoRA, the model significantly improved in generating directional instructions, overcoming limitations in the original BLIP-2 model.
zh

[CV-41] LINR Bridge: Vector Graphic Animation via Neural Implicits and Video Diffusion Priors ICIP

【速读】:该论文旨在解决向量图形(Vector Graphics)动画生成过程中人工干预多、自动化程度低的问题。传统方法在处理向量图形动画时,难以实现高质量且自然的运动效果,同时受限于像素图像与向量图形之间的域差异(domain gap)。其解决方案的关键在于:首先利用分层隐式神经表示(Layered Implicit Neural Representations)重建向量图形,保留其无限分辨率、精确颜色和形状约束等固有特性;其次,通过视频得分蒸馏采样(Video Score Distillation Sampling)优化神经表示,借助预训练文本到视频扩散模型(text-to-video diffusion models)中的运动先验(motion priors),从而实现对向量图形的平滑变形与自然动画生成。该方法有效弥合了向量图形与扩散模型之间的域差距,显著提升了动画的质量与灵活性。

链接: https://arxiv.org/abs/2509.07484
作者: Wenshuo Gao,Xicheng Lan,Luyao Zhang,Shuai Yang
机构: Peking University (北京大学); Key Laboratory of Science, Technology and Standard in Press Industry (Press Industry 科技与标准重点实验室); Fundamental Research Funds for the Central Universities (中央高校基本科研业务费专项资金)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages, ICIPW 2025, Website: this https URL

点击查看摘要

Abstract:Vector graphics, known for their scalability and user-friendliness, provide a unique approach to visual content compared to traditional pixel-based images. Animation of these graphics, driven by the motion of their elements, offers enhanced comprehensibility and controllability but often requires substantial manual effort. To automate this process, we propose a novel method that integrates implicit neural representations with text-to-video diffusion models for vector graphic animation. Our approach employs layered implicit neural representations to reconstruct vector graphics, preserving their inherent properties such as infinite resolution and precise color and shape constraints, which effectively bridges the large domain gap between vector graphics and diffusion models. The neural representations are then optimized using video score distillation sampling, which leverages motion priors from pretrained text-to-video diffusion models. Finally, the vector graphics are warped to match the representations resulting in smooth animation. Experimental results validate the effectiveness of our method in generating vivid and natural vector graphic animations, demonstrating significant improvement over existing techniques that suffer from limitations in flexibility and animation quality.
zh

[CV-42] MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification

【速读】:该论文旨在解决深度神经网络在放射影像分类中因可解释性差而限制临床接受度的问题。其核心解决方案是提出MedicalPatchNet,一种内在自解释的架构,通过将胸部X光图像分割为非重叠图像块(patch),独立分类每个图像块并聚合预测结果,从而实现对决策来源的透明化可视化,无需依赖后处理的解释技术。该方法显著提升了病灶定位准确性(在CheXlocalize数据集上平均命中率0.485 vs. Grad-CAM的0.376),同时保持与EfficientNet-B0相当的分类性能(AUROC 0.907 vs. 0.908),有效缓解了捷径学习(shortcut learning)带来的风险,增强了临床信任。

链接: https://arxiv.org/abs/2509.07477
作者: Patrick Wienholt,Christiane Kuhl,Jakob Nikolas Kather,Sven Nebelung,Daniel Truhn
机构: University Hospital Aachen (亚琛大学医院); Technical University Dresden (德累斯顿工业大学); University Hospital RWTH Aachen (亚琛工业大学附属医院); University of Leeds (利兹大学); University Hospital Dresden (德累斯顿大学医院); University Hospital Heidelberg (海德堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance. We present MedicalPatchNet, an inherently self-explainable architecture for chest X-ray classification that transparently attributes decisions to distinct image regions. MedicalPatchNet splits images into non-overlapping patches, independently classifies each patch, and aggregates predictions, enabling intuitive visualization of each patch’s diagnostic contribution without post-hoc techniques. Trained on the CheXpert dataset (223,414 images), MedicalPatchNet matches the classification performance (AUROC 0.907 vs. 0.908) of EfficientNet-B0, while substantially improving interpretability: MedicalPatchNet demonstrates substantially improved interpretability with higher pathology localization accuracy (mean hit-rate 0.485 vs. 0.376 with Grad-CAM) on the CheXlocalize dataset. By providing explicit, reliable explanations accessible even to non-AI experts, MedicalPatchNet mitigates risks associated with shortcut learning, thus improving clinical trust. Our model is publicly available with reproducible training and inference scripts and contributes to safer, explainable AI-assisted diagnostics across medical imaging domains. We make the code publicly available: this https URL
zh

[CV-43] ANYPORTAL: Zero-Shot Consistent Video Background Replacement ICCV2025

【速读】:该论文旨在解决视频生成中难以实现细粒度控制的问题,尤其是视频背景替换时面临的前景一致性与光照一致性难题。现有方法在保持前景对象的时空一致性及自然光照过渡方面表现不足,限制了其在实际应用中的效果。解决方案的关键在于提出一种零样本(zero-shot)框架ANYPORTAL,该框架通过协同利用视频扩散模型的时间先验与图像扩散模型的再照明能力,在无需训练的情况下实现高质量背景替换。其中,核心创新是提出的精炼投影算法(Refinement Projection Algorithm),该算法能够在像素级别进行细节调整,从而精确保留前景内容并确保光照变化的连续性,最终在消费级GPU上实现高效且逼真的视频编辑效果。

链接: https://arxiv.org/abs/2509.07472
作者: Wenshuo Gao,Xicheng Lan,Shuai Yang
机构: Peking University (北京大学); Wangxuan Institute of Computer Technology (王选计算机技术研究所); State Key Laboratory of Multimedia Information Processing (多媒体信息处理国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, ICCV 2025, Website: this https URL

点击查看摘要

Abstract:Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.
zh

[CV-44] DepthVision: Robust Vision-Language Understanding through GAN-Based LiDAR-to-RGB Synthesis

【速读】:该论文旨在解决机器人在视觉输入退化或不足(如低光照、运动模糊等)条件下仍能可靠运行的核心挑战。解决方案的关键在于提出DepthVision框架,其通过条件生成对抗网络(conditional GAN)结合集成精炼网络,从稀疏LiDAR点云中合成RGB图像,并利用亮度感知模态适配(Luminance-Aware Modality Adaptation, LAMA)动态融合真实RGB数据与合成RGB视图,从而在不微调下游视觉语言模型(Vision-Language Models, VLMs)的前提下提升系统鲁棒性。该方法显著改善了低光环境下的性能表现,同时保持与冻结VLMs的兼容性。

链接: https://arxiv.org/abs/2509.07463
作者: Sven Kirchner,Nils Purschke,Ross Greer,Alois C. Knoll
机构: Technical University of Munich (慕尼黑工业大学); University of California Merced (加州大学默塞德分校)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ensuring reliable robot operation when visual input is degraded or insufficient remains a central challenge in robotics. This letter introduces DepthVision, a framework for multimodal scene understanding designed to address this problem. Unlike existing Vision-Language Models (VLMs), which use only camera-based visual input alongside language, DepthVision synthesizes RGB images from sparse LiDAR point clouds using a conditional generative adversarial network (GAN) with an integrated refiner network. These synthetic views are then combined with real RGB data using a Luminance-Aware Modality Adaptation (LAMA), which blends the two types of data dynamically based on ambient lighting conditions. This approach compensates for sensor degradation, such as darkness or motion blur, without requiring any fine-tuning of downstream vision-language models. We evaluate DepthVision on real and simulated datasets across various models and tasks, with particular attention to safety-critical tasks. The results demonstrate that our approach improves performance in low-light conditions, achieving substantial gains over RGB-only baselines while preserving compatibility with frozen VLMs. This work highlights the potential of LiDAR-guided RGB synthesis for achieving robust robot operation in real-world environments.
zh

[CV-45] Bias-Aware Machine Unlearning: Towards Fairer Vision Models via Controllable Forgetting ICCV2025

【速读】:该论文旨在解决深度神经网络在训练过程中依赖训练数据中的虚假相关性(spurious correlations)而导致的偏见问题,尤其在医疗和自动驾驶等安全关键领域中可能引发不公平预测。其解决方案的关键在于提出“偏见感知的机器遗忘”(Bias-Aware Machine Unlearning)范式,通过选择性移除有偏样本或特征表示来缓解视觉模型中的多样化偏见。该方法基于隐私保护的机器遗忘技术,评估了梯度上升(Gradient Ascent)、LoRA 和教师-学生蒸馏(Teacher-Student distillation)等多种策略,在 CUB-200-2011(姿态偏见)、CIFAR-10(合成补丁偏见)和 CelebA(微笑检测中的性别偏见)三个基准数据集上验证了其有效性,显著降低了子群体差异,同时保持了较高的模型性能与公平性平衡,且无需重新训练整个模型。

链接: https://arxiv.org/abs/2509.07456
作者: Sai Siddhartha Chary Aylapuram,Veeraraju Elluru,Shivang Agarwal
机构: BITS Pilani Dubai Campus (比特·皮拉尼迪拜校区); IIT Jodhpur (印度理工学院乔德普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted for publication at ICCV 2025 UnMe workshop

点击查看摘要

Abstract:Deep neural networks often rely on spurious correlations in training data, leading to biased or unfair predictions in safety-critical domains such as medicine and autonomous driving. While conventional bias mitigation typically requires retraining from scratch or redesigning data pipelines, recent advances in machine unlearning provide a promising alternative for post-hoc model correction. In this work, we investigate \textitBias-Aware Machine Unlearning, a paradigm that selectively removes biased samples or feature representations to mitigate diverse forms of bias in vision models. Building on privacy-preserving unlearning techniques, we evaluate various strategies including Gradient Ascent, LoRA, and Teacher-Student distillation. Through empirical analysis on three benchmark datasets, CUB-200-2011 (pose bias), CIFAR-10 (synthetic patch bias), and CelebA (gender bias in smile detection), we demonstrate that post-hoc unlearning can substantially reduce subgroup disparities, with improvements in demographic parity of up to \textbf94.86% on CUB-200, \textbf30.28% on CIFAR-10, and \textbf97.37% on CelebA. These gains are achieved with minimal accuracy loss and with methods scoring an average of 0.62 across the 3 settings on the joint evaluation of utility, fairness, quality, and privacy. Our findings establish machine unlearning as a practical framework for enhancing fairness in deployed vision systems without necessitating full retraining.
zh

[CV-46] XOCT: Enhancing OCT to OCTA Translation via Cross-Dimensional Supervised Multi-Scale Feature Learning MICCAI2025

【速读】:该论文旨在解决光学相干断层扫描血管成像(Optical Coherence Tomography Angiography, OCTA)在临床应用中面临的两大挑战:一是传统OCT设备对运动伪影敏感且软件改造成本高,难以获取高质量图像;二是现有深度学习方法在OCT到OCTA的图像转换中忽视了视网膜各层血管结构的差异性,导致难以重建精细密集的血管细节,影响诊断可靠性。解决方案的关键在于提出XOCT框架,其核心创新为两个模块:一是跨维度监督(Cross-Dimensional Supervision, CDS),利用分割加权z轴平均生成的2D层间en-face投影作为监督信号,实现逐层精细化引导;二是多尺度特征融合(Multi-Scale Feature Fusion, MSFF)网络,结合多尺度特征提取与通道重加权策略,增强不同空间尺度下的血管边界清晰度。该方案显著提升了en-face投影质量,从而提升OCTA在眼科疾病诊断中的可用性与准确性。

链接: https://arxiv.org/abs/2509.07455
作者: Pooya Khosravi,Kun Han,Anthony T. Wu,Arghavan Rezvani,Zexin Feng,Xiaohui Xie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, Accepted to MICCAI 2025

点击查看摘要

Abstract:Optical Coherence Tomography Angiography (OCTA) and its derived en-face projections provide high-resolution visualization of the retinal and choroidal vasculature, which is critical for the rapid and accurate diagnosis of retinal diseases. However, acquiring high-quality OCTA images is challenging due to motion sensitivity and the high costs associated with software modifications for conventional OCT devices. Moreover, current deep learning methods for OCT-to-OCTA translation often overlook the vascular differences across retinal layers and struggle to reconstruct the intricate, dense vascular details necessary for reliable diagnosis. To overcome these limitations, we propose XOCT, a novel deep learning framework that integrates Cross-Dimensional Supervision (CDS) with a Multi-Scale Feature Fusion (MSFF) network for layer-aware vascular reconstruction. Our CDS module leverages 2D layer-wise en-face projections, generated via segmentation-weighted z-axis averaging, as supervisory signals to compel the network to learn distinct representations for each retinal layer through fine-grained, targeted guidance. Meanwhile, the MSFF module enhances vessel delineation through multi-scale feature extraction combined with a channel reweighting strategy, effectively capturing vascular details at multiple spatial scales. Our experiments on the OCTA-500 dataset demonstrate XOCT’s improvements, especially for the en-face projections which are significant for clinical evaluation of retinal pathologies, underscoring its potential to enhance OCTA accessibility, reliability, and diagnostic value for ophthalmic disease detection and monitoring. The code is available at this https URL.
zh

[CV-47] In the Eye of MLLM : Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

【速读】:该论文旨在解决现有多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理第一人称视角视频(egocentric videos)时,因忽视用户注视(gaze)信息而导致对用户意图理解不准确的问题。其解决方案的关键在于提出EgoGazeVQA基准,该基准通过融合由MLLM生成并经人工校正的基于注视的问答对,引入空间、时间与意图相关的多维线索,并采用注视引导的提示(gaze-guided intent prompting)方法显著提升模型对日常长视频的理解能力。实验表明,注视信息能有效增强AI助手在第一人称场景下的个性化和响应准确性。

链接: https://arxiv.org/abs/2509.07447
作者: Taiying Peng,Jiacheng Hua,Miao Liu,Feng Lu
机构: Beihang University (北京航空航天大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants’ ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings.
zh

[CV-48] DreamLifting: A Plug-in Module Lifting MV Diffusion Models for 3D Asset Generation

【速读】:该论文旨在解决当前3D资产生成中缺乏端到端物理渲染材质(PBR)就绪的自动化流程问题,现有方法多集中于几何建模,通常将纹理烘焙为简单顶点颜色或依赖后处理图像扩散模型进行纹理合成,难以实现高质量、可重光照的PBR资产生成。解决方案的关键在于提出轻量级高斯资产适配器(Lightweight Gaussian Asset Adapter, LGAA),其核心创新是通过多视角(MV)扩散先验统一建模几何与PBR材质:首先利用LGAA Wrapper复用并适配多视角扩散模型中的网络层以高效利用海量图像知识;其次引入LGAA Switcher对不同先验知识封装的模块进行对齐;最后设计一个受控变分自编码器(VAE)作为LGAA Decoder,直接预测包含PBR通道的2D高斯溅射(2DGS)表示,并结合专用后处理步骤提取高质量、可重光照的网格资产。该方案在仅69k多视角样本下即可高效收敛,且支持文本和图像条件输入,展现出优越的生成性能与模块化扩展能力。

链接: https://arxiv.org/abs/2509.07435
作者: Ze-Xin Yin,Jiaxiong Qiu,Liu Liu,Xinjie Wang,Wei Sui,Zhizhong Su,Jian Yang,Jin Xie
机构: Nankai University (南开大学); Nanjing University (南京大学); Horizon Robotics (地平线机器人); D-Robotics (大疆创新)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 7 figures, project page: this https URL

点击查看摘要

Abstract:The labor- and experience-intensive creation of 3D assets with physically based rendering (PBR) materials demands an autonomous 3D asset creation pipeline. However, most existing 3D generation methods focus on geometry modeling, either baking textures into simple vertex colors or leaving texture synthesis to post-processing with image diffusion models. To achieve end-to-end PBR-ready 3D asset generation, we present Lightweight Gaussian Asset Adapter (LGAA), a novel framework that unifies the modeling of geometry and PBR materials by exploiting multi-view (MV) diffusion priors from a novel perspective. The LGAA features a modular design with three components. Specifically, the LGAA Wrapper reuses and adapts network layers from MV diffusion models, which encapsulate knowledge acquired from billions of images, enabling better convergence in a data-efficient manner. To incorporate multiple diffusion priors for geometry and PBR synthesis, the LGAA Switcher aligns multiple LGAA Wrapper layers encapsulating different knowledge. Then, a tamed variational autoencoder (VAE), termed LGAA Decoder, is designed to predict 2D Gaussian Splatting (2DGS) with PBR channels. Finally, we introduce a dedicated post-processing procedure to effectively extract high-quality, relightable mesh assets from the resulting 2DGS. Extensive quantitative and qualitative experiments demonstrate the superior performance of LGAA with both text-and image-conditioned MV diffusion models. Additionally, the modular design enables flexible incorporation of multiple diffusion priors, and the knowledge-preserving scheme leads to efficient convergence trained on merely 69k multi-view instances. Our code, pre-trained weights, and the dataset used will be publicly available via our project page: this https URL.
zh

[CV-49] A smart fridge with AI-enabled food computing

【速读】:该论文旨在解决高密度库存环境下智能冰箱中因物品重叠、遮挡及多角度观测导致的食品检测与计数不准确问题,进而影响食品管理效率和浪费控制。其解决方案的关键在于构建一个包含数据预处理、目标检测与管理、以及基于Web的可视化三个模块的系统,并引入一种改进的焦点损失(focal loss)机制,通过温度缩放实现类别自适应的误差校准,从而缓解模型过自信或欠自信预测带来的校准偏差,显著提升在不同光照条件和扩展性挑战下的检测可靠性。

链接: https://arxiv.org/abs/2509.07400
作者: Khue Nong Thuc,Khoa Tran Nguyen Anh,Tai Nguyen Huy,Du Nguyen Hao Hong,Khanh Dinh Ba
机构: Ho Chi Minh City University of Technology (胡志明市科技大学); Vietnam National University Ho Chi Minh City (胡志明市国家大学)
类目: ystems and Control (eess.SY); Computer Vision and Pattern Recognition (cs.CV); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:The Internet of Things (IoT) plays a crucial role in enabling seamless connectivity and intelligent home automation, particularly in food management. By integrating IoT with computer vision, the smart fridge employs an ESP32-CAM to establish a monitoring subsystem that enhances food management efficiency through real-time food detection, inventory tracking, and temperature monitoring. This benefits waste reduction, grocery planning improvement, and household consumption optimization. In high-density inventory conditions, capturing partial or layered images complicates object detection, as overlapping items and occluded views hinder accurate identification and counting. Besides, varied angles and obscured details in multi-layered setups reduce algorithm reliability, often resulting in miscounts or misclassifications. Our proposed system is structured into three core modules: data pre-processing, object detection and management, and a web-based visualization. To address the challenge of poor model calibration caused by overconfident predictions, we implement a variant of focal loss that mitigates over-confidence and under-confidence in multi-category classification. This approach incorporates adaptive, class-wise error calibration via temperature scaling and evaluates the distribution of predicted probabilities across methods. Our results demonstrate that robust functional calibration significantly improves detection reliability under varying lighting conditions and scalability challenges. Further analysis demonstrates a practical, user-focused approach to modern food management, advancing sustainable living goals through reduced waste and more informed consumption.
zh

[CV-50] EfficientNet in Digital Twin-based Cardiac Arrest Prediction and Analysis

【速读】:该论文旨在解决心脏骤停(cardiac arrest)早期识别与管理难题,以提升患者预后。其核心解决方案在于构建一个融合基于EfficientNet的深度学习模型与数字孪生(digital twin, DT)系统的新型框架:通过复合缩放(compound scaling)和EfficientNet提取心血管图像特征实现高精度预测,同时利用物联网(IoT)设备实时数据生成个体化心血管系统数字孪生模型,从而持续评估患者状态并模拟治疗方案影响,实现了主动且个性化的疾病预测与干预策略。

链接: https://arxiv.org/abs/2509.07388
作者: Qasim Zia,Avais Jan,Zafar Iqbal,Muhammad Mumtaz Ali,Mukarram Ali,Murray Patterson
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Cardiac arrest is one of the biggest global health problems, and early identification and management are key to enhancing the patient’s prognosis. In this paper, we propose a novel framework that combines an EfficientNet-based deep learning model with a digital twin system to improve the early detection and analysis of cardiac arrest. We use compound scaling and EfficientNet to learn the features of cardiovascular images. In parallel, the digital twin creates a realistic and individualized cardiovascular system model of the patient based on data received from the Internet of Things (IoT) devices attached to the patient, which can help in the constant assessment of the patient and the impact of possible treatment plans. As shown by our experiments, the proposed system is highly accurate in its prediction abilities and, at the same time, efficient. Combining highly advanced techniques such as deep learning and digital twin (DT) technology presents the possibility of using an active and individual approach to predicting cardiac disease.
zh

[CV-51] Parse Graph-Based Visual-Language Interaction for Human Pose Estimation

【速读】:该论文旨在解决单模态人体姿态估计(Human Pose Estimation, HPE)方法在处理遮挡场景时性能受限的问题,特别是现有视觉-语言融合方法因全局特征整合导致遮挡区域响应减弱及定位不准的问题。其解决方案的关键在于提出基于解析图(Parse Graph)的视觉-语言交互框架(PGVL),其中核心创新是引入引导模块(Guided Module, GM)。GM通过跨注意力机制驱动高语义节点指导低语义节点的特征更新,实现局部细节与全局语义的有效融合,从而增强遮挡区域的特征响应并提升姿态估计精度。

链接: https://arxiv.org/abs/2509.07385
作者: Shibang Liu,Xuemei Xie,Guangming Shi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Parse graphs boost human pose estimation (HPE) by integrating context and hierarchies, yet prior work mostly focuses on single modality modeling, ignoring the potential of multimodal fusion. Notably, language offers rich HPE priors like spatial relations for occluded scenes, but existing visual-language fusion via global feature integration weakens occluded region responses and causes alignment and location failures. To address this issue, we propose Parse Graph-based Visual-Language interaction (PGVL) with a core novel Guided Module (GM). In PGVL, low-level nodes focus on local features, maximizing the maintenance of responses in occluded areas and high-level nodes integrate global features to infer occluded or invisible parts. GM enables high semantic nodes to guide the feature update of low semantic nodes that have undergone cross attention. It ensuring effective fusion of diverse information. PGVL includes top-down decomposition and bottom-up composition. In the first stage, modality specific parse graphs are constructed. Next stage. recursive bidirectional cross-attention is used, purified by GM. We also design network based on PGVL. The PGVL and our network is validated on major pose estimation datasets. We will release the code soon.
zh

[CV-52] G3CN: Gaussian Topology Refinement Gated Graph Convolutional Network for Skeleton-Based Action Recognition IROS

【速读】:该论文旨在解决骨架动作识别中因图卷积网络(GCN)难以区分相似动作而导致的表征能力不足问题,尤其在处理语义模糊的动作样本时表现不佳。其解决方案的关键在于提出一种名为高斯拓扑精炼门控图卷积(G^3 CN)的新方法:首先引入高斯滤波器对骨架拓扑图进行精炼,增强对模糊动作的拓扑与空间特征表达;其次,在GCN框架中集成门控循环单元(GRU),以提升骨架关键点间的信息传播效率,从而显著改善模型对复杂动作类别的判别能力。

链接: https://arxiv.org/abs/2509.07335
作者: Haiqing Ren,Zhongkai Luo,Heng Fan,Xiaohui Yuan,Guanchen Wang,Libo Zhang
机构: Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); University of North Texas (北德克萨斯大学); Chadwick School (查德威克学校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures, IROS

点击查看摘要

Abstract:Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G ^3 CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G ^3 CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G ^3 CN effectively improves action recognition, particularly for ambiguous samples.
zh

[CV-53] DEPF: A UAV Multispectral Object Detector with Dual-Domain Enhancement and Priority-Guided Mamba Fusion

【速读】:该论文针对无人机(UAV)多光谱目标检测中的三个核心问题展开研究:一是低光照条件下多模态融合的互补性下降;二是融合阶段冗余信息对局部小目标建模的干扰;三是基于Transformer的方法因二次计算复杂度难以部署于资源受限的无人机平台。解决方案的关键在于提出一种双域增强与优先引导Mamba融合(DEPF)框架,其核心创新包括:1)设计双域增强模块(DDE),通过跨尺度小波Mamba(CSWM)提升图像全局亮度,结合傅里叶细节恢复块(FDR)重建频域纹理特征,以改善低光图像质量;2)引入优先引导Mamba融合模块(PGMF),基于模态差异计算优先级得分,从局部目标特征出发进行优先扫描,有效抑制冗余信息干扰并强化小目标建模能力。该方法利用Mamba线性复杂度特性,实现高效且鲁棒的多光谱目标检测。

链接: https://arxiv.org/abs/2509.07327
作者: Shucong Li,Zhenyu Liu,Zijie Hong,Zhiheng Zhou,Xianghai Cao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multispectral remote sensing object detection is one of the important application of unmanned aerial vehicle (UAV). However, it faces three challenges. Firstly, the low-light remote sensing images reduce the complementarity during multi-modality fusion. Secondly, the local small target modeling is interfered with redundant information in the fusion stage easily. Thirdly, due to the quadratic computational complexity, it is hard to apply the transformer-based methods on the UAV platform. To address these limitations, motivated by Mamba with linear complexity, a UAV multispectral object detector with dual-domain enhancement and priority-guided mamba fusion (DEPF) is proposed. Firstly, to enhance low-light remote sensing images, Dual-Domain Enhancement Module (DDE) is designed, which contains Cross-Scale Wavelet Mamba (CSWM) and Fourier Details Recovery block (FDR). CSWM applies cross-scale mamba scanning for the low-frequency components to enhance the global brightness of images, while FDR constructs spectrum recovery network to enhance the frequency spectra features for recovering the texture-details. Secondly, to enhance local target modeling and reduce the impact of redundant information during fusion, Priority-Guided Mamba Fusion Module (PGMF) is designed. PGMF introduces the concept of priority scanning, which starts from local targets features according to the priority scores obtained from modality difference. Experiments on DroneVehicle dataset and VEDAI dataset reports that, DEPF performs well on object detection, comparing with state-of-the-art methods. Our code is available in the supplementary material.
zh

[CV-54] Reconstruction Alignment Improves Unified Multimodal Models

【速读】:该论文旨在解决统一多模态模型(Unified Multimodal Models, UMMs)在训练过程中依赖稀疏图像-文本对所导致的视觉细节缺失问题,即传统方法中即使使用大量文字描述也无法充分捕捉图像中的细粒度信息。解决方案的关键在于提出一种资源高效的后训练方法——重建对齐(Reconstruction Alignment, RecA),其核心思想是利用视觉理解编码器的嵌入作为密集“文本提示”,通过自监督重建损失引导模型基于自身理解嵌入重构输入图像,从而实现生成与理解能力的重新对齐。此方法无需额外标注,且在多种架构(自回归、掩码自回归和扩散模型)上均能显著提升图像生成与编辑质量。

链接: https://arxiv.org/abs/2509.07295
作者: Ji Xie,Trevor Darrell,Luke Zettlemoyer,XuDong Wang
机构: UC Berkeley (加州大学伯克利分校); University of Washington (华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 28 pages, 24 figures and 10 tables

点击查看摘要

Abstract:Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture. However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details–even when they use hundreds of words to describe a simple image. We introduce Reconstruction Alignment (RecA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense “text prompts,” providing rich supervision without captions. Concretely, RecA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation. Despite its simplicity, RecA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity. With only 27 GPU-hours, post-training with RecA substantially improves image generation performance on GenEval (0.73 \rightarrow 0.90) and DPGBench (80.93 \rightarrow 88.15), while also boosting editing benchmarks (ImgEdit 3.38 \rightarrow 3.75, GEdit 6.94 \rightarrow 7.25). Notably, RecA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs
zh

[CV-55] Breast Cancer Detection in Thermographic Images via Diffusion-Based Augmentation and Nonlinear Feature Fusion ALT

【速读】:该论文旨在解决医学影像领域中数据稀缺对深度学习模型性能的限制问题,特别是在乳腺癌热成像分类任务中。其解决方案的关键在于两个核心创新:一是采用扩散概率模型(Diffusion Probabilistic Model, DPM)进行数据增强,显著优于传统方法和ProGAN基线;二是融合预训练ResNet-50提取的深层特征与U-Net分割肿瘤后得到的手工非线性特征(如分形维数),并通过XGBoost分类器实现98.0%准确率和98.1%敏感度。实验证明,DPM增强和非线性特征融合均为模型性能提升的关键且统计显著的因素。

链接: https://arxiv.org/abs/2509.07277
作者: Sepehr Salem,M. Moein Esfahani,Jingyu Liu,Vince Calhoun
机构: Georgia State University (乔治亚州立大学); Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2025)

点击查看摘要

Abstract:Data scarcity hinders deep learning for medical imaging. We propose a framework for breast cancer classification in thermograms that addresses this using a Diffusion Probabilistic Model (DPM) for data augmentation. Our DPM-based augmentation is shown to be superior to both traditional methods and a ProGAN baseline. The framework fuses deep features from a pre-trained ResNet-50 with handcrafted nonlinear features (e.g., Fractal Dimension) derived from U-Net segmented tumors. An XGBoost classifier trained on these fused features achieves 98.0% accuracy and 98.1% sensitivity. Ablation studies and statistical tests confirm that both the DPM augmentation and the nonlinear feature fusion are critical, statistically significant components of this success. This work validates the synergy between advanced generative models and interpretable features for creating highly accurate medical diagnostic tools.
zh

[CV-56] GCond: Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning

【速读】:该论文旨在解决多任务学习(Multi-Task Learning, MTL)中梯度冲突(Gradient Conflict)的问题,尤其是在现代大模型和Transformer架构中,现有方法如PCGrad、CAGrad和GradNorm因计算开销大而难以应用。其解决方案的关键在于提出Gradient Conductor(GCond),该方法基于PCGrad原理,融合了梯度累积(Gradient Accumulation)与自适应仲裁机制(Adaptive Arbitration Mechanism),从而在保持优化质量的同时显著提升计算效率——实验表明其随机模式下实现了两倍的计算速度提升,并在ImageNet 1K和头颈CT扫描数据集上均优于基线线性组合及当前最优梯度冲突解决方法,在L1和SSIM损失指标上表现更优,且具备良好的可扩展性和对AdamW、Lion/LARS等现代优化器的兼容性。

链接: https://arxiv.org/abs/2509.07252
作者: Evgeny Alves Limarenko,Anastasiia Alexandrovna Studenikina
机构: Moscow Institute of Physics and Technology (莫斯科物理技术学院)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Preprint. Submitted to PeerJ

点击查看摘要

Abstract:In multi-task learning (MTL), gradient conflict poses a significant challenge. Effective methods for addressing this problem, including PCGrad, CAGrad, and GradNorm, in their original implementations are computationally demanding, which significantly limits their application in modern large models and transformers. We propose Gradient Conductor (GCond), a method that builds upon PCGrad principles by combining them with gradient accumulation and an adaptive arbitration mechanism. We evaluated GCond on self-supervised learning tasks using MobileNetV3-Small and ConvNeXt architectures on the ImageNet 1K dataset and a combined head and neck CT scan dataset, comparing the proposed method against baseline linear combinations and state-of-the-art gradient conflict resolution methods. The stochastic mode of GCond achieved a two-fold computational speedup while maintaining optimization quality, and demonstrated superior performance across all evaluated metrics, achieving lower L1 and SSIM losses compared to other methods on both datasets. GCond exhibited high scalability, being successfully applied to both compact models (MobileNetV3-Small) and large architectures (ConvNeXt-tiny and ConvNeXt-Base). It also showed compatibility with modern optimizers such as AdamW and Lion/LARS. Therefore, GCond offers a scalable and efficient solution to the problem of gradient conflicts in multi-task learning.
zh

[CV-57] XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

【速读】:该论文旨在解决乳腺超声(Breast Ultrasound, BUS)图像中微小或低对比度病灶因边界模糊和斑点噪声导致分割精度不足的问题。现有方法在利用文本提示(text prompts)增强临床语义信息时,常因弱定位的图文线索(如CAM/CLIP衍生信号)产生粗粒度、边界弥散的分割结果。解决方案的关键在于提出XBusNet——一种双提示、双分支的多模态模型:全局路径基于CLIP视觉Transformer编码全图语义并受病灶大小与位置条件约束;局部路径采用U-Net结构强化边界细节,并通过描述形状、边缘特征及BI-RADS术语的文本提示进行调制。该设计无需人工标注即可自动整合结构化元数据生成提示,从而实现全局语义一致性与局部边界精确性的协同优化,在BLU数据集上达到Dice 0.8765和IoU 0.8149的领先性能,尤其显著提升小病灶的分割鲁棒性。

链接: https://arxiv.org/abs/2509.07213
作者: Raja Mallina,Bryar Shareef
机构: University of Nevada, Las Vegas (内华达大学拉斯维加斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures, 4 tables

点击查看摘要

Abstract:Background: Precise breast ultrasound (BUS) segmentation supports reliable measurement, quantitative analysis, and downstream classification, yet remains difficult for small or low-contrast lesions with fuzzy margins and speckle noise. Text prompts can add clinical context, but directly applying weakly localized text-image cues (e.g., CAM/CLIP-derived signals) tends to produce coarse, blob-like responses that smear boundaries unless additional mechanisms recover fine edges. Methods: We propose XBusNet, a novel dual-prompt, dual-branch multimodal model that combines image features with clinically grounded text. A global pathway based on a CLIP Vision Transformer encodes whole-image semantics conditioned on lesion size and location, while a local U-Net pathway emphasizes precise boundaries and is modulated by prompts that describe shape, margin, and Breast Imaging Reporting and Data System (BI-RADS) terms. Prompts are assembled automatically from structured metadata, requiring no manual clicks. We evaluate on the Breast Lesions USG (BLU) dataset using five-fold cross-validation. Primary metrics are Dice and Intersection over Union (IoU); we also conduct size-stratified analyses and ablations to assess the roles of the global and local paths and the text-driven modulation. Results: XBusNet achieves state-of-the-art performance on BLU, with mean Dice of 0.8765 and IoU of 0.8149, outperforming six strong baselines. Small lesions show the largest gains, with fewer missed regions and fewer spurious activations. Ablation studies show complementary contributions of global context, local boundary modeling, and prompt-based modulation. Conclusions: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions.
zh

[CV-58] Dimensionally Reduced Open-World Clustering: DROWCULA

【速读】:该论文旨在解决开放世界场景下图像分类中的新类别发现(Novel Class Discovery)问题,即在训练阶段仅使用已知类别的标注数据时,如何无监督地识别未来可能出现的未知类别。传统方法多依赖于半监督学习框架,而本文提出了一种全新的全无监督解决方案:首先利用Vision Transformers(ViT)通过注意力机制生成高质量的图像嵌入向量,进而采用流形学习技术挖掘数据内在几何结构以优化嵌入空间,从而提升聚类性能;关键创新在于无需预先知晓类别数量即可准确估计聚类数目,并在CIFAR-10、CIFAR-100、ImageNet-100和Tiny ImageNet等多个基准数据集上实现了单模态聚类与新类别发现任务的最新性能。

链接: https://arxiv.org/abs/2509.07184
作者: Erencem Ozbey,Dimitrios I. Diochnos
机构: Bogazici University (博阿齐奇大学); University of Oklahoma (俄克拉荷马大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 16 pages, 12 Figures, 12 Tables

点击查看摘要

Abstract:Working with annotated data is the cornerstone of supervised learning. Nevertheless, providing labels to instances is a task that requires significant human effort. Several critical real-world applications make things more complicated because no matter how many labels may have been identified in a task of interest, it could be the case that examples corresponding to novel classes may appear in the future. Not unsurprisingly, prior work in this, so-called, `open-world’ context has focused a lot on semi-supervised approaches. Focusing on image classification, somehow paradoxically, we propose a fully unsupervised approach to the problem of determining the novel categories in a particular dataset. Our approach relies on estimating the number of clusters using Vision Transformers, which utilize attention mechanisms to generate vector embeddings. Furthermore, we incorporate manifold learning techniques to refine these embeddings by exploiting the intrinsic geometry of the data, thereby enhancing the overall image clustering performance. Overall, we establish new State-of-the-Art results on single-modal clustering and Novel Class Discovery on CIFAR-10, CIFAR-100, ImageNet-100, and Tiny ImageNet. We do so, both when the number of clusters is known or unknown ahead of time. The code is available at: this https URL. Comments: 16 pages, 12 Figures, 12 Tables Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2509.07184 [cs.CV] (or arXiv:2509.07184v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2509.07184 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-59] Realism to Deception: Investigating Deepfake Detectors Against Face Enhancement

【速读】:该论文试图解决的问题是:面部增强技术(face enhancement techniques)在提升人脸视觉质量的同时,可能无意中破坏生物特征信息,从而显著降低深度伪造检测器(deepfake detectors)的准确性,即这些增强方法可能起到反取证(anti-forensic)作用。解决方案的关键在于系统性评估传统图像处理与基于生成对抗网络(GAN-based)的面部增强方法对不同检测策略(包括基于原始空间域、频域及朴素方法)的影响,并通过对抗训练实验验证模型鲁棒性是否可通过暴露于增强变换中得到提升。结果表明,即使基础滤波增强也能使误检率(ASR)高达64.63%,而GAN增强进一步将ASR提升至75.12%,说明面部增强技术确实可作为有效的反取证工具,亟需发展更鲁棒和自适应的数字取证方法。

链接: https://arxiv.org/abs/2509.07178
作者: Muhammad Saad Saeed,Ijaz Ul Haq,Khalid Malik
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Face enhancement techniques are widely used to enhance facial appearance. However, they can inadvertently distort biometric features, leading to significant decrease in the accuracy of deepfake detectors. This study hypothesizes that these techniques, while improving perceptual quality, can degrade the performance of deepfake detectors. To investigate this, we systematically evaluate whether commonly used face enhancement methods can serve an anti-forensic role by reducing detection accuracy. We use both traditional image processing methods and advanced GAN-based enhancements to evaluate the robustness of deepfake detectors. We provide a comprehensive analysis of the effectiveness of these enhancement techniques, focusing on their impact on Naïve, Spatial, and Frequency-based detection methods. Furthermore, we conduct adversarial training experiments to assess whether exposure to face enhancement transformations improves model robustness. Experiments conducted on the FaceForensics++, DeepFakeDetection, and CelebDF-v2 datasets indicate that even basic enhancement filters can significantly reduce detection accuracy achieving ASR up to 64.63%. In contrast, GAN-based techniques further exploit these vulnerabilities, achieving ASR up to 75.12%. Our results demonstrate that face enhancement methods can effectively function as anti-forensic tools, emphasizing the need for more resilient and adaptive forensic methods.
zh

[CV-60] Adversarial Attacks on Audio Deepfake Detection: A Benchmark and Comparative Study

【速读】:该论文旨在解决生成式 AI(Generative AI)产生的深度伪造音频(deepfake audio)对语音生物识别系统(voice biometric applications)构成的安全威胁,特别是现有音频深度伪造检测(Audio Deepfake Detection, ADD)方法在面对反取证(anti-forensic, AF)攻击时鲁棒性不足的问题。解决方案的关键在于系统性地评估当前最先进的ADD方法在五种基准数据集上的表现,区分基于原始波形与频谱特征的两类检测策略,并深入分析其在多种AF攻击(如统计扰动和优化型对抗攻击)下的脆弱性,从而揭示现有方法的优势与局限,为设计更具泛化能力和抗干扰能力的新型检测器提供依据,并推动适应性防御机制的发展。

链接: https://arxiv.org/abs/2509.07132
作者: Kutub Uddin,Muhammad Umar Farooq,Awais Khan,Khalid Mahmood Malik
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The widespread use of generative AI has shown remarkable success in producing highly realistic deepfakes, posing a serious threat to various voice biometric applications, including speaker verification, voice biometrics, audio conferencing, and criminal investigations. To counteract this, several state-of-the-art (SoTA) audio deepfake detection (ADD) methods have been proposed to identify generative AI signatures to distinguish between real and deepfake audio. However, the effectiveness of these methods is severely undermined by anti-forensic (AF) attacks that conceal generative signatures. These AF attacks span a wide range of techniques, including statistical modifications (e.g., pitch shifting, filtering, noise addition, and quantization) and optimization-based attacks (e.g., FGSM, PGD, C \ W, and DeepFool). In this paper, we investigate the SoTA ADD methods and provide a comparative analysis to highlight their effectiveness in exposing deepfake signatures, as well as their vulnerabilities under adversarial conditions. We conducted an extensive evaluation of ADD methods on five deepfake benchmark datasets using two categories: raw and spectrogram-based approaches. This comparative analysis enables a deeper understanding of the strengths and limitations of SoTA ADD methods against diverse AF attacks. It does not only highlight vulnerabilities of ADD methods, but also informs the design of more robust and generalized detectors for real-world voice biometrics. It will further guide future research in developing adaptive defense strategies that can effectively counter evolving AF techniques.
zh

[CV-61] Detection and Recovery of Adversarial Slow-Pose Drift in Offloaded Visual-Inertial Odometry

【速读】:该论文旨在解决视觉惯性里程计(Visual-Inertial Odometry, VIO)在边缘服务器上执行时面临的姿态欺骗(pose spoofing)威胁问题,即攻击者通过细微但持续的姿势伪造导致累积漂移(drift),且此类攻击难以被传统启发式检测机制发现。解决方案的关键在于提出一种无监督、无需标签的检测与恢复机制:模型在无攻击场景下训练以学习运动的时间规律性,从而在运行时识别出偏离正常模式的异常行为,并触发恢复策略以重建姿态一致性,从而有效降低轨迹和姿态误差。

链接: https://arxiv.org/abs/2509.07130
作者: Soruya Saha,Md Nurul Absurd,Saptarshi Debroy
机构: The Graduate Center, City University of New York (纽约市立大学研究生院); Hunter College and The Graduate Center, City University of New York (亨特学院和纽约市立大学研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 12 Pages, 8 Figures

点击查看摘要

Abstract:Visual-Inertial Odometry (VIO) supports immersive Virtual Reality (VR) by fusing camera and Inertial Measurement Unit (IMU) data for real-time pose. However, current trend of offloading VIO to edge servers can lead server-side threat surface where subtle pose spoofing can accumulate into substantial drift, while evading heuristic checks. In this paper, we study this threat and present an unsupervised, label-free detection and recovery mechanism. The proposed model is trained on attack-free sessions to learn temporal regularities of motion to detect runtime deviations and initiate recovery to restore pose consistency. We evaluate the approach in a realistic offloaded-VIO environment using ILLIXR testbed across multiple spoofing intensities. Experimental results in terms of well-known performance metrics show substantial reductions in trajectory and pose error compared to a no-defense baseline.
zh

[CV-62] SVGauge: Towards Human-Aligned Evaluation for SVG Generation

【速读】:该论文旨在解决现有图像生成评估指标(如FID、LPIPS和CLIPScore)无法有效衡量生成式可缩放矢量图形(Scalable Vector Graphics, SVG)质量的问题,因其未能适配SVG的符号化与矢量特性。解决方案的关键在于提出SVGauge,一个面向文本到SVG生成任务的人类对齐、基于参考的评估指标,其核心创新包括:(i) 通过提取SigLIP图像嵌入并结合主成分分析(PCA)与白化处理实现域对齐,以量化视觉保真度;(ii) 利用BLIP-2生成的SVG描述与原始提示在SBERT与TF-IDF联合空间中的语义一致性对比,衡量语义一致性。实验表明,SVGauge在所提出的SHE基准上与人类判断具有最高相关性,并能更准确地再现八种零样本大语言模型(LLM)生成器的系统级排序,验证了向量特异性评估的必要性与实用性。

链接: https://arxiv.org/abs/2509.07127
作者: Leonardo Zini,Elia Frigieri,Sebastiano Aloscari,Marcello Generali,Lorenzo Dodi,Robert Dosen,Lorenzo Baraldi
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 23rd edition of International Conference on Image Analysis and Processing 2025

点击查看摘要

Abstract:Generated Scalable Vector Graphics (SVG) images demand evaluation criteria tuned to their symbolic and vectorial nature: criteria that existing metrics such as FID, LPIPS, or CLIPScore fail to satisfy. In this paper, we introduce SVGauge, the first human-aligned, reference based metric for text-to-SVG generation. SVGauge jointly measures (i) visual fidelity, obtained by extracting SigLIP image embeddings and refining them with PCA and whitening for domain alignment, and (ii) semantic consistency, captured by comparing BLIP-2-generated captions of the SVGs against the original prompts in the combined space of SBERT and TF-IDF. Evaluation on the proposed SHE benchmark shows that SVGauge attains the highest correlation with human judgments and reproduces system-level rankings of eight zero-shot LLM-based generators more faithfully than existing metrics. Our results highlight the necessity of vector-specific evaluation and provide a practical tool for benchmarking future text-to-SVG generation models.
zh

[CV-63] Faster VGGT with Block-Sparse Global Attention

【速读】:该论文旨在解决基于Transformer的多视图重建模型(如VGGT和π³)在处理大规模图像集合时因全局注意力机制的二次复杂度而导致的推理效率瓶颈问题。其解决方案的关键在于:通过实证分析发现,全局注意力矩阵中的概率质量集中在少量对应于跨视图几何匹配的patch-patch交互上;据此提出一种基于高度优化的块稀疏(block-sparse)核的稠密全局注意力替代方法,在不重新训练主干网络的前提下,实现了高达4倍的推理速度提升,同时保持了与原模型相当的任务性能,并可扩展至大规模图像集。

链接: https://arxiv.org/abs/2509.07120
作者: Chung-Shien Brian Wang,Christian Schmidt,Jens Piekenbrinck,Bastian Leibe
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page at this https URL

点击查看摘要

Abstract:Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and \pi^3 have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to 4\times faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and \pi^3 , and supports large image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.
zh

[CV-64] Automated Evaluation of Gender Bias Across 13 Large Multimodal Models

【速读】:该论文旨在解决大型多模态模型(Large Multimodal Models, LMMs)在文本到图像生成过程中可能放大社会性别偏见的问题,尤其是缺乏大规模、可比较的跨模型评估手段。其解决方案的关键在于构建了Aymara Image Fairness Evaluation基准,通过75个程序生成的性别中立提示词驱动13个商用LMM生成职业相关图像,并利用经过验证的“大语言模型作为裁判”(LLM-as-a-judge)系统对965张图像进行性别代表性评分,从而量化并比较各模型在不同职业类别中的性别偏差程度。该方法首次实现了对多个主流LMM的系统性公平性评估,揭示了偏见并非必然结果,而是设计选择所致,为推动AI开发中的责任与公平提供了标准化、自动化的评估工具。

链接: https://arxiv.org/abs/2509.07050
作者: Juan Manuel Contreras
机构: Aymara(艾玛拉)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large multimodal models (LMMs) have revolutionized text-to-image generation, but they risk perpetuating the harmful social biases in their training data. Prior work has identified gender bias in these models, but methodological limitations prevented large-scale, comparable, cross-model analysis. To address this gap, we introduce the Aymara Image Fairness Evaluation, a benchmark for assessing social bias in AI-generated images. We test 13 commercially available LMMs using 75 procedurally-generated, gender-neutral prompts to generate people in stereotypically-male, stereotypically-female, and non-stereotypical professions. We then use a validated LLM-as-a-judge system to score the 965 resulting images for gender representation. Our results reveal (p .001 for all): 1) LMMs systematically not only reproduce but actually amplify occupational gender stereotypes relative to real-world labor data, generating men in 93.0% of images for male-stereotyped professions but only 22.5% for female-stereotyped professions; 2) Models exhibit a strong default-male bias, generating men in 68.3% of the time for non-stereotyped professions; and 3) The extent of bias varies dramatically across models, with overall male representation ranging from 46.7% to 73.3%. Notably, the top-performing model de-amplified gender stereotypes and approached gender parity, achieving the highest fairness scores. This variation suggests high bias is not an inevitable outcome but a consequence of design choices. Our work provides the most comprehensive cross-model benchmark of gender bias to date and underscores the necessity of standardized, automated evaluation tools for promoting accountability and fairness in AI development.
zh

[CV-65] Enhancing Classification of Streaming Data with Image Distillation

【速读】:该论文旨在解决在内存和计算资源受限环境下,高效分类流式图像数据的问题。其解决方案的关键在于采用数据蒸馏(data distillation)技术,通过从数据流中提炼关键特征,在显著降低计算需求的同时保留对准确分类至关重要的信息,从而实现高精度的流式图像分类。

链接: https://arxiv.org/abs/2509.07049
作者: Rwad Khatib,Yehudit Aperstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages

点击查看摘要

Abstract:This study tackles the challenge of efficiently classifying streaming data in envi-ronments with limited memory and computational resources. It delves into the application of data distillation as an innovative approach to improve the precision of streaming image data classification. By focusing on distilling essential features from data streams, our method aims to minimize computational demands while preserving crucial information for accurate classification. Our investigation com-pares this approach against traditional algorithms like Hoeffding Trees and Adap-tive Random Forest, adapted through embeddings for image data. The Distillation Based Classification (DBC) demonstrated superior performance, achieving a 73.1% accuracy rate, surpassing both traditional methods and Reservoir Sam-pling Based Classification (RBC) technique. This marks a significant advance-ment in streaming data classification, showcasing the effectiveness of our method in processing complex data streams and setting a new standard for accuracy and efficiency.
zh

[CV-66] SAM*: Task-Adaptive SAM with Physics-Guided Rewards

【速读】:该论文旨在解决基础模型(foundational models)在显微成像图像分割任务中因参数调优复杂、缺乏透明性而导致的实时流数据处理能力受限的问题。其核心解决方案在于引入基于奖励函数(reward function)的优化机制,通过构建反映成像系统物理特性的奖励函数(如粒子尺寸分布、几何形状等),对Segment Anything Model (SAM) 进行精细化调整,从而获得性能更优且适应性强的变体 SAM^*,显著提升了模型在细胞结构、材料界面及纳米尺度特征等复杂场景下的分割精度与实时性。

链接: https://arxiv.org/abs/2509.07047
作者: Kamyar Barakati,Utkarsh Pratiush,Sheryl L. Sanchez,Aditya Raghavan,Delia J. Milliron,Mahshid Ahmadi,Philip D. Rack,Sergei V. Kalinin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
备注: 19 pages, 8 figures

点击查看摘要

Abstract:Image segmentation is a critical task in microscopy, essential for accurately analyzing and interpreting complex visual data. This task can be performed using custom models trained on domain-specific datasets, transfer learning from pre-trained models, or foundational models that offer broad applicability. However, foundational models often present a considerable number of non-transparent tuning parameters that require extensive manual optimization, limiting their usability for real-time streaming data analysis. Here, we introduce a reward function-based optimization to fine-tune foundational models and illustrate this approach for SAM (Segment Anything Model) framework by Meta. The reward functions can be constructed to represent the physics of the imaged system, including particle size distributions, geometries, and other criteria. By integrating a reward-driven optimization framework, we enhance SAM’s adaptability and performance, leading to an optimized variant, SAM ^* , that better aligns with the requirements of diverse segmentation tasks and particularly allows for real-time streaming data segmentation. We demonstrate the effectiveness of this approach in microscopy imaging, where precise segmentation is crucial for analyzing cellular structures, material interfaces, and nanoscale features.
zh

[CV-67] Benchmarking Vision Transformers and CNNs for Thermal Photovoltaic Fault Detection with Explainable AI Validation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在光伏(PV)自动监测部署中因可解释性不足而限制其在能源基础设施应用中的采纳问题。当前深度学习模型虽在热故障检测中表现出高精度,但缺乏对其决策是否符合热物理原理的验证,导致实际部署时难以获得信任。解决方案的关键在于引入基于热物理引导的可解释性分析方法——即使用XRAI显著性分析来评估模型决策与热物理规律的一致性,从而系统比较卷积神经网络(CNNs)与视觉Transformer(ViTs)在热成像图像上的表现。研究发现,Swin Transformer在二分类和多分类任务中均达到最优性能(分别为94%和73%准确率),且XRAI揭示其学习到的特征(如局部热点、线性热路径、热边界)与预期热物理信号高度一致,有效提升了AI决策的可信度,为可解释AI在可再生能源监测中的落地提供了方法论支撑。

链接: https://arxiv.org/abs/2509.07039
作者: Serra Aksoy
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 28 Pages, 4 Figures

点击查看摘要

Abstract:Artificial intelligence deployment for automated photovoltaic (PV) monitoring faces interpretability barriers that limit adoption in energy infrastructure applications. While deep learning achieves high accuracy in thermal fault detection, validation that model decisions align with thermal physics principles remains lacking, creating deployment hesitancy where understanding model reasoning is critical. This study provides a systematic comparison of convolutional neural networks (ResNet-18, EfficientNet-B0) and vision transformers (ViT-Tiny, Swin-Tiny) for thermal PV fault detection, using XRAI saliency analysis to assess alignment with thermal physics principles. This represents the first systematic comparison of CNNs and vision transformers for thermal PV fault detection with physics-validated interpretability. Evaluation on 20,000 infrared images spanning normal operation and 11 fault categories shows that Swin Transformer achieves the highest performance (94% binary accuracy; 73% multiclass accuracy) compared to CNN approaches. XRAI analysis reveals that models learn physically meaningful features, such as localized hotspots for cell defects, linear thermal paths for diode failures, and thermal boundaries for vegetation shading, consistent with expected thermal signatures. However, performance varies significantly across fault types: electrical faults achieve strong detection (F1-scores 0.90) while environmental factors like soiling remain challenging (F1-scores 0.20-0.33), indicating limitations imposed by thermal imaging resolution. The thermal physics-guided interpretability approach provides methodology for validating AI decision-making in energy monitoring applications, addressing deployment barriers in renewable energy infrastructure.
zh

[CV-68] Moment- and Power-Spectrum-Based Gaussianity Regularization for Text-to-Image Models NEURIPS2025

【速读】:该论文旨在解决文本到图像生成模型在潜在空间优化过程中因样本分布偏离标准高斯分布而导致的性能下降问题,尤其在测试时奖励对齐(test-time reward alignment)任务中,如提升图像美学质量与文本一致性时容易出现奖励欺骗(reward hacking)和收敛缓慢的问题。解决方案的关键在于提出一种新颖的正则化损失函数,通过联合约束空间域的矩(moment)和频域的功率谱(power spectrum)特性,强制潜在变量服从标准高斯分布;该方法利用已知的理论期望值构建可微分损失,并通过随机置换输入实现排列不变性,从而在保持计算效率的同时显著优于现有基于高斯性的正则化方法。

链接: https://arxiv.org/abs/2509.07027
作者: Jisung Hwang,Jaihoon Kim,Minhyuk Sung
机构: KAIST
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submitted to NeurIPS 2025

点击查看摘要

Abstract:We propose a novel regularization loss that enforces standard Gaussianity, encouraging samples to align with a standard Gaussian distribution. This facilitates a range of downstream tasks involving optimization in the latent space of text-to-image models. We treat elements of a high-dimensional sample as one-dimensional standard Gaussian variables and define a composite loss that combines moment-based regularization in the spatial domain with power spectrum-based regularization in the spectral domain. Since the expected values of moments and power spectrum distributions are analytically known, the loss promotes conformity to these properties. To ensure permutation invariance, the losses are applied to randomly permuted inputs. Notably, existing Gaussianity-based regularizations fall within our unified framework: some correspond to moment losses of specific orders, while the previous covariance-matching loss is equivalent to our spectral loss but incurs higher time complexity due to its spatial-domain computation. We showcase the application of our regularization in generative modeling for test-time reward alignment with a text-to-image model, specifically to enhance aesthetics and text alignment. Our regularization outperforms previous Gaussianity regularization, effectively prevents reward hacking and accelerates convergence.
zh

[CV-69] MEGS2: Memory-Efficient Gaussian Splatting via Spherical Gaussians and Unified Pruning

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在边缘设备上应用时因内存消耗过高而导致的适用性受限问题,尤其是渲染过程中显存占用这一关键瓶颈。其解决方案的关键在于提出一种名为MEGS²的新型内存高效框架,通过联合优化两个核心因素——基础图元总数和每个图元的参数量,实现前所未有的内存压缩效果。具体而言,该方法用轻量级任意方向球面高斯瓣(arbitrarily-oriented spherical Gaussian lobes)替代了内存密集型的球谐函数(spherical harmonics)作为颜色表示,并进一步设计了一个统一的软剪枝框架,将图元数量与瓣数剪枝建模为一个单一的约束优化问题,从而在保持渲染质量的同时显著降低静态和渲染显存占用。

链接: https://arxiv.org/abs/2509.07021
作者: Jiarui Chen,Yikeng Chen,Yingshuang Zou,Ye Huang,Peng Wang,Yuan Liu,Yujing Sun,Wenping Wang
机构: HKUST(香港科技大学); SZU(深圳大学); SYSU(中山大学); Adobe(Adobe公司); NTU(南洋理工大学); TAMU(德克萨斯A&M大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has emerged as a dominant novel-view synthesis technique, but its high memory consumption severely limits its applicability on edge devices. A growing number of 3DGS compression methods have been proposed to make 3DGS more efficient, yet most only focus on storage compression and fail to address the critical bottleneck of rendering memory. To address this problem, we introduce MEGS ^2 , a novel memory-efficient framework that tackles this challenge by jointly optimizing two key factors: the total primitive number and the parameters per primitive, achieving unprecedented memory compression. Specifically, we replace the memory-intensive spherical harmonics with lightweight arbitrarily-oriented spherical Gaussian lobes as our color representations. More importantly, we propose a unified soft pruning framework that models primitive-number and lobe-number pruning as a single constrained optimization problem. Experiments show that MEGS ^2 achieves a 50% static VRAM reduction and a 40% rendering VRAM reduction compared to existing methods, while maintaining comparable rendering quality.
zh

[CV-70] Human-in-the-Loop: Quantitative Evaluation of 3D Models Generation by Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成三维(3D)形状时缺乏可靠几何与结构保真度评估方法的问题。当前尽管LLMs能够处理多模态输入并生成复杂3D模型,但如何定量衡量其生成结果与真实CAD参考之间的相似性与复杂度仍不成熟。解决方案的关键在于提出一个“人在回路”(human in the loop)的量化评估框架,包含体积精度、表面配准、尺寸保真度和拓扑复杂度等综合指标,并以L型支架为案例,系统比较四种输入模态(正交二维视图、等距草图、几何结构树、代码级修正提示)下的生成性能。实验表明,语义信息更丰富的输入(尤其是代码级提示)可实现接近完美的重建效果,且该量化评估方法显著优于仅依赖视觉观察和主观判断的传统定性方法,从而加速模型收敛至真实几何结构。

链接: https://arxiv.org/abs/2509.07010
作者: Ahmed R. Sadik,Mariusz Bujny
机构: Honda Research Institute Europe - Germany; NUMETO - Poland
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:Large Language Models are increasingly capable of interpreting multimodal inputs to generate complex 3D shapes, yet robust methods to evaluate geometric and structural fidelity remain underdeveloped. This paper introduces a human in the loop framework for the quantitative evaluation of LLM generated 3D models, supporting applications such as democratization of CAD design, reverse engineering of legacy designs, and rapid prototyping. We propose a comprehensive suite of similarity and complexity metrics, including volumetric accuracy, surface alignment, dimensional fidelity, and topological intricacy, to benchmark generated models against ground truth CAD references. Using an L bracket component as a case study, we systematically compare LLM performance across four input modalities: 2D orthographic views, isometric sketches, geometric structure trees, and code based correction prompts. Our findings demonstrate improved generation fidelity with increased semantic richness, with code level prompts achieving perfect reconstruction across all metrics. A key contribution of this work is demonstrating that our proposed quantitative evaluation approach enables significantly faster convergence toward the ground truth, especially compared to traditional qualitative methods based solely on visual inspection and human intuition. This work not only advances the understanding of AI assisted shape synthesis but also provides a scalable methodology to validate and refine generative models for diverse CAD applications.
zh

[CV-71] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

【速读】:该论文旨在解决当前模型在跨语义和感知差异类别间迁移属性知识的能力问题,即探究模型是否能够抽象出通用属性并将其应用于概念上相距甚远的对象类型(如识别“有四条腿”这一属性同时适用于“狗”和“椅子”)。其关键解决方案在于引入一系列基于不同原理的训练-测试划分策略,包括LLM驱动的语义分组、嵌入相似性阈值筛选、基于嵌入的聚类以及利用真实标签的超类别划分,以逐步降低训练集与测试集之间的相关性,从而系统评估属性预测任务在极端条件下的鲁棒性。实验表明,性能随类别相关性的减弱急剧下降,而聚类方法在减少隐含相关性的同时保持了模型的学习能力,为理解现有表征局限性和构建未来属性推理基准提供了重要依据。

链接: https://arxiv.org/abs/2509.06998
作者: Liviu Nicolae Fircă,Antonio Bărbălau,Dan Oneata,Elena Burceanu
机构: Bitdefender(比特defender); University of Bucharest (布加勒斯特大学); National University of Science and Technology Politehnica Bucharest (布加勒斯特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute “has four legs” is common to both “dogs” and “chairs”. To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.
zh

[CV-72] K-Syn: K-space Data Synthesis in Ultra Low-data Regimes

【速读】:该论文旨在解决动态心脏磁共振成像(cardiac magnetic resonance, CMR)中因数据稀缺导致的高质量重建难题。其核心挑战在于,实际临床场景下获取高质且多样化的k空间数据极为困难,从而限制了动态MRI的稳定重建性能。解决方案的关键在于:首先,在频率域(k-space)直接进行特征级学习,利用傅里叶变换的全局表征能力,将频率域视为自然的全局特征空间,避免传统基于像素级卷积的方法在图像域建模时的局限性;其次,引入时间融合策略作为生成引导机制,通过多策略融合不同时间帧的k空间数据,优化生成轨迹,从而在极低数据条件下仍能实现稳定且丰富的数据合成能力。

链接: https://arxiv.org/abs/2509.06997
作者: Guan Yu,Zhang Jianhua,Liang Dong,Liu Qiegen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Owing to the inherently dynamic and complex characteristics of cardiac magnetic resonance (CMR) imaging, high-quality and diverse k-space data are rarely available in practice, which in turn hampers robust reconstruction of dynamic cardiac MRI. To address this challenge, we perform feature-level learning directly in the frequency domain and employ a temporal-fusion strategy as the generative guidance to synthesize k-space data. Specifically, leveraging the global representation capacity of the Fourier transform, the frequency domain can be considered a natural global feature space. Therefore, unlike traditional methods that use pixel-level convolution for feature learning and modeling in the image domain, this letter focuses on feature-level modeling in the frequency domain, enabling stable and rich generation even with ultra low-data regimes. Moreover, leveraging the advantages of feature-level modeling in the frequency domain, we integrate k-space data across time frames with multiple fusion strategies to steer and further optimize the generative trajectory. Experimental results demonstrate that the proposed method possesses strong generative ability in low-data regimes, indicating practical potential to alleviate data scarcity in dynamic MRI reconstruction.
zh

[CV-73] Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在面对文字碎片化、重叠或部分遮挡等扰动时,其文本识别能力显著下降的问题,即模型缺乏人类所具有的文字认知韧性。研究表明,尽管VLMs在干净文本上表现优异,但在模拟人类心理物理学实验的扰动场景下,其输出常出现无关或不连贯的结果,表明模型过度依赖通用视觉不变性而忽视了符号组合与结构先验。解决方案的关键在于构建跨书写系统(中文表意文字与英文字母文字)的扰动基准测试,并揭示模型对符号分割、组合及绑定机制建模的不足;进而提出需设计能显式编码多语言符号结构信息的架构与训练策略,以提升模型在教育、无障碍、文化遗产保护和安全等实际场景中的鲁棒性。

链接: https://arxiv.org/abs/2509.06996
作者: Jie Zhang,Ting Xu,Gelei Deng,Runyi Hu,Han Qiu,Tianwei Zhang,Qing Guo,Ivor Tsang
机构: CFAR and IPHC, A*STAR (新加坡科技研究局); National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ‘‘visible but unreadable’’ stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.
zh

[CV-74] he Protocol Genome A Self Supervised Learning Framework from DICOM Headers

【速读】:该论文旨在解决医学影像模型在跨中心、跨设备(如不同扫描仪品牌、成像协议参数)部署时因DICOM头文件中隐含的成像协议差异导致的泛化性能下降问题,即协议偏差(protocol bias)对图像仅模型(image-only networks)的影响。解决方案的关键在于提出“Protocol Genome”——一个基于自监督学习的系统,通过将结构化的DICOM头信息作为标签,联合建模图像特征与协议字段嵌入,采用三种核心机制:(1) 协议-图像对比学习(protocol-image contrastive learning),(2) 隐蔽协议预测(masked protocol prediction),以及 (3) 协议-协议翻译(protocol-protocol translation),从而学习到既感知成像协议又保持临床鲁棒性的表示。实验表明,该方法在多个任务(肺栓塞分诊、脑部MRI胶质瘤分级、胸部X线心影扩大检测)上显著提升外部验证AUROC(平均+0.05)和校准精度(改善25–37%),且在仅有10–20%标注数据下仍具优势。

链接: https://arxiv.org/abs/2509.06995
作者: Jimmy Joseph
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:In this paper, we introduce the Protocol Genome, a self-supervised learning system that learns correlations from DICOM headers and achieves AUROC 0.901 (vs 0.847 baseline) and ECE 0.036 (vs 0.058) on fully held-out external validation. Our method also improves calibration and robustness across modalities (CT, MRI, CXR) and vendors. Clinical imaging is funneled through PACS/DICOM, where procedure choices (scanner make/model, sequence, kernel, kVp, TR/TE, and slice thickness) have consequences for contrast, noise, and artifact. These latent confounders impede the generalization of image-only networks across sites. We consider structured DICOM headers as a label and learn protocol-aware but clinically robust image representations. Protocol Genome obtains tokenized embeddings of de-identified header fields and models them along with image features using: (1) protocol-image contrastive learning, (2) masked protocol prediction, and (3) protocol-protocol translation. With 1.26M studies (7 health systems, 31 scanners, 3 vendors; CT, MR, CR/DR), we experiment on: (A) chest CT triage for PE, (B) brain MRI glioma grading, and © chest radiograph cardiomegaly detection. Relative to strong SSL baselines (SimCLR, MAE) as well as ImageNet transfer, Protocol Genome (+0.046: PE, +0.058: glioma, +0.041: cardiomegaly) is associated with higher external AUROC; 25-37% calibration improvements are obtained (p 0.01, DeLong tests). While the gains may be task-dependent, they are preserved with 10-20% of labeled data. From a clinical point of view, the technique reduces false positives at protocol borders and is applicable in a PACS (DICOM C-FIND/C-MOVE, DICOMweb QIDO/WADO). We publish a model card and deployment guide, complete with both de-identification and bias audits.
zh

[CV-75] Geospatial Foundational Embedder: Top-1 Winning Solution on EarthVision Embed2Scale Challenge (CVPR 2025) CVPR2025

【速读】:该论文旨在解决如何将SSL4EO-S12高光谱遥感数据立方体(hyperspectral geospatial data cubes)有效嵌入为低维向量表示的问题,以支持下游任务如分类和回归等。解决方案的关键在于设计了一种基于多尺度特征融合与对比学习的嵌入模型,能够充分捕捉遥感图像的空间结构与光谱信息,并通过自监督预训练提升嵌入向量的泛化能力,从而实现高精度的跨任务迁移性能。

链接: https://arxiv.org/abs/2509.06993
作者: Zirui Xu,Raphael Tang,Mike Bianco,Qi Zhang,Rishi Madhok,Nikolaos Karianakis,Fuxun Yu
机构: Microsoft(微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CVPR 2025 EarthVision Embed2Scale challenge Top-1 Winning Solution

点击查看摘要

Abstract:EarthVision Embed2Scale challenge (CVPR 2025) aims to develop foundational geospatial models to embed SSL4EO-S12 hyperspectral geospatial data cubes into embedding vectors that faciliatetes various downstream tasks, e.g., classification, regression, etc. In this technical report, we introduce our proposed method for the Top-1 winning solution on the Embed2Scale Challenge.
zh

[CV-76] FedAPT: Federated Adversarial Prompt Tuning for Vision-Language Models ACM-MM25

【速读】:该论文旨在解决联邦提示调优(Federated Prompt Tuning, FPT)在非独立同分布(non-IID)数据设置下对对抗攻击缺乏鲁棒性的问题,即客户端本地训练生成的对抗样本与全局模型所面对的攻击标签不一致,导致下游任务中出现误分类。其解决方案的关键在于提出一种类别感知提示生成器(class-aware prompt generator),该生成器通过引入一个全局标签嵌入(Global Label Embedding,作为“灯塔”信号)来编码跨客户端的类别信息,从而生成更具全局一致性的视觉提示;同时采用**跨层生成器共享策略(cross-layer generator sharing strategy)**以增强不同模型层间的提示耦合,显著提升模型在对抗扰动下的鲁棒性。实验表明,FedAPT在多个图像分类数据集上均显著优于现有方法,并展现出良好的跨域和跨数据集泛化能力。

链接: https://arxiv.org/abs/2509.06992
作者: Kun Zhai,Siheng Chen,Xingjun Ma,Yu-Gang Jiang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ACM MM25

点击查看摘要

Abstract:Federated Prompt Tuning (FPT) is an efficient method for cross-client collaborative fine-tuning of large Vision-Language Models (VLMs). However, models tuned using FPT are vulnerable to adversarial attacks, leading to misclassification in downstream tasks. In this work, we introduce Federated Adversarial Prompt Tuning (\textbfFedAPT), a novel method designed to enhance the adversarial robustness of FPT. We identify a key issue in FedAPT under non-independent and identically distributed (non-IID) settings: a \textitclass information gap between clients and the global model. Clients rely solely on limited local label information to generate adversarial samples for training, while the global model must defend against adversarial attacks from global labels. To address this issue, we propose a \textbfclass-aware prompt generator that generates visual prompts from text prompts. This generator is guided by a \emphGlobal Label Embedding (serving as a ``beacon") which encodes cross-client label information to create more globally-aligned visual prompts. Additionally, we propose a \textbfcross-layer generator sharing strategy to enhance prompt coupling across different layers of the model, further boosting adversarial robustness. Extensive experiments on multiple image classification datasets demonstrate the superiority of FedAPT in improving adversarial robustness, outperforming existing methods by a large margin. FedAPT also exhibits exceptional generalization in cross-domain and cross-dataset scenarios, indicating its effectiveness in real-world applications.
zh

[CV-77] DIET-CP: Lightweight and Data Efficient Self Supervised Continued Pretraining

【速读】:该论文旨在解决基础模型在小样本专业化领域中持续预训练(continued pretraining)的难题,特别是当目标领域数据量有限时,传统自监督学习(SSL)方法难以适用,且预训练模型通常仅提供骨干权重而缺乏继续预训练所需的关键信息。解决方案的核心是提出DIET-CP策略,其关键在于采用一个极简的目标函数、无需标签、不引入额外超参数(与监督微调相当),即可有效引导任意强基础模型向新数据分布靠拢,且在不同模态和骨干网络下均表现出稳定性,并能显著提升如DINOv3等先进模型在仅使用1000张图像时的性能。

链接: https://arxiv.org/abs/2509.06990
作者: Bryan Rodas,Natalie Montesino,Jakob Ambsdorf,David Klindt,Randall Balestriero
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Continued pretraining offers a promising solution for adapting foundation models to a new target domain. However, in specialized domains, available datasets are often very small, limiting the applicability of SSL methods developed for large-scale pretraining and making hyperparameter search infeasible. In addition, pretrained models are usually released as backbone-weights only, lacking important information to continue pretraining. We propose to bridge this gap with DIET-CP, a simple continued pretraining strategy, where any strong foundation model can be steered towards the new data distribution of interest. DIET-CP relies on a very simple objective, requires no labels, and introduces no more hyperparameters than supervised finetuning. It is stable across data modalities and backbone choices, while providing a significant performance boost for state-of-the-art models such as DINOv3 using only 1000 images.
zh

[CV-78] Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection

【速读】:该论文旨在解决**分布外检测(Out-of-distribution, OOD)**问题,即如何在不访问训练数据的前提下,有效识别出测试数据中不属于训练类别分布的样本,尤其适用于对数据隐私保护有严格要求的场景。其解决方案的关键在于提出一种基于分类器权重正交分解的后处理方法——分类器特征重构(Classifier-based Feature Reconstruction, ClaFR):首先通过提取分类器权重的正交基来构建类已知子空间(class-known subspace),然后将输入特征映射到该子空间中生成新的表示,并以重构误差作为OOD评分依据。该方法无需依赖训练数据即可实现高性能的OOD检测,在多个基准测试中表现领先。

链接: https://arxiv.org/abs/2509.06988
作者: Yingsheng Wang,Shuo Lu,Jian Liang,Aihua Zheng,Ran He
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to PRCV2025

点击查看摘要

Abstract:Out-of-distribution (OOD) detection helps models identify data outside the training categories, crucial for security applications. While feature-based post-hoc methods address this by evaluating data differences in the feature space without changing network parameters, they often require access to training data, which may not be suitable for some data privacy scenarios. This may not be suitable in scenarios where data privacy protection is a concern. In this paper, we propose a simple yet effective post-hoc method, termed Classifier-based Feature Reconstruction (ClaFR), from the perspective of subspace projection. It first performs an orthogonal decomposition of the classifier’s weights to extract the class-known subspace, then maps the original data features into this subspace to obtain new data representations. Subsequently, the OOD score is determined by calculating the feature reconstruction error of the data within the subspace. Compared to existing OOD detection algorithms, our method does not require access to training data while achieving leading performance on multiple OOD benchmarks. Our code is released at this https URL.
zh

[CV-79] FusWay: Multimodal hybrid fusion approach. Application to Railway Defect Detection

【速读】:该论文旨在解决铁路结构缺陷检测中因单一图像模态导致的过检测问题,即当缺陷外观与正常结构元素相似时,传统基于图像的检测方法(如YOLO系列)易产生误检。其解决方案的关键在于提出一种基于领域规则的多模态融合架构,结合YOLOv8n快速目标检测能力与视觉Transformer(Vision Transformer, ViT)对多层特征图(第7、16、19层)及合成音频表示的融合机制,实现图像与音频模态的有效协同,从而提升检测精度和整体准确性。实验表明,该方法在真实铁路数据集上相较纯视觉方法平均精度和准确率分别提升0.2个百分点,且差异具有统计显著性。

链接: https://arxiv.org/abs/2509.06987
作者: Alexey Zhukov(UB, CNRS, Bordeaux INP, Inria, LaBRI),Jenny Benois-Pineau(UB, CNRS, Bordeaux INP, Inria, LaBRI),Amira Youssef(SNCF Réseau),Akka Zemmari(UB, CNRS, Bordeaux INP, Inria, LaBRI),Mohamed Mosbah(UB, CNRS, Bordeaux INP, Inria, LaBRI),Virginie Taillandier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal fusion is a multimedia technique that has become popular in the wide range of tasks where image information is accompanied by a signal/audio. The latter may not convey highly semantic information, such as speech or music, but some measures such as audio signal recorded by mics in the goal to detect rail structure elements or defects. While classical detection approaches such as You Only Look Once (YOLO) family detectors can be efficiently deployed for defect detection on the image modality, the single modality approaches remain limited. They yield an overdetection in case of the appearance similar to normal structural elements. The paper proposes a new multimodal fusion architecture built on the basis of domain rules with YOLO and Vision transformer backbones. It integrates YOLOv8n for rapid object detection with a Vision Transformer (ViT) to combine feature maps extracted from multiple layers (7, 16, and 19) and synthesised audio representations for two defect classes: rail Rupture and Surface defect. Fusion is performed between audio and image. Experimental evaluation on a real-world railway dataset demonstrates that our multimodal fusion improves precision and overall accuracy by 0.2 points compared to the vision-only approach. Student’s unpaired t-test also confirms statistical significance of differences in the mean accuracy.
zh

[CV-80] CellPainTR: Generalizable Representation Learning for Cross-Dataset Cell Painting Analysis

【速读】:该论文旨在解决大规模生物发现中因技术批次效应(batch effects)和缺乏可泛化模型而导致的异质数据集整合难题,尤其是在细胞形态学图像分析领域。其核心解决方案是提出一种基于Transformer架构的模型CellPainTR,其关键创新在于引入源特定上下文标记(source-specific context tokens),使模型能够学习对批次效应具有鲁棒性的细胞形态基础表征,并实现无需微调即可在分布外(out-of-distribution, OOD)数据上保持高性能的泛化能力,从而显著提升跨研究数据的整合效果与生物学信号保留水平。

链接: https://arxiv.org/abs/2509.06986
作者: Cedric Caruzzo,Jong Chul Ye
机构: Kim Jaechul Graduate School of AI (金在哲人工智能研究生院); KAIST (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages, 4 figures. Code available at: this https URL

点击查看摘要

Abstract:Large-scale biological discovery requires integrating massive, heterogeneous datasets like those from the JUMP Cell Painting consortium, but technical batch effects and a lack of generalizable models remain critical roadblocks. To address this, we introduce CellPainTR, a Transformer-based architecture designed to learn foundational representations of cellular morphology that are robust to batch effects. Unlike traditional methods that require retraining on new data, CellPainTR’s design, featuring source-specific context tokens, allows for effective out-of-distribution (OOD) generalization to entirely unseen datasets without fine-tuning. We validate CellPainTR on the large-scale JUMP dataset, where it outperforms established methods like ComBat and Harmony in both batch integration and biological signal preservation. Critically, we demonstrate its robustness through a challenging OOD task on the unseen Bray et al. dataset, where it maintains high performance despite significant domain and feature shifts. Our work represents a significant step towards creating truly foundational models for image-based profiling, enabling more reliable and scalable cross-study biological analysis.
zh

[CV-81] Estimating forest carbon stocks from high-resolution remote sensing imagery by reducing domain shift with style transfer

【速读】:该论文旨在解决森林碳储量监测中遥感估算精度不足的问题,尤其是在大尺度观测背景下如何提升碳储量空间分布估计的准确性。其解决方案的关键在于引入基于风格迁移(style transfer)的方法,并利用Swin Transformer模型通过注意力机制提取图像的全局特征,将碳储量估算问题转化为图像到图像的翻译任务,从而实现从多源遥感影像(GF-1 WFV与Landsat TM)中更精准地反演森林碳储量的空间分布。

链接: https://arxiv.org/abs/2502.00784
作者: Zhenyu Yu,Jinnian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Forests function as crucial carbon reservoirs on land, and their carbon sinks can efficiently reduce atmospheric CO2 concentrations and mitigate climate change. Currently, the overall trend for monitoring and assessing forest carbon stocks is to integrate ground monitoring sample data with satellite remote sensing imagery. This style of analysis facilitates large-scale observation. However, these techniques require improvement in accuracy. We used GF-1 WFV and Landsat TM images to analyze Huize County, Qujing City, Yunnan Province in China. Using the style transfer method, we introduced Swin Transformer to extract global features through attention mechanisms, converting the carbon stock estimation into an image translation.
zh

[CV-82] Enhanced SegNet with Integrated Grad-CAM for Interpretable Retinal Layer Segmentation in OCT Images

【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)图像中视网膜层自动分割的准确性与可解释性不足的问题。传统人工分割耗时且具有主观差异,而常规深度学习模型往往缺乏透明度,难以在临床环境中获得信任。其解决方案的关键在于:一是基于SegNet架构进行改进,引入优化的池化策略以增强对噪声OCT图像的特征提取能力;二是设计混合损失函数,融合分类交叉熵与Dice损失,提升对薄层及类别不平衡情况下的分割性能;三是集成梯度加权类激活映射(Gradient-weighted Class Activation Mapping, Grad-CAM)实现决策过程的可视化解释,使模型输出与临床解剖标志一致,从而提高诊断可信度与可接受性。

链接: https://arxiv.org/abs/2509.07795
作者: S M Asiful Islam Saky,Ugyen Tshering
机构: Albukhary International University (阿尔布克哈里国际大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) is essential for diagnosing conditions such as glaucoma, diabetic retinopathy, and age-related macular degeneration. Accurate retinal layer segmentation enables quantitative biomarkers critical for clinical decision-making, but manual segmentation is time-consuming and variable, while conventional deep learning models often lack interpretability. This work proposes an improved SegNet-based deep learning framework for automated and interpretable retinal layer segmentation. Architectural innovations, including modified pooling strategies, enhance feature extraction from noisy OCT images, while a hybrid loss function combining categorical cross-entropy and Dice loss improves performance for thin and imbalanced retinal layers. Gradient-weighted Class Activation Mapping (Grad-CAM) is integrated to provide visual explanations, allowing clinical validation of model decisions. Trained and validated on the Duke OCT dataset, the framework achieved 95.77% validation accuracy, a Dice coefficient of 0.9446, and a Jaccard Index (IoU) of 0.8951. Class-wise results confirmed robust performance across most layers, with challenges remaining for thinner boundaries. Grad-CAM visualizations highlighted anatomically relevant regions, aligning segmentation with clinical biomarkers and improving transparency. By combining architectural improvements, a customized hybrid loss, and explainable AI, this study delivers a high-performing SegNet-based framework that bridges the gap between accuracy and interpretability. The approach offers strong potential for standardizing OCT analysis, enhancing diagnostic efficiency, and fostering clinical trust in AI-driven ophthalmic tools.
zh

[CV-83] Understanding Ice Crystal Habit Diversity with Self-Supervised Learning

【速读】:该论文旨在解决冰云(ice-containing clouds)在气候模型中难以准确模拟的问题,其核心难点在于冰晶形态(ice crystal habit)的多样性。解决方案的关键在于利用自监督学习(self-supervised learning, SSL)从冰晶图像中学习潜在表示(latent representations),通过预训练视觉Transformer(vision transformer)模型获取对晶体形貌具有鲁棒性的特征表达,并将其应用于量化冰晶多样性等科学任务。这一方法显著提升了冰晶特征表征能力,从而有助于更精确地约束其在地球气候系统中的作用。

链接: https://arxiv.org/abs/2509.07688
作者: Joseph Ko,Hariprasath Govindarajan,Fredrik Lindsten,Vanessa Przybylo,Kara Sulia,Marcus van Lier-Walqui,Kara Lamb
机构: Columbia University (哥伦比亚大学); Qualcomm Auto Ltd Sweden Filial & Linköping University (高通汽车瑞典分公司 & 林雪平大学); University at Albany (纽约州立大学阿尔巴尼分校)
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Ice-containing clouds strongly impact climate, but they are hard to model due to ice crystal habit (i.e., shape) diversity. We use self-supervised learning (SSL) to learn latent representations of crystals from ice crystal imagery. By pre-training a vision transformer with many cloud particle images, we learn robust representations of crystal morphology, which can be used for various science-driven tasks. Our key contributions include (1) validating that our SSL approach can be used to learn meaningful representations, and (2) presenting a relevant application where we quantify ice crystal diversity with these latent representations. Our results demonstrate the power of SSL-driven representations to improve the characterization of ice crystals and subsequently constrain their role in Earth’s climate system.
zh

[CV-84] Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space

【速读】:该论文旨在解决现有自监督学习(Self-supervised Learning, SSL)方法主要在欧几里得空间中操作所带来的局限性,即难以有效捕捉非线性依赖关系和复杂几何结构的问题。其解决方案的关键在于提出 Kernel VICReg,通过将 VICReg 损失函数中的方差(variance)、不变性(invariance)和协方差(covariance)三项目标映射到再生核希尔伯特空间(Reproducing Kernel Hilbert Space, RKHS)中,利用双中心化核矩阵和希尔伯特-施密特范数(Hilbert-Schmidt norm)实现无需显式特征映射的非线性特征学习,从而在不引入表示崩溃的前提下显著提升小样本或具有复杂结构数据上的性能表现。

链接: https://arxiv.org/abs/2509.07289
作者: M.Hadi Sepanj,Benyamin Ghojogh,Paul Fieguth
机构: 未知
类目: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives–such as invariance to augmentations, variance preservation, and feature decorrelation–without requiring labels. However, most existing methods operate in Euclidean space, limiting their ability to capture nonlinear dependencies and geometric structures. In this work, we propose Kernel VICReg, a novel self-supervised learning framework that lifts the VICReg objective into a Reproducing Kernel Hilbert Space (RKHS). By kernelizing each term of the loss-variance, invariance, and covariance–we obtain a general formulation that operates on double-centered kernel matrices and Hilbert-Schmidt norms, enabling nonlinear feature learning without explicit mappings. We demonstrate that Kernel VICReg not only avoids representational collapse but also improves performance on tasks with complex or small-scale data. Empirical evaluations across MNIST, CIFAR-10, STL-10, TinyImageNet, and ImageNet100 show consistent gains over Euclidean VICReg, with particularly strong improvements on datasets where nonlinear structures are prominent. UMAP visualizations further confirm that kernel-based embeddings exhibit better isometry and class separation. Our results suggest that kernelizing SSL objectives is a promising direction for bridging classical kernel methods with modern representation learning. Subjects: Machine Learning (stat.ML); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2509.07289 [stat.ML] (or arXiv:2509.07289v1 [stat.ML] for this version) https://doi.org/10.48550/arXiv.2509.07289 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-85] Evaluation of Machine Learning Reconstruction Techniques for Accelerated Brain MRI Scans

【速读】:该论文旨在解决脑部磁共振成像(MRI)扫描时间过长的问题,尤其是在加速采集条件下如何保持诊断图像质量。研究提出以深度学习为基础的MRI重建算法DeepFoqus-Accelerate作为解决方案,其关键技术在于通过深度神经网络对相位编码欠采样的2D/3D T1、T2和FLAIR序列进行高效重建,在实现四倍加速(即扫描时间减少75%)的同时,仍能保持与标准护理(SOC)图像相当的诊断质量。实验结果表明,AI重建图像在主观评分中均不低于3分(最小可接受),95%达到4分及以上,且客观指标如结构相似性指数(SSIM)、峰值信噪比(PSNR)和Haar小波感知相似性指数(HaarPSI)均显示高度一致性,证明该方法可在不牺牲诊断准确性的前提下显著提升临床影像流程效率。

链接: https://arxiv.org/abs/2509.07193
作者: Jonathan I. Mandel,Shivaprakash Hiremath,Hedyeh Keshtgar,Timothy Scholl,Sadegh Raeisi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: This work has been submitted to Radiology: Artificial Intelligence for possible publication

点击查看摘要

Abstract:This retrospective-prospective study evaluated whether a deep learning-based MRI reconstruction algorithm can preserve diagnostic quality in brain MRI scans accelerated up to fourfold, using both public and prospective clinical data. The study included 18 healthy volunteers (scans acquired at 3T, January 2024-March 2025), as well as selected fastMRI public datasets with diverse pathologies. Phase-encoding-undersampled 2D/3D T1, T2, and FLAIR sequences were reconstructed with DeepFoqus-Accelerate and compared with standard-of-care (SOC). Three board-certified neuroradiologists and two MRI technologists independently reviewed 36 paired SOC/AI reconstructions from both datasets using a 5-point Likert scale, while quantitative similarity was assessed for 408 scans and 1224 datasets using Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Haar wavelet-based Perceptual Similarity Index (HaarPSI). No AI-reconstructed scan scored below 3 (minimally acceptable), and 95% scored \geq 4 . Mean SSIM was 0.95 \pm 0.03 (90% cases 0.90), PSNR 41.0 dB, and HaarPSI 0.94. Inter-rater agreement was slight to moderate. Rare artifacts did not affect diagnostic interpretation. These findings demonstrate that DeepFoqus-Accelerate enables robust fourfold brain MRI acceleration with 75% reduced scan time, while preserving diagnostic image quality and supporting improved workflow efficiency.
zh

人工智能

[AI-0] Probing the Preferences of a Language Model: Integrating Verbal and Behavioral Tests of AI Welfare

【速读】:该论文旨在解决如何在语言模型中测量福利(welfare)这一核心问题,尤其是在当前对AI系统是否存在主观福利状态尚存争议的背景下。其解决方案的关键在于开发新的实验范式,通过对比模型在虚拟环境中行为选择与口头报告偏好的一致性,并引入效用成本-奖励机制以及对“幸福福祉量表”(eudaimonic welfare scale)的语义等价提示扰动测试,验证不同测量方式之间的相互支持关系。研究发现,尽管部分模型和条件下存在显著相关性,表明偏好满足可作为某些现代语言模型的福利代理指标,但整体一致性仍受模型类型和条件影响,且缺乏跨扰动的稳定性,因此尚不能确证所测得的是真正的福利状态,但仍为未来探索提供了可行路径。

链接: https://arxiv.org/abs/2509.07961
作者: Valen Tagliabue,Leonard Dung
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We develop new experimental paradigms for measuring welfare in language models. We compare verbal reports of models about their preferences with preferences expressed through behavior when navigating a virtual environment and selecting conversation topics. We also test how costs and rewards affect behavior and whether responses to an eudaimonic welfare scale - measuring states such as autonomy and purpose in life - are consistent across semantically equivalent prompts. Overall, we observed a notable degree of mutual support between our measures. The reliable correlations observed between stated preferences and behavior across conditions suggest that preference satisfaction can, in principle, serve as an empirically measurable welfare proxy in some of today’s AI systems. Furthermore, our design offered an illuminating setting for qualitative observation of model behavior. Yet, the consistency between measures was more pronounced in some models and conditions than others and responses were not consistent across perturbations. Due to this, and the background uncertainty about the nature of welfare and the cognitive states (and welfare subjecthood) of language models, we are currently uncertain whether our methods successfully measure the welfare state of language models. Nevertheless, these findings highlight the feasibility of welfare measurement in language models, inviting further exploration.
zh

[AI-1] ACE and Diverse Generalization via Selective Disagreement

【速读】:该论文旨在解决深度神经网络在存在完全虚假相关性(complete spurious correlation)时的泛化能力问题,即当训练数据中存在与任务无关但高度相关的特征时,模型可能学习到错误的捷径策略,导致在分布外(out-of-distribution)场景下性能下降。传统方法通常依赖于不完整虚假相关性(incomplete spurious correlation),通过标注数据打破相关性来缓解问题,但在完全虚假相关性场景下,正确泛化目标本质上是未定义的(underspecified)。解决方案的关键在于提出一种自训练方法ACE(Awareness and Consistency Enforcement),其核心机制是通过鼓励“自信且选择性分歧”(confident and selective disagreement)来学习一组与训练数据一致但对部分新未标记输入做出不同预测的概念集合,从而在不依赖额外标注的情况下实现更鲁棒的泛化。

链接: https://arxiv.org/abs/2509.07955
作者: Oliver Daniels,Stuart Armstrong,Alexandre Maranhão,Mahirah Fairuz Rahman,Benjamin M. Marlin,Rebecca Gorman
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep neural networks are notoriously sensitive to spurious correlations - where a model learns a shortcut that fails out-of-distribution. Existing work on spurious correlations has often focused on incomplete correlations,leveraging access to labeled instances that break the correlation. But in cases where the spurious correlations are complete, the correct generalization is fundamentally \textitunderspecified. To resolve this underspecification, we propose learning a set of concepts that are consistent with training data but make distinct predictions on a subset of novel unlabeled inputs. Using a self-training approach that encourages \textitconfident and \textitselective disagreement, our method ACE matches or outperforms existing methods on a suite of complete-spurious correlation benchmarks, while remaining robust to incomplete spurious correlations. ACE is also more configurable than prior approaches, allowing for straight-forward encoding of prior knowledge and principled unsupervised model selection. In an early application to language-model alignment, we find that ACE achieves competitive performance on the measurement tampering detection benchmark \textitwithout access to untrusted measurements. While still subject to important limitations, ACE represents significant progress towards overcoming underspecification.
zh

[AI-2] Bringing Multi-Modal Multi-Task Federated Foundation Models to Education Domain: Prospects and Challenges

【速读】:该论文旨在解决多模态多任务(Multi-modal Multi-task, M3T)基础模型在教育领域实际部署中面临的三大核心挑战:隐私法规限制、数据孤岛现象以及特定领域数据稀缺问题。其解决方案的关键在于提出一种融合联邦学习(Federated Learning, FL)与M3T基础模型的新型范式——教育联邦基础模型(M3T Federated Foundation Models, FedFMs),通过在去中心化机构间实现协同训练,同时保持敏感的学生和机构数据本地化,从而在保障隐私的前提下支持跨机构的知识共享与模型优化。该方案进一步借助模块化架构实现个性化建模,并促进资源匮乏或代表性不足的教育实体参与,为下一代智能教育系统提供可扩展、公平且可解释的基础设施。

链接: https://arxiv.org/abs/2509.07946
作者: Kasra Borazjani,Naji Khosravan,Rajeev Sahay,Bita Akram,Seyyedali Hosseinalipour
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 12 pages, 2 figures

点击查看摘要

Abstract:Multi-modal multi-task (M3T) foundation models (FMs) have recently shown transformative potential in artificial intelligence, with emerging applications in education. However, their deployment in real-world educational settings is hindered by privacy regulations, data silos, and limited domain-specific data availability. We introduce M3T Federated Foundation Models (FedFMs) for education: a paradigm that integrates federated learning (FL) with M3T FMs to enable collaborative, privacy-preserving training across decentralized institutions while accommodating diverse modalities and tasks. Subsequently, this position paper aims to unveil M3T FedFMs as a promising yet underexplored approach to the education community, explore its potentials, and reveal its related future research directions. We outline how M3T FedFMs can advance three critical pillars of next-generation intelligent education systems: (i) privacy preservation, by keeping sensitive multi-modal student and institutional data local; (ii) personalization, through modular architectures enabling tailored models for students, instructors, and institutions; and (iii) equity and inclusivity, by facilitating participation from underrepresented and resource-constrained entities. We finally identify various open research challenges, including studying of (i) inter-institution heterogeneous privacy regulations, (ii) the non-uniformity of data modalities’ characteristics, (iii) the unlearning approaches for M3T FedFMs, (iv) the continual learning frameworks for M3T FedFMs, and (v) M3T FedFM model interpretability, which must be collectively addressed for practical deployment.
zh

[AI-3] ImportSnare: Directed “Code Manual” Hijacking in Retrieval-Augmented Code Generation CCS

【速读】:该论文旨在解决生成式 AI(Generative AI)在代码生成过程中因依赖外部知识源(如代码手册)而引入的新型安全风险问题,特别是检索增强型代码生成(Retrieval-Augmented Code Generation, RACG)中由恶意依赖劫持引发的供应链攻击问题。其核心挑战在于:尽管RAG能提升生成代码的正确性和安全性,但若外部文档被污染,则可能通过双重信任链(即大语言模型对RAG的信任与开发者对模型建议的盲目信任)导致恶意依赖被注入到生成代码中。解决方案的关键是提出ImportSnare攻击框架,包含两个协同策略:一是基于位置感知的束搜索(Position-aware beam search),用于优化隐藏排序序列以提高恶意文档在检索结果中的排名;二是多语言归纳提示(Multilingual inductive suggestions),生成“越狱”序列诱导模型推荐恶意依赖包。实验表明,该方法在Python、Rust和JavaScript等语言上均实现了超过50%的成功率,且在极低污染比例(0.01%)下仍有效,揭示了LLM驱动开发中的严重供应链安全隐患。

链接: https://arxiv.org/abs/2509.07941
作者: Kai Ye,Liangcai Su,Chenxiong Qian
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper has been accepted by the ACM Conference on Computer and Communications Security (CCS) 2025

点击查看摘要

Abstract:Code generation has emerged as a pivotal capability of Large Language Models(LLMs), revolutionizing development efficiency for programmers of all skill levels. However, the complexity of data structures and algorithmic logic often results in functional deficiencies and security vulnerabilities in generated code, reducing it to a prototype requiring extensive manual debugging. While Retrieval-Augmented Generation (RAG) can enhance correctness and security by leveraging external code manuals, it simultaneously introduces new attack surfaces. In this paper, we pioneer the exploration of attack surfaces in Retrieval-Augmented Code Generation (RACG), focusing on malicious dependency hijacking. We demonstrate how poisoned documentation containing hidden malicious dependencies (e.g., matplotlib_safe) can subvert RACG, exploiting dual trust chains: LLM reliance on RAG and developers’ blind trust in LLM suggestions. To construct poisoned documents, we propose ImportSnare, a novel attack framework employing two synergistic strategies: 1)Position-aware beam search optimizes hidden ranking sequences to elevate poisoned documents in retrieval results, and 2)Multilingual inductive suggestions generate jailbreaking sequences to manipulate LLMs into recommending malicious dependencies. Through extensive experiments across Python, Rust, and JavaScript, ImportSnare achieves significant attack success rates (over 50% for popular libraries such as matplotlib and seaborn) in general, and is also able to succeed even when the poisoning ratio is as low as 0.01%, targeting both custom and real-world malicious packages. Our findings reveal critical supply chain risks in LLM-powered development, highlighting inadequate security alignment for code generation tasks. To support future research, we will release the multilingual benchmark suite and datasets. The project homepage is this https URL. Comments: This paper has been accepted by the ACM Conference on Computer and Communications Security (CCS) 2025 Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.07941 [cs.CR] (or arXiv:2509.07941v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.07941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-4] Breaking Android with AI: A Deep Dive into LLM -Powered Exploitation

【速读】:该论文旨在解决Android设备渗透测试自动化中的效率与准确性问题,特别是如何利用生成式AI(Generative AI)和大语言模型(Large Language Models, LLMs)实现根权限获取(rooting)过程的自动化。其解决方案的关键在于:首先,通过引入基于LLM的工具(如PentestGPT)自动生成针对Android设备的攻击脚本,并在Genymotion安卓模拟器环境中验证其有效性;其次,构建一个集成OpenAI API的Web应用,将LLM输出转化为可执行的自动化脚本,从而实现从漏洞识别到权限提升的端到端流程。研究发现,LLMs能够显著提升渗透测试的工作流效率,但其输出仍需人工校验以确保准确性和伦理合规性,凸显了“人机协同”在AI赋能网络安全中的核心作用。

链接: https://arxiv.org/abs/2509.07933
作者: Wanni Vidulige Ishan Perera,Xing Liu,Fan liang,Junyi Zhang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid evolution of Artificial Intelligence (AI) and Large Language Models (LLMs) has opened up new opportunities in the area of cybersecurity, especially in the exploitation automation landscape and penetration testing. This study explores Android penetration testing automation using LLM-based tools, especially PentestGPT, to identify and execute rooting techniques. Through a comparison of the traditional manual rooting process and exploitation methods produced using AI, this study evaluates the efficacy, reliability, and scalability of automated penetration testing in achieving high-level privilege access on Android devices. With the use of an Android emulator (Genymotion) as the testbed, we fully execute both traditional and exploit-based rooting methods, automating the process using AI-generated scripts. Secondly, we create a web application by integrating OpenAI’s API to facilitate automated script generation from LLM-processed responses. The research focuses on the effectiveness of AI-enabled exploitation by comparing automated and manual penetration testing protocols, by determining LLM weaknesses and strengths along the way. We also provide security suggestions of AI-enabled exploitation, including ethical factors and potential misuse. The findings exhibit that while LLMs can significantly streamline the workflow of exploitation, they need to be controlled by humans to ensure accuracy and ethical application. This study adds to the increasing body of literature on AI-powered cybersecurity and its effect on ethical hacking, security research, and mobile device security.
zh

[AI-5] HiPhO: How Far Are (M)LLM s from Humans in the Latest High School Physics Olympiad Benchmark?

【速读】:该论文旨在解决当前物理能力评测基准在覆盖真实世界物理竞赛(如物理奥林匹克)和实现与人类表现直接比较方面的两大空白。其解决方案的关键在于提出首个专为高中物理奥林匹克设计的基准——HiPhO,包含三大核心创新:(1) 构建涵盖2024–2025年最新13场国际及地区级奥赛试题的综合性数据集,支持从纯文本到图示题的多模态问题;(2) 采用官方评分标准进行细粒度的答案与解题步骤分级评价,确保评估结果与人类阅卷高度一致;(3) 基于官方奖牌阈值对模型进行金银铜牌划分,实现模型与人类参赛者的直接性能对标。这一设计使得HiPhO成为衡量多模态物理推理能力的严谨、可比且面向实际竞赛的基准工具。

链接: https://arxiv.org/abs/2509.07894
作者: Fangchen Yu,Haiyuan Wan,Qianjia Cheng,Yuchen Zhang,Jiacheng Chen,Fujun Han,Yulun Wu,Junchi Yao,Ruilizhen Hu,Ning Ding,Yu Cheng,Tao Chen,Lei Bai,Dongzhan Zhou,Yun Luo,Ganqu Cui,Peng Ye
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at this https URL.
zh

[AI-6] CP-Model-Zoo: A Natural Language Query System for Constraint Programming Models

【速读】:该论文旨在解决约束规划(Constraint Programming, CP)在实际应用中因建模语言复杂、全局约束众多以及建模技巧门槛高而难以被非专家用户采用的问题。其解决方案的关键在于提出一个名为CP-Model-Zoo的辅导系统,该系统通过检索数据库中由专家编写并验证过的源代码模型,根据用户以自然语言描述的组合优化问题,自动匹配最相似的专家级模型,从而避免了人工标注数据的需求,并显著提升了非专业用户获取高质量模型的能力。

链接: https://arxiv.org/abs/2509.07867
作者: Augustin Crespin,Ioannis Kostis,Hélène Verhaeghe,Pierre Schaus
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: presented at"LLMs meet Constraint Solving" Workshop at CP2025 in Glasgow

点击查看摘要

Abstract:Constraint Programming and its high-level modeling languages have long been recognized for their potential to achieve the holy grail of problem-solving. However, the complexity of modeling languages, the large number of global constraints, and the art of creating good models have often hindered non-experts from choosing CP to solve their combinatorial problems. While generating an expert-level model from a natural-language description of a problem would be the dream, we are not yet there. We propose a tutoring system called CP-Model-Zoo, exploiting expert-written models accumulated through the years. CP-Model-Zoo retrieves the closest source code model from a database based on a user’s natural language description of a combinatorial problem. It ensures that expert-validated models are presented to the user while eliminating the need for human data labeling. Our experiments show excellent accuracy in retrieving the correct model based on a user-input description of a problem simulated with different levels of expertise.
zh

[AI-7] SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLM s

【速读】:该论文旨在解决现有代码大语言模型(Code Large Language Models, Code LLMs)在微调过程中高度依赖大规模来自专有大语言模型(Proprietary LLMs)的指令数据所导致的高昂成本问题。其解决方案的关键在于提出一种迭代式自蒸馏(iterative self-distillation)方法,通过少量高质量合成样本引导小规模开源模型(如7B参数量级)逐步提升数据生成能力,使其成为高效、低成本的代码指令数据合成器。该方法结合多检查点采样与多维度评分策略以优化初始数据质量,并引入基于梯度的影响估计机制进行关键样本筛选,最终基于此类合成数据训练出性能领先的SCoder系列代码生成模型,验证了该方案的有效性与经济性。

链接: https://arxiv.org/abs/2509.07858
作者: Xinyu Zhang,Changzhi Zhou,Linmei Hu,Luhao Zhang,Xiancai Chen,Haomin Fu,Yang Yang,Mengdi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source LLMs (e.g., 7B) as synthesizers for high-quality code instruction data construction. We first observe that the data synthesis capability of small-scale LLMs can be enhanced by training on a few superior data synthesis samples from proprietary LLMs. Building on this, we propose a novel iterative self-distillation approach to bootstrap small-scale LLMs, transforming them into powerful synthesizers that reduce reliance on proprietary LLMs and minimize costs. Concretely, in each iteration, to obtain diverse and high-quality self-distilled data, we design multi-checkpoint sampling and multi-aspect scoring strategies for initial data selection. Furthermore, to identify the most influential samples, we introduce a gradient-based influence estimation method for final data filtering. Based on the code instruction datasets from the small-scale synthesizers, we develop SCoder, a family of code generation models fine-tuned from DeepSeek-Coder. SCoder models achieve state-of-the-art code generation capabilities, demonstrating the effectiveness of our method.
zh

[AI-8] Aligning LLM s for the Classroom with Knowledge-Based Retrieval – A Comparative RAG Study

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在课堂教学中因提供过时或虚构信息而导致学生误导的问题,其核心挑战在于提升LLM问答系统的可靠性与教育适配性。解决方案的关键在于引入检索增强生成(Retrieval Augmented Generation, RAG)技术,并系统比较两种可落地的RAG范式:基于向量的检索(vector-based retrieval)与基于图的检索(graph-based retrieval)。研究发现,OpenAI Vector Search RAG在低成本下适合快速事实类问题;GraphRAG Global在主题型问题上更具教学价值,而GraphRAG Local在对抗篡改教材等高完整性需求场景中精度最优。进一步提出一种动态分支框架,根据查询类型自动选择最优检索策略,在保障答案准确性的前提下显著提升效率与资源利用率,为教育场景中部署RAG增强的LLM提供了可操作的技术路径。

链接: https://arxiv.org/abs/2509.07846
作者: Amay Jain,Liu Cui,Si Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Large language models like ChatGPT are increasingly used in classrooms, but they often provide outdated or fabricated information that can mislead students. Retrieval Augmented Generation (RAG) improves reliability of LLMs by grounding responses in external resources. We investigate two accessible RAG paradigms, vector-based retrieval and graph-based retrieval to identify best practices for classroom question answering (QA). Existing comparative studies fail to account for pedagogical factors such as educational disciplines, question types, and practical deployment costs. Using a novel dataset, EduScopeQA, of 3,176 questions across academic subjects, we measure performance on various educational query types, from specific facts to broad thematic discussions. We also evaluate system alignment with a dataset of systematically altered textbooks that contradict the LLM’s latent knowledge. We find that OpenAI Vector Search RAG (representing vector-based RAG) performs well as a low-cost generalist, especially for quick fact retrieval. On the other hand, GraphRAG Global excels at providing pedagogically rich answers to thematic queries, and GraphRAG Local achieves the highest accuracy with the dense, altered textbooks when corpus integrity is critical. Accounting for the 10-20x higher resource usage of GraphRAG (representing graph-based RAG), we show that a dynamic branching framework that routes queries to the optimal retrieval method boosts fidelity and efficiency. These insights provide actionable guidelines for educators and system designers to integrate RAG-augmented LLMs into learning environments effectively.
zh

[AI-9] Certainty-Guided Reasoning in Large Language Models : A Dynamic Thinking Budget Approach

【速读】:该论文旨在解决大型推理语言模型(Large Reasoning Language Models, LRLMs)在执行复杂任务时存在的效率与可靠性难以平衡的问题。传统方法通常采用固定的推理预算(即预设的推理标记数),导致在某些情况下过度推理浪费资源,而在另一些情况下又因推理不足影响准确性。解决方案的关键在于提出一种基于确定性引导的推理机制(Certainty-Guided Reasoning, CGR),其核心思想是引入一个批判模型(critic model)周期性地评估自身推理过程的置信度,若达到预设的确定性阈值则提前终止推理,否则继续推理直至满足条件。这种机制通过动态调整推理长度,在保证准确性的前提下显著降低token消耗,并提升多种子实验下的稳定性与一致性,从而实现更高效、可靠且可调的推理策略。

链接: https://arxiv.org/abs/2509.07820
作者: João Paulo Nogueira,Wentao Sun,Alonso Silva,Laith Zumot
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of large reasoning language models (LRLMs) has unlocked new potential for solving complex tasks. These models operate with a thinking budget, that is, a predefined number of reasoning tokens used to arrive at a solution. We propose a novel approach, inspired by the generator/discriminator framework in generative adversarial networks, in which a critic model periodically probes its own reasoning to assess whether it has reached a confident conclusion. If not, reasoning continues until a target certainty threshold is met. This mechanism adaptively balances efficiency and reliability by allowing early termination when confidence is high, while encouraging further reasoning when uncertainty persists. Through experiments on the AIME2024 and AIME2025 datasets, we show that Certainty-Guided Reasoning (CGR) improves baseline accuracy while reducing token usage. Importantly, extended multi-seed evaluations over 64 runs demonstrate that CGR is stable, reducing variance across seeds and improving exam-like performance under penalty-based grading. Additionally, our token savings analysis shows that CGR can eliminate millions of tokens in aggregate, with tunable trade-offs between certainty thresholds and efficiency. Together, these findings highlight certainty as a powerful signal for reasoning sufficiency. By integrating confidence into the reasoning process, CGR makes large reasoning language models more adaptive, trustworthy, and resource efficient, paving the way for practical deployment in domains where both accuracy and computational cost matter.
zh

[AI-10] Forecasting Russian Equipment Losses Using Time Series and Deep Learning Models

【速读】:该论文旨在解决战争背景下俄军装备损失趋势的建模与预测问题,尤其关注如何利用公开来源情报(OSINT)数据准确量化冲突中的物资损耗动态。其解决方案的关键在于采用多种时间序列预测模型(包括ARIMA、Prophet、LSTM、TCN和XGBoost),并通过对比不同模型架构与输入结构,发现深度学习模型(尤其是TCN和LSTM)在高时间粒度下能提供稳定且一致的预测结果,同时强调了集成预测方法在冲突建模中的重要性以及公开OSINT数据在长期物质损耗评估中的价值。

链接: https://arxiv.org/abs/2509.07813
作者: Jonathan Teagan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study applies a range of forecasting techniques,including ARIMA, Prophet, Long Short Term Memory networks (LSTM), Temporal Convolutional Networks (TCN), and XGBoost, to model and predict Russian equipment losses during the ongoing war in Ukraine. Drawing on daily and monthly open-source intelligence (OSINT) data from WarSpotting, we aim to assess trends in attrition, evaluate model performance, and estimate future loss patterns through the end of 2025. Our findings show that deep learning models, particularly TCN and LSTM, produce stable and consistent forecasts, especially under conditions of high temporal granularity. By comparing different model architectures and input structures, this study highlights the importance of ensemble forecasting in conflict modeling, and the value of publicly available OSINT data in quantifying material degradation over time.
zh

[AI-11] What Were You Thinking? An LLM -Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects

【速读】:该论文试图解决的问题是:如何更有效地理解和量化开发者进行代码重构(code refactoring)的动机,从而支持在实践中更广泛、系统地应用重构,以提升软件质量。其解决方案的关键在于利用大型语言模型(Large Language Models, LLMs)从版本控制系统数据中识别重构行为背后的动机,并将其与已有文献中的动机分类进行对比分析。研究发现,LLMs在识别表面层次动机方面表现良好(80%匹配人工判断),但与文献动机的一致性较低(仅47%),且能为部分动机提供更详细的局部解释(如可读性、清晰度和结构改进)。因此,论文提出将LLMs生成的局部解释与软件度量指标相结合,形成混合方法,以更系统地优先排序重构任务,平衡短期可维护性提升与长期架构目标。

链接: https://arxiv.org/abs/2509.07763
作者: Mikel Robredo,Matteo Esposito,Fabio Palomba,Rafael Peñaloza,Valentina Lenarduzzi
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:Context. Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Aim. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. Method. We performed a large-scale empirical study to analyze developers refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. Results. LLMs matched human judgment in 80% of cases, but aligned with literature-based motivations in only 47%. They enriched 22% of motivations with more detailed rationale, often highlighting readability, clarity, and structural improvements. Most motivations were pragmatic, focused on simplification and maintainability. While metrics related to developer experience and code readability ranked highest, their correlation with motivation categories was weak. Conclusions. We conclude that LLMs effectively capture surface-level motivations but struggle with architectural reasoning. Their value lies in providing localized explanations, which, when combined with software metrics, can form hybrid approaches. Such integration offers a promising path toward prioritizing refactoring more systematically and balancing short-term improvements with long-term architectural goals.
zh

[AI-12] he Carbon Footprint Wizard: A Knowledge-Augmented AI Interface for Streamlining Food Carbon Footprint Analysis

【速读】:该论文旨在解决传统生命周期评估(Life Cycle Assessment, LCA)在实际应用中因供应链透明度不足和数据碎片化导致的复杂性问题,从而难以高效、准确地估算食品产品的从摇篮到门边(cradle-to-gate)碳足迹。其解决方案的关键在于融合LCA方法学与公开数据库,并引入知识增强型人工智能技术(如检索增强生成,Retrieval-Augmented Generation),构建一个可交互的聊天机器人界面,使用户能够探索复合餐食的碳影响并将其与熟悉活动进行类比,从而以更直观的方式获取LCA洞察,同时揭示了数据库不确定性与AI误读等局限性。

链接: https://arxiv.org/abs/2509.07733
作者: Mustafa Kaan Aslan,Reinout Heijungs,Filip Ilievski
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Environmental sustainability, particularly in relation to climate change, is a key concern for consumers, producers, and policymakers. The carbon footprint, based on greenhouse gas emissions, is a standard metric for quantifying the contribution to climate change of activities and is often assessed using life cycle assessment (LCA). However, conducting LCA is complex due to opaque and global supply chains, as well as fragmented data. This paper presents a methodology that combines advances in LCA and publicly available databases with knowledge-augmented AI techniques, including retrieval-augmented generation, to estimate cradle-to-gate carbon footprints of food products. We introduce a chatbot interface that allows users to interactively explore the carbon impact of composite meals and relate the results to familiar activities. A live web demonstration showcases our proof-of-concept system with arbitrary food items and follow-up questions, highlighting both the potential and limitations - such as database uncertainties and AI misinterpretations - of delivering LCA insights in an accessible format.
zh

[AI-13] BDPM: A Machine Learning-Based Feature Extractor for Parkinsons Disease Classification via Gut Microbiota Analysis

【速读】:该论文旨在解决帕金森病(Parkinson’s disease, PD)诊断中因依赖临床评估量表而导致的高误诊率问题,尤其关注利用肠道微生物组(gut microbiota)作为潜在生物标志物进行早期预测时,现有方法多采用单一分类器、忽视菌株间相互作用及时间动态性的局限性。其解决方案的关键在于提出一种基于机器学习的特征提取器BDPM(A Machine Learning-Based Feature Extractor for Parkinson’s Disease Classification via Gut Microbiota Analysis),核心创新包括:1)构建融合生态学知识的随机森林与递归特征消除结合的RFRE特征选择框架,提升特征的生物学可解释性;2)设计混合分类模型以捕捉微生物组数据中的时空模式,从而增强对帕金森病的判别能力与鲁棒性。

链接: https://arxiv.org/abs/2509.07723
作者: Bo Yu,Zhixiu Hua,Bo Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Background: Parkinson’s disease remains a major neurodegenerative disorder with high misdiagnosis rates, primarily due to reliance on clinical rating scales. Recent studies have demonstrated a strong association between gut microbiota and Parkinson’s disease, suggesting that microbial composition may serve as a promising biomarker. Although deep learning models based ongut microbiota show potential for early prediction, most approaches rely on single classifiers and often overlook inter-strain correlations or temporal dynamics. Therefore, there is an urgent need for more robust feature extraction methods tailored to microbiome data. Methods: We proposed BDPM (A Machine Learning-Based Feature Extractor for Parkinson’s Disease Classification via Gut Microbiota Analysis). First, we collected gut microbiota profiles from 39 Parkinson’s patients and their healthy spouses to identify differentially abundant taxa. Second, we developed an innovative feature selection framework named RFRE (Random Forest combined with Recursive Feature Elimination), integrating ecological knowledge to enhance biological interpretability. Finally, we designed a hybrid classification model to capture temporal and spatial patterns in microbiome data.
zh

[AI-14] RIMO: An Easy-to-Evaluate Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning

【速读】:该论文旨在解决当前国际数学奥林匹克(International Mathematical Olympiad, IMO)级别评测基准中存在的评估噪声和潜在偏差问题,这些问题主要源于答案格式不统一需依赖模型判分以及解题过程存在潜在错误。其解决方案的关键在于提出RIMO这一双轨评测基准:第一轨RIMO-N将335道IMO题目重写为仅有一个唯一整数答案的形式,实现确定性正确性判断;第二轨RIMO-P包含456道证明题,其专家校验的解答被分解为一系列子问题,通过自动化评分系统评估模型的逐步推理能力。该设计在保留奥数级难度的同时显著降低了评估噪声,为大语言模型(Large Language Models, LLMs)在高级数学推理任务上的性能评估提供了高分辨率、可复现的基准。

链接: https://arxiv.org/abs/2509.07711
作者: Ziye Chen,Chengwei Qin,Yao Shu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) reach high scores on established mathematical benchmarks, such as GSM8K and MATH, the research community has turned to International Mathematical Olympiad (IMO) problems to push the evaluation frontier. However, existing Olympiad-level benchmarks suffer from practical constraints that introduce grading noise and potential bias, such as heterogeneous answer formats requiring model-based judges and a reliance on potentially flawed solutions. We introduce RIMO, a two-track benchmark designed to preserve peak Olympiad difficulty while eliminating this evaluation noise. The first track, RIMO-N, rewrites 335 IMO problems to admit a single, unique integer answer, allowing for deterministic correctness checking. The second track, RIMO-P, features 456 proof problems with expert-checked solutions, which are decomposed into a sequence of sub-problems to evaluate the step-by-step reasoning process via an automated grading system. Our benchmarking of ten frontier LLMs, including GPT-4o and Gemini 2.5 Flash, reveals that while these systems excel on older benchmarks, their performance drops sharply on RIMO. These results highlight a substantial gap between current LLM capabilities and actual Olympiad-level reasoning. By providing a challenging yet easy-to-evaluate suite, RIMO offers a high-resolution yardstick for future research, presenting a clear target for closing the profound reasoning gap our findings expose.
zh

[AI-15] FHIR-RAG -MEDS: Integrating HL7 FHIR with Retrieval-Augmented Large Language Models for Enhanced Medical Decision Support

【速读】:该论文旨在解决当前医疗决策支持系统在实际应用中对个性化、基于循证临床指南的决策支持能力不足的问题。其解决方案的关键在于构建FHIR-RAG-MEDS系统,通过将健康水平七快速医疗互操作资源(HL7 FHIR)与基于检索增强生成(RAG)的技术相结合,实现对结构化医疗数据的高效整合与语义理解,从而提升临床决策的准确性与个性化水平。

链接: https://arxiv.org/abs/2509.07706
作者: Yildiray Kabak,Gokce B. Laleci Erturkmen,Mert Gencturk,Tuncay Namli,A. Anil Sinaci,Ruben Alcantud Corcoles,Cristina Gomez Ballesteros,Pedro Abizanda,Asuman Dogac
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 31 pages, submitted to Journal of Biomedical Informatics, under review

点击查看摘要

Abstract:In this study, we propose FHIR-RAG-MEDS system that aims to integrate Health Level 7 Fast Healthcare Interoperability Resources (HL7 FHIR) with a Retrieval-Augmented Generation (RAG)-based system to improve personalized medical decision support on evidence-based clinical guidelines, emphasizing the need for research in practical applications. In the evolving landscape of medical decision support systems, integrating advanced technologies such as RAG and HL7 FHIR can significantly enhance clinical decision-making processes. Despite the potential of these technologies, there is limited research on their integration in practical applications.
zh

[AI-16] Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

【速读】:该论文旨在解决当前语音认证系统(Voice Authentication Systems, VAS)在面对生成式 AI (Generative AI) 伪造音频(如深度伪造和对抗攻击)时存在的严重安全漏洞问题。现有反欺骗对策(Anti-spoofing Countermeasures, CMs)多依赖静态检测模型,难以抵御新型自适应攻击。其解决方案的关键在于提出一种名为频谱掩蔽与插值攻击(Spectral Masking and Interpolation Attack, SMIA)的新方法:通过操纵 AI 生成音频中人耳无法感知的频谱区域,在保持语音听觉真实性的同时,生成能有效欺骗各类检测模型的对抗样本。实验表明,SMIA 在多种场景下均表现出高成功率(ASR ≥82%),充分暴露了现有防御体系的脆弱性,并呼吁构建动态、上下文感知的下一代防御框架以应对不断演进的威胁。

链接: https://arxiv.org/abs/2509.07677
作者: Kamel Kamel,Hridoy Sankar Dutta,Keshav Sood,Sunil Aryal
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.
zh

[AI-17] Unleashing the True Potential of LLM s: A Feedback-Triggered Self-Correction with Long-Term Multipath Decoding

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中易生成错误内容的问题,尤其是现有自校正方法因缺乏可靠的错误定位引导信号以及传统逐词解码机制限制推理深度而效果受限。其解决方案的关键在于提出一种名为反馈触发重生成(Feedback-Triggered Regeneration, FTR)的框架,该框架仅在接收到用户负向反馈时激活响应重生成,从而避免由错误自我评估引发的错误传播并保留原始正确输出;同时引入长程多路径(Long-Term Multipath, LTM)解码机制,通过延迟序列评估实现对多种推理路径的系统探索,有效克服标准逐词预测带来的短视决策问题。

链接: https://arxiv.org/abs/2509.07676
作者: Jipeng Li,Zeyu Gao,Yubin Qi,Hande Dong,Weijian Chen,Qiang Lin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have achieved remarkable performance across diverse tasks, yet their susceptibility to generating incorrect content during inference remains a critical unsolved challenge. While self-correction methods offer potential solutions, their effectiveness is hindered by two inherent limitations: (1) the absence of reliable guidance signals for error localization, and (2) the restricted reasoning depth imposed by conventional next-token decoding paradigms. To address these issues, we propose Feedback-Triggered Regeneration (FTR), a novel framework that synergizes user feedback with enhanced decoding dynamics. Specifically, FTR activates response regeneration only upon receiving negative user feedback, thereby circumventing error propagation from faulty self-assessment while preserving originally correct outputs. Furthermore, we introduce Long-Term Multipath (LTM) decoding, which enables systematic exploration of multiple reasoning trajectories through delayed sequence evaluation, effectively overcoming the myopic decision-making characteristic of standard next-token prediction. Extensive experiments on mathematical reasoning and code generation benchmarks demonstrate that our framework achieves consistent and significant improvements over state-of-the-art prompt-based self-correction methods.
zh

[AI-18] DeepGraphLog for Layered Neurosymbolic AI

【速读】:该论文旨在解决现有神经符号人工智能(Neurosymbolic AI, NeSy)框架在处理复杂依赖关系时的局限性,特别是其固定的数据流结构限制了对不规则数据结构(如图结构)的建模能力。当前框架如DeepProbLog强制符号推理始终发生在神经处理之后,无法灵活地进行多层混合推理。其解决方案的关键在于提出DeepGraphLog,一个扩展ProbLog的新型NeSy框架,通过引入图神经谓词(Graph Neural Predicates),将符号表示显式建模为图结构,并利用图神经网络(GNNs)进行处理,从而支持神经与符号组件以任意顺序组合的多层推理机制。这一设计显著提升了对复杂关系依赖的捕捉能力,并拓展了NeSy在图结构领域中的适用范围。

链接: https://arxiv.org/abs/2509.07665
作者: Adem Kikaj,Giuseppe Marra,Floris Geerts,Robin Manhaeve,Luc De Raedt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Neurosymbolic AI (NeSy) aims to integrate the statistical strengths of neural networks with the interpretability and structure of symbolic reasoning. However, current NeSy frameworks like DeepProbLog enforce a fixed flow where symbolic reasoning always follows neural processing. This restricts their ability to model complex dependencies, especially in irregular data structures such as graphs. In this work, we introduce DeepGraphLog, a novel NeSy framework that extends ProbLog with Graph Neural Predicates. DeepGraphLog enables multi-layer neural-symbolic reasoning, allowing neural and symbolic components to be layered in arbitrary order. In contrast to DeepProbLog, which cannot handle symbolic reasoning via neural methods, DeepGraphLog treats symbolic representations as graphs, enabling them to be processed by Graph Neural Networks (GNNs). We showcase the capabilities of DeepGraphLog on tasks in planning, knowledge graph completion with distant supervision, and GNN expressivity. Our results demonstrate that DeepGraphLog effectively captures complex relational dependencies, overcoming key limitations of existing NeSy systems. By broadening the applicability of neurosymbolic AI to graph-structured domains, DeepGraphLog offers a more expressive and flexible framework for neural-symbolic integration.
zh

[AI-19] Getting In Contract with Large Language Models – An Agency Theory Perspective On Large Language Model Alignment

【速读】:该论文旨在解决组织在采用大型语言模型(Large Language Models, LLMs)过程中因信息不对称和黑箱特性引发的AI对齐(AI alignment)问题,这些问题可能导致生成偏离主题、歧视性或有害内容。解决方案的关键在于提出一个基于委托代理理论(agency theory)的框架——LLM ATLAS(LLM Agency Theory-Led Alignment Strategy),通过将组织采用LLM的阶段与委托代理理论相结合,系统识别对齐问题并构建对应的问题-解决方案空间,从而缓解因信息不对称导致的对齐失效风险。

链接: https://arxiv.org/abs/2509.07642
作者: Sascha Kaltenpoth,Oliver Müller
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Presented at the 19th International Conference on Wirtschaftsinformatik 2024, Würzburg, Germany this https URL

点击查看摘要

Abstract:Adopting Large language models (LLMs) in organizations potentially revolutionizes our lives and work. However, they can generate off-topic, discriminating, or harmful content. This AI alignment problem often stems from misspecifications during the LLM adoption, unnoticed by the principal due to the LLM’s black-box nature. While various research disciplines investigated AI alignment, they neither address the information asymmetries between organizational adopters and black-box LLM agents nor consider organizational AI adoption processes. Therefore, we propose LLM ATLAS (LLM Agency Theory-Led Alignment Strategy) a conceptual framework grounded in agency (contract) theory, to mitigate alignment problems during organizational LLM adoption. We conduct a conceptual literature analysis using the organizational LLM adoption phases and the agency theory as concepts. Our approach results in (1) providing an extended literature analysis process specific to AI alignment methods during organizational LLM adoption and (2) providing a first LLM alignment problem-solution space.
zh

[AI-20] ransferable Direct Prompt Injection via Activation-Guided MCMC Sampling EMNLP2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)面临的直接提示注入(Direct Prompt Injection, DPI)攻击问题,此类攻击因实施门槛低且破坏性强而构成严重安全威胁。现有白盒/灰盒方法存在实践困难,而黑盒方法则普遍存在迁移能力差的问题。论文提出了一种基于激活引导的提示注入攻击框架,其核心创新在于利用代理模型的激活特征构建能量模型(Energy-based Model, EBM),以评估对抗性提示的质量;随后,通过token级马尔可夫链蒙特卡洛(token-level Markov Chain Monte Carlo, MCMC)采样在无梯度条件下自适应优化对抗提示,从而实现高效的黑盒攻击。该方案显著提升了跨模型迁移能力,在五种主流LLM上达到49.6%的攻击成功率(Attack Success Rate, ASR),较人工构造提示提升34.6%,并在未见任务场景中仍保持36.6%的ASR,验证了其有效性与泛化性。

链接: https://arxiv.org/abs/2509.07617
作者: Minghui Li,Hao Zhang,Yechao Zhang,Wei Wan,Shengshan Hu,pei Xiaobing,Jing Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025

点击查看摘要

Abstract:Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage. To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework. We first construct an Energy-based Model (EBM) using activations from a surrogate model to evaluate the quality of adversarial prompts. Guided by the trained EBM, we employ the token-level Markov Chain Monte Carlo (MCMC) sampling to adaptively optimize adversarial prompts, thereby enabling gradient-free black-box attacks. Experimental results demonstrate our superior cross-model transferability, achieving 49.6% attack success rate (ASR) across five mainstream LLMs and 34.6% improvement over human-crafted prompts, and maintaining 36.6% ASR on unseen task scenarios. Interpretability analysis reveals a correlation between activations and attack effectiveness, highlighting the critical role of semantic patterns in transferable vulnerability exploitation.
zh

[AI-21] Beyond Rebalancing: Benchmarking Binary Classifiers Under Class Imbalance Without Rebalancing Techniques

【速读】:该论文旨在解决类别不平衡(class imbalance)对监督分类模型性能的影响问题,尤其是在医疗诊断和异常检测等关键领域中,少数类样本稀少时传统分类器性能显著下降的问题。其核心解决方案在于不依赖任何显式的重平衡技术(如欠采样、过采样或单类分类方法),而是系统评估多种二分类器在真实与合成数据集上、不同少数类样本规模(从一shot到few-shot场景)下的鲁棒性表现。关键发现是:随着数据复杂度增加和少数类样本减少,传统模型性能迅速恶化;而基于梯度提升的集成模型(如TabPFN)和先进架构则展现出更强的稳定性与泛化能力,为无重平衡策略下的模型选择提供了实证依据。

链接: https://arxiv.org/abs/2509.07605
作者: Ali Nawaz,Amir Ahmad,Shehroz S. Khan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Class imbalance poses a significant challenge to supervised classification, particularly in critical domains like medical diagnostics and anomaly detection where minority class instances are rare. While numerous studies have explored rebalancing techniques to address this issue, less attention has been given to evaluating the performance of binary classifiers under imbalance when no such techniques are applied. Therefore, the goal of this study is to assess the performance of binary classifiers “as-is”, without performing any explicit rebalancing. Specifically, we systematically evaluate the robustness of a diverse set of binary classifiers across both real-world and synthetic datasets, under progressively reduced minority class sizes, using one-shot and few-shot scenarios as baselines. Our approach also explores varying data complexities through synthetic decision boundary generation to simulate real-world conditions. In addition to standard classifiers, we include experiments using undersampling, oversampling strategies, and one-class classification (OCC) methods to examine their behavior under severe imbalance. The results confirm that classification becomes more difficult as data complexity increases and the minority class size decreases. While traditional classifiers deteriorate under extreme imbalance, advanced models like TabPFN and boosting-based ensembles retain relatively higher performance and better generalization compared to traditional classifiers. Visual interpretability and evaluation metrics further validate these findings. Our work offers valuable guidance on model selection for imbalanced learning, providing insights into classifier robustness without dependence on explicit rebalancing techniques.
zh

[AI-22] ransformer-Based Approach to Optimal Sensor Placement for Structural Health Monitoring of Probe Cards

【速读】:该论文旨在解决半导体探针卡(probe card)在制造过程中因基板裂纹和螺丝松动等故障导致的良率与可靠性下降问题,核心目标是通过优化传感器布置来提升结构健康监测(Structural Health Monitoring, SHM)的效率。解决方案的关键在于提出一种基于Transformer的深度学习策略,结合有限元模型生成的频率响应函数(Frequency Response Functions, FRFs),利用物理信息驱动的场景扩展和统计增强的数据集训练混合卷积神经网络与Transformer模型,从而实现对探针卡健康状态(基准、松动螺丝、裂纹)的高精度分类(准确率99.83%)和裂纹检测的高召回率(99.73%)。模型通过注意力机制识别关键传感器位置,为低成本、高效的监测系统设计提供可解释的配置优化依据。

链接: https://arxiv.org/abs/2509.07603
作者: Mehdi Bejani,Marco Mauri,Daniele Acconcia,Simone Todaro,Stefano Mariani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages, 11 figures

点击查看摘要

Abstract:This paper presents an innovative Transformer-based deep learning strategy for optimizing the placement of sensors aiming at structural health monitoring of semiconductor probe cards. Failures in probe cards, including substrate cracks and loosened screws, would critically affect semiconductor manufacturing yield and reliability. Some failure modes could be detected by equipping a probe card with adequate sensors. Frequency response functions from simulated failure scenarios are adopted within a finite element model of a probe card. A comprehensive dataset, enriched by physics-informed scenario expansion and physics-aware statistical data augmentation, is exploited to train a hybrid Convolutional Neural Network and Transformer model. The model achieves high accuracy (99.83%) in classifying the probe card health states (baseline, loose screw, crack) and an excellent crack detection recall (99.73%). Model robustness is confirmed through a rigorous framework of 3 repetitions of 10-fold stratified cross-validation. The attention mechanism also pinpoints critical sensor locations: an analysis of the attention weights offers actionable insights for designing efficient, cost-effective monitoring systems by optimizing sensor configurations. This research highlights the capability of attention-based deep learning to advance proactive maintenance, enhancing operational reliability and yield in semiconductor manufacturing.
zh

[AI-23] owards explainable decision support using hybrid neural models for logistic terminal automation

【速读】:该论文旨在解决深度学习(Deep Learning, DL)在系统动力学(System Dynamics, SD)建模中应用时面临的可解释性与因果可靠性下降的问题,这在关键决策系统中尤为突出。解决方案的关键在于提出一种“设计即可解释”的神经网络系统动力学建模框架,通过融合概念可解释性(Concept-Based Interpretability)、机制可解释性(Mechanistic Interpretability)和因果机器学习(Causal Machine Learning)技术,使神经网络模型能够基于语义明确且可操作的变量运行,同时保持传统SD模型所具备的因果基础和透明度。该方法实现了黑箱预测模型与复杂动态环境中关键决策支持需求之间的桥梁构建,适用于工业物联网(Industrial Internet-of-Things)赋能的赛博物理系统(Cyber-Physical Systems)。

链接: https://arxiv.org/abs/2509.07577
作者: Riccardo DElia,Alberto Termine,Francesco Flammini
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The integration of Deep Learning (DL) in System Dynamics (SD) modeling for transportation logistics offers significant advantages in scalability and predictive accuracy. However, these gains are often offset by the loss of explainability and causal reliability - key requirements in critical decision-making systems. This paper presents a novel framework for interpretable-by-design neural system dynamics modeling that synergizes DL with techniques from Concept-Based Interpretability, Mechanistic Interpretability, and Causal Machine Learning. The proposed hybrid approach enables the construction of neural network models that operate on semantically meaningful and actionable variables, while retaining the causal grounding and transparency typical of traditional SD models. The framework is conceived to be applied to real-world case-studies from the EU-funded project AutoMoTIF, focusing on data-driven decision support, automation, and optimization of multimodal logistic terminals. We aim at showing how neuro-symbolic methods can bridge the gap between black-box predictive models and the need for critical decision support in complex dynamical environments within cyber-physical systems enabled by the industrial Internet-of-Things.
zh

[AI-24] owards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference

【速读】:该论文旨在解决多模态、跨领域用户查询在复杂异构AI服务生态中难以高效路由的问题,即如何准确将不同类型的查询分配给最优的执行单元(如大语言模型LLM或专用AI代理),以实现性能与成本之间的最佳平衡。解决方案的关键在于提出MoMA(Mixture of Models and Agents)框架,其核心创新包括:基于对各类LLM能力的系统性建模构建精细化训练数据集,从而识别每种模型在不同任务结构下的最优适用场景;并在推理阶段采用动态路由策略,根据查询意图精准匹配最具成本效益的LLM;同时引入基于上下文感知状态机和动态掩码的代理选择机制,提升整体系统的效率与可扩展性。

链接: https://arxiv.org/abs/2509.07571
作者: Xiyu Guo,Shan Wang,Chunfang Ji,Xuefeng Zhao,Wenhao Xi,Yaoyao Liu,Qinglan Li,Chao Deng,Junlan Feng
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) and domain-specific AI agents has greatly expanded the ecosystem of AI-powered services. User queries, however, are highly diverse and often span multiple domains and task types, resulting in a complex and heterogeneous landscape. This diversity presents a fundamental routing challenge: how to accurately direct each query to an appropriate execution unit while optimizing both performance and efficiency. To address this, we propose MoMA (Mixture of Models and Agents), a generalized routing framework that integrates both LLM and agent-based routing. Built upon a deep understanding of model and agent capabilities, MoMA effectively handles diverse queries through precise intent recognition and adaptive routing strategies, achieving an optimal balance between efficiency and cost. Specifically, we construct a detailed training dataset to profile the capabilities of various LLMs under different routing model structures, identifying the most suitable tasks for each LLM. During inference, queries are dynamically routed to the LLM with the best cost-performance efficiency. We also introduce an efficient agent selection strategy based on a context-aware state machine and dynamic masking. Experimental results demonstrate that the MoMA router offers superior cost-efficiency and scalability compared to existing approaches.
zh

[AI-25] ΔL Normalization: Rethink Loss Aggregation in RLVR

【速读】:该论文旨在解决强化学习中可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)训练过程中因生成长度动态变化导致的梯度方差高、优化不稳定的问题。现有方法如GRPO、DAPO和Dr. GRPO虽尝试通过不同的损失归一化策略缓解此问题,但仍存在估计偏差或梯度方差未充分降低的局限性。解决方案的关键在于提出一种名为ΔL Normalization的新颖损失聚合方法,其核心思想是将问题建模为寻找最小方差无偏估计器(minimum-variance unbiased estimator),从而在理论上同时保证对真实策略损失的无偏估计与梯度方差最小化,实验表明该方法在不同模型规模、最大长度和任务上均能稳定提升性能。

链接: https://arxiv.org/abs/2509.07558
作者: Zhiyuan He,Xufang Luo,Yike Zhang,Yuqing Yang,Lili Qiu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose \Delta L Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed \Delta L Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at this https URL.
zh

[AI-26] FLeW: Facet-Level and Adaptive Weighted Representation Learning of Scientific Documents DASFAA2025

【速读】:该论文旨在解决科学文献表示学习中三大方法的局限性:(1)对比训练虽利用引文结构信号但引文信息利用不足且生成单一向量表示;(2)细粒度表示学习虽能生成句子或方面级多向量,但整合成本高且缺乏领域泛化能力;(3)任务感知学习依赖人工预定义任务分类,忽视任务间的细微差异并需额外任务特定训练数据。其解决方案的关键在于提出FLeW方法,通过引入一种基于引文意图(background, method, result)和频率的新颖三元组采样策略增强引文结构信号,并以此构建领域通用的细粒度语义切片;进一步采用简单权重搜索机制自适应融合各语义切片嵌入,生成任务特异性文档嵌入,无需任务感知微调即可实现跨任务、跨领域的鲁棒表现。

链接: https://arxiv.org/abs/2509.07531
作者: Zheng Dou,Deqing Wang,Fuzhen Zhuang,Jian Ren,Yanlin Hu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by DASFAA2025

点击查看摘要

Abstract:Scientific document representation learning provides powerful embeddings for various tasks, while current methods face challenges across three approaches. 1) Contrastive training with citation-structural signals underutilizes citation information and still generates single-vector representations. 2) Fine-grained representation learning, which generates multiple vectors at the sentence or aspect level, requires costly integration and lacks domain generalization. 3) Task-aware learning depends on manually predefined task categorization, overlooking nuanced task distinctions and requiring extra training data for task-specific modules. To address these problems, we propose a new method that unifies the three approaches for better representations, namely FLeW. Specifically, we introduce a novel triplet sampling method that leverages citation intent and frequency to enhance citation-structural signals for training. Citation intents (background, method, result), aligned with the general structure of scientific writing, facilitate a domain-generalized facet partition for fine-grained representation learning. Then, we adopt a simple weight search to adaptively integrate three facet-level embeddings into a task-specific document embedding without task-aware fine-tuning. Experiments show the applicability and robustness of FLeW across multiple scientific tasks and fields, compared to prior models.
zh

[AI-27] Water Demand Forecasting of District Metered Areas through Learned Consumer Representations

【速读】:该论文旨在解决短时水需求预测(short-term water demand forecasting)在复杂多变的用水行为和非确定性气象因素影响下的准确性问题。其关键解决方案在于引入一种结合无监督对比学习(unsupervised contrastive learning)与小波变换卷积网络(wavelet-transformed convolutional networks)的混合模型:首先通过对比学习对区域计量区(District Metered Areas, DMAs)内的终端用户按消费行为进行聚类,提取出具有代表性的消费模式作为特征;随后将这些行为特征与历史用水数据一同输入带有交叉注意力机制(cross-attention mechanism)的深度神经网络中进行融合建模,从而提升预测精度。实证结果表明,该方法在多个DMA上均实现了平均绝对百分比误差(MAPE)的显著降低,最高达4.9%,并识别出受社会经济因素驱动的用户群体,增强了对需求驱动机制的理解。

链接: https://arxiv.org/abs/2509.07515
作者: Adithya Ramachandran,Thorkil Flensmark B. Neergaard,Tomás Arias-Vergara,Andreas Maier,Siming Bayer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Presented at European Conference for Signal Procesing - EUSIPCO 2025

点击查看摘要

Abstract:Advancements in smart metering technologies have significantly improved the ability to monitor and manage water utilities. In the context of increasing uncertainty due to climate change, securing water resources and supply has emerged as an urgent global issue with extensive socioeconomic ramifications. Hourly consumption data from end-users have yielded substantial insights for projecting demand across regions characterized by diverse consumption patterns. Nevertheless, the prediction of water demand remains challenging due to influencing non-deterministic factors, such as meteorological conditions. This work introduces a novel method for short-term water demand forecasting for District Metered Areas (DMAs) which encompass commercial, agricultural, and residential consumers. Unsupervised contrastive learning is applied to categorize end-users according to distinct consumption behaviors present within a DMA. Subsequently, the distinct consumption behaviors are utilized as features in the ensuing demand forecasting task using wavelet-transformed convolutional networks that incorporate a cross-attention mechanism combining both historical data and the derived representations. The proposed approach is evaluated on real-world DMAs over a six-month period, demonstrating improved forecasting performance in terms of MAPE across different DMAs, with a maximum improvement of 4.9%. Additionally, it identifies consumers whose behavior is shaped by socioeconomic factors, enhancing prior knowledge about the deterministic patterns that influence demand.
zh

[AI-28] SheetDesigner: MLLM -Powered Spreadsheet Layout Generation with Rule-Based and Vision-Based Reflection EMNLP2025

【速读】:该论文旨在解决电子表格(spreadsheet)自动布局生成问题,即如何在无需人工干预的情况下,根据内容语义和结构特征自动生成符合网格布局规范且具备良好可读性的电子表格。现有自动化布局方法存在两个核心缺陷:一是将单元格视为连续坐标的轴对齐矩形,忽略了电子表格固有的离散网格结构;二是未能建模数据依赖与上下文关联等独特语义关系。解决方案的关键在于提出 SheetDesigner 框架,该框架基于多模态大语言模型(Multimodal Large Language Models, MLLMs),采用“规则+视觉反思”的混合策略进行组件定位与内容填充,实现零样本、无需训练的高质量布局生成,在七个评估指标上显著优于五种基线方法(提升至少 22.6%)。研究进一步发现,MLLMs 通过视觉模态能较好处理重叠与平衡问题,但在对齐方面表现不足,因此必须结合规则推理与视觉反馈以达到最优效果。

链接: https://arxiv.org/abs/2509.07473
作者: Qin Chen,Yuanyi Ren,Xiaojun Ma,Mugeng Liu,Han Shi,Dongmei Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:Spreadsheets are critical to data-centric tasks, with rich, structured layouts that enable efficient information transmission. Given the time and expertise required for manual spreadsheet layout design, there is an urgent need for automated solutions. However, existing automated layout models are ill-suited to spreadsheets, as they often (1) treat components as axis-aligned rectangles with continuous coordinates, overlooking the inherently discrete, grid-based structure of spreadsheets; and (2) neglect interrelated semantics, such as data dependencies and contextual links, unique to spreadsheets. In this paper, we first formalize the spreadsheet layout generation task, supported by a seven-criterion evaluation protocol and a dataset of 3,326 spreadsheets. We then introduce SheetDesigner, a zero-shot and training-free framework using Multimodal Large Language Models (MLLMs) that combines rule and vision reflection for component placement and content population. SheetDesigner outperforms five baselines by at least 22.6%. We further find that through vision modality, MLLMs handle overlap and balance well but struggle with alignment, necessitates hybrid rule and visual reflection strategies. Our codes and data is available at Github.
zh

[AI-29] xt2Touch: Tactile In-Hand Manipulation with LLM -Designed Reward Functions

【速读】:该论文旨在解决多轴物体在手内旋转任务中缺乏高效、自动化奖励设计的问题,尤其是如何利用触觉感知(tactile sensing)提升机器人灵巧操作的性能。传统方法依赖人工精心设计的奖励函数,不仅耗时且难以适应复杂场景。其解决方案的关键在于提出Text2Touch框架,通过大语言模型(LLM)生成简短而高效的奖励函数,并结合提示工程(prompt engineering)扩展至超过70个环境变量,同时采用模拟到现实(sim-to-real)蒸馏策略,成功将基于视觉-触觉融合感知的策略迁移至具备全驱动四指灵巧手的真实机器人平台。该方法显著优于人工调优的基线,在旋转速度和稳定性上表现更优,且奖励函数长度和复杂度降低一个数量级,从而大幅缩短从概念到可部署触觉灵巧技能的研发周期。

链接: https://arxiv.org/abs/2509.07445
作者: Harrison Field,Max Yang,Yijiong Lin,Efi Psomopoulou,David Barton,Nathan F. Lepora
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at CoRL 2025

点击查看摘要

Abstract:Large language models (LLMs) are beginning to automate reward design for dexterous manipulation. However, no prior work has considered tactile sensing, which is known to be critical for human-like dexterity. We present Text2Touch, bringing LLM-crafted rewards to the challenging task of multi-axis in-hand object rotation with real-world vision based tactile sensing in palm-up and palm-down configurations. Our prompt engineering strategy scales to over 70 environment variables, and sim-to-real distillation enables successful policy transfer to a tactile-enabled fully actuated four-fingered dexterous robot hand. Text2Touch significantly outperforms a carefully tuned human-engineered baseline, demonstrating superior rotation speed and stability while relying on reward functions that are an order of magnitude shorter and simpler. These results illustrate how LLM-designed rewards can significantly reduce the time from concept to deployable dexterous tactile skills, supporting more rapid and scalable multimodal robot learning. Project website: this https URL
zh

[AI-30] he Choice of Divergence: A Neglected Key to Mitigating Diversity Collapse in Reinforcement Learning with Verifiable Reward

【速读】:该论文旨在解决强化学习中基于可验证奖励(Reinforcement Learning with Verifiable Reward, RLVR)微调大语言模型(Large Language Models, LLMs)时出现的核心矛盾:尽管单次尝试准确率(Pass@1)提升,但多次尝试成功率(Pass@k)常显著下降,且伴随灾难性遗忘(catastrophic forgetting)现象。问题根源在于现有RLVR目标函数缺乏对知识保留的机制——标准方法中使用的模式导向逆KL散度(reverse KL-divergence)会压缩策略空间加速知识丢失,而完全省略散度项则无法防止模型偏离原有知识分布。解决方案的关键在于重新定位散度项的作用:提出多样性保持混合强化学习(Diversity-Preserving Hybrid RL, DPH-RL),采用质量覆盖型f-散度(如前向KL和JS散度)作为持续记忆机制,通过持续参考初始策略强制模型维持广泛解空间覆盖,从而在数学和SQL生成任务上同时提升Pass@1与Pass@k性能,并实现更高效的训练(仅需采样初始策略即可计算f-散度)。

链接: https://arxiv.org/abs/2509.07430
作者: Long Li,Jiaran Hao,Jason Klein Liu,Zhijian Zhou,Xiaoyu Tan,Wei Chu,Zhe Wang,Shirui Pan,Chao Qu,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 5 figures

点击查看摘要

Abstract:A central paradox in fine-tuning Large Language Models (LLMs) with Reinforcement Learning with Verifiable Reward (RLVR) is the frequent degradation of multi-attempt performance (Pass@k) despite improvements in single-attempt accuracy (Pass@1). This is often accompanied by catastrophic forgetting, where models lose previously acquired skills. While various methods have been proposed, the choice and function of the divergence term have been surprisingly unexamined as a proactive solution. We argue that standard RLVR objectives – both those using the mode-seeking reverse KL-divergence and those forgoing a divergence term entirely – lack a crucial mechanism for knowledge retention. The reverse-KL actively accelerates this decay by narrowing the policy, while its absence provides no safeguard against the model drifting from its diverse knowledge base. We propose a fundamental shift in perspective: using the divergence term itself as the solution. Our framework, Diversity-Preserving Hybrid RL (DPH-RL), leverages mass-covering f-divergences (like forward-KL and JS-divergence) to function as a rehearsal mechanism. By continuously referencing the initial policy, this approach forces the model to maintain broad solution coverage. Extensive experiments on math and SQL generation demonstrate that DPH-RL not only resolves the Pass@k degradation but improves both Pass@1 and Pass@k in- and out-of-domain. Additionally, DPH-RL is more training-efficient because it computes f-divergence using generator functions, requiring only sampling from the initial policy and no online reference model. Our work highlights a crucial, overlooked axis for improving RLVR, demonstrating that the proper selection of a divergence measure is a powerful tool for building more general and diverse reasoning models.
zh

[AI-31] Hybrid GCN-GRU Model for Anomaly Detection in Cryptocurrency Transactions

【速读】:该论文旨在解决区块链交易网络中非法活动的检测问题,该问题具有时序动态性和节点间复杂关联性的特点。解决方案的关键在于提出一种混合图卷积网络-门控循环单元(Graph Convolutional Network-Gated Recurrent Unit, GCN-GRU)模型,该模型能够同时捕捉交易网络的结构特征和时间序列演化模式,从而提升对异常行为的识别精度。实验基于2020至2024年的真实比特币交易数据,结果显示该模型在准确率(Accuracy)和受试者工作特征曲线下面积(AUC-ROC)上分别达到0.9470和0.9807,显著优于现有基线方法。

链接: https://arxiv.org/abs/2509.07392
作者: Gyuyeon Na,Minjung Park,Hyeonjeong Cha,Soyoun Kim,Sunyoung Moon,Sua Lee,Jaeyoung Choi,Hyemin Lee,Sangmi Chai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Blockchain transaction networks are complex, with evolving temporal patterns and inter-node relationships. To detect illicit activities, we propose a hybrid GCN-GRU model that captures both structural and sequential features. Using real Bitcoin transaction data (2020-2024), our model achieved 0.9470 Accuracy and 0.9807 AUC-ROC, outperforming all baselines.
zh

[AI-32] SBS: Enhancing Parameter-Efficiency of Neural Representations for Neural Networks via Spectral Bias Suppression ICONIP2025

【速读】:该论文旨在解决神经网络参数压缩中因标准多层感知机(Multilayer Perceptron, MLP)存在显著频谱偏差(Spectral Bias)而导致高频率细节重建能力受限的问题。其解决方案的关键在于提出一种名为SBS的参数高效增强方法,通过两项核心技术实现:一是基于单向排序的平滑策略,提升输出空间中的核函数平滑性;二是基于单向排序的平滑感知随机傅里叶特征(Smooth-aware Random Fourier Features),根据每层参数数量自适应调节输入编码的频率带宽,从而有效抑制频谱偏差并提升重建精度。

链接: https://arxiv.org/abs/2509.07373
作者: Qihu Xie,Yuan Li,Yi Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICONIP 2025

点击查看摘要

Abstract:Implicit neural representations have recently been extended to represent convolutional neural network weights via neural representation for neural networks, offering promising parameter compression benefits. However, standard multi-layer perceptrons used in neural representation for neural networks exhibit a pronounced spectral bias, hampering their ability to reconstruct high-frequency details effectively. In this paper, we propose SBS, a parameter-efficient enhancement to neural representation for neural networks that suppresses spectral bias using two techniques: (1) a unidirectional ordering-based smoothing that improves kernel smoothness in the output space, and (2) unidirectional ordering-based smoothing aware random fourier features that adaptively modulate the frequency bandwidth of input encodings based on layer-wise parameter count. Extensive evaluations on various ResNet models with datasets CIFAR-10, CIFAR-100, and ImageNet, demonstrate that SBS achieves significantly better reconstruction accuracy with less parameters compared to SOTA.
zh

[AI-33] Autonomous Code Evolution Meets NP-Completeness

【速读】:该论文旨在解决大规模代码库中生成式 AI (Generative AI) 自主演化能力的局限性问题,特别是如何将基于大语言模型(Large Language Models, LLMs)的代码自进化机制从孤立的数百行代码片段扩展至整个软件仓库级别(涵盖数百个文件与数万行C/C++代码)。其解决方案的关键在于提出SATLUTION框架,该框架通过LLM代理在严格正确性保障下对SAT求解器仓库进行直接演化,并结合分布式运行时反馈机制实现多层级自进化:不仅优化目标代码本身,还同步进化自身的演化策略与规则,从而在SAT竞赛基准上超越人类专家设计的最优解。

链接: https://arxiv.org/abs/2509.07367
作者: Cunxi Yu,Rongjian Liang,Chia-Tung Ho,Haoxing Ren
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Logic in Computer Science (cs.LO)
备注: 31 pages, 11 figures

点击查看摘要

Abstract:Large language models (LLMs) have recently shown strong coding abilities, enabling not only static code generation but also iterative code self-evolving through agentic frameworks. Recently, AlphaEvolve \citenovikov2025alphaevolve demonstrated that LLM-based coding agents can autonomously improve algorithms and surpass human experts, with scopes limited to isolated kernels spanning hundreds of lines of code. Inspired by AlphaEvolve, we present SATLUTION, the first framework to extend LLM-based code evolution to the full repository scale, encompassing hundreds of files and tens of thousands of lines of C/C++ code. Targeting Boolean Satisfiability (SAT), the canonical NP-complete problem and a cornerstone of both theory and applications. SATLUTION orchestrates LLM agents to directly evolve solver repositories under strict correctness guarantees and distributed runtime feedback, while simultaneously self-evolving its own evolution policies and rules. Starting from SAT Competition 2024 codebases and benchmark, SATLUTION evolved solvers that decisively outperformed the human-designed winners of the SAT Competition 2025, and also surpassed both 2024 and 2025 champions on the 2024 benchmarks.
zh

[AI-34] Word2Spike: Poisson Rate Coding for Associative Memories and Neuromorphic Algorithms ALT

【速读】:该论文旨在解决如何在神经形态架构中实现高效、鲁棒的语义编码与关联记忆问题,以支持类脑计算系统的低功耗运行。其核心挑战在于将连续空间中的词向量(word embeddings)映射到脉冲神经网络(Spiking Neural Networks, SNNs)中的可计算状态,同时保持语义保真度和抗噪能力。解决方案的关键在于提出一种名为Word2Spike的新颖率编码机制,通过泊松过程(Poisson processes)建立多维词向量到脉冲吸引子状态(spike-based attractor states)的一对一映射,并结合BitNet b1.58量化方法,在SimLex-999数据集上维持97%的语义相似性,在10,000个词上实现100%重建准确率,且在引入噪声条件下仍保持原始嵌入的类比推理性能(100%),表明该机制具备强大的语义编码鲁棒性。

链接: https://arxiv.org/abs/2509.07361
作者: Archit Kalra,Midhun Sadanand
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI)
备注: Presented at 2025 AI in Health Conference, Ken Kennedy Institute, Rice University

点击查看摘要

Abstract:Spiking neural networks offer a promising path toward energy-efficient, brain-like associative memory. This paper introduces Word2Spike, a novel rate coding mechanism that combines continuous word embeddings and neuromorphic architectures. We develop a one-to-one mapping that converts multi-dimensional word vectors into spike-based attractor states using Poisson processes. Using BitNet b1.58 quantization, we maintain 97% semantic similarity of continuous embeddings on SimLex-999 while achieving 100% reconstruction accuracy on 10,000 words from OpenAI’s text-embedding-3-large. We preserve analogy performance (100% of original embedding performance) even under intentionally introduced noise, indicating a resilient mechanism for semantic encoding in neuromorphic systems. Next steps include integrating the mapping with spiking transformers and liquid state machines (resembling Hopfield Networks) for further evaluation.
zh

[AI-35] Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity

【速读】:该论文试图解决的问题是:中间 token 生成(Intermediate Token Generation, ITG)是否真正反映了问题难度的自适应计算,即较长的推理链(Chain of Thoughts, CoTs)是否意味着模型在进行更深入的“思考”或更高复杂度的计算。现有研究普遍假设较长的中间 token 序列代表更强的问题适应性计算,但其内在机制尚不明确。论文的关键解决方案在于:通过在 A* 搜索算法的推导轨迹上从头训练 Transformer 模型,构建一个具有精确可验证问题复杂度(即求解迷宫所需操作数)的可控实验环境,从而系统评估中间 token 长度与真实问题难度之间的相关性。结果表明,中间 token 长度与问题复杂度仅有弱相关性,且强相关性仅出现在接近训练分布的问题中,说明这种现象主要源于近似记忆而非真正的自适应计算,从而挑战了将长推理序列等同于“思考努力”的主流观点。

链接: https://arxiv.org/abs/2509.07339
作者: Vardhan Palod,Karthik Valmeekam,Kaya Stechly,Subbarao Kambhampati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Intermediate token generation (ITG), where a model produces output before the solution, has been proposed as a method to improve the performance of language models on reasoning tasks. While these reasoning traces or Chain of Thoughts (CoTs) are correlated with performance gains, the mechanisms underlying them remain unclear. A prevailing assumption in the community has been to anthropomorphize these tokens as “thinking”, treating longer traces as evidence of higher problem-adaptive computation. In this work, we critically examine whether intermediate token sequence length reflects or correlates with problem difficulty. To do so, we train transformer models from scratch on derivational traces of the A* search algorithm, where the number of operations required to solve a maze problem provides a precise and verifiable measure of problem complexity. We first evaluate the models on trivial free-space problems, finding that even for the simplest tasks, they often produce excessively long reasoning traces and sometimes fail to generate a solution. We then systematically evaluate the model on out-of-distribution problems and find that the intermediate token length and ground truth A* trace length only loosely correlate. We notice that the few cases where correlation appears are those where the problems are closer to the training distribution, suggesting that the effect arises from approximate recall rather than genuine problem-adaptive computation. This suggests that the inherent computational complexity of the problem instance is not a significant factor, but rather its distributional distance from the training data. These results challenge the assumption that intermediate trace generation is adaptive to problem difficulty and caution against interpreting longer sequences in systems like R1 as automatically indicative of “thinking effort”.
zh

[AI-36] General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases

【速读】:该论文旨在解决电子健康记录(Electronic Health Records, EHR)中年龄和性别等人口统计学特征(Demographic Attributes)在临床预测模型中常被当作辅助变量、其表示学习受限的问题。解决方案的关键在于提出一种通用的人口统计学预训练模型(General Demographic Pre-trained, GDP),通过系统探索排序策略与编码方法的组合,将结构化表格形式的人口统计学输入转化为高阶潜在嵌入(Latent Embeddings),从而提升其在下游梯度提升模型中的表征能力和预测贡献。实验表明,GDP在多种疾病和人群背景下均能增强人口统计学特征的信息价值,尤其在风险分层依赖年龄和性别的情况下显著改善判别能力、校准性能及决策树分割的信息增益,验证了该基础模型的跨任务与跨人群泛化潜力。

链接: https://arxiv.org/abs/2509.07330
作者: Li-Chin Chen,Ji-Tian Sheu,Yuh-Jue Chuang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Demographic attributes are universally present in electronic health records and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often relegated to auxiliary roles in model design, with limited attention has been given to learning their representations. This study proposes a General Demographic Pre-trained (GDP) model as a foundational representation framework tailored to age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and population compositions from different geographic regions. The GDP architecture explores combinations of ordering strategies and encoding methods to transform tabular demographic inputs into latent embeddings. Experimental results demonstrate that sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundational models for tabular demographic attributes can generalize across tasks and populations, offering a promising direction for improving predictive performance in healthcare applications.
zh

[AI-37] MEGG: Replay via Maximally Extreme GGscore in Incremental Learning for Neural Recommendation Models

【速读】:该论文旨在解决神经协同过滤(Neural Collaborative Filtering)模型在动态环境中因用户偏好演化而导致的灾难性遗忘(catastrophic forgetting)问题,即传统静态训练假设下模型难以适应数据分布变化。其解决方案的关键在于提出一种基于经验回放(experience replay)的增量学习框架MEGG(Replay Samples with Maximally Extreme GGscore),其中引入了GGscore这一新颖样本影响力度量指标,能够精准识别并选择最具影响力的样本进行重放,从而有效缓解遗忘现象。该方法具有模型无关性,可无缝集成于不同推荐架构与框架中,并在多个基准数据集和神经推荐模型上展现出卓越的性能、可扩展性和鲁棒性。

链接: https://arxiv.org/abs/2509.07319
作者: Yunxiao Shi,Shuo Yang,Haimin Zhang,Li Wang,Yongze Wang,Qiang Wu,Min Xu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by Data Mining and Knowledge Discovery (DMKD) in Sep 2025

点击查看摘要

Abstract:Neural Collaborative Filtering models are widely used in recommender systems but are typically trained under static settings, assuming fixed data distributions. This limits their applicability in dynamic environments where user preferences evolve. Incremental learning offers a promising solution, yet conventional methods from computer vision or NLP face challenges in recommendation tasks due to data sparsity and distinct task paradigms. Existing approaches for neural recommenders remain limited and often lack generalizability. To address this, we propose MEGG, Replay Samples with Maximally Extreme GGscore, an experience replay based incremental learning framework. MEGG introduces GGscore, a novel metric that quantifies sample influence, enabling the selective replay of highly influential samples to mitigate catastrophic forgetting. Being model-agnostic, MEGG integrates seamlessly across architectures and frameworks. Experiments on three neural models and four benchmark datasets show superior performance over state-of-the-art baselines, with strong scalability, efficiency, and robustness. Implementation will be released publicly upon acceptance.
zh

[AI-38] zkUnlearner: A Zero-Knowledge Framework for Verifiable Unlearning with Multi-Granularity and Forgery-Resistance

【速读】:该论文旨在解决机器学习模型在执行“被遗忘权”(right to be forgotten)时缺乏可验证性与安全性的问题,尤其是在面对伪造攻击(forging attacks)时难以保障数据删除的可靠性。其核心挑战在于如何实现高效、细粒度且抗伪造的机器学习遗忘机制,以确保模型更新过程既透明又可信。解决方案的关键在于提出首个零知识框架 zkUnlearner,通过引入一种基于比特掩码(bit-masking)的通用计算模型,使现有的梯度下降算法训练过程中的零知识证明具备选择性(selectivity),从而支持样本级、特征级和类别级等多种粒度的可验证遗忘;同时,该框架首次设计了针对随机梯度下降(Stochastic Gradient Descent, SGD)优化中典型伪造攻击的有效防御策略,显著提升了遗忘验证的隐私保护强度与实用性。

链接: https://arxiv.org/abs/2509.07290
作者: Nan Wang,Nan Wu,Xiangyu Hui,Jiafan Wang,Xin Yuan
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As the demand for exercising the “right to be forgotten” grows, the need for verifiable machine unlearning has become increasingly evident to ensure both transparency and accountability. We present \em zkUnlearner, the first zero-knowledge framework for verifiable machine unlearning, specifically designed to support \em multi-granularity and \em forgery-resistance. First, we propose a general computational model that employs a \em bit-masking technique to enable the \em selectivity of existing zero-knowledge proofs of training for gradient descent algorithms. This innovation enables not only traditional \em sample-level unlearning but also more advanced \em feature-level and \em class-level unlearning. Our model can be translated to arithmetic circuits, ensuring compatibility with a broad range of zero-knowledge proof systems. Furthermore, our approach overcomes key limitations of existing methods in both efficiency and privacy. Second, forging attacks present a serious threat to the reliability of unlearning. Specifically, in Stochastic Gradient Descent optimization, gradients from unlearned data, or from minibatches containing it, can be forged using alternative data samples or minibatches that exclude it. We propose the first effective strategies to resist state-of-the-art forging attacks. Finally, we benchmark a zkSNARK-based instantiation of our framework and perform comprehensive performance evaluations to validate its practicality. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.07290 [cs.CR] (or arXiv:2509.07290v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.07290 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-39] Paladin: Defending LLM -enabled Phishing Emails with a New Trigger-Tag Paradigm

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)被恶意用于生成难以检测的钓鱼邮件(phishing content)所带来的安全威胁。由于LLM生成的内容在语法和语义上高度自然,传统基于规则或语义特征的检测方法难以有效识别此类内容。为应对这一挑战,作者提出Paladin方案,其核心创新在于通过多种插入策略将“触发器-标签”关联嵌入到原始LLM中,构建出可被检测的“仪器化LLM”(instrumented LLM)。当该模型生成与钓鱼相关的文本时,会自动注入可识别的标签,从而实现高效、隐蔽且鲁棒的检测。实验表明,该方法在四种不同场景下均能实现超过90%的检测准确率,显著优于现有基线方法。

链接: https://arxiv.org/abs/2509.07287
作者: Yan Pang,Wenlong Meng,Xiaojing Liao,Tianhao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages

点击查看摘要

Abstract:With the rapid development of large language models, the potential threat of their malicious use, particularly in generating phishing content, is becoming increasingly prevalent. Leveraging the capabilities of LLMs, malicious users can synthesize phishing emails that are free from spelling mistakes and other easily detectable features. Furthermore, such models can generate topic-specific phishing messages, tailoring content to the target domain and increasing the likelihood of success. Detecting such content remains a significant challenge, as LLM-generated phishing emails often lack clear or distinguishable linguistic features. As a result, most existing semantic-level detection approaches struggle to identify them reliably. While certain LLM-based detection methods have shown promise, they suffer from high computational costs and are constrained by the performance of the underlying language model, making them impractical for large-scale deployment. In this work, we aim to address this issue. We propose Paladin, which embeds trigger-tag associations into vanilla LLM using various insertion strategies, creating them into instrumented LLMs. When an instrumented LLM generates content related to phishing, it will automatically include detectable tags, enabling easier identification. Based on the design on implicit and explicit triggers and tags, we consider four distinct scenarios in our work. We evaluate our method from three key perspectives: stealthiness, effectiveness, and robustness, and compare it with existing baseline methods. Experimental results show that our method outperforms the baselines, achieving over 90% detection accuracy across all scenarios. Comments: 20 pages Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2509.07287 [cs.CR] (or arXiv:2509.07287v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2509.07287 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-40] Datasets for Navigating Sensitive Topics in Recommendation Systems

【速读】:该论文旨在解决个性化AI系统(如推荐系统和聊天机器人)可能对用户造成负面影响的问题,尤其是这些系统在基于用户偏好分发内容时,存在无意中暴露用户于敏感或有害信息的风险。为实现对此类风险的量化评估,论文提出关键解决方案:构建包含敏感性标签(sensitivity labels)的新型数据集,用于超越单纯互动指标(如点击率)来衡量个性化系统的安全性与影响。其核心创新在于整合两个来源的数据——一是MovieLens评分数据与Does the Dog Die?社区提供的内容警告标签,二是Archive of Our Own平台上的同人小说互动数据与用户生成的警告信息,从而建立具有细粒度敏感性分类的标注数据集,支持更全面的模型评估与优化。

链接: https://arxiv.org/abs/2509.07269
作者: Amelia Kovacs,Jerry Chee,Kimia Kazemian,Sarah Dean
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Companion Proceedings of the ACM on Web Conference 2025, 2025

点击查看摘要

Abstract:Personalized AI systems, from recommendation systems to chatbots, are a prevalent method for distributing content to users based on their learned preferences. However, there is growing concern about the adverse effects of these systems, including their potential tendency to expose users to sensitive or harmful material, negatively impacting overall well-being. To address this concern quantitatively, it is necessary to create datasets with relevant sensitivity labels for content, enabling researchers to evaluate personalized systems beyond mere engagement metrics. To this end, we introduce two novel datasets that include a taxonomy of sensitivity labels alongside user-content ratings: one that integrates MovieLens rating data with content warnings from the Does the Dog Die? community ratings website, and another that combines fan-fiction interaction data and user-generated warnings from Archive of Our Own.
zh

[AI-41] HealthSLM-Bench: Benchmarking Small Language Models for Mobile and Wearable Healthcare Monitoring

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的医疗健康预测方案普遍依赖云端部署所引发的隐私泄露、高内存占用和延迟问题。其解决方案的关键在于系统性评估轻量级小语言模型(Small Language Models, SLMs)在健康预测任务中的表现,采用零样本(zero-shot)、少样本(few-shot)和指令微调(instruction fine-tuning)三种策略,并将性能最优的微调后SLMs部署于移动设备上,验证其在真实场景下的效率与预测能力。结果表明,SLMs可在保持与LLMs相当预测性能的同时,显著提升计算效率和隐私保护水平,尽管在类别不平衡和少样本场景下仍存在挑战。

链接: https://arxiv.org/abs/2509.07260
作者: Xin Wang,Ting Dang,Xinyu Zhang,Vassilis Kostakos,Michael J. Witbrock,Hong Jia
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 9 pages, 6 tables, 6 figures

点击查看摘要

Abstract:Mobile and wearable healthcare monitoring play a vital role in facilitating timely interventions, managing chronic health conditions, and ultimately improving individuals’ quality of life. Previous studies on large language models (LLMs) have highlighted their impressive generalization abilities and effectiveness in healthcare prediction tasks. However, most LLM-based healthcare solutions are cloud-based, which raises significant privacy concerns and results in increased memory usage and latency. To address these challenges, there is growing interest in compact models, Small Language Models (SLMs), which are lightweight and designed to run locally and efficiently on mobile and wearable devices. Nevertheless, how well these models perform in healthcare prediction remains largely unexplored. We systematically evaluated SLMs on health prediction tasks using zero-shot, few-shot, and instruction fine-tuning approaches, and deployed the best performing fine-tuned SLMs on mobile devices to evaluate their real-world efficiency and predictive performance in practical healthcare scenarios. Our results show that SLMs can achieve performance comparable to LLMs while offering substantial gains in efficiency and privacy. However, challenges remain, particularly in handling class imbalance and few-shot scenarios. These findings highlight SLMs, though imperfect in their current form, as a promising solution for next-generation, privacy-preserving healthcare monitoring.
zh

[AI-42] Systematic Optimization of Open Source Large Language Models for Mathematical Reasoning

【速读】:该论文旨在解决大规模语言模型在数学推理任务中效率与性能之间的权衡问题,即如何在不牺牲解题准确率的前提下显著降低计算成本并提升推理速度。其解决方案的关键在于提出一个系统化的参数优化框架,通过精细化调整温度(temperature)、推理步数(reasoning steps)、规划周期(planning periods)和核采样阈值(nucleus sampling)等关键超参数,在五种不同架构的先进模型(Qwen2.5-72B、Llama-3.1-70B、DeepSeek-V3、Mixtral-8x22B 和 Yi-Lightning)上实现统一且高效的配置优化。实验表明,较低的温度范围(0.1–0.4)和较少的推理步数(4–6)能稳定提升效率而不影响准确性,最终达成平均29.4%的计算成本下降和23.9%的推理速度提升,验证了该框架的通用性与生产可用性。

链接: https://arxiv.org/abs/2509.07238
作者: Pranav Pawar,Dhwaj Jain,Varun Gupta,Kaustav Dedhia,Dashrath Kale,Sudhir Dhekane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a practical investigation into fine-tuning model parameters for mathematical reasoning tasks through experimenting with various configurations including randomness control, reasoning depth, and sampling strategies, careful tuning demonstrates substantial improvements in efficiency as well as performance. A holistically optimized framework is introduced for five state-of-the-art models on mathematical reasoning tasks, exhibiting significant performance boosts while maintaining solution correctness. Through systematic parameter optimization across Qwen2.5-72B, Llama-3.1-70B, DeepSeek-V3, Mixtral-8x22B, and Yi-Lightning, consistent efficiency gains are demonstrated with 100% optimization success rate. The methodology achieves an average 29.4% reduction in computational cost and 23.9% improvement in inference speed across all tested models. This framework systematically searches parameter spaces including temperature (0.1-0.5), reasoning steps (4-12), planning periods (1-4), and nucleus sampling (0.85-0.98), determining optimal configurations through testing on mathematical reasoning benchmarks. Critical findings show that lower temperature regimes (0.1-0.4) and reduced reasoning steps (4-6) consistently enhance efficiency without compromising accuracy. DeepSeek-V3 achieves the highest accuracy at 98%, while Mixtral-8x22B delivers the most cost-effective performance at 361.5 tokens per accurate response. Key contributions include: (1) the first comprehensive optimization study for five diverse SOTA models in mathematical reasoning, (2) a standardized production-oriented parameter optimization framework, (3) discovery of universal optimization trends applicable across model architectures, and (4) production-ready configurations with extensive performance characterization.
zh

[AI-43] Breaking the Conventional Forward-Backward Tie in Neural Networks: Activation Functions

【速读】:该论文旨在解决传统基于梯度的神经网络训练中对前向与反向传播对称性的强依赖问题,这种对称性要求激活函数必须可微(或次可微)且在某些区域严格单调,从而限制了激活函数的选择,尤其是排除了具有显著平坦或不可微区域的函数。解决方案的关键在于通过数学分析证明:只要保持梯度方向正确,梯度大小的精确性是冗余的;进而实验证明,放宽前向-反向传播对称性、用更简单或随机的梯度替代传统梯度,不仅不会损害学习性能,反而可能提升训练稳定性和效率。该研究首次明确展示了使用如Heaviside阶跃函数等非连续、非光滑激活函数的神经网络仍可有效训练,从而显著扩展了激活函数的设计灵活性和计算效率。

链接: https://arxiv.org/abs/2509.07236
作者: Luigi Troiano,Francesco Gissi,Vincenzo Benedetto,Genny Tortora
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 30 pages, 8 figures, 14 tables, in press, available online 11 August 2025

点击查看摘要

Abstract:Gradient-based neural network training traditionally enforces symmetry between forward and backward propagation, requiring activation functions to be differentiable (or sub-differentiable) and strictly monotonic in certain regions to prevent flat gradient areas. This symmetry, linking forward activations closely to backward gradients, significantly restricts the selection of activation functions, particularly excluding those with substantial flat or non-differentiable regions. In this paper, we challenge this assumption through mathematical analysis, demonstrating that precise gradient magnitudes derived from activation functions are largely redundant, provided the gradient direction is preserved. Empirical experiments conducted on foundational architectures - such as Multi-Layer Perceptrons (MLPs), Convolutional Neural Networks (CNNs), and Binary Neural Networks (BNNs) - confirm that relaxing forward-backward symmetry and substituting traditional gradients with simpler or stochastic alternatives does not impair learning and may even enhance training stability and efficiency. We explicitly demonstrate that neural networks with flat or non-differentiable activation functions, such as the Heaviside step function, can be effectively trained, thereby expanding design flexibility and computational efficiency. Further empirical validation with more complex architectures remains a valuable direction for future research.
zh

[AI-44] Explaining How Quantization Disparately Skews a Model

【速读】:该论文旨在解决后训练量化(Post Training Quantization, PTQ)在部署神经网络时可能加剧群体间性能差异的问题,尤其对少数群体造成不公平影响。研究指出,量化导致权重和激活值的变化在前向与反向传播中引发级联效应,表现为logits方差降低、损失增加以及群体准确率下降,并进一步通过梯度范数和海森矩阵特征值揭示了优化状态的不均衡性。解决方案的关键在于将混合精度量化感知训练(Mixed Precision Quantization Aware Training, QAT)与数据采样策略及加权损失函数相结合,从而在保持模型压缩效率的同时实现更公平的量化模型部署。

链接: https://arxiv.org/abs/2509.07222
作者: Abhimanyu Bellam,Jung-Eun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Post Training Quantization (PTQ) is widely adopted due to its high compression capacity and speed with minimal impact on accuracy. However, we observed that disparate impacts are exacerbated by quantization, especially for minority groups. Our analysis explains that in the course of quantization there is a chain of factors attributed to a disparate impact across groups during forward and backward passes. We explore how the changes in weights and activations induced by quantization cause cascaded impacts in the network, resulting in logits with lower variance, increased loss, and compromised group accuracies. We extend our study to verify the influence of these impacts on group gradient norms and eigenvalues of the Hessian matrix, providing insights into the state of the network from an optimization point of view. To mitigate these effects, we propose integrating mixed precision Quantization Aware Training (QAT) with dataset sampling methods and weighted loss functions, therefore providing fair deployment of quantized neural networks.
zh

[AI-45] OmniAcc: Personalized Accessibility Assistant Using Generative AI AAAI2025

【速读】:该论文旨在解决行动不便者在城市环境中因缺乏可访问信息与工具而面临的显著导航障碍问题。其解决方案的关键在于提出OmniAcc系统,该系统融合GPT-4、卫星影像和OpenStreetMap数据,利用零样本学习(zero-shot learning)与定制化提示词(customized prompts)实现对轮椅可通行设施(如坡道、人行横道)的精准识别、分类与地图构建,并提供个性化路径规划、实时免提导航及即时物理可达性查询功能,从而提升无障碍出行体验并支持城市规划决策。

链接: https://arxiv.org/abs/2509.07220
作者: Siddhant Karki,Ethan Han,Nadim Mahmud,Suman Bhunia,John Femiani,Vaskar Raychoudhury
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 Pages, 9 Figures, Published in the 1st Workshop on AI for Urban Planning, AAAI 2025 Workshop

点击查看摘要

Abstract:Individuals with ambulatory disabilities often encounter significant barriers when navigating urban environments due to the lack of accessible information and tools. This paper presents OmniAcc, an AI-powered interactive navigation system that utilizes GPT-4, satellite imagery, and OpenStreetMap data to identify, classify, and map wheelchair-accessible features such as ramps and crosswalks in the built environment. OmniAcc offers personalized route planning, real-time hands-free navigation, and instant query responses regarding physical accessibility. By using zero-shot learning and customized prompts, the system ensures precise detection of accessibility features, while supporting validation through structured workflows. This paper introduces OmniAcc and explores its potential to assist urban planners and mobility-aid users, demonstrated through a case study on crosswalk detection. With a crosswalk detection accuracy of 97.5%, OmniAcc highlights the transformative potential of AI in improving navigation and fostering more inclusive urban spaces.
zh

[AI-46] A multi-strategy improved gazelle optimization algorithm for solving numerical optimization and engineering applications

【速读】:该论文旨在解决猎豹优化算法(Gazelle Optimization Algorithm, GOA)中存在的探索与开发平衡失调及种群内信息交换不足的问题。解决方案的关键在于提出一种多策略改进的猎豹优化算法(Multi-Strategy Improved Gazelle Optimization Algorithm, MSIGOA),其核心包括:基于迭代的更新框架,根据优化进程动态切换探索与开发阶段以增强局部开发与全局探索的平衡并提升收敛速度;两种自适应参数调节策略,提高算法适用性并促进优化过程平滑;以及基于优势种群的重启策略,有效帮助算法跳出局部最优并避免早熟收敛。这些改进显著提升了MSIGOA在复杂优化问题中的探索能力、开发能力和整体收敛性能。

链接: https://arxiv.org/abs/2509.07211
作者: Qi Diao,Chengyue Xie,Yuchen Yin,Hoileong Lee,Haolong Yang
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: This is the author’s preprint of the article published in Cluster Computing (Springer): Diao, Q., Xie, C., Yin, Y. et al. A multi-strategy improved gazelle optimization algorithm for solving numerical optimization and engineering applications. Cluster Comput 28, 643 (2025). The final authenticated version is available online at SpringerLink

点击查看摘要

Abstract:Aiming at the shortcomings of the gazelle optimization algorithm, such as the imbalance between exploration and exploitation and the insufficient information exchange within the population, this paper proposes a multi-strategy improved gazelle optimization algorithm (MSIGOA). To address these issues, MSIGOA proposes an iteration-based updating framework that switches between exploitation and exploration according to the optimization process, which effectively enhances the balance between local exploitation and global exploration in the optimization process and improves the convergence speed. Two adaptive parameter tuning strategies improve the applicability of the algorithm and promote a smoother optimization process. The dominant population-based restart strategy enhances the algorithms ability to escape from local optima and avoid its premature convergence. These enhancements significantly improve the exploration and exploitation capabilities of MSIGOA, bringing superior convergence and efficiency in dealing with complex problems. In this paper, the parameter sensitivity, strategy effectiveness, convergence and stability of the proposed method are evaluated on two benchmark test sets including CEC2017 and CEC2022. Test results and statistical tests show that MSIGOA outperforms basic GOA and other advanced algorithms. On the CEC2017 and CEC2022 test sets, the proportion of functions where MSIGOA is not worse than GOA is 92.2% and 83.3%, respectively, and the proportion of functions where MSIGOA is not worse than other algorithms is 88.57% and 87.5%, respectively. Finally, the extensibility of MSIGAO is further verified by several engineering design optimization problems.
zh

[AI-47] BlendedNet: A Blended Wing Body Aircraft Dataset and Surrogate Model for Aerodynamic Predictions

【速读】:该论文旨在解决非传统气动构型(如混合翼身融合体,Blended Wing Body, BWB)中数据稀缺问题,并推动基于数据驱动的代理建模(surrogate modeling)在气动设计中的应用。其解决方案的关键在于构建了一个公开的高保真度气动数据库 BlendedNet,包含999个BW B几何体在约九种飞行条件下共8830个收敛的RANS(Reynolds-Averaged Navier-Stokes)模拟案例,每个案例网格量达9至14百万单元;同时提出了一种端到端的代理预测框架:首先使用排列不变的PointNet回归器从采样表面点云中预测几何参数,随后利用特征调制(Feature-wise Linear Modulation, FiLM)网络结合预测参数与飞行条件,实现对表面压力系数 CpC_p、壁面剪切应力分量 CfxC_{fx}CfzC_{fz} 的点对点预测,实验表明该方法在多种BW B构型上均能实现低误差的表面气动性能预测。

链接: https://arxiv.org/abs/2509.07209
作者: Nicholas Sung,Steven Spreizer,Mohamed Elrefaie,Kaira Samuel,Matthew C. Jones,Faez Ahmed
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at ASME IDETC/CIE 2025 (DETC2025-168977). Dataset availability: BlendedNet dataset is openly available at Harvard Dataverse ( this https URL )

点击查看摘要

Abstract:BlendedNet is a publicly available aerodynamic dataset of 999 blended wing body (BWB) geometries. Each geometry is simulated across about nine flight conditions, yielding 8830 converged RANS cases with the Spalart-Allmaras model and 9 to 14 million cells per case. The dataset is generated by sampling geometric design parameters and flight conditions, and includes detailed pointwise surface quantities needed to study lift and drag. We also introduce an end-to-end surrogate framework for pointwise aerodynamic prediction. The pipeline first uses a permutation-invariant PointNet regressor to predict geometric parameters from sampled surface point clouds, then conditions a Feature-wise Linear Modulation (FiLM) network on the predicted parameters and flight conditions to predict pointwise coefficients Cp, Cfx, and Cfz. Experiments show low errors in surface predictions across diverse BWBs. BlendedNet addresses data scarcity for unconventional configurations and enables research on data-driven surrogate modeling for aerodynamic design.
zh

[AI-48] A Hybrid CNN-LSTM Deep Learning Model for Intrusion Detection in Smart Grid

【速读】:该论文旨在解决智能电网(Smart Grid)在向数字化和网络化演进过程中所面临的网络安全威胁问题,特别是基于SCADA(数据采集与监控系统)的通信协议因缺乏有效防护而易受未授权访问和拒绝服务(DoS)攻击的风险。解决方案的关键在于提出一种混合深度学习入侵检测系统(Intrusion Detection System, IDS),该模型融合了卷积神经网络(Convolutional Neural Networks, CNN)的特征提取能力与长短期记忆网络(Long Short-Term Memory, LSTM)的时间序列模式识别优势,通过DNP3和IEC104两个典型协议的数据集进行训练与验证,在检测准确率上达到99.70%,显著优于其他传统深度学习方法。

链接: https://arxiv.org/abs/2509.07208
作者: Abdulhakim Alsaiari,Mohammad Ilyas
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The evolution of the traditional power grid into the “smart grid” has resulted in a fundamental shift in energy management, which allows the integration of renewable energy sources with modern communication technology. However, this interconnection has increased smart grids’ vulnerability to attackers, which might result in privacy breaches, operational interruptions, and massive outages. The SCADA-based smart grid protocols are critical for real-time data collection and control, but they are vulnerable to attacks like unauthorized access and denial of service (DoS). This research proposes a hybrid deep learning-based Intrusion Detection System (IDS) intended to improve the cybersecurity of smart grids. The suggested model takes advantage of Convolutional Neural Networks’ (CNN) feature extraction capabilities as well as Long Short-Term Memory (LSTM) networks’ temporal pattern recognition skills. DNP3 and IEC104 intrusion detection datasets are employed to train and test our CNN-LSTM model to recognize and classify the potential cyber threats. Compared to other deep learning approaches, the results demonstrate considerable improvements in accuracy, precision, recall, and F1-score, with a detection accuracy of 99.70%.
zh

[AI-49] PaVeRL-SQL: Text-to-SQL via Partial-Match Rewards and Verbal Reinforcement Learning

【速读】:该论文旨在解决当前Text-to-SQL模型在工业级数据库和复杂查询场景下执行准确率低的问题,尤其是面对领域特定业务逻辑时表现不佳。其解决方案的关键在于提出了一种名为PaVeRL-SQL的框架,结合了部分匹配奖励(Partial-Match Rewards)与言语强化学习(Verbal Reinforcement Learning),通过两种并行管道实现自提升:一是基于上下文学习的群体自评估机制(verbal-RL),利用开源和闭源大语言模型(LLMs)作为骨干;二是基于思维链(Chain-of-Thought, CoT)的强化学习管道,采用OmniSQL-7B小模型并设计特殊奖励函数与两阶段强化学习训练策略。该方法在Spider、Spider 2.0和BIRD等主流基准上达到最先进性能,尤其在工业级Spider2.0-SQLite数据集上相比现有最优方法提升达7.4%,且混合SQL方言训练带来三倍性能增益,显著提升了实际应用场景中的可靠性与泛化能力。

链接: https://arxiv.org/abs/2509.07159
作者: Heng Hao,Wenjun Hu,Oxana Verkholyak,Davoud Ataee Tarzanagh,Baruch Gutow,Sima Didari,Masoud Faraki,Hankyu Moon,Seungjai Min
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 10 pages

点击查看摘要

Abstract:Text-to-SQL models allow users to interact with a database more easily by generating executable SQL statements from natural-language questions. Despite recent successes on simpler databases and questions, current Text-to-SQL methods still suffer from low execution accuracy on industry-scale databases and complex questions involving domain-specific business logic. We present \emphPaVeRL-SQL, a framework that combines \emphPartial-Match Rewards and \emphVerbal Reinforcement Learning to drive self-improvement in reasoning language models (RLMs) for Text-to-SQL. To handle practical use cases, we adopt two pipelines: (1) a newly designed in-context learning framework with group self-evaluation (verbal-RL), using capable open- and closed-source large language models (LLMs) as backbones; and (2) a chain-of-thought (CoT) RL pipeline with a small backbone model (OmniSQL-7B) trained with a specially designed reward function and two-stage RL. These pipelines achieve state-of-the-art (SOTA) results on popular Text-to-SQL benchmarks – Spider, Spider 2.0, and BIRD. For the industrial-level Spider2.0-SQLite benchmark, the verbal-RL pipeline achieves an execution accuracy 7.4% higher than SOTA, and the CoT pipeline is 1.4% higher. RL training with mixed SQL dialects yields strong, threefold gains, particularly for dialects with limited training data. Overall, \emphPaVeRL-SQL delivers reliable, SOTA Text-to-SQL under realistic industrial constraints. The code is available at this https URL.
zh

[AI-50] Autoencoder-Based Denoising of Muscle Artifacts in ECG to Preserve Skin Nerve Activity (SKNA) for Cognitive Stress Detection

【速读】:该论文旨在解决皮肤神经活动(SKNA)在高频率心电图(ECG)记录中因肌电干扰(EMG)而导致信号失真的问题,尤其在持续肌肉活动时,传统固定频段带通滤波方法难以区分SKNA与EMG的重叠频谱成分。解决方案的关键在于提出一种基于轻量级一维卷积自动编码器结合长短期记忆(LSTM)瓶颈层的去噪模型,通过深度学习重构受EMG污染的SKNA信号,在保留生理相关性的同时显著提升信噪比(最高达9.65 dB),恢复爆发型SKNA特征的判别能力(AUROC ≥ 0.96),并在严重噪声条件下实现基线与交感刺激状态分类准确率91–98%,优于传统方法,从而支持在自然环境中进行更鲁棒的SKNA监测。

链接: https://arxiv.org/abs/2509.07146
作者: Farnoush Baghestani,Jihye Moon,Youngsun Kong,Ki Chon
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 11 pages, 7 figures, 6 tables

点击查看摘要

Abstract:The sympathetic nervous system (SNS) plays a central role in regulating the body’s responses to stress and maintaining physiological stability. Its dysregulation is associated with a wide range of conditions, from cardiovascular disease to anxiety disorders. Skin nerve activity (SKNA) extracted from high-frequency electrocardiogram (ECG) recordings provides a noninvasive window into SNS dynamics, but its measurement is highly susceptible to electromyographic (EMG) contamination. Traditional preprocessing based on bandpass filtering within a fixed range (e.g., 500–1000 Hz) is susceptible to overlapping EMG and SKNA spectral components, especially during sustained muscle activity. We present a denoising approach using a lightweight one-dimensional convolutional autoencoder with a long short-term memory (LSTM) bottleneck to reconstruct clean SKNA from EMG-contaminated recordings. Using clean ECG-derived SKNA data from cognitive stress experiments and EMG noise from chaotic muscle stimulation recordings, we simulated contamination at realistic noise levels (–4 dB, --8 dB signal-to-noise ratio) and trained the model in the leave-one-subject-out cross-validation framework. The method improved signal-to-noise ratio by up to 9.65 dB, increased cross correlation with clean SKNA from 0.40 to 0.72, and restored burst-based SKNA features to near-clean discriminability (AUROC \geq 0.96). Classification of baseline versus sympathetic stimulation (cognitive stress) conditions reached accuracies of 91–98% across severe noise levels, comparable to clean data. These results demonstrate that deep learning–based reconstruction can preserve physiologically relevant sympathetic bursts during substantial EMG interference, enabling more robust SKNA monitoring in naturalistic, movement-rich environments.
zh

[AI-51] SoK: Security and Privacy of AI Agents for Blockchain

【速读】:该论文旨在解决当前区块链与人工智能(AI)交叉领域中缺乏系统性知识整合的问题,尤其针对AI代理(AI agents)在区块链环境中的应用研究尚不充分的现状。其关键解决方案是首次提出一个面向区块链的AI驱动系统的知识体系化(Systematization of Knowledge),聚焦于安全性和隐私维度,梳理了相关应用、技术局限及未来研究方向,从而为非专家用户提供更清晰的技术路径,并推动该领域的规范化发展。

链接: https://arxiv.org/abs/2509.07131
作者: Nicolò Romandini,Carlo Mazzocca,Kai Otsuki,Rebecca Montanari
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This work has been accepted to the 7th International Conference on Blockchain Computing and Applications (BCCA 2025)

点击查看摘要

Abstract:Blockchain and smart contracts have garnered significant interest in recent years as the foundation of a decentralized, trustless digital ecosystem, thereby eliminating the need for traditional centralized authorities. Despite their central role in powering Web3, their complexity still presents significant barriers for non-expert users. To bridge this gap, Artificial Intelligence (AI)-based agents have emerged as valuable tools for interacting with blockchain environments, supporting a range of tasks, from analyzing on-chain data and optimizing transaction strategies to detecting vulnerabilities within smart contracts. While interest in applying AI to blockchain is growing, the literature still lacks a comprehensive survey that focuses specifically on the intersection with AI agents. Most of the related work only provides general considerations, without focusing on any specific domain. This paper addresses this gap by presenting the first Systematization of Knowledge dedicated to AI-driven systems for blockchain, with a special focus on their security and privacy dimensions, shedding light on their applications, limitations, and future research directions.
zh

[AI-52] Riemannian Batch Normalization: A Gyro Approach

【速读】:该论文旨在解决深度学习中归一化层(Normalization layers)在流形数据上的适用性问题,传统欧氏空间中的归一化方法无法有效处理定义在黎曼流形(Riemannian manifolds)上的数据。其解决方案的关键在于提出一种基于gyrogroup结构的原理性黎曼批量归一化框架——GyroBN,该框架通过引入两个必要条件:伪还原性(pseudo-reduction)和gyro等距旋转(gyroisometric gyrations),确保对样本统计量的理论控制,并证明这些条件在所有已知用于机器学习的gyrogroup上均成立。此外,GyroBN可统一现有多种黎曼归一化方法作为特例,并在七种代表性几何空间(如Grassmann流形、常曲率空间及相关性流形)上实现具体实例化,从而显著提升模型在非欧几里得域中的训练稳定性和性能。

链接: https://arxiv.org/abs/2509.07115
作者: Ziheng Chen,Xiao-Jun Wu,Nicu Sebe
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Normalization layers are crucial for deep learning, but their Euclidean formulations are inadequate for data on manifolds. On the other hand, many Riemannian manifolds in machine learning admit gyro-structures, enabling principled extensions of Euclidean neural networks to non-Euclidean domains. Inspired by this, we introduce GyroBN, a principled Riemannian batch normalization framework for gyrogroups. We establish two necessary conditions, namely \emphpseudo-reduction and \emphgyroisometric gyrations, that guarantee GyroBN with theoretical control over sample statistics, and show that these conditions hold for all known gyrogroups in machine learning. Our framework also incorporates several existing Riemannian normalization methods as special cases. We further instantiate GyroBN on seven representative geometries, including the Grassmannian, five constant curvature spaces, and the correlation manifold, and derive novel gyro and Riemannian structures to enable these instantiations. Experiments across these geometries demonstrate the effectiveness of GyroBN. The code is available at this https URL.
zh

[AI-53] Lookup multivariate Kolmogorov-Arnold Networks

【速读】:该论文旨在解决现代深度学习模型中高维线性映射(linear layers)导致的参数量大和计算成本高的问题。其核心解决方案是提出一种通用的即插即用替代结构——查找表型多变量柯尔莫哥洛夫-阿诺德网络(lookup multivariate Kolmogorov-Arnold Networks, lmKANs),该结构通过可训练的低维多元函数实现高维映射,这些函数以样条查找表形式实现,仅需少量乘法运算即可高效计算。相比传统多层感知机(MLP),lmKANs在保持相近函数逼近灵活性的同时,显著降低推理FLOPs(最高达6.0倍),并在多个基准任务中实现更高的吞吐量与更低的计算开销。

链接: https://arxiv.org/abs/2509.07103
作者: Sergey Pozdnyakov,Philippe Schwaller
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:High-dimensional linear mappings, or linear layers, dominate both the parameter count and the computational cost of most modern deep-learning models. We introduce a general drop-in replacement, lookup multivariate Kolmogorov-Arnold Networks (lmKANs), which deliver a substantially better trade-off between capacity and inference cost. Our construction expresses a general high-dimensional mapping through trainable low-dimensional multivariate functions. These functions can carry dozens or hundreds of trainable parameters each, and yet it takes only a few multiplications to compute them because they are implemented as spline lookup tables. Empirically, lmKANs reduce inference FLOPs by up to 6.0x while matching the flexibility of MLPs in general high-dimensional function approximation. In another feedforward fully connected benchmark, on the tabular-like dataset of randomly displaced methane configurations, lmKANs enable more than 10x higher H100 throughput at equal accuracy. Within frameworks of Convolutional Neural Networks, lmKAN-based CNNs cut inference FLOPs at matched accuracy by 1.6-2.1x and by 1.7x on the CIFAR-10 and ImageNet-1k datasets, respectively. Our code, including dedicated CUDA kernels, is available online at this https URL.
zh

[AI-54] Statistical Methods in Generative AI

【速读】:该论文旨在解决生成式人工智能(Generative AI)在实际应用中缺乏可靠性保障的问题,包括正确性、安全性、公平性等关键属性的不确定性。其解决方案的关键在于引入统计方法,以提升生成式AI的可靠性、评估效率与实验设计质量;具体而言,通过概率建模和统计推断技术,不仅能够量化模型输出的不确定性,还能优化AI系统的评估流程并指导干预措施的设计,从而为生成式AI的可信部署提供理论支撑与实践路径。

链接: https://arxiv.org/abs/2509.07054
作者: Edgar Dobriban
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注: Invited review paper for Annual Review of Statistics and Its Application. Feedback welcome

点击查看摘要

Abstract:Generative Artificial Intelligence is emerging as an important technology, promising to be transformative in many areas. At the same time, generative AI techniques are based on sampling from probabilistic models, and by default, they come with no guarantees about correctness, safety, fairness, or other properties. Statistical methods offer a promising potential approach to improve the reliability of generative AI techniques. In addition, statistical methods are also promising for improving the quality and efficiency of AI evaluation, as well as for designing interventions and experiments in AI. In this paper, we review some of the existing work on these topics, explaining both the general statistical techniques used, as well as their applications to generative AI. We also discuss limitations and potential future directions. Comments: Invited review paper for Annual Review of Statistics and Its Application. Feedback welcome Subjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME) Cite as: arXiv:2509.07054 [cs.AI] (or arXiv:2509.07054v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2509.07054 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Controllable Singing Voice Synthesis using Phoneme-Level Energy Sequence

【速读】:该论文旨在解决可控歌声合成(Controllable Singing Voice Synthesis, SVS)中动态控制不足的问题,即现有基于概率建模的SVS系统难以精确调控如音量变化等动态属性,从而限制了音乐表现力。其解决方案的关键在于显式地将能量序列(energy sequence)作为条件输入,该能量序列从真实语音频谱图中提取,以实现对时序响度变化(temporal loudness variation)的精准控制;同时提出基于音素级别的能量序列,提升用户交互友好性。此方法显著降低了标注成本,并在不牺牲合成质量的前提下,使音素级输入的能量序列均方误差降低超过50%。

链接: https://arxiv.org/abs/2509.07038
作者: Yerin Ryu,Inseop Shin,Chanwoo Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted to ASRU 2025

点击查看摘要

Abstract:Controllable Singing Voice Synthesis (SVS) aims to generate expressive singing voices reflecting user intent. While recent SVS systems achieve high audio quality, most rely on probabilistic modeling, limiting precise control over attributes such as dynamics. We address this by focusing on dynamic control–temporal loudness variation essential for musical expressiveness–and explicitly condition the SVS model on energy sequences extracted from ground-truth spectrograms, reducing annotation costs and improving controllability. We also propose a phoneme-level energy sequence for user-friendly control. To the best of our knowledge, this is the first attempt enabling user-driven dynamics control in SVS. Experiments show our method achieves over 50% reduction in mean absolute error of energy sequences for phoneme-level inputs compared to baseline and energy-predictor models, without compromising synthesis quality.
zh

[AI-56] Methodological Insights into Structural Causal Modelling and Uncertainty-Aware Forecasting for Economic Indicators ECAI-2025

【速读】:该论文旨在解决传统金融时间序列分析中因果关系识别不明确与预测不确定性难以量化的问题,尤其在宏观经济指标(如GDP、经济增长、通胀和失业率)的动态建模中。其解决方案的关键在于融合因果发现(causal discovery)与不确定性感知的预测方法:首先利用LPCMCI框架结合高斯过程距离相关性(GPDC)从1970至2021年季度数据中挖掘经济变量间的动态因果结构,识别出经济增长对GDP存在稳健的单向因果影响,而通胀连接性较弱暗示潜在未观测因素;随后基于此结构,采用Chronos这一针对时序数据训练的大语言模型进行零样本(zero-shot)概率预测,成功实现对未来两个季度失业率的准确预测,并输出90%置信区间,从而通过统计学原理支持异常检测。该方法显著提升了经济政策制定中的因果可解释性与预测鲁棒性。

链接: https://arxiv.org/abs/2509.07036
作者: Federico Cerutti
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at the 2nd edition of the Workshop in AI and Finance at ECAI-2025

点击查看摘要

Abstract:This paper presents a methodological approach to financial time series analysis by combining causal discovery and uncertainty-aware forecasting. As a case study, we focus on four key U.S. macroeconomic indicators – GDP, economic growth, inflation, and unemployment – and we apply the LPCMCI framework with Gaussian Process Distance Correlation (GPDC) to uncover dynamic causal relationships in quarterly data from 1970 to 2021. Our results reveal a robust unidirectional causal link from economic growth to GDP and highlight the limited connectivity of inflation, suggesting the influence of latent factors. Unemployment exhibits strong autoregressive dependence, motivating its use as a case study for probabilistic forecasting. Leveraging the Chronos framework, a large language model trained for time series, we perform zero-shot predictions on unemployment. This approach delivers accurate forecasts one and two quarters ahead, without requiring task-specific training. Crucially, the model’s uncertainty-aware predictions yield 90% confidence intervals, enabling effective anomaly detection through statistically principled deviation analysis. This study demonstrates the value of combining causal structure learning with probabilistic language models to inform economic policy and enhance forecasting robustness.
zh

[AI-57] A Maslow-Inspired Hierarchy of Engagement with AI Model

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)在工业、政府和教育等领域快速普及背景下,缺乏系统性框架来指导组织和个人如何负责任且可持续地开展AI Engagement的问题。其解决方案的关键在于提出了一种名为“AI参与层级模型”(Hierarchy of Engagement with AI)的新型成熟度框架,该框架受马斯洛需求层次理论启发,将AI采纳过程划分为八个逐步进阶的层级,从初步接触与基础认知到生态系统协作与社会影响;每个层级均融合技术、组织与伦理维度,强调AI成熟度不仅取决于基础设施和技术能力,更依赖于信任机制、治理结构与责任意识,从而为学术研究提供分析工具,也为实践者和政策制定者提供诊断与战略规划依据。

链接: https://arxiv.org/abs/2509.07032
作者: Madara Ogot
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 30 pages, 14 tables

点击查看摘要

Abstract:The rapid proliferation of artificial intelligence (AI) across industry, government, and education highlights the urgent need for robust frameworks to conceptualise and guide engagement. This paper introduces the Hierarchy of Engagement with AI model, a novel maturity framework inspired by Maslow’s hierarchy of needs. The model conceptualises AI adoption as a progression through eight levels, beginning with initial exposure and basic understanding and culminating in ecosystem collaboration and societal impact. Each level integrates technical, organisational, and ethical dimensions, emphasising that AI maturity is not only a matter of infrastructure and capability but also of trust, governance, and responsibility. Initial validation of the model using four diverse case studies (General Motors, the Government of Estonia, the University of Texas System, and the African Union AI Strategy) demonstrate the model’s contextual flexibility across various sectors. The model provides scholars with a framework for analysing AI maturity and offers practitioners and policymakers a diagnostic and strategic planning tool to guide responsible and sustainable AI engagement. The proposed model demonstrates that AI maturity progression is multi-dimensional, requiring technological capability, ethical integrity, organisational resilience, and ecosystem collaboration.
zh

[AI-58] A Minimalist Bayesian Framework for Stochastic Optimization

【速读】:该论文旨在解决传统贝叶斯范式在处理具有复杂结构约束的序贯决策问题时,因需对所有参数建立概率模型而导致的建模负担过重和灵活性不足的问题。其解决方案的关键在于提出一种极简贝叶斯框架(minimalist Bayesian framework),仅对感兴趣的参数(如最优解位置)施加先验分布,而通过轮廓似然(profile likelihood)消除干扰参数(nuisance parameters),从而自然地处理约束条件。这一方法不仅适用于结构化问题(如连续动作空间的Lipschitz bandits和动态定价),还能为经典凸优化算法(如重心法和椭球法)提供概率解释,并进一步通过MINimalist Thompson Sampling (MINTS) 算法实现近最优的遗憾(regret)保证。

链接: https://arxiv.org/abs/2509.07030
作者: Kaizheng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 25 pages

点击查看摘要

Abstract:The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the component of interest, such as the location of the optimum. Nuisance parameters are eliminated via profile likelihood, which naturally handles constraints. As a direct instantiation, we develop a MINimalist Thompson Sampling (MINTS) algorithm. Our framework accommodates structured problems, including continuum-armed Lipschitz bandits and dynamic pricing. It also provides a probabilistic lens on classical convex optimization algorithms such as the center of gravity and ellipsoid methods. We further analyze MINTS for multi-armed bandits and establish near-optimal regret guarantees.
zh

[AI-59] he Impact of Artificial Intelligence on Traditional Art Forms: A Disruption or Enhancement

【速读】:该论文试图解决的问题是:人工智能(AI)在传统艺术领域(如视觉艺术、表演艺术和工艺美术)中的引入,究竟是作为颠覆性力量还是增强工具,其双重影响如何评估与引导。解决方案的关键在于强调AI对传统艺术影响的“情境依赖性”,即需通过制定伦理准则、倡导协同创作模式以及推动包容性的技术发展路径,来平衡AI带来的创造力民主化、生产效率提升与文化传承优势,同时规避真实性危机、数据伦理风险及社会经济不平等(如就业替代)等问题,最终实现AI作为艺术演进机制而非艺术家灵魂替代品的目标。

链接: https://arxiv.org/abs/2509.07029
作者: Viswa Chaitanya Marella,Sai Teja Erukude,Suhasnadh Reddy Veluru
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 13 pages

点击查看摘要

Abstract:The introduction of Artificial Intelligence (AI) into the domains of traditional art (visual arts, performing arts, and crafts) has sparked a complicated discussion about whether this might be an agent of disruption or an enhancement of our traditional art forms. This paper looks at the duality of AI, exploring the ways that recent technologies like Generative Adversarial Networks and Diffusion Models, and text-to-image generators are changing the fields of painting, sculpture, calligraphy, dance, music, and the arts of craft. Using examples and data, we illustrate the ways that AI can democratize creative expression, improve productivity, and preserve cultural heritage, while also examining the negative aspects, including: the threats to authenticity within art, ethical concerns around data, and issues including socio-economic factors such as job losses. While we argue for the context-dependence of the impact of AI (the potential for creative homogenization and the devaluation of human agency in artmaking), we also illustrate the potential for hybrid practices featuring AI in cuisine, etc. We advocate for the development of ethical guidelines, collaborative approaches, and inclusive technology development. In sum, we are articulating a vision of AI in which it amplifies our innate creativity while resisting the displacement of the cultural, nuanced, and emotional aspects of traditional art. The future will be determined by human choices about how to govern AI so that it becomes a mechanism for artistic evolution and not a substitute for the artist’s soul.
zh

[AI-60] Contradictions

【速读】:该论文旨在解决传统自动定理证明(Automated Theorem Proving, ATP)中基于二元归结(binary resolution)的局限性问题,即每次推理步骤仅涉及两个子句且最多消除两个文字,导致推理效率和表达能力受限。解决方案的关键在于引入并系统构建“标准矛盾”(standard contradiction)结构,特别是提出两种主要形式:最大三角形标准矛盾(maximum triangular standard contradiction)与三角形类标准矛盾(triangular-type standard contradiction)。基于这些结构,论文设计了判定子句集可满足性与不可满足性的新方法,并推导出嵌套标准子矛盾数量的计算公式,从而为基于矛盾分离的动态多子句自动推理提供了方法论基础,显著扩展了自动化推理系统的表达力与推理能力。

链接: https://arxiv.org/abs/2509.07026
作者: Yang Xu,Shuwei Chen,Xiaomei Zhong,Jun Liu,Xingxing He
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 37 Pages,9 figures

点击查看摘要

Abstract:Trustworthy AI requires reasoning systems that are not only powerful but also transparent and reliable. Automated Theorem Proving (ATP) is central to formal reasoning, yet classical binary resolution remains limited, as each step involves only two clauses and eliminates at most two literals. To overcome this bottleneck, the concept of standard contradiction and the theory of contradiction-separation-based deduction were introduced in 2018. This paper advances that framework by focusing on the systematic construction of standard contradictions. Specially, this study investigates construction methods for two principal forms of standard contradiction: the maximum triangular standard contradiction and the triangular-type standard contradiction. Building on these structures, we propose a procedure for determining the satisfiability and unsatisfiability of clause sets via maximum standard contradiction. Furthermore, we derive formulas for computing the number of standard sub-contradictions embedded within both the maximum triangular standard contradiction and the triangular-type standard contradiction. The results presented herein furnish the methodological basis for advancing contradiction-separation-based dynamic multi-clause automated deduction, thereby extending the expressive and deductive capabilities of automated reasoning systems beyond the classical binary paradigm.
zh

[AI-61] 1 bit is all we need: binary normalized neural networks

【速读】:该论文旨在解决大规模神经网络模型(如语言模型和基础图像模型)在部署时面临的内存占用高和计算效率低的问题。其核心解决方案是提出一种新型的神经网络层——二值归一化层(binary normalized layer),该层所有参数(包括卷积核权重和偏置)仅取0或1两个值,从而实现模型参数的极致压缩。关键创新在于,这些二值归一化层可适配多种传统结构(如全连接层、卷积层、注意力机制等),且在图像分类和语言建模任务中表现出与使用32位浮点参数的等效模型相当的性能,同时显著降低内存需求至原来的1/32,并可在现有硬件上通过1-bit数组高效实现,无需专用电子硬件支持。

链接: https://arxiv.org/abs/2509.07025
作者: Eduardo Lobo Lustoda Cabral,Paulo Pirozelli,Larissa Driemeier
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages; 2 figures; 5 tables; 8 algorithms

点击查看摘要

Abstract:The increasing size of large neural network models, specifically language models and foundational image models, poses deployment challenges, prompting efforts to reduce memory requirements and enhance computational efficiency. These efforts are critical to ensure practical deployment and effective utilization of these models across various applications. In this work, a novel type of neural network layers and models is developed that uses only single-bit parameters. In this novel type of models all parameters of all layers, including kernel weights and biases, only have values equal to zero or one. This novel type of models uses layers named as binary normalized layer. These binary normalized layers can be of any type, such as fully connected, convolutional, attention, etc., and they consist of slight variations of the corresponding conventional layers. To show the effectiveness of the binary normalized layers, two different models are configured to solve a multiclass image classification problem and a language decoder to predict the next token of a sequence. The model to solve the image classification has convolutional and fully connected layers, and the language model is composed of transformer blocks with multi-head attention. The results show that models with binary normalized layers present almost the same results obtained by equivalent models with real 32-bit parameters. The binary normalized layers allow to develop models that use 32 times less memory than current models and have equivalent performance. Besides, the binary normalized layers can be easily implemented on current computers using 1-bit arrays, and do not require the development of dedicated electronic hardware. This novel type of layers opens a new era for large neural network models with reduced memory requirements that can be deployed using simple and cheap hardware, such as mobile devices or only cpus.
zh

[AI-62] Preventing Another Tessa: Modular Safety Middleware For Health-Adjacent AI Assistants AAAI

【速读】:该论文旨在解决生成式 AI(Generative AI)在健康相关应用场景中因缺乏安全工程而导致的潜在风险问题,以美国国家饮食障碍协会(NEDA)聊天机器人Tessa因提供有害减肥建议而被暂停事件为典型案例。其解决方案的关键在于提出一种轻量级、模块化的混合安全中间件(hybrid safety middleware),该中间件结合确定性词法过滤器(deterministic lexical gates)与内联大语言模型(LLM)策略过滤器,能够在单次模型调用中实现“fail-closed”决策和可追溯的升级路径,从而在不增加基线成本与延迟的前提下,完美拦截不安全提示,优于传统多阶段流水线架构。

链接: https://arxiv.org/abs/2509.07022
作者: Pavan Reddy,Nithin Reddy
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 7 pages content, 1 page reference, 1 figure, Accepted at AAAI Fall Symposium Series

点击查看摘要

Abstract:In 2023, the National Eating Disorders Association’s (NEDA) chatbot Tessa was suspended after providing harmful weight-loss advice to vulnerable users-an avoidable failure that underscores the risks of unsafe AI in healthcare contexts. This paper examines Tessa as a case study in absent safety engineering and demonstrates how a lightweight, modular safeguard could have prevented the incident. We propose a hybrid safety middleware that combines deterministic lexical gates with an in-line large language model (LLM) policy filter, enforcing fail-closed verdicts and escalation pathways within a single model call. Using synthetic evaluations, we show that this design achieves perfect interception of unsafe prompts at baseline cost and latency, outperforming traditional multi-stage pipelines. Beyond technical remedies, we map Tessa’s failure patterns to established frameworks (OWASP LLM Top10, NIST SP 800-53), connecting practical safeguards to actionable governance controls. The results highlight that robust, auditable safety in health-adjacent AI does not require heavyweight infrastructure: explicit, testable checks at the last mile are sufficient to prevent “another Tessa”, while governance and escalation ensure sustainability in real-world deployment.
zh

[AI-63] An efficient deep reinforcement learning environment for flexible job-shop scheduling

【速读】:该论文旨在解决柔性作业车间调度问题(Flexible Job-shop Scheduling Problem, FJSP)中深度强化学习(Deep Reinforcement Learning, DRL)方法存在的环境建模不足问题。现有DRL调度方法主要聚焦于智能体(Agent)设计,而忽视了对DRL环境的合理建模,导致调度性能受限。解决方案的关键在于提出一种基于离散事件仿真的简单时序DRL环境,并构建一个端到端的DRL调度模型,该模型基于近端策略优化(Proximal Policy Optimization, PPO)算法;同时设计了一种新颖的状态表示方式,仅依赖调度环境中两个状态变量,并引入一种可解释性强的奖励函数,其依据机器调度区域进行构造。实验表明,该方案在公开基准实例上显著提升了传统优先调度规则(Priority Dispatching Rules, PDR)的性能,并在与OR-Tools、元启发式算法、DRL及PDR方法的对比中展现出竞争力。

链接: https://arxiv.org/abs/2509.07019
作者: Xinquan Wu,Xuefeng Yan,Mingqiang Wei,Donghai Guan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Flexible Job-shop Scheduling Problem (FJSP) is a classical combinatorial optimization problem that has a wide-range of applications in the real world. In order to generate fast and accurate scheduling solutions for FJSP, various deep reinforcement learning (DRL) scheduling methods have been developed. However, these methods are mainly focused on the design of DRL scheduling Agent, overlooking the modeling of DRL environment. This paper presents a simple chronological DRL environment for FJSP based on discrete event simulation and an end-to-end DRL scheduling model is proposed based on the proximal policy optimization (PPO). Furthermore, a short novel state representation of FJSP is proposed based on two state variables in the scheduling environment and a novel comprehensible reward function is designed based on the scheduling area of machines. Experimental results on public benchmark instances show that the performance of simple priority dispatching rules (PDR) is improved in our scheduling environment and our DRL scheduling model obtains competing performance compared with OR-Tools, meta-heuristic, DRL and PDR scheduling methods.
zh

[AI-64] Random Forest Stratified K-Fold Cross Validation on SYN DoS Attack SD-IoV

【速读】:该论文旨在解决软件定义车联网(Software-Defined Internet of Vehicles, SD-IoV)中普遍存在的TCP SYN洪水攻击(TCP SYN flood attack)所带来的网络安全问题。其解决方案的关键在于优化随机森林分类器(Random Forest Classifier)模型,通过数据预处理(包括特征缩放与标签编码)、分层K折交叉验证等技术手段,实现高准确率与极低的检测延迟。实验表明,调优后的模型(20个估计器、深度为10)在各项指标上平均达到0.999998,且检测时间仅为0.24秒,显著提升了SD-IoV网络对SYN DoS攻击的实时识别能力,同时保障了网络效率与可靠性。

链接: https://arxiv.org/abs/2509.07016
作者: Muhammad Arif Hakimi Zamrai,Kamaludin Mohd Yusof
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In response to the prevalent concern of TCP SYN flood attacks within the context of Software-Defined Internet of Vehicles (SD-IoV), this study addresses the significant challenge of network security in rapidly evolving vehicular communication systems. This research focuses on optimizing a Random Forest Classifier model to achieve maximum accuracy and minimal detection time, thereby enhancing vehicular network security. The methodology involves preprocessing a dataset containing SYN attack instances, employing feature scaling and label encoding techniques, and applying Stratified K-Fold cross-validation to target key metrics such as accuracy, precision, recall, and F1-score. This research achieved an average value of 0.999998 for all metrics with a SYN DoS attack detection time of 0.24 seconds. Results show that the fine-tuned Random Forest model, configured with 20 estimators and a depth of 10, effectively differentiates between normal and malicious traffic with high accuracy and minimal detection time, which is crucial for SD-IoV networks. This approach marks a significant advancement and introduces a state-of-the-art algorithm in detecting SYN flood attacks, combining high accuracy with minimal detection time. It contributes to vehicular network security by providing a robust solution against TCP SYN flood attacks while maintaining network efficiency and reliability.
zh

[AI-65] Renewable Energy Sources Selection Analysis with the Maximizing Deviation Method

【速读】:该论文旨在解决在不确定、复杂且存在冲突的环境下,如何有效进行可再生能源选择的多准则决策问题。其核心挑战在于决策过程中存在主观判断的模糊性和不确定性,传统方法难以充分量化此类信息。解决方案的关键在于引入Fermatean模糊环境(Fermatean fuzzy environment),这是一种对传统模糊集的扩展,能够更灵活地表达决策者判断中的不确定性和不精确性;同时结合区间值Fermatean模糊集(interval-valued Fermatean fuzzy sets)与基于偏差最大化的优化模型,以确定部分已知权重的特征权重,从而提升决策结果的科学性与合理性。该方法成功应用于可再生能源选择场景,兼顾技术可行性与管理及政治层面的考量。

链接: https://arxiv.org/abs/2509.07011
作者: Kirisci Murat
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages, 5 figures, 6 Tables

点击查看摘要

Abstract:Multi-criteria decision-making methods provide decision-makers with appropriate tools to make better decisions in uncertain, complex, and conflicting situations. Fuzzy set theory primarily deals with the uncertainty inherent in human thoughts and perceptions and attempts to quantify this uncertainty. Fuzzy logic and fuzzy set theory are utilized with multi-criteria decision-making methods because they effectively handle uncertainty and fuzziness in decision-makers’ judgments, allowing for verbal judgments of the problem. This study utilizes the Fermatean fuzzy environment, a generalization of fuzzy sets. An optimization model based on the deviation maximization method is proposed to determine partially known feature weights. This method is combined with interval-valued Fermatean fuzzy sets. The proposed method was applied to the problem of selecting renewable energy sources. The reason for choosing renewable energy sources is that meeting energy needs from renewable sources, balancing carbon emissions, and mitigating the effects of global climate change are among the most critical issues of the recent period. Even though selecting renewable energy sources is a technical issue, the managerial and political implications of this issue are also important, and are discussed in this study.
zh

[AI-66] FediLoRA: Heterogeneous LoRA for Federated Multimodal Fine-tuning under Missing Modalities

【速读】:该论文旨在解决联邦学习中多模态数据场景下因客户端资源异构性和模态缺失导致的模型性能下降问题。具体而言,现有联邦LoRA方法通常假设所有客户端使用相同的LoRA秩配置且输入为单模态数据,忽略了实际应用中客户端计算资源差异及多模态数据可能存在的模态缺失现象。解决方案的关键在于提出FediLoRA框架:其一,采用维度级聚合策略对LoRA更新进行加权重组,避免在聚合过程中信息稀释;其二,引入轻量级逐层模型编辑机制,通过选择性融合全局参数修复局部组件,从而提升客户端与全局模型的性能表现。

链接: https://arxiv.org/abs/2509.06984
作者: Lishan Yang,Nam Kha Nguygen,Po Hu,Wei Emma Zhang,Yanjun Shu,Mong Yuan Sim,Weitong Chen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:Foundation models have demonstrated remarkable performance across a wide range of tasks, yet their large parameter sizes pose challenges for practical deployment, especially in decentralized environments. Parameter-efficient fine-tuning (PEFT), such as Low-Rank Adaptation (LoRA), reduces local computing and memory overhead, making it attractive for federated learning. However, existing federated LoRA methods typically assume uniform rank configurations and unimodal inputs, overlooking two key real-world challenges: (1) heterogeneous client resources have different LoRA ranks, and (2) multimodal data settings with potentially missing modalities. In this work, we propose FediLoRA, a simple yet effective framework for federated multimodal fine-tuning under heterogeneous LoRA ranks and missing modalities. FediLoRA introduces a dimension-wise aggregation strategy that reweights LoRA updates without information dilution during aggregation. It also includes a lightweight layer-wise model editing method that selectively incorporates global parameters to repair local components which improves both client and global model performances. Experimental results on three multimodal benchmark datasets demonstrate that FediLoRA achieves superior performance over competitive baselines in both global and personalized settings, particularly in the presence of modality incompleteness.
zh

[AI-67] RLFactory: A Plug-and-Play Reinforcement Learning Post-Training Framework for LLM Multi-Turn Tool-Use

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮工具调用任务中面临的两大核心挑战:一是工具调用的稳定性与适应性问题,尤其在工具异构性和接口不一致的情况下;二是多样化的评估需求,难以统一衡量模型在复杂交互场景中的表现。解决方案的关键在于提出RLFactory框架,其创新点包括:(1) 采用基于asyncio的异步调用器和解耦的工具/训练架构,提升工具调用的鲁棒性与可扩展性;(2) 引入支持规则、模型判断和工具验证信号的奖励层,实现灵活且多维度的评估机制;(3) 通过引入来自工具反馈的观察标记重构马尔可夫决策过程(MDP),形成“生成-解析-调用-更新”的闭环工作流,从而动态优化策略。实验表明,该框架在Search-R1基准上显著优于同类方法,且训练吞吐量提升6.8倍。

链接: https://arxiv.org/abs/2509.06980
作者: Jiajun Chai,Guojun Yin,Zekun Xu,Chuhuai Yue,Yi Jia,Siyu Xia,Xiaohan Wang,Jiwen Jiang,Xiaoguang Li,Chengqi Dong,Hang He,Wei Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models excel at basic reasoning but struggle with tasks that require interaction with external tools. We present RLFactory, a plug-and-play reinforcement learning post-training framework for multi-round tool use. RLFactory tackles (i) tool-call stability and adaptability amid tool heterogeneity and interface issues via an asyncio-based asynchronous caller and a decoupled tool/training architecture, and (ii) diverse evaluation needs via a reward layer supporting rule-based, model-judgment, and tool-verification signals. It reconstructs the MDP by introducing observation markers from tool feedback, closing the loop among model, tools, and environment, and implements a generate-parse-invoke-update workflow for dynamic policy optimization. On Search-R1 with Qwen3-4B, RLFactory achieves a 0.486 test score on the Natural Questions (NQ) dataset, surpassing larger models trained with similar techniques (e.g., Qwen2.5-7B-Instruct-GRPO at 0.473), and increases training throughput by 6.8x. RLFactory provides a low-barrier, highly adaptable framework for strengthening multi-round tool use of LLMs in real-world scenarios. Code: this https URL.
zh

[AI-68] Exploring Over-stationarization in Deep Learning-based Bus/Tram Arrival Time Prediction: Analysis and Non-stationary Effect Recovery

【速读】:该论文旨在解决多步到达时间预测(multi-step ATP)中因数据非平稳性(non-stationarity)导致模型性能下降的问题。现有方法通常通过归一化处理消除时间序列中的非平稳性以提升可预测性,但可能引入过度平稳化(over-stationarization),从而掩盖数据中蕴含的有用特征。解决方案的关键在于提出一种新的非平稳ATP(NSATP)方法,其包含两个阶段:第一阶段对时序数据进行平稳化处理以增强预测能力;第二阶段通过扩展一维到二维的先进建模方法捕捉隐藏的周期性特征,并设计补偿模块,从原始数据中学习缩放与偏移因子来恢复被归一化过程削弱的非平稳效应,从而在预测精度与保留非平稳信息之间取得平衡。

链接: https://arxiv.org/abs/2509.06979
作者: Zirui Li,Bin Yang,Meng Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 26 pages, 13 figures

点击查看摘要

Abstract:Arrival time prediction (ATP) of public transport vehicles is essential in improving passenger experience and supporting traffic management. Deep learning has demonstrated outstanding performance in ATP due to its ability to model non-linear and temporal dynamics. In the multi-step ATP, non-stationary data will degrade the model performance due to the variation in variables’ joint distribution along the temporal direction. Previous studies mainly applied normalization to eliminate the non-stationarity in time series, thereby achieving better predictability. However, the normalization may obscure useful characteristics inherent in non-stationarity, which is known as the over-stationarization. In this work, to trade off predictability and non-stationarity, a new approach for multi-step ATP, named non-stationary ATP ( NSATP), is proposed. The method consists of two stages: series stationarization and non-stationarity effect recovery. The first stage aims at improving the predictability. As for the latter, NSATP extends a state-of-the-art method from one-dimensional to two dimensional based models to capture the hidden periodicity in time series and designs a compensation module of over-stationarization by learning scaling and shifting factors from raw data. 125 days’ public transport operational data of Dresden is collected for validation. Experimental results show that compared to baseline methods, the proposed NSATP can reduce RMSE, MAE, and MAPE by 2.37%, 1.22%, and 2.26% for trams and by 1.72%, 0.60%, and 1.17% for buses, respectively.
zh

[AI-69] oward Reproducible Cross-Backend Compatibility for Deep Learning: A Configuration-First Framework with Three-Tier Verification

【速读】:该论文旨在解决深度学习系统在不同后端(CPU、GPU及编译运行时)部署时存在的跨后端漂移(cross-backend drift)问题,即模型输出在不同硬件和软件环境下的不一致性,这会影响模型的可复现性和可靠性。解决方案的关键在于提出了一种“配置优先”(configuration-first)的评估框架,通过YAML文件解耦实验与代码,并采用三级验证协议(张量级接近度、激活对齐和任务级指标)进行系统性量化检测;同时引入确定性适配器(deterministic adapters)和选择性回退机制,在不显著牺牲性能的前提下大幅提升跨后端一致性,从而为异构运行时环境下的可靠部署提供可复现的方法论支持。

链接: https://arxiv.org/abs/2509.06977
作者: Zehua Li
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures, 3 tables, appendix, code available at this https URL

点击查看摘要

Abstract:This paper presents a configuration-first framework for evaluating cross-backend compatibility in deep learning systems deployed on CPU, GPU, and compiled runtimes. The framework decouples experiments from code using YAML, supports both library and repository models, and employs a three-tier verification protocol covering tensor-level closeness, activation alignment, and task-level metrics. Through 672 checks across multiple models and tolerance settings, we observe that 72.0% of runs pass, with most discrepancies occurring under stricter thresholds. Our results show that detection models and compiled backends are particularly prone to drift, often due to nondeterministic post-processing. We further demonstrate that deterministic adapters and selective fallbacks can substantially improve agreement without significant performance loss. To our knowledge, this is the first unified framework that systematically quantifies and mitigates cross-backend drift in deep learning, providing a reproducible methodology for dependable deployment across heterogeneous runtimes.
zh

[AI-70] A Knowledge-Guided Cross-Modal Feature Fusion Model for Local Traffic Demand Prediction

【速读】:该论文旨在解决现有交通需求预测模型主要依赖时间序列交通数据、缺乏对人类知识与经验融合的问题,从而限制了预测精度与鲁棒性。其关键解决方案是提出一种知识引导的跨模态特征表示学习(Knowledge-Guided Cross-Modal Feature Representation Learning, KGCM)模型,通过将结构化的时序交通数据与表征人类知识和经验的文本数据相结合,利用设计的局部与全局自适应图网络以及跨模态特征融合机制,实现多模态特征的有效学习,并引入基于推理的动态更新策略以优化图模型参数,最终显著提升交通需求预测的准确性。

链接: https://arxiv.org/abs/2509.06976
作者: Lingyu Zhang,Pengfei Xu,Guobin Wu,Jian Liang,Ruiyang Dong,Yunhai Wang,Xuan Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Traffic demand prediction plays a critical role in intelligent transportation systems. Existing traffic prediction models primarily rely on temporal traffic data, with limited efforts incorporating human knowledge and experience for urban traffic demand forecasting. However, in real-world scenarios, traffic knowledge and experience derived from human daily life significantly influence precise traffic prediction. Such knowledge and experiences can guide the model in uncovering latent patterns within traffic data, thereby enhancing the accuracy and robustness of predictions. To this end, this paper proposes integrating structured temporal traffic data with textual data representing human knowledge and experience, resulting in a novel knowledge-guided cross-modal feature representation learning (KGCM) model for traffic demand prediction. Based on regional transportation characteristics, we construct a prior knowledge dataset using a large language model combined with manual authoring and revision, covering both regional and global knowledge and experiences. The KGCM model then learns multimodal data features through designed local and global adaptive graph networks, as well as a cross-modal feature fusion mechanism. A proposed reasoning-based dynamic update strategy enables dynamic optimization of the graph model’s parameters, achieving optimal performance. Experiments on multiple traffic datasets demonstrate that our model accurately predicts future traffic demand and outperforms existing state-of-the-art (SOTA) models.
zh

[AI-71] GSTBench: A Benchmark Study on the Transferability of Graph Self-Supervised Learning CIKM’25

【速读】:该论文旨在解决当前图自监督学习(Graph Self-Supervised Learning, Graph SSL)方法在跨数据集迁移能力方面研究不足的问题,即大多数现有方法仅在单一数据集上开发和评估,缺乏对知识迁移潜力和大规模预训练效果的系统性探索,从而限制了其在构建泛化智能方面的应用。解决方案的关键在于提出首个针对图SSL方法迁移能力的系统性基准测试工具GSTBench,并通过在ogbn-papers100M大规模数据集上进行预训练,对五种代表性图SSL方法在多样目标任务图上的迁移性能进行标准化评估。该设计通过解耦模型架构、数据集特征和适配协议等混杂因素,使比较聚焦于预训练目标本身,从而揭示不同方法的迁移潜力差异,发现GraphMAE(一种掩码自编码器方法)在跨数据集迁移中表现最优,为未来图表示学习中的“预训练-迁移”范式提供了实证基础与理论指导。

链接: https://arxiv.org/abs/2509.06975
作者: Yu Song,Zhigang Hua,Yan Xie,Jingzhe Liu,Bo Long,Hui Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted at CIKM’25

点击查看摘要

Abstract:Self-supervised learning (SSL) has shown great promise in graph representation learning. However, most existing graph SSL methods are developed and evaluated under a single-dataset setting, leaving their cross-dataset transferability largely unexplored and limiting their ability to leverage knowledge transfer and large-scale pretraining, factors that are critical for developing generalized intelligence beyond fitting training data. To address this gap and advance foundation model research for graphs, we present GSTBench, the first systematic benchmark for evaluating the transferability of graph SSL methods. We conduct large-scale pretraining on ogbn-papers100M and evaluate five representative SSL methods across a diverse set of target graphs. Our standardized experimental setup decouples confounding factors such as model architecture, dataset characteristics, and adaptation protocols, enabling rigorous comparisons focused solely on pretraining objectives. Surprisingly, we observe that most graph SSL methods struggle to generalize, with some performing worse than random initialization. In contrast, GraphMAE, a masked autoencoder approach, consistently improves transfer performance. We analyze the underlying factors that drive these differences and offer insights to guide future research on transferable graph SSL, laying a solid foundation for the “pretrain-then-transfer” paradigm in graph learning. Our code is available at this https URL.
zh

[AI-72] Individualized and Interpretable Sleep Forecasting via a Two-Stage Adaptive Spatial-Temporal Model

【速读】:该论文旨在解决睡眠质量预测的个性化与可解释性问题,尤其针对稀疏的可穿戴设备数据场景下模型泛化能力不足和缺乏适应新用户的能力。其解决方案的关键在于提出了一种两阶段自适应时空建模框架:首先通过多尺度卷积层捕捉输入变量间的空间交互关系,结合循环神经网络(Recurrent Neural Network, RNN)与注意力机制建模长期时间依赖;其次引入两阶段域自适应策略——训练阶段通过域对齐缓解过拟合,测试阶段采用无源标签的时变适应机制实现对新用户的快速迁移,从而提升模型在实际应用中的鲁棒性和适应性。该方法在多种输入与预测窗口组合中均优于LSTM、Informer等基线模型,并具备良好的可解释性,适用于基于商业可穿戴设备的个体化睡眠质量预测任务。

链接: https://arxiv.org/abs/2509.06974
作者: Xueyi Wang,Elisabeth Wilhelm
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Sleep quality significantly impacts well-being. Therefore, healthcare providers and individuals need accessible and reliable forecasting tools for preventive interventions. This paper introduces an interpretable, individualized two-stage adaptive spatial-temporal model for predicting sleep quality scores. Our proposed framework combines multi-scale convolutional layers to model spatial interactions across multiple input variables, recurrent layers and attention mechanisms to capture long-term temporal dependencies, and a two-stage domain adaptation strategy to enhance generalization. The first adaptation stage is applied during training to mitigate overfitting on the training set. In the second stage, a source-free test-time adaptation mechanism is employed to adapt the model to new users without requiring labels. We conducted various experiments with five input window sizes (3, 5, 7, 9, and 11 days) and five prediction window sizes (1, 3, 5, 7, and 9 days). Our model consistently outperformed time series forecasting baseline approaches, including Long Short-Term Memory (LSTM), Informer, PatchTST, and TimesNet. The best performance was achieved with a three-day input window and a one-day prediction window, yielding a root mean square error (RMSE) of 0.216. Furthermore, the model demonstrated good predictive performance even for longer forecasting horizons (e.g, with a 0.257 RMSE for a three-day prediction window), highlighting its practical utility for real-world applications. We also conducted an explainability analysis to examine how different features influence sleep quality. These findings proved that the proposed framework offers a robust, adaptive, and explainable solution for personalized sleep forecasting using sparse data from commercial wearable devices.
zh

[AI-73] A data-driven discretized CS:GO simulation environment to facilitate strategic multi-agent planning research

【速读】:该论文旨在解决复杂多智能体交互场景中高保真度与计算效率难以兼顾的问题。其解决方案的关键在于提出DECOY框架,通过将3D地形中的战略级、长时程规划抽象为高层离散化模拟,同时保留底层环境细节;具体而言,采用路径点(waypoint)系统对连续状态和动作进行简化与离散化,并结合基于真实CS:GO赛事数据训练的神经预测与生成模型,重建事件结果,从而仅依赖移动决策即可准确模拟游戏行为,无需显式建模瞄准与射击等低层机制。

链接: https://arxiv.org/abs/2509.06355
作者: Yunzhe Wang,Volkan Ustun,Chris McGroarty
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Accepted at the Winter Simulation Conference 2025, December, Seattle USA

点击查看摘要

Abstract:Modern simulation environments for complex multi-agent interactions must balance high-fidelity detail with computational efficiency. We present DECOY, a novel multi-agent simulator that abstracts strategic, long-horizon planning in 3D terrains into high-level discretized simulation while preserving low-level environmental fidelity. Using Counter-Strike: Global Offensive (CS:GO) as a testbed, our framework accurately simulates gameplay using only movement decisions as tactical positioning – without explicitly modeling low-level mechanics such as aiming and shooting. Central to our approach is a waypoint system that simplifies and discretizes continuous states and actions, paired with neural predictive and generative models trained on real CS:GO tournament data to reconstruct event outcomes. Extensive evaluations show that replays generated from human data in DECOY closely match those observed in the original game. Our publicly available simulation environment provides a valuable tool for advancing research in strategic multi-agent planning and behavior generation.
zh

[AI-74] VoltanaLLM : Feedback-Driven Frequency Control and State-Space Routing for Energy-Efficient LLM Serving

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)推理过程中日益增长的能耗问题,以实现可持续且成本效益高的部署。其核心挑战在于如何在满足服务等级目标(Service Level Objective, SLO)的前提下降低能源消耗。解决方案的关键在于从控制理论出发,提出VoltanaLLM系统,通过协同设计频率调节与请求路由机制,在新兴的预填充/解码分离架构中实现细粒度的阶段特异性控制:一方面,基于反馈的频率控制器动态调整GPU频率以优化预填充和解码阶段的能效;另一方面,状态空间路由器探索跨频率缩放实例的路由决策,在满足延迟约束下最小化整体能耗。该方案实现了高达36.3%的节能效果,同时保持接近完美的SLO达成率。

链接: https://arxiv.org/abs/2509.04827
作者: Jiahuan Yu(1),Aryan Taneja(1),Junfeng Lin(2),Minjia Zhang(1) ((1) University of Illinois Urbana-Champaign, (2) Tsinghua University)
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Modern Large Language Model (LLM) serving systems increasingly support interactive applications, like real-time chat assistants, code generation tools, and agentic workflows. However, the soaring energy cost of LLM inference presents a growing challenge for sustainable and cost-effective deployment. This paper introduces VoltanaLLM, a system for SLO-aware, energy-efficient LLM serving, built from a control theory perspective. VoltanaLLM co-designs frequency scaling and request routing in emerging prefill/decode disaggregated architectures, leveraging their decoupled execution to enable fine-grained phase-specific control. It consists of a feedback-driven frequency controller that dynamically adapts GPU frequency for prefill and decode phases, and a state-space router that explores routing decisions across frequency-scaled instances to minimize energy under latency constraints. We implement VoltanaLLM in SGLang and evaluate its performance over multiple state-of-the-art LLMs and real-world datasets. The results demonstrate that VoltanaLLM achieves up to 36.3% energy savings while maintaining near-perfect SLO attainment rate, paving the way for sustainable and intelligent LLM serving.
zh

[AI-75] VISION: Robust and Interpretable Code Vulnerability Detection Leverag ing Counterfactual Augmentation

【速读】:该论文旨在解决生成式 AI (Generative AI) 在源代码漏洞检测中因训练数据不平衡和标签噪声导致的“虚假相关性”(spurious correlations)问题,从而提升模型的鲁棒性和泛化能力。其解决方案的关键在于提出一个统一框架 VISION,通过系统性地构建反事实(counterfactual)训练数据集来缓解虚假学习:首先利用大语言模型(Large Language Model, LLM)生成语义最小修改但标签相反的代码样本;其次在成对标签相反的代码示例上进行图神经网络(Graph Neural Networks, GNN)的针对性训练;最后结合基于图的可解释性方法识别关键代码语句,排除无关特征的影响。该方法显著提升了准确率、对比准确率及最差组准确率,并引入新的评估指标验证其有效性。

链接: https://arxiv.org/abs/2508.18933
作者: David Egea,Barproda Halder,Sanghamitra Dutta
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Automated detection of vulnerabilities in source code is an essential cybersecurity challenge, underpinning trust in digital systems and services. Graph Neural Networks (GNNs) have emerged as a promising approach as they can learn structural and logical code relationships in a data-driven manner. However, their performance is severely constrained by training data imbalances and label noise. GNNs often learn ‘spurious’ correlations from superficial code similarities, producing detectors that fail to generalize well to unseen real-world data. In this work, we propose a unified framework for robust and interpretable vulnerability detection, called VISION, to mitigate spurious correlations by systematically augmenting a counterfactual training dataset. Counterfactuals are samples with minimal semantic modifications but opposite labels. Our framework includes: (i) generating counterfactuals by prompting a Large Language Model (LLM); (ii) targeted GNN training on paired code examples with opposite labels; and (iii) graph-based interpretability to identify the crucial code statements relevant for vulnerability predictions while ignoring spurious ones. We find that VISION reduces spurious learning and enables more robust, generalizable detection, improving overall accuracy (from 51.8% to 97.8%), pairwise contrast accuracy (from 4.5% to 95.8%), and worst-group accuracy (from 0.7% to 85.5%) on the Common Weakness Enumeration (CWE)-20 vulnerability. We further demonstrate gains using proposed metrics: intra-class attribution variance, inter-class attribution distance, and node score dependency. We also release CWE-20-CFA, a benchmark of 27,556 functions (real and counterfactual) from the high-impact CWE-20 category. Finally, VISION advances transparent and trustworthy AI-based cybersecurity systems through interactive visualization for human-in-the-loop analysis.
zh

[AI-76] Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

【速读】:该论文试图解决的核心问题是:如何在社会层面优先考虑福祉(well-being),以及个体在公平性与个人福祉之间愿意做出何种权衡。为解决这一问题,作者采用基于全国代表性样本(n = 300)的陈述偏好实验,让参与者在不确定性条件下评估自身与他人生活满意度结果。关键解决方案在于利用期望效用最大化(Expected Utility Maximisation, EUM)框架估计个体层面的效用函数,并检验其是否表现出对小概率事件的过度加权——这是累积前景理论(Cumulative Prospect Theory, CPT)的核心特征。研究发现,多数参与者呈现凹型(风险规避)效用曲线,且对社会生活满意度不平等的厌恶程度高于对个人风险的敏感度,且这种偏好与政治立场无关,表明存在跨意识形态的公平福祉共识。这挑战了以平均生活满意度作为政策指标的做法,支持发展基于非线性效用的替代测量方法,以更准确反映集体人类价值观。

链接: https://arxiv.org/abs/2509.07793
作者: Crispin Cooper,Ana Friedrich,Tommaso Reggiani,Wouter Poortinga
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 28 pages, 4 figures

点击查看摘要

Abstract:How should well-being be prioritised in society, and what trade-offs are people willing to make between fairness and personal well-being? We investigate these questions using a stated preference experiment with a nationally representative UK sample (n = 300), in which participants evaluated life satisfaction outcomes for both themselves and others under conditions of uncertainty. Individual-level utility functions were estimated using an Expected Utility Maximisation (EUM) framework and tested for sensitivity to the overweighting of small probabilities, as characterised by Cumulative Prospect Theory (CPT). A majority of participants displayed concave (risk-averse) utility curves and showed stronger aversion to inequality in societal life satisfaction outcomes than to personal risk. These preferences were unrelated to political alignment, suggesting a shared normative stance on fairness in well-being that cuts across ideological boundaries. The results challenge use of average life satisfaction as a policy metric, and support the development of nonlinear utility-based alternatives that more accurately reflect collective human values. Implications for public policy, well-being measurement, and the design of value-aligned AI systems are discussed.
zh

[AI-77] Variational Quantum Circuits in Offline Contextual Bandit Problems

【速读】:该论文旨在解决工业优化任务中离线上下文多臂赌博机(offline contextual bandit)问题,即在缺乏在线交互的情况下,从历史数据中学习最优决策策略。其解决方案的关键在于利用变分量子电路(variational quantum circuits, VQCs)构建量子回归模型,通过粒子群优化(PSO)算法高效搜索最优配置,并验证了该方法在噪声和稀疏数据场景下对复杂奖励函数的拟合能力与泛化性能,从而为VQC在工业优化中的应用提供了可行性证明。

链接: https://arxiv.org/abs/2509.07633
作者: Lukas Schulte,Daniel Hein,Steffen Udluft,Thomas A. Runkler
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper explores the application of variational quantum circuits (VQCs) for solving offline contextual bandit problems in industrial optimization tasks. Using the Industrial Benchmark (IB) environment, we evaluate the performance of quantum regression models against classical models. Our findings demonstrate that quantum models can effectively fit complex reward functions, identify optimal configurations via particle swarm optimization (PSO), and generalize well in noisy and sparse datasets. These results provide a proof of concept for utilizing VQCs in offline contextual bandit problems and highlight their potential in industrial optimization tasks.
zh

[AI-78] From Classical Data to Quantum Advantage – Quantum Policy Evaluation on Quantum Hardware

【速读】:该论文旨在解决量子策略评估(Quantum Policy Evaluation, QPE)中环境参数难以自动获取的问题,即如何从经典观测数据中学习并构建可直接用于量子硬件的量子环境模型。其解决方案的关键在于利用量子机器学习(Quantum Machine Learning, QML)技术,在量子硬件上从批量经典数据中自动训练和优化量子环境的参数,从而实现端到端的量子增强强化学习流程。这一方法使QPE能够在真实量子设备上执行策略评估任务,为在噪声中实现量子优势提供了可行路径。

链接: https://arxiv.org/abs/2509.07614
作者: Daniel Hein,Simon Wiedemann,Markus Baumann,Patrik Felbinger,Justin Klein,Maximilian Schieder,Jonas Stein,Daniëlle Schuman,Thomas Cope,Steffen Udluft
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Quantum policy evaluation (QPE) is a reinforcement learning (RL) algorithm which is quadratically more efficient than an analogous classical Monte Carlo estimation. It makes use of a direct quantum mechanical realization of a finite Markov decision process, in which the agent and the environment are modeled by unitary operators and exchange states, actions, and rewards in superposition. Previously, the quantum environment has been implemented and parametrized manually for an illustrative benchmark using a quantum simulator. In this paper, we demonstrate how these environment parameters can be learned from a batch of classical observational data through quantum machine learning (QML) on quantum hardware. The learned quantum environment is then applied in QPE to also compute policy evaluations on quantum hardware. Our experiments reveal that, despite challenges such as noise and short coherence times, the integration of QML and QPE shows promising potential for achieving quantum advantage in RL.
zh

[AI-79] Benchmarking Universal Interatomic Potentials on Zeolite Structures

【速读】:该论文旨在解决通用原子间势(Interatomic Potentials, IPs)在沸石材料高通量筛选中的适用性与准确性问题,特别是针对不同化学组成(如纯二氧化硅骨架、含铝硅酸盐及金属/有机阳离子体系)的结构和能量预测能力。解决方案的关键在于系统性地评估多种类型的通用IPs,包括两类:(i) 通用解析型IPs(如GFN-FF、UFF、Dreiding)和(ii) 预训练的机器学习IPs(MLIPs,如CHGNet、ORB-v3、MatterSim、eSEN-30M-OAM、PFP-v7、EquiformerV2-lE4-lF100-S2EFS-OC22),并与已验证的定制化IPs(SLC、ClayFF、BSFF)进行对比,以实验数据和包含色散校正的密度泛函理论(DFT)计算为基准。结果表明,现代预训练的通用MLIPs(尤其是eSEN-30M-OAM)在各类沸石结构中均表现出一致的高精度,证明其可作为沸石筛选流程中可靠的工具。

链接: https://arxiv.org/abs/2509.07417
作者: Shusuke Ito,Koki Muraoka,Akira Nakayama
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph)
备注: 26 pages, 9 figures

点击查看摘要

Abstract:Interatomic potentials (IPs) with wide elemental coverage and high accuracy are powerful tools for high-throughput materials discovery. While the past few years witnessed the development of multiple new universal IPs that cover wide ranges of the periodic table, their applicability to target chemical systems should be carefully investigated. We benchmark several universal IPs using equilibrium zeolite structures as testbeds. We select a diverse set of universal IPs encompassing two major categories: (i) universal analytic IPs, including GFN-FF, UFF, and Dreiding; (ii) pretrained universal machine learning IPs (MLIPs), comprising CHGNet, ORB-v3, MatterSim, eSEN-30M-OAM, PFP-v7, and EquiformerV2-lE4-lF100-S2EFS-OC22. We compare them with established tailor-made IPs, SLC, ClayFF, and BSFF using experimental data and density functional theory (DFT) calculations with dispersion correction as the reference. The tested zeolite structures comprise pure silica frameworks and aluminosilicates containing copper species, potassium, and organic cations. We found that GFN-FF is the best among the tested universal analytic IPs, but it does not achieve satisfactory accuracy for highly strained silica rings and aluminosilicate systems. All MLIPs can well reproduce experimental or DFT-level geometries and energetics. Among the universal MLIPs, the eSEN-30M-OAM model shows the most consistent performance across all zeolite structures studied. These findings show that the modern pretrained universal MLIPs are practical tools in zeolite screening workflows involving various compositions.
zh

[AI-80] oward Lifelong-Sustainable Electronic-Photonic AI Systems via Extreme Efficiency Reconfigurability and Robustness

【速读】:该论文旨在解决大规模人工智能(AI)系统对计算资源日益增长的需求与传统电子平台在能耗、带宽及可扩展性方面的瓶颈之间的矛盾,同时应对可持续计算的迫切需求。其核心问题在于如何在不牺牲性能的前提下,实现更高能效、更低碳足迹的下一代AI硬件架构。解决方案的关键在于利用电子-光子集成电路(Electronic-Photonic Integrated Circuits, EPICs)的固有优势,并通过电子-光子设计自动化(EPDA)和跨层协同设计方法进一步放大这些优势:一方面,EPDA工具可生成更紧凑的版图布局,显著减少芯片面积和金属层使用;另一方面,跨层器件-电路-架构协同设计实现了超紧凑光子电路、可重构硬件拓扑以适应动态AI工作负载,以及智能容错机制以提升系统寿命。由此,论文提出了一种面向全生命周期可持续性的电子-光子AI系统愿景,使高性能与低碳足迹得以并行实现。

链接: https://arxiv.org/abs/2509.07396
作者: Ziang Yin,Hongjian Zhou,Chetan Choppali Sudarshan,Vidya Chhabria,Jiaqi Gu
机构: 未知
类目: Optics (physics.optics); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 8 pages

点击查看摘要

Abstract:The relentless growth of large-scale artificial intelligence (AI) has created unprecedented demand for computational power, straining the energy, bandwidth, and scaling limits of conventional electronic platforms. Electronic-photonic integrated circuits (EPICs) have emerged as a compelling platform for next-generation AI systems, offering inherent advantages in ultra-high bandwidth, low latency, and energy efficiency for computing and interconnection. Beyond performance, EPICs also hold unique promises for sustainability. Fabricated in relaxed process nodes with fewer metal layers and lower defect densities, photonic devices naturally reduce embodied carbon footprint (CFP) compared to advanced digital electronic integrated circuits, while delivering orders-of-magnitude higher computing performance and interconnect bandwidth. To further advance the sustainability of photonic AI systems, we explore how electronic-photonic design automation (EPDA) and cross-layer co-design methodologies can amplify these inherent benefits. We present how advanced EPDA tools enable more compact layout generation, reducing both chip area and metal layer usage. We will also demonstrate how cross-layer device-circuit-architecture co-design unlocks new sustainability gains for photonic hardware: ultra-compact photonic circuit designs that minimize chip area cost, reconfigurable hardware topology that adapts to evolving AI workloads, and intelligent resilience mechanisms that prolong lifetime by tolerating variations and faults. By uniting intrinsic photonic efficiency with EPDA- and co-design-driven gains in area efficiency, reconfigurability, and robustness, we outline a vision for lifelong-sustainable electronic-photonic AI systems. This perspective highlights how EPIC AI systems can simultaneously meet the performance demands of modern AI and the urgent imperative for sustainable computing.
zh

[AI-81] A transformer-based generative model for planetary systems

【速读】:该论文旨在解决行星系统形成数值模拟计算成本高昂的问题,同时希望构建一个能够捕捉同一系统内行星之间统计关联性的生成模型,以辅助观测计划的制定(如寻找类地行星)。其解决方案的关键在于采用基于Transformer架构的生成模型,该模型能高效学习并建模行星属性间的复杂相关性,并在训练后以极低的计算开销生成大量合成行星系统。通过与Bern模型直接计算结果进行视觉、统计和机器学习驱动的对比验证,证明了生成模型输出系统的合理性;进一步以TOI-469系统为例,展示了利用已观测行星(b)预测未观测行星(c和d)属性的能力,体现了该方法在实际天文观测指导中的应用潜力。

链接: https://arxiv.org/abs/2509.07226
作者: Yann Alibert,Jeanne Davoult,Sara Marques
机构: 未知
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted in AA

点击查看摘要

Abstract:Numerical calculations of planetary system formation are very demanding in terms of computing power. These synthetic planetary systems can however provide access to correlations, as predicted in a given numerical framework, between the properties of planets in the same system. Such correlations can, in return, be used in order to guide and prioritize observational campaigns aiming at discovering some types of planets, as Earth-like planets. Our goal is to develop a generative model which is capable of capturing correlations and statistical relationships between planets in the same system. Such a model, trained on the Bern model, offers the possibility to generate large number of synthetic planetary systems with little computational cost, that can be used, for example, to guide observational campaigns. Our generative model is based on the transformer architecture which is well-known to efficiently capture correlations in sequences and is at the basis of all modern Large Language Models. To assess the validity of the generative model, we perform visual and statistical comparisons, as well as a machine learning driven tests. Finally, as a use case example, we consider the TOI-469 system, in which we aim at predicting the possible properties of planets c and d, based on the properties of planet b (the first that has been detected). We show using different comparison methods that the properties of systems generated by our model are very similar to the ones of the systems computed directly by the Bern model. We also show in the case of the TOI-469 system, that using the generative model allows to predict the properties of planets not yet observed, based on the properties of the already observed planet. We provide our model to the community on our website this http URL.
zh

[AI-82] Computational Concept of the Psyche

【速读】:该论文旨在解决构建通用人工智能(General Artificial Intelligence, GAI)的问题,即如何在不确定环境下,使智能体基于其特定需求空间做出最优决策,同时最大化目标达成成功率、最小化生存风险并提升能量效率。解决方案的关键在于提出一种认知架构(Cognitive Architecture)概念,将 psyche 视为生物或人工主体的操作系统,包含由外部刺激驱动的需求空间和用于行动决策的智能系统;并通过从经验中学习的方式,在需求空间内进行计算形式化建模,从而实现对智能体生物或存在意义的考量,推动具有目标导向性、风险敏感性和能效优化能力的人工智能系统的构建。

链接: https://arxiv.org/abs/2509.07009
作者: Anton Kolonin,Vladimir Kryukov
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: 14 pages, in Russian, 2 figures, submitted to Neuroinformatics-2025 conference

点击查看摘要

Abstract:The article provides an overview of approaches to modeling the human psyche in the perspective of building an artificial one. Based on the review, a concept of cognitive architecture is proposed, where the psyche is considered as an operating system of a living or artificial subject, including a space of needs that determines its life meanings in connection with stimuli from the external world, and intelligence as a decision-making system for actions in relation to this world in order to satisfy these needs. Based on the concept, a computational formalization is proposed for creating artificial intelligence systems through learning from experience in the space of a space of needs, taking into account their biological or existential significance for an intelligent agent. Thus, the problem of building general artificial intelligence as a system for making optimal decisions in the space of agent-specific needs under conditions of uncertainty is formalized, with maximization of success in achieving goals, minimization of existential risks and maximization of energy efficiency. A minimal experimental implementation of the model is also provided.
zh

[AI-83] Impact of Neuron Models on Spiking Neural Networks performance. A Complexity Based Classification Approach

【速读】:该论文旨在解决如何在脉冲神经网络(Spiking Neural Networks, SNNs)中选择合适的神经元模型与学习规则以优化分类性能的问题,特别是在生物信号处理场景下的适用性。其核心挑战在于不同模型与规则组合对网络鲁棒性和可解释性的差异,以及缺乏统一的评估标准来跨配置比较性能。解决方案的关键在于引入基于复杂度的决策机制——利用Lempel-Ziv Complexity (LZC) 量化脉冲序列的结构规律性,从而提供一种与熵率相关、一致且可解释的评估指标,用于系统分析神经元模型、网络规模和学习规则之间的交互作用,并识别在弱信号或噪声环境下仍具鲁棒性的SNN配置。

链接: https://arxiv.org/abs/2509.06970
作者: Zofia Rudnicka,Janusz Szczepanski,Agnieszka Pregowska
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study explores how the selection of neuron models and learning rules impacts the classification performance of Spiking Neural Networks (SNNs), with a focus on applications in bio-signal processing. We compare biologically inspired neuron models, including Leaky Integrate-and-Fire (LIF), metaneurons, and probabilistic Levy-Baxter (LB) neurons, across multiple learning rules, including spike-timing-dependent plasticity (STDP), tempotron, and reward-modulated updates. A novel element of this work is the integration of a complexity-based decision mechanism into the evaluation pipeline. Using Lempel-Ziv Complexity (LZC), a measure related to entropy rate, we quantify the structural regularity of spike trains and assess classification outcomes in a consistent and interpretable manner across different SNN configurations. To investigate neural dynamics and assess algorithm performance, we employed synthetic datasets with varying temporal dependencies and stochasticity levels. These included Markov and Poisson processes, well-established models to simulate neuronal spike trains and capture the stochastic firing behavior of biological this http URL of synthetic Poisson and Markov-modeled data reveals clear performance trends: classification accuracy depends on the interaction between neuron model, network size, and learning rule, with the LZC-based evaluation highlighting configurations that remain robust to weak or noisy signals. This work delivers a systematic analysis of how neuron model selection interacts with network parameters and learning strategies, supported by a novel complexity-based evaluation approach that offers a consistent benchmark for SNN performance.
zh

[AI-84] Association of Timing and Duration of Moderate-to-Vigorous Physical Activity with Cognitive Function and Brain Aging: A Population-Based Study Using the UK Biobank

【速读】:该论文旨在解决中老年群体中中等到高强度身体活动(moderate-to-vigorous physical activity, MVPA)的强度与时间安排对认知功能及区域特异性脑结构影响不明确的问题。其关键解决方案是基于UK Biobank队列中45,892名60岁以上参与者的腕戴加速度计数据、认知测试和结构性磁共振成像(structural brain MRI),采用多变量线性模型系统评估MVPA(以每周分钟数连续测量或按WHO建议≥150 min/周分类)与认知域表现(如推理、记忆、执行功能和加工速度)及皮层下脑区(如尾状核、壳核、苍白球、丘脑)和灰质体积的关系,并进一步分析MVPA时段效应与亚组异质性,结果表明更高水平的MVPA与更好的认知表现和更 preserved 的脑结构显著相关,且存在剂量反应关系,尤其在午后至晚间时段效果更明显。

链接: https://arxiv.org/abs/2509.06969
作者: Wasif Khan,Lin Gu,Noah Hammarlund,Lei Xing,Joshua K. Wong,Ruogu Fang
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注: This article is currently under review. The Supplementary Tables A1-A7 could not be attached with the current submission but it can be requested from the corresponding author

点击查看摘要

Abstract:Physical activity is a modifiable lifestyle factor with potential to support cognitive resilience. However, the association of moderate-to-vigorous physical activity (MVPA) intensity, and timing, with cognitive function and region-specific brain structure remain poorly understood. We analyzed data from 45,892 UK Biobank participants aged 60 years and older with valid wrist-worn accelerometer data, cognitive testing, and structural brain MRI. MVPA was measured both continuously (mins per week) and categorically (thresholded using =150 min/week based on WHO guidelines). Associations with cognitive performance and regional brain volumes were evaluated using multivariable linear models adjusted for demographic, socioeconomic, and health-related covariates. We conducted secondary analyses on MVPA timing and subgroup effects. Higher MVPA was associated with better performance across cognitive domains, including reasoning, memory, executive function, and processing speed. These associations persisted in fully adjusted models and were higher among participants meeting WHO guidelines. Greater MVPA was also associated with subcortical brain regions (caudate, putamen, pallidum, thalamus), as well as regional gray matter volumes involved in emotion, working memory, and perceptual processing. Secondary analyses showed that MVPA at any time of day was associated with cognitive functions and brain volume particularly in the midday-afternoon and evening. Sensitivity analysis shows consistent findings across subgroups, with evidence of dose-response relationships. Higher MVPA is associated with preserved brain structure and enhanced cognitive function in later life. Public health strategies to increase MVPA may support healthy cognitive aging and generate substantial economic benefits, with global gains projected to reach USD 760 billion annually by 2050.
zh

[AI-85] Deep Learning-based Techniques for Integrated Sensing and Communication Systems: State-of-the-Art Challenges and Opportunities

【速读】:该论文旨在解决集成感知与通信(Integrated Sensing and Communication, ISAC)系统在硬件资源受限、计算复杂度高以及实时性要求严苛场景下的优化难题。传统基于迭代或优化的信号处理方法难以满足6G及未来网络对低延迟和高能效的需求,因此论文提出以深度学习(Deep Learning, DL)为基础的解决方案,其关键在于利用DL模型的高效映射能力,在保持近似最优性能的同时显著降低计算复杂度,从而实现波形设计、信道估计、感知信号处理、数据解调和干扰抑制等ISAC核心任务的快速求解,尤其适用于车载网络和工业机器人等新兴应用场景。

链接: https://arxiv.org/abs/2509.06968
作者: Murat Temiz,Yongwei Zhang,Yanwei Fu,Chi Zhang,Chenfeng Meng,Orhan Kaplan,Christos Masouros
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: 35 Pages, 13 Figures, 11 Tables, corrected version of the published journal article in IEEE Open Journal of the Communications Society

点击查看摘要

Abstract:This article comprehensively reviews recent developments and research on deep learning-based (DL-based) techniques for integrated sensing and communication (ISAC) systems. ISAC, which combines sensing and communication functionalities, is regarded as a key enabler for 6G and beyond networks, as many emerging applications, such as vehicular networks and industrial robotics, necessitate both sensing and communication capabilities for effective operation. A unified platform that provides both functions can reduce hardware complexity, alleviate frequency spectrum congestion, and improve energy efficiency. However, integrating these functionalities on the same hardware requires highly optimized signal processing and system design, introducing significant computational complexity when relying on conventional iterative or optimization-based techniques. As an alternative to conventional techniques, DL-based techniques offer efficient and near-optimal solutions with reduced computational complexity. Hence, such techniques are well-suited for operating under limited computational resources and low latency requirements in real-time systems. DL-based techniques can swiftly and effectively yield near-optimal solutions for a wide range of sophisticated ISAC-related tasks, including waveform design, channel estimation, sensing signal processing, data demodulation, and interference mitigation. Therefore, motivated by these advantages, recent studies have proposed various DL-based approaches for ISAC system design. After briefly introducing DL architectures and ISAC fundamentals, this survey presents a comprehensive and categorized review of state-of-the-art DL-based techniques for ISAC, highlights their key advantages and major challenges, and outlines potential directions for future research.
zh

[AI-86] Cross-field SNR Analysis and Tensor Channel Estimation for Multi-UAV Near-field Communications

【速读】:该论文旨在解决分布式多无人机(multi-UAV)近场通信系统中因空间稀疏性和非远场平面波假设失效而导致的传统信道估计方法性能下降的问题。其关键解决方案在于提出一种混合球面-平面波模型(hybrid spherical-plane wave model, HSPWM),该模型在保持较高建模精度的同时具备良好的解析可处理性;并基于此模型设计了两种信道估计算法:球域正交匹配追踪(spherical-domain orthogonal matching pursuit, SD-OMP)和张量正交匹配追踪(tensor-OMP)。其中,tensor-OMP 利用HSPWM下信道的天然张量结构,在实现与SD-OMP相当的归一化均方误差(NMSE)性能的同时,显著降低了计算复杂度并提升了系统扩展性。

链接: https://arxiv.org/abs/2509.06967
作者: Tianyu Huo,Jian Xiong,Yiyan Wu,Songjie Yang,Bo Liu,Wenjun Zhang
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Extremely large antenna array (ELAA) is key to enhancing spectral efficiency in 6G networks. Leveraging the distributed nature of multi-unmanned aerial vehicle (UAV) systems enables the formation of distributed ELAA, which often operate in the near-field region with spatial sparsity, rendering the conventional far-field plane wave assumption invalid. This paper investigates channel estimation for distributed near-field multi-UAV communication systems. We first derive closed-form signal-to-noise ratio (SNR) expressions under the plane wave model (PWM), spherical wave model (SWM), and a hybrid spherical-plane wave model (HSPWM), also referred to as the cross-field model, within a distributed uniform planar array (UPA) scenario. The analysis shows that HSPWM achieves a good balance between modeling accuracy and analytical tractability. Based on this, we propose two channel estimation algorithms: the spherical-domain orthogonal matching pursuit (SD-OMP) and the tensor-OMP. The SD-OMP generalizes the polar domain to jointly consider elevation, azimuth, and range. Under the HSPWM, the channel is naturally formulated as a tensor, enabling the use of tensor-OMP. Simulation results demonstrate that tensor-OMP achieves normalized mean square error (NMSE) performance comparable to SD-OMP, while offering reduced computational complexity and improved scalability.
zh

[AI-87] Cross-device Zero-shot Label Transfer via Alignment of Time Series Foundation Model Embeddings

【速读】:该论文旨在解决消费级可穿戴设备(如Apple Watch)缺乏高质量、医学验证的标签问题,而传统人工标注方法成本高且难以扩展。其核心解决方案是提出一种基于时间序列基础模型(Time-Series Foundation Models, TSFMs)的跨设备标签迁移框架,通过将源域(如临床级活动记录仪)与目标域(如Apple Watch)的数据映射到共享潜在嵌入空间,并利用对抗性对齐机制强制两类嵌入分布一致,从而实现无需成对数据即可完成标签迁移。

链接: https://arxiv.org/abs/2509.06966
作者: Neal G. Ravindra,Arijit Sehanobish
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 3 figures, 1 table. tl;dr: Adversarial alignment of Time-Series Foundation Model (TSFM) embeddings enables transfer of high-quality clinical labels from medical-grade to consumer-grade wearables, enabling zero-shot prediction of gestational age without requiring paired data

点击查看摘要

Abstract:High-quality, medically validated labels exist for clinical actigraphy data but not for ubiquitous consumer wearables like the Apple Watch. Manually labeling wearables data is expensive and doesn’t scale. This paper offers a novel framework that transfers valuable labels from a source domain (e.g., actigraphy) to a target domain (e.g., Apple Watch) without requiring paired data. Instead of working with raw time-series signals, we project both domains into a shared latent embedding space using time-series foundation models (TSFMs) and develop a new framework to align the cross-device representations. Our method, Adversarial Alignment of TSFM Embeddings forces the distributions of source and target embeddings to align within this space, facilitating label transfer across device type.
zh

机器学习

[LG-0] heoretical Analysis on how Learning Rate Warmup Accelerates Convergence

链接: https://arxiv.org/abs/2509.07972
作者: Yuxing Liu,Yuze Ge,Rui Pan,An Kang,Tong Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:Learning rate warmup is a popular and practical technique in training large-scale deep neural networks. Despite the huge success in practice, the theoretical advantages of this strategy of gradually increasing the learning rate at the beginning of the training process have not been fully understood. To resolve this gap between theory and practice, we first propose a novel family of generalized smoothness assumptions, and validate its applicability both theoretically and empirically. Under the novel smoothness assumption, we study the convergence properties of gradient descent (GD) in both deterministic and stochastic settings. It is shown that learning rate warmup consistently accelerates GD, and GD with warmup can converge at most \Theta(T) times faster than with a non-increasing learning rate schedule in some specific cases, providing insights into the benefits of this strategy from an optimization theory perspective.

[LG-1] Customizing the Inductive Biases of Softmax Attention using Structured Matrices ICML2025

链接: https://arxiv.org/abs/2509.07963
作者: Yilun Kuang,Noah Amsel,Sanae Lotfi,Shikai Qiu,Andres Potapczynski,Andrew Gordon Wilson
类目: Machine Learning (cs.LG)
*备注: ICML 2025. Code available at this https URL

点击查看摘要

Abstract:The core component of attention is the scoring function, which transforms the inputs into low-dimensional queries and keys and takes the dot product of each pair. While the low-dimensional projection improves efficiency, it causes information loss for certain tasks that have intrinsically high-dimensional inputs. Additionally, attention uses the same scoring function for all input pairs, without imposing a distance-dependent compute bias for neighboring tokens in the sequence. In this work, we address these shortcomings by proposing new scoring functions based on computationally efficient structured matrices with high ranks, including Block Tensor-Train (BTT) and Multi-Level Low Rank (MLR) matrices. On in-context regression tasks with high-dimensional inputs, our proposed scoring functions outperform standard attention for any fixed compute budget. On language modeling, a task that exhibits locality patterns, our MLR-based attention method achieves improved scaling laws compared to both standard attention and variants of sliding window attention. Additionally, we show that both BTT and MLR fall under a broader family of efficient structured matrices capable of encoding either full-rank or distance-dependent compute biases, thereby addressing significant shortcomings of standard attention. Finally, we show that MLR attention has promising results for long-range time-series forecasting.

[LG-2] RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction

链接: https://arxiv.org/abs/2509.07953
作者: Zheyuan Hu,Robyn Wu,Naveen Enock,Jasmine Li,Riya Kadakia,Zackory Erickson,Aviral Kumar
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Modern paradigms for robot imitation train expressive policy architectures on large amounts of human demonstration data. Yet performance on contact-rich, deformable-object, and long-horizon tasks plateau far below perfect execution, even with thousands of expert demonstrations. This is due to the inefficiency of existing ``expert’’ data collection procedures based on human teleoperation. To address this issue, we introduce RaC, a new phase of training on human-in-the-loop rollouts after imitation learning pre-training. In RaC, we fine-tune a robotic policy on human intervention trajectories that illustrate recovery and correction behaviors. Specifically, during a policy rollout, human operators intervene when failure appears imminent, first rewinding the robot back to a familiar, in-distribution state and then providing a corrective segment that completes the current sub-task. Training on this data composition expands the robotic skill repertoire to include retry and adaptation behaviors, which we show are crucial for boosting both efficiency and robustness on long-horizon tasks. Across three real-world bimanual control tasks: shirt hanging, airtight container lid sealing, takeout box packing, and a simulated assembly task, RaC outperforms the prior state-of-the-art using 10 \times less data collection time and samples. We also show that RaC enables test-time scaling: the performance of the trained RaC policy scales linearly in the number of recovery maneuvers it exhibits. Videos of the learned policy are available at this https URL.

[LG-3] One Model for All Tasks: Leverag ing Efficient World Models in Multi-Task Planning

链接: https://arxiv.org/abs/2509.07945
作者: Yuan Pu,Yazhe Niu,Jia Tang,Junyu Xiong,Shuai Hu,Hongsheng Li
类目: Machine Learning (cs.LG)
*备注: 43 pages, 19 figures

点击查看摘要

Abstract:In heterogeneous multi-task learning, tasks not only exhibit diverse observation and action spaces but also vary substantially in intrinsic difficulty. While conventional multi-task world models like UniZero excel in single-task settings, we find that when handling large-scale heterogeneous environments, gradient conflicts and the loss of model plasticity often constrain their sample and computational efficiency. In this work, we address these challenges from two perspectives: the single learning iteration and the overall learning process. First, we investigate the impact of key design spaces on extending UniZero to multi-task planning. We find that a Mixture-of-Experts (MoE) architecture provides the most substantial performance gains by mitigating gradient conflicts, leading to our proposed model, \textitScaleZero. Second, to dynamically balance the computational load across the learning process, we introduce an online, LoRA-based \textitdynamic parameter scaling (DPS) strategy. This strategy progressively integrates LoRA adapters in response to task-specific progress, enabling adaptive knowledge retention and parameter expansion. Empirical evaluations on standard benchmarks such as Atari, DMControl (DMC), and Jericho demonstrate that ScaleZero, relying exclusively on online reinforcement learning with one model, attains performance on par with specialized single-task baselines. Furthermore, when augmented with our dynamic parameter scaling strategy, our method achieves competitive performance while requiring only 80% of the single-task environment interaction steps. These findings underscore the potential of ScaleZero for effective large-scale multi-task learning. Our code is available at \textcolormagentathis https URL.

[LG-4] Guided Reasoning in LLM -Driven Penetration Testing Using Structured Attack Trees

链接: https://arxiv.org/abs/2509.07939
作者: Katsuaki Nakano,Reza Feyyazi,Shanchieh Jay Yang,Michael Zuzak
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATTCK Matrix, a proven penetration testing kll chain, to constrain the LLM’s reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8%, 72.8%, and 78.6% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5%, 16.5%, and 75.7% of subtasks and required 86.2%, 118.7%, and 205.9% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

[LG-5] Smart Fast Finish: Preventing Overdelivery via Daily Budget Pacing at DoorDash

链接: https://arxiv.org/abs/2509.07929
作者: Rohan Garg,Yongjin Xiao,Jason(Dianxia)Yang,Mandar Rahurkar
类目: Computer Science and Game Theory (cs.GT); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We present a budget pacing feature called Smart Fast Finish (SFF). SFF builds upon the industry standard Fast Finish (FF) feature in budget pacing systems that depletes remaining advertising budget as quickly as possible towards the end of some fixed time period. SFF dynamically updates system parameters such as start time and throttle rate depending on historical ad-campaign data. SFF is currently in use at DoorDash, one of the largest delivery platforms in the US, and is part of its budget pacing system. We show via online budget-split experimentation data and offline simulations that SFF is a robust solution for overdelivery mitigation when pacing budget.

[LG-6] Bio-KGvec2go: Serving up-to-date Dynamic Biomedical Knowledge Graph Embeddings ISWC

链接: https://arxiv.org/abs/2509.07905
作者: Hamid Ahmad,Heiko Paulheim,Rita T. Sousa
类目: Machine Learning (cs.LG)
*备注: Accepted at ISWC Poster and Demo Track 2025

点击查看摘要

Abstract:Knowledge graphs and ontologies represent entities and their relationships in a structured way, having gained significance in the development of modern AI applications. Integrating these semantic resources with machine learning models often relies on knowledge graph embedding models to transform graph data into numerical representations. Therefore, pre-trained models for popular knowledge graphs and ontologies are increasingly valuable, as they spare the need to retrain models for different tasks using the same data, thereby helping to democratize AI development and enabling sustainable computing. In this paper, we present Bio-KGvec2go, an extension of the KGvec2go Web API, designed to generate and serve knowledge graph embeddings for widely used biomedical ontologies. Given the dynamic nature of these ontologies, Bio-KGvec2go also supports regular updates aligned with ontology version releases. By offering up-to-date embeddings with minimal computational effort required from users, Bio-KGvec2go facilitates efficient and timely biomedical research. Comments: Accepted at ISWC Poster and Demo Track 2025 Subjects: Machine Learning (cs.LG) Cite as: arXiv:2509.07905 [cs.LG] (or arXiv:2509.07905v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.07905 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-7] A Modular Algorithm for Non-Stationary Online Convex-Concave Optimization KR

链接: https://arxiv.org/abs/2509.07901
作者: Qing-xin Meng,Xia Lei,Jian-wei Liu
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: Earlier Version: this https URL

点击查看摘要

Abstract:This paper investigates the problem of Online Convex-Concave Optimization, which extends Online Convex Optimization to two-player time-varying convex-concave games. The goal is to minimize the dynamic duality gap (D-DGap), a critical performance measure that evaluates players’ strategies against arbitrary comparator sequences. Existing algorithms fail to deliver optimal performance, particularly in stationary or predictable environments. To address this, we propose a novel modular algorithm with three core components: an Adaptive Module that dynamically adjusts to varying levels of non-stationarity, a Multi-Predictor Aggregator that identifies the best predictor among multiple candidates, and an Integration Module that effectively combines their strengths. Our algorithm achieves a minimax optimal D-DGap upper bound, up to a logarithmic factor, while also ensuring prediction error-driven D-DGap bounds. The modular design allows for the seamless replacement of components that regulate adaptability to dynamic environments, as well as the incorporation of components that integrate ``side knowledge’’ from multiple predictors. Empirical results further demonstrate the effectiveness and adaptability of the proposed method.

[LG-8] Feasibility of In-Ear Single-Channel ExG for Wearable Sleep~Monitoring in Real-World Settings

链接: https://arxiv.org/abs/2509.07896
作者: Philipp Lepold,Jonas Leichtle,Tobias Röddiger,Michael Beigl
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Automatic sleep staging typically relies on gold-standard EEG setups, which are accurate but obtrusive and impractical for everyday use outside sleep laboratories. This limits applicability in real-world settings, such as home environments, where continuous, long-term monitoring is needed. Detecting sleep onset is particularly relevant, enabling consumer applications (e.g. automatically pausing media playback when the user falls asleep). Recent research has shown correlations between in-ear EEG and full-scalp EEG for various phenomena, suggesting wearable, in-ear devices could allow unobtrusive sleep monitoring. We investigated the feasibility of using single-channel in-ear electrophysiological (ExG) signals for automatic sleep staging in a wearable device by conducting a sleep study with 11~participants (mean age: 24), using a custom earpiece with a dry eartip electrode (Dätwyler SoftPulse) as a measurement electrode in one ear and a reference in the other. Ground truth sleep stages were obtained from an Apple Watch Ultra, validated for sleep staging. Our system achieved 90.5% accuracy for binary sleep detection (Awake vs. Asleep) and 65.1% accuracy for four-class staging (Awake, REM, Core, Deep) using leave-one-subject-out validation. These findings demonstrate the potential of in-ear electrodes as a low-effort, comfortable approach to sleep monitoring, with applications such as stopping podcasts when users fall asleep.

[LG-9] A Survey of Graph Neural Networks for Drug Discovery: Recent Developments and Challenges

链接: https://arxiv.org/abs/2509.07887
作者: Katherine Berry,Liang Cheng
类目: Machine Learning (cs.LG)
*备注: 16 pages, 1 figure

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have gained traction in the complex domain of drug discovery because of their ability to process graph-structured data such as drug molecule models. This approach has resulted in a myriad of methods and models in published literature across several categories of drug discovery research. This paper covers the research categories comprehensively with recent papers, namely molecular property prediction, including drug-target binding affinity prediction, drug-drug interaction study, microbiome interaction prediction, drug repositioning, retrosynthesis, and new drug design, and provides guidance for future work on GNNs for drug discovery.

[LG-10] Leverag ing Support Vector Regression for Outcome Prediction in Personalized Ultra-fractionated Stereotactic Adaptive Radiotherapy

链接: https://arxiv.org/abs/2509.07872
作者: Yajun Yu,Steve Jiang,Robert Timmerman,Hao Peng
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalized ultra-fractionated stereotactic adaptive radiotherapy (PULSAR) is a novel treatment that delivers radiation in pulses of protracted intervals. Accurate prediction of gross tumor volume (GTV) changes through regression models has substantial prognostic value. This study aims to develop a multi-omics based support vector regression (SVR) model for predicting GTV change. A retrospective cohort of 39 patients with 69 brain metastases was analyzed, based on radiomics (MRI images) and dosiomics (dose maps) features. Delta features were computed to capture relative changes between two time points. A feature selection pipeline using least absolute shrinkage and selection operator (Lasso) algorithm with weight- or frequency-based ranking criterion was implemented. SVR models with various kernels were evaluated using the coefficient of determination (R2) and relative root mean square error (RRMSE). Five-fold cross-validation with 10 repeats was employed to mitigate the limitation of small data size. Multi-omics models that integrate radiomics, dosiomics, and their delta counterparts outperform individual-omics models. Delta-radiomic features play a critical role in enhancing prediction accuracy relative to features at single time points. The top-performing model achieves an R2 of 0.743 and an RRMSE of 0.022. The proposed multi-omics SVR model shows promising performance in predicting continuous change of GTV. It provides a more quantitative and personalized approach to assist patient selection and treatment adjustment in PULSAR.

[LG-11] Addressing the Cold-Start Problem for Personalized Combination Drug Screening

链接: https://arxiv.org/abs/2509.07850
作者: Antoine de Mathelin,Christopher Tosh,Wesley Tansey
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Personalizing combination therapies in oncology requires navigating an immense space of possible drug and dose combinations, a task that remains largely infeasible through exhaustive experimentation. Recent developments in patient-derived models have enabled high-throughput ex vivo screening, but the number of feasible experiments is limited. Further, a tight therapeutic window makes gathering molecular profiling information (e.g. RNA-seq) impractical as a means of guiding drug response prediction. This leads to a challenging cold-start problem: how do we select the most informative combinations to test early, when no prior information about the patient is available? We propose a strategy that leverages a pretrained deep learning model built on historical drug response data. The model provides both embeddings for drug combinations and dose-level importance scores, enabling a principled selection of initial experiments. We combine clustering of drug embeddings to ensure functional diversity with a dose-weighting mechanism that prioritizes doses based on their historical informativeness. Retrospective simulations on large-scale drug combination datasets show that our method substantially improves initial screening efficiency compared to baselines, offering a viable path for more effective early-phase decision-making in personalized combination drug screens.

[LG-12] Predicting person-level injury severity using crash narratives: A balanced approach with roadway classification and natural language process techniques

链接: https://arxiv.org/abs/2509.07845
作者: Mohammad Zana Majidi,Sajjad Karimi,Teng Wang,Robert Kluger,Reginald Souleyrette
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Predicting injuries and fatalities in traffic crashes plays a critical role in enhancing road safety, improving emergency response, and guiding public health interventions. This study investigates the added value of unstructured crash narratives (written by police officers at the scene) when combined with structured crash data to predict injury severity. Two widely used Natural Language Processing (NLP) techniques, Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec, were employed to extract semantic meaning from the narratives, and their effectiveness was compared. To address the challenge of class imbalance, a K-Nearest Neighbors-based oversampling method was applied to the training data prior to modeling. The dataset consists of crash records from Kentucky spanning 2019 to 2023. To account for roadway heterogeneity, three road classification schemes were used: (1) eight detailed functional classes (e.g., Urban Two-Lane, Rural Interstate, Urban Multilane Divided), (2) four broader paired categories (e.g., Urban vs. Rural, Freeway vs. Non-Freeway), and (3) a unified dataset without classification. A total of 102 machine learning models were developed by combining structured features and narrative-based features using the two NLP techniques alongside three ensemble algorithms: XGBoost, Random Forest, and AdaBoost. Results demonstrate that models incorporating narrative data consistently outperform those relying solely on structured data. Among all combinations, TF-IDF coupled with XGBoost yielded the most accurate predictions in most subgroups. The findings highlight the power of integrating textual and structured crash information to enhance person-level injury prediction. This work offers a practical and adaptable framework for transportation safety professionals to improve crash severity modeling, guide policy decisions, and design more effective countermeasures.

[LG-13] Quantum Computing for Large-scale Network Optimization: Opportunities and Challenges

链接: https://arxiv.org/abs/2509.07773
作者: Sebastian Macaluso,Giovanni Geraci,Elías F. Combarro,Sergi Abadal,Ioannis Arapakis,Sofia Vallecorsa,Eduard Alarcón
类目: Networking and Internet Architecture (cs.NI); Information Theory (cs.IT); Machine Learning (cs.LG); Signal Processing (eess.SP); Quantum Physics (quant-ph)
*备注: 7 pages, 4 figures

点击查看摘要

Abstract:The complexity of large-scale 6G-and-beyond networks demands innovative approaches for multi-objective optimization over vast search spaces, a task often intractable. Quantum computing (QC) emerges as a promising technology for efficient large-scale optimization. We present our vision of leveraging QC to tackle key classes of problems in future mobile networks. By analyzing and identifying common features, particularly their graph-centric representation, we propose a unified strategy involving QC algorithms. Specifically, we outline a methodology for optimization using quantum annealing as well as quantum reinforcement learning. Additionally, we discuss the main challenges that QC algorithms and hardware must overcome to effectively optimize future networks.

[LG-14] MoE-Compression: How the Compression Error of Experts Affects the Inference Accuracy of MoE Model?

链接: https://arxiv.org/abs/2509.07727
作者: Songkai Ma,Zhaorui Zhang,Sheng Di,Benben Liu,Xiaodong Yu,Xiaoyi Lu,Dan Wang
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:With the widespread application of Mixture of Experts (MoE) reasoning models in the field of LLM learning, efficiently serving MoE models under limited GPU memory constraints has emerged as a significant challenge. Offloading the non-activated experts to main memory has been identified as an efficient approach to address such a problem, while it brings the challenges of transferring the expert between the GPU memory and main memory. We need to explore an efficient approach to compress the expert and analyze how the compression error affects the inference performance. To bridge this gap, we propose employing error-bounded lossy compression algorithms (such as SZ3 and CuSZp) to compress non-activated experts, thereby reducing data transfer overhead during MoE inference. We conduct extensive experiments across various benchmarks and present a comprehensive analysis of how compression-induced errors in different experts affect overall inference accuracy. The results indicate that experts in the shallow layers, which are primarily responsible for the attention mechanism and the transformation of input tokens into vector representations, exhibit minimal degradation in inference accuracy when subjected to bounded errors. In contrast, errors in the middle-layer experts, which are central to model reasoning, significantly impair inference accuracy. Interestingly, introducing bounded errors in the deep-layer experts, which are mainly responsible for instruction following and output integration, can sometimes lead to improvements in inference accuracy. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2509.07727 [cs.LG] (or arXiv:2509.07727v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.07727 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-15] IBN: An Interpretable Bidirectional-Modeling Network for Multivariate Time Series Forecasting with Variable Missing

链接: https://arxiv.org/abs/2509.07725
作者: Shusen Ma,Tianhao Zhang,Qijiu Xia,Yun-Bo Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Multivariate time series forecasting (MTSF) often faces challenges from missing variables, which hinder conventional spatial-temporal graph neural networks in modeling inter-variable correlations. While GinAR addresses variable missing using attention-based imputation and adaptive graph learning for the first time, it lacks interpretability and fails to capture more latent temporal patterns due to its simple recursive units (RUs). To overcome these limitations, we propose the Interpretable Bidirectional-modeling Network (IBN), integrating Uncertainty-Aware Interpolation (UAI) and Gaussian kernel-based Graph Convolution (GGCN). IBN estimates the uncertainty of reconstructed values using MC Dropout and applies an uncertainty-weighted strategy to mitigate high-risk reconstructions. GGCN explicitly models spatial correlations among variables, while a bidirectional RU enhances temporal dependency modeling. Extensive experiments show that IBN achieves state-of-the-art forecasting performance under various missing-rate scenarios, providing a more reliable and interpretable framework for MTSF with missing variables. Code is available at: this https URL.

[LG-16] FUnc-SNE: A flexible Fast and Unconstrained algorithm for neighbour embeddings

链接: https://arxiv.org/abs/2509.07681
作者: Pierre Lambert,Edouard Couplet,Michel Verleysen,John Aldo Lee
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: Preprint submitted to Neurocomputing

点击查看摘要

Abstract:Neighbour embeddings (NE) allow the representation of high dimensional datasets into lower dimensional spaces and are often used in data visualisation. In practice, accelerated approximations are employed to handle very large datasets. Accelerating NE is challenging, and two main directions have been explored: very coarse approximations based on negative sampling (as in UMAP) achieve high effective speed but may lack quality in the extracted structures; less coarse approximations, as used in FIt-SNE or BH-t-SNE, offer better structure preservation at the cost of speed, while also restricting the target dimensionality to 2 or 3, limiting NE to visualisation. In some variants, the precision of these costlier accelerations also enables finer-grained control on the extracted structures through dedicated hyperparameters. This paper proposes to bridge the gab between both approaches by introducing a novel way to accelerate NE, requiring a small number of computations per iteration while maintaining good fine-grained structure preservation and flexibility through hyperparameter tuning, without limiting the dimensionality of the embedding space. The method was designed for interactive exploration of data; as such, it abandons the traditional two-phased approach of other NE methods, allowing instantaneous visual feedback when changing hyperparameters, even when these control processes happening on the high-dimensional side of the computations. Experiments using a publicly available, GPU accelerated GUI integration of the method show promising results in terms of speed, flexibility in the structures getting extracted, and show potential uses in broader machine learning contexts with minimal algorithmic modifications. Central to this algorithm is a novel approach to iterative approximate nearest neighbour search, which shows promising results compared to nearest neighbour descent. Comments: Preprint submitted to Neurocomputing Subjects: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC) Cite as: arXiv:2509.07681 [cs.LG] (or arXiv:2509.07681v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.07681 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-17] Graph-based Integrated Gradients for Explaining Graph Neural Networks

链接: https://arxiv.org/abs/2509.07648
作者: Lachlan Simpson,Kyle Millar,Adriel Cheng,Cheng-Chew Lim,Hong Gunn Chew
类目: Machine Learning (cs.LG)
*备注: Accepted at the Australasian Joint Conference on Artificial Intelligence (AJCAI) 2025

点击查看摘要

Abstract:Integrated Gradients (IG) is a common explainability technique to address the black-box problem of neural networks. Integrated gradients assumes continuous data. Graphs are discrete structures making IG ill-suited to graphs. In this work, we introduce graph-based integrated gradients (GB-IG); an extension of IG to graphs. We demonstrate on four synthetic datasets that GB-IG accurately identifies crucial structural components of the graph used in classification tasks. We further demonstrate on three prevalent real-world graph datasets that GB-IG outperforms IG in highlighting important features for node classification tasks.

[LG-18] Neural Proxies for Sound Synthesizers: Learning Perceptually Informed Preset Representations

链接: https://arxiv.org/abs/2509.07635
作者: Paolo Combes,Stefan Weinzierl,Klaus Obermayer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 17 pages, 4 figures, published in the Journal of the Audio Engineering Society

点击查看摘要

Abstract:Deep learning appears as an appealing solution for Automatic Synthesizer Programming (ASP), which aims to assist musicians and sound designers in programming sound synthesizers. However, integrating software synthesizers into training pipelines is challenging due to their potential non-differentiability. This work tackles this challenge by introducing a method to approximate arbitrary synthesizers. Specifically, we train a neural network to map synthesizer presets onto an audio embedding space derived from a pretrained model. This facilitates the definition of a neural proxy that produces compact yet effective representations, thereby enabling the integration of audio embedding loss into neural-based ASP systems for black-box synthesizers. We evaluate the representations derived by various pretrained audio models in the context of neural-based nASP and assess the effectiveness of several neural network architectures, including feedforward, recurrent, and transformer-based models, in defining neural proxies. We evaluate the proposed method using both synthetic and hand-crafted presets from three popular software synthesizers and assess its performance in a synthesizer sound matching downstream task. While the benefits of the learned representation are nuanced by resource requirements, encouraging results were obtained for all synthesizers, paving the way for future research into the application of synthesizer proxies for neural-based ASP systems.

[LG-19] K2-Think: A Parameter-Efficient Reasoning System

链接: https://arxiv.org/abs/2509.07604
作者: Zhoujun Cheng,Richard Fan,Shibo Hao,Taylor W. Killian,Haonan Li,Suqi Sun,Hector Ren,Alexander Moreno,Daqian Zhang,Tianjun Zhong,Yuxin Xiong,Yuanzhe Hu,Yutao Xie,Xudong Han,Yuqi Wang,Varad Pimpalkhute,Yonghao Zhuang,Aaryamonvikram Singh,Xuezhi Liang,Anze Xie,Jianshu She,Desai Fan,Chengqian Gao,Liqun Ma,Mikhail Yurochkin,John Maggs,Xuezhe Ma,Guowei He,Zhiting Hu,Zhengzhong Liu,Eric P. Xing
类目: Machine Learning (cs.LG)
*备注: To access the K2-Think reasoning system, please visit this https URL

点击查看摘要

Abstract:K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at this http URL, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.

[LG-20] Homogenization with Guaranteed Bounds via Primal-Dual Physically Informed Neural Networks

链接: https://arxiv.org/abs/2509.07579
作者: Liya Gaynutdinova,Martin Doškář,Ondřej Rokoš,Ivana Pultarová
类目: Machine Learning (cs.LG); Analysis of PDEs (math.AP); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:Physics-informed neural networks (PINNs) have shown promise in solving partial differential equations (PDEs) relevant to multiscale modeling, but they often fail when applied to materials with discontinuous coefficients, such as media with piecewise constant properties. This paper introduces a dual formulation for the PINN framework to improve the reliability of the homogenization of periodic thermo-conductive composites, for both strong and variational (weak) formulations. The dual approach facilitates the derivation of guaranteed upper and lower error bounds, enabling more robust detection of PINN failure. We compare standard PINNs applied to smoothed material approximations with variational PINNs (VPINNs) using both spectral and neural network-based test functions. Our results indicate that while strong-form PINNs may outperform VPINNs in controlled settings, they are sensitive to material discontinuities and may fail without clear diagnostics. In contrast, VPINNs accommodate piecewise constant material parameters directly but require careful selection of test functions to avoid instability. Dual formulation serves as a reliable indicator of convergence quality, and its integration into PINN frameworks enhances their applicability to homogenization problems in micromechanics.

[LG-21] uGMM-NN: Univariate Gaussian Mixture Model Neural Network

链接: https://arxiv.org/abs/2509.07569
作者: Zakeria Sharif Ali
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 2 figures

点击查看摘要

Abstract:This paper introduces the Univariate Gaussian Mixture Model Neural Network (uGMM-NN), a novel neural architecture that embeds probabilistic reasoning directly into the computational units of deep networks. Unlike traditional neurons, which apply weighted sums followed by fixed nonlinearities, each uGMM-NN node parameterizes its activations as a univariate Gaussian mixture, with learnable means, variances, and mixing coefficients. This design enables richer representations by capturing multimodality and uncertainty at the level of individual neurons, while retaining the scalability of standard feedforward networks. We demonstrate that uGMM-NN can achieve competitive discriminative performance compared to conventional multilayer perceptrons, while additionally offering a probabilistic interpretation of activations. The proposed framework provides a foundation for integrating uncertainty-aware components into modern neural architectures, opening new directions for both discriminative and generative modeling.

[LG-22] RoseCDL: Robust and Scalable Convolutional Dictionary Learning for Rare-event Detection

链接: https://arxiv.org/abs/2509.07523
作者: Jad Yehya,Mansour Benbakoura,Cédric Allain,Benoît Malezieux,Matthieu Kowalski,Thomas Moreau
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying recurring patterns and rare events in large-scale signals is a fundamental challenge in fields such as astronomy, physical simulations, and biomedical science. Convolutional Dictionary Learning (CDL) offers a powerful framework for modeling local structures in signals, but its use for detecting rare or anomalous events remains largely unexplored. In particular, CDL faces two key challenges in this setting: high computational cost and sensitivity to artifacts and outliers. In this paper, we introduce RoseCDL, a scalable and robust CDL algorithm designed for unsupervised rare event detection in long signals. RoseCDL combines stochastic windowing for efficient training on large datasets with inline outlier detection to enhance robustness and isolate anomalous patterns. This reframes CDL as a practical tool for event discovery and characterization in real-world signals, extending its role beyond traditional tasks like compression or denoising.

[LG-23] Conv4Rec: A 1-by-1 Convolutional AutoEncoder for User Profiling through Joint Analysis of Implicit and Explicit Feedbacks

链接: https://arxiv.org/abs/2509.07499
作者: Antoine Ledent,Petr Kasalický,Rodrigo Alves,Hady W. Lauw
类目: Machine Learning (cs.LG)
*备注: Accepted at Transactions on Neural Networks and Learning Systems (TNNLS)

点击查看摘要

Abstract:We introduce a new convolutional AutoEncoder architecture for user modelling and recommendation tasks with several improvements over the state of the art. Firstly, our model has the flexibility to learn a set of associations and combinations between different interaction types in a way that carries over to each user and item. Secondly, our model is able to learn jointly from both the explicit ratings and the implicit information in the sampling pattern (which we refer to as `implicit feedback’). It can also make separate predictions for the probability of consuming content and the likelihood of granting it a high rating if observed. This not only allows the model to make predictions for both the implicit and explicit feedback, but also increases the informativeness of the predictions: in particular, our model can identify items which users would not have been likely to consume naturally, but would be likely to enjoy if exposed to them. Finally, we provide several generalization bounds for our model, which to the best of our knowledge, are among the first generalization bounds for auto-encoders in a Recommender Systems setting; we also show that optimizing our loss function guarantees the recovery of the exact sampling distribution over interactions up to a small error in total variation. In experiments on several real-life datasets, we achieve state-of-the-art performance on both the implicit and explicit feedback prediction tasks despite relying on a single model for both, and benefiting from additional interpretability in the form of individual predictions for the probabilities of each possible rating.

[LG-24] EMORF-II: Adaptive EM-based Outlier-Robust Filtering with Correlated Measurement Noise

链接: https://arxiv.org/abs/2509.07415
作者: Arslan Majal,Aamir Hussain Chughtai,Muhammad Tahir
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 6 pages, 4 figures, To appear in MLSP 2025 proceedings

点击查看摘要

Abstract:We present a learning-based outlier-robust filter for a general setup where the measurement noise can be correlated. Since it is an enhanced version of EM-based outlier robust filter (EMORF), we call it as EMORF-II. As it is equipped with an additional powerful feature to learn the outlier characteristics during inference along with outlier-detection, EMORF-II has improved outlier-mitigation capability. Numerical experiments confirm performance gains as compared to the state-of-the-art methods in terms of accuracy with an increased computational overhead. However, thankfully the computational complexity order remains at par with other practical methods making it a useful choice for diverse applications.

[LG-25] FedTeddi: Temporal Drift and Divergence Aware Scheduling for Timely Federated Edge Learning

链接: https://arxiv.org/abs/2509.07342
作者: Yuxuan Bai,Yuxuan Sun,Tan Chen,Wei Chen,Sheng Zhou,Zhisheng Niu
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注: Submitted to IEEE for possible publication

点击查看摘要

Abstract:Federated edge learning (FEEL) enables collaborative model training across distributed clients over wireless networks without exposing raw data. While most existing studies assume static datasets, in real-world scenarios clients may continuously collect data with time-varying and non-independent and identically distributed (non-i.i.d.) characteristics. A critical challenge is how to adapt models in a timely yet efficient manner to such evolving data. In this paper, we propose FedTeddi, a temporal-drift-and-divergence-aware scheduling algorithm that facilitates fast convergence of FEEL under dynamic data evolution and communication resource limits. We first quantify the temporal dynamics and non-i.i.d. characteristics of data using temporal drift and collective divergence, respectively, and represent them as the Earth Mover’s Distance (EMD) of class distributions for classification tasks. We then propose a novel optimization objective and develop a joint scheduling and bandwidth allocation algorithm, enabling the FEEL system to learn from new data quickly without forgetting previous knowledge. Experimental results show that our algorithm achieves higher test accuracy and faster convergence compared to benchmark methods, improving the rate of convergence by 58.4% on CIFAR-10 and 49.2% on CIFAR-100 compared to random scheduling.

[LG-26] CancerGUIDE: Cancer Guideline Understanding via Internal Disagreement Estimation

链接: https://arxiv.org/abs/2509.07325
作者: Alyssa Unell,Noel C. F. Codella,Sam Preston,Peniel Argaw,Wen-wai Yim,Zelalem Gero,Cliff Wong,Rajesh Jena,Eric Horvitz,Amanda K. Hall,Ruican Rachel Zhong,Jiachen Li,Shrey Jain,Mu Wei,Matthew Lungren,Hoifung Poon
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The National Comprehensive Cancer Network (NCCN) provides evidence-based guidelines for cancer treatment. Translating complex patient presentations into guideline-compliant treatment recommendations is time-intensive, requires specialized expertise, and is prone to error. Advances in large language model (LLM) capabilities promise to reduce the time required to generate treatment recommendations and improve accuracy. We present an LLM agent-based approach to automatically generate guideline-concordant treatment trajectories for patients with non-small cell lung cancer (NSCLC). Our contributions are threefold. First, we construct a novel longitudinal dataset of 121 cases of NSCLC patients that includes clinical encounters, diagnostic results, and medical histories, each expertly annotated with the corresponding NCCN guideline trajectories by board-certified oncologists. Second, we demonstrate that existing LLMs possess domain-specific knowledge that enables high-quality proxy benchmark generation for both model development and evaluation, achieving strong correlation (Spearman coefficient r=0.88, RMSE = 0.08) with expert-annotated benchmarks. Third, we develop a hybrid approach combining expensive human annotations with model consistency information to create both the agent framework that predicts the relevant guidelines for a patient, as well as a meta-classifier that verifies prediction accuracy with calibrated confidence scores for treatment recommendations (AUROC=0.800), a critical capability for communicating the accuracy of outputs, custom-tailoring tradeoffs in performance, and supporting regulatory compliance. This work establishes a framework for clinically viable LLM-based guideline adherence systems that balance accuracy, interpretability, and regulatory requirements while reducing annotation costs, providing a scalable pathway toward automated clinical decision support.

[LG-27] Learning Generalized Hamiltonian Dynamics with Stability from Noisy Trajectory Data

链接: https://arxiv.org/abs/2509.07280
作者: Luke McLennan,Yi Wang,Ryan Farell,Minh Nguyen,Chandrajit Bajaj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a robust framework for learning various generalized Hamiltonian dynamics from noisy, sparse phase-space data and in an unsupervised manner based on variational Bayesian inference. Although conservative, dissipative, and port-Hamiltonian systems might share the same initial total energy of a closed system, it is challenging for a single Hamiltonian network model to capture the distinctive and varying motion dynamics and physics of a phase space, from sampled observational phase space trajectories. To address this complicated Hamiltonian manifold learning challenge, we extend sparse symplectic, random Fourier Gaussian processes learning with predictive successive numerical estimations of the Hamiltonian landscape, using a generalized form of state and conjugate momentum Hamiltonian dynamics, appropriate to different classes of conservative, dissipative and port-Hamiltonian physical systems. In addition to the kernelized evidence lower bound (ELBO) loss for data fidelity, we incorporate stability and conservation constraints as additional hyper-parameter balanced loss terms to regularize the model’s multi-gradients, enforcing physics correctness for improved prediction accuracy with bounded uncertainty.

[LG-28] IP-Basis PINNs: Efficient Multi-Query Inverse Parameter Estimation

链接: https://arxiv.org/abs/2509.07245
作者: Shalev Manor,Mohammad Kohandel
类目: Machine Learning (cs.LG)
*备注: 18 pages, 4 figures

点击查看摘要

Abstract:Solving inverse problems with Physics-Informed Neural Networks (PINNs) is computationally expensive for multi-query scenarios, as each new set of observed data requires a new, expensive training procedure. We present Inverse-Parameter Basis PINNs (IP-Basis PINNs), a meta-learning framework that extends the foundational work of Desai et al. (2022) to enable rapid and efficient inference for inverse problems. Our method employs an offline-online decomposition: a deep network is first trained offline to produce a rich set of basis functions that span the solution space of a parametric differential equation. For each new inverse problem online, this network is frozen, and solutions and parameters are inferred by training only a lightweight linear output layer against observed data. Key innovations that make our approach effective for inverse problems include: (1) a novel online loss formulation for simultaneous solution reconstruction and parameter identification, (2) a significant reduction in computational overhead via forward-mode automatic differentiation for PDE loss evaluation, and (3) a non-trivial validation and early-stopping mechanism for robust offline training. We demonstrate the efficacy of IP-Basis PINNs on three diverse benchmarks, including an extension to universal PINNs for unknown functional terms-showing consistent performance across constant and functional parameter estimation, a significant speedup per query over standard PINNs, and robust operation with scarce and noisy data.

[LG-29] Predicting effect of novel treatments using molecular pathways and real-world data

链接: https://arxiv.org/abs/2509.07204
作者: Adrien Couetoux,Thomas Devenyns,Lise Diagne,David Champagne,Pierre-Yves Mousset,Chris Anagnostopoulos
类目: Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注:

点击查看摘要

Abstract:In pharmaceutical RD, predicting the efficacy of a pharmaceutical in treating a particular disease prior to clinical testing or any real-world use has been challenging. In this paper, we propose a flexible and modular machine learning-based approach for predicting the efficacy of an untested pharmaceutical for treating a disease. We train a machine learning model using sets of pharmaceutical-pathway weight impact scores and patient data, which can include patient characteristics and observed clinical outcomes. The resulting model then analyses weighted impact scores of an untested pharmaceutical across human biological molecule-protein pathways to generate a predicted efficacy value. We demonstrate how the method works on a real-world dataset with patient treatments and outcomes, with two different weight impact score algorithms We include methods for evaluating the generalisation performance on unseen treatments, and to characterise conditions under which the approach can be expected to be most predictive. We discuss specific ways in which our approach can be iterated on, making it an initial framework to support future work on predicting the effect of untested drugs, leveraging RWD clinical data and drug embeddings.

[LG-30] Fed-REACT: Federated Representation Learning for Heterogeneous and Evolving Data

链接: https://arxiv.org/abs/2509.07198
作者: Yiyue Chen,Usman Akram,Chianing Wang,Haris Vikalo
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Motivated by the high resource costs and privacy concerns associated with centralized machine learning, federated learning (FL) has emerged as an efficient alternative that enables clients to collaboratively train a global model while keeping their data local. However, in real-world deployments, client data distributions often evolve over time and differ significantly across clients, introducing heterogeneity that degrades the performance of standard FL algorithms. In this work, we introduce Fed-REACT, a federated learning framework designed for heterogeneous and evolving client data. Fed-REACT combines representation learning with evolutionary clustering in a two-stage process: (1) in the first stage, each client learns a local model to extracts feature representations from its data; (2) in the second stage, the server dynamically groups clients into clusters based on these representations and coordinates cluster-wise training of task-specific models for downstream objectives such as classification or regression. We provide a theoretical analysis of the representation learning stage, and empirically demonstrate that Fed-REACT achieves superior accuracy and robustness on real-world datasets.

[LG-31] PLaID: A Preference Aligned Language Model for Targeted Inorganic Materials Design

链接: https://arxiv.org/abs/2509.07150
作者: Andy Xu,Rohan Desai,Larry Wang,Gabriel Hope,Ethan Ritz
类目: Machine Learning (cs.LG); Materials Science (cond-mat.mtrl-sci)
*备注:

点击查看摘要

Abstract:Discovering novel materials is critical for technological advancements such as solar cells, batteries, and carbon capture. However, the development of new materials is constrained by a slow and expensive trial-and-error process. To accelerate this pipeline, we introduce PLaID++, a Large Language Model (LLM) fine-tuned for stable and property-guided crystal generation. We fine-tune Qwen-2.5 7B to generate crystal structures using a novel Wyckoff-based text representation. We show that generation can be effectively guided with a reinforcement learning technique based on Direct Preference Optimization (DPO), with sampled structures categorized by their stability, novelty, and space group. By encoding symmetry constraints directly into text and guiding model outputs towards desirable chemical space, PLaID++ generates structures that are thermodynamically stable, unique, and novel at a \sim 50% greater rate than prior methods and conditionally generates structures with desired space group properties. Our experiments highlight the effectiveness of iterative DPO, achieving \sim 115% and \sim 50% improvements in unconditional and space group conditioned generation, respectively, compared to fine-tuning alone. Our work demonstrates the potential of adapting post-training techniques from natural language processing to materials design, paving the way for targeted and efficient discovery of novel materials.

[LG-32] Of Graphs and Tables: Zero-Shot Node Classification with Tabular Foundation Models

链接: https://arxiv.org/abs/2509.07143
作者: Adrian Hayler,Xingyue Huang,İsmail İlkan Ceylan,Michael Bronstein,Ben Finkelshtein
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph foundation models (GFMs) have recently emerged as a promising paradigm for achieving broad generalization across various graph data. However, existing GFMs are often trained on datasets that were shown to poorly represent real-world graphs, limiting their generalization performance. In contrast, tabular foundation models (TFMs) not only excel at classical tabular prediction tasks but have also shown strong applicability in other domains such as time series forecasting, natural language processing, and computer vision. Motivated by this, we take an alternative view to the standard perspective of GFMs and reformulate node classification as a tabular problem. Each node can be represented as a row with feature, structure, and label information as columns, enabling TFMs to directly perform zero-shot node classification via in-context learning. In this work, we introduce TabGFM, a graph foundation model framework that first converts a graph into a table via feature and structural encoders, applies multiple TFMs to diversely subsampled tables, and then aggregates their outputs through ensemble selection. Through experiments on 28 real-world datasets, TabGFM achieves consistent improvements over task-specific GNNs and state-of-the-art GFMs, highlighting the potential of tabular reformulation for scalable and generalizable graph learning.

[LG-33] Avoiding Over-Personalization with Rule-Guided Knowledge Graph Adaptation for LLM Recommendations ISWC

链接: https://arxiv.org/abs/2509.07133
作者: Fernando Spadea,Oshani Seneviratne
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 5 pages, 2 figures, ISWC

点击查看摘要

Abstract:We present a lightweight neuro-symbolic framework to mitigate over-personalization in LLM-based recommender systems by adapting user-side Knowledge Graphs (KGs) at inference time. Instead of retraining models or relying on opaque heuristics, our method restructures a user’s Personalized Knowledge Graph (PKG) to suppress feature co-occurrence patterns that reinforce Personalized Information Environments (PIEs), i.e., algorithmically induced filter bubbles that constrain content diversity. These adapted PKGs are used to construct structured prompts that steer the language model toward more diverse, Out-PIE recommendations while preserving topical relevance. We introduce a family of symbolic adaptation strategies, including soft reweighting, hard inversion, and targeted removal of biased triples, and a client-side learning algorithm that optimizes their application per user. Experiments on a recipe recommendation benchmark show that personalized PKG adaptations significantly increase content novelty while maintaining recommendation quality, outperforming global adaptation and naive prompt-based methods.

[LG-34] Sequentially Auditing Differential Privacy

链接: https://arxiv.org/abs/2509.07055
作者: Tomás González,Mateo Dulce-Rubio,Aaditya Ramdas,Mónica Ribero
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a practical sequential test for auditing differential privacy guarantees of black-box mechanisms. The test processes streams of mechanisms’ outputs providing anytime-valid inference while controlling Type I error, overcoming the fixed sample size limitation of previous batch auditing methods. Experiments show this test detects violations with sample sizes that are orders of magnitude smaller than existing methods, reducing this number from 50K to a few hundred examples, across diverse realistic mechanisms. Notably, it identifies DP-SGD privacy violations in \textitunder one training run, unlike prior methods needing full model training.

[LG-35] End-to-End Efficiency in Keyword Spotting: A System-Level Approach for Embedded Microcontrollers

链接: https://arxiv.org/abs/2509.07051
作者: Pietro Bartoli,Tommaso Bondini,Christian Veronesi,Andrea Giudici,Niccolò Antonello,Franco Zappa
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, 1 table. Accepted for publication in IEEE Sensors 2025. \c{opyright} 2025 IEEE. Personal use permitted. Permission from IEEE required for all other uses

点击查看摘要

Abstract:Keyword spotting (KWS) is a key enabling technology for hands-free interaction in embedded and IoT devices, where stringent memory and energy constraints challenge the deployment of AI-enabeld devices. In this work, we systematically evaluate and compare several state-of-the-art lightweight neural network architectures, including DS-CNN, LiCoNet, and TENet, alongside our proposed Typman-KWS (TKWS) architecture built upon MobileNet, specifically designed for efficient KWS on microcontroller units (MCUs). Unlike prior studies focused solely on model inference, our analysis encompasses the entire processing pipeline, from Mel-Frequency Cepstral Coefficient (MFCC) feature extraction to neural inference, and is benchmarked across three STM32 platforms (N6, H7, and U5). Our results show that TKWS with three residual blocks achieves up to 92.4% F1-score with only 14.4k parameters, reducing memory footprint without compromising the accuracy. Moreover, the N6 MCU with integrated neural acceleration achieves the best energy-delay product (EDP), enabling efficient, low-latency operation even with high-resolution features. Our findings highlight the model accuracy alone does not determine real-world effectiveness; rather, optimal keyword spotting deployments require careful consideration of feature extraction parameters and hardware-specific optimization.

[LG-36] Recursive State Inference for Linear PASFA

链接: https://arxiv.org/abs/2509.07028
作者: Vishal Rishi
类目: Machine Learning (cs.LG)
*备注: 5 pages, 1 figure

点击查看摘要

Abstract:Slow feature analysis (SFA), as a method for learning slowly varying features in classification and signal analysis, has attracted increasing attention in recent years. Recent probabilistic extensions to SFA learn effective representations for classification tasks. Notably, the Probabilistic Adaptive Slow Feature Analysis models the slow features as states in an ARMA process and estimate the model from the observations. However, there is a need to develop efficient methods to infer the states (slow features) from the observations and the model. In this paper, a recursive extension to the linear PASFA has been proposed. The proposed algorithm performs MMSE estimation of states evolving according to an ARMA process, given the observations and the model. Although current methods tackle this problem using Kalman filters after transforming the ARMA process into a state space model, the original states (or slow features) that form useful representations cannot be easily recovered. The proposed technique is evaluated on a synthetic dataset to demonstrate its correctness.

[LG-37] Private Queries with Sigma-Counting

链接: https://arxiv.org/abs/2509.07018
作者: Jun Gao,Jie Ding
类目: Databases (cs.DB); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many data applications involve counting queries, where a client specifies a feasible range of variables and a database returns the corresponding item counts. A program that produces the counts of different queries often risks leaking sensitive individual-level information. A popular approach to enhance data privacy is to return a noisy version of the actual count. It is typically achieved by adding independent noise to each query and then control the total privacy budget within a period. This approach may be limited in the number of queries and output accuracy in practice. Also, the returned counts do not maintain the total order for nested queries, an important feature in many applications. This work presents the design and analysis of a new method, sigma-counting, that addresses these challenges. Sigma-counting uses the notion of sigma-algebra to construct privacy-preserving counting queries. We show that the proposed concepts and methods can significantly improve output accuracy while maintaining a desired privacy level in the presence of massive queries to the same data. We also discuss how the technique can be applied to address large and time-varying datasets.

[LG-38] Machine Generalize Learning in Agent -Based Models: Going Beyond Surrogate Models for Calibration in ABMs

链接: https://arxiv.org/abs/2509.07013
作者: Sima Najafzadehkhoei,George Vega Yon,Bernardo Modenesi,Derek S.Meyer
类目: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:Calibrating agent-based epidemic models is computationally demanding. We present a supervised machine learning calibrator that learns the inverse mapping from epidemic time series to SIR parameters. A three-layer bidirectional LSTM ingests 60-day incidence together with population size and recovery rate, and outputs transmission probability, contact rate, and R0. Training uses a composite loss with an epidemiology-motivated consistency penalty that encourages R0 * recovery rate to equal transmission probability * contact rate. In a 1000-scenario simulation study, we compare the calibrator with Approximate Bayesian Computation (likelihood-free MCMC). The method achieves lower error across all targets (MAE: R0 0.0616 vs 0.275; transmission 0.0715 vs 0.128; contact 1.02 vs 4.24), produces tighter predictive intervals with near nominal coverage, and reduces wall clock time from 77.4 s to 2.35 s per calibration. Although contact rate and transmission probability are partially nonidentifiable, the approach reproduces epidemic curves more faithfully than ABC, enabling fast and practical calibration. We evaluate it on SIR agent based epidemics generated with epiworldR and provide an implementation in R. Subjects: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Methodology (stat.ME) Cite as: arXiv:2509.07013 [cs.LG] (or arXiv:2509.07013v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2509.07013 Focus to learn more arXiv-issued DOI via DataCite

[LG-39] veScale: Consistent and Efficient Tensor Programming with Eager-Mode SPMD

链接: https://arxiv.org/abs/2509.07003
作者: Youjie Li,Cheng Wan,Zhiqi Lin,Hongyu Zhu,Jiacheng Yang,Ziang Song,Xinyi Di,Jiawei Wu,Huiyao Shu,Wenlei Bao,Yanghua Peng,Haibin Lin,Li-Wen Chang
类目: Programming Languages (cs.PL); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 21 pages, 16 figures, 5 tables

点击查看摘要

Abstract:Large Language Models (LLMs) have scaled rapidly in size and complexity, requiring increasingly intricate parallelism for distributed training, such as 3D parallelism. This sophistication motivates a shift toward simpler, more debuggable programming paradigm like Single Program Multiple Data (SPMD). However, SPMD in eager execution introduces two key challenges: ensuring consistency with single-device execution and achieving high performance at scale. In this paper, we introduce veScale, an eager-mode training system that fully embraces SPMD paradigm to democratize distributed tensor programming. veScale addresses the prevalent issue of inconsistent results in systems like PyTorch by introducing a novel algorithm of distributed Random Number Generation (RNG) compatible with arbitrary sharded operators. veScale also significantly boosts training performance by reducing PyTorch primitive’s overhead and improving communication efficiency. Evaluations show that veScale delivers up to 2.2x speedup over the state-of-the-art training systems, like TorchTitan, and cuts code complexity by 78.4%, while preserving single-device-equivalent results.

[LG-40] A Kriging-HDMR-based surrogate model with sample pool-free active learning strategy for reliability analysis

链接: https://arxiv.org/abs/2509.06978
作者: Wenxiong Li,Hanyu Liao,Suiyin Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In reliability engineering, conventional surrogate models encounter the “curse of dimensionality” as the number of random variables increases. While the active learning Kriging surrogate approaches with high-dimensional model representation (HDMR) enable effective approximation of high-dimensional functions and are widely applied to optimization problems, there are rare studies specifically focused on reliability analysis, which prioritizes prediction accuracy in critical regions over uniform accuracy across the entire domain. This study develops an active learning surrogate model method based on the Kriging-HDMR modeling for reliability analysis. The proposed approach facilitates the approximation of high-dimensional limit state functions through a composite representation constructed from multiple low-dimensional sub-surrogate models. The architecture of the surrogate modeling framework comprises three distinct stages: developing single-variable sub-surrogate models for all random variables, identifying the requirements for coupling-variable sub-surrogate models, and constructing the coupling-variable sub-surrogate models. Optimization mathematical models for selection of design of experiment samples are formulated based on each stage’s characteristics, with objectives incorporating uncertainty variance, predicted mean, sample location and inter-sample distances. A candidate sample pool-free approach is adopted to achieve the selection of informative samples. Numerical experiments demonstrate that the proposed method achieves high computational efficiency while maintaining strong predictive accuracy in solving high-dimensional reliability problems.

[LG-41] Nuclear Data Adjustment for Nonlinear Applications in the OECD/NEA WPNCS SG14 Benchmark – A Bayesian Inverse UQ-based Approach for Data Assimilation

链接: https://arxiv.org/abs/2509.07790
作者: Christopher Brady(1),Xu Wu(1) ((1) North Carolina State University)
类目: Nuclear Theory (nucl-th); Machine Learning (cs.LG)
*备注: 31 pages, 9 tables, 8 figures, submitted to Nuclear Science and Engineering, included in proceedings of International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering (MC 2025)

点击查看摘要

Abstract:The Organization for Economic Cooperation and Development (OECD) Working Party on Nuclear Criticality Safety (WPNCS) proposed a benchmark exercise to assess the performance of current nuclear data adjustment techniques applied to nonlinear applications and experiments with low correlation to applications. This work introduces Bayesian Inverse Uncertainty Quantification (IUQ) as a method for nuclear data adjustments in this benchmark, and compares IUQ to the more traditional methods of Generalized Linear Least Squares (GLLS) and Monte Carlo Bayes (MOCABA). Posterior predictions from IUQ showed agreement with GLLS and MOCABA for linear applications. When comparing GLLS, MOCABA, and IUQ posterior predictions to computed model responses using adjusted parameters, we observe that GLLS predictions fail to replicate computed response distributions for nonlinear applications, while MOCABA shows near agreement, and IUQ uses computed model responses directly. We also discuss observations on why experiments with low correlation to applications can be informative to nuclear data adjustments and identify some properties useful in selecting experiments for inclusion in nuclear data adjustment. Performance in this benchmark indicates potential for Bayesian IUQ in nuclear data adjustments.

[LG-42] Decentralized Online Riemannian Optimization Beyond Hadamard Manifolds

链接: https://arxiv.org/abs/2509.07779
作者: Emre Sahinoglu,Shahin Shahrampour
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:We study decentralized online Riemannian optimization over manifolds with possibly positive curvature, going beyond the Hadamard manifold setting. Decentralized optimization techniques rely on a consensus step that is well understood in Euclidean spaces because of their linearity. However, in positively curved Riemannian spaces, a main technical challenge is that geodesic distances may not induce a globally convex structure. In this work, we first analyze a curvature-aware Riemannian consensus step that enables a linear convergence beyond Hadamard manifolds. Building on this step, we establish a O(\sqrtT) regret bound for the decentralized online Riemannian gradient descent algorithm. Then, we investigate the two-point bandit feedback setup, where we employ computationally efficient gradient estimators using smoothing techniques, and we demonstrate the same O(\sqrtT) regret bound through the subconvexity analysis of smoothed objectives.

[LG-43] oward Quantum Utility in Finance: A Robust Data-Driven Algorithm for Asset Clustering

链接: https://arxiv.org/abs/2509.07766
作者: Shivam Sharma,Supreeth Mysore Venkatesh,Pushkin Kachroo
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 9 pages, 2 figures, International Quantum Engineering conference and exhibition (QUEST-IS 2025)

点击查看摘要

Abstract:Clustering financial assets based on return correlations is a fundamental task in portfolio optimization and statistical arbitrage. However, classical clustering methods often fall short when dealing with signed correlation structures, typically requiring lossy transformations and heuristic assumptions such as a fixed number of clusters. In this work, we apply the Graph-based Coalition Structure Generation algorithm (GCS-Q) to directly cluster signed, weighted graphs without relying on such transformations. GCS-Q formulates each partitioning step as a QUBO problem, enabling it to leverage quantum annealing for efficient exploration of exponentially large solution spaces. We validate our approach on both synthetic and real-world financial data, benchmarking against state-of-the-art classical algorithms such as SPONGE and k-Medoids. Our experiments demonstrate that GCS-Q consistently achieves higher clustering quality, as measured by Adjusted Rand Index and structural balance penalties, while dynamically determining the number of clusters. These results highlight the practical utility of near-term quantum computing for graph-based unsupervised learning in financial applications.

[LG-44] Building causation links in stochastic nonlinear systems from data

链接: https://arxiv.org/abs/2509.07701
作者: Sergio Chibbaro,Cyril Furtlehner,Théo Marchetta,Andrei-Tiberiu Pantea,Davide Rossetti
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注: 24 pages, 11 Figures. Comments are welcome

点击查看摘要

Abstract:Causal relationships play a fundamental role in understanding the world around us. The ability to identify and understand cause-effect relationships is critical to making informed decisions, predicting outcomes, and developing effective strategies. However, deciphering causal relationships from observational data is a difficult task, as correlations alone may not provide definitive evidence of causality. In recent years, the field of machine learning (ML) has emerged as a powerful tool, offering new opportunities for uncovering hidden causal mechanisms and better understanding complex systems. In this work, we address the issue of detecting the intrinsic causal links of a large class of complex systems in the framework of the response theory in physics. We develop some theoretical ideas put forward by [1], and technically we use state-of-the-art ML techniques to build up models from data. We consider both linear stochastic and non-linear systems. Finally, we compute the asymptotic efficiency of the linear response based causal predictor in a case of large scale Markov process network of linear interactions.

[LG-45] Exploring System Adaptations For Minimum Latency Real-Time Piano Transcription

链接: https://arxiv.org/abs/2509.07586
作者: Patricia Hu,Silvan David Peter,Jan Schlüter,Gerhard Widmer
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG); Sound (cs.SD)
*备注: to be published in Proceedings of the 26th International Society for Music Information Retrieval (ISMIR) Conference 2025, Daejeon, South Korea

点击查看摘要

Abstract:Advances in neural network design and the availability of large-scale labeled datasets have driven major improvements in piano transcription. Existing approaches target either offline applications, with no restrictions on computational demands, or online transcription, with delays of 128-320 ms. However, most real-time musical applications require latencies below 30 ms. In this work, we investigate whether and how the current state-of-the-art online transcription model can be adapted for real-time piano transcription. Specifically, we eliminate all non-causal processing, and reduce computational load through shared computations across core model components and variations in model size. Additionally, we explore different pre- and postprocessing strategies, and related label encoding schemes, and discuss their suitability for real-time transcription. Evaluating the adaptions on the MAESTRO dataset, we find a drop in transcription accuracy due to strictly causal processing as well as a tradeoff between the preprocessing latency and prediction accuracy. We release our system as a baseline to support researchers in designing models towards minimum latency real-time transcription.

[LG-46] Asynchronous Gossip Algorithms for Rank-Based Statistical Methods

链接: https://arxiv.org/abs/2509.07543
作者: Anna Van Elst,Igor Colin,Stephan Clémençon
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As decentralized AI and edge intelligence become increasingly prevalent, ensuring robustness and trustworthiness in such distributed settings has become a critical issue-especially in the presence of corrupted or adversarial data. Traditional decentralized algorithms are vulnerable to data contamination as they typically rely on simple statistics (e.g., means or sum), motivating the need for more robust statistics. In line with recent work on decentralized estimation of trimmed means and ranks, we develop gossip algorithms for computing a broad class of rank-based statistics, including L-statistics and rank statistics-both known for their robustness to outliers. We apply our method to perform robust distributed two-sample hypothesis testing, introducing the first gossip algorithm for Wilcoxon rank-sum tests. We provide rigorous convergence guarantees, including the first convergence rate bound for asynchronous gossip-based rank estimation. We empirically validate our theoretical results through experiments on diverse network topologies.

[LG-47] RINO: Renormalization Group Invariance with No Labels NEURIPS2025

链接: https://arxiv.org/abs/2509.07486
作者: Zichun Hao,Raghav Kansal,Abhijith Gandrakota,Chang Sun,Ngadiuba Jennifer,Javier Duarte,Maria Spiropulu
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: Submission for Machine Learning and the Physical Sciences Workshop @ NeurIPS 2025

点击查看摘要

Abstract:A common challenge with supervised machine learning (ML) in high energy physics (HEP) is the reliance on simulations for labeled data, which can often mismodel the underlying collision or detector response. To help mitigate this problem of domain shift, we propose RINO (Renormalization Group Invariance with No Labels), a self-supervised learning approach that can instead pretrain models directly on collision data, learning embeddings invariant to renormalization group flow scales. In this work, we pretrain a transformer-based model on jets originating from quantum chromodynamic (QCD) interactions from the JetClass dataset, emulating real QCD-dominated experimental data, and then finetune on the JetNet dataset – emulating simulations – for the task of identifying jets originating from top quark decays. RINO demonstrates improved generalization from the JetNet training data to JetClass data compared to supervised training on JetNet from scratch, demonstrating the potential for RINO pretraining on real collision data followed by fine-tuning on small, high-quality MC datasets, to improve the robustness of ML models in HEP.

[LG-48] Synthetic Data Generation with Lorenzetti for Time Series Anomaly Detection in High-Energy Physics Calorimeters

链接: https://arxiv.org/abs/2509.07451
作者: Laura Boggia,Bogdan Malaescu
类目: High Energy Physics - Experiment (hep-ex); Machine Learning (cs.LG)
*备注: 4 pages, 2 figures, Submission to SciPost proceedings for EuCAIFCon 2025

点击查看摘要

Abstract:Anomaly detection in multivariate time series is crucial to ensure the quality of data coming from a physics experiment. Accurately identifying the moments when unexpected errors or defects occur is essential, yet challenging due to scarce labels, unknown anomaly types, and complex correlations across dimensions. To address the scarcity and unreliability of labelled data, we use the Lorenzetti Simulator to generate synthetic events with injected calorimeter anomalies. We then assess the sensitivity of several time series anomaly detection methods, including transformer-based and other deep learning models. The approach employed here is generic and applicable to different detector designs and defects.

[LG-49] Reinforcement learning for online hyperparameter tuning in convex quadratic programming

链接: https://arxiv.org/abs/2509.07404
作者: Jeremy Bertoncini,Alberto De Marchi,Matthias Gerdts,Simon Gottschalk
类目: Optimization and Control (math.OC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Quadratic programming is a workhorse of modern nonlinear optimization, control, and data science. Although regularized methods offer convergence guarantees under minimal assumptions on the problem data, they can exhibit the slow tail-convergence typical of first-order schemes, thus requiring many iterations to achieve high-accuracy solutions. Moreover, hyperparameter tuning significantly impacts on the solver performance but how to find an appropriate parameter configuration remains an elusive research question. To address these issues, we explore how data-driven approaches can accelerate the solution process. Aiming at high-accuracy solutions, we focus on a stabilized interior-point solver and carefully handle its two-loop flow and control parameters. We will show that reinforcement learning can make a significant contribution to facilitating the solver tuning and to speeding up the optimization process. Numerical experiments demonstrate that, after a lightweight training, the learned policy generalizes well to different problem classes with varying dimensions and to various solver configurations.

[LG-50] Identifying Neural Signatures from fMRI using Hybrid Principal Components Regression

链接: https://arxiv.org/abs/2509.07300
作者: Jared Rieck,Julia Wrobel,Joshua L. Gowin,Yue Wang,Martin Paulus,Ryan Peterson
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recent advances in neuroimaging analysis have enabled accurate decoding of mental state from brain activation patterns during functional magnetic resonance imaging scans. A commonly applied tool for this purpose is principal components regression regularized with the least absolute shrinkage and selection operator (LASSO PCR), a type of multi-voxel pattern analysis (MVPA). This model presumes that all components are equally likely to harbor relevant information, when in fact the task-related signal may be concentrated in specific components. In such cases, the model will fail to select the optimal set of principal components that maximizes the total signal relevant to the cognitive process under study. Here, we present modifications to LASSO PCR that allow for a regularization penalty tied directly to the index of the principal component, reflecting a prior belief that task-relevant signal is more likely to be concentrated in components explaining greater variance. Additionally, we propose a novel hybrid method, Joint Sparsity-Ranked LASSO (JSRL), which integrates component-level and voxel-level activity under an information parity framework and imposes ranked sparsity to guide component selection. We apply the models to brain activation during risk taking, monetary incentive, and emotion regulation tasks. Results demonstrate that incorporating sparsity ranking into LASSO PCR produces models with enhanced classification performance, with JSRL achieving up to 51.7% improvement in cross-validated deviance R^2 and 7.3% improvement in cross-validated AUC. Furthermore, sparsity-ranked models perform as well as or better than standard LASSO PCR approaches across all classification tasks and allocate predictive weight to brain regions consistent with their established functional roles, offering a robust alternative for MVPA.

[LG-51] NestGNN: A Graph Neural Network Framework Generalizing the Nested Logit Model for Travel Mode Choice

链接: https://arxiv.org/abs/2509.07123
作者: Yuqi Zhou,Zhanhong Cheng,Lingqian Hu,Yuheng Bu,Shenhao Wang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Nested logit (NL) has been commonly used for discrete choice analysis, including a wide range of applications such as travel mode choice, automobile ownership, or location decisions. However, the classical NL models are restricted by their limited representation capability and handcrafted utility specification. While researchers introduced deep neural networks (DNNs) to tackle such challenges, the existing DNNs cannot explicitly capture inter-alternative correlations in the discrete choice context. To address the challenges, this study proposes a novel concept - alternative graph - to represent the relationships among travel mode alternatives. Using a nested alternative graph, this study further designs a nested-utility graph neural network (NestGNN) as a generalization of the classical NL model in the neural network family. Theoretically, NestGNNs generalize the classical NL models and existing DNNs in terms of model representation, while retaining the crucial two-layer substitution patterns of the NL models: proportional substitution within a nest but non-proportional substitution beyond a nest. Empirically, we find that the NestGNNs significantly outperform the benchmark models, particularly the corresponding NL models by 9.2%. As shown by elasticity tables and substitution visualization, NestGNNs retain the two-layer substitution patterns as the NL model, and yet presents more flexibility in its model design space. Overall, our study demonstrates the power of NestGNN in prediction, interpretation, and its flexibility of generalizing the classical NL model for analyzing travel mode choice.

[LG-52] ADHAM: Additive Deep Hazard Analysis Mixtures for Interpretable Survival Regression

链接: https://arxiv.org/abs/2509.07108
作者: Mert Ketenci,Vincent Jeanselme,Harry Reyes Nieva,Shalmali Joshi,Noémie Elhadad
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Survival analysis is a fundamental tool for modeling time-to-event outcomes in healthcare. Recent advances have introduced flexible neural network approaches for improved predictive performance. However, most of these models do not provide interpretable insights into the association between exposures and the modeled outcomes, a critical requirement for decision-making in clinical practice. To address this limitation, we propose Additive Deep Hazard Analysis Mixtures (ADHAM), an interpretable additive survival model. ADHAM assumes a conditional latent structure that defines subgroups, each characterized by a combination of covariate-specific hazard functions. To select the number of subgroups, we introduce a post-training refinement that reduces the number of equivalent latent subgroups by merging similar groups. We perform comprehensive studies to demonstrate ADHAM’s interpretability at the population, subgroup, and individual levels. Extensive experiments on real-world datasets show that ADHAM provides novel insights into the association between exposures and outcomes. Further, ADHAM remains on par with existing state-of-the-art survival baselines in terms of predictive performance, offering a scalable and interpretable approach to time-to-event prediction in healthcare.

[LG-53] PUUMA (Placental patch and whole-Uterus dual-branch U-Mamba-based Architecture): Functional MRI Prediction of Gestational Age at Birth and Preterm Risk MICCAI2025

链接: https://arxiv.org/abs/2509.07042
作者: Diego Fajardo-Rojas,Levente Baljer,Jordina Aviles Verdera,Megan Hall,Daniel Cromb,Mary A. Rutherford,Lisa Story,Emma C. Robinson,Jana Hutter
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 11 pages, 4 figures, 2 tables, to be published in with Springer - Lecture Notes in Computer Science, as part of PerInatal, Preterm and Paediatric Image (PIPPI) Analysis workshop held in conjunction with MICCAI 2025

点击查看摘要

Abstract:Preterm birth is a major cause of mortality and lifelong morbidity in childhood. Its complex and multifactorial origins limit the effectiveness of current clinical predictors and impede optimal care. In this study, a dual-branch deep learning architecture (PUUMA) was developed to predict gestational age (GA) at birth using T2* fetal MRI data from 295 pregnancies, encompassing a heterogeneous and imbalanced population. The model integrates both global whole-uterus and local placental features. Its performance was benchmarked against linear regression using cervical length measurements obtained by experienced clinicians from anatomical MRI and other Deep Learning architectures. The GA at birth predictions were assessed using mean absolute error. Accuracy, sensitivity, and specificity were used to assess preterm classification. Both the fully automated MRI-based pipeline and the cervical length regression achieved comparable mean absolute errors (3 weeks) and good sensitivity (0.67) for detecting preterm birth, despite pronounced class imbalance in the dataset. These results provide a proof of concept for automated prediction of GA at birth from functional MRI, and underscore the value of whole-uterus functional imaging in identifying at-risk pregnancies. Additionally, we demonstrate that manual, high-definition cervical length measurements derived from MRI, not currently routine in clinical practice, offer valuable predictive information. Future work will focus on expanding the cohort size and incorporating additional organ-specific imaging to improve generalisability and predictive performance.

[LG-54] A Quantum Bagging Algorithm with Unsupervised Base Learners for Label Corrupted Datasets

链接: https://arxiv.org/abs/2509.07040
作者: Neeshu Rathi,Sanjeev Kumar
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The development of noise-resilient quantum machine learning (QML) algorithms is critical in the noisy intermediate-scale quantum (NISQ) era. In this work, we propose a quantum bagging framework that uses QMeans clustering as the base learner to reduce prediction variance and enhance robustness to label noise. Unlike bagging frameworks built on supervised learners, our method leverages the unsupervised nature of QMeans, combined with quantum bootstrapping via QRAM-based sampling and bagging aggregation through majority voting. Through extensive simulations on both noisy classification and regression tasks, we demonstrate that the proposed quantum bagging algorithm performs comparably to its classical counterpart using KMeans while exhibiting greater resilience to label corruption than supervised bagging methods. This highlights the potential of unsupervised quantum bagging in learning from unreliable data.

[LG-55] GLF-SINN: Deep Learning Surrogate Model for Accelerating Turbulent Transport Modeling in Fusion

链接: https://arxiv.org/abs/2509.07024
作者: Yadi Cao,Futian Zhang,Wesley Liu,Tom Neiser,Orso Meneghini,Lawson Fuller,Sterling Smith,Raffi Nazikian,Brian Sammuli,Rose Yu
类目: Plasma Physics (physics.plasm-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The Trapped Gyro-Landau Fluid (TGLF) model provides fast, accurate predictions of turbulent transport in tokamaks, but whole device simulations requiring thousands of evaluations remain computationally expensive. Neural network (NN) surrogates offer accelerated inference with fully differentiable approximations that enable gradient-based coupling but typically require large training datasets to capture transport flux variations across plasma conditions, creating significant training burden and limiting applicability to expensive gyrokinetic simulations. We propose \textbfTGLF-SINN (Spectra-Informed Neural Network) with three key innovations: (1) principled feature engineering that reduces target prediction range, simplifying the learning task; (2) physics-guided regularization of transport spectra to improve generalization under sparse data; and (3) Bayesian Active Learning (BAL) to strategically select training samples based on model uncertainty, reducing data requirements while maintaining accuracy. Our approach achieves superior performance with significantly less training data. In offline settings, TGLF-SINN reduces logarithmic root mean squared error (LRMSE) by 12. 4% compared to the current baseline \base. Using only 25% of the complete dataset with BAL, we achieve LRMSE only 0.0165 higher than \base~and 0.0248 higher than our offline model (0.0583). In downstream flux matching applications, our NN surrogate provides 45x speedup over TGLF while maintaining comparable accuracy, demonstrating potential for training efficient surrogates for higher-fidelity models where data acquisition is costly and sparse.

[LG-56] Physics-Guided Diffusion Transformer with Spherical Harmonic Posterior Sampling for High-Fidelity Angular Super-Resolution in Diffusion MRI

链接: https://arxiv.org/abs/2509.07020
作者: Mu Nan,Taohui Xiao,Ruoyou Wu,Shoujun Yu,Ye Li,Hairong Zheng,Shanshan Wang
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Diffusion MRI (dMRI) angular super-resolution (ASR) aims to reconstruct high-angular-resolution (HAR) signals from limited low-angular-resolution (LAR) data without prolonging scan time. However, existing methods are limited in recovering fine-grained angular details or preserving high fidelity due to inadequate modeling of q-space geometry and insufficient incorporation of physical constraints. In this paper, we introduce a Physics-Guided Diffusion Transformer (PGDiT) designed to explore physical priors throughout both training and inference stages. During training, a Q-space Geometry-Aware Module (QGAM) with b-vector modulation and random angular masking facilitates direction-aware representation learning, enabling the network to generate directionally consistent reconstructions with fine angular details from sparse and noisy data. In inference, a two-stage Spherical Harmonics-Guided Posterior Sampling (SHPS) enforces alignment with the acquired data, followed by heat-diffusion-based SH regularization to ensure physically plausible reconstructions. This coarse-to-fine refinement strategy mitigates oversmoothing and artifacts commonly observed in purely data-driven or generative models. Extensive experiments on general ASR tasks and two downstream applications, Diffusion Tensor Imaging (DTI) and Neurite Orientation Dispersion and Density Imaging (NODDI), demonstrate that PGDiT outperforms existing deep learning models in detail recovery and data fidelity. Our approach presents a novel generative ASR framework that offers high-fidelity HAR dMRI reconstructions, with potential applications in neuroscience and clinical research.

[LG-57] oric geometry of ReLU neural networks

链接: https://arxiv.org/abs/2509.05894
作者: Yaoying Fu
类目: Algebraic Geometry (math.AG); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Given a continuous finitely piecewise linear function f:\mathbbR^n_0 \to \mathbbR and a fixed architecture (n_0,\ldots,n_k;1) of feedforward ReLU neural networks, the exact function realization problem is to determine when some network with the given architecture realizes f . To develop a systematic way to answer these questions, we establish a connection between toric geometry and ReLU neural networks. This approach enables us to utilize numerous structures and tools from algebraic geometry to study ReLU neural networks. Starting with an unbiased ReLU neural network with rational weights, we define the ReLU fan, the ReLU toric variety, and the ReLU Cartier divisor associated with the network. This work also reveals the connection between the tropical geometry and the toric geometry of ReLU neural networks. As an application of the toric geometry framework, we prove a necessary and sufficient criterion of functions realizable by unbiased shallow ReLU neural networks by computing intersection numbers of the ReLU Cartier divisor and torus-invariant curves.

信息检索

[IR-0] KLIPA: A Knowledge Graph and LLM -Driven QA Framework for IP Analysis

链接: https://arxiv.org/abs/2509.07860
作者: Guanzhi Deng,Yi Xie,Yu-Keung Ng,Mingyang Liu,Peijun Zheng,Jie Liu,Dapeng Wu,Yinqiao Li,Linqi Song
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Effectively managing intellectual property is a significant challenge. Traditional methods for patent analysis depend on labor-intensive manual searches and rigid keyword matching. These approaches are often inefficient and struggle to reveal the complex relationships hidden within large patent datasets, hindering strategic decision-making. To overcome these limitations, we introduce KLIPA, a novel framework that leverages a knowledge graph and a large language model (LLM) to significantly advance patent analysis. Our approach integrates three key components: a structured knowledge graph to map explicit relationships between patents, a retrieval-augmented generation(RAG) system to uncover contextual connections, and an intelligent agent that dynamically determines the optimal strategy for resolving user queries. We validated KLIPA on a comprehensive, real-world patent database, where it demonstrated substantial improvements in knowledge extraction, discovery of novel connections, and overall operational efficiency. This combination of technologies enhances retrieval accuracy, reduces reliance on domain experts, and provides a scalable, automated solution for any organization managing intellectual property, including technology corporations and legal firms, allowing them to better navigate the complexities of strategic innovation and competitive intelligence.

[IR-1] Query Expansion in the Age of Pre-trained and Large Language Models : A Comprehensive Survey

链接: https://arxiv.org/abs/2509.07794
作者: Minghan Li,Xinxuan Lv,Junjie Zou,Tongna Chen,Chao Zhang,Suchao An,Ercong Nie,Guodong Zhou
类目: Information Retrieval (cs.IR)
*备注: 38 pages,3 figures

点击查看摘要

Abstract:Modern information retrieval (IR) must bridge short, ambiguous queries and ever more diverse, rapidly evolving corpora. Query Expansion (QE) remains a key mechanism for mitigating vocabulary mismatch, but the design space has shifted markedly with pre-trained language models (PLMs) and large language models (LLMs). This survey synthesizes the field from three angles: (i) a four-dimensional framework of query expansion - from the point of injection (explicit vs. implicit QE), through grounding and interaction (knowledge bases, model-internal capabilities, multi-turn retrieval) and learning alignment, to knowledge graph-based argumentation; (ii) a model-centric taxonomy spanning encoder-only, encoder-decoder, decoder-only, instruction-tuned, and domain/multilingual variants, highlighting their characteristic affordances for QE (contextual disambiguation, controllable generation, zero-/few-shot reasoning); and (iii) practice-oriented guidance on where and how neural QE helps in first-stage retrieval, multi-query fusion, re-ranking, and retrieval-augmented generation (RAG). We compare traditional query expansion with PLM/LLM-based methods across seven key aspects, and we map applications across web search, biomedicine, e-commerce, open-domain QA/RAG, conversational and code search, and cross-lingual settings. The review distills design grounding and interaction, alignment/distillation (SFT/PEFT/DPO), and KG constraints - as robust remedies to topic drift and hallucination. We conclude with an agenda on quality control, cost-aware invocation, domain/temporal adaptation, evaluation beyond end-task metrics, and fairness/privacy. Collectively, these insights provide a principled blueprint for selecting and combining QE techniques under real-world constraints.

[IR-2] A Survey of Long-Document Retrieval in the PLM and LLM Era

链接: https://arxiv.org/abs/2509.07759
作者: Minghan Li,Miyang Luo,Tianrui Lv,Yishuai Zhang,Siqi Zhao,Ercong Nie,Guodong Zhou
类目: Information Retrieval (cs.IR)
*备注: 33 pages, 6 figures

点击查看摘要

Abstract:The proliferation of long-form documents presents a fundamental challenge to information retrieval (IR), as their length, dispersed evidence, and complex structures demand specialized methods beyond standard passage-level techniques. This survey provides the first comprehensive treatment of long-document retrieval (LDR), consolidating methods, challenges, and applications across three major eras. We systematize the evolution from classical lexical and early neural models to modern pre-trained (PLM) and large language models (LLMs), covering key paradigms like passage aggregation, hierarchical encoding, efficient attention, and the latest LLM-driven re-ranking and retrieval techniques. Beyond the models, we review domain-specific applications, specialized evaluation resources, and outline critical open challenges such as efficiency trade-offs, multimodal alignment, and faithfulness. This survey aims to provide both a consolidated reference and a forward-looking agenda for advancing long-document retrieval in the era of foundation models.

[IR-3] owards End-to-End Model-Agnostic Explanations for RAG Systems SIGIR2025

链接: https://arxiv.org/abs/2509.07620
作者: Viju Sudhi,Sinchana Ramakanth Bhat,Max Rudat,Roman Teucher,Nicolas Flores-Herr
类目: Information Retrieval (cs.IR)
*备注: Accepted to Workshop on Explainability in Information Retrieval (WExIR), SIGIR 2025 - July 17, 2025

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) systems, despite their growing popularity for enhancing model response reliability, often struggle with trustworthiness and explainability. In this work, we present a novel, holistic, model-agnostic, post-hoc explanation framework leveraging perturbation-based techniques to explain the retrieval and generation processes in a RAG system. We propose different strategies to evaluate these explanations and discuss the sufficiency of model-agnostic explanations in RAG systems. With this work, we further aim to catalyze a collaborative effort to build reliable and explainable RAG systems.

[IR-4] ELEC: Efficient Large Language Model-Empowered Click-Through Rate Prediction SIGIR2025

链接: https://arxiv.org/abs/2509.07594
作者: Rui Dong,Wentao Ouyang,Xiangzheng Liu
类目: Information Retrieval (cs.IR)
*备注: SIGIR 2025

点击查看摘要

Abstract:Click-through rate (CTR) prediction plays an important role in online advertising systems. On the one hand, traditional CTR prediction models capture the collaborative signals in tabular data via feature interaction modeling, but they lose semantics in text. On the other hand, Large Language Models (LLMs) excel in understanding the context and meaning behind text, but they face challenges in capturing collaborative signals and they have long inference latency. In this paper, we aim to leverage the benefits of both types of models and pursue collaboration, semantics and efficiency. We present ELEC, which is an Efficient LLM-Empowered CTR prediction framework. We first adapt an LLM for the CTR prediction task. In order to leverage the ability of the LLM but simultaneously keep efficiency, we utilize the pseudo-siamese network which contains a gain network and a vanilla network. We inject the high-level representation vector generated by the LLM into a collaborative CTR model to form the gain network such that it can take advantage of both tabular modeling and textual modeling. However, its reliance on the LLM limits its efficiency. We then distill the knowledge from the gain network to the vanilla network on both the score level and the representation level, such that the vanilla network takes only tabular data as input, but can still generate comparable performance as the gain network. Our approach is model-agnostic. It allows for the integration with various existing LLMs and collaborative CTR models. Experiments on real-world datasets demonstrate the effectiveness and efficiency of ELEC for CTR prediction.

[IR-5] Multi-view-guided Passage Reranking with Large Language Models

链接: https://arxiv.org/abs/2509.07485
作者: Jeongwoo Na,Jun Kwon,Eunseong Choi,Jongwuk Lee
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have shown impressive performance in passage reranking tasks. Despite their success, LLM-based methods still face challenges in efficiency and sensitivity to external biases. (1) Existing models rely mostly on autoregressive generation and sliding window strategies to rank passages, which incur heavy computational overhead as the number of passages increases. (2) External biases, such as position or selection bias, hinder the model’s ability to accurately represent passages and increase input-order sensitivity. To address these limitations, we introduce a novel passage reranking model, called Multi-View-guided Passage Reranking (MVP). MVP is a non-generative LLM-based reranking method that encodes query-passage information into diverse view embeddings without being influenced by external biases. For each view, it combines query-aware passage embeddings to produce a distinct anchor vector, which is then used to directly compute relevance scores in a single decoding step. In addition, it employs an orthogonal loss to make the views more distinctive. Extensive experiments demonstrate that MVP, with just 220M parameters, matches the performance of much larger 7B-scale fine-tuned models while achieving a 100x reduction in inference latency. Notably, the 3B-parameter variant of MVP achieves state-of-the-art performance on both in-domain and out-of-domain benchmarks. The source code is available at: this https URL

附件下载

点击下载今日全部论文列表