本篇博文主要内容为 2025-09-01 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-09-01)

今日共更新350篇论文,其中:

  • 自然语言处理48篇(Computation and Language (cs.CL))
  • 人工智能116篇(Artificial Intelligence (cs.AI))
  • 计算机视觉71篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习95篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Going over Fine Web with a Fine-Tooth Comb: Technical Report of Indexing Fine Web for Problematic Content Search and Retrieval

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练数据中存在质量、安全与伦理问题的挑战,尤其是由于网络爬取数据的无差别性导致的有害内容风险。现有研究受限于计算资源,仅能对小样本数据进行分析,难以全面评估整个训练集的风险。解决方案的关键在于构建一个基于ElasticSearch的索引与分析框架,实现了对瑞士AI组织FineWeb-2语料库(1.5TB,涵盖四种语言)的高效检索与实时分析,查询响应时间大多在毫秒级,全部低于2秒,从而为构建更安全、可问责的人工智能系统提供了实用工具。

链接: https://arxiv.org/abs/2508.21788
作者: Inés Altemir Marinas,Anastasiia Kucherenko,Andrei Kucharavy
机构: École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院); Institute of Entrepreneurship and Management, HES-SO Valais-Wallis (瓦莱-沃州应用科学与艺术大学创业与管理研究所); Institute of Informatics, HES-SO Valais-Wallis (瓦莱-沃州应用科学与艺术大学信息学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) rely heavily on web-scale datasets like Common Crawl, which provides over 80% of training data for some modern models. However, the indiscriminate nature of web crawling raises challenges in data quality, safety, and ethics. Despite the critical importance of training data quality, prior research on harmful content has been limited to small samples due to computational constraints. This project presents a framework for indexing and analyzing LLM training datasets using an ElasticSearch-based pipeline. We apply it to SwissAI’s FineWeb-2 corpus (1.5TB, four languages), achieving fast query performance–most searches in milliseconds, all under 2 seconds. Our work demonstrates real-time dataset analysis, offering practical tools for safer, more accountable AI systems.
zh

[NLP-1] PiCSAR: Probabilistic Confidence Selection And Ranking

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)和大推理模型(Large Reasoning Models, LRMs)在推理任务中缺乏可靠评分机制的问题,即如何在没有真实答案(ground-truth answers)的情况下识别正确的推理链。其解决方案的关键在于提出一种无需训练的简单方法——概率置信度选择与排序(Probabilistic Confidence Selection And Ranking, PiCSAR),该方法通过计算推理过程与最终答案的联合对数似然(joint log-likelihood)来评估候选生成的质量,并将该联合似然自然分解为推理置信度(reasoning confidence)和答案置信度(answer confidence)。实验证明,正确推理链在两项指标上均显著更高,从而验证了PiCSAR的有效性。

链接: https://arxiv.org/abs/2508.21787
作者: Joshua Ong Jun Leang,Zheng Zhao,Aryo Pradipta Gema,Sohee Yang,Wai-Chung Kwan,Xuanli He,Wenda Li,Pasquale Minervini,Eleonora Giunchiglia,Shay B. Cohen
机构: Imperial College London (帝国理工学院); University of Edinburgh (爱丁堡大学); UCL (伦敦大学学院); Miniml.AI (Miniml.AI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Best-of-n sampling improves the accuracy of large language models (LLMs) and large reasoning models (LRMs) by generating multiple candidate solutions and selecting the one with the highest reward. The key challenge for reasoning tasks is designing a scoring function that can identify correct reasoning chains without access to ground-truth answers. We propose Probabilistic Confidence Selection And Ranking (PiCSAR): a simple, training-free method that scores each candidate generation using the joint log-likelihood of the reasoning and final answer. The joint log-likelihood of the reasoning and final answer naturally decomposes into reasoning confidence and answer confidence. PiCSAR achieves substantial gains across diverse benchmarks (+10.18 on MATH500, +9.81 on AIME2025), outperforming baselines with at least 2x fewer samples in 16 out of 20 comparisons. Our analysis reveals that correct reasoning chains exhibit significantly higher reasoning and answer confidence, justifying the effectiveness of PiCSAR.
zh

[NLP-2] Reasoning -Intensive Regression

【速读】: 该论文旨在解决生成式 AI(Generative AI)在推理密集型回归(Reasoning-Intensive Regression, RiR)任务中的性能瓶颈问题,即从文本中推断细微的数值属性,这类任务常见于评分规则制定或领域特定检索等场景,通常面临标注数据稀缺和计算资源受限的挑战。现有方法如提示冻结的大语言模型(LLMs)和基于梯度下降微调的Transformer编码器,在此类任务中表现不佳。论文提出一种轻量级解决方案MENTAT,其关键在于结合批量反射式提示优化(batch-reflective prompt optimization)与神经集成学习(neural ensemble learning),从而显著提升模型在低资源条件下的推理能力,实验表明其相较基线方法最高可提升65%的性能。

链接: https://arxiv.org/abs/2508.21762
作者: Diane Tchuindjo,Omar Khattab
机构: MIT (麻省理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e. deducing subtle numerical properties from text. Unlike standard language regression tasks, e.g. for sentiment or similarity, RiR often appears instead in ad-hoc problems like rubric-based scoring or domain-specific retrieval, where much deeper analysis of text is required while only limited task-specific training data and computation are available. We cast three realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and finetuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to 65% improvement over both baselines, though substantial room remains for future advances in RiR.
zh

[NLP-3] Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance EMNLP2025

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)在多任务微调过程中存在的“跷跷板现象”(seesaw phenomenon),即参数的无差别更新导致某些任务性能提升的同时,其他任务性能显著下降的问题。其解决方案的关键在于提出了一种核心参数隔离微调(Core Parameter Isolation Fine-Tuning, CPI-FT)框架:首先通过独立微调各任务识别出每个任务的核心参数区域(基于参数更新幅度量化),再依据核心区域重叠度对任务聚类;随后采用参数融合策略——将各任务的核心参数直接移植到统一主干模型中,而非核心参数则利用球面线性插值(Spherical Linear Interpolation, SLERP)进行平滑整合,从而缓解任务间破坏性干扰;最后引入轻量级流水线式多任务微调训练阶段,冻结先前任务的核心参数以防止灾难性遗忘,实验证明该方法能显著降低任务干扰与遗忘,优于传统的多任务和分阶段微调基线。

链接: https://arxiv.org/abs/2508.21741
作者: Yao Wang,Di Liang,Minlong Peng
机构: University of New South Wales (新南威尔士大学); ByteDance Inc. (字节跳动公司); Fudan University (复旦大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:Supervised fine-tuning (SFT) is a pivotal approach to adapting large language models (LLMs) for downstream tasks; however, performance often suffers from the ``seesaw phenomenon’', where indiscriminate parameter updates yield progress on certain tasks at the expense of others. To address this challenge, we propose a novel \emphCore Parameter Isolation Fine-Tuning (CPI-FT) framework. Specifically, we first independently fine-tune the LLM on each task to identify its core parameter regions by quantifying parameter update magnitudes. Tasks with similar core regions are then grouped based on region overlap, forming clusters for joint modeling. We further introduce a parameter fusion technique: for each task, core parameters from its individually fine-tuned model are directly transplanted into a unified backbone, while non-core parameters from different tasks are smoothly integrated via Spherical Linear Interpolation (SLERP), mitigating destructive interference. A lightweight, pipelined SFT training phase using mixed-task data is subsequently employed, while freezing core regions from prior tasks to prevent catastrophic forgetting. Extensive experiments on multiple public benchmarks demonstrate that our approach significantly alleviates task interference and forgetting, consistently outperforming vanilla multi-task and multi-stage fine-tuning baselines.
zh

[NLP-4] Why Stop at Words? Unveiling the Bigger Picture through Line-Level OCR

【速读】: 该论文旨在解决传统光学字符识别(OCR)技术中因字符级分割导致的误差累积问题,以及现有基于单词级OCR方法在词边界检测错误上引入的新瓶颈。其核心挑战在于:虽然现代OCR通过跳过字符分割、直接以单词为单位进行序列到序列建模提升了语言模型利用效率,但词检测错误仍限制整体性能提升。解决方案的关键在于提出从词级OCR向行级OCR(line-level OCR)的自然演进路径——即直接以整行为输入,绕过词检测环节,从而减少错误传播并增强句子级上下文信息的利用能力。实验表明,该方法不仅使端到端识别准确率提升5.4%,还实现了比传统词级流水线高4倍的处理效率。

链接: https://arxiv.org/abs/2508.21693
作者: Shashank Vempati,Nishit Anand,Gaurav Talebailkar,Arpan Garai,Chetan Arora
机构: Typeface(印度); University of Maryland, College Park(美国马里兰大学学院公园分校); Tata 1mg(印度); Vellore Institute of Technology(印度维洛尔理工学院); Indian Institute of Technology Delhi(印度德里理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 11 pages. Project Website: this https URL

点击查看摘要

Abstract:Conventional optical character recognition (OCR) techniques segmented each character and then recognized. This made them prone to error in character segmentation, and devoid of context to exploit language models. Advances in sequence to sequence translation in last decade led to modern techniques first detecting words and then inputting one word at a time to a model to directly output full words as sequence of characters. This allowed better utilization of language models and bypass error-prone character segmentation step. We observe that the above transition in style has moved the bottleneck in accuracy to word segmentation. Hence, in this paper, we propose a natural and logical progression from word level OCR to line-level OCR. The proposal allows to bypass errors in word detection, and provides larger sentence context for better utilization of language models. We show that the proposed technique not only improves the accuracy but also efficiency of OCR. Despite our thorough literature survey, we did not find any public dataset to train and benchmark such shift from word to line-level OCR. Hence, we also contribute a meticulously curated dataset of 251 English page images with line-level annotations. Our experimentation revealed a notable end-to-end accuracy improvement of 5.4%, underscoring the potential benefits of transitioning towards line-level OCR, especially for document images. We also report a 4 times improvement in efficiency compared to word-based pipelines. With continuous improvements in large language models, our methodology also holds potential to exploit such advances. Project Website: this https URL
zh

[NLP-5] Is this chart lying to me? Automating the detection of misleading visualizations

【速读】: 该论文旨在解决社交媒体和网络平台上由误导性可视化(misleading visualizations)引发的信息失真问题,此类可视化通过违反图表设计原则扭曲数据,导致读者得出错误结论。解决方案的关键在于构建一个大规模、多样且公开可用的基准数据集——Misviz(包含2,604个真实世界可视化案例,标注了12类误导类型),并辅以基于Matplotlib生成的合成数据集Misviz-synth(81,814个可视化),用于训练和评估AI模型自动检测误导性可视化及其违反的具体设计规则的能力。这一方法为提升模型对误导性信息的识别能力提供了关键的数据基础与评估标准。

链接: https://arxiv.org/abs/2508.21675
作者: Jonathan Tonglet,Jan Zimny,Tinne Tuytelaars,Iryna Gurevych
机构: TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE; KU Leuven
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Preprint under review. Code and data available at: this https URL

点击查看摘要

Abstract:Misleading visualizations are a potent driver of misinformation on social media and the web. By violating chart design principles, they distort data and lead readers to draw inaccurate conclusions. Prior work has shown that both humans and multimodal large language models (MLLMs) are frequently deceived by such visualizations. Automatically detecting misleading visualizations and identifying the specific design rules they violate could help protect readers and reduce the spread of misinformation. However, the training and evaluation of AI models has been limited by the absence of large, diverse, and openly available datasets. In this work, we introduce Misviz, a benchmark of 2,604 real-world visualizations annotated with 12 types of misleaders. To support model training, we also release Misviz-synth, a synthetic dataset of 81,814 visualizations generated using Matplotlib and based on real-world data tables. We perform a comprehensive evaluation on both datasets using state-of-the-art MLLMs, rule-based systems, and fine-tuned classifiers. Our results reveal that the task remains highly challenging. We release Misviz, Misviz-synth, and the accompanying code.
zh

[NLP-6] QZhou-Embedding Technical Report

【速读】: 该论文旨在解决现有文本嵌入模型在语义表示能力与跨任务泛化性能方面的局限性,尤其针对检索任务中对高质量、多样化训练数据的需求不足问题。其解决方案的关键在于构建一个统一的多任务框架,结合基于大语言模型(LLM)API的数据合成流水线(包括改写、增强和难负样本生成),以提升训练数据的语义丰富度与难度;同时采用两阶段训练策略——先进行聚焦检索的预训练,再进行全任务微调,从而显著增强模型的检索基础能力并拓展至重排序、聚类等下游任务。实验表明,该方法在MTEB和CMTEB基准上均达到领先水平,验证了高质量多样性数据与LLM生成能力对嵌入模型性能突破的核心作用。

链接: https://arxiv.org/abs/2508.21632
作者: Peng Yu,En Xu,Bin Chen,Haibiao Chen,Yinfei Xu
机构: Kingsoft AI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present QZhou-Embedding, a general-purpose contextual text embedding model with exceptional text representation capabilities. Built upon the Qwen2.5-7B-Instruct foundation model, we designed a unified multi-task framework comprising specialized data transformation and training strategies. The data transformation scheme enables the incorporation of more diverse textual training datasets, while the task-specific training strategies enhance model learning efficiency. We developed a data synthesis pipeline leveraging LLM API, incorporating techniques such as paraphrasing, augmentation, and hard negative example generation to improve the semantic richness and sample difficulty of the training set. Additionally, we employ a two-stage training strategy, comprising initial retrieval-focused pretraining followed by full-task fine-tuning, enabling the embedding model to extend its capabilities based on robust retrieval performance. Our model achieves state-of-the-art results on the MTEB and CMTEB benchmarks, ranking first on both leaderboards (August 27 2025), and simultaneously achieves state-of-the-art performance on tasks including reranking, clustering, etc. Our findings demonstrate that higher-quality, more diverse data is crucial for advancing retrieval model performance, and that leveraging LLMs generative capabilities can further optimize data quality for embedding model breakthroughs. Our model weights are released on HuggingFace under Apache 2.0 license. For reproducibility, we provide evaluation code and instructions on GitHub.
zh

[NLP-7] Personality Matters: User Traits Predict LLM Preferences in Multi-Turn Collaborative Tasks EMNLP2025

【速读】: 该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)日益融入日常协作工作流的背景下,不同人格特质的用户是否会系统性地偏好特定LLM。解决方案的关键在于通过控制变量的实验设计,将32名参与者按Keirsey人格类型均匀分配,评估其在四类协作任务中对GPT-4与Claude 3.5的交互体验,结合定量评分与定性情感分析,发现人格类型显著影响模型偏好模式——例如理性型(Rationals)更倾向GPT-4用于目标导向任务,理想型(Idealists)则偏爱Claude 3.5于创意和分析任务,而整体帮助度评分相近,凸显了传统评估指标无法捕捉的人格驱动差异。

链接: https://arxiv.org/abs/2508.21628
作者: Sarfaroz Yunusov,Kaige Chen,Kazi Nishat Anwar,Ali Emami
机构: Brock University (布罗克大学); Emory University (埃默里大学)
类目: Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.
zh

[NLP-8] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLM)在监督微调(Supervised Fine-Tuning, SFT)过程中对高质量训练数据的依赖问题,特别是现有静态数据集难以适应模型能力动态演化的局限性。其解决方案的关键在于提出了一种名为 Middo 的自进化模型感知动态数据优化框架,通过构建闭环优化系统实现数据质量的持续提升:首先利用自参考诊断模块基于损失模式(loss patterns)、嵌入聚类动态(embedding cluster dynamics)和自对齐分数(self-alignment scores)三轴信号识别低质量样本;其次,通过自适应优化引擎对这些样本进行语义保持的重构,转化为具有教学价值的训练点;最后,整个优化过程随模型能力增强而动态演进,从而实现数据与模型的协同进化,显著提升模型性能(平均准确率提升7.15%),同时维持原始数据规模不变。

链接: https://arxiv.org/abs/2508.21589
作者: Zinan Tang,Xin Gao,Qizhi Pei,Zhuoshi Pan,Mengzhang Cai,Jiang Wu,Conghui He,Lijun Wu
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Renmin University of China (中国人民大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 (main)

点击查看摘要

Abstract:Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our \method consistently enhances the quality of seed data and boosts LLM’s performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are coming soon.
zh

[NLP-9] A Survey on Current Trends and Recent Advances in Text Anonymization

【速读】: 该论文旨在解决文本数据中敏感个人信息的隐私保护问题,以满足法规合规要求并维持数据在下游任务中的可用性。其核心挑战在于如何在去标识化过程中平衡隐私保护与信息效用之间的关系。解决方案的关键在于系统性地梳理和整合当前文本匿名化技术的发展脉络,涵盖基础命名实体识别(Named Entity Recognition, NER)方法、大语言模型(Large Language Models, LLMs)在匿名化与再识别中的双重作用、领域特异性挑战(如医疗、法律、金融等)以及基于形式化隐私模型和风险感知框架的先进方法。此外,论文还强调了评估体系、基准测试和工具包对实际部署的重要性,从而为未来研究提供方向指引。

链接: https://arxiv.org/abs/2508.21587
作者: Tobias Deußer,Lorenz Sparrenberg,Armin Berger,Max Hahnbück,Christian Bauckhage,Rafet Sifa
机构: University of Bonn (波恩大学); Fraunhofer IAIS (弗劳恩霍夫信息与通信技术研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE DSAA 2025

点击查看摘要

Abstract:The proliferation of textual data containing sensitive personal information across various domains requires robust anonymization techniques to protect privacy and comply with regulations, while preserving data usability for diverse and crucial downstream tasks. This survey provides a comprehensive overview of current trends and recent advances in text anonymization techniques. We begin by discussing foundational approaches, primarily centered on Named Entity Recognition, before examining the transformative impact of Large Language Models, detailing their dual role as sophisticated anonymizers and potent de-anonymization threats. The survey further explores domain-specific challenges and tailored solutions in critical sectors such as healthcare, law, finance, and education. We investigate advanced methodologies incorporating formal privacy models and risk-aware frameworks, and address the specialized subfield of authorship anonymization. Additionally, we review evaluation frameworks, comprehensive metrics, benchmarks, and practical toolkits for real-world deployment of anonymization solutions. This review consolidates current knowledge, identifies emerging trends and persistent challenges, including the evolving privacy-utility trade-off, the need to address quasi-identifiers, and the implications of LLM capabilities, and aims to guide future research directions for both academics and practitioners in this field.
zh

[NLP-10] L3Cube-MahaSTS: A Marathi Sentence Similarity Dataset and Models

【速读】: 该论文旨在解决低资源语言(如马拉地语)中句子文本相似度(Sentence Textual Similarity, STS)任务缺乏高质量标注数据和专用预训练模型的问题。其解决方案的关键在于构建了一个由人工标注的马拉地语句子对相似度数据集 MahaSTS,包含 16,860 对句子及其连续评分(0–5),并通过均匀分布于六个评分区间的标签设计减少标签偏差、提升监督信号质量;在此基础上,进一步微调得到 MahaSBERT-STS-v2 模型,该模型在回归任务上优化了句向量表示能力,实验证明其优于多种主流多语言模型(如 MahaBERT、MuRIL、IndicBERT 和 IndicSBERT),凸显了人工标注数据与结构化监督机制在低资源场景下的重要价值。

链接: https://arxiv.org/abs/2508.21569
作者: Aishwarya Mirashi,Ananya Joshi,Raviraj Joshi
机构: Pune Institute of Computer Technology (普奈计算机技术学院); Indian Institute of Technology Madras (印度理工学院马德拉斯分校); MKSSS’ Cummins College of Engineering for Women (MKSSS卡姆林斯女子工程学院); L3Cube Labs (L3Cube实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present MahaSTS, a human-annotated Sentence Textual Similarity (STS) dataset for Marathi, along with MahaSBERT-STS-v2, a fine-tuned Sentence-BERT model optimized for regression-based similarity scoring. The MahaSTS dataset consists of 16,860 Marathi sentence pairs labeled with continuous similarity scores in the range of 0-5. To ensure balanced supervision, the dataset is uniformly distributed across six score-based buckets spanning the full 0-5 range, thus reducing label bias and enhancing model stability. We fine-tune the MahaSBERT model on this dataset and benchmark its performance against other alternatives like MahaBERT, MuRIL, IndicBERT, and IndicSBERT. Our experiments demonstrate that MahaSTS enables effective training for sentence similarity tasks in Marathi, highlighting the impact of human-curated annotations, targeted fine-tuning, and structured supervision in low-resource settings. The dataset and model are publicly shared at this https URL
zh

[NLP-11] Summarize-Exemplify-Reflect: Data-driven Insight Distillation Empowers LLM s for Few-shot Tabular Classification EMNLP25

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在少样本表格分类任务中因结构化数据变异性而导致的性能不稳定问题。其解决方案的关键在于提出一种名为InsightTab的见解蒸馏框架,该框架通过“分而治之”、“由易到难”和“反思学习”三大原则指导数据洞察的提炼过程,融合规则总结、策略性示例生成与见解反思机制,实现LLMs与数据建模技术的深度协同,从而增强LLMs对特定表格任务的适应能力,提升分类鲁棒性与有效性。

链接: https://arxiv.org/abs/2508.21561
作者: Yifei Yuan,Jiatong Li,Weijia Zhang,Mohammad Aliannejadi,Evangelos Kanoulas,Renjun Hu
机构: ETH Zürich (苏黎世联邦理工学院); University of Copenhagen (哥本哈根大学); University of Science and Technology of China (中国科学技术大学); University of Amsterdam (阿姆斯特丹大学); East China Normal University (华东师范大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: EMNLP 25 Findings

点击查看摘要

Abstract:Recent studies show the promise of large language models (LLMs) for few-shot tabular classification but highlight challenges due to the variability in structured data. To address this, we propose distilling data into actionable insights to enable robust and effective classification by LLMs. Drawing inspiration from human learning processes, we introduce InsightTab, an insight distillation framework guided by principles of divide-and-conquer, easy-first, and reflective learning. Our approach integrates rule summarization, strategic exemplification, and insight reflection through deep collaboration between LLMs and data modeling techniques. The obtained insights enable LLMs to better align their general knowledge and capabilities with the particular requirements of specific tabular tasks. We extensively evaluate InsightTab on nine datasets. The results demonstrate consistent improvement over state-of-the-art methods. Ablation studies further validate the principle-guided distillation process, while analyses emphasize InsightTab’s effectiveness in leveraging labeled data and managing bias.
zh

[NLP-12] Accept or Deny? Evaluating LLM Fairness and Performance in Loan Approval across Table-to-Text Serialization Approaches

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在金融决策任务中处理表格数据能力不足、公平性难以保障以及预测可靠性差的问题。其解决方案的关键在于系统评估不同表格数据序列化(serialization)格式对LLMs在贷款审批任务中性能与公平性的影响,并验证了上下文学习(in-context learning, ICL)策略的有效性及其对公平性的复杂影响。研究发现,特定序列化方法如GReat和LIFT虽能提升F1分数,但可能加剧公平性偏差;而ICL虽可显著提高性能(相对零样本基线提升4.9–59.6%),其对公平性的改善效果因地域数据分布差异而异,凸显出开发高效表格表示方法与公平性感知建模机制的重要性。

链接: https://arxiv.org/abs/2508.21512
作者: Israel Abebe Azime,Deborah D. Kanubala,Tejumade Afonja,Mario Fritz,Isabel Valera,Dietrich Klakow,Philipp Slusallek
机构: Saarland University (萨尔兰大学); CISPA Helmholtz Center for Information Security (信息安全亥姆霍兹中心); Max Planck Institute for Software Systems (软件系统马克斯普朗克研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed in high-stakes decision-making tasks, such as loan approvals. While their applications expand across domains, LLMs struggle to process tabular data, ensuring fairness and delivering reliable predictions. In this work, we assess the performance and fairness of LLMs on serialized loan approval datasets from three geographically distinct regions: Ghana, Germany, and the United States. Our evaluation focuses on the model’s zero-shot and in-context learning (ICL) capabilities. Our results reveal that the choice of serialization (Serialization refers to the process of converting tabular data into text formats suitable for processing by LLMs.) format significantly affects both performance and fairness in LLMs, with certain formats such as GReat and LIFT yielding higher F1 scores but exacerbating fairness disparities. Notably, while ICL improved model performance by 4.9-59.6% relative to zero-shot baselines, its effect on fairness varied considerably across datasets. Our work underscores the importance of effective tabular data representation methods and fairness-aware models to improve the reliability of LLMs in financial decision-making.
zh

[NLP-13] HSFN: Hierarchical Selection for Fake News Detection building Heterogeneous Ensemble

【速读】: 该论文旨在解决生成式AI在事实核查(fact-checking)任务中,由于个体心理偏差(如确认偏误)导致虚假信息在社交媒体上广泛传播的问题。现有基于机器学习的解决方案虽能缓解此问题,但其性能受限于集成方法中各分类器之间的冗余性,即缺乏多样性(diversity),从而影响整体鲁棒性和泛化能力。论文提出一种新颖的自动分类器选择方法——HierarchySelect,其核心在于通过计算分类器间的成对多样性并进行层次聚类,构建多粒度的分类器分组;随后在不同层级中选取具有独特内部多样性的分类器池,并结合性能评估指标筛选出最多样且性能最优的分类器组合用于集成建模,从而在保证准确性的同时显著提升模型多样性与泛化能力。

链接: https://arxiv.org/abs/2508.21482
作者: Sara B. Coutinho,Rafael M.O. Cruz,Francimaria R. S. Nascimento,George D. C. Cavalcanti
机构: Centro de Informática (CIn), Universidade Federal de Pernambuco (UFPE); École de Technologie Supérieure (ÉTS), Université du Québec
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by IEEE International Conference on Systems, Man, and Cybernetics (SMC) - IEEE SMC 2025

点击查看摘要

Abstract:Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning-based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers-selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifiers’s performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on two of six datasets. The implementation details are available on the project’s repository: this https URL .
zh

[NLP-14] Igniting Creative Writing in Small Language Models: LLM -as-a-Judge versus Multi-Agent Refined Rewards EMNLP2025

【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在创意写作任务中面临的新颖性不足与训练成本高昂的问题。当前主流方法如监督微调(Supervised Fine-Tuning, SFT)难以激发创造性,而基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)则存在数据标注成本高、效率低等瓶颈。论文提出在强化学习从AI反馈(Reinforcement Learning from AI Feedback, RLAIF)框架下,采用两种AI驱动的奖励策略来提升7B参数规模SLM生成中文问候语的创造力:其一为利用多智能体拒绝采样框架构建高质量偏好数据训练奖励模型(Reward Model, RM);其二更为创新地引入基于原则引导的“大语言模型作为裁判”(LLM-as-a-Judge)机制,通过对抗训练与反思机制优化奖励函数,直接输出奖励信号。实验表明,后者不仅显著优于基线模型且在生成质量上更优,同时具备更高的训练效率和更低的人工标注依赖,展现出更强的可扩展性和实用性。

链接: https://arxiv.org/abs/2508.21476
作者: Xiaolong Wei,Bo Lu,Xingyu Zhang,Zhejun Zhao,Dongdong Shen,Long Xia,Dawei Yin
机构: Beihang University (北京航空航天大学); Baidu Inc. (百度公司); Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at this https URL.
zh

[NLP-15] Morae: Proactively Pausing UI Agents for User Choices

【速读】: 该论文旨在解决当前用户界面(User Interface, UI)代理在服务盲人及低视力(Blind and Low-Vision, BLV)用户时缺乏用户参与和情境感知的问题,导致用户在关键决策点失去控制权。例如,代理可能自动选择一个商品而未告知用户其他具有不同风味或更高评分的替代选项,从而削弱了用户的自主性(user agency)。解决方案的关键在于提出Morae——一种具备混合主动性(mixed-initiative)能力的UI代理:它利用大模型对UI代码与截图进行联合理解,在任务执行过程中自动识别决策点并暂停操作,主动向用户询问澄清信息,从而让用户能够在自动化流程中表达偏好并做出知情选择。实证研究表明,相较于基线代理(如OpenAI Operator),Morae显著提升了BLV用户完成任务的数量和所选选项与其偏好的匹配度。

链接: https://arxiv.org/abs/2508.21456
作者: Yi-Hao Peng,Dingzeyu Li,Jeffrey P. Bigham,Amy Pavel
机构: 未知
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: ACM UIST 2025

点击查看摘要

Abstract:User interface (UI) agents promise to make inaccessible or complex UIs easier to access for blind and low-vision (BLV) users. However, current UI agents typically perform tasks end-to-end without involving users in critical choices or making them aware of important contextual information, thus reducing user agency. For example, in our field study, a BLV participant asked to buy the cheapest available sparkling water, and the agent automatically chose one from several equally priced options, without mentioning alternative products with different flavors or better ratings. To address this problem, we introduce Morae, a UI agent that automatically identifies decision points during task execution and pauses so that users can make choices. Morae uses large multimodal models to interpret user queries alongside UI code and screenshots, and prompt users for clarification when there is a choice to be made. In a study over real-world web tasks with BLV participants, Morae helped users complete more tasks and select options that better matched their preferences, as compared to baseline agents, including OpenAI Operator. More broadly, this work exemplifies a mixed-initiative approach in which users benefit from the automation of UI agents while being able to express their preferences.
zh

[NLP-16] Beyond the Surface: Probing the Ideological Depth of Large Language Models

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)中存在的意识形态倾向是否具有稳定性与深度,即其政治立场是否反映了一种内在的、结构化的认知表征,而非仅通过简单提示工程即可操控的表面响应。为解决此问题,作者提出“意识形态深度”(ideological depth)这一可量化属性,并采用双轨方法进行探究:首先通过指令提示(instruction prompting)和激活调控(activation steering)评估模型的可操控性(steerability);其次利用稀疏自编码器(Sparse Autoencoders, SAEs)解析模型内部表示,发现低可操控性模型拥有更抽象且区分度更高的政治特征。关键在于,通过针对性删减一个核心政治特征,可在“深度”模型中引发逻辑一致的推理偏移,而在“浅层”模型中则导致拒绝输出增加,从而验证了意识形态深度可通过可操作的机制被识别和干预。

链接: https://arxiv.org/abs/2508.21448
作者: Shariar Kabir,Kevin Esterling,Yue Dong
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); University of California Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of “ideological depth” in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the “steerability” of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically “deep” model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a “shallow” model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.
zh

[NLP-17] Discovering Semantic Subdimensions through Disentangled Conceptual Representations

【速读】: 该论文旨在解决现有语义研究中对概念语义核心维度刻画过于粗粒度的问题,即传统方法依赖预定义的语义维度,难以捕捉细粒度的概念差异。其解决方案的关键在于提出一种解耦连续语义表示模型(Disentangled Continuous Semantic Representation Model, DCSRM),通过将大语言模型中的词向量分解为多个子向量(sub-embeddings),每个子向量编码特定语义信息,从而识别出可解释的语义子维度。该方法不仅提升了语义表征的精细度,还借助体素级编码模型验证了这些子维度在神经层面的合理性,揭示了极性(polarity)是驱动语义维度分解的核心因素。

链接: https://arxiv.org/abs/2508.21436
作者: Yunhao Zhang,Shaonan Wang,Nan Lin,Xinyi Dong,Chong Li,Chengqing Zong
机构: State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, CAS, Beijing, China; School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China; State Key Laboratory of Cognitive Science and Mental Health, Institute of Psychology, CAS, Beijing, China; Department of Psychology, University of Chinese Academy of Sciences, Beijing, China; State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding the core dimensions of conceptual semantics is fundamental to uncovering how meaning is organized in language and the brain. Existing approaches often rely on predefined semantic dimensions that offer only broad representations, overlooking finer conceptual distinctions. This paper proposes a novel framework to investigate the subdimensions underlying coarse-grained semantic dimensions. Specifically, we introduce a Disentangled Continuous Semantic Representation Model (DCSRM) that decomposes word embeddings from large language models into multiple sub-embeddings, each encoding specific semantic information. Using these sub-embeddings, we identify a set of interpretable semantic subdimensions. To assess their neural plausibility, we apply voxel-wise encoding models to map these subdimensions to brain activation. Our work offers more fine-grained interpretable semantic subdimensions of conceptual meaning. Further analyses reveal that semantic dimensions are structured according to distinct principles, with polarity emerging as a key factor driving their decomposition into subdimensions. The neural correlates of the identified subdimensions support their cognitive and neuroscientific plausibility.
zh

[NLP-18] Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models

【速读】: 该论文旨在解决医学领域中生成式 AI(Generative AI)模型在诊断与临床决策任务中缺乏可靠奖励模型(Reward Models, RM)和评判标准的问题。现有评估基准多聚焦于通用大语言模型能力或将其视为解题工具,忽视了诊断准确性、临床相关性等关键维度,导致模型输出难以满足专业医疗需求。解决方案的关键在于构建首个专门面向医学场景的奖励模型评估基准——Med-RewardBench,其包含覆盖13个器官系统和8个临床科室的1026例专家标注病例,并通过三步严谨流程确保数据质量,涵盖六个临床关键维度。该基准首次系统性地支持对医学奖励模型的量化评估,为提升模型与临床实践一致性提供了可信赖的评价体系。

链接: https://arxiv.org/abs/2508.21430
作者: Meidan Ding,Jipeng Zhang,Wenxuan Wang,Cheng-Yi Li,Wei-Chieh Fang,Hsin-Yu Wu,Haiqin Zhong,Wenting Chen,Linlin Shen
机构: Shenzhen University (深圳大学); City University of Hong Kong (香港城市大学); The Hong Kong University of Science and Technology (香港科技大学); Renmin University of China (中国人民大学); National Yang Ming Chiao Tung University (国立阳明交通大学); Taipei Veterans General Hospital (台北荣民总医院); School of Biomedical Engineering, Shenzhen University (深圳大学生物医学工程学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Multimodal large language models (MLLMs) hold significant potential in medical applications, including disease diagnosis and clinical decision-making. However, these tasks require highly accurate, context-sensitive, and professionally aligned responses, making reliable reward models and judges critical. Despite their importance, medical reward models (MRMs) and judges remain underexplored, with no dedicated benchmarks addressing clinical requirements. Existing benchmarks focus on general MLLM capabilities or evaluate models as solvers, neglecting essential evaluation dimensions like diagnostic accuracy and clinical relevance. To address this, we introduce Med-RewardBench, the first benchmark specifically designed to evaluate MRMs and judges in medical scenarios. Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases. A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions. We evaluate 32 state-of-the-art MLLMs, including open-source, proprietary, and medical-specific models, revealing substantial challenges in aligning outputs with expert judgment. Additionally, we develop baseline models that demonstrate substantial performance improvements through fine-tuning.
zh

[NLP-19] Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在学术同行评审中潜在的逻辑缺陷检测能力不足的问题,即当前最先进的自动评审生成器(ARGs)是否能够识别并响应研究逻辑中的错误。其解决方案的关键在于提出了一种完全自动化的反事实评估框架,该框架能够在受控条件下隔离并测试ARGs对研究逻辑错误的敏感性,从而系统性地揭示其在识别内部一致性问题上的局限性。

链接: https://arxiv.org/abs/2508.21422
作者: Nils Dycke,Iryna Gurevych
机构: UKP Lab, Department of Computer Science and National Research Center for Applied Cybersecurity ATHENE, Technical University of Darmstadt (达姆施塔特工业大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have great potential to accelerate and support scholarly peer review and are increasingly used as fully automatic review generators (ARGs). However, potential biases and systematic errors may pose significant risks to scientific integrity; understanding the specific capabilities and limitations of state-of-the-art ARGs is essential. We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic. This involves evaluating the internal consistency between a paper’s results, interpretations, and claims. We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions. Testing a range of ARG approaches, we find that, contrary to expectation, flaws in research logic have no significant effect on their output reviews. Based on our findings, we derive three actionable recommendations for future work and release our counterfactual dataset and evaluation framework publicly.
zh

[NLP-20] AllSummedUp: un framework open-source pour comparer les metriques devaluation de resume

【速读】: 该论文旨在解决自动文本摘要评估中存在的一致性(reproducibility)问题,即不同研究中报告的评估指标性能与实际实验结果之间存在显著差异。其解决方案的关键在于提出一个统一、开源的框架,用于在SummEval数据集上对多种代表性评估指标(包括传统方法如ROUGE和基于大语言模型(LLM)的方法如G-Eval、SEval-Ex)进行公平且透明的比较。该框架揭示了评估指标在人类判断一致性与计算效率及运行稳定性之间的结构性权衡,并强调了依赖LLM进行评估时存在的随机性、技术依赖性和可复现性不足等关键问题,从而推动建立更稳健的评估协议,包括详尽的方法文档记录和标准化流程,以提升自动摘要评估的可靠性。

链接: https://arxiv.org/abs/2508.21389
作者: Tanguy Herserant,Vincent Guigue
机构: AgroParisTech - MIA (AgroParisTech - MIA), 22 place de l’Agronomie, 91120 Palaiseau, France
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: in French language

点击查看摘要

Abstract:This paper investigates reproducibility challenges in automatic text summarization evaluation. Based on experiments conducted across six representative metrics ranging from classical approaches like ROUGE to recent LLM-based methods (G-Eval, SEval-Ex), we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting. We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics. Our results reveal a structural trade-off: metrics with the highest alignment with human judgments tend to be computationally intensive and less stable across runs. Beyond comparative analysis, this study highlights key concerns about relying on LLMs for evaluation, stressing their randomness, technical dependencies, and limited reproducibility. We advocate for more robust evaluation protocols including exhaustive documentation and methodological standardization to ensure greater reliability in automatic summarization assessment.
zh

[NLP-21] Normality and the Turing Test

【速读】: 该论文试图解决的问题是:如何重新理解图灵测试(Turing test)的本质及其在人工智能(Artificial Intelligence, AI)评估中的意义。其核心挑战在于,当前主流大语言模型(如ChatGPT)虽表现出高度智能行为,却可能并不符合图灵测试所隐含的“正常人类智能”标准。论文的解决方案关键在于引入“正常性”(normality)概念,将图灵测试重新阐释为一种统计意义上的测试——即机器需模仿普通人类的认知与行为特征(包括犯错和非最优决策),而非追求异常或卓越的人类智能表现;同时强调,“平均人类审问者”并非单一个体,而是由多个审问者的判断经标准化聚合形成的数学抽象。这一视角揭示了现有大语言模型本质上体现的是“人工聪明”(artificial smartness),而非真正意义上的“人工智力”(artificial intelligence)。

链接: https://arxiv.org/abs/2508.21382
作者: Alexandre Kabbach
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper proposes to revisit the Turing test through the concept of normality. Its core argument is that the statistical interpretation of the normal–understood as the average both in the normative and mathematical sense of the term–proves useful for understanding the Turing test in at least two ways. First, in the sense that the Turing test targets normal/average rather than exceptional human intelligence, so that successfully passing the test requires building machines that “make mistakes” and display imperfect behavior just like normal/average humans. Second, in the sense that the Turing test is a statistical test where judgments of intelligence are never carried out by a single “average” judge (understood as non-expert) but always by a full jury. As such, the notion of “average human interrogator” that Turing talks about in his original paper should be understood primarily as referring to a mathematical abstraction made of the normalized aggregate of individual judgments of multiple judges. In short, this paper argues that the Turing test is a test of normal intelligence as assessed by a normal judge characterizing the average judgment of a pool of human interrogators. Its conclusions are twofold. First, it argues that large language models such as ChatGPT are unlikely to pass the Turing test as those models precisely target exceptional rather than normal/average human intelligence. As such, they constitute models of what it proposes to call artificial smartness rather than artificial intelligence per se. Second, it argues that the core question of whether the Turing test can contribute anything to the understanding of human cognition is that of whether the human mind is really reducible to the normal/average mind–a question which largely extends beyond the Turing test itself and questions the conceptual underpinnings of the normalist paradigm it belongs to.
zh

[NLP-22] Challenges and Applications of Large Language Models : A Comparison of GPT and DeepSeek family of models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在开发与部署过程中面临的复杂性问题,系统梳理了构建和使用LLMs时遇到的16个关键挑战,并通过对比两种前沿模型——闭源的GPT-4o(OpenAI,2024年5月更新)与开源的DeepSeek-V3-0324(DeepSeek,2025年3月版本)——揭示其应对策略的差异。解决方案的关键在于:一方面识别出闭源模型在安全性与微调可靠性方面的优势,另一方面凸显开源模型在效率与可定制性上的潜力,从而为不同应用场景(如对话机器人、代码生成、医疗健康及教育等)提供适配模型属性的决策依据,助力研究人员、开发者和决策者更清晰地理解当前LLM的能力边界与最佳实践。

链接: https://arxiv.org/abs/2508.21377
作者: Shubham Sharma,Sneha Tuli,Narendra Badam
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI’s closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.
zh

[NLP-23] AHELM: A Holistic Evaluation of Audio-Language Models

【速读】: 该论文旨在解决当前音频-语言模型(Audio-Language Models, ALMs)评估缺乏标准化基准的问题,现有评测往往仅覆盖单一或少数能力维度,且在公平性、安全性等关键方面存在缺失,同时不同模型间的比较因提示方式和推理参数不一致而难以实现。解决方案的关键在于提出AHELM——一个涵盖10个重要维度(包括音频感知、知识、推理、情绪识别、偏见、公平性、多语言性、鲁棒性、毒性与安全)的综合性基准,整合多个数据集(含两个新构建的合成音频-文本数据集PARADE和CoRe-Bench),并通过统一的提示模板、推理参数及评估指标确保模型间可比性;实验测试了14个开放权重与封闭API的ALMs及3个基础系统,揭示了模型性能差异及其公平性问题,为ALMs的持续发展提供透明、可扩展的评测框架。

链接: https://arxiv.org/abs/2508.21376
作者: Tony Lee,Haoqin Tu,Chi Heem Wong,Zijun Wang,Siwei Yang,Yifan Mai,Yuyin Zhou,Cihang Xie,Percy Liang
机构: Stanford University (斯坦福大学); University of California, Santa Cruz (加州大学圣克鲁兹分校); Hitachi America, Ltd. (日立美国有限公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluations of audio-language models (ALMs) – multimodal models that take interleaved audio and text as input and output text – are hindered by the lack of standardized benchmarks; most benchmarks measure only one or two capabilities and omit evaluative aspects such as fairness or safety. Furthermore, comparison across models is difficult as separate evaluations test a limited number of models and use different prompting methods and inference parameters. To address these shortfalls, we introduce AHELM, a benchmark that aggregates various datasets – including 2 new synthetic audio-text datasets called PARADE, which evaluates the ALMs on avoiding stereotypes, and CoRe-Bench, which measures reasoning over conversational audio through inferential multi-turn question answering – to holistically measure the performance of ALMs across 10 aspects we have identified as important to the development and usage of ALMs: audio perception, knowledge, reasoning, emotion detection, bias, fairness, multilinguality, robustness, toxicity, and safety. We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models. We test 14 open-weight and closed-API ALMs from 3 developers and 3 additional simple baseline systems each consisting of an automatic speech recognizer and a language model. Our results show that while Gemini 2.5 Pro ranks top in 5 out of 10 aspects, it exhibits group unfairness ( p=0.01 ) on ASR tasks whereas most of the other models do not. We also find that the baseline systems perform reasonably well on AHELM, with one ranking 5th overall despite having only speech-to-text capabilities. For transparency, all raw prompts, model generations, and outputs are available on our website at this https URL. AHELM is intended to be a living benchmark and new datasets and models will be added over time.
zh

[NLP-24] Stairway to Fairness: Connecting Group and Individual Fairness RECSYS2025

【速读】: 该论文旨在解决推荐系统(Recommender Systems, RSs)中群体公平性(group fairness)与个体公平性(individual fairness)之间关系不明确的问题。现有研究分别使用不同的评估指标和目标来衡量两类公平性,导致难以进行有效比较,进而无法理解提升一种公平性是否会影响另一种。论文的关键解决方案是通过系统性地对比适用于两类公平性的评估指标,在三个数据集上进行8次实验,揭示了群体公平性高的推荐结果可能对个体不公平的显著现象,从而为推荐系统从业者提供了关于公平性权衡的重要实证依据。

链接: https://arxiv.org/abs/2508.21334
作者: Theresia Veronika Rampisela,Maria Maistro,Tuukka Ruotsalo,Falk Scholer,Christina Lioma
机构: University of Copenhagen(哥本哈根大学); LUT University(拉赫蒂理工大学); RMIT University(皇家墨尔本理工大学)
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: Accepted to RecSys 2025 (short paper)

点击查看摘要

Abstract:Fairness in recommender systems (RSs) is commonly categorised into group fairness and individual fairness. However, there is no established scientific understanding of the relationship between the two fairness types, as prior work on both types has used different evaluation measures or evaluation objectives for each fairness type, thereby not allowing for a proper comparison of the two. As a result, it is currently not known how increasing one type of fairness may affect the other. To fill this gap, we study the relationship of group and individual fairness through a comprehensive comparison of evaluation measures that can be used for both fairness types. Our experiments with 8 runs across 3 datasets show that recommendations that are highly fair for groups can be very unfair for individuals. Our finding is novel and useful for RS practitioners aiming to improve the fairness of their systems. Our code is available at: this https URL.
zh

[NLP-25] BLUEX Revisited: Enhancing Benchmark Coverag e with Automatic Captioning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多语言及非英语场景下缺乏可靠评估方法的问题,尤其关注预训练数据污染(data contamination)的研究需求。其解决方案的关键在于更新并扩展了BLUEX数据集,新增2024–2025年考试内容,并利用先进模型自动生成图像描述(image captions),从而显著提升文本-only 模型对视觉信息的可访问性——使可用问题数量从原始版本翻倍至1,422个,超过40%的提升幅度。这一改进增强了数据集在评估LLMs是否有效利用视觉上下文方面的实用性与代表性。

链接: https://arxiv.org/abs/2508.21294
作者: João Guilherme Alves Santos,Giovana Kerche Bonás,Thales Sales Almeida
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures, 2 tables

点击查看摘要

Abstract:With the growing capabilities of Large Language Models (LLMs), there is an increasing need for robust evaluation methods, especially in multilingual and non-English contexts. We present an updated version of the BLUEX dataset, now including 2024-2025 exams and automatically generated image captions using state-of-the-art models, enhancing its relevance for data contamination studies in LLM pretraining. Captioning strategies increase accessibility to text-only models by more than 40%, producing 1,422 usable questions, more than doubling the number in the original BLUEX. We evaluated commercial and open-source LLMs and their ability to leverage visual context through captions.
zh

[NLP-26] Efficient Code Embeddings from Code Generation Models

【速读】: 该论文旨在解决代码检索、技术问答以及跨语言语义相似代码片段识别等任务中,如何高效构建高质量代码嵌入(code embedding)模型的问题。解决方案的关键在于提出了一种基于自回归骨干网络(autoregressive backbone)的代码嵌入模型套件 jina-code-embeddings,该模型在文本和代码上联合预训练,并通过最后token池化(last-token pooling)机制生成嵌入表示,从而在模型规模相对较小的情况下实现了最先进的性能表现。

链接: https://arxiv.org/abs/2508.21290
作者: Daria Kryvosheieva,Saba Sturua,Michael Günther,Scott Martens,Han Xiao
机构: Massachusetts Institute of Technology (麻省理工学院); Jina AI GmbH (Jina AI 公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 9 pages, table and evaluations 5-9

点击查看摘要

Abstract:jina-code-embeddings is a novel code embedding model suite designed to retrieve code from natural language queries, perform technical question-answering, and identify semantically similar code snippets across programming languages. It makes innovative use of an autoregressive backbone pre-trained on both text and code, generating embeddings via last-token pooling. We outline the training recipe and demonstrate state-of-the-art performance despite the relatively small size of the models, validating this approach to code embedding model construction.
zh

[NLP-27] CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

【速读】: 该论文旨在解决多编程语言间代码翻译的复杂性问题,传统方法需为每对语言单独开发翻译器,导致组合爆炸式增长的开发成本。其解决方案的关键在于提出一种统一的中间表示(Intermediate Representation, IR)——CrossGL,通过该IR实现多种编程语言(如CUDA、HIP、Metal、DirectX HLSL、OpenGL GLSL、Vulkan SPIR-V、Rust和Mojo)之间的双向翻译。该设计使新增语言仅需开发特定前端和后端组件,极大提升了系统的可扩展性和实用性,从而推动“写一次、部署到所有平台”的语言无关编程范式落地。

链接: https://arxiv.org/abs/2508.21256
作者: Nripesh Niketan,Vaatsalya Shrivastva
机构: 未知
类目: Programming Languages (cs.PL); Computation and Language (cs.CL); Graphics (cs.GR)
备注: 15 Pages, 5 Figures, 1 Table. Introduces CrossTL, a universal programming language translator enabling bidirectional translation between 8 programming languages (CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, Mojo) through a unified intermediate representation called CrossGL. Includes comprehensive evaluation with complex real-world examples

点击查看摘要

Abstract:We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, and Mojo, with Slang support in development. Our system consists of: language-specific lexers/parsers converting source code to ASTs, bidirectional CrossGL translation modules implementing ToCrossGLConverter classes for importing code and CodeGen classes for target generation, and comprehensive backend implementations handling full translation pipelines. We demonstrate effectiveness through comprehensive evaluation across programming domains, achieving successful compilation and execution across all supported backends. The universal IR design enables adding new languages with minimal effort, requiring only language-specific frontend/backend components. Our contributions include: (1) a unified IR capturing semantics of multiple programming paradigms, (2) a modular architecture enabling extensibility, (3) a comprehensive framework supporting GPU compute, graphics programming, and systems languages, and (4) empirical validation demonstrating practical viability of universal code translation. CrossTL represents a significant step toward language-agnostic programming, enabling write-once, deploy-everywhere development.
zh

[NLP-28] Decoding Memories: An Efficient Pipeline for Self-Consistency Hallucination Detection

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成过程中存在的幻觉(hallucination)问题,尤其是现有检测方法在句子级生成任务中表现不佳或依赖领域特定知识的局限性。同时,尽管自一致性(self-consistency)方法能提升准确性,但其因重复生成导致计算开销高昂。论文的关键创新在于首次识别出自一致性方法中存在冗余现象——即不同生成路径间共享前缀token,且非精确答案token对语义内容贡献有限。基于此洞察,作者提出一种解码记忆流水线(Decoding Memory Pipeline, DMP),通过选择性推理和退火解码(annealed decoding)实现高效生成,在不牺牲AUROC性能的前提下,最多实现3倍加速,且与模型结构、数据集、解码策略及自一致性基线正交,具备良好的扩展性。

链接: https://arxiv.org/abs/2508.21228
作者: Weizhi Gao,Xiaorui Liu,Feiyi Wang,Dan Lu,Junqi Yin
机构: Oak Ridge National Laboratory (橡树岭国家实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, under review

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated impressive performance in both research and real-world applications, but they still struggle with hallucination. Existing hallucination detection methods often perform poorly on sentence-level generation or rely heavily on domain-specific knowledge. While self-consistency approaches help address these limitations, they incur high computational costs due to repeated generation. In this paper, we conduct the first study on identifying redundancy in self-consistency methods, manifested as shared prefix tokens across generations, and observe that non-exact-answer tokens contribute minimally to the semantic content. Based on these insights, we propose a novel Decoding Memory Pipeline (DMP) that accelerates generation through selective inference and annealed decoding. Being orthogonal to the model, dataset, decoding strategy, and self-consistency baseline, our DMP consistently improves the efficiency of multi-response generation and holds promise for extension to alignment and reasoning tasks. Extensive experiments show that our method achieves up to a 3x speedup without sacrificing AUROC performance.
zh

[NLP-29] Do Self-Supervised Speech Models Exhibit the Critical Period Effects in Language Acquisition?

【速读】: 该论文试图解决的问题是:自监督语音模型(Self-Supervised Speech Models, S3Ms)是否表现出人类语言习得中的关键期(Critical Period, CP)效应。CP效应指第二语言(L2)习得在延迟暴露时难度增加,且第一语言(L1)在延迟终止时更容易被遗忘。此前研究主要基于文本语言模型,而语音模型因与人类口语习得更贴近,其CP效应尚未明确。解决方案的关键在于:通过控制L2训练起始时间(onset)和L1训练终止时间(offset),在儿童语料(child-directed speech)上训练S3Ms,并评估其音位辨别能力;结果发现,S3Ms未表现出典型的CP效应,反而显示延迟L2暴露有助于提升L2性能,延迟L1终止则导致L1遗忘,这揭示了语音模型在语言习得机制上可能与人类存在本质差异。

链接: https://arxiv.org/abs/2508.21210
作者: Yurie Koga,Shunsuke Kando,Yusuke Miyao
机构: The University of Tokyo (东京大学); NII LLMC (日本国立信息学研究所大语言模型研究中心)
类目: Computation and Language (cs.CL)
备注: Accepted to ASRU 2025

点击查看摘要

Abstract:This paper investigates whether the Critical Period (CP) effects in human language acquisition are observed in self-supervised speech models (S3Ms). CP effects refer to greater difficulty in acquiring a second language (L2) with delayed L2 exposure onset, and greater retention of their first language (L1) with delayed L1 exposure offset. While previous work has studied these effects using textual language models, their presence in speech models remains underexplored despite the central role of spoken language in human language acquisition. We train S3Ms with varying L2 training onsets and L1 training offsets on child-directed speech and evaluate their phone discrimination performance. We find that S3Ms do not exhibit clear evidence of either CP effects in terms of phonological acquisition. Notably, models with delayed L2 exposure onset tend to perform better on L2 and delayed L1 exposure offset leads to L1 forgetting.
zh

[NLP-30] Designing Smarter Conversational Agents for Kids: Lessons from Cognitive Work and Means-Ends Analyses

【速读】: 该论文旨在解决儿童在使用对话式代理(Conversational Agents, CAs)进行学业、探索和娱乐时,因缺乏结构化支持而导致交互效率低、学习效果差的问题。其解决方案的关键在于设计基于“脚手架”(scaffolding)的对话树(conversation-tree),通过结构化提示(structured-prompting)模拟家长与孩子间的支持性互动模式,从而提升儿童与CA之间的信息处理流畅性、问题深度与多样性以及对话连贯性。研究通过实证分析巴西9–11岁儿童的使用行为,并结合大语言模型(LLM)模拟测试,验证了该“配方式”脚手架策略在增强儿童与CA交互质量方面的有效性。

链接: https://arxiv.org/abs/2508.21209
作者: Vanessa Figueiredo
机构: University of Regina (里贾纳大学)
类目: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents two studies on how Brazilian children (ages 9–11) use conversational agents (CAs) for schoolwork, discovery, and entertainment, and how structured scaffolds can enhance these interactions. In Study 1, a seven-week online investigation with 23 participants (children, parents, teachers) employed interviews, observations, and Cognitive Work Analysis to map children’s information-processing flows, the role of more knowledgeable others, functional uses, contextual goals, and interaction patterns to inform conversation-tree design. We identified three CA functions: School, Discovery, Entertainment, and derived recipe'' scaffolds mirroring parent-child support. In Study 2, we prompted GPT-4o-mini on 1,200 simulated child-CA exchanges, comparing conversation-tree recipes based on structured-prompting to an unstructured baseline. Quantitative evaluation of readability, question count/depth/diversity, and coherence revealed gains for the recipe approach. Building on these findings, we offer design recommendations: scaffolded conversation-trees, child-dedicated profiles for personalized context, and caregiver-curated content. Our contributions include the first CWA application with Brazilian children, an empirical framework of child-CA information flows, and an LLM-scaffolding recipe’’ (i.e., structured-prompting) for effective, scaffolded learning.
zh

[NLP-31] Enhancing Robustness of Autoregressive Language Models against Orthographic Attacks via Pixel-based Approach

【速读】: 该论文旨在解决自回归语言模型在面对拼写攻击(orthographic attacks)时的脆弱性问题,即输入文本通过引入多语言字母表中的字符进行扰动后导致性能显著下降。这一问题主要源于子词分词器(subword tokenizer)及其嵌入表示对未登录词(out-of-vocabulary, OOV)的敏感性。解决方案的关键在于提出一种基于像素的生成式语言模型(pixel-based generative language model),该模型将词语渲染为独立图像,并用像素级表示替代传统的文本嵌入,从而增强对噪声输入的鲁棒性,并支持跨不同书写系统的多语言文本兼容性。

链接: https://arxiv.org/abs/2508.21206
作者: Han Yang,Jian Lan,Yihong Liu,Hinrich Schütze,Thomas Seidl
机构: LMU Munich (慕尼黑大学); GESIS - Leibniz Institute for the Social Sciences (德国社会科学研究机构); Center for Information and Language Processing, LMU Munich (慕尼黑大学信息与语言处理中心); Munich Center for Machine Learning (慕尼黑机器学习中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Autoregressive language models are vulnerable to orthographic attacks, where input text is perturbed with characters from multilingual alphabets, leading to substantial performance degradation. This vulnerability primarily stems from the out-of-vocabulary issue inherent in subword tokenizers and their embeddings. To address this limitation, we propose a pixel-based generative language model that replaces the text-based embeddings with pixel-based representations by rendering words as individual images. This design provides stronger robustness to noisy inputs, while an extension of compatibility to multilingual text across diverse writing systems. We evaluate the proposed method on the multilingual LAMBADA dataset, WMT24 dataset and the SST-2 benchmark, demonstrating both its resilience to orthographic noise and its effectiveness in multilingual settings.
zh

[NLP-32] Fuzzy Symbolic and Contextual: Enhancing LLM Instruction via Cognitive Scaffolding

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在指令对话中认知行为受架构归纳偏置(architectural inductive biases)影响的问题,尤其是如何通过结构化设计提升其在苏格拉底式教学中的适应性推理能力。解决方案的关键在于引入一种符号支撑机制(symbolic scaffolding mechanism)与短期记忆模式(short-term memory schema)相结合的架构设计,以促进模型在教学交互中实现抽象、自适应探问和概念连续性等关键认知行为。实验表明,移除记忆或符号结构会显著削弱这些认知行为,验证了架构层面的支撑能够可靠地塑造LLM在教学场景下的涌现策略。

链接: https://arxiv.org/abs/2508.21204
作者: Vanessa Figueiredo
机构: University of Regina (里贾纳大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We study how architectural inductive biases influence the cognitive behavior of large language models (LLMs) in instructional dialogue. We introduce a symbolic scaffolding mechanism paired with a short-term memory schema designed to promote adaptive, structured reasoning in Socratic tutoring. Using controlled ablation across five system variants, we evaluate model outputs via expert-designed rubrics covering scaffolding, responsiveness, symbolic reasoning, and conversational memory. We present preliminary results using an LLM-based evaluation framework aligned to a cognitively grounded rubric. This enables scalable, systematic comparisons across architectural variants in early-stage experimentation. The preliminary results show that our full system consistently outperforms baseline variants. Analysis reveals that removing memory or symbolic structure degrades key cognitive behaviors, including abstraction, adaptive probing, and conceptual continuity. These findings support a processing-level account in which architectural scaffolds can reliably shape emergent instructional strategies in LLMs.
zh

[NLP-33] Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization

【速读】: 该论文旨在解决传统航空事故中人为因素分析方法(如HFACS)在可扩展性和一致性方面的局限性,从而提升航空安全分析的自动化水平。其关键解决方案是提出一种基于强化学习的自动化HFACS分类框架,采用Group Relative Policy Optimization (GRPO) 对 Llama-3.1 8B 语言模型进行微调,并引入面向航空安全分析的多组件奖励机制与合成数据生成技术以缓解事故数据集中的类别不平衡问题。该方法显著提升了精确匹配准确率(从0.0400提升至0.1800)和部分匹配准确率(达0.8800),且优于当前主流大语言模型(LLMs),验证了领域优化的小型模型在计算效率和推理能力上的优势,为资源受限边缘设备上的低延迟部署提供了可行路径。

链接: https://arxiv.org/abs/2508.21201
作者: Arash Ahmadi,Sarah Sharif,Yaser Banad
机构: University of Oklahoma (俄克拉荷马大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.
zh

[NLP-34] Model-Task Alignment Drives Distinct RL Outcomes

【速读】: 该论文试图解决的问题是:在将强化学习(Reinforcement Learning, RL)应用于大语言模型(Large Language Models, LLMs)时,一系列看似反直觉的现象(如单个训练样本即可达到全数据集效果、奖励信号精度要求低、仅用负样本训练也能媲美甚至超越复杂奖励机制)为何会出现,以及这些现象在何种条件下成立或失效。解决方案的关键在于识别出一个核心因素——预训练模型与目标任务之间的对齐程度(Model-Task Alignment),具体以pass@k准确率衡量。研究发现,这些反直觉结果仅在模型与任务已具备强对齐时出现;而在对齐较弱的挑战性场景中,标准RL方法仍具鲁棒性,而上述“神奇”策略则失效。

链接: https://arxiv.org/abs/2508.21188
作者: Haoze Wu,Cheng Wang,Wenshuo Zhao,Junxian He
机构: Zhejiang University (浙江大学); National University of Singapore (新加坡国立大学); HKUST (香港科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent advances in applying reinforcement learning (RL) to large language models (LLMs) have led to substantial progress. In particular, a series of remarkable yet often counterintuitive phenomena have been reported in LLMs, exhibiting patterns not typically observed in traditional RL settings. For example, notable claims include that a single training example can match the performance achieved with an entire dataset, that the reward signal does not need to be very accurate, and that training solely with negative samples can match or even surpass sophisticated reward-based methods. However, the precise conditions under which these observations hold - and, critically, when they fail - remain unclear. In this work, we identify a key factor that differentiates RL observations: whether the pretrained model already exhibits strong Model-Task Alignment, as measured by pass@k accuracy on the evaluated task. Through a systematic and comprehensive examination of a series of counterintuitive claims, supported by rigorous experimental validation across different model architectures and task domains, our findings show that while standard RL training remains consistently robust across settings, many of these counterintuitive results arise only when the model and task already exhibit strong model-task alignment. In contrast, these techniques fail to drive substantial learning in more challenging regimes, where standard RL methods remain effective.
zh

[NLP-35] BED-LLM : Intelligent Information Gathering with LLM s and Bayesian Experimental Design

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮对话中缺乏智能且自适应的信息收集能力的问题,从而提升其作为交互式代理与外部环境协同工作的效能。解决方案的核心在于提出BED-LLM方法,即基于贝叶斯实验设计(Bayesian Experimental Design, BED)框架,通过迭代选择能最大化预期信息增益(Expected Information Gain, EIG)的提问策略来优化信息获取过程。关键创新包括:构建基于LLM信念分布的概率模型以形式化EIG、设计不完全依赖上下文更新的EIG估计器,以及采用有针对性的候选查询生成策略,显著提升了模型在20问游戏和用户偏好推理等任务中的性能表现。

链接: https://arxiv.org/abs/2508.21184
作者: Deepro Choudhury,Sinead Williamson,Adam Goliński,Ning Miao,Freddie Bickford Smith,Michael Kirchhof,Yizhe Zhang,Tom Rainforth
机构: University of Oxford (牛津大学); Apple (苹果公司); City University of Hong Kong (香港城市大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We propose a general-purpose approach for improving the ability of Large Language Models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian Experimental Design with Large Language Models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) about the task of interest given the responses gathered previously. We show how this EIG can be formulated in a principled way using a probabilistic model derived from the LLM’s belief distribution and provide detailed insights into key decisions in its construction. Further key to the success of BED-LLM are a number of specific innovations, such as a carefully designed estimator for the EIG, not solely relying on in-context updates for conditioning on previous responses, and a targeted strategy for proposing candidate queries. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20-questions game and using the LLM to actively infer user preferences, compared to direct prompting of the LLM and other adaptive design strategies.
zh

[NLP-36] Quantifying Label-Induced Bias in Large Language Model Self- and Cross-Evaluations

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在自我评估与跨模型评估中可能存在的偏见问题,尤其关注模型身份标签对评价结果的干扰效应。研究发现,模型名称标签(如“Claude”或“Gemini”)会显著扭曲评分结果,导致偏好投票和质量评分(包括连贯性、信息量和简洁性)出现系统性偏差,甚至在某些情况下引发高达50个百分点的排名变化。解决方案的关键在于采用盲评(blind evaluation)或多元模型协同评估协议,以消除模型身份带来的认知偏差,从而保障LLM基准测试的公平性和客观性。

链接: https://arxiv.org/abs/2508.21164
作者: Muskan Saraf,Sajjad Rezvani Boroujeni,Justin Beaudry,Hossein Abedi,Tom Bush
机构: Actual Reality Technologies(实际现实技术公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used to evaluate outputs, yet their judgments may be influenced. This study examines bias in self- and cross-model evaluations by ChatGPT, Gemini, and Claude under four conditions: no labels, true labels, and two false-label scenarios. Blog posts authored by each model were evaluated by all three using both overall preference voting and quality ratings for Coherence, Informativeness, and Conciseness, with all scores expressed as percentages for direct comparison. Results reveal striking asymmetries: the “Claude” label consistently boosts scores, while the “Gemini” label consistently depresses them, regardless of actual content. False labels frequently reversed rankings, producing shifts of up to 50 percentage points in preference votes and up to 12 percentage points in converted quality ratings. Gemini’s self-scores collapsed under true labels, while Claude’s self-preference intensified. These findings show that perceived model identity can heavily distort high-level judgments and subtly influence detailed quality ratings, underscoring the need for blind or multimodel evaluation protocols to ensure fairness in LLM benchmarking.
zh

[NLP-37] A Survey of Scientific Large Language Models : From Data Foundations to Agent Frontiers

【速读】: 该论文旨在解决科学大语言模型(Scientific Large Language Models, Sci-LLMs)发展中因科学数据复杂性带来的瓶颈问题,特别是如何构建高质量、多模态、跨尺度且具备领域不变性的数据基础以支撑模型的持续进化。其解决方案的关键在于提出一个以数据为中心的范式重构:首先建立统一的科学数据分类体系和分层的科学知识模型,揭示科学语料相较于通用自然语言处理数据的独特挑战;其次系统梳理超过270个预训练与后训练数据集,明确Sci-LLMs对异构、不确定性高、多模态数据的需求;最后推动评估方法从静态测试向过程导向和发现导向演进,并探索半自动化标注流程与专家验证相结合的数据治理路径,最终导向闭环系统——即基于Sci-LLMs的自主代理能够主动实验、验证并动态更新科学知识库,从而实现可信、持续进化的AI作为科研伙伴的目标。

链接: https://arxiv.org/abs/2508.21148
作者: Ming Hu,Chenglong Ma,Wei Li,Wanghan Xu,Jiamin Wu,Jucheng Hu,Tianbin Li,Guohang Zhuang,Jiaqi Liu,Yingzhou Lu,Ying Chen,Chaoyang Zhang,Cheng Tan,Jie Ying,Guocheng Wu,Shujian Gao,Pengcheng Chen,Jiashi Lin,Haitao Wu,Lulu Chen,Fengxiang Wang,Yuanyuan Zhang,Xiangyu Zhao,Feilong Tang,Encheng Su,Junzhi Ning,Xinyao Liu,Ye Du,Changkai Ji,Cheng Tang,Huihui Xu,Ziyang Chen,Ziyan Huang,Jiyao Liu,Pengfei Jiang,Yizhou Wang,Chen Tang,Jianyu Wu,Yuchen Ren,Siyuan Yan,Zhonghua Wang,Zhongxing Xu,Shiyan Su,Shangquan Sun,Runkai Zhao,Zhisheng Zhang,Yu Liu,Fudi Wang,Yuanfeng Ji,Yanzhou Su,Hongming Shan,Chunmei Feng,Jiahao Xu,Jiangtao Yan,Wenhao Tang,Diping Song,Lihao Liu,Yanyan Huang,Lequan Yu,Bin Fu,Shujun Wang,Xiaomeng Li,Xiaowei Hu,Yun Gu,Ben Fei,Zhongying Deng,Benyou Wang,Yuewen Cao,Minjie Shen,Haodong Duan,Jie Xu,Yirong Chen,Fang Yan,Hongxia Hao,Jielan Li,Jiajun Du,Yanbo Wang,Imran Razzak,Chi Zhang,Lijun Wu,Conghui He,Zhaohui Lu,Jinhai Huang,Yihao Liu,Fenghua Ling,Yuqiang Li,Aoran Wang,Qihao Zheng,Nanqing Dong,Tianfan Fu,Dongzhan Zhou,Yan Lu,Wenlong Zhang,Jin Ye,Jianfei Cai,Wanli Ouyang,Yu Qiao,Zongyuan Ge,Shixiang Tang,Junjun He
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Monash University (蒙纳士大学); Fudan University (复旦大学); Shanghai Jiao Tong University (上海交通大学); The Chinese University of Hong Kong (香港中文大学); University College London (伦敦大学学院); UNC-Chapel Hill (北卡罗来纳大学教堂山分校); Stanford University (斯坦福大学); Virginia Tech (弗吉尼亚理工学院); Purdue University (普渡大学); The University of Hong Kong (香港大学); China Pharmaceutical University (中国药科大学); Beijing Institute of Heart, Lung and Blood Vessel Diseases (北京心脏、肺和血管疾病研究所); Chinese Academy of Sciences (中国科学院); Fuzhou University (福州大学); University College Dublin (都柏林大学学院); The Hong Kong Polytechnic University (香港理工大学); The Hong Kong University of Science and Technology (香港科技大学); South China University of Technology (华南理工大学); University of Cambridge (剑桥大学); The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); Caltech (加州理工学院); North University of China (中北大学); MBZUAI (穆罕默德·本·扎耶德人工智能大学); Nanjing University (南京大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Scientific Large Language Models (Sci-LLMs) are transforming how knowledge is represented, integrated, and applied in scientific research, yet their progress is shaped by the complex nature of scientific data. This survey presents a comprehensive, data-centric synthesis that reframes the development of Sci-LLMs as a co-evolution between models and their underlying data substrate. We formulate a unified taxonomy of scientific data and a hierarchical model of scientific knowledge, emphasizing the multimodal, cross-scale, and domain-specific challenges that differentiate scientific corpora from general natural language processing datasets. We systematically review recent Sci-LLMs, from general-purpose foundations to specialized models across diverse scientific disciplines, alongside an extensive analysis of over 270 pre-/post-training datasets, showing why Sci-LLMs pose distinct demands – heterogeneous, multi-scale, uncertainty-laden corpora that require representations preserving domain invariance and enabling cross-modal reasoning. On evaluation, we examine over 190 benchmark datasets and trace a shift from static exams toward process- and discovery-oriented assessments with advanced evaluation protocols. These data-centric analyses highlight persistent issues in scientific data development and discuss emerging solutions involving semi-automated annotation pipelines and expert validation. Finally, we outline a paradigm shift toward closed-loop systems where autonomous agents based on Sci-LLMs actively experiment, validate, and contribute to a living, evolving knowledge base. Collectively, this work provides a roadmap for building trustworthy, continually evolving artificial intelligence (AI) systems that function as a true partner in accelerating scientific discovery.
zh

[NLP-38] Can Multimodal LLM s Solve the Basic Perception Problems of Percept-V?

【速读】: 该论文试图解决的问题是:当前对多模态大语言模型(Multimodal Large Language Models, MLLMs)的评估主要集中于复杂任务(如编程、数学和科学推理),而缺乏对其在基础感知任务上表现的系统性测试,尤其是在干净、生成图像中识别简单形状与结构的能力。为解决这一问题,作者提出了一种名为Percept-V的数据集,其关键在于包含7200张程序生成的图像,均匀分布在30个类别中,每个类别对应一种视觉感知技能组合,且任务难度从低到高逐步递增。该数据集的设计使得能够客观评估MLLMs在不同复杂度下的感知能力,并揭示其性能随任务复杂度上升而显著下降的趋势。

链接: https://arxiv.org/abs/2508.21143
作者: Samrajnee Ghosh,Naman Agarwal,Hemanshu Garg,Chinmay Mittal,Mausam,Parag Singla
机构: 未知
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The reasoning abilities of Multimodal Large Language Models (MLLMs) have garnered a lot of attention in recent times, with advances made in frontiers like coding, mathematics, and science. However, very limited experiments have been done to assess their performance in simple perception tasks performed over uncontaminated, generated images containing basic shapes and structures. To address this issue, the paper introduces a dataset, Percept-V, containing a total of 7200 program-generated images equally divided into 30 categories, each testing a combination of visual perception skills. Unlike previously proposed datasets, Percept-V comprises very basic tasks of varying complexity that test the perception abilities of MLLMs. This dataset is then tested on state-of-the-art MLLMs like GPT-4o, Gemini, and Claude as well as Large Reasoning Models (LRMs) like OpenAI o4-mini and DeepSeek R1 to gauge their performance. Contrary to the evidence that MLLMs excel in many complex tasks, our experiments show a significant drop in the models’ performance with increasing problem complexity across all categories. An analysis of the performances also reveals that the tested MLLMs exhibit a similar trend in accuracy across categories, testing a particular cognitive skill and find some skills to be more difficult than others.
zh

[NLP-39] How Does Cognitive Bias Affect Large Language Models ? A Case Study on the Anchoring Effect in Price Negotiation Simulations

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中认知偏差(cognitive biases)对其在现实场景中可靠性的影响问题,特别是锚定效应(anchoring effect)在LLM驱动的价格谈判中的表现。解决方案的关键在于通过设计实验,让销售方LLM代理主动应用锚定效应,并结合客观与主观指标评估谈判结果,从而验证LLM是否像人类一样受锚定效应影响;同时进一步探究推理能力(reasoning)和人格特质(personality)对锚定效应敏感性的调节作用,发现具备长链思维(long chain of thought)的推理模型更能缓解该效应,而人格特质则无显著相关性。这一发现为提升LLM在社会应用中的安全性与责任性提供了实证依据。

链接: https://arxiv.org/abs/2508.21137
作者: Yoshiki Takenami,Yin Jou Huang,Yugo Murawaki,Chenhui Chu
机构: Kyoto University (京都大学)
类目: Computation and Language (cs.CL)
备注: work in progress

点击查看摘要

Abstract:Cognitive biases, well-studied in humans, can also be observed in LLMs, affecting their reliability in real-world applications. This paper investigates the anchoring effect in LLM-driven price negotiations. To this end, we instructed seller LLM agents to apply the anchoring effect and evaluated negotiations using not only an objective metric but also a subjective metric. Experimental results show that LLMs are influenced by the anchoring effect like humans. Additionally, we investigated the relationship between the anchoring effect and factors such as reasoning and personality. It was shown that reasoning models are less prone to the anchoring effect, suggesting that the long chain of thought mitigates the effect. However, we found no significant correlation between personality traits and susceptibility to the anchoring effect. These findings contribute to a deeper understanding of cognitive biases in LLMs and to the realization of safe and responsible application of LLMs in society.
zh

[NLP-40] rInk: Ink Generation with Transformer Network EMNLP2025

【速读】: 该论文旨在解决手写体生成中字符识别错误率高以及文本与笔画点对齐不准确的问题。其核心解决方案是提出了一种基于Transformer的墨迹生成模型TrInk,通过引入缩放的位置嵌入(scaled positional embeddings)和高斯记忆掩码(Gaussian memory mask)于交叉注意力模块中,有效增强了输入文本与生成笔画点之间的对齐能力,并利用主观与客观相结合的评估流程全面衡量生成手写的可读性和风格一致性。实验表明,该方法在IAM-OnDB数据集上相较于现有方法将字符错误率(CER)降低了35.56%,词错误率(WER)降低了29.66%。

链接: https://arxiv.org/abs/2508.21098
作者: Zezhong Jin,Shubhang Desai,Xu Chen,Biyi Fang,Zhuoyi Huang,Zhe Li,Chong-Xin Gan,Xiao Tu,Man-Wai Mak,Yan Lu,Shujie Liu
机构: The Hong Kong Polytechnic University (香港理工大学); Microsoft Corporation (微软公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Main Conference

点击查看摘要

Abstract:In this paper, we propose TrInk, a Transformer-based model for ink generation, which effectively captures global dependencies. To better facilitate the alignment between the input text and generated stroke points, we introduce scaled positional embeddings and a Gaussian memory mask in the cross-attention module. Additionally, we design both subjective and objective evaluation pipelines to comprehensively assess the legibility and style consistency of the generated handwriting. Experiments demonstrate that our Transformer-based model achieves a 35.56% reduction in character error rate (CER) and an 29.66% reduction in word error rate (WER) on the IAM-OnDB dataset compared to previous methods. We provide an demo page with handwriting samples from TrInk and baseline models at: this https URL
zh

[NLP-41] Granite Embedding R2 Models

【速读】: 该论文旨在解决企业级密集检索(dense retrieval)应用中对高精度、高效率嵌入模型的迫切需求,尤其在多模态数据(如文本、代码、长文档、多轮对话和表格数据)场景下性能不足的问题。解决方案的关键在于推出Granite Embedding R2系列模型,其核心创新包括:1)支持长达8,192 tokens的上下文长度(较前代提升16倍),显著增强对长序列内容的理解能力;2)提供双编码器(bi-encoder)与交叉编码器(cross-encoder)架构,包含22层高性能检索器与12层高效版本,以及高质量重排序器(reranker);3)全部模型基于企业合规数据训练并配备完善的数据治理机制,确保可追溯性与安全性;4)在保持卓越准确率的同时实现19–44%的速度优势,且以Apache 2.0许可证开源,满足企业级部署对性能、许可透明性和数据可信性的综合要求。

链接: https://arxiv.org/abs/2508.21085
作者: Parul Awasthy,Aashka Trivedi,Yulong Li,Meet Doshi,Riyaz Bhat,Vignesh P,Vishwajeet Kumar,Yushu Yang,Bhavani Iyer,Abraham Daniels,Rudra Murthy,Ken Barker,Martin Franz,Madison Lee,Todd Ward,Salim Roukos,David Cox,Luis Lastras,Jaydeep Sen,Radu Florian
机构: IBM Research AI (IBM研究院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:We introduce the Granite Embedding R2 models, a comprehensive family of high-performance English encoder-based embedding models engineered for enterprise-scale dense retrieval applications. Building upon our first-generation release, these models deliver substantial improvements, including 16x expanded context length (8,192 tokens), state-of-the-art performance across diverse retrieval domains - text, code, long-document search, multi-turn conversational, and tabular data - and measurable speed advantages of 19-44% over leading competitors while maintaining superior accuracy. Our release encompasses both bi-encoder and cross-encoder architectures, featuring a highly effective 22-layer retriever model and its efficient 12-layer counterpart, alongside a high-quality reranker model, all trained exclusively on enterprise-appropriate data with comprehensive governance oversight. The models demonstrate exceptional versatility across standard benchmarks, IBM-developed evaluation suites, and real-world enterprise use cases, establishing new performance standards for open-source embedding models. In an era where retrieval speed and accuracy are paramount for competitive advantage, the Granite R2 models deliver a compelling combination of cutting-edge performance, enterprise-ready licensing, and transparent data provenance that organizations require for mission-critical deployments. All models are publicly available under the Apache 2.0 license at this https URL, enabling unrestricted research and commercial use.
zh

[NLP-42] Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting EMNLP2025

【速读】: 该论文旨在解决现有毒性言论(toxic speech)数据集缺乏人口统计学背景的问题,从而限制了对不同年龄群体在线交流方式的理解。其解决方案的关键在于构建首个大规模德语毒性标注数据集,并通过平台提供的年龄估计信息进行增强,涵盖来自Instagram、TikTok和YouTube的共30,048条匿名评论(其中3,024条由人工标注,30,024条由大语言模型LLM标注)。研究采用预定义毒性关键词筛选评论以确保相关性,并结合人类专家与先进语言模型的协同标注机制,识别出侮辱、虚假信息及对广播费批评等关键类别。结果揭示了年龄差异对毒性表达模式的影响:年轻用户更倾向使用情感化语言,而年长用户则更多涉及虚假信息传播与价值贬损,为开发更具公平性和年龄敏感性的内容审核系统提供了基础资源。

链接: https://arxiv.org/abs/2508.21084
作者: Jan Fillies,Michael Peter Hoffmann,Rebecca Reichel,Roman Salzwedel,Sven Bodemer,Adrian Paschke
机构: Freie Universität Berlin (柏林自由大学); Fraunhofer FOKUS (弗劳恩霍夫信息与通信技术研究所); MSB Medical School Berlin (柏林医学学校); funk - Content-Netzwerk (funk 内容网络); ARD / ZDF (德国公共广播联盟); InfAI Leipzig (莱比锡信息与人工智能研究所)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: The paper has been accepted to the EMNLP 2025 main track

点击查看摘要

Abstract:A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.
zh

[NLP-43] CoBA: Counterbias Text Augmentation for Mitigating Various Spurious Correlations via Semantic Triples EMNLP2025

【速读】: 该论文旨在解决深度学习模型在训练过程中习得并依赖训练数据中的虚假相关性(spurious correlations)的问题,这种依赖会导致模型在未见数据上性能下降和泛化能力不足。解决方案的关键在于提出了一种更通用的反事实数据增强方法——CounterBias Augmentation (CoBA),其核心思想是在语义三元组(subject-predicate-object triple)层面操作文本:首先将文本分解为三元组,随后选择性地修改这些三元组以破坏虚假关联,再通过重构调整后的三元组生成具有去偏特性的反事实数据,从而同时缓解多种偏差(如性别偏见、简单性偏见)并提升分布外鲁棒性。

链接: https://arxiv.org/abs/2508.21083
作者: Kyohoon Jin,Juhwan Choi,Jungmin Yun,Junho Lee,Soojin Jang,Youngbin Kim
机构: DATUMO; AITRICS; Graduate School of Advanced Imaging Sciences, Multimedia and Film, Chung-Ang University (中央大学高级影像科学、多媒体与电影研究生院); Department of Artificial Intelligence, Chung-Ang University (中央大学人工智能系)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:Deep learning models often learn and exploit spurious correlations in training data, using these non-target features to inform their predictions. Such reliance leads to performance degradation and poor generalization on unseen data. To address these limitations, we introduce a more general form of counterfactual data augmentation, termed counterbias data augmentation, which simultaneously tackles multiple biases (e.g., gender bias, simplicity bias) and enhances out-of-distribution robustness. We present CoBA: CounterBias Augmentation, a unified framework that operates at the semantic triple level: first decomposing text into subject-predicate-object triples, then selectively modifying these triples to disrupt spurious correlations. By reconstructing the text from these adjusted triples, CoBA generates counterbias data that mitigates spurious patterns. Through extensive experiments, we demonstrate that CoBA not only improves downstream task performance, but also effectively reduces biases and strengthens out-of-distribution resilience, offering a versatile and robust solution to the challenges posed by spurious correlations.
zh

[NLP-44] Normalisation of SWIFT Message Counterparties with Feature Extraction and Clustering

【速读】: 该论文旨在解决银行支付报文系统(如SWIFT)中交易对手方(transaction counterparties)聚类问题,这类数据通常由人工输入的标签构成,缺乏自然语言结构且包含大量噪声,传统自然语言处理模型难以适用。针对此问题,作者提出了一种混合方法,其关键在于融合字符串相似度、主题建模(topic modelling)、层次聚类(hierarchical clustering)与规则驱动的流水线,能够在不预先设定簇数量的情况下实现高效聚类,并通过改进的精度(precision)和召回率(recall)指标进行评估。该方案在保持规则系统可解释性的同时,显著提升了对未知实体变体的识别能力,有效减少了人工审查需求,尤其适用于制裁调查等需精准定位风险实体的场景。

链接: https://arxiv.org/abs/2508.21081
作者: Thanasis Schoinas,Benjamin Guinard,Diba Esbati,Richard Chalk
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Short text clustering is a known use case in the text analytics community. When the structure and content falls in the natural language domain e.g. Twitter posts or instant messages, then natural language techniques can be used, provided texts are of sufficient length to allow for use of (pre)trained models to extract meaningful information, such as part-of-speech or topic annotations. However, natural language models are not suitable for clustering transaction counterparties, as they are found in bank payment messaging systems, such as SWIFT. The manually typed tags are typically physical or legal entity details, which lack sentence structure, while containing all the variations and noise that manual entry introduces. This leaves a gap in an investigator or counter-fraud professional’s toolset when looking to augment their knowledge of payment flow originator and beneficiary entities and trace funds and assets. A gap that vendors traditionally try to close with fuzzy matching tools. With these considerations in mind, we are proposing a hybrid string similarity, topic modelling, hierarchical clustering and rule-based pipeline to facilitate clustering of transaction counterparties, also catering for unknown number of expected clusters. We are also devising metrics to supplement the evaluation of the approach, based on the well-known measures of precision and recall. Testing on a real-life labelled dataset demonstrates significantly improved performance over a baseline rule-based (‘keyword’) approach. The approach retains most of the interpretability found in rule-based systems, as the former adds an additional level of cluster refinement to the latter. The resulting workflow reduces the need for manual review. When only a subset of the population needs to be investigated, such as in sanctions investigations, the approach allows for better control of the risks of missing entity variations.
zh

[NLP-45] Database Normalization via Dual-LLM Self-Refinement

【速读】: 该论文旨在解决数据库规范化(database normalization)过程中人工操作耗时且易出错的问题。传统方法依赖数据工程师手动完成,效率低且难以保证一致性。解决方案的关键在于提出Miffie框架,其核心是一个双模型自精炼架构(dual-model self-refinement architecture),由生成模块和验证模块协同工作:生成模块基于验证模块的反馈不断消除异常,直至输出满足规范化要求的模式;同时,通过精心设计的任务特定零样本提示(task-specific zero-shot prompts)提升准确率与成本效益,从而实现无需人工干预的高精度自动化数据库规范化。

链接: https://arxiv.org/abs/2508.17693
作者: Eunjae Jo,Nakyung Lee,Gyuyeong Kim
机构: Sungshin Women’s University (成信女子大学)
类目: Databases (cs.DB); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:Database normalization is crucial to preserving data integrity. However, it is time-consuming and error-prone, as it is typically performed manually by data engineers. To this end, we present Miffie, a database normalization framework that leverages the capability of large language models. Miffie enables automated data normalization without human effort while preserving high accuracy. The core of Miffie is a dual-model self-refinement architecture that combines the best-performing models for normalized schema generation and verification, respectively. The generation module eliminates anomalies based on the feedback of the verification module until the output schema satisfies the requirement for normalization. We also carefully design task-specific zero-shot prompts to guide the models for achieving both high accuracy and cost efficiency. Experimental results show that Miffie can normalize complex database schemas while maintaining high accuracy.
zh

[NLP-46] From Canonical to Complex: Benchmarking LLM Capabilities in Undergraduate Thermodynamics

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在本科科学教育中作为自主教学辅助工具的适用性问题,特别是其是否具备稳定、基于原理的推理能力以胜任热力学等复杂概念的教学任务。解决方案的关键在于构建了一个名为UTQA的50题基准测试集,涵盖理想气体过程、可逆性及图示解读等核心热力学内容,系统评估主流2025年模型在文本理解和图像推理任务中的表现。结果显示,当前最佳LLM准确率仅为82%,远未达到95%的胜任阈值,且在非平衡态/不可逆场景和视觉特征与热力学语义绑定方面存在显著短板,表明现有模型尚不具备在无监督环境下进行可靠教学的能力。

链接: https://arxiv.org/abs/2508.21452
作者: Anna Geißler,Luca-Sophie Bien,Friedrich Schöppler,Tobias Hertel
机构: University of Würzburg (维尔茨堡大学)
类目: Physics Education (physics.ed-ph); Computation and Language (cs.CL); Chemical Physics (physics.chem-ph)
备注: Benchmark downloadable at this https URL

点击查看摘要

Abstract:Large language models (LLMs) are increasingly considered as tutoring aids in science education. Yet their readiness for unsupervised use in undergraduate instruction remains uncertain, as reliable teaching requires more than fluent recall: it demands consistent, principle-grounded reasoning. Thermodynamics, with its compact laws and subtle distinctions between state and path functions, reversibility, and entropy, provides an ideal testbed for evaluating such capabilities. Here we present UTQA, a 50-item undergraduate thermodynamics question answering benchmark, covering ideal-gas processes, reversibility, and diagram interpretation. No leading 2025-era model exceeded our 95% competence threshold: the best LLMs achieved 82% accuracy, with text-only items performing better than image reasoning tasks, which often fell to chance levels. Prompt phrasing and syntactic complexity showed modest to little correlation with performance. The gap concentrates in finite-rate/irreversible scenarios and in binding visual features to thermodynamic meaning, indicating that current LLMs are not yet suitable for unsupervised tutoring in this domain.
zh

[NLP-47] Quantum-Enhanced Natural Language Generation: A Multi-Model Framework with Hybrid Quantum-Classical Architectures

【速读】: 该论文旨在解决量子计算在自然语言处理(Natural Language Processing, NLP)中应用效果的评估问题,特别是针对生成式 AI (Generative AI) 模型中量子启发架构的性能表现缺乏系统性比较。其解决方案的关键在于构建并对比五种不同模型——包括传统Transformer基线与三种量子启发模型(QKSAN、QRWKV、QASA),并在五个多样化数据集上进行多维度评估,涵盖困惑度(perplexity)、BLEU分数、词汇多样性(vocabulary diversity)、重复率(repetition rates)和流畅性等指标,从而揭示量子模型在特定任务中的优势与局限。

链接: https://arxiv.org/abs/2508.21332
作者: Chi-Sheng Chen,En-Jui Kuo
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Quantum Physics (quant-ph); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper presents a comprehensive evaluation of quantum text generation models against traditional Transformer/MLP architectures, addressing the growing interest in quantum computing applications for natural language processing. We conduct systematic experiments comparing five distinct models: Transformer (baseline), Quantum Kernel Self-Attention Network (QKSAN), Quantum RWKV (QRWKV), and Quantum Attention Sequence Architecture (QASA) across five diverse datasets including simple sentences, short stories, quantum phrases, haiku poetry, and proverbs. Our evaluation employs multiple metrics including perplexity, BLEU scores, vocabulary diversity, repetition rates, and fluency measures to assess different aspects of text generation quality. The experimental results reveal that while traditional Transformer models maintain overall superiority with the lowest average perplexity (1.21) and highest BLEU-1 score (0.2895), quantum-inspired models demonstrate competitive performance in specific scenarios. Notably, QKSAN achieves a competitive BLEU-1 score of 0.2800 while maintaining zero repetition rates, and QRWKV demonstrates perfect vocabulary diversity (Distinct-1 = 1.000) in certain tasks.
zh

计算机视觉

[CV-0] DriveQA: Passing the Driving Knowledge Test ICCV2025 ICCV

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)在自动驾驶相关知识理解能力上的局限性,特别是其对交通规则、标志识别、路权判定及复杂场景推理的掌握不足。为应对这一问题,作者提出了DriveQA——一个涵盖全面交通法规与多场景的文本与视觉联合基准测试集,其关键在于通过系统化构建包含数值推理、复杂路权决策、交通标志变异和空间布局变化等挑战性任务的数据集,揭示了现有LLM与多模态大语言模型(Multimodal LLM, MLLM)在特定驾驶知识维度上的薄弱环节,并验证了基于DriveQA的微调与预训练能显著提升模型在真实世界数据集(如nuScenes和BDD)上的表现,从而推动生成式AI在自动驾驶认知理解层面的泛化能力发展。

链接: https://arxiv.org/abs/2508.21824
作者: Maolin Wei,Wanzhou Liu,Eshed Ohn-Bar
机构: Boston University (波士顿大学); Washington University in St. Louis (圣路易斯华盛顿大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025. Project page: this https URL

点击查看摘要

Abstract:If a Large Language Model (LLM) were to take a driving knowledge test today, would it pass? Beyond standard spatial and visual question-answering (QA) tasks on current autonomous driving benchmarks, driving knowledge tests require a complete understanding of all traffic rules, signage, and right-of-way principles. To pass this test, human drivers must discern various edge cases that rarely appear in real-world datasets. In this work, we present DriveQA, an extensive open-source text and vision-based benchmark that exhaustively covers traffic regulations and scenarios. Through our experiments using DriveQA, we show that (1) state-of-the-art LLMs and Multimodal LLMs (MLLMs) perform well on basic traffic rules but exhibit significant weaknesses in numerical reasoning and complex right-of-way scenarios, traffic sign variations, and spatial layouts, (2) fine-tuning on DriveQA improves accuracy across multiple categories, particularly in regulatory sign recognition and intersection decision-making, (3) controlled variations in DriveQA-V provide insights into model sensitivity to environmental factors such as lighting, perspective, distance, and weather conditions, and (4) pretraining on DriveQA enhances downstream driving task performance, leading to improved results on real-world datasets such as nuScenes and BDD, while also demonstrating that models can internalize text and synthetic traffic knowledge to generalize effectively across downstream QA tasks.
zh

[CV-1] he Demon is in Ambiguity: Revisiting Situation Recognition with Single Positive Multi-Label Learning ICDM2025

【速读】:该论文旨在解决视觉事件识别(Scene Recognition, SR)中动词分类任务的单标签建模局限性问题,即现有方法将动词分类视为单标签问题,忽略了图像中可能存在多个合理描述的动词类别,导致无法准确捕捉视觉事件的语义模糊性和多样性。解决方案的关键在于提出一种全新的“单正样本多标签学习”(Single Positive Multi-Label Learning, SPMLL)范式,通过引入图神经网络(Graph Neural Networks, GNNs)建模标签间相关性,并结合对抗训练优化决策边界,从而有效提升多标签场景下的识别性能。此外,作者还构建了一个专门针对多标签设置的评估基准,确保对模型在真实复杂场景下表现的公平衡量。

链接: https://arxiv.org/abs/2508.21816
作者: Yiming Lin,Yuchen Niu,Shang Wang,Kaizhu Huang,Qiufeng Wang,Xiao-Bo Jin
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); Duke Kunshan University (昆山杜克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICDM 2025

点击查看摘要

Abstract:Context recognition (SR) is a fundamental task in computer vision that aims to extract structured semantic summaries from images by identifying key events and their associated entities. Specifically, given an input image, the model must first classify the main visual events (verb classification), then identify the participating entities and their semantic roles (semantic role labeling), and finally localize these entities in the image (semantic role localization). Existing methods treat verb classification as a single-label problem, but we show through a comprehensive analysis that this formulation fails to address the inherent ambiguity in visual event recognition, as multiple verb categories may reasonably describe the same image. This paper makes three key contributions: First, we reveal through empirical analysis that verb classification is inherently a multi-label problem due to the ubiquitous semantic overlap between verb categories. Second, given the impracticality of fully annotating large-scale datasets with multiple labels, we propose to reformulate verb classification as a single positive multi-label learning (SPMLL) problem - a novel perspective in SR research. Third, we design a comprehensive multi-label evaluation benchmark for SR that is carefully designed to fairly evaluate model performance in a multi-label setting. To address the challenges of SPMLL, we futher develop the Graph Enhanced Verb Multilayer Perceptron (GE-VerbMLP), which combines graph neural networks to capture label correlations and adversarial training to optimize decision boundaries. Extensive experiments on real-world datasets show that our approach achieves more than 3% MAP improvement while remaining competitive on traditional top-1 and top-5 accuracy metrics.
zh

[CV-2] VoCap: Video Object Captioning and Segmentation from Any Prompt

【速读】:该论文旨在解决视频理解中细粒度目标定位与语义属性描述的联合任务,即如何基于多模态提示(文本、框或掩码)生成时空掩码(spatio-temporal masklet)及对应的对象中心描述(object-centric caption)。其核心挑战在于高质量标注数据稀缺,且需同时处理视频对象分割、指代表达分割和对象描述生成等多任务。解决方案的关键在于:首先利用现有大规模视频分割数据集(SAV)通过预处理视频帧并结合大视觉语言模型(VLM)自动生成伪对象描述,构建新型数据集SAV-Caption;其次,在该数据集上训练一个可处理多模态提示的灵活视频模型VoCap,实现端到端的promptable视频对象分割、指代表达分割与视频对象captioning。该方法显著提升了相关任务性能,并建立了视频对象描述的新基准。

链接: https://arxiv.org/abs/2508.21809
作者: Jasper Uijlings,Xingyi Zhou,Xiuye Gu,Arsha Nagrani,Anurag Arnab,Alireza Fathi,David Ross,Cordelia Schmid
机构: Google DeepMind(谷歌深度思维)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding objects in videos in terms of fine-grained localization masks and detailed semantic properties is a fundamental task in video understanding. In this paper, we propose VoCap, a flexible video model that consumes a video and a prompt of various modalities (text, box or mask), and produces a spatio-temporal masklet with a corresponding object-centric caption. As such our model addresses simultaneously the tasks of promptable video object segmentation, referring expression segmentation, and object captioning. Since obtaining data for this task is tedious and expensive, we propose to annotate an existing large-scale segmentation dataset (SAV) with pseudo object captions. We do so by preprocessing videos with their ground-truth masks to highlight the object of interest and feed this to a large Vision Language Model (VLM). For an unbiased evaluation, we collect manual annotations on the validation set. We call the resulting dataset SAV-Caption. We train our VoCap model at scale on a SAV-Caption together with a mix of other image and video datasets. Our model yields state-of-the-art results on referring expression video object segmentation, is competitive on semi-supervised video object segmentation, and establishes a benchmark for video object captioning. Our dataset will be made available at this https URL.
zh

[CV-3] MUAD: Enhancing Logical Capabilities in Unified Anomaly Detection Models with a Text Memory Bank

【速读】:该论文旨在解决异常检测(anomaly detection)中因正常数据稀缺而导致的挑战,尤其是如何同时有效识别结构异常(structural anomalies)和逻辑异常(logical anomalies)。传统方法通常依赖于精心设计的图像特征提取器与记忆库来捕捉对象间的逻辑关系,但难以全面建模复杂场景中的多维度异常。其解决方案的关键在于提出一个三重记忆框架(Three-Memory framework for Unified structural and logical Anomaly Detection, TMUAD),通过构建三个互补的记忆库:1)类级别文本记忆库(class-level text memory bank),利用逻辑感知文本提取器从输入图像中捕获对象间的丰富逻辑描述;2)对象级别图像记忆库(object-level image memory bank),基于分割后的物体提取完整轮廓特征以保留结构信息;3)补丁级别图像记忆库(patch-level memory bank),通过视觉编码器提取局部图像块特征用于结构异常检测。这三个记忆库协同工作,实现多层次异常评分并融合生成最终异常分数,从而在工业和医学领域的七个公开数据集上实现了当前最优性能。

链接: https://arxiv.org/abs/2508.21795
作者: Jiawei Liu,Jiahe Hou,Wei Wang,Jinsong Du,Yang Cong,Huijie Fan
机构: Shenyang Institute of Automation, Chinese Academy of Sciences (沈阳自动化研究所,中国科学院); Liaoning Liaohe Laboratory (辽宁辽河实验室); Key Laboratory on Intelligent Detection and Equipment Technology (智能检测与装备技术重点实验室); University of Chinese Academy of Sciences (中国科学院大学); State Key Laboratory of Robotics and Intelligent Systems (机器人学与智能系统国家重点实验室); College of Automation Science and Engineering, South China University of Technology (华南理工大学自动化科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Anomaly detection, which aims to identify anomalies deviating from normal patterns, is challenging due to the limited amount of normal data available. Unlike most existing unified methods that rely on carefully designed image feature extractors and memory banks to capture logical relationships between objects, we introduce a text memory bank to enhance the detection of logical anomalies. Specifically, we propose a Three-Memory framework for Unified structural and logical Anomaly Detection (TMUAD). First, we build a class-level text memory bank for logical anomaly detection by the proposed logic-aware text extractor, which can capture rich logical descriptions of objects from input images. Second, we construct an object-level image memory bank that preserves complete object contours by extracting features from segmented objects. Third, we employ visual encoders to extract patch-level image features for constructing a patch-level memory bank for structural anomaly detection. These three complementary memory banks are used to retrieve and compare normal images that are most similar to the query image, compute anomaly scores at multiple levels, and fuse them into a final anomaly score. By unifying structural and logical anomaly detection through collaborative memory banks, TMUAD achieves state-of-the-art performance across seven publicly available datasets involving industrial and medical domains. The model and code are available at this https URL.
zh

[CV-4] Learning Unified Representations from Heterogeneous Data for Robust Heart Rate Modeling

【速读】:该论文旨在解决心率预测在真实场景部署中因数据异质性(data heterogeneity)导致的性能下降问题,具体包括源异质性(source heterogeneity,源于不同设备市场带来的特征集差异)和用户异质性(user heterogeneity,反映个体间生理模式与活动状态的差异)。现有方法要么忽略设备特异性信息,要么无法建模个体差异,限制了实际应用效果。解决方案的关键在于提出一个学习对两种异质性均不敏感的潜在表示(latent representations)框架:通过随机特征丢弃策略(random feature dropout)增强模型对不同特征集的鲁棒性,利用时间感知注意力模块(time-aware attention module)捕捉长期生理特征,并引入对比学习目标(contrastive learning objective)构建具有判别力的表示空间,从而实现下游预测器在异构数据模式下的稳定表现。

链接: https://arxiv.org/abs/2508.21785
作者: Peng Yang,Zhengdong Huang,Zicheng Xie,Wentao Tian,Jingyu Liu,Lunhong Dong
机构: Southern University of Science and Technology (南方科技大学); Guangdong Provincial Key Laboratory of Brain-Inspired Intelligent Computation (广东省脑启发智能计算重点实验室)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Heart rate prediction is vital for personalized health monitoring and fitness, while it frequently faces a critical challenge when deploying in real-world: data heterogeneity. We classify it in two key dimensions: source heterogeneity from fragmented device markets with varying feature sets, and user heterogeneity reflecting distinct physiological patterns across individuals and activities. Existing methods either discard device-specific information, or fail to model user-specific differences, limiting their real-world performance. To address this, we propose a framework that learns latent representations agnostic to both heterogeneity, enabling downstream predictors to work consistently under heterogeneous data patterns. Specifically, we introduce a random feature dropout strategy to handle source heterogeneity, making the model robust to various feature sets. To manage user heterogeneity, we employ a time-aware attention module to capture long-term physiological traits and use a contrastive learning objective to build a discriminative representation space. To reflect the heterogeneous nature of real-world data, we created and publicly released a new benchmark dataset, ParroTao. Evaluations on both ParroTao and the public FitRec dataset show that our model significantly outperforms existing baselines by 17% and 15%, respectively. Furthermore, analysis of the learned representations demonstrates their strong discriminative power, and one downstream application task confirm the practical value of our model.
zh

[CV-5] Benchmarking GPT -5 in Radiation Oncology: Measurable Gains but Persistent Need for Expert Oversight

【速读】:该论文旨在解决生成式 AI (Generative AI) 在放射肿瘤学临床决策支持中的性能评估问题,特别是针对新型大语言模型(LLM)GPT-5在专业医学知识准确性与临床适用性方面的表现。其解决方案的关键在于采用双轨评估策略:一是基于美国放射学会(ACR)放射肿瘤学住院医师考试(TXIT)的标准化多选题测试,二是通过由四位资深放射肿瘤学家盲评的真实病例(vignette)治疗方案生成任务,量化正确性、全面性和幻觉风险。结果显示,GPT-5在TXIT上准确率达92.8%,显著优于GPT-4和GPT-3.5;在真实病例中,其推荐方案虽获高分评价,但存在少数需精确临床判断或试验知识的复杂场景下错误集中出现的现象,表明尽管幻觉罕见,仍需专家审核方可用于临床部署。

链接: https://arxiv.org/abs/2508.21777
作者: Ugur Dinc,Jibak Sarkar,Philipp Schubert,Sabine Semrau,Thomas Weissmann,Andre Karius,Johann Brand,Bernd-Niklas Axer,Ahmed Gomaa,Pluvio Stephan,Ishita Sheth,Sogand Beirami,Annette Schwarz,Udo Gaipl,Benjamin Frey,Christoph Bert,Stefanie Corradini,Rainer Fietkau,Florian Putz
机构: University of Erlangen-Nuremberg (埃尔朗根-纽伦堡大学); German Cancer Research Center (德国癌症研究中心); University Hospital Erlangen (埃尔朗根大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Under review in Frontiers in Artificial Intelligence

点击查看摘要

Abstract:Introduction: Large language models (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss’ \kappa. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5’s treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss’ \kappa 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation. Comments: Under review in Frontiers in Artificial Intelligence Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.21777 [cs.CV] (or arXiv:2508.21777v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.21777 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Florian Putz [view email] [v1] Fri, 29 Aug 2025 16:55:25 UTC (1,212 KB)
zh

[CV-6] A Multi-Stage Fine-Tuning and Ensembling Strategy for Pancreatic Tumor Segmentation in Diagnostic and Therapeutic MRI

【速读】:该论文旨在解决胰腺导管腺癌(Pancreatic Ductal Adenocarcinoma, PDAC)在磁共振成像(MRI)中的自动分割问题,该任务因肿瘤组织对比度差和标注数据稀缺而极具挑战性。解决方案的关键在于构建一个基于nnU-Net框架的多阶段级联预训练策略:首先从通用解剖学基础模型出发,依次在CT胰腺病变数据集和目标MRI模态上进行微调;同时通过五折交叉验证系统评估数据增强方案与训练调度策略,发现激进的数据增强可提升体积准确性,而默认增强则更优边界精度(Task 1中达到5.46 mm MASD和17.33 mm HD95)。最终采用基于指标感知的异质集成方法,组合多个专家模型,实现了Task 1和Task 2的最高交叉验证肿瘤Dice分数(分别为0.661和0.523),为有限数据下复杂医学图像分割提供了稳健、高性能的建模范式。

链接: https://arxiv.org/abs/2508.21775
作者: Omer Faruk Durugol,Maximilian Rokuss,Yannick Kirchhoff,Klaus H. Maier-Hein
机构: German Cancer Research Center (德国癌症研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 1 figure, PANTHER Challenge submission

点击查看摘要

Abstract:Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).
zh

[CV-7] Unsupervised Video Continual Learning via Non-Parametric Deep Embedded Clustering BMVC2025

【速读】:该论文旨在解决无监督视频持续学习(unsupervised video continual learning, uVCL)问题,即在不提供任务边界或标签的情况下,模型需连续学习一系列视频任务。由于视频具有复杂的时空特性,且传统方法依赖标注数据和任务边界,导致其在真实场景中难以应用。解决方案的关键在于引入基于深度嵌入特征的核密度估计(Kernel Density Estimation, KDE)作为非参数概率表示,并结合新颖性检测准则动态扩展记忆簇,从而在无监督条件下识别新任务并有效迁移先前知识,显著提升多任务连续学习性能。

链接: https://arxiv.org/abs/2508.21773
作者: Nattapong Kurpukdee,Adrian G. Bors
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted to The 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK

点击查看摘要

Abstract:We propose a realistic scenario for the unsupervised video learning where neither task boundaries nor labels are provided when learning a succession of tasks. We also provide a non-parametric learning solution for the under-explored problem of unsupervised video continual learning. Videos represent a complex and rich spatio-temporal media information, widely used in many applications, but which have not been sufficiently explored in unsupervised continual learning. Prior studies have only focused on supervised continual learning, relying on the knowledge of labels and task boundaries, while having labeled data is costly and not practical. To address this gap, we study the unsupervised video continual learning (uVCL). uVCL raises more challenges due to the additional computational and memory requirements of processing videos when compared to images. We introduce a general benchmark experimental protocol for uVCL by considering the learning of unstructured video data categories during each task. We propose to use the Kernel Density Estimation (KDE) of deep embedded video features extracted by unsupervised video transformer networks as a non-parametric probabilistic representation of the data. We introduce a novelty detection criterion for the incoming new task data, dynamically enabling the expansion of memory clusters, aiming to capture new knowledge when learning a succession of tasks. We leverage the use of transfer learning from the previous tasks as an initial state for the knowledge transfer to the current learning task. We found that the proposed methodology substantially enhances the performance of the model when successively learning many tasks. We perform in-depth evaluations on three standard video action recognition datasets, including UCF101, HMDB51, and Something-to-Something V2, without using any labels or class boundaries.
zh

[CV-8] What Can We Learn from Harry Potter? An Exploratory Study of Visual Representation Learning from Atypical Videos

【速读】:该论文旨在解决开放世界视觉表征学习中对罕见或异常视频数据利用不足的问题,尤其是如何通过引入非常规(atypical)视频数据来提升模型在开放世界任务中的泛化能力。其关键解决方案是构建了一个包含多种类型异常视频(如科幻、动画等)的新数据集,并将这些非常规样本直接用于模型训练过程中进行表示学习。实验表明,即使采用简单的训练策略,加入此类数据也能显著改善Out-of-Distribution (OOD)检测、新颖类别发现(NCD)和零样本动作识别(ZSAR)三项任务的性能;其中,增加异常样本的类别多样性有助于提升OOD检测效果,而使用更小但语义更丰富的异常样本集合则在NCD和ZSAR任务中表现更优,揭示了异常视频在开放世界学习中的潜在价值。

链接: https://arxiv.org/abs/2508.21770
作者: Qiyue Sun,Qiming Huang,Yang Yang,Hongjun Wang,Jianbo Jiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans usually show exceptional generalisation and discovery ability in the open world, when being shown uncommon new concepts. Whereas most existing studies in the literature focus on common typical data from closed sets, open-world novel discovery is under-explored in videos. In this paper, we are interested in asking: \textitWhat if atypical unusual videos are exposed in the learning process? To this end, we collect a new video dataset consisting of various types of unusual atypical data (\eg sci-fi, animation, \etc). To study how such atypical data may benefit open-world learning, we feed them into the model training process for representation learning. Focusing on three key tasks in open-world learning: out-of-distribution (OOD) detection, novel category discovery (NCD), and zero-shot action recognition (ZSAR), we found that even straightforward learning approaches with atypical data consistently improve performance across various settings. Furthermore, we found that increasing the categorical diversity of the atypical samples further boosts OOD detection performance. Additionally, in the NCD task, using a smaller yet more semantically diverse set of atypical samples leads to better performance compared to using a larger but more typical dataset. In the ZSAR setting, the semantic diversity of atypical videos helps the model generalise better to unseen action classes. These observations in our extensive experimental evaluations reveal the benefits of atypical videos for visual representation learning in the open world, together with the newly proposed dataset, encouraging further studies in this direction.
zh

[CV-9] Domain Generalization in-the-Wild: Disentangling Classification from Domain-Aware Representations

【速读】:该论文旨在解决当前评估基础模型(如CLIP)在域泛化(Domain Generalization, DG)能力时存在的挑战,即预训练数据可能覆盖现有基准测试集,导致评估不够严格且难以真实反映模型在未见数据场景下的表现。为更准确地衡量CLIP在“野外”(in-the-wild)域泛化性能,作者提出两种方法:一是基于ImageNet微调后在33个多样化数据集上评估其对分布外(Out-of-Distribution, OOD)数据的适应性;二是通过“遗忘”(unlearning)机制模拟部分领域知识的移除以逼近真实未见场景。实验发现CLIP在高OOD数据集上性能显著下降。针对此问题,论文提出CLIP-DCA(Disentangling Classification from enhanced domain Aware representations),其核心创新在于:不同于传统域不变损失强制消除域相关特征的做法,CLIP-DCA认为增强域感知能力是实现有效域不变分类的前提。具体而言,它利用独立的域头和合成多样性域数据来识别并强化CLIP编码器中的域感知表示,同时通过特征解耦促使分类器脱离域特征依赖,从而提升模型在挑战性OOD场景下的泛化性能。

链接: https://arxiv.org/abs/2508.21769
作者: Ha Min Son,Zhe Zhao,Shahbaz Rezaei,Xin Liu
机构: University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Evaluating domain generalization (DG) for foundational models like CLIP is challenging, as web-scale pretraining data potentially covers many existing benchmarks. Consequently, current DG evaluation may neither be sufficiently challenging nor adequately test genuinely unseen data scenarios. To better assess the performance of CLIP on DG in-the-wild, a scenario where CLIP encounters challenging unseen data, we consider two approaches: (1) evaluating on 33 diverse datasets with quantified out-of-distribution (OOD) scores after fine-tuning CLIP on ImageNet, and (2) using unlearning to make CLIP `forget’ some domains as an approximation. We observe that CLIP’s performance deteriorates significantly on more OOD datasets. To address this, we present CLIP-DCA (Disentangling Classification from enhanced domain Aware representations). Our approach is motivated by the observation that while standard domain invariance losses aim to make representations domain-invariant, this can be harmful to foundation models by forcing the discarding of domain-aware representations beneficial for generalization. We instead hypothesize that enhancing domain awareness is a prerequisite for effective domain-invariant classification in foundation models. CLIP-DCA identifies and enhances domain awareness within CLIP’s encoders using a separate domain head and synthetically generated diverse domain data. Simultaneously, it encourages domain-invariant classification through disentanglement from the domain features. CLIP-DCA shows significant improvements within this challenging evaluation compared to existing methods, particularly on datasets that are more OOD.
zh

[CV-10] UItron: Foundational GUI Agent with Advanced Perception and Planning

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 在 GUI(图形用户界面)自动化操作中的关键瓶颈问题,包括操作轨迹数据稀缺、交互基础设施缺失以及基础模型在 GUI 场景下初始能力不足等挑战。其解决方案的核心在于构建一个名为 UItron 的开源基础模型,该模型通过系统性的数据工程策略提升训练效果,并建立连接移动端与 PC 端的交互环境;在训练方法上,采用监督微调(Supervised Fine-Tuning)处理感知与规划任务,再引入课程强化学习(Curriculum Reinforcement Learning)框架以增强在线环境中的复杂推理与探索能力。特别地,针对现有方案普遍缺乏对中文移动应用的支持,研究团队手动收集了覆盖 Top 100 热门中文 App 的超百万步操作轨迹,并构建了离线与在线评估环境,显著提升了 GUI agent 在中文场景下的交互性能,推动了 GUI agent 向真实世界应用迈进。

链接: https://arxiv.org/abs/2508.21767
作者: Zhixiong Zeng,Jing Huang,Liming Zheng,Wenkang Han,Yufeng Zhong,Lei Chen,Longrong Yang,Yingjie Chu,Yuzhi He,Lin Ma
机构: Meituan(美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 24 pages

点击查看摘要

Abstract:GUI agent aims to enable automated operations on Mobile/PC devices, which is an important task toward achieving artificial general intelligence. The rapid advancement of VLMs accelerates the development of GUI agents, owing to their powerful capabilities in visual understanding and task planning. However, building a GUI agent remains a challenging task due to the scarcity of operation trajectories, the availability of interactive infrastructure, and the limitation of initial capabilities in foundation models. In this work, we introduce UItron, an open-source foundational model for automatic GUI agents, featuring advanced GUI perception, grounding, and planning capabilities. UItron highlights the necessity of systemic data engineering and interactive infrastructure as foundational components for advancing GUI agent development. It not only systematically studies a series of data engineering strategies to enhance training effects, but also establishes an interactive environment connecting both Mobile and PC devices. In training, UItron adopts supervised finetuning over perception and planning tasks in various GUI scenarios, and then develop a curriculum reinforcement learning framework to enable complex reasoning and exploration for online environments. As a result, UItron achieves superior performance in benchmarks of GUI perception, grounding, and planning. In particular, UItron highlights the interaction proficiency with top-tier Chinese mobile APPs, as we identified a general lack of Chinese capabilities even in state-of-the-art solutions. To this end, we manually collect over one million steps of operation trajectories across the top 100 most popular apps, and build the offline and online agent evaluation environments. Experimental results demonstrate that UItron achieves significant progress in Chinese app scenarios, propelling GUI agents one step closer to real-world application.
zh

[CV-11] Learning from Silence and Noise for Visual Sound Source Localization

【速读】:该论文旨在解决视觉声源定位(Visual Sound Source Localization)任务中现有方法在低音频-视觉语义对应场景下的性能不足问题,尤其是面对静音、噪声及非画面内声音(即负音频)时的鲁棒性差,以及评估体系仅聚焦于单一可见声源的正样本场景,缺乏对负音频干扰下模型表现的量化分析。其关键解决方案包括:提出一种融合静音与噪声的新型自监督训练策略,显著提升了模型在正负音频混合环境下的稳定性与准确性,所构建的SSL-SaN模型在声源定位和跨模态检索任务上达到当前最优性能;设计了一种新指标,用于衡量听觉与视觉特征在正负音频配对中的对齐度与可分性之间的权衡;并发布了扩展版IS3+合成数据集,包含负音频样本,以支持更全面的模型评测。

链接: https://arxiv.org/abs/2508.21761
作者: Xavier Juanola,Giovana Morais,Magdalena Fuentes,Gloria Haro
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 2 figures, 4 tables + Supplementary Material

点击查看摘要

Abstract:Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available on the this https URL. Comments: 10 pages, 2 figures, 4 tables + Supplementary Material Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) Cite as: arXiv:2508.21761 [cs.CV] (or arXiv:2508.21761v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.21761 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-12] From Drone Imagery to Livability Mapping: AI-powered Environment Perception in Rural China

【速读】:该论文旨在解决当前农村宜居性(rural livability)评估方法存在的局限性问题,即传统问卷调查难以规模化实施,而基于城市视觉感知的评估方法又不适用于农村场景。其解决方案的关键在于构建了一个基于无人机影像与多模态大语言模型(Multimodal Large Language Models, MLLMs)的乡村宜居性评估框架:首先通过自上而下的方式获取中国146个县共1766个村庄的大规模无人机影像数据;其次设计了一种高效的图像对比机制,结合二分查找插值策略减少比较迭代次数;再者基于专家知识构建适用于全国范围的链式思维提示(chain-of-thought prompting),综合考虑生活质量和生态宜居两个维度,从而提升评估结果的合理性与可靠性。

链接: https://arxiv.org/abs/2508.21738
作者: Weihuan Deng,Yaofu Huang,Luan Chen,Xun Li,Yao Yao
机构: 未知
类目: Computers and Society (cs.CY); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the deepening of poverty alleviation and rural revitalization strategies, improving the rural living environment and enhancing the quality of life have become key priorities. Rural livability is a key indicator for measuring the effectiveness of these efforts. Current measurement approaches face significant limitations, as questionnaire-based methods are difficult to scale, while urban-oriented visual perception methods are poorly suited for rural contexts. In this paper, a rural-specific livability assessment framework was proposed based on drone imagery and multimodal large language models (MLLMs). To comprehensively assess village livability, this study first used a top-down approach to collect large-scale drone imagery of 1,766 villages in 146 counties across China. In terms of the model framework, an efficient image comparison mechanism was developed, incorporating binary search interpolation to determine effective image pairs while reducing comparison iterations. Building on expert knowledge, a chain-of-thought prompting suitable for nationwide rural livability measurement was constructed, considering both living quality and ecological habitability dimensions. This approach enhanced the rationality and reliability of the livability assessment. Finally, this study characterized the spatial heterogeneity of rural livability across China and thoroughly analyzed its influential factors. The results show that: (1) The rural livability in China demonstrates a dual-core-periphery spatial pattern, radiating outward from Sichuan and Zhejiang provinces with declining gradients; (2) Among various influential factors, government fiscal expenditure emerged as the core determinant, with each unit increase corresponding to a 3.9 - 4.9 unit enhancement in livability. The findings provide valuable insights for rural construction policy-making.
zh

[CV-13] CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在真实场景下对数字测量设备(Digital Measurement Devices, DMDs)读数任务中的性能瓶颈问题,尤其是在存在杂乱背景、遮挡、极端视角和运动模糊等挑战性条件下的表现不佳。解决方案的关键在于提出了一种名为CAD2DMD-SET的合成数据生成工具,该工具利用3D CAD模型、高级渲染与高保真图像合成技术,生成多样且带有VQA标注的合成DMD数据集,用于微调LVLMs;同时构建了DMDBench这一包含1000张真实世界图像的验证集以评估模型在实际约束下的性能。实验表明,基于CAD2DMD-SET生成的数据进行LoRA微调后,模型在ANLS指标上显著提升(如InternVL提升200%),且未损害其他任务表现,证明该方法有效增强了LVLMs在复杂环境下的鲁棒性和准确性。

链接: https://arxiv.org/abs/2508.21732
作者: João Valente,Atabak Dehban,Rodrigo Ventura
机构: Institute for Systems and Robotics (系统与机器人研究所); University of Lisbon (里斯本大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA’s of these models with CAD2DMD-SET’s generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.
zh

[CV-14] Entropy-Based Non-Invasive Reliability Monitoring of Convolutional Neural Networks

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在面对对抗性扰动(adversarial perturbations)时的脆弱性问题,即模型对微小、人眼不可感知的输入扰动高度敏感,易导致错误分类。现有检测方法通常需重新训练模型、修改网络结构或损害干净样本上的性能。论文提出的关键解决方案是:利用CNN激活层中的熵(entropy)变化作为对抗性扰动的即时可检测特征,无需任何模型修改即可实现高精度检测。研究发现,在VGG-16网络的早期卷积层中,对抗样本会引发约7%的激活熵偏移,且干净输入与对抗输入的熵分布完全分离,从而实现90%的检测准确率,同时保持原始模型性能不变。这一方法为构建实时自诊断视觉系统提供了理论基础和实践路径。

链接: https://arxiv.org/abs/2508.21715
作者: Amirhossein Nazeri,Wael Hafez
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Information Theory (cs.IT); Image and Video Processing (eess.IV)
备注: 8 pages, 3 figures, 2 tables

点击查看摘要

Abstract:Convolutional Neural Networks (CNNs) have become the foundation of modern computer vision, achieving unprecedented accuracy across diverse image recognition tasks. While these networks excel on in-distribution data, they remain vulnerable to adversarial perturbations imperceptible input modifications that cause misclassification with high confidence. However, existing detection methods either require expensive retraining, modify network architecture, or degrade performance on clean inputs. Here we show that adversarial perturbations create immediate, detectable entropy signatures in CNN activations that can be monitored without any model modification. Using parallel entropy monitoring on VGG-16, we demonstrate that adversarial inputs consistently shift activation entropy by 7% in early convolutional layers, enabling 90% detection accuracy with false positives and false negative rates below 20%. The complete separation between clean and adversarial entropy distributions reveals that CNNs inherently encode distribution shifts in their activation patterns. This work establishes that CNN reliability can be assessed through activation entropy alone, enabling practical deployment of self-diagnostic vision systems that detect adversarial inputs in real-time without compromising original model performance.
zh

[CV-15] FLORA: Efficient Synthetic Data Generation for Object Detection in Low-Data Regimes via finetuning Flux LoRA

【速读】:该论文旨在解决当前基于扩散模型的合成数据生成方法在对象检测任务中存在计算资源消耗大、数据效率低的问题。现有方法通常依赖于对大规模扩散模型进行全量微调(full fine-tuning),需使用企业级GPU(如NVIDIA V100)并生成数千张合成图像,难以在实际场景中推广。其解决方案的关键在于提出一种轻量级合成数据生成流程——Flux LoRA Augmentation (FLORA),该方法仅通过低秩适应(Low-Rank Adaptation, LoRA)对Flux 1.1 Dev扩散模型进行微调,显著降低计算需求,可在消费级GPU(如NVIDIA RTX 4090)上高效运行。实验证明,仅用500张合成图像即可超越基线ODGEN使用5000张图像的效果,在mAP@.50:.95指标上提升达21.3%,表明高质量与高效率协同优化优于单纯增加数据量和算力投入。

链接: https://arxiv.org/abs/2508.21712
作者: Alvaro Patricio,Atabak Dehban,Rodrigo Ventura
机构: Institute for Systems and Robotics, University of Lisbon(里斯本大学), Portugal
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in diffusion-based generative models have demonstrated significant potential in augmenting scarce datasets for object detection tasks. Nevertheless, most recent models rely on resource-intensive full fine-tuning of large-scale diffusion models, requiring enterprise-grade GPUs (e.g., NVIDIA V100) and thousands of synthetic images. To address these limitations, we propose Flux LoRA Augmentation (FLORA), a lightweight synthetic data generation pipeline. Our approach uses the Flux 1.1 Dev diffusion model, fine-tuned exclusively through Low-Rank Adaptation (LoRA). This dramatically reduces computational requirements, enabling synthetic dataset generation with a consumer-grade GPU (e.g., NVIDIA RTX 4090). We empirically evaluate our approach on seven diverse object detection datasets. Our results demonstrate that training object detectors with just 500 synthetic images generated by our approach yields superior detection performance compared to models trained on 5000 synthetic images from the ODGEN baseline, achieving improvements of up to 21.3% in mAP@.50:.95. This work demonstrates that it is possible to surpass state-of-the-art performance with far greater efficiency, as FLORA achieves superior results using only 10% of the data and a fraction of the computational cost. This work demonstrates that a quality and efficiency-focused approach is more effective than brute-force generation, making advanced synthetic data creation more practical and accessible for real-world scenarios.
zh

[CV-16] Activation Subspaces for Out-of-Distribution Detection

【速读】:该论文旨在解决深度模型在真实应用场景中面对分布外(out-of-distribution, OOD)样本时的可靠性问题,即如何有效区分来自训练分布(in-distribution, ID)的样本与分布外(OOD)样本。其解决方案的关键在于利用分类头权重矩阵的奇异值分解(singular value decomposition, SVD),将模型激活分解为决定性(decisive)和非决定性(insignificant)两个子空间:前者对最终分类输出贡献最大,后者贡献最小。研究发现,在大分布偏移场景(Far-OOD)下,非决定性子空间更能有效区分ID与OOD数据,因其未被分类目标显著影响,特征更具鲁棒性;而在小分布偏移场景(Near-OOD)下,仅保留决定性子空间可避免干扰,提升检测性能。通过融合这两个发现,作者提出统一方法ActSub,在多个标准OOD基准上达到当前最优效果。

链接: https://arxiv.org/abs/2508.21695
作者: Barış Zöngür,Robin Hesse,Stefan Roth
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:To ensure the reliability of deep models in real-world applications, out-of-distribution (OOD) detection methods aim to distinguish samples close to the training distribution (in-distribution, ID) from those farther away (OOD). In this work, we propose a novel OOD detection method that utilizes singular value decomposition of the weight matrix of the classification head to decompose the model’s activations into decisive and insignificant components, which contribute maximally, respectively minimally, to the final classifier output. We find that the subspace of insignificant components more effectively distinguishes ID from OOD data than raw activations in regimes of large distribution shifts (Far-OOD). This occurs because the classification objective leaves the insignificant subspace largely unaffected, yielding features that are ‘‘untainted’’ by the target classification task. Conversely, in regimes of smaller distribution shifts (Near-OOD), we find that activation shaping methods profit from only considering the decisive subspace, as the insignificant component can cause interference in the activation space. By combining two findings into a single approach, termed ActSub, we achieve state-of-the-art results in various standard OOD benchmarks.
zh

[CV-17] Mapping like a Skeptic: Probabilistic BEV Projection for Online HD Mapping BMVC2025

【速读】:该论文旨在解决高精度地图(High-Definition Map, HD Map)构建中图像空间到鸟瞰图(Bird’s Eye View, BEV)空间映射的准确性问题,现有方法依赖标准映射技术(如基于注意力机制的方法)常因泛化能力不足而产生虚假道路元素(hallucination)。其解决方案的关键在于提出一种基于相机参数的几何映射初始方案,并引入一种新颖的概率投影机制(probabilistic projection mechanism),通过置信度分数实现两个目标:(i) 优化映射以更贴合场景特征,(ii) 过滤掉不应影响HD地图生成的无关元素;同时利用置信度分数在时间维度上选择性累积可靠信息,提升时序处理能力。

链接: https://arxiv.org/abs/2508.21689
作者: Fatih Erdoğan,Merve Rabia Barın,Fatma Güney
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025. GitHub: this https URL

点击查看摘要

Abstract:Constructing high-definition (HD) maps from sensory input requires accurately mapping the road elements in image space to the Bird’s Eye View (BEV) space. The precision of this mapping directly impacts the quality of the final vectorized HD map. Existing HD mapping approaches outsource the projection to standard mapping techniques, such as attention-based ones. However, these methods struggle with accuracy due to generalization problems, often hallucinating non-existent road elements. Our key idea is to start with a geometric mapping based on camera parameters and adapt it to the scene to extract relevant map information from camera images. To implement this, we propose a novel probabilistic projection mechanism with confidence scores to (i) refine the mapping to better align with the scene and (ii) filter out irrelevant elements that should not influence HD map generation. In addition, we improve temporal processing by using confidence scores to selectively accumulate reliable information over time. Experiments on new splits of the nuScenes and Argoverse2 datasets demonstrate improved performance over state-of-the-art approaches, indicating better generalization. The improvements are particularly pronounced on nuScenes and in the challenging long perception range. Our code and model checkpoints are available at this https URL .
zh

[CV-18] owards Interactive Lesion Segmentation in Whole-Body PET/CT with Promptable Models

【速读】:该论文旨在解决全身影像(whole-body PET/CT)中病灶分割的准确性难题,尤其是在示踪剂分布不均、生理性摄取干扰及多中心数据差异等复杂条件下,如何实现高效且精准的交互式分割。其解决方案的关键在于引入可响应用户提示的交互式分割框架:通过将用户提供的前景和背景点击点编码为额外输入通道,并采用欧氏距离变换(Euclidean Distance Transform, EDT)作为空间提示的表示方式,显著优于传统的高斯核(Gaussian kernel)编码;同时结合在线模拟用户交互与定制点采样策略,提升了模型在真实交互场景下的鲁棒性。最终,基于EDT的集成模型在交叉验证中有效降低了假阳性和假阴性率,验证了该方法在多示踪剂、多中心PET/CT影像中实现人机协同高效分割的潜力。

链接: https://arxiv.org/abs/2508.21680
作者: Maximilian Rokuss,Yannick Kirchhoff,Fabian Isensee,Klaus H. Maier-Hein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: atuoPET4 Team LesionLocator

点击查看摘要

Abstract:Whole-body PET/CT is a cornerstone of oncological imaging, yet accurate lesion segmentation remains challenging due to tracer heterogeneity, physiological uptake, and multi-center variability. While fully automated methods have advanced substantially, clinical practice benefits from approaches that keep humans in the loop to efficiently refine predicted masks. The autoPET/CT IV challenge addresses this need by introducing interactive segmentation tasks based on simulated user prompts. In this work, we present our submission to Task 1. Building on the winning autoPET III nnU-Net pipeline, we extend the framework with promptable capabilities by encoding user-provided foreground and background clicks as additional input channels. We systematically investigate representations for spatial prompts and demonstrate that Euclidean Distance Transform (EDT) encodings consistently outperform Gaussian kernels. Furthermore, we propose online simulation of user interactions and a custom point sampling strategy to improve robustness under realistic prompting conditions. Our ensemble of EDT-based models, trained with and without external data, achieves the strongest cross-validation performance, reducing both false positives and false negatives compared to baseline models. These results highlight the potential of promptable models to enable efficient, user-guided segmentation workflows in multi-tracer, multi-center PET/CT. Code is publicly available at this https URL
zh

[CV-19] Unfolding Framework with Complex-Valued Deformable Attention for High-Quality Computer-Generated Hologram Generation

【速读】:该论文旨在解决计算全息(Computer-generated Holography, CGH)中因非线性和不适定性导致的重建精度与稳定性不足问题,具体包括:(i) 端到端网络将重建模型视为黑箱,缺乏物理可解释性与灵活性;(ii) 基于卷积神经网络(CNN)的方法感受野有限,难以捕捉长程依赖和全局上下文信息;(iii) 基于角谱法(Angular Spectrum Method, ASM)的模型受限于有限的工作距离。解决方案的关键在于提出一种深度展开网络(Deep Unfolding Network, DUN),将梯度下降过程分解为两个模块:自适应带宽保持模型(Adaptive Bandwidth-Preserving Model, ABPM)和相位域复值去噪器(Phase-Domain Complex-Valued Denoiser, PCD)。ABPM扩展了工作距离,突破ASM限制;PCD引入复值可变形自注意力机制,有效捕获全局特征并提升性能,最终实现峰值信噪比(PSNR)超过35 dB,优于现有方法。

链接: https://arxiv.org/abs/2508.21657
作者: Haomiao Zhang,Zhangyuan Li,Yanling Piao,Zhi Li,Xiaodong Wang,Miao Cao,Xiongfei Su,Qiang Song,Xin Yuan
机构: Zhejiang University (浙江大学); Westlake University (西湖大学); Westlake Institute for Optoelectronics (西湖光学电子研究所); Hunan University (湖南大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computer-generated holography (CGH) has gained wide attention with deep learning-based algorithms. However, due to its nonlinear and ill-posed nature, challenges remain in achieving accurate and stable reconstruction. Specifically, ( i ) the widely used end-to-end networks treat the reconstruction model as a black box, ignoring underlying physical relationships, which reduces interpretability and flexibility. ( ii ) CNN-based CGH algorithms have limited receptive fields, hindering their ability to capture long-range dependencies and global context. ( iii ) Angular spectrum method (ASM)-based models are constrained to finite this http URL this paper, we propose a Deep Unfolding Network (DUN) that decomposes gradient descent into two modules: an adaptive bandwidth-preserving model (ABPM) and a phase-domain complex-valued denoiser (PCD), providing more flexibility. ABPM allows for wider working distances compared to ASM-based methods. At the same time, PCD leverages its complex-valued deformable self-attention module to capture global features and enhance performance, achieving a PSNR over 35 dB. Experiments on simulated and real data show state-of-the-art results.
zh

[CV-20] he Rosario Dataset v2: Multimodal Dataset for Agricultural Robotics

【速读】:该论文旨在解决农业机器人在复杂自然环境中实现高精度定位、建图与导航的挑战,特别是针对光照变化、运动模糊、地形粗糙及长距离感知伪影(perceptual aliasing)等问题。解决方案的关键在于构建了一个多模态数据集,涵盖来自立体红外相机、彩色相机、惯性测量单元(IMU)、GNSS(单点定位、实时动态和后处理动态)、轮式里程计等传感器的同步数据,总时长超过两小时,并提供6自由度(6-DOF)地面真实值和长轨迹闭环信息,以支持多模态SLAM(Simultaneous Localization and Mapping)系统的开发与基准测试。通过在该数据集上运行最先进的多模态SLAM算法,验证了现有方法在农业场景中的局限性,从而推动更鲁棒的农业机器人感知与导航算法研究。

链接: https://arxiv.org/abs/2508.21635
作者: Nicolas Soncini,Javier Cremona,Erica Vidal,Maximiliano García,Gastón Castro,Taihú Pire
机构: CIFASIS (CONICET-UNR) (CIFASIS (CONICET-UNR)); Universidad de San Andrés (UDESA-CONICET) (Universidad de San Andrés (UDESA-CONICET))
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY)
备注: First published on The International Journal of Robotics Research: this https URL

点击查看摘要

Abstract:We present a multi-modal dataset collected in a soybean crop field, comprising over two hours of recorded data from sensors such as stereo infrared camera, color camera, accelerometer, gyroscope, magnetometer, GNSS (Single Point Positioning, Real-Time Kinematic and Post-Processed Kinematic), and wheel odometry. This dataset captures key challenges inherent to robotics in agricultural environments, including variations in natural lighting, motion blur, rough terrain, and long, perceptually aliased sequences. By addressing these complexities, the dataset aims to support the development and benchmarking of advanced algorithms for localization, mapping, perception, and navigation in agricultural robotics. The platform and data collection system is designed to meet the key requirements for evaluating multi-modal SLAM systems, including hardware synchronization of sensors, 6-DOF ground truth and loops on long trajectories. We run multimodal state-of-the art SLAM methods on the dataset, showcasing the existing limitations in their application on agricultural settings. The dataset and utilities to work with it are released on this https URL. Comments: First published on The International Journal of Robotics Research: this https URL Subjects: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Systems and Control (eess.SY) ACMclasses: I.2.9 Cite as: arXiv:2508.21635 [cs.RO] (or arXiv:2508.21635v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2508.21635 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: The Rosario dataset v2: Multi-modal dataset for agricultural robotics. The International Journal of Robotics Research. 2025;0(0) Related DOI: https://doi.org/10.1177/02783649251368909 Focus to learn more DOI(s) linking to related resources
zh

[CV-21] Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer MICCAI2025

【速读】:该论文旨在解决透明细胞肾细胞癌(clear cell renal cell carcinoma, ccRCC)术后复发风险个体化预测的准确性不足问题,现有临床工具如Leibovich评分虽广泛使用,但缺乏患者层面的精细分层且未整合影像学信息。解决方案的关键在于构建一个基于预训练编码器的模块化深度学习框架,通过多模态融合策略(包括单模态、晚期融合与中间融合)整合术前CT影像与术后组织病理全切片图像(WSI),结果显示中间融合策略下模型性能最优,接近调整后的Leibovich评分,表明病理图像具有强 prognostic 能力,而影像学信息主要通过融合机制提升整体预测效果。

链接: https://arxiv.org/abs/2508.21581
作者: Daniël Boeke,Cedrik Blommestijn,Rebecca N. Wray,Kalina Chupetlovska,Shangqi Gao,Zeyu Gao,Regina G. H. Beets-Tan,Mireia Crispin-Ortuzar,James O. Jones,Wilson Silva,Ines P. Machado
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 2 figures, 1 table. Accepted at the Multimodal Learning and Fusion Across Scales for Clinical Decision Support (ML-CDS) Workshop, MICCAI 2025. This is the submitted version with authors, affiliations, and acknowledgements included; it has not undergone peer review or revisions. The final version will appear in the Springer Lecture Notes in Computer Science (LNCS) proceedings

点击查看摘要

Abstract:Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.
zh

[CV-22] mporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging

【速读】:该论文旨在解决医学影像中时间动态建模的难题,尤其是在疾病进展建模、治疗规划和解剖发育追踪等任务中,现有深度学习方法普遍局限于单一时间点或分类/回归任务,难以实现细粒度的空间预测。其解决方案的关键在于提出Temporal Flow Matching (TFM),一种统一的生成轨迹方法,能够学习潜在的时间分布,并在设计上可退化为最近图像预测(Last Context Image, LCI)这一特例,同时支持3D体积、多先验扫描及不规则采样,从而在三个公开纵向数据集上显著优于现有时空方法,建立了4D医学图像预测的新基准。

链接: https://arxiv.org/abs/2508.21580
作者: Nico Albert Disch,Yannick Kirchhoff,Robin Peretzke,Maximilian Rokuss,Saikat Roy,Constantin Ulrich,David Zimmerer,Klaus Maier-Hein
机构: German Cancer Research Center (德国癌症研究中心); Helmholtz Information and Data Science School for Health (亥姆霍兹信息与数据科学健康学校); University of Heidelberg (海德堡大学); Heidelberg University Hospital (海德堡大学医院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports 3D volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for 4D medical image prediction.
zh

[CV-23] How Well Do Vision–Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images ICCV

【速读】:该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在城市场景中进行细粒度空间推理能力不足的问题,尤其是这些模型在通用场景下预训练后向城市领域迁移时的表现尚不明确。其解决方案的关键在于构建一个针对城市场景的合成视觉问答(VQA)数据集,并采用大语言模型(LLM)生成的思维链(Chain-of-Thought, CoT)答案作为分步推理监督信号,从而有效提升模型在零样本(zero-shot)基础上对复杂问题(如否定句和反事实推理)的理解能力。实验表明,通过该合成数据集进行微调可显著增强VLMs在城市空间理解任务中的性能表现。

链接: https://arxiv.org/abs/2508.21565
作者: Juneyoung Ro,Namwoo Kim,Yoonjin Yoon
机构: Korea Advanced Institute of Science and Technology (韩国科学技术院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV Workshop 2025

点击查看摘要

Abstract:Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.
zh

[CV-24] ECHO: Ego-Centric modeling of Human-Object interactions

【速读】:该论文致力于解决从第一人称视角(egocentric perspective)重建人类-物体交互(Human-Object Interaction, HOI)的问题,尤其在仅依赖头部和手腕追踪数据的情况下如何恢复人体姿态、物体运动轨迹以及接触关系。其解决方案的关键在于提出ECHO框架,首次构建了一个统一模型来同时恢复这三种模态信息;该框架采用扩散Transformer架构和一种独特的三变量扩散过程,联合建模人体运动、物体轨迹与接触序列,并通过基于传送带(conveyor-based)的推理机制实现任意长度序列的处理,同时在头-centric规范空间中运行以提升对全局朝向变化的鲁棒性。

链接: https://arxiv.org/abs/2508.21556
作者: Ilya A. Petrov,Vladimir Guzov,Riccardo Marin,Emre Aksan,Xu Chen,Daniel Cremers,Thabo Beeler,Gerard Pons-Moll
机构: Google(谷歌); Stanford University (斯坦福大学); ETH Zurich (苏黎世联邦理工学院); Max Planck Institute for Intelligent Systems (马克斯·普朗克智能系统研究所); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modeling human-object interactions (HOI) from an egocentric perspective is a largely unexplored yet important problem due to the increasing adoption of wearable devices, such as smart glasses and watches. We investigate how much information about interaction can be recovered from only head and wrists tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object interactions), which, for the first time, proposes a unified framework to recover three modalities: human pose, object motion, and contact from such minimal observation. ECHO employs a Diffusion Transformer architecture and a unique three-variate diffusion process, which jointly models human motion, object trajectory, and contact sequence, allowing for flexible input configurations. Our method operates in a head-centric canonical space, enhancing robustness to global orientation. We propose a conveyor-based inference, which progressively increases the diffusion timestamp with the frame position, allowing us to process sequences of any length. Through extensive evaluation, we demonstrate that ECHO outperforms existing methods that do not offer the same flexibility, setting a state-of-the-art in egocentric HOI reconstruction.
zh

[CV-25] EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting CIKM2025

【速读】:该论文旨在解决在主观或困难的标注任务中,传统全量成对比较(pairwise comparison)方法因标注成本过高(O(n²))而难以实际应用的问题。其核心解决方案是通过结合零样本预排序与不确定性引导的人机协同采样机制来显著降低人工标注负担:首先利用对比语言-图像预训练模型(CLIP)进行无训练的粗粒度预排序,其次基于此初始化分桶感知的Elo评分,最后运行一种不确定性驱动的人工介入归并排序(MergeSort)。该方法在保持或提升评价者间一致性的前提下,相较于完全成对比较减少了90.5%的人工标注成本,且相比已有最优方法进一步节省了19.8%的标注量(当n=100时),体现了高效且可扩展的成对排序能力。

链接: https://arxiv.org/abs/2508.21550
作者: Yujin Park,Haejun Chung,Ikbeom Jang
机构: Hanyang University (汉阳大学); Hankuk University of Foreign Studies (韩国外国语大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 5 pages, 2 figures, Accepted at CIKM 2025 (ACM International Conference on Information and Knowledge Management)

点击查看摘要

Abstract:Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical image chronology (DHCI), and retinal image quality assessment (EyePACS). It showed that EZ-Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (when n = 100), while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking.
zh

[CV-26] Complete Gaussian Splats from a Single Image with Denoising Diffusion Models

【速读】:该论文旨在解决单张图像下3D场景重建中难以完整恢复遮挡区域和未观测表面的问题,传统回归方法因仅预测单一“模式”而导致重建模糊、不真实且无法捕捉多种可能的解释。其解决方案的关键在于提出一种生成式建模框架,通过变分自重构器(Variational AutoReconstructor)在无监督条件下学习从2D图像到3D高斯点云(Gaussian splats)的潜在空间表示,并在此基础上训练扩散模型(diffusion model),从而实现对遮挡区域的多样化、高质量重建,支持360°视角渲染。

链接: https://arxiv.org/abs/2508.21542
作者: Ziwei Liao,Mohamed Sayed,Steven L. Waslander,Sara Vicente,Daniyar Turmukhambetov,Michael Firman
机构: University of Toronto (多伦多大学); Niantic Spatial (Niantic空间)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: Main paper: 11 pages; Supplementary materials: 7 pages

点击查看摘要

Abstract:Gaussian splatting typically requires dense observations of the scene and can fail to reconstruct occluded and unobserved areas. We propose a latent diffusion model to reconstruct a complete 3D scene with Gaussian splats, including the occluded parts, from only a single image during inference. Completing the unobserved surfaces of a scene is challenging due to the ambiguity of the plausible surfaces. Conventional methods use a regression-based formulation to predict a single “mode” for occluded and out-of-frustum surfaces, leading to blurriness, implausibility, and failure to capture multiple possible explanations. Thus, they often address this problem partially, focusing either on objects isolated from the background, reconstructing only visible surfaces, or failing to extrapolate far from the input views. In contrast, we propose a generative formulation to learn a distribution of 3D representations of Gaussian splats conditioned on a single input image. To address the lack of ground-truth training data, we propose a Variational AutoReconstructor to learn a latent space only from 2D images in a self-supervised manner, over which a diffusion model is trained. Our method generates faithful reconstructions and diverse samples with the ability to complete the occluded surfaces for high-quality 360-degree renderings.
zh

[CV-27] HCCM: Hierarchical Cross-Granularity Contrastive and Matching Learning for Natural Language-Guided Drones ACM-MM’25

【速读】:该论文旨在解决无人机场景中自然语言引导的视觉理解问题,特别是由于广角视野和复杂组合语义导致的视觉-语言对齐困难。主流视觉-语言模型(Vision-Language Models, VLMs)虽强调全局对齐,但缺乏细粒度语义表达;现有层次化方法则依赖精确实体分割与严格包含关系,在动态环境中效果受限。解决方案的关键在于提出层次交叉粒度对比与匹配学习框架(Hierarchical Cross-Granularity Contrastive and Matching, HCCM),其核心创新包括:(1) 区域-全局图像-文本对比学习(Region-Global Image-Text Contrastive Learning, RG-ITC),通过局部区域与全局文本间的双向对比捕捉从局部到全局的层次语义;(2) 区域-全局图像-文本匹配(Region-Global Image-Text Matching, RG-ITM),在不依赖刚性约束的前提下评估局部语义一致性,提升组合推理能力;同时引入动量对比与蒸馏机制(Momentum Contrast and Distillation, MCD)以增强对不完整或模糊文本描述的鲁棒性。

链接: https://arxiv.org/abs/2508.21539
作者: Hao Ruan,Jinliang Lin,Yingxin Lai,Zhiming Luo,Shaozi Li
机构: Xiamen University (厦门大学); Wuyi University (武夷学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM’25

点击查看摘要

Abstract:Natural Language-Guided Drones (NLGD) provide a novel paradigm for tasks such as target matching and navigation. However, the wide field of view and complex compositional semantics in drone scenarios pose challenges for vision-language understanding. Mainstream Vision-Language Models (VLMs) emphasize global alignment while lacking fine-grained semantics, and existing hierarchical methods depend on precise entity partitioning and strict containment, limiting effectiveness in dynamic environments. To address this, we propose the Hierarchical Cross-Granularity Contrastive and Matching learning (HCCM) framework with two components: (1) Region-Global Image-Text Contrastive Learning (RG-ITC), which avoids precise scene partitioning and captures hierarchical local-to-global semantics by contrasting local visual regions with global text and vice versa; (2) Region-Global Image-Text Matching (RG-ITM), which dispenses with rigid constraints and instead evaluates local semantic consistency within global cross-modal representations, enhancing compositional reasoning. Moreover, drone text descriptions are often incomplete or ambiguous, destabilizing alignment. HCCM introduces a Momentum Contrast and Distillation (MCD) mechanism to improve robustness. Experiments on GeoText-1652 show HCCM achieves state-of-the-art Recall@1 of 28.8% (image retrieval) and 14.7% (text retrieval). On the unseen ERA dataset, HCCM demonstrates strong zero-shot generalization with 39.93% mean recall (mR), outperforming fine-tuned baselines.
zh

[CV-28] Maybe you dont need a U-Net: convolutional feature upsampling for materials micrograph segmentation

【速读】:该论文旨在解决基于视觉变换器(Vision Transformer)的特征基础模型在微纳尺度图像分析中面临的两个关键问题:一是由于patch-based特征表示方式导致对微观图像中精细结构(如毛细裂纹)的表征能力不足;二是难以处理材料科学与生物成像中常见的大尺寸图像。解决方案的关键在于训练一个卷积神经网络(Convolutional Neural Network, CNN)作为上采样器,将低分辨率(即大patch size)的基础模型特征根据输入图像内容进行上采样,从而获得更丰富的细粒度特征表示。该上采样网络无需进一步训练即可高效地用于多种显微图像的特征提取与分割任务,显著提升了交互式分割的精度和效率,并大幅减少标注需求。

链接: https://arxiv.org/abs/2508.21529
作者: Ronan Docherty,Antonis Vamvakeros,Samuel J. Cooper
机构: Imperial College London (帝国理工学院); Finden ltd; Dyson School of Design Engineering (戴森设计工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Materials Science (cond-mat.mtrl-sci)
备注:

点击查看摘要

Abstract:Feature foundation models - usually vision transformers - offer rich semantic descriptors of images, useful for downstream tasks such as (interactive) segmentation and object detection. For computational efficiency these descriptors are often patch-based, and so struggle to represent the fine features often present in micrographs; they also struggle with the large image sizes present in materials and biological image analysis. In this work, we train a convolutional neural network to upsample low-resolution (i.e, large patch size) foundation model features with reference to the input image. We apply this upsampler network (without any further training) to efficiently featurise and then segment a variety of microscopy images, including plant cells, a lithium-ion battery cathode and organic crystals. The richness of these upsampled features admits separation of hard to segment phases, like hairline cracks. We demonstrate that interactive segmentation with these deep features produces high-quality segmentations far faster and with far fewer labels than training or finetuning a more traditional convolutional network.
zh

[CV-29] ELV-Halluc: Benchmarking Semantic Aggregation Hallucinations in Long Video Understanding

【速读】:该论文旨在解决长视频理解中一种新型幻觉类型——语义聚合幻觉(Semantic Aggregation Hallucination, SAH)的问题。SAH 指模型在帧级语义正确的情况下,由于将多个帧的语义信息错误地聚合为事件级语义而导致的输出偏差,这种问题在长视频中因语义复杂度提升而尤为显著。解决方案的关键在于:首先构建首个专注于长视频幻觉的基准 ELV-Halluc,系统识别并量化 SAH;其次提出基于位置编码策略缓解 SAH 的方法,并结合直接偏好优化(DPO)增强模型对事件内与事件间语义差异的区分能力。实验表明,该方案在 8K 对抗数据对上实现了 27.7% 的 SAH 比例下降,显著提升了模型在长视频理解中的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.21496
作者: Hao Lu,Jiahao Wang,Yaolun Zhang,Ruohui Wang,Xuanyu Zheng,Yepeng Tang,Dahua Lin,Lewei Lu
机构: SenseTime(商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video multimodal large language models (Video-MLLMs) have achieved remarkable progress in video understanding. However, they remain vulnerable to hallucination-producing content inconsistent with or unrelated to video inputs. Previous video hallucination benchmarks primarily focus on short-videos. They attribute hallucinations to factors such as strong language priors, missing frames, or vision-language biases introduced by the visual encoder. While these causes indeed account for most hallucinations in short videos, they still oversimplify the cause of hallucinations. Sometimes, models generate incorrect outputs but with correct frame-level semantics. We refer to this type of hallucination as Semantic Aggregation Hallucination (SAH), which arises during the process of aggregating frame-level semantics into event-level semantic groups. Given that SAH becomes particularly critical in long videos due to increased semantic complexity across multiple events, it is essential to separate and thoroughly investigate the causes of this type of hallucination. To address the above issues, we introduce ELV-Halluc, the first benchmark dedicated to long-video hallucination, enabling a systematic investigation of SAH. Our experiments confirm the existence of SAH and show that it increases with semantic complexity. Additionally, we find that models are more prone to SAH on rapidly changing semantics. Moreover, we discuss potential approaches to mitigate SAH. We demonstrate that positional encoding strategy contributes to alleviating SAH, and further adopt DPO strategy to enhance the model’s ability to distinguish semantics within and across events. To support this, we curate a dataset of 8K adversarial data pairs and achieve improvements on both ELV-Halluc and Video-MME, including a substantial 27.7% reduction in SAH ratio.
zh

[CV-30] Adversarial Patch Attack for Ship Detection via Localized Augmentation

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在遥感图像舰船检测中对对抗补丁攻击(adversarial patch attacks)的脆弱性问题,尤其关注现有基于数据增强的方法因过度扩充背景或非目标区域而引入干扰、导致误检的问题。解决方案的关键在于提出一种局部化增强方法(localized augmentation method),仅对目标区域进行增强操作,避免对非目标区域造成影响,从而让损失函数更聚焦于对抗补丁对检测模型的实际干扰效果,显著提升攻击成功率和迁移性。

链接: https://arxiv.org/abs/2508.21472
作者: Chun Liu,Panpan Ding,Zheng Zheng,Hailong Wang,Bingqian Zhu,Tao Xu,Zhigang Han,Jiayao Wang
机构: Henan University (河南大学); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current ship detection techniques based on remote sensing imagery primarily rely on the object detection capabilities of deep neural networks (DNNs). However, DNNs are vulnerable to adversarial patch attacks, which can lead to misclassification by the detection model or complete evasion of the targets. Numerous studies have demonstrated that data transformation-based methods can improve the transferability of adversarial examples. However, excessive augmentation of image backgrounds or irrelevant regions may introduce unnecessary interference, resulting in false detections of the object detection model. These errors are not caused by the adversarial patches themselves but rather by the over-augmentation of background and non-target areas. This paper proposes a localized augmentation method that applies augmentation only to the target regions, avoiding any influence on non-target areas. By reducing background interference, this approach enables the loss function to focus more directly on the impact of the adversarial patch on the detection model, thereby improving the attack success rate. Experiments conducted on the HRSC2016 dataset demonstrate that the proposed method effectively increases the success rate of adversarial patch attacks and enhances their transferability.
zh

[CV-31] Multi-Method Ensemble for Out-of-Distribution Detection BMVC2025

【速读】:该论文旨在解决开放世界场景下神经网络对分布外(Out-of-Distribution, OOD)样本检测能力不足的问题,尤其是在安全关键应用中,现有方法往往仅依赖单一技术路径(如特征截断或评分函数),且在不同类型的OOD数据上泛化性能有限。其解决方案的关键在于理论与实证结合地证明了当前最先进的特征截断技术和评分函数可以有效协同,并进一步提出多方法集成(Multi-Method Ensemble, MME)评分机制,通过聚合多种评分函数来增强对各类OOD样本的鲁棒性,从而显著提升整体检测性能。实验表明,MME在多个基准测试中均优于现有最先进方法,尤其在ImageNet-1K挑战性数据集上,使用BiT模型时平均FPR95达到27.57%,较最优基线提升6%。

链接: https://arxiv.org/abs/2508.21463
作者: Lucas Rakotoarivony
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted paper for BMVC 2025

点击查看摘要

Abstract:Detecting out-of-distribution (OOD) samples is essential for neural networks operating in open-world settings, particularly in safety-critical applications. Existing methods have improved OOD detection by leveraging two main techniques: feature truncation, which increases the separation between in-distribution (ID) and OOD samples, and scoring functions, which assign scores to distinguish between ID and OOD data. However, most approaches either focus on a single family of techniques or evaluate their effectiveness on a specific type of OOD dataset, overlooking the potential of combining multiple existing solutions. Motivated by this observation, we theoretically and empirically demonstrate that state-of-the-art feature truncation and scoring functions can be effectively combined. Moreover, we show that aggregating multiple scoring functions enhances robustness against various types of OOD samples. Based on these insights, we propose the Multi-Method Ensemble (MME) score, which unifies state-of-the-art OOD detectors into a single, more effective scoring function. Extensive experiments on both large-scale and small-scale benchmarks, covering near-OOD and far-OOD scenarios, show that MME significantly outperforms recent state-of-the-art methods across all benchmarks. Notably, using the BiT model, our method achieves an average FPR95 of 27.57% on the challenging ImageNet-1K benchmark, improving performance by 6% over the best existing baseline.
zh

[CV-32] Federated Fine-tuning of SAM-Med3D for MRI-based Dementia Classification MICCAI2025

【速读】:该论文旨在解决将基础模型(Foundation Models, FMs)集成到联邦学习(Federated Learning, FL)系统中用于基于脑部磁共振成像(brain MRI)的阿尔茨海默病(Alzheimer’s Disease)诊断时所面临的性能与效率挑战。其关键解决方案在于系统性评估三类核心设计选择的影响:分类头(classification head)架构、微调策略(fine-tuning strategy)以及聚合方法(aggregation method)。研究发现,分类头结构显著影响模型性能,冻结FM编码器(encoder)即可获得与全参数微调相当的结果,且先进聚合方法优于标准联邦平均(federated averaging),从而为在去中心化临床环境中部署FM提供了可操作的实践指导,并揭示了未来方法开发需权衡的关键因素。

链接: https://arxiv.org/abs/2508.21458
作者: Kaouther Mouheb,Marawan Elbatel,Janne Papma,Geert Jan Biessels,Jurgen Claassen,Huub Middelkoop,Barbara van Munster,Wiesje van der Flier,Inez Ramakers,Stefan Klein,Esther E. Bron
机构: Erasmus MC (伊拉斯谟医学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MICCAI 2025 Workshop on Distributed, Collaborative and Federated Learning (DeCAF)

点击查看摘要

Abstract:While foundation models (FMs) offer strong potential for AI-based dementia diagnosis, their integration into federated learning (FL) systems remains underexplored. In this benchmarking study, we systematically evaluate the impact of key design choices: classification head architecture, fine-tuning strategy, and aggregation method, on the performance and efficiency of federated FM tuning using brain MRI data. Using a large multi-cohort dataset, we find that the architecture of the classification head substantially influences performance, freezing the FM encoder achieves comparable results to full fine-tuning, and advanced aggregation methods outperform standard federated averaging. Our results offer practical insights for deploying FMs in decentralized clinical settings and highlight trade-offs that should guide future method development.
zh

[CV-33] One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist

【速读】:该论文旨在解决在本地设备上部署多模态大语言模型(Multimodal Large Language Models, MLLMs)进行图像描述生成时面临的高计算资源需求问题。为应对这一挑战,研究提出了一种轻量级图像描述模型,其语言模型参数仅为125M,远小于LLaMA-7B的规模,但性能可与大型通用模型相媲美,表明其作为本地部署视觉专用模型的潜力。然而,该模型仍存在视觉盲区问题,导致语义错误。解决方案的关键在于提出一种名为“Sharp-Eyed Refinement”的新型框架,其核心是DeepLens模块,通过聚焦初始观察中识别出的信息丰富区域来提取更精细的视觉表征,从而提升描述的视觉定位精度和整体质量。

链接: https://arxiv.org/abs/2508.21451
作者: Junha Song,Yongsik Jo,So Yeon Min,Quanting Xie,Taehwan Kim,Yonatan Bisk,Jaegul Choo
机构: KAIST(韩国科学技术院); UNIST(蔚山国立科学技术院); Carnegie Mellon University(卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.
zh

[CV-34] Scale-GS: Efficient Scalable Gaussian Splatting via Redundancy-filtering Training on Streaming Content

【速读】:该论文旨在解决3D高斯点绘(3D Gaussian Splatting, 3DGS)在动态场景中应用时面临的两大挑战:一是密集高斯分布导致的数据量庞大,二是每帧训练时间过长的问题。其解决方案的关键在于提出了一种可扩展的高斯点绘框架 \M,通过锚定结构(anchor-based structure)对高斯球进行多尺度层次化组织,其中粗粒度高斯表示场景低分辨率结构,细粒度高斯由粗粒度高斯选择性激活以实现细节渲染;同时引入混合变形与新生策略(hybrid deformation and spawning strategy),利用高斯变形建模帧间运动并触发高斯新生以捕捉大范围运动;此外,结合双向自适应掩码机制(bidirectional adaptive masking mechanism)剔除静态区域并优先采样信息量丰富的视角,从而显著提升训练效率并保持优异视觉质量。

链接: https://arxiv.org/abs/2508.21444
作者: Jiayu Yang,Weijian Su,Songqian Zhang,Yuqi Han,Jinli Suo,Qiang Zhang
机构: Dalian University of Technology (大连理工大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) enables high-fidelity real-time rendering, a key requirement for immersive applications. However, the extension of 3DGS to dynamic scenes remains limitations on the substantial data volume of dense Gaussians and the prolonged training time required for each frame. This paper presents \M, a scalable Gaussian Splatting framework designed for efficient training in streaming tasks. Specifically, Gaussian spheres are hierarchically organized by scale within an anchor-based structure. Coarser-level Gaussians represent the low-resolution structure of the scene, while finer-level Gaussians, responsible for detailed high-fidelity rendering, are selectively activated by the coarser-level Gaussians. To further reduce computational overhead, we introduce a hybrid deformation and spawning strategy that models motion of inter-frame through Gaussian deformation and triggers Gaussian spawning to characterize wide-range motion. Additionally, a bidirectional adaptive masking mechanism enhances training efficiency by removing static regions and prioritizing informative viewpoints. Extensive experiments demonstrate that \M~ achieves superior visual quality while significantly reducing training time compared to state-of-the-art methods.
zh

[CV-35] rees as Gaussians: Large-Scale Individual Tree Mapping

【速读】:该论文旨在解决全球尺度上对单棵树木进行高分辨率监测的难题,现有全球树覆盖产品多局限于二值树覆盖或冠层高度信息,无法明确识别个体树木。其解决方案的关键在于提出一种基于深度学习的方法,利用3米分辨率的PlanetScope遥感影像,通过可缩放尺寸的高斯核模拟树冠结构,从而提取树冠中心并生成二值树覆盖图;训练数据来源于数亿个从机载激光雷达(LiDAR)自动提取的点云,使模型能够有效识别林内与林外的树木,并在多个生物群落中实现平衡的检测性能,且可通过人工标注微调进一步提升精度,为未来卫星任务提供可扩展、高分辨率的全球树木监测框架。

链接: https://arxiv.org/abs/2508.21437
作者: Dimitri Gominski,Martin Brandt,Xiaoye Tong,Siyu Liu,Maurice Mugabowindekwe,Sizhuo Li,Florian Reiner,Andrew Davies,Rasmus Fensholt
机构: University of Copenhagen(哥本哈根大学); Harvard University(哈佛大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Trees are key components of the terrestrial biosphere, playing vital roles in ecosystem function, climate regulation, and the bioeconomy. However, large-scale monitoring of individual trees remains limited by inadequate modelling. Available global products have focused on binary tree cover or canopy height, which do not explicitely identify trees at individual level. In this study, we present a deep learning approach for detecting large individual trees in 3-m resolution PlanetScope imagery at a global scale. We simulate tree crowns with Gaussian kernels of scalable size, allowing the extraction of crown centers and the generation of binary tree cover maps. Training is based on billions of points automatically extracted from airborne lidar data, enabling the model to successfully identify trees both inside and outside forests. We compare against existing tree cover maps and airborne lidar with state-of-the-art performance (fractional cover R ^2 = 0.81 against aerial lidar), report balanced detection metrics across biomes, and demonstrate how detection can be further improved through fine-tuning with manual labels. Our method offers a scalable framework for global, high-resolution tree monitoring, and is adaptable to future satellite missions offering improved imagery.
zh

[CV-36] MedShift: Implicit Conditional Transport for X-Ray Domain Adaptation ICCV2025

【速读】:该论文旨在解决合成医学数据与真实临床影像之间存在的显著域差异问题,这种差异限制了合成数据在实际医疗场景中的泛化能力。其核心挑战在于跨域图像翻译中对衰减行为、噪声特性及软组织表现的一致性建模。解决方案的关键是提出MedShift——一种基于流匹配(Flow Matching)和薛定谔桥(Schrodinger Bridges)的统一类条件生成模型,该模型通过学习一个共享的域无关潜在空间,实现了无需配对数据或领域特定训练即可在任意训练过的域间进行高保真图像转换,且具备在感知保真度与结构一致性之间灵活权衡的能力,从而为医学影像领域的域适应提供了可扩展且通用的解决方案。

链接: https://arxiv.org/abs/2508.21435
作者: Francisco Caetano,Christiaan Viviers,Peter H.H. de With,Fons van der Sommen
机构: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at the ICCV 2025 AIM Workshop

点击查看摘要

Abstract:Synthetic medical data offers a scalable solution for training robust models, but significant domain gaps limit its generalizability to real-world clinical settings. This paper addresses the challenge of cross-domain translation between synthetic and real X-ray images of the head, focusing on bridging discrepancies in attenuation behavior, noise characteristics, and soft tissue representation. We propose MedShift, a unified class-conditional generative model based on Flow Matching and Schrodinger Bridges, which enables high-fidelity, unpaired image translation across multiple domains. Unlike prior approaches that require domain-specific training or rely on paired data, MedShift learns a shared domain-agnostic latent space and supports seamless translation between any pair of domains seen during training. We introduce X-DigiSkull, a new dataset comprising aligned synthetic and real skull X-rays under varying radiation doses, to benchmark domain translation models. Experimental results demonstrate that, despite its smaller model size compared to diffusion-based approaches, MedShift offers strong performance and remains flexible at inference time, as it can be tuned to prioritize either perceptual fidelity or structural consistency, making it a scalable and generalizable solution for domain adaptation in medical imaging. The code and dataset are available at this https URL
zh

[CV-37] Unsupervised Incremental Learning Using Confidence-Based Pseudo-Labels WACV2026

【速读】:该论文旨在解决现实场景中类增量学习(Class-Incremental Learning, CIL)方法依赖完全标注数据集的不切实际假设问题,即在增量学习过程中无法获取新类别的标签信息。其解决方案的关键在于提出一种基于置信度的伪标签(Confidence-based Pseudo-labels, ICPL)机制,通过自动为未标注数据生成高置信度伪标签,并将其集成到多种CIL方法中,从而实现无需人工标注即可完成增量学习的目标。该方法在CIFAR100和ImageNet100等基准上验证了有效性,并在细粒度数据集上展示了实用性,同时具备较低的计算复杂度,适用于资源受限环境。

链接: https://arxiv.org/abs/2508.21424
作者: Lucas Rakotoarivony
机构: Thales, cortAIx Labs (Thales, cortAIx 实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Submitted to WACV 2026

点击查看摘要

Abstract:Deep learning models have achieved state-of-the-art performance in many computer vision tasks. However, in real-world scenarios, novel classes that were unseen during training often emerge, requiring models to acquire new knowledge incrementally. Class-Incremental Learning (CIL) methods enable a model to learn novel classes while retaining knowledge of previous classes. However, these methods make the strong assumption that the incremental dataset is fully labeled, which is unrealistic in practice. In this work, we propose an unsupervised Incremental Learning method using Confidence-based Pseudo-labels (ICPL), which replaces human annotations with pseudo-labels, enabling incremental learning from unlabeled datasets. We integrate these pseudo-labels into various CIL methods with confidence-based selection and evaluate performance degradation on CIFAR100 and ImageNet100. Then, we compare our approach to popular Class Incremental Novel Category Discovery (class-iNCD) methods addressing similar challenges. Additionally, we apply our method to fine-grained datasets to demonstrate its real-world practicality and measure its computational complexity to validate its suitability for resource-constrained environments. ICPL achieves competitive results compared to supervised methods and outperforms state-of-the-art class-iNCD methods by more than 5% in final accuracy.
zh

[CV-38] Standardized Multi-Layer Tissue Maps for Enhanced Artificial Intelligence Integration and Search in Large-Scale Whole Slide Image Archives

【速读】:该论文旨在解决当前Whole Slide Image (WSI)在人工智能(AI)算法训练与验证过程中缺乏统一元数据标准的问题,导致大规模WSI集合的筛选和管理主要依赖人工检查,效率低下且难以扩展。其解决方案的关键在于提出一种通用框架,用于生成WSI的二维索引图(2D index map)和面向特定应用领域的特征描述机制,并构建一个多层组织的组织结构化“组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构化组织结构…

链接: https://arxiv.org/abs/2508.21418
作者: Gernot Fiala,Markus Plass,Robert Harb,Peter Regitnig,Kristijan Skok,Wael Al Zoughbi,Carmen Zerner,Paul Torke,Michaela Kargl,Heimo Müller,Tomas Brazdil,Matej Gallo,Jaroslav Kubín,Roman Stoklasa,Rudolf Nenutil,Norman Zerbe,Andreas Holzinger,Petr Holub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A Whole Slide Image (WSI) is a high-resolution digital image created by scanning an entire glass slide containing a biological specimen, such as tissue sections or cell samples, at multiple magnifications. These images can be viewed, analyzed, shared digitally, and are used today for Artificial Intelligence (AI) algorithm development. WSIs are used in a variety of fields, including pathology for diagnosing diseases and oncology for cancer research. They are also utilized in neurology, veterinary medicine, hematology, microbiology, dermatology, pharmacology, toxicology, immunology, and forensic science. When assembling cohorts for the training or validation of an AI algorithm, it is essential to know what is present on such a WSI. However, there is currently no standard for this metadata, so such selection has mainly been done through manual inspection, which is not suitable for large collections with several million objects. We propose a general framework to generate a 2D index map for WSI and a profiling mechanism for specific application domains. We demonstrate this approach in the field of clinical pathology, using common syntax and semantics to achieve interoperability between different catalogs. Our approach augments each WSI collection with a detailed tissue map that provides fine-grained information about the WSI content. The tissue map is organized into three layers: source, tissue type, and pathological alterations, with each layer assigning segments of the WSI to specific classes. We illustrate the advantages and applicability of the proposed standard through specific examples in WSI catalogs, Machine Learning (ML), and graph-based WSI representations. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2508.21418 [cs.CV] (or arXiv:2508.21418v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.21418 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Related DOI: https://doi.org/10.5281/zenodo.16964541 Focus to learn more DOI(s) linking to related resources Submission history From: Gernot Fiala [view email] [v1] Fri, 29 Aug 2025 08:39:07 UTC (6,935 KB) Full-text links: Access Paper: View a PDF of the paper titled Standardized Multi-Layer Tissue Maps for Enhanced Artificial Intelligence Integration and Search in Large-Scale Whole Slide Image Archives, by Gernot Fiala and 17 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.CV prev | next new | recent | 2025-08 Change to browse by: cs cs.LG References Citations NASA ADSGoogle Scholar Semantic Scholar export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-39] SatDINO: A Deep Dive into Self-Supervised Pretraining for Remote Sensing

【速读】:该论文旨在解决遥感图像中缺乏标注数据时,如何有效进行表示学习的问题。针对这一挑战,作者提出了一种专为卫星影像设计的自监督预训练模型SatDINO,其核心在于将对比式自监督学习方法DINO应用于遥感领域,并通过引入地面采样距离(Ground Sample Distance, GSD)编码和自适应视图采样等创新改进,显著提升了模型在多数据集和多种测试场景下的性能表现。实验表明,SatDINO在多个基准上优于基于掩码自动编码器(Masked Autoencoder, MAE)的主流方法,验证了其在遥感图像表征学习中的有效性与竞争力。

链接: https://arxiv.org/abs/2508.21402
作者: Jakub Straka,Ivan Gruber
机构: University of West Bohemia in Pilsen (西波希米亚大学皮尔森分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Self-supervised learning has emerged as a powerful tool for remote sensing, where large amounts of unlabeled data are available. In this work, we investigate the use of DINO, a contrastive self-supervised method, for pretraining on remote sensing imagery. We introduce SatDINO, a model tailored for representation learning in satellite imagery. Through extensive experiments on multiple datasets in multiple testing setups, we demonstrate that SatDINO outperforms other state-of-the-art methods based on much more common masked autoencoders (MAE) and achieves competitive results in multiple benchmarks. We also provide a rigorous ablation study evaluating SatDINO’s individual components. Finally, we propose a few novel enhancements, such as a new way to incorporate ground sample distance (GSD) encoding and adaptive view sampling. These enhancements can be used independently on our SatDINO model. Our code and trained models are available at: this https URL. Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG) Cite as: arXiv:2508.21402 [cs.CV] (or arXiv:2508.21402v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.21402 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-40] Identifying Surgical Instruments in Laparoscopy Using Deep Learning Instance Segmentation

【速读】:该论文旨在解决腹腔镜妇科手术视频中手术器械的自动分割与识别问题,这是实现医疗视频内容索引和基于内容检索的关键步骤。其解决方案的核心在于采用基于区域的全卷积网络(region-based fully convolutional network)来实现两类任务:(1) 实例感知的二值分割(即区分器械与背景),以及(2) 多类器械识别(即识别具体器械类型)。实验表明,即使训练样本数量有限,该方法仍能实现较高的器械定位与分割精度;然而,由于手术器械本身具有高度相似性,精确识别具体器械类型仍是挑战。

链接: https://arxiv.org/abs/2508.21399
作者: Sabrina Kletz,Klaus Schoeffmann,Jenny Benois-Pineau,Heinrich Husslein
机构: Klagenfurt University (奥地利克兰大学); University of Bordeaux (法国波尔多大学); Medical University of Vienna (维也纳医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Recorded videos from surgeries have become an increasingly important information source for the field of medical endoscopy, since the recorded footage shows every single detail of the surgery. However, while video recording is straightforward these days, automatic content indexing - the basis for content-based search in a medical video archive - is still a great challenge due to the very special video content. In this work, we investigate segmentation and recognition of surgical instruments in videos recorded from laparoscopic gynecology. More precisely, we evaluate the achievable performance of segmenting surgical instruments from their background by using a region-based fully convolutional network for instance-aware (1) instrument segmentation as well as (2) instrument recognition. While the first part addresses only binary segmentation of instances (i.e., distinguishing between instrument or background) we also investigate multi-class instrument recognition (i.e., identifying the type of instrument). Our evaluation results show that even with a moderately low number of training examples, we are able to localize and segment instrument regions with a pretty high accuracy. However, the results also reveal that determining the particular instrument is still very challenging, due to the inherently high similarity of surgical instruments.
zh

[CV-41] GLENDA: Gynecologic Laparoscopy Endometriosis Dataset

【速读】:该论文旨在解决妇科腹腔镜手术视频手动分析效率低下的问题,这一问题限制了术后治疗规划、病例记录和医学教育等关键应用的开展。解决方案的关键在于构建首个针对子宫内膜异位症(endometriosis)的区域标注图像数据集——Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA),该数据集由领域内顶尖医疗专家合作创建,为后续基于计算机视觉与机器学习的自动化分析方法提供了稀缺但高质量的训练样本基础。

链接: https://arxiv.org/abs/2508.21398
作者: Andreas Leibetseder,Sabrina Kletz,Klaus Schoeffmann,Simon Keckstein,Jörg Keckstein
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Gynecologic laparoscopy as a type of minimally invasive surgery (MIS) is performed via a live feed of a patient’s abdomen surveying the insertion and handling of various instruments for conducting treatment. Adopting this kind of surgical intervention not only facilitates a great variety of treatments, the possibility of recording said video streams is as well essential for numerous post-surgical activities, such as treatment planning, case documentation and education. Nonetheless, the process of manually analyzing surgical recordings, as it is carried out in current practice, usually proves tediously time-consuming. In order to improve upon this situation, more sophisticated computer vision as well as machine learning approaches are actively developed. Since most of such approaches heavily rely on sample data, which especially in the medical field is only sparsely available, with this work we publish the Gynecologic Laparoscopy ENdometriosis DAtaset (GLENDA) - an image dataset containing region-based annotations of a common medical condition named endometriosis, i.e. the dislocation of uterine-like tissue. The dataset is the first of its kind and it has been created in collaboration with leading medical experts in the field.
zh

[CV-42] Print2Volume: Generating Synthetic OCT-based 3D Fingerprint Volume from 2D Fingerprint Image

【速读】:该论文旨在解决光学相干断层扫描(Optical Coherence Tomography, OCT)用于指纹识别时因数据获取成本高、耗时长而导致的大规模公开数据集稀缺问题,从而制约了深度学习模型的发展。其解决方案的关键在于提出了一种名为Print2Volume的三阶段框架:首先通过2D风格迁移模块将二值指纹图像转换为模拟Z方向均值投影OCT图像的灰度图;其次利用3D结构扩展网络从二维图像生成合理的三维解剖结构体积;最后采用基于3D生成对抗网络(3D GAN)的OCT真实性优化模块,渲染出具有真实纹理、散斑噪声等成像特征的合成OCT数据。该方法成功构建了包含42万样本的合成数据集,并在ZJUT-EIFD基准上实现等错误率(Equal Error Rate, EER)从15.62%降至2.50%,验证了其在缓解数据稀缺方面的有效性。

链接: https://arxiv.org/abs/2508.21371
作者: Qingran Miao,Haixia Wang,Haohao Sun,Yilong Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Optical Coherence Tomography (OCT) enables the acquisition of high-resolution, three-dimensional fingerprint data, capturing rich subsurface structures for robust biometric recognition. However, the high cost and time-consuming nature of OCT data acquisition have led to a scarcity of large-scale public datasets, significantly hindering the development of advanced algorithms, particularly data-hungry deep learning models. To address this critical bottleneck, this paper introduces Print2Volume, a novel framework for generating realistic, synthetic OCT-based 3D fingerprints from 2D fingerprint image. Our framework operates in three sequential stages: (1) a 2D style transfer module that converts a binary fingerprint into a grayscale images mimicking the style of a Z-direction mean-projected OCT scan; (2) a 3D Structure Expansion Network that extrapolates the 2D im-age into a plausible 3D anatomical volume; and (3) an OCT Realism Refiner, based on a 3D GAN, that renders the structural volume with authentic textures, speckle noise, and other imaging characteristics. Using Print2Volume, we generated a large-scale synthetic dataset of 420,000 samples. Quantitative experiments demonstrate the high quality of our synthetic data and its significant impact on recognition performance. By pre-training a recognition model on our synthetic data and fine-tuning it on a small real-world dataset, we achieved a remarkable reduction in the Equal Error Rate (EER) from 15.62% to 2.50% on the ZJUT-EIFD benchmark, proving the effectiveness of our approach in overcoming data scarcity.
zh

[CV-43] Efficient Diffusion-Based 3D Human Pose Estimation with Hierarchical Temporal Pruning

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成3D人体姿态时因迭代特性及多假设需求导致的计算开销过大的问题。其解决方案的关键在于提出一种分层时间剪枝策略(Hierarchical Temporal Pruning, HTP),通过三个阶段实现高效推理:首先利用时间相关性增强剪枝(Temporal Correlation-Enhanced Pruning, TCEP)识别关键帧;其次采用稀疏聚焦时间多头自注意力机制(Sparse-Focused Temporal MHSA, SFT MHSA)降低注意力计算复杂度;最后通过掩码引导的姿态令牌剪枝(Mask-Guided Pose Token Pruner, MGPTP)进行细粒度语义剪枝,保留最具信息量的姿势令牌。该方法在保持最优性能的同时显著降低训练和推理的MACs(Multiply-Accumulate Operations),提升推理速度达81.1%。

链接: https://arxiv.org/abs/2508.21363
作者: Yuquan Bi,Hongsong Wang,Xinli Shi,Zhipeng Gui,Jie Gui,Yuan Yan Tang
机构: Southeast University (东南大学); Wuhan University (武汉大学); University of Macau (澳门大学); UOW College Hong Kong (伍伦贡大学香港分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have demonstrated strong capabilities in generating high-fidelity 3D human poses, yet their iterative nature and multi-hypothesis requirements incur substantial computational cost. In this paper, we propose an Efficient Diffusion-Based 3D Human Pose Estimation framework with a Hierarchical Temporal Pruning (HTP) strategy, which dynamically prunes redundant pose tokens across both frame and semantic levels while preserving critical motion dynamics. HTP operates in a staged, top-down manner: (1) Temporal Correlation-Enhanced Pruning (TCEP) identifies essential frames by analyzing inter-frame motion correlations through adaptive temporal graph construction; (2) Sparse-Focused Temporal MHSA (SFT MHSA) leverages the resulting frame-level sparsity to reduce attention computation, focusing on motion-relevant tokens; and (3) Mask-Guided Pose Token Pruner (MGPTP) performs fine-grained semantic pruning via clustering, retaining only the most informative pose tokens. Experiments on Human3.6M and MPI-INF-3DHP show that HTP reduces training MACs by 38.5%, inference MACs by 56.8%, and improves inference speed by an average of 81.1% compared to prior diffusion-based methods, while achieving state-of-the-art performance.
zh

[CV-44] ARGS: Advanced Regularization on Aligning Gaussians over the Surface

【速读】:该论文旨在解决从3D高斯点绘(3D Gaussian Splatting, 3DGS)中重建高质量三维网格与视觉效果的难题,尤其针对现有方法如SuGaR在视觉保真度和场景一致性方面的不足。其解决方案的关键在于引入两种互补的正则化策略:一是基于最新关于高斯基元结构研究的秩正则化(rank regularization),通过抑制极端各向异性(如“针状”形状)并偏好更稳定的“碟状”形态来提升单个高斯体素的质量;二是将神经符号距离函数(neural Signed Distance Function, SDF)整合进优化过程,并辅以Eikonal损失约束以保持正确的距离特性,从而提供连续的全局表面先验,引导高斯体素更好地贴合底层几何结构。这两种策略协同作用,显著提升了个体高斯体素的保真度及其整体表面的一致性。

链接: https://arxiv.org/abs/2508.21344
作者: Jeong Uk Lee,Sung Hee Choi
机构: KAIST(韩国科学技术院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures

点击查看摘要

Abstract:Reconstructing high-quality 3D meshes and visuals from 3D Gaussian Splatting(3DGS) still remains a central challenge in computer graphics. Although existing models such as SuGaR offer effective solutions for rendering, there is is still room to improve improve both visual fidelity and scene consistency. This work builds upon SuGaR by introducing two complementary regularization strategies that address common limitations in both the shape of individual Gaussians and the coherence of the overall surface. The first strategy introduces an effective rank regularization, motivated by recent studies on Gaussian primitive structures. This regularization discourages extreme anisotropy-specifically, “needle-like” shapes-by favoring more balanced, “disk-like” forms that are better suited for stable surface reconstruction. The second strategy integrates a neural Signed Distance Function (SDF) into the optimization process. The SDF is regularized with an Eikonal loss to maintain proper distance properties and provides a continuous global surface prior, guiding Gaussians toward better alignment with the underlying geometry. These two regularizations aim to improve both the fidelity of individual Gaussian primitives and their collective surface behavior. The final model can make more accurate and coherent visuals from 3DGS data.
zh

[CV-45] Mini Autonomous Car Driving based on 3D Convolutional Neural Networks

【速读】:该论文旨在解决自动驾驶系统在开发过程中面临的高复杂性、训练周期长以及内在不确定性等问题,特别是在小型自主汽车(Mini Autonomous Cars, MACs)平台上实现可靠且可信赖的自主控制。其解决方案的关键在于提出一种基于RGB-D信息和三维卷积神经网络(3D CNNs)的方法,用于在模拟环境中实现MAC的自主驾驶。相比传统的循环神经网络(RNNs),该方法通过利用时空特征融合能力更强的3D CNN架构,在不同环境复杂度的赛道上展现出更优的任务完成成功率、平均圈速表现及驾驶一致性,从而提升了模型的泛化能力和车辆控制性能。

链接: https://arxiv.org/abs/2508.21271
作者: Pablo Moraes,Monica Rodriguez,Kristofer S. Kappel,Hiago Sodre,Santiago Fernandez,Igor Nunes,Bruna Guterres,Ricardo Grando
机构: Technological University of Uruguay (乌拉圭技术大学); UTEC (乌拉圭技术大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Autonomous driving applications have become increasingly relevant in the automotive industry due to their potential to enhance vehicle safety, efficiency, and user experience, thereby meeting the growing demand for sophisticated driving assistance features. However, the development of reliable and trustworthy autonomous systems poses challenges such as high complexity, prolonged training periods, and intrinsic levels of uncertainty. Mini Autonomous Cars (MACs) are used as a practical testbed, enabling validation of autonomous control methodologies on small-scale setups. This simplified and cost-effective environment facilitates rapid evaluation and comparison of machine learning models, which is particularly useful for algorithms requiring online training. To address these challenges, this work presents a methodology based on RGB-D information and three-dimensional convolutional neural networks (3D CNNs) for MAC autonomous driving in simulated environments. We evaluate the proposed approach against recurrent neural networks (RNNs), with architectures trained and tested on two simulated tracks with distinct environmental features. Performance was assessed using task completion success, lap-time metrics, and driving consistency. Results highlight how architectural modifications and track complexity influence the models’ generalization capability and vehicle control performance. The proposed 3D CNN demonstrated promising results when compared with RNNs.
zh

[CV-46] PHD: Personalized 3D Human Body Fitting with Point Diffusion ICCV2025

【速读】:该论文旨在解决传统3D人体网格恢复(HMR)方法在用户特定体型建模与姿态估计之间权衡不足的问题,即现有方法因忽略个体差异的体形信息而导致3D姿态估计精度下降。其关键解决方案是提出一种分阶段的个性化HMR框架——PHD,首先通过用户特定体形校准获得个体化身体形状,再基于该形状条件构建一个点扩散Transformer(Point Diffusion Transformer)实现体形约束下的3D姿态拟合,引入点蒸馏采样损失(Point Distillation Sampling loss)迭代优化姿态估计,从而有效缓解对2D图像约束的过度依赖,显著提升整体姿态准确性(包括骨盆对齐和绝对姿态精度)。

链接: https://arxiv.org/abs/2508.21257
作者: Hsuan-I Ho,Chen Guo,Po-Chen Wu,Ivan Shugurov,Chengcheng Tang,Abhay Mittal,Sizhe An,Manuel Kaufmann,Linguang Zhang
机构: ETH Zürich (苏黎世联邦理工学院); Meta (Meta)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025, 19 pages, 18 figures

点击查看摘要

Abstract:We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user’s body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy – an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Project page: this https URL
zh

[CV-47] Reverse Imaging for Wide-spectrum Generalization of Cardiac MRI Segmentation

【速读】:该论文旨在解决预训练的心脏磁共振成像(Cardiac MRI)分割模型在不同成像序列间泛化能力差的问题,其根源在于图像对比度因成像协议差异而显著变化,但所有图像均由相同的自旋特性(包括质子密度、T1和T2值)决定。解决方案的关键在于提出一种物理驱动的“逆成像”(Reverse Imaging)方法:通过求解正则化后的非线性反问题,从观测到的心脏MRI图像中反演得到潜在的自旋属性;其中,自旋先验分布由基于多参数饱和恢复单激发采集序列(mSASHA)数据集训练的生成扩散模型学习获得,从而实现对任意新成像序列的高保真图像合成与分割泛化,显著提升跨协议的分割准确性。

链接: https://arxiv.org/abs/2508.21254
作者: Yidong Zhao,Peter Kellman,Hui Xue,Tongyun Yang,Yi Zhang,Yuchi Han,Orlando Simonetti,Qian Tao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Pretrained segmentation models for cardiac magnetic resonance imaging (MRI) struggle to generalize across different imaging sequences due to significant variations in image contrast. These variations arise from changes in imaging protocols, yet the same fundamental spin properties, including proton density, T1, and T2 values, govern all acquired images. With this core principle, we introduce Reverse Imaging, a novel physics-driven method for cardiac MRI data augmentation and domain adaptation to fundamentally solve the generalization problem. Our method reversely infers the underlying spin properties from observed cardiac MRI images, by solving ill-posed nonlinear inverse problems regularized by the prior distribution of spin properties. We acquire this “spin prior” by learning a generative diffusion model from the multiparametric SAturation-recovery single-SHot acquisition sequence (mSASHA) dataset, which offers joint cardiac T1 and T2 maps. Our method enables approximate but meaningful spin-property estimates from MR images, which provide an interpretable “latent variable” that lead to highly flexible image synthesis of arbitrary novel sequences. We show that Reverse Imaging enables highly accurate segmentation across vastly different image contrasts and imaging protocols, realizing wide-spectrum generalization of cardiac MRI segmentation.
zh

[CV-48] Lightweight MRI-Based Automated Segmentation of Pancreatic Cancer with Auto3DSeg MICCAI

【速读】:该论文旨在解决胰腺肿瘤在MRI图像中自动分割的难题,以提升诊断、治疗规划和疗效评估的准确性。其关键解决方案是基于Auto3DSeg架构中的SegResNet模型,在两个不同MRI任务上进行训练与评估:任务1使用91例增强T1加权MRI数据,任务2使用50例MR-Linac T2加权数据。算法采用5折交叉验证结合STAPLE集成方法,并聚焦于解剖学相关的感兴趣区域(Region-of-Interest, ROI),从而提高分割精度。尽管在小样本条件下性能受限(如任务1 DSC=0.56,任务2 DSC=0.33),但结果表明该方法具备潜力,同时强调了构建更大规模、标准化MRI数据集对提升模型鲁棒性和临床实用性的重要性。

链接: https://arxiv.org/abs/2508.21227
作者: Keshav Jha,William Sharp,Dominic LaBella
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, 3 tables, MICCAI

点击查看摘要

Abstract:Accurate delineation of pancreatic tumors is critical for diagnosis, treatment planning, and outcome assessment, yet automated segmentation remains challenging due to anatomical variability and limited dataset availability. In this study, SegResNet models, as part of the Auto3DSeg architecture, were trained and evaluated on two MRI-based pancreatic tumor segmentation tasks as part of the 2025 PANTHER Challenge. Algorithm methodology included 5-fold cross-validation with STAPLE ensembling after focusing on an anatomically relevant region-of-interest. The Pancreatic Tumor Segmentation on Diagnostic MRI task 1 training set included 91 T1-weighted arterial contrast-enhanced MRI with expert annotated pancreas and tumor labels. The Pancreatic Tumor Segmentation on MR-Linac task 2 training set used 50 T2-weighted MR-Linac cases with expert annotated pancreas and tumor labels. Algorithm-automated segmentation performance of pancreatic tumor was assessed using Dice Similarity Coefficient (DSC), 5 mm DSC, 95th percentile Hausdorff Distance (HD95), Mean Average Surface Distance (MASD), and Root Mean Square Error (RMSE). For Task 1, the algorithm achieved a DSC of 0.56, 5 mm DSC of 0.73, HD95 of 41.1 mm, MASD of 26.0 mm, and RMSE of 5164 mm. For Task 2, performance decreased, with a DSC of 0.33, 5 mm DSC of 0.50, HD95 of 20.1 mm, MASD of 7.2 mm, and RMSE of 17,203 mm. These findings illustrate the challenges of MRI-based pancreatic tumor segmentation with small datasets, highlighting variability introduced by different MRI sequences. Despite modest performance, the results demonstrate potential for automated delineation and emphasize the need for larger, standardized MRI datasets to improve model robustness and clinical utility.
zh

[CV-49] Generalizable Object Re-Identification via Visual In-Context Prompting ICCV2025

【速读】:该论文旨在解决当前行人与车辆等特定领域重识别(ReID)方法在面对未见类别时泛化能力差、且需大量标注数据重新训练的问题。其核心挑战在于如何在不进行参数微调的前提下,利用少量上下文样本实现对新类别的有效识别。解决方案的关键是提出视觉上下文提示(Visual In-Context Prompting, VICP)框架:该框架融合大语言模型(LLM)与视觉基础模型(VFM),通过LLM从少量正负样本对中推断语义身份规则,并生成动态视觉提示(dynamic visual prompts)引导VFM提取具有身份判别性的特征;这种机制将LLM的语义先验与VFM的预训练视觉表示对齐,从而实现跨类别零样本迁移,无需针对新类别重新训练模型。

链接: https://arxiv.org/abs/2508.21222
作者: Zhizhong Huang,Xiaoming Liu
机构: Michigan State University (密歇根州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025

点击查看摘要

Abstract:Current object re-identification (ReID) methods train domain-specific models (e.g., for persons or vehicles), which lack generalization and demand costly labeled data for new categories. While self-supervised learning reduces annotation needs by learning instance-wise invariance, it struggles to capture \textitidentity-sensitive features critical for ReID. This paper proposes Visual In-Context Prompting~(VICP), a novel framework where models trained on seen categories can directly generalize to unseen novel categories using only \textitin-context examples as prompts, without requiring parameter adaptation. VICP synergizes LLMs and vision foundation models~(VFM): LLMs infer semantic identity rules from few-shot positive/negative pairs through task-specific prompting, which then guides a VFM (\eg, DINO) to extract ID-discriminative features via \textitdynamic visual prompts. By aligning LLM-derived semantic concepts with the VFM’s pre-trained prior, VICP enables generalization to novel categories, eliminating the need for dataset-specific retraining. To support evaluation, we introduce ShopID10K, a dataset of 10K object instances from e-commerce platforms, featuring multi-view images and cross-domain testing. Experiments on ShopID10K and diverse ReID benchmarks demonstrate that VICP outperforms baselines by a clear margin on unseen categories. Code is available at this https URL.
zh

[CV-50] GCAV: A Global Concept Activation Vector Framework for Cross-Layer Consistency in Interpretability

【速读】:该论文旨在解决概念激活向量(Concept Activation Vectors, CAVs)在不同网络层独立计算时存在的语义不一致性问题,该问题导致跨层比较不可靠,影响对深度神经网络中人类定义概念敏感性的准确解释。解决方案的关键在于提出全局概念激活向量(Global Concept Activation Vector, GCAV)框架,其核心机制包括:利用对比学习对齐不同层的概念表示,并通过注意力融合机制构建统一的、语义一致的CAV表示,从而显著降低TCAV评分的方差并保持概念相关性,实现更稳定和可靠的模型概念归因。

链接: https://arxiv.org/abs/2508.21197
作者: Zhenghao He,Sanchit Sinha,Guangzhi Xiong,Aidong Zhang
机构: University of Virginia (弗吉尼亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Concept Activation Vectors (CAVs) provide a powerful approach for interpreting deep neural networks by quantifying their sensitivity to human-defined concepts. However, when computed independently at different layers, CAVs often exhibit inconsistencies, making cross-layer comparisons unreliable. To address this issue, we propose the Global Concept Activation Vector (GCAV), a novel framework that unifies CAVs into a single, semantically consistent representation. Our method leverages contrastive learning to align concept representations across layers and employs an attention-based fusion mechanism to construct a globally integrated CAV. By doing so, our method significantly reduces the variance in TCAV scores while preserving concept relevance, ensuring more stable and reliable concept attributions. To evaluate the effectiveness of GCAV, we introduce Testing with Global Concept Activation Vectors (TGCAV) as a method to apply TCAV to GCAV-based representations. We conduct extensive experiments on multiple deep neural networks, demonstrating that our method effectively mitigates concept inconsistency across layers, enhances concept localization, and improves robustness against adversarial perturbations. By integrating cross-layer information into a coherent framework, our method offers a more comprehensive and interpretable understanding of how deep learning models encode human-defined concepts. Code and models are available at this https URL.
zh

[CV-51] Radially Distorted Homographies Revisited

【速读】:该论文旨在解决在存在径向畸变(radial distortion)的情况下,如何准确、高效地估计图像间单应变换(homography)的问题。其核心挑战在于,实际图像常因相机镜头引起几何畸变,需同时估计单应矩阵与径向畸变参数,而此前针对不同畸变配置(仅一张图有畸变、两图畸变相同、两图畸变独立)的方法均未统一处理。解决方案的关键在于提出一种新颖且统一的建模框架,能够适用于上述三种典型场景,并基于此构建新的最小解算器(minimal solvers),在保证精度的同时显著提升计算速度和数值稳定性,优于现有最优方法。

链接: https://arxiv.org/abs/2508.21190
作者: Mårten Wadenbäck,Marcus Valtonen Örnhag,Johan Edstedt
机构: Linköping University (林雪平大学); Ericsson Research (爱立信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Homographies are among the most prevalent transformations occurring in geometric computer vision and projective geometry, and homography estimation is consequently a crucial step in a wide assortment of computer vision tasks. When working with real images, which are often afflicted with geometric distortions caused by the camera lens, it may be necessary to determine both the homography and the lens distortion-particularly the radial component, called radial distortion-simultaneously to obtain anything resembling useful estimates. When considering a homography with radial distortion between two images, there are three conceptually distinct configurations for the radial distortion; (i) distortion in only one image, (ii) identical distortion in the two images, and (iii) independent distortion in the two images. While these cases have been addressed separately in the past, the present paper provides a novel and unified approach to solve all three cases. We demonstrate how the proposed approach can be used to construct new fast, stable, and accurate minimal solvers for radially distorted homographies. In all three cases, our proposed solvers are faster than the existing state-of-the-art solvers while maintaining similar accuracy. The solvers are tested on well-established benchmarks including images taken with fisheye cameras. The source code for our solvers will be made available in the event our paper is accepted for publication.
zh

[CV-52] SYNBUILD-3D: A large multi-modal and semantically rich synthetic dataset of 3D building models at Level of Detail 4

【速读】:该论文旨在解决3D建筑模型自动生成中因缺乏大规模标注数据集而导致的准确性与语义丰富性不足的问题。其解决方案的关键在于提出SYNBUILD-3D,一个包含超过620万栋合成住宅建筑的多模态数据集,每栋建筑以三种不同模态表示:LoD 4级别的语义增强三维线框图(Modality I)、对应的楼层平面图像(Modality II)以及类似LiDAR的屋顶点云(Modality III)。通过这种三模态结构,可利用楼层平面图像中的语义信息(如房间、门和窗)来指导生成过程,从而实现语义-几何一致性约束下的自动化3D建筑建模,为未来生成式AI算法的发展提供基础支持。

链接: https://arxiv.org/abs/2508.21169
作者: Kevin Mayer,Alex Vesel,Xinyi Zhao,Martin Fischer
机构: Stanford University (斯坦福大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D building models are critical for applications in architecture, energy simulation, and navigation. Yet, generating accurate and semantically rich 3D buildings automatically remains a major challenge due to the lack of large-scale annotated datasets in the public domain. Inspired by the success of synthetic data in computer vision, we introduce SYNBUILD-3D, a large, diverse, and multi-modal dataset of over 6.2 million synthetic 3D residential buildings at Level of Detail (LoD) 4. In the dataset, each building is represented through three distinct modalities: a semantically enriched 3D wireframe graph at LoD 4 (Modality I), the corresponding floor plan images (Modality II), and a LiDAR-like roof point cloud (Modality III). The semantic annotations for each building wireframe are derived from the corresponding floor plan images and include information on rooms, doors, and windows. Through its tri-modal nature, future work can use SYNBUILD-3D to develop novel generative AI algorithms that automate the creation of 3D building models at LoD 4, subject to predefined floor plan layouts and roof geometries, while enforcing semantic-geometric consistency. Dataset and code samples are publicly available at this https URL.
zh

[CV-53] RadGS-Reg: Registering Spine CT with Biplanar X-rays via Joint 3D Radiative Gaussians Reconstruction and 3D/3D Registration

【速读】:该论文旨在解决医学图像引导导航中CT/X射线(X-ray)配准问题,其核心挑战在于对高精度和实时性的严苛要求,传统“渲染与比较”方法因迭代投影与比对过程导致空间信息丢失及域差异问题,而现有基于双平面X射线的三维(3D)重建方法则受限于密集视角需求且难以应对噪声干扰。解决方案的关键在于提出RadGS-Reg框架,通过联合3D辐射高斯(Radiative Gaussians, RadGS)重建与3D/3D配准实现椎体层级的精准配准:首先利用带有反事实注意力学习(Counterfactual Attention Learning, CAL)机制的基于学习的RadGS重建模块,在噪声X射线中聚焦椎体区域以提升重建质量;其次引入患者特异性预训练策略,逐步从模拟数据向真实数据迁移的同时学习椎体形状先验知识,从而显著提升配准性能并超越现有方法。

链接: https://arxiv.org/abs/2508.21154
作者: Ao Shen,Xueming Fu,Junfeng Jiang,Qiang Zeng,Ye Tang,Zhengming Chen,Luming Nong,Feng Wang,S. Kevin Zhou
机构: Hohai University (河海大学); University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Computed Tomography (CT)/X-ray registration in image-guided navigation remains challenging because of its stringent requirements for high accuracy and real-time performance. Traditional “render and compare” methods, relying on iterative projection and comparison, suffer from spatial information loss and domain gap. 3D reconstruction from biplanar X-rays supplements spatial and shape information for 2D/3D registration, but current methods are limited by dense-view requirements and struggles with noisy X-rays. To address these limitations, we introduce RadGS-Reg, a novel framework for vertebral-level CT/X-ray registration through joint 3D Radiative Gaussians (RadGS) reconstruction and 3D/3D registration. Specifically, our biplanar X-rays vertebral RadGS reconstruction module explores learning-based RadGS reconstruction method with a Counterfactual Attention Learning (CAL) mechanism, focusing on vertebral regions in noisy X-rays. Additionally, a patient-specific pre-training strategy progressively adapts the RadGS-Reg from simulated to real data while simultaneously learning vertebral shape prior knowledge. Experiments on in-house datasets demonstrate the state-of-the-art performance for both tasks, surpassing existing methods. The code is available at: this https URL.
zh

[CV-54] HiddenObject: Modality-Agnostic Fusion for Multimodal Hidden Object Detection

【速读】:该论文旨在解决多模态环境下隐藏或部分遮挡目标检测的难题,此类场景中遮挡、伪装和光照变化等因素严重制约了传统基于RGB图像的检测方法性能。解决方案的关键在于提出一种名为HiddenObject的融合框架,该框架利用基于Mamba的融合机制整合RGB、热成像与深度数据,通过提取各模态特异性特征并在统一表征空间中进行融合,从而增强对被遮挡或伪装目标的检测能力。实验表明,该方法在多个基准数据集上达到或超越现有技术水平,验证了其在视觉退化或复杂条件下的有效性。

链接: https://arxiv.org/abs/2508.21135
作者: Harris Song,Tuan-Anh Vu,Sanjith Menon,Sriram Narasimhan,M. Khalid Jawed
机构: University of California, Los Angeles(加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Detecting hidden or partially concealed objects remains a fundamental challenge in multimodal environments, where factors like occlusion, camouflage, and lighting variations significantly hinder performance. Traditional RGB-based detection methods often fail under such adverse conditions, motivating the need for more robust, modality-agnostic approaches. In this work, we present HiddenObject, a fusion framework that integrates RGB, thermal, and depth data using a Mamba-based fusion mechanism. Our method captures complementary signals across modalities, enabling enhanced detection of obscured or camouflaged targets. Specifically, the proposed approach identifies modality-specific features and fuses them in a unified representation that generalizes well across challenging scenarios. We validate HiddenObject across multiple benchmark datasets, demonstrating state-of-the-art or competitive performance compared to existing methods. These results highlight the efficacy of our fusion design and expose key limitations in current unimodal and naïve fusion strategies. More broadly, our findings suggest that Mamba-based fusion architectures can significantly advance the field of multimodal object detection, especially under visually degraded or complex conditions.
zh

[CV-55] R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLM s via Bi-Mode Annealing and Reinforce Learning

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理简单问题时因强制执行逐步推理(step-by-step thinking)而导致的计算效率低下问题。其解决方案的关键在于提出一种自适应思考机制——R-4B,该模型通过双模式退火(bi-mode annealing)赋予模型“思考”与“不思考”的双重能力,并采用双模式策略优化(Bi-mode Policy Optimization, BPO)提升模型判断何时激活推理过程的准确性。具体而言,R-4B首先在包含思考与非思考模式样本的多样化数据集上进行预训练,随后在改进的GRPO框架下进行第二阶段训练,强制模型对每个输入同时生成两种模式的响应,从而实现高效且准确的决策机制。

链接: https://arxiv.org/abs/2508.21113
作者: Jie Jiang,Qi Yang,Bolin Ni,Shiming Xiang,Han Hu,Houwen Peng
机构: Tencent Hunyuan Team (腾讯混元团队); Institute of Automation, CAS (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 20 pages, 14 figures, 5 tables

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) equipped with step-by-step thinking capabilities have demonstrated remarkable performance on complex reasoning problems. However, this thinking process is redundant for simple problems solvable without complex reasoning. To address this inefficiency, we propose R-4B, an auto-thinking MLLM, which can adaptively decide when to think based on problem complexity. The central idea of R-4B is to empower the model with both thinking and non-thinking capabilities using bi-mode annealing, and apply Bi-mode Policy Optimization~(BPO) to improve the model’s accuracy in determining whether to activate the thinking process. Specifically, we first train the model on a carefully curated dataset spanning various topics, which contains samples from both thinking and non-thinking modes. Then it undergoes a second phase of training under an improved GRPO framework, where the policy model is forced to generate responses from both modes for each input query. Experimental results show that R-4B achieves state-of-the-art performance across 25 challenging benchmarks. It outperforms Qwen2.5-VL-7B in most tasks and achieves performance comparable to larger models such as Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.
zh

[CV-56] GENNAV: Polygon Mask Generation for Generalized Referring Navigable Regions

【速读】:该论文旨在解决从自然语言指令和移动机器人前视摄像头图像中准确识别目标区域位置的问题,尤其针对边界模糊的“stuff-type”目标区域(如草地、道路等无明确边界的场景)以及缺失或多目标情况下的存在性预测与分割难题。现有方法在处理这类复杂场景时表现不佳,难以同时实现高精度的存在性判断和多目标分割。解决方案的关键在于提出GENNAV模型,该模型通过联合预测目标是否存在,并生成多个stuff-type目标区域的分割掩码(segmentation masks),从而有效提升对复杂语义场景的理解能力。实验表明,GENNAV在自建的GRiN-Drive基准上优于基线方法,并在五个不同地理城市的四辆汽车上实现了零样本迁移(zero-shot transfer)性能验证,展现出良好的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.21102
作者: Kei Katsumata,Yui Iioka,Naoki Hosomi,Teruhisa Misu,Kentaro Yamada,Komei Sugiura
机构: Keio University (庆应义塾大学); Honda R&D Co., Ltd. (本田研发有限公司); Honda Research Institute USA (本田研究美国公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted for presentation at CoRL2025

点击查看摘要

Abstract:We focus on the task of identifying the location of target regions from a natural language instruction and a front camera image captured by a mobility. This task is challenging because it requires both existence prediction and segmentation, particularly for stuff-type target regions with ambiguous boundaries. Existing methods often underperform in handling stuff-type target regions, in addition to absent or multiple targets. To overcome these limitations, we propose GENNAV, which predicts target existence and generates segmentation masks for multiple stuff-type target regions. To evaluate GENNAV, we constructed a novel benchmark called GRiN-Drive, which includes three distinct types of samples: no-target, single-target, and multi-target. GENNAV achieved superior performance over baseline methods on standard evaluation metrics. Furthermore, we conducted real-world experiments with four automobiles operated in five geographically distinct urban areas to validate its zero-shot transfer performance. In these experiments, GENNAV outperformed baseline methods and demonstrated its robustness across diverse real-world environments. The project page is available at this https URL.
zh

[CV-57] Safe-Control: A Safety Patch for Mitigating Unsafe Content in Text-to-Image Generation Models

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型在实际应用中可能被滥用或误用所带来的安全风险问题,尤其是现有安全机制在面对分布偏移时易被规避或需大量针对性调整的局限性。解决方案的关键在于提出一种名为Safe-Control的即插即用型安全补丁(plug-and-play safety patch),通过数据驱动策略和安全感知条件,在锁定的T2I模型中注入安全控制信号,以类似“补丁更新”的方式实现对不安全内容生成的有效抑制。该方法无需修改原模型结构,具备良好的兼容性和可扩展性,能够灵活组合多种安全策略,并在六种不同T2I模型上验证了其在降低不安全内容生成概率(降至7%)的同时保持良性图像质量与文本对齐能力,显著优于当前主流内外部防御机制。

链接: https://arxiv.org/abs/2508.21099
作者: Xiangtao Meng,Yingkai Dong,Ning Yu,Li Wang,Zheng Li,Shanqing Guo
机构: Shandong University (山东大学); Netflix Eyeline Studios
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.
zh

[CV-58] ROBUST-MIPS: A Combined Skeletal Pose and Instance Segmentation Dataset for Laparoscopic Surgical Instruments

【速读】:该论文旨在解决手术工具定位任务中深度学习模型训练依赖大量多样化标注数据的问题,这一限制显著制约了相关计算机辅助介入技术的发展。其解决方案的关键在于提出一种更高效的标注方式——骨骼姿态(skeletal pose)标注,相较于传统的工具实例分割(instance segmentation)标注,骨骼姿态标注在保持丰富语义信息的同时大幅降低标注难度,从而加速高质量标注数据的积累。为验证该方法的有效性,作者构建了ROBUST-MIPS数据集,该数据集融合了工具姿态与实例分割标注,并配套基准模型和自定义标注软件,支持对两种标注方式的直接比较与应用验证。

链接: https://arxiv.org/abs/2508.21096
作者: Zhe Han,Charlie Budd,Gongyu Zhang,Huanyu Tian,Christos Bergeles,Tom Vercauteren
机构: King’s College London, School of Biomedical Engineering & Imaging Sciences (伦敦国王学院生物医学工程与成像科学学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Localisation of surgical tools constitutes a foundational building block for computer-assisted interventional technologies. Works in this field typically focus on training deep learning models to perform segmentation tasks. Performance of learning-based approaches is limited by the availability of diverse annotated data. We argue that skeletal pose annotations are a more efficient annotation approach for surgical tools, striking a balance between richness of semantic information and ease of annotation, thus allowing for accelerated growth of available annotated data. To encourage adoption of this annotation style, we present, ROBUST-MIPS, a combined tool pose and tool instance segmentation dataset derived from the existing ROBUST-MIS dataset. Our enriched dataset facilitates the joint study of these two annotation styles and allow head-to-head comparison on various downstream tasks. To demonstrate the adequacy of pose annotations for surgical tool localisation, we set up a simple benchmark using popular pose estimation methods and observe high-quality results. To ease adoption, together with the dataset, we release our benchmark models and custom tool pose annotation software.
zh

[CV-59] ScanMove: Motion Prediction and Transfer for Unregistered Body Meshes

【速读】:该论文旨在解决未注册表面网格(unregistered surface meshes),尤其是原始3D扫描数据,在自动计算合理形变时面临的挑战,这些问题主要源于缺乏点对点对应关系以及数据中的噪声。其解决方案的关键在于提出了一种无需刚性约束(rig-free)、数据驱动的框架,通过将一个鲁棒的运动嵌入网络(motion embedding network)与一个学习得到的逐顶点特征场(per-vertex feature field)相结合,生成时空形变场(spatio-temporal deformation field),从而驱动网格形变。该方法在行走和跑步等任务上的定量与定性评估中展现出有效性与通用性。

链接: https://arxiv.org/abs/2508.21095
作者: Thomas Besnier,Sylvain Arguillère,Mohamed Daoudi
机构: 未知
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Unregistered surface meshes, especially raw 3D scans, present significant challenges for automatic computation of plausible deformations due to the lack of established point-wise correspondences and the presence of noise in the data. In this paper, we propose a new, rig-free, data-driven framework for motion prediction and transfer on such body meshes. Our method couples a robust motion embedding network with a learned per-vertex feature field to generate a spatio-temporal deformation field, which drives the mesh deformation. Extensive evaluations, including quantitative benchmarks and qualitative visuals on tasks such as walking and running, demonstrate the effectiveness and versatility of our approach on challenging unregistered meshes.
zh

[CV-60] Video-LLM s with Temporal Visual Screening

【速读】:该论文旨在解决当前视频大语言模型(Video Large Language Models, Video-LLMs)在处理视频时难以捕捉细粒度时间语义的问题,其根源在于稀疏帧采样和训练过程中缺乏足够的帧间推理监督。解决方案的关键在于提出一种受认知科学启发的新型预处理任务——时间视觉筛选(Temporal Visual Screening, TVS),该任务通过保留关注关键的时间片段、同步重构查询以保持答案一致性,并确保所有可能答案的不变性,从而优化推理负担与认知负荷。TVS作为模块化前端适配器任务,可无缝集成至视频指令微调(training)和视频问答(inference)流程中,在训练阶段对齐查询与关键视觉信息,在推理阶段实现查询感知的片段聚焦与简化查询表示,显著提升视频-语言理解性能。

链接: https://arxiv.org/abs/2508.21094
作者: Zheyu Fan,Jiateng Liu,Yuji Zhang,Zihan Wang,Yi R.(May)Fung,Manling Li,Heng Ji
机构: University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校); Northwestern University (西北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Humans naturally perform temporal screening by dragging the progress bar and focusing on salient temporal segments, but current Video Large Language Models (Video-LLMs) struggle to capture fine-grained temporal semantics due to sparse frame sampling and insufficient inter-frame reasoning supervision during their training. To address this, Inspired by well-established cognitive science principles, we propose Temporal Visual Screening (TVS), a new task that universally pre-processes video question answering and instruction tuning data by: (1) retaining focus-critical video segments, (2) synchronously reconstructing queries to their most direct form while preserving answer consistency, and (3) keeping the invariance and consistency for any possible answer. TVS is formulated as a modular front-end adapter task that can be seamlessly integrated into both Video Instruction Tuning (training) and Video Question Answering (inference) pipelines. TVS optimizes distribution of reasoning burden and cognitive load; during training, it aligns queries with focus-critical visual information; at inference, it enables query-aware segment focus and streamlined query representations. In particular, we curate the first benchmark for TVS and propose ReSimplifyIt, a baseline outperforming prior approaches on seemingly similar tasks by 0.47 in F-1 score on video trimming while achieving competitive query rewriting performance. Experiments demonstrate that incorporating TVS yields relative gains of 7.33% (training) and 34.6% (inference), demonstrating the effectiveness of temporal information screening for improving video-language understanding.
zh

[CV-61] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

【速读】:该论文旨在解决扩散模型(Diffusion Models)在推理过程中因迭代特性导致的高计算开销问题。现有基于特征缓存(Feature Caching)的加速策略虽能通过复用跨时间步的中间输出提升效率,但直接复用常引发显著的质量下降。针对此问题,作者系统分析了缓存引入的累积误差,将其分解为两类:由缓存输出不准确引起的特征偏移误差(feature shift error),以及在固定时间步调度下误差传播导致的步骤放大误差(step amplification error)。解决方案的关键在于提出 ERTACache 框架,该框架通过三个核心机制协同优化:1)离线残差剖析阶段识别可复用的时间步;2)基于轨迹感知的校正系数动态调整积分区间;3)利用闭式残差线性化模型对缓存引入的误差进行解析近似。这一设计实现了在激进缓存复用下的高精度与高效率平衡,在图像和视频生成基准上实现最高达 2 倍的推理加速,同时保持或提升视觉质量。

链接: https://arxiv.org/abs/2508.21091
作者: Xurui Peng,Hong Liu,Chenqian Yan,Rui Ma,Fangmin Chen,Xing Wang,Zhihua Wu,Songwei Liu,Mingbao Lin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at this https URL.
zh

[CV-62] Q-Align: Alleviating Attention Leakage in Zero-Shot Appearance Transfer via Query-Query Alignment

【速读】:该论文旨在解决零样本外观迁移(zero-shot appearance transfer)中因注意力泄漏(Attention Leakage)导致的语义映射失真问题,即当图像间的语义对应关系被Query-Key对齐机制捕获时,会引发不准确的特征匹配。其解决方案的关键在于提出Q-Align方法,通过Query-Query对齐实现更精细的空间语义映射,结合Key-Value重排增强特征对应关系,并利用重排后的键值对进行注意力细化,从而在保持结构保真度的同时显著提升外观保真度。

链接: https://arxiv.org/abs/2508.21090
作者: Namu Kim,Wonbin Kweon,Minsoo Kim,Hwanjo Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We observe that zero-shot appearance transfer with large-scale image generation models faces a significant challenge: Attention Leakage. This challenge arises when the semantic mapping between two images is captured by the Query-Key alignment. To tackle this issue, we introduce Q-Align, utilizing Query-Query alignment to mitigate attention leakage and improve the semantic alignment in zero-shot appearance transfer. Q-Align incorporates three core contributions: (1) Query-Query alignment, facilitating the sophisticated spatial semantic mapping between two images; (2) Key-Value rearrangement, enhancing feature correspondence through realignment; and (3) Attention refinement using rearranged keys and values to maintain semantic consistency. We validate the effectiveness of Q-Align through extensive experiments and analysis, and Q-Align outperforms state-of-the-art methods in appearance fidelity while maintaining competitive structure preservation.
zh

[CV-63] Advanced Deep Learning Techniques for Classifying Dental Conditions Using Panoramic X-Ray Images

【速读】:该论文旨在解决口腔全景X射线图像中牙科病变自动分类的难题,以提升诊断效率与准确性。其解决方案的关键在于通过融合卷积神经网络(Convolutional Neural Network, CNN)的特征提取能力与传统集成学习方法(如随机森林)的优势,构建混合模型架构。实验表明,这种结合方式在准确率(85.4%)和对形态相似病灶的区分能力上显著优于单一CNN或预训练模型(如VGG16,82.3%),从而为自动化牙科辅助诊断提供了高效且可靠的路径。

链接: https://arxiv.org/abs/2508.21088
作者: Alireza Golkarieh,Kiana Kiashemshaki,Sajjad Rezvani Boroujeni
机构: Oakland University (奥克兰大学); Bowling Green State University (鲍林格林州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 14 pages, 8 figures, 8 tables

点击查看摘要

Abstract:This study investigates deep learning methods for automated classification of dental conditions in panoramic X-ray images. A dataset of 1,512 radiographs with 11,137 expert-verified annotations across four conditions fillings, cavities, implants, and impacted teeth was used. After preprocessing and class balancing, three approaches were evaluated: a custom convolutional neural network (CNN), hybrid models combining CNN feature extraction with traditional classifiers, and fine-tuned pre-trained architectures. Experiments employed 5 fold cross validation with accuracy, precision, recall, and F1 score as evaluation metrics. The hybrid CNN Random Forest model achieved the highest performance with 85.4% accuracy, surpassing the custom CNN baseline of 74.3%. Among pre-trained models, VGG16 performed best at 82.3% accuracy, followed by Xception and ResNet50. Results show that hybrid models improve discrimination of morphologically similar conditions and provide efficient, reliable performance. These findings suggest that combining CNN-based feature extraction with ensemble classifiers offers a practical path toward automated dental diagnostic support, while also highlighting the need for larger datasets and further clinical validation.
zh

[CV-64] 2COOOL: 2nd Workshop on the Challenge Of Out-Of-Label Hazards in Autonomous Driving ICCV2025

【速读】:该论文旨在解决自动驾驶系统在面对未见过的、分布外(out-of-distribution)危险场景时仍存在安全风险的问题,这是当前实现完全安全自动驾驶的核心障碍之一。其解决方案的关键在于推动对“新颖性”(novelty)的有效处理,包括开发新型异常检测方法、开放集识别(open-set recognition)、开放词汇建模(open-vocabulary modeling)、领域自适应(domain adaptation)等技术,并通过构建新的基准测试与评估方法,提升视觉语言模型在危险理解中的能力,从而增强自动驾驶系统在复杂现实环境下的感知、决策与避障能力。

链接: https://arxiv.org/abs/2508.21080
作者: Ali K. AlShami,Ryan Rabinowitz,Maged Shoman,Jianwu Fang,Lukas Picek,Shao-Yuan Lo,Steve Cruz,Khang Nhut Lam,Nachiket Kamod,Lei-Lei Li,Jugal Kalita,Terrance E. Boult
机构: University of Colorado Colorado Springs (科罗拉多大学波尔得分校); University of Tennessee–Oak Ridge Innovation Institute (田纳西大学-橡树岭创新研究所); Xi’an Jiaotong University (西安交通大学); University of West Bohemia (西波希米亚大学); Honda Research Institute USA (本田研究 institute 美国); University of Notre Dame (圣母大学); Can Tho University (芹苴大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 11 pages, 2 figures, Accepted to ICCV 2025 Workshop on Out-of-Label Hazards in Autonomous Driving (2COOOL)

点击查看摘要

Abstract:As the computer vision community advances autonomous driving algorithms, integrating vision-based insights with sensor data remains essential for improving perception, decision making, planning, prediction, simulation, and control. Yet we must ask: Why don’t we have entirely safe self-driving cars yet? A key part of the answer lies in addressing novel scenarios, one of the most critical barriers to real-world deployment. Our 2COOOL workshop provides a dedicated forum for researchers and industry experts to push the state of the art in novelty handling, including out-of-distribution hazard detection, vision-language models for hazard understanding, new benchmarking and methodologies, and safe autonomous driving practices. The 2nd Workshop on the Challenge of Out-of-Label Hazards in Autonomous Driving (2COOOL) will be held at the International Conference on Computer Vision (ICCV) 2025 in Honolulu, Hawaii, on October 19, 2025. We aim to inspire the development of new algorithms and systems for hazard avoidance, drawing on ideas from anomaly detection, open-set recognition, open-vocabulary modeling, domain adaptation, and related fields. Building on the success of its inaugural edition at the Winter Conference on Applications of Computer Vision (WACV) 2025, the workshop will feature a mix of academic and industry participation.
zh

[CV-65] QuadKAN: KAN-Enhanced Quadruped Motion Control via End-to-End Reinforcement Learning

【速读】:该论文旨在解决视觉引导下四足机器人运动控制的鲁棒性问题,特别是在复杂地形和动态障碍环境中如何实现高效、稳定且低能耗的自主导航。其核心挑战在于单纯依赖视觉信息易受噪声干扰,而仅使用本体感觉(proprioception)则难以适应环境变化。解决方案的关键是提出QuadKAN框架,采用基于样条函数参数化的跨模态策略网络(spline-parameterized cross-modal policy),利用科尔莫戈罗夫-阿诺尔德网络(Kolmogorov-Arnold Networks, KANs)结构建模状态到动作的映射关系;该方法通过样条编码器处理本体感觉输入,并引入样条融合头整合视觉与本体感知信息,从而更好地匹配步态的分段光滑特性,显著提升样本效率、降低动作抖动与能耗,并提供可解释的姿势-动作敏感性分析。

链接: https://arxiv.org/abs/2508.19153
作者: Allen Wang,Gavin Tao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注: 14pages, 9 figures, Journal paper

点击查看摘要

Abstract:We address vision-guided quadruped motion control with reinforcement learning (RL) and highlight the necessity of combining proprioception with vision for robust control. We propose QuadKAN, a spline-parameterized cross-modal policy instantiated with Kolmogorov-Arnold Networks (KANs). The framework incorporates a spline encoder for proprioception and a spline fusion head for proprioception-vision inputs. This structured function class aligns the state-to-action mapping with the piecewise-smooth nature of gait, improving sample efficiency, reducing action jitter and energy consumption, and providing interpretable posture-action sensitivities. We adopt Multi-Modal Delay Randomization (MMDR) and perform end-to-end training with Proximal Policy Optimization (PPO). Evaluations across diverse terrains, including both even and uneven surfaces and scenarios with static or dynamic obstacles, demonstrate that QuadKAN achieves consistently higher returns, greater distances, and fewer collisions than state-of-the-art (SOTA) baselines. These results show that spline-parameterized policies offer a simple, effective, and interpretable alternative for robust vision-guided locomotion. A repository will be made available upon acceptance.
zh

人工智能

[AI-0] Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture ALT

【速读】:该论文旨在解决临床病历文本(SOAP格式中的S和O部分)自动解析的准确性问题,以提升临床决策支持系统的性能。由于单模型方法在高风险医疗任务中可能缺乏鲁棒性,作者提出了一种协作式多智能体系统(Multi-Agent System, MAS),其关键在于模拟临床咨询团队的诊断推理过程:由一个管理代理(Manager agent)动态分配多个专科代理(Specialist agents)组成临时团队,通过层级化、迭代式的辩论机制达成共识。这种结构不仅提升了对心力衰竭、急性肾损伤和脓毒症等关键病症识别的准确率,还增强了结果的可解释性与证据权重的透明度。

链接: https://arxiv.org/abs/2508.21803
作者: Yeawon Lee,Xiaoyang Wang,Christopher C. Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: Accepted to The 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025)(Poster Paper)

点击查看摘要

Abstract:Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single-model approaches can lack the robustness required for high-stakes clinical tasks. We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single-agent baseline on a curated dataset of 420 MIMIC-III notes. The dynamic multi-agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team’s reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.
zh

[AI-1] ree-Guided Diffusion Planner

【速读】:该论文旨在解决预训练扩散模型在测试时引导控制任务中面临的挑战,尤其是标准梯度引导方法在非凸目标、不可微约束和多奖励结构等现实场景下表现不佳的问题,同时克服现有监督式规划方法对特定任务训练或价值估计器的依赖,从而限制了测试时灵活性与零样本泛化能力。解决方案的关键在于提出一种零样本测试时规划框架——树引导扩散规划器(Tree-guided Diffusion Planner, TDP),其通过双层采样过程实现探索与利用的平衡:首先利用无训练粒子引导生成多样化的父轨迹以促进广泛探索;其次基于任务目标快速条件去噪精炼子轨迹,从而在仅使用预训练模型和测试时奖励信号的前提下,扩展解空间并有效利用梯度信息。

链接: https://arxiv.org/abs/2508.21800
作者: Hyeonseong Jeon,Cheolhong Min,Jaesik Park
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 20 pages, 11 figures, 14 tables (main paper + appendix) / under review / project page will be available after the paper becomes public in arxiv

点击查看摘要

Abstract:Planning with pretrained diffusion models has emerged as a promising approach for solving test-time guided control problems. However, standard gradient guidance typically performs optimally under convex and differentiable reward landscapes, showing substantially reduced effectiveness in real-world scenarios involving non-convex objectives, non-differentiable constraints, and multi-reward structures. Furthermore, recent supervised planning approaches require task-specific training or value estimators, which limits test-time flexibility and zero-shot generalization. We propose a Tree-guided Diffusion Planner (TDP), a zero-shot test-time planning framework that balances exploration and exploitation through structured trajectory generation. We frame test-time planning as a tree search problem using a bi-level sampling process: (1) diverse parent trajectories are produced via training-free particle guidance to encourage broad exploration, and (2) sub-trajectories are refined through fast conditional denoising guided by task objectives. TDP addresses the limitations of gradient guidance by exploring diverse trajectory regions and harnessing gradient information across this expanded solution space using only pretrained models and test-time reward signals. We evaluate TDP on three diverse tasks: maze gold-picking, robot arm block manipulation, and AntMaze multi-goal exploration. TDP consistently outperforms state-of-the-art approaches on all tasks. The project page can be found at: this http URL.
zh

[AI-2] DynaMark: A Reinforcement Learning Framework for Dynamic Watermarking in Industrial Machine Tool Controllers

【速读】:该论文旨在解决工业4.0背景下数控机床控制器(Machine Tool Controllers, MTCs)面临的重放攻击(replay attacks)问题,即攻击者利用过时的传感器数据操纵执行器。现有动态水印方案因假设系统为线性高斯模型且采用固定水印统计特性,在面对具有时变性和部分专有行为的MTC时易失效。解决方案的关键在于提出DynaMark框架,其将动态水印建模为马尔可夫决策过程(Markov Decision Process, MDP),通过强化学习在线学习自适应策略,动态调整零均值高斯水印的协方差矩阵,仅依赖可用测量和检测器反馈,无需系统先验知识。该框架设计了一种独特的奖励函数,实时平衡控制性能、能耗与检测置信度,并引入贝叶斯信念更新机制实现线性系统的在线检测置信度估计,从而在不破坏原轨迹的前提下显著降低水印能量(达70%)并保持平均检测延迟不超过一个采样周期。

链接: https://arxiv.org/abs/2508.21797
作者: Navid Aftabi,Abhishek Hanchate,Satish Bukkapatnam,Dan Li
机构: 未知
类目: ystems and Control (eess.SY); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Industry 4.0’s highly networked Machine Tool Controllers (MTCs) are prime targets for replay attacks that use outdated sensor data to manipulate actuators. Dynamic watermarking can reveal such tampering, but current schemes assume linear-Gaussian dynamics and use constant watermark statistics, making them vulnerable to the time-varying, partly proprietary behavior of MTCs. We close this gap with DynaMark, a reinforcement learning framework that models dynamic watermarking as a Markov decision process (MDP). It learns an adaptive policy online that dynamically adapts the covariance of a zero-mean Gaussian watermark using available measurements and detector feedback, without needing system knowledge. DynaMark maximizes a unique reward function balancing control performance, energy consumption, and detection confidence dynamically. We develop a Bayesian belief updating mechanism for real-time detection confidence in linear systems. This approach, independent of specific system assumptions, underpins the MDP for systems with linear dynamics. On a Siemens Sinumerik 828D controller digital twin, DynaMark achieves a reduction in watermark energy by 70% while preserving the nominal trajectory, compared to constant variance baselines. It also maintains an average detection delay equivalent to one sampling interval. A physical stepper-motor testbed validates these findings, rapidly triggering alarms with less control performance decline and exceeding existing benchmarks.
zh

[AI-3] MoE-Health: A Mixture of Experts Framework for Robust Multimodal Healthcare Prediction ALT

【速读】:该论文旨在解决医疗预测中多模态数据(如电子健康记录、临床笔记和医学影像)因实际场景下常出现模态缺失或不一致而导致的建模困难问题。现有方法通常依赖完整的模态数据或人工选择策略,难以适应真实临床环境中不同患者和机构间数据可用性的差异。解决方案的关键在于提出MoE-Health框架——一种基于专家混合(Mixture of Experts, MoE)的新型多模态融合机制,通过专用专家网络与动态门控机制相结合,根据输入样本中可用的模态类型自动选择并组合相关专家,从而实现对异构且不完整数据的灵活适配和高效整合,在多个关键临床任务(住院死亡率预测、住院时间延长、再入院预测)上展现出优于现有方法的性能与鲁棒性。

链接: https://arxiv.org/abs/2508.21793
作者: Xiaoyang Wang,Christopher C. Yang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted to The 16th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB 2025)

点击查看摘要

Abstract:Healthcare systems generate diverse multimodal data, including Electronic Health Records (EHR), clinical notes, and medical images. Effectively leveraging this data for clinical prediction is challenging, particularly as real-world samples often present with varied or incomplete modalities. Existing approaches typically require complete modality data or rely on manual selection strategies, limiting their applicability in real-world clinical settings where data availability varies across patients and institutions. To address these limitations, we propose MoE-Health, a novel Mixture of Experts framework designed for robust multimodal fusion in healthcare prediction. MoE-Health architecture is specifically developed to handle samples with differing modalities and improve performance on critical clinical tasks. By leveraging specialized expert networks and a dynamic gating mechanism, our approach dynamically selects and combines relevant experts based on available data modalities, enabling flexible adaptation to varying data availability scenarios. We evaluate MoE-Health on the MIMIC-IV dataset across three critical clinical prediction tasks: in-hospital mortality prediction, long length of stay, and hospital readmission prediction. Experimental results demonstrate that MoE-Health achieves superior performance compared to existing multimodal fusion methods while maintaining robustness across different modality availability patterns. The framework effectively integrates multimodal information, offering improved predictive performance and robustness in handling heterogeneous and incomplete healthcare data, making it particularly suitable for deployment in diverse healthcare environments with heterogeneous data availability.
zh

[AI-4] Orientability of Causal Relations in Time Series using Summary Causal Graphs and Faithful Distributions

【速读】:该论文旨在解决时间序列分析中因果关系推断的难题,特别是当真实因果结构未知时,如何从观测数据中准确识别微观层面(micro-level)变量间的因果边方向。其解决方案的关键在于:在给定专家提供的宏观抽象因果图(即总结因果图,summary causal graph, SCG)的前提下,结合对真实因果图的忠实性(faithfulness)和因果充分性(causal sufficiency)假设,推导出保证微观边可定向的理论条件。这一方法即使在宏观层面存在循环或双向边的情况下依然有效,从而为利用SCG指导复杂时序系统中的因果发现提供了理论依据,并凸显了融合专家知识以提升基于观测数据的因果推断能力的重要性。

链接: https://arxiv.org/abs/2508.21742
作者: Timothée Loranchet,Charles K. Assaad
机构: 未知
类目: Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Understanding causal relations between temporal variables is a central challenge in time series analysis, particularly when the full causal structure is unknown. Even when the full causal structure cannot be fully specified, experts often succeed in providing a high-level abstraction of the causal graph, known as a summary causal graph, which captures the main causal relations between different time series while abstracting away micro-level details. In this work, we present conditions that guarantee the orientability of micro-level edges between temporal variables given the background knowledge encoded in a summary causal graph and assuming having access to a faithful and causally sufficient distribution with respect to the true unknown graph. Our results provide theoretical guarantees for edge orientation at the micro-level, even in the presence of cycles or bidirected edges at the macro-level. These findings offer practical guidance for leveraging SCGs to inform causal discovery in complex temporal systems and highlight the value of incorporating expert knowledge to improve causal inference from observational time series data.
zh

[AI-5] Neural Network Acceleration on MPSoC board: Integrating SLACs SNL Rogue Software and Auto-SNL

【速读】:该论文旨在解决高能物理实验中由LCLS-II自由电子激光(Free Electron Laser, FEL)产生的高速X射线脉冲所引发的海量数据处理难题,其核心挑战在于传统数据传输与存储基础设施成本过高,而现有机器学习(Machine Learning, ML)方法因延迟过大难以满足每秒兆级数据流的实时处理需求。解决方案的关键在于开发了SLAC Neural Network Library(SNL),这是一个专为现场可编程门阵列(Field-Programmable Gate Array, FPGA)设计的实时推理框架,具备无需重新综合(resynthesis)即可动态更新模型权重的能力,从而实现低延迟、高灵活性的硬件部署;同时引入Auto-SNL工具链,将Python神经网络模型自动转换为SNL兼容的高层次综合代码,显著提升易用性与适配效率,实测表明其在多数网络架构下性能优于或至少不逊于当前主流工具hls4ml,并在部分场景下节省FPGA资源。

链接: https://arxiv.org/abs/2508.21739
作者: Hamza Ezzaoui Rahali,Abhilasha Dave,Larry Ruckman,Mohammad Mehdi Rahimifar,Audrey C. Therrien,James J. Russel,Ryan T. Herbst
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:

点击查看摘要

Abstract:The LCLS-II Free Electron Laser (FEL) will generate X-ray pulses for beamline experiments at rates of up to 1~MHz, with detectors producing data throughputs exceeding 1 TB/s. Managing such massive data streams presents significant challenges, as transmission and storage infrastructures become prohibitively expensive. Machine learning (ML) offers a promising solution for real-time data reduction, but conventional implementations introduce excessive latency, making them unsuitable for high-speed experimental environments. To address these challenges, SLAC developed the SLAC Neural Network Library (SNL), a specialized framework designed to deploy real-time ML inference models on Field-Programmable Gate Arrays (FPGA). SNL’s key feature is the ability to dynamically update model weights without requiring FPGA resynthesis, enhancing flexibility for adaptive learning applications. To further enhance usability and accessibility, we introduce Auto-SNL, a Python extension that streamlines the process of converting Python-based neural network models into SNL-compatible high-level synthesis code. This paper presents a benchmark comparison against hls4ml, the current state-of-the-art tool, across multiple neural network architectures, fixed-point precisions, and synthesis configurations targeting a Xilinx ZCU102 FPGA. The results showed that SNL achieves competitive or superior latency in most tested architectures, while in some cases also offering FPGA resource savings. This adaptation demonstrates SNL’s versatility, opening new opportunities for researchers and academics in fields such as high-energy physics, medical imaging, robotics, and many more.
zh

[AI-6] Developer Insights into Designing AI-Based Computer Perception Tools

【速读】:该论文试图解决的问题是:如何在临床工作中有效整合基于人工智能的计算机感知(AI-based computer perception, CP)工具,使其不仅具备临床实用性,还能获得用户(包括患者和医护人员)的接受度与信任。解决方案的关键在于开发者需在技术设计中平衡四大优先事项:一是考虑使用情境并确保对患者和临床医生的可解释性;二是使工具与现有临床流程相契合;三是针对不同利益相关者进行适当定制以提升可用性和可接受性;四是推动创新的同时遵循既定医学范式。研究强调,开发者不仅是技术架构师,更是伦理守门人,应通过透明化设计决策、明确定制边界、清晰传达输出信息及加强用户培训等策略,实现工具的可信、实用与知识进步之间的协同统一。

链接: https://arxiv.org/abs/2508.21733
作者: Maya Guhan(1),Meghan E. Hurley(1),Eric A. Storch(2),John Herrington(3),Casey Zampella(3),Julia Parish-Morris(3),Gabriel Lázaro-Muñoz(4),Kristin Kostick-Quenet(1) ((1) Center for Ethics and Health Policy, Baylor College of Medicine, Houston, TX, USA, (2) Department of Psychiatry and Behavioral Sciences, Baylor College of Medicine, Houston, TX, USA, (3) Department of Child and Adolescent Psychiatry and Behavioral Sciences, Children’s Hospital of Philadelphia, Philadelphia, PA, USA, (4) Center for Bioethics, Harvard Medical School, Boston, MA, USA)
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 15 pages

点击查看摘要

Abstract:Artificial intelligence (AI)-based computer perception (CP) technologies use mobile sensors to collect behavioral and physiological data for clinical decision-making. These tools can reshape how clinical knowledge is generated and interpreted. However, effective integration of these tools into clinical workflows depends on how developers balance clinical utility with user acceptability and trustworthiness. Our study presents findings from 20 in-depth interviews with developers of AI-based CP tools. Interviews were transcribed and inductive, thematic analysis was performed to identify 4 key design priorities: 1) to account for context and ensure explainability for both patients and clinicians; 2) align tools with existing clinical workflows; 3) appropriately customize to relevant stakeholders for usability and acceptability; and 4) push the boundaries of innovation while aligning with established paradigms. Our findings highlight that developers view themselves as not merely technical architects but also ethical stewards, designing tools that are both acceptable by users and epistemically responsible (prioritizing objectivity and pushing clinical knowledge forward). We offer the following suggestions to help achieve this balance: documenting how design choices around customization are made, defining limits for customization choices, transparently conveying information about outputs, and investing in user training. Achieving these goals will require interdisciplinary collaboration between developers, clinicians, and ethicists.
zh

[AI-7] Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem

【速读】:该论文旨在解决在含噪声中等规模量子(NISQ)硬件上高效求解旅行商问题(Traveling Salesman Problem, TSP)的挑战,其核心问题是传统变分量子算法在测试阶段需反复优化电路结构(Ansatz),导致计算成本高且难以部署。解决方案的关键在于提出一种“优化-冻结-复用”策略:首先使用模拟退火(Simulated Annealing, SA)在训练实例上优化Ansatz的电路拓扑结构,随后将其“冻结”,仅对新实例进行参数微调,从而避免了每次测试都重新搜索最优结构。这一方法显著降低了时间至解(time-to-solution),同时保持了较高的最优解采样概率,在4–6个城市规模下平均成功率达80%以上,展现出良好的泛化能力与实用性。

链接: https://arxiv.org/abs/2508.21730
作者: Fabrizio Fagiolo,Nicolo’ Vescera
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then frozen’’ and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware. On a set of 40 randomly generated symmetric instances that span 4 - 7 cities, the resulting Ansatz achieves an average optimal trip sampling probability of 100% for 4 city cases, 90% for 5 city cases and 80% for 6 city cases. With 7 cities the success rate drops markedly to an average of \sim 20% , revealing the onset of scalability limitations of the proposed method. The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start’’ initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.21730 [cs.AI] (or arXiv:2508.21730v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.21730 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-8] OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization

【速读】:该论文旨在解决扩散模型生成图像的水印技术在鲁棒性与容量之间的矛盾问题:现有零比特水印系统无法支持大规模用户追踪,而多比特方法对图像变换或生成攻击敏感,难以实现全面鲁棒性。其解决方案的关键在于提出OptMark,一种基于优化的方法,通过在扩散去噪过程的中间潜在空间中分阶段嵌入水印——早期插入结构水印以抵抗生成攻击,晚期插入细节水印以应对图像变换,并引入针对性正则化项保障图像质量与不可感知性;同时采用伴随梯度法(adjoint gradient methods)将内存消耗从O(N)降低至O(1),显著提升计算效率。

链接: https://arxiv.org/abs/2508.21727
作者: Jiazheng Xing,Hai Ci,Hongbin Xu,Hangjie Yuan,Yong Liu,Mike Zheng Shou
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Watermarking diffusion-generated images is crucial for copyright protection and user tracking. However, current diffusion watermarking methods face significant limitations: zero-bit watermarking systems lack the capacity for large-scale user tracking, while multi-bit methods are highly sensitive to certain image transformations or generative attacks, resulting in a lack of comprehensive robustness. In this paper, we propose OptMark, an optimization-based approach that embeds a robust multi-bit watermark into the intermediate latents of the diffusion denoising process. OptMark strategically inserts a structural watermark early to resist generative attacks and a detail watermark late to withstand image transformations, with tailored regularization terms to preserve image quality and ensure imperceptibility. To address the challenge of memory consumption growing linearly with the number of denoising steps during optimization, OptMark incorporates adjoint gradient methods, reducing memory usage from O(N) to O(1). Experimental results demonstrate that OptMark achieves invisible multi-bit watermarking while ensuring robust resilience against valuemetric transformations, geometric transformations, editing, and regeneration attacks.
zh

[AI-9] PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

【速读】:该论文旨在解决科学海报(scientific poster)自动生成中长期存在的两个核心问题:一是现有方法普遍忽视科学文档的层次结构(hierarchical structure)与文本-视觉元素之间的语义整合(semantic integration),二是缺乏对逻辑一致性、内容保真度和视觉连贯性的联合优化能力。解决方案的关键在于提出一种无需训练的框架PosterForest,其核心创新是引入了“海报树”(Poster Tree)这一多层级中间表示,能够同时编码文档结构与图文语义关系;并通过多智能体协作策略,使内容摘要代理与布局规划代理迭代交互并提供反馈,从而实现上述三方面的协同优化。

链接: https://arxiv.org/abs/2508.21720
作者: Jiho Choi,Seojeong Park,Seongjong Song,Hyunjung Shim
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present a novel training-free framework, \textitPosterForest, for automated scientific poster generation. Unlike prior approaches, which largely neglect the hierarchical structure of scientific documents and the semantic integration of textual and visual elements, our method addresses both challenges directly. We introduce the \textitPoster Tree, a hierarchical intermediate representation that jointly encodes document structure and visual-textual relationships at multiple levels. Our framework employs a multi-agent collaboration strategy, where agents specializing in content summarization and layout planning iteratively coordinate and provide mutual feedback. This approach enables the joint optimization of logical consistency, content fidelity, and visual coherence. Extensive experiments on multiple academic domains show that our method outperforms existing baselines in both qualitative and quantitative evaluations. The resulting posters achieve quality closest to expert-designed ground truth and deliver superior information preservation, structural clarity, and user preference.
zh

[AI-10] Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education

【速读】:该论文旨在解决当前气候韧性教育中缺乏情境化、动态适应性学习体验的问题,传统教学难以根据实时环境变化提供个性化的学习支持。解决方案的关键在于构建一个名为未来大气条件训练系统(Future Atmospheric Conditions Training System, FACTS)的平台,其核心是融合物联网(IoT)传感器采集的实时大气数据与知识库资源,通过生成式 AI(Generative AI)驱动的服务器动态生成本地化的学习挑战并提供个性化反馈,从而实现以地点为基础的自适应学习机制,有效提升学习者的气候韧性认知与参与度。

链接: https://arxiv.org/abs/2508.21666
作者: Imran S. A. Khan,Emmanuel G. Blanchard,Sébastien George
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:This paper introduces the Future Atmospheric Conditions Training System (FACTS), a novel platform that advances climate resilience education through place-based, adaptive learning experiences. FACTS combines real-time atmospheric data collected by IoT sensors with curated resources from a Knowledge Base to dynamically generate localized learning challenges. Learner responses are analyzed by a Generative AI powered server, which delivers personalized feedback and adaptive support. Results from a user evaluation indicate that participants found the system both easy to use and effective for building knowledge related to climate resilience. These findings suggest that integrating IoT and Generative AI into atmospherically adaptive learning technologies holds significant promise for enhancing educational engagement and fostering climate awareness.
zh

[AI-11] Leverag ing Imperfection with MEDLEY A Multi-Model Approach Harnessing Bias in Medical AI

【速读】:该论文试图解决医疗人工智能(Medical Artificial Intelligence, MedAI)中偏见(bias)被传统视为缺陷并需消除的问题,指出人类推理本身便包含由教育、文化与经验塑造的偏见,暗示偏见的存在可能不可避免且具有潜在价值。其解决方案的关键在于提出MEDLEY(Medical Ensemble Diagnostic system with Leveraged diversitY)框架,该框架通过协调多个AI模型并保留其多样化输出而非强制达成共识,将模型特异性偏见记录为潜在优势,并将幻觉(hallucination)视为供临床医生验证的暂定假设,从而在保持诊断不确定性透明的同时增强临床监督下的医学推理能力。

链接: https://arxiv.org/abs/2508.21648
作者: Farhad Abtahi,Mehdi Astaraki,Fernando Seoane
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Bias in medical artificial intelligence is conventionally viewed as a defect requiring elimination. However, human reasoning inherently incorporates biases shaped by education, culture, and experience, suggesting their presence may be inevitable and potentially valuable. We propose MEDLEY (Medical Ensemble Diagnostic system with Leveraged diversitY), a conceptual framework that orchestrates multiple AI models while preserving their diverse outputs rather than collapsing them into a consensus. Unlike traditional approaches that suppress disagreement, MEDLEY documents model-specific biases as potential strengths and treats hallucinations as provisional hypotheses for clinician verification. A proof-of-concept demonstrator was developed using over 30 large language models, creating a minimum viable product that preserved both consensus and minority views in synthetic cases, making diagnostic uncertainty and latent biases transparent for clinical oversight. While not yet a validated clinical tool, the demonstration illustrates how structured diversity can enhance medical reasoning under clinician supervision. By reframing AI imperfection as a resource, MEDLEY offers a paradigm shift that opens new regulatory, ethical, and innovation pathways for developing trustworthy medical AI systems.
zh

[AI-12] A-MHA*: Anytime Multi-Heuristic A*

【速读】:该论文旨在解决Multi-Heuristic A* (MHA*)算法在实际应用中缺乏 anytime(随时可中断)能力的问题,即原版MHA仅能一次性生成一个子最优解,无法随着计算时间的增加持续改进解的质量。其关键解决方案是将MHA扩展为A-MHA*(Anytime Multi-Heuristic A*),借鉴Anytime Repairing A* (ARA*)的思想,通过引入迭代加深与动态调整启发式膨胀因子的机制,在保证原有子最优性和完备性前提下,实现从快速获得可行解到随时间不断优化解质量的 anytime 行为。

链接: https://arxiv.org/abs/2508.21637
作者: Ramkumar Natarajan,Muhammad Suhail Saleem,William Xiao,Sandip Aine,Howie Choset,Maxim Likhachev
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Designing good heuristic functions for graph search requires adequate domain knowledge. It is often easy to design heuristics that perform well and correlate with the underlying true cost-to-go values in certain parts of the search space but these may not be admissible throughout the domain thereby affecting the optimality guarantees of the search. Bounded suboptimal search using several such partially good but inadmissible heuristics was developed in Multi-Heuristic A* (MHA*). Although MHA* leverages multiple inadmissible heuristics to potentially generate a faster suboptimal solution, the original version does not improve the solution over time. It is a one shot algorithm that requires careful setting of inflation factors to obtain a desired one time solution. In this work, we tackle this issue by extending MHA* to an anytime version that finds a feasible suboptimal solution quickly and continually improves it until time runs out. Our work is inspired from the Anytime Repairing A* (ARA*) algorithm. We prove that our precise adaptation of ARA* concepts in the MHA* framework preserves the original suboptimal and completeness guarantees and enhances MHA* to perform in an anytime fashion. Furthermore, we report the performance of A-MHA* in 3-D path planning domain and sliding tiles puzzle and compare against MHA* and other anytime algorithms.
zh

[AI-13] Integrating Large Language Models with Network Optimization for Interactive and Explainable Supply Chain Planning : A Real-World Case Study

【速读】:该论文旨在解决供应链规划中传统运筹学模型输出复杂、难以被业务决策者理解的问题,尤其在多周期、多品类库存再分配场景下,如何实现交互性、可解释性和角色感知的决策支持。其解决方案的关键在于构建一个融合传统网络优化模型与大语言模型(Large Language Models, LLMs)的集成框架,通过生成自然语言摘要、上下文可视化和定制化关键绩效指标(Key Performance Indicators, KPIs),将复杂的优化结果转化为业务人员可理解的形式,并借助AI代理、RESTful API和动态用户界面实现实时交互与仿真洞察,从而提升规划效率与执行效果。

链接: https://arxiv.org/abs/2508.21622
作者: Saravanan Venkatachalam
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents an integrated framework that combines traditional network optimization models with large language models (LLMs) to deliver interactive, explainable, and role-aware decision support for supply chain planning. The proposed system bridges the gap between complex operations research outputs and business stakeholder understanding by generating natural language summaries, contextual visualizations, and tailored key performance indicators (KPIs). The core optimization model addresses tactical inventory redistribution across a network of distribution centers for multi-period and multi-item, using a mixed-integer formulation. The technical architecture incorporates AI agents, RESTful APIs, and a dynamic user interface to support real-time interaction, configuration updates, and simulation-based insights. A case study demonstrates how the system improves planning outcomes by preventing stockouts, reducing costs, and maintaining service levels. Future extensions include integrating private LLMs, transfer learning, reinforcement learning, and Bayesian neural networks to enhance explainability, adaptability, and real-time decision-making.
zh

[AI-14] Physics-Informed Spectral Modeling for Hyperspectral Imaging

【速读】:该论文旨在解决高光谱图像(hyperspectral image)中特征表示与建模的难题,尤其是如何在有限标注数据下实现高效且可解释的分类与回归任务。其解决方案的关键在于提出PhISM(Physics-Informed Spectral Modeling),一种无需监督即可显式解耦高光谱观测信号的物理信息深度学习架构,通过连续基函数(continuous basis functions)对数据进行建模,从而在保持高性能的同时提供可解释的潜在表示(interpretable latent representation)。

链接: https://arxiv.org/abs/2508.21618
作者: Zuzanna Gawrysiak,Krzysztof Krawiec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present PhISM, a physics-informed deep learning architecture that learns without supervision to explicitly disentangle hyperspectral observations and model them with continuous basis functions. \mname outperforms prior methods on several classification and regression benchmarks, requires limited labeled data, and provides additional insights thanks to interpretable latent representation.
zh

[AI-15] Scalable Solution Methods for Dec-POMDPs with Deterministic Dynamics

【速读】:该论文旨在解决高阶多智能体规划问题中因状态空间和动作空间规模庞大而导致的计算复杂性难题,尤其聚焦于具有确定性转移与观测的多智能体部分可观测马尔可夫决策过程(Deterministic Decentralized POMDPs, Det-Dec-POMDPs)建模场景。其核心贡献在于提出一种名为迭代确定性POMDP规划(Iterative Deterministic POMDP Planning, IDPP)的实用求解方法,该方法基于经典的联合策略均衡搜索(Joint Equilibrium Search for Policies, JESP)框架,并针对大规模Det-Dec-POMDP问题进行了专门优化,从而突破了现有Dec-POMDP求解器在处理此类问题时效率不足的瓶颈。

链接: https://arxiv.org/abs/2508.21595
作者: Yang You,Alex Schutz,Zhikun Li,Bruno Lacerda,Robert Skilton,Nick Hawes
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Many high-level multi-agent planning problems, including multi-robot navigation and path planning, can be effectively modeled using deterministic actions and observations. In this work, we focus on such domains and introduce the class of Deterministic Decentralized POMDPs (Det-Dec-POMDPs). This is a subclass of Dec-POMDPs characterized by deterministic transitions and observations conditioned on the state and joint actions. We then propose a practical solver called Iterative Deterministic POMDP Planning (IDPP). This method builds on the classic Joint Equilibrium Search for Policies framework and is specifically optimized to handle large-scale Det-Dec-POMDPs that current Dec-POMDP solvers are unable to address efficiently. Subjects: Artificial Intelligence (cs.AI) Cite as: arXiv:2508.21595 [cs.AI] (or arXiv:2508.21595v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.21595 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-16] Revisiting Landmarks: Learning from Previous Plans to Generalize over Problem Instances

【速读】:该论文旨在解决传统地标提取算法在规划问题中难以跨实例泛化的问题,即现有方法通常仅基于特定问题的谓词(predicate)识别地标,无法有效捕捉领域内重复结构并推广至不同规模的问题实例。其解决方案的关键在于提出一种新的通用地标(generalized landmark)框架:通过学习一组已求解实例中的状态函数(state function),这些函数独立于具体问题的对象而适用于所有相似对象,从而捕获领域内的重复模式;在此基础上构建有向通用地标图(directed generalized landmark graph),显式刻画地标间的推进关系及循环结构(用于表示重复子计划),并利用该图设计启发式策略以提升新实例的求解效率。实验表明,从少量小规模实例中学习到的通用地标图同样能有效应用于更大规模同领域问题,尤其当识别出循环结构时,启发式性能显著优于基线方法。

链接: https://arxiv.org/abs/2508.21564
作者: Issa Hanou,Sebastijan Dumančić,Mathijs de Weerdt
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a new framework for discovering landmarks that automatically generalize across a domain. These generalized landmarks are learned from a set of solved instances and describe intermediate goals for planning problems where traditional landmark extraction algorithms fall short. Our generalized landmarks extend beyond the predicates of a domain by using state functions that are independent of the objects of a specific problem and apply to all similar objects, thus capturing repetition. Based on these functions, we construct a directed generalized landmark graph that defines the landmark progression, including loop possibilities for repetitive subplans. We show how to use this graph in a heuristic to solve new problem instances of the same domain. Our results show that the generalized landmark graphs learned from a few small instances are also effective for larger instances in the same domain. If a loop that indicates repetition is identified, we see a significant improvement in heuristic performance over the baseline. Generalized landmarks capture domain information that is interpretable and useful to an automated planner. This information can be discovered from a small set of plans for the same domain.
zh

[AI-17] Limitations of Physics-Informed Neural Networks: a Study on Smart Grid Surrogation

【速读】:该论文旨在解决传统数据驱动模型在智能电网建模中面临的两大核心问题:一是数据稀缺导致的泛化能力不足,二是缺乏物理一致性引发的运行不可靠性。解决方案的关键在于引入物理信息神经网络(Physics-Informed Neural Networks, PINNs),通过将电网的物理定律(如功率平衡、运行约束和稳定性条件)直接嵌入损失函数中进行训练,从而构建兼具数据适应性与第一性原理严谨性的代理模型(surrogate model)。实验表明,PINNs 在插值、交叉验证及动态轨迹预测任务中均显著优于 XGBoost、随机森林和线性回归等传统方法,在保持较低平均绝对误差(MAE)的同时,可靠捕捉状态转移并确保物理可行性,尤其适用于安全关键场景下的实时控制与数字孪生应用。

链接: https://arxiv.org/abs/2508.21559
作者: Julen Cestero,Carmine Delle Femine,Kenji S. Muro,Marco Quartulli,Marcello Restelli
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Presented in PowerTech2025

点击查看摘要

Abstract:Physics-Informed Neural Networks (PINNs) present a transformative approach for smart grid modeling by integrating physical laws directly into learning frameworks, addressing critical challenges of data scarcity and physical consistency in conventional data-driven methods. This paper evaluates PINNs’ capabilities as surrogate models for smart grid dynamics, comparing their performance against XGBoost, Random Forest, and Linear Regression across three key experiments: interpolation, cross-validation, and episodic trajectory prediction. By training PINNs exclusively through physics-based loss functions (enforcing power balance, operational constraints, and grid stability) we demonstrate their superior generalization, outperforming data-driven models in error reduction. Notably, PINNs maintain comparatively lower MAE in dynamic grid operations, reliably capturing state transitions in both random and expert-driven control scenarios, while traditional models exhibit erratic performance. Despite slight degradation in extreme operational regimes, PINNs consistently enforce physical feasibility, proving vital for safety-critical applications. Our results contribute to establishing PINNs as a paradigm-shifting tool for smart grid surrogation, bridging data-driven flexibility with first-principles rigor. This work advances real-time grid control and scalable digital twins, emphasizing the necessity of physics-aware architectures in mission-critical energy systems.
zh

[AI-18] What Data is Really Necessary? A Feasibility Study of Inference Data Minimization for Recommender Systems CIKM’25

【速读】:该论文旨在解决推荐系统中数据最小化(data minimization)原则的实践难题,即如何在保障系统性能的前提下,减少对用户隐式反馈数据(implicit feedback inference data)的依赖。其解决方案的关键在于提出了一种新颖的问题建模方法,系统分析了多种数据最小化技术,并揭示了技术设定(如性能目标、模型选择)与用户特征(如历史数据规模、偏好复杂度)是决定最小化效果的核心因素。研究证明,在特定条件下实现显著的数据缩减在技术上是可行的,但实际应用仍面临挑战,且缺乏普适性的“必要性”标准。

链接: https://arxiv.org/abs/2508.21547
作者: Jens Leysen,Marco Favier,Bart Goethals
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: Accepted for publication at the 34th ACM International Conference on Information and Knowledge Management (CIKM '25), November 10-14, 2025, Seoul, Republic of Korea

点击查看摘要

Abstract:Data minimization is a legal principle requiring personal data processing to be limited to what is necessary for a specified purpose. Operationalizing this principle for recommender systems, which rely on extensive personal data, remains a significant challenge. This paper conducts a feasibility study on minimizing implicit feedback inference data for such systems. We propose a novel problem formulation, analyze various minimization techniques, and investigate key factors influencing their effectiveness. We demonstrate that substantial inference data reduction is technically feasible without significant performance loss. However, its practicality is critically determined by two factors: the technical setting (e.g., performance targets, choice of model) and user characteristics (e.g., history size, preference complexity). Thus, while we establish its technical feasibility, we conclude that data minimization remains practically challenging and its dependence on the technical and user context makes a universal standard for data `necessity’ difficult to implement.
zh

[AI-19] HealthProcessAI: A Technical Framework and Proof-of-Concept for LLM -Enhanced Healthcare Process Mining

【速读】:该论文旨在解决过程挖掘(Process Mining)在医疗和流行病学领域应用中的三大障碍:技术复杂性高、缺乏标准化方法以及实践培训资源有限。其解决方案的关键在于提出 HealthProcessAI,一个生成式 AI (Generative AI) 框架,通过封装现有的 Python(PM4PY)和 R(bupaR)过程挖掘库,实现对医疗数据的自动化处理,并集成多个大型语言模型(Large Language Models, LLMs)以实现过程图的自动解释与报告生成。该框架显著降低了非专业用户理解复杂分析结果的门槛,提升了过程挖掘输出的可访问性和临床可用性,从而推动其在实际医疗场景中的落地应用。

链接: https://arxiv.org/abs/2508.21540
作者: Eduardo Illueca-Fernandez,Kaile Chen,Fernando Seoane,Farhad Abtahi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Process mining has emerged as a powerful analytical technique for understanding complex healthcare workflows. However, its application faces significant barriers, including technical complexity, a lack of standardized approaches, and limited access to practical training resources. We introduce HealthProcessAI, a GenAI framework designed to simplify process mining applications in healthcare and epidemiology by providing a comprehensive wrapper around existing Python (PM4PY) and R (bupaR) libraries. To address unfamiliarity and improve accessibility, the framework integrates multiple Large Language Models (LLMs) for automated process map interpretation and report generation, helping translate technical analyses into outputs that diverse users can readily understand. We validated the framework using sepsis progression data as a proof-of-concept example and compared the outputs of five state-of-the-art LLM models through the OpenRouter platform. To test its functionality, the framework successfully processed sepsis data across four proof-of-concept scenarios, demonstrating robust technical performance and its capability to generate reports through automated LLM analysis. LLM evaluation using five independent LLMs as automated evaluators revealed distinct model strengths: Claude Sonnet-4 and Gemini 2.5-Pro achieved the highest consistency scores (3.79/4.0 and 3.65/4.0) when evaluated by automated LLM assessors. By integrating multiple Large Language Models (LLMs) for automated interpretation and report generation, the framework addresses widespread unfamiliarity with process mining outputs, making them more accessible to clinicians, data scientists, and researchers. This structured analytics and AI-driven interpretation combination represents a novel methodological advance in translating complex process mining results into potentially actionable insights for healthcare applications.
zh

[AI-20] Counterfactual Scenarios for Automated Planning KR2025

【速读】:该论文旨在解决传统反事实解释(Counterfactual Explanations, CEs)在自动化规划领域中无法捕捉问题高层属性的问题。传统CEs仅关注对已有计划进行最小修改以达成不同目标,而忽略了规划问题本身结构特性或所需行为的更高层逻辑约束。为此,作者提出一种基于反事实场景(counterfactual scenarios)的新解释范式:给定一个规划问题 $ P $ 和一个用于定义期望计划性质的线性时序逻辑(Linear Temporal Logic, LTL)公式 $ \psi $,反事实场景通过识别对 $ P $ 的最小修改,使得新问题存在满足 $ \psi $ 的计划。其关键在于引入对满足 $ \psi $ 的所有计划的显式量化,并在此基础上设计两种定性实例化方法,同时证明生成此类反事实场景的计算复杂度通常仅相当于求解原规划问题 $ P $ 的复杂度,从而验证了该方法的实用性与可扩展性。

链接: https://arxiv.org/abs/2508.21521
作者: Nicola Gigante,Francesco Leofante,Andrea Micheli
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted at the 22nd International Conference on Principles of Knowledge Representation and Reasoning (KR 2025)

点击查看摘要

Abstract:Counterfactual Explanations (CEs) are a powerful technique used to explain Machine Learning models by showing how the input to a model should be minimally changed for the model to produce a different output. Similar proposals have been made in the context of Automated Planning, where CEs have been characterised in terms of minimal modifications to an existing plan that would result in the satisfaction of a different goal. While such explanations may help diagnose faults and reason about the characteristics of a plan, they fail to capture higher-level properties of the problem being solved. To address this limitation, we propose a novel explanation paradigm that is based on counterfactual scenarios. In particular, given a planning problem P and an \ltlf formula \psi defining desired properties of a plan, counterfactual scenarios identify minimal modifications to P such that it admits plans that comply with \psi . In this paper, we present two qualitative instantiations of counterfactual scenarios based on an explicit quantification over plans that must satisfy \psi . We then characterise the computational complexity of generating such counterfactual scenarios when different types of changes are allowed on P . We show that producing counterfactual scenarios is often only as expensive as computing a plan for P , thus demonstrating the practical viability of our proposal and ultimately providing a framework to construct practical algorithms in this area.
zh

[AI-21] Modeling Wise Decision Making: A Z-Number Fuzzy Framework Inspired by Phronesis

【速读】:该论文试图解决传统智慧测量方法依赖自我报告、难以体现智慧核心特征(如谦逊与不确定性)的问题,旨在构建一种能够同时捕捉智慧多维性与认知不确定性的计算框架。解决方案的关键在于引入基于Z数(Z-numbers)的模糊推理系统,将每个决策表示为智慧得分(限制)和置信度得分(确定性)的双重属性,通过21条规则和高斯核密度估计优化隶属函数,从而实现对人类智慧判断的可解释、自信敏感的量化建模,为心理学测量和人本AI提供兼具严谨性与人性化的推理机制。

链接: https://arxiv.org/abs/2508.21517
作者: Sweta Kaman,Ankita Sharma,Romi Banerjee
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: total 17 pages, main manuscript 12 pages, supplementary 5 pages, 6 tables in main manuscript, 5 figures in main manuscript, 2 tables in supplementary, and 3 figures in supplementary

点击查看摘要

Abstract:Background: Wisdom is a superordinate construct that embraces perspective taking, reflectiveness, prosocial orientation, reflective empathetic action, and intellectual humility. Unlike conventional models of reasoning that are rigidly bound by binary thinking, wisdom unfolds in shades of ambiguity, requiring both graded evaluation and self-reflective humility. Current measures depend on self-reports and seldom reflect the humility and uncertainty inherent in wise reasoning. A computational framework that takes into account both multidimensionality and confidence has the potential to improve psychological science and allow humane AI. Method: We present a fuzzy inference system with Z numbers, each of the decisions being expressed in terms of a wisdom score (restriction) and confidence score (certainty). As part of this study, participants (N = 100) were exposed to culturally neutral pictorial moral dilemma tasks to which they generated think-aloud linguistic responses, which were mapped into five theoretically based components of wisdom. The scores of each individual component were combined using a base of 21 rules, with membership functions tuned via Gaussian kernel density estimation. Results: In a proof of concept study, the system produced dual attribute wisdom representations that correlated modestly but significantly with established scales while showing negligible relations with unrelated traits, supporting convergent and divergent validity. Contribution: The contribution is to formalize wisdom as a multidimensional, uncertainty-conscious construct, operationalized in the form of Z-numbers. In addition to progressing measurement in psychology, it calculates how fuzzy Z numbers can provide AI systems with interpretable, confidence-sensitive reasoning that affords a safe, middle ground between rigorous computation and human-like judgment.
zh

[AI-22] On the Hardness of Learning GNN-based SAT Solvers: The Role of Graph Ricci Curvature

【速读】:该论文旨在解决图神经网络(Graph Neural Networks, GNNs)在求解布尔可满足性问题(Boolean Satisfiability Problems, SATs)时,面对更困难实例性能显著下降的问题。其关键解决方案在于引入图 Ricci 曲率(graph Ricci Curvature, RC)作为几何度量工具,揭示了随机 k-SAT 问题对应的二分图具有内在负曲率特性,且该曲率随问题难度增加而降低;进而指出 GNN 的性能瓶颈源于“过压缩”(oversquashing)现象——即长程依赖关系无法被固定长度的节点表示所捕捉。研究通过实证验证了曲率与问题复杂度之间的强相关性,并证明其可用于预测 GNN 解决器的性能表现,从而为改进 GNN 架构设计提供了理论依据和优化方向。

链接: https://arxiv.org/abs/2508.21513
作者: Geri Skenderi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have recently shown promise as solvers for Boolean Satisfiability Problems (SATs) by operating on graph representations of logical formulas. However, their performance degrades sharply on harder instances, raising the question of whether this reflects fundamental architectural limitations. In this work, we provide a geometric explanation through the lens of graph Ricci Curvature (RC), which quantifies local connectivity bottlenecks. We prove that bipartite graphs derived from random k-SAT formulas are inherently negatively curved, and that this curvature decreases with instance difficulty. Building on this, we show that GNN-based SAT solvers are affected by oversquashing, a phenomenon where long-range dependencies become impossible to compress into fixed-length representations. We validate our claims empirically across different SAT benchmarks and confirm that curvature is both a strong indicator of problem complexity and can be used to predict performance. Finally, we connect our findings to design principles of existing solvers and outline promising directions for future work.
zh

[AI-23] Priors Matter: Addressing Misspecification in Bayesian Deep Q-Learning

【速读】:该论文旨在解决贝叶斯强化学习中后验分布近似精度不足的问题,特别是针对模型无关算法中先验和似然假设的合理性未被充分探讨的现象。研究发现,在深度Q-learning中存在“冷后验效应”(cold posterior effect),即降低后验温度反而提升性能,违背了贝叶斯理论预期。其关键解决方案在于挑战并改进常见的高斯似然假设和先验分布,通过实证分析表明标准高斯似然常被违反,并提出简单可行的更合适先验设计,从而显著提升贝叶斯强化学习算法的性能。

链接: https://arxiv.org/abs/2508.21488
作者: Pascal R. van der Vaart,Neil Yorke-Smith,Matthijs T.J. Spaan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Uncertainty quantification in reinforcement learning can greatly improve exploration and robustness. Approximate Bayesian approaches have recently been popularized to quantify uncertainty in model-free algorithms. However, so far the focus has been on improving the accuracy of the posterior approximation, instead of studying the accuracy of the prior and likelihood assumptions underlying the posterior. In this work, we demonstrate that there is a cold posterior effect in Bayesian deep Q-learning, where contrary to theory, performance increases when reducing the temperature of the posterior. To identify and overcome likely causes, we challenge common assumptions made on the likelihood and priors in Bayesian model-free algorithms. We empirically study prior distributions and show through statistical tests that the common Gaussian likelihood assumption is frequently violated. We argue that developing more suitable likelihoods and priors should be a key focus in future Bayesian reinforcement learning research and we offer simple, implementable solutions for better priors in deep Q-learning that lead to more performant Bayesian algorithms.
zh

[AI-24] MMSearch-Plus: A Simple Yet Challenging Benchmark for Multimodal Browsing Agents

【速读】:该论文旨在解决当前大型多模态语言模型(Large Multimodal Language Models, MLLMs)在作为网络代理(web agents)时,因依赖浅层固定工作流(如高召回率图像搜索和邻近文本掩码)而难以应对细粒度视觉推理、来源验证及长程工具调用等真正多模态挑战的问题。其解决方案的关键在于提出MMSearch-Plus基准,包含311个任务,这些任务设计为需从局部视觉信号中提取信息,并通过迭代式文本-图像搜索传播与交叉验证,同时引入空间-时间外推法(Spatial-Temporal Extrapolation)生成需要基于微文本、部件外观、布局或广播叠加、季节背景等线索推断出图像外事实(如事件、日期、场所)的问题。该框架还提供了一个模型无关的代理系统,用于评估多种闭源与开源MLLMs,从而更真实地衡量模型在复杂多模态场景下的能力。

链接: https://arxiv.org/abs/2508.21475
作者: Xijia Tao,Yihua Teng,Xinxing Su,Xinyu Fu,Jihao Wu,Chaofan Tao,Ziru Liu,Haoli Bai,Rui Liu,Lingpeng Kong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Project Page: this https URL

点击查看摘要

Abstract:Large multimodal language models (MLLMs) are increasingly deployed as web agents, yet many multimodal browsing benchmarks can be solved by shallow, fixed workflows that lean on high-recall image search and nearby text-masking the genuinely multimodal challenges of fine-grained visual reasoning, provenance verification, and long-horizon tool use. We introduce MMSearch-Plus, a benchmark of 311 tasks that highly demand multimodal understanding while preserving the difficulty profile of strong text-only browsing suites. Each item is constructed to contain multiple weak, localized visual signals that must be extracted, propagated through iterative text-image search, and cross-validated under retrieval noise before answering. Our curation procedure, Spatial-Temporal Extrapolation, seeds questions whose answers require extrapolating from spatial cues (micro-text, part-level appearance, layouts, signage) and temporal traces (broadcast overlays, seasonal context) to out-of-image facts such as events, dates, and venues. We provide a model-agnostic agent framework with browsing tools and evaluate a range of closed and open MLLMs. The strongest agent (o3) attains 15.1% without search and 36.0% accuracy with rollout under our framework, while a strong open-source model (Qwen-2.5-VL-72B-Instruct) achieves 0.0% without search and 6.9% after 20 rounds of search. Beyond answer accuracy, we assess bounding-box production and cropped-image search, and conduct an error analysis that surfaces failures in source verification, part-based reasoning, and long-horizon planning.
zh

[AI-25] Controllable 3D Molecular Generation for Structure-Based Drug Design Through Bayesian Flow Networks and Gradient Integration

【速读】:该论文旨在解决结构基础药物设计(Structure-based Drug Design, SBDD)中生成式分子模型在实际应用中的局限性问题,即现有基于扩散的生成模型虽能优化结合亲和力(binding affinity),但忽视了合成可行性(synthetic feasibility)和选择性(selectivity)等关键药理属性,导致生成分子难以满足真实药物开发需求。解决方案的关键在于提出CByG框架——一种将贝叶斯流网络(Bayesian Flow Network)扩展为梯度引导的条件生成模型的方法,通过引入针对特定性质的引导机制,实现对多种药理属性的稳健整合与协同优化;同时构建包含结合亲和力、合成可行性和选择性的综合评估体系,从而更全面地衡量生成模型的实际价值。

链接: https://arxiv.org/abs/2508.21468
作者: Seungyeon Choi,Hwanhee Kim,Chihyun Park,Dahyeon Lee,Seungyong Lee,Yoonju Kim,Hyoungjoon Park,Sein Kwon,Youngwan Jo,Sanghyun Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in Structure-based Drug Design (SBDD) have leveraged generative models for 3D molecular generation, predominantly evaluating model performance by binding affinity to target proteins. However, practical drug discovery necessitates high binding affinity along with synthetic feasibility and selectivity, critical properties that were largely neglected in previous evaluations. To address this gap, we identify fundamental limitations of conventional diffusion-based generative models in effectively guiding molecule generation toward these diverse pharmacological properties. We propose CByG, a novel framework extending Bayesian Flow Network into a gradient-based conditional generative model that robustly integrates property-specific guidance. Additionally, we introduce a comprehensive evaluation scheme incorporating practical benchmarks for binding affinity, synthetic feasibility, and selectivity, overcoming the limitations of conventional evaluation methods. Extensive experiments demonstrate that our proposed CByG framework significantly outperforms baseline models across multiple essential evaluation criteria, highlighting its effectiveness and practicality for real-world drug discovery applications.
zh

[AI-26] Diffusion-based Multi-modal Synergy Interest Network for Click-through Rate Prediction SIGIR2025

【速读】:该论文针对现有点击率预测(Click-Through Rate, CTR)模型主要依赖ID模态、难以全面建模用户多模态偏好这一问题,提出了一种基于扩散机制的多模态协同兴趣网络(Diffusion-based Multi-modal Synergy Interest Network, Diff-MSIN)。其核心解决方案在于引入三个创新模块:多模态特征增强(Multi-modal Feature Enhancement, MFE)模块与协同关系捕捉(Synergistic Relationship Capture, SRC)模块联合提取不同模态间的共性与特异性信息,并通过知识解耦方法强化特征区分度;同时,特征动态自适应融合(Feature Dynamic Adaptive Fusion, FDAF)模块聚焦于捕获用户偏好并降低融合噪声。该框架有效解决了传统多模态融合方法在模态间协同效应建模和复杂交互刻画上的不足,实验证明其在Rec-Tmall和Amazon数据集上相较基线提升至少1.67%,显著增强了多模态推荐系统的性能。

链接: https://arxiv.org/abs/2508.21460
作者: Xiaoxi Cui,Weihai Lu,Yu Tong,Yiheng Li,Zhejun Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: SIGIR 2025

点击查看摘要

Abstract:In click-through rate prediction, click-through rate prediction is used to model users’ interests. However, most of the existing CTR prediction methods are mainly based on the ID modality. As a result, they are unable to comprehensively model users’ multi-modal preferences. Therefore, it is necessary to introduce multi-modal CTR prediction. Although it seems appealing to directly apply the existing multi-modal fusion methods to click-through rate prediction models, these methods (1) fail to effectively disentangle commonalities and specificities across different modalities; (2) fail to consider the synergistic effects between modalities and model the complex interactions between modalities. To address the above issues, this paper proposes the Diffusion-based Multi-modal Synergy Interest Network (Diff-MSIN) framework for click-through prediction. This framework introduces three innovative modules: the Multi-modal Feature Enhancement (MFE) Module Synergistic Relationship Capture (SRC) Module, and the Feature Dynamic Adaptive Fusion (FDAF) Module. The MFE Module and SRC Module extract synergistic, common, and special information among different modalities. They effectively enhances the representation of the modalities, improving the overall quality of the fusion. To encourage distinctiveness among different features, we design a Knowledge Decoupling method. Additionally, the FDAF Module focuses on capturing user preferences and reducing fusion noise. To validate the effectiveness of the Diff-MSIN framework, we conducted extensive experiments using the Rec-Tmall and three Amazon datasets. The results demonstrate that our approach yields a significant improvement of at least 1.67% compared to the baseline, highlighting its potential for enhancing multi-modal recommendation systems. Our code is available at the following link: this https URL. Comments: SIGIR 2025 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.21460 [cs.IR] (or arXiv:2508.21460v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.21460 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: SIGIR 2025: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval Pages 581 - 591 Related DOI: https://doi.org/10.1145/3726302.3729949 Focus to learn more DOI(s) linking to related resources
zh

[AI-27] Learning Lifted Action Models From Traces of Incomplete Actions and States KR2025

【速读】:该论文旨在解决从仅包含状态和动作标签的随机状态-动作轨迹中学习升维STRIPS模型(lifted STRIPS model)的问题,其中状态仅表示瓷砖位置,动作标签为“上、下、左、右”且无参数。核心挑战在于:状态不完整(缺失如“空白”位置等谓词),动作也不完整(未显式揭示影响和前提中的全部对象)。传统方法通常假设动作是完整的STRIPS动作或所有谓词可观测,而本文设定更贴近现实场景。解决方案的关键是提出一种扩展的STRIPS形式——STRIPS+,允许在前提中省略某些参数并引入有限的存在量词(existential quantification)表达式;进而设计名为SYNTH的学习算法,通过构建分层的前件表达式序列(即“查询”),识别状态中唯一对象以确定隐含的动作参数。该方法实现了STRIPS+模型的正确性与完备性学习,并在由现有STRIPS域生成的STRIPS+轨迹上验证了其可扩展性。

链接: https://arxiv.org/abs/2508.21449
作者: Niklas Jansen,Jonas Gösgens,Hector Geffner
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: To be presented at KR 2025

点击查看摘要

Abstract:Consider the problem of learning a lifted STRIPS model of the sliding-tile puzzle from random state-action traces where the states represent the location of the tiles only, and the actions are the labels up, down, left, and right, with no arguments. Two challenges are involved in this problem. First, the states are not full STRIPS states, as some predicates are missing, like the atoms representing the position of the blank''. Second, the actions are not full STRIPS either, as they do not reveal all the objects involved in the actions effects and preconditions. Previous approaches have addressed different versions of this model learning problem, but most assume that actions in the traces are full STRIPS actions or that the domain predicates are all observable. The new setting considered in this work is more realistic’‘, as the atoms observed convey the state of the world but not full STRIPS states, and the actions reveal the arguments needed for selecting the action but not the ones needed for modeling it in STRIPS. For formulating and addressing the learning problem, we introduce a variant of STRIPS, which we call STRIPS+, where certain STRIPS action arguments can be left implicit in preconditions which can also involve a limited form of existential quantification. The learning problem becomes the problem of learning STRIPS+ models from STRIPS+ state-action traces. For this, the proposed learning algorithm, called SYNTH, constructs a stratified sequence (conjunction) of precondition expressions or ``queries’’ for each action, that denote unique objects in the state and ground the implicit action arguments in STRIPS+. The correctness and completeness of SYNTH is established, and its scalability is tested on state-action traces obtained from STRIPS+ models derived from existing STRIPS domains.
zh

[AI-28] A General Framework of Epistemic Forgetting and its Instantiation by Ranking Functions

【速读】:该论文旨在解决在具有丰富语义结构的信念状态(epistemic states)中如何定义和实现遗忘操作的问题,尤其是将传统基于命题逻辑的遗忘方法(如变量消除和AGM收缩)推广至更复杂的认知背景。其解决方案的关键在于从认识论视角出发,提出五类通用的epistemic forgetting类型,并通过七种具体的针对Spohn排名函数的遗忘算子进行实例化;同时借鉴逻辑编程与AGM信念修正理论中的遗忘公理,构建一套完整的评价体系,从而系统性地分析不同遗忘操作在认知层面的行为特征与差异。

链接: https://arxiv.org/abs/2508.21441
作者: Christoph Beierle,Alexander Hahn,Diana Howey,Gabriele Kern-Isberner,Kai Sauerwald
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Forgetting as a knowledge management operation deliberately ignores parts of the knowledge and beliefs of an agent, for various reasons. Forgetting has many facets, one may want to forget parts of the syntax, a proposition, or a conditional. In the literature, two main operators suitable for performing forgetting have been proposed and investigated in depth: First, variable elimination is a syntactical method that blends out certain atomic variables to focus on the rest of the language. It has been mainly used in the area of logic programming and answer set programming. Second, contraction in AGM belief revision theory effectively removes propositions from belief sets under logical deduction. Both operations rely mainly on classical logics. In this article, we take an epistemic perspective and study forgetting operations in epistemic states with richer semantic structures, but with clear links to propositional logic. This allows us to investigate what forgetting in the epistemic background means, thereby lifting well-known and novel forgetting operations to the epistemic level. We present five general types of epistemic forgetting and instantiate them with seven concrete forgetting operations for Spohn’s ranking functions. We take inspiration from postulates of forgetting both from logic programming and AGM theory to propose a rich landscape of axioms for evaluating forgetting operations. Finally, we evaluate all concrete forgetting operations according to all postulates, leading to a novel comprehensive overview highlighting differences and commonalities among the forgetting operators.
zh

[AI-29] he Complexity Trap: Simple Observation Masking Is as Efficient as LLM Summarization for Agent Context Management

【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的软件工程(Software Engineering, SE)代理在执行复杂任务时因迭代推理、探索和工具调用导致的长上下文历史所带来的高计算成本问题。其关键解决方案是提出并验证一种简单的观察掩码(observation-masking)策略,即通过屏蔽较早的历史观测来管理上下文长度,而非依赖复杂的LLM摘要机制。实验表明,该策略在SWE-bench Verified基准上相较原始代理可将成本减半,且在解题率上与LLM摘要相当甚至略优,说明在特定场景下最有效的上下文管理方式可能是最简化的方案。

链接: https://arxiv.org/abs/2508.21433
作者: Tobias Lindenbauer,Igor Slinko,Ludwig Felder,Egor Bogomolov,Yaroslav Zharov
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Model (LLM)-based agents solve complex tasks through iterative reasoning, exploration, and tool-use, a process that can result in long, expensive context histories. While state-of-the-art Software Engineering ( SE) agents like OpenHands or Cursor use LLM-based summarization to tackle this issue, it is unclear whether the increased complexity offers tangible performance benefits compared to simply omitting older observations. We present a systematic comparison of these strategies within SWE-agent on SWE-bench Verified across five diverse model configurations. We find that a simple observation-masking strategy halves cost relative to a raw agent while matching, and sometimes slightly exceeding, the solve rate of LLM summarization. For example, with Qwen3-Coder 480B, masking improves solve rate from 53.8% (raw agent) to 54.8%, while remaining competitive with summarization at a lower cost. These results suggest that, at least within SWE-agent on SWE-bench Verified, the most effective and efficient context management can be the simplest. We release code and data for reproducibility
zh

[AI-30] Benchmarking the State of Networks with a Low-Cost Method Based on Reservoir Computing

【速读】:该论文旨在解决如何利用低成本、非侵入式的方法实时监测通信与交通网络状态的问题。其核心挑战在于,传统方法通常依赖于专门部署的传感器或复杂模型,成本高且难以推广;而本文提出的关键解决方案是基于储层计算(reservoir computing)框架,将移动网络使用数据转化为加权网络模型,并通过代理任务(proxy tasks)评估模型性能。该方法利用匿名聚合的移动网络数据作为输入,无需训练整个网络权重,仅需训练单一层级,显著降低能耗并实现对网络状态变化的敏感检测。实验表明,模型在特定网络配置下的性能与扰动后的下降趋势密切相关,从而可用于识别潜在薄弱环节,具有向近实时监控扩展的潜力。

链接: https://arxiv.org/abs/2508.21420
作者: Felix Simon Reimers,Carl-Hendrik Peters,Stefano Nichele
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Net-Zero Future 2025 Conference

点击查看摘要

Abstract:Using data from mobile network utilization in Norway, we showcase the possibility of monitoring the state of communication and mobility networks with a non-invasive, low-cost method. This method transforms the network data into a model within the framework of reservoir computing and then measures the model’s performance on proxy tasks. Experimentally, we show how the performance on these proxies relates to the state of the network. A key advantage of this approach is that it uses readily available data sets and leverages the reservoir computing framework for an inexpensive and largely agnostic method. Data from mobile network utilization is available in an anonymous, aggregated form with multiple snapshots per day. This data can be treated like a weighted network. Reservoir computing allows the use of weighted, but untrained networks as a machine learning tool. The network, initialized as a so-called echo state network (ESN), projects incoming signals into a higher dimensional space, on which a single trained layer operates. This consumes less energy than deep neural networks in which every weight of the network is trained. We use neuroscience inspired tasks and trained our ESN model to solve them. We then show how the performance depends on certain network configurations and also how it visibly decreases when perturbing the network. While this work serves as proof of concept, we believe it can be elevated to be used for near-real-time monitoring as well as the identification of possible weak spots of both mobile communication networks as well as transportation networks.
zh

[AI-31] CARJAN: Agent -Based Generation and Simulation of Traffic Scenarios with AJAN

【速读】:该论文旨在解决城市交通场景中多类型交互智能体(如行人、骑行者和自动驾驶车辆)的用户友好建模与虚拟仿真难题。解决方案的关键在于提出CARJAN工具,其基于多智能体工程框架AJAN和驾驶模拟器CARLA,提供可视化界面用于场景布局的建模、存储与维护,并利用SPARQL行为树(Behavior Tree)实现智能体在动态场景中的决策与交互,从而首次实现了在CARLA中对交互式、智能化交通场景的集成生成与仿真。

链接: https://arxiv.org/abs/2508.21411
作者: Leonard Frank Neis,Andre Antakli,Matthias Klusch
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:User-friendly modeling and virtual simulation of urban traffic scenarios with different types of interacting agents such as pedestrians, cyclists and autonomous vehicles remains a challenge. We present CARJAN, a novel tool for semi-automated generation and simulation of such scenarios based on the multi-agent engineering framework AJAN and the driving simulator CARLA. CARJAN provides a visual user interface for the modeling, storage and maintenance of traffic scenario layouts, and leverages SPARQL Behavior Tree-based decision-making and interactions for agents in dynamic scenario simulations in CARLA. CARJAN provides a first integrated approach for interactive, intelligent agent-based generation and simulation of virtual traffic scenarios in CARLA.
zh

[AI-32] DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction

【速读】:该论文旨在解决现有音频质量评估中池化机制(pooling mechanism)的局限性问题,即传统方法通常仅在单一粒度下操作,要么关注全局统计特征,要么聚焦帧级细节,从而可能遗漏互补的感知信息。解决方案的关键在于提出双分辨率注意力统计池化(Dual-Resolution Attentive Statistics Pooling, DRASP)框架,该框架同时融合粗粒度的全局统计摘要与细粒度的感知显著片段注意力分析,通过双视角架构实现对语音质量结构上下文和局部关键细节的协同建模,从而生成更全面且鲁棒的固定长度表征,显著提升主观意见分数(MOS)预测性能。

链接: https://arxiv.org/abs/2508.21407
作者: Cheng-Yeh Yang,Kuan-Tang Huang,Chien-Chun Wang,Hung-Shin Lee,Hsin-Min Wang,Berlin Chen
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Accepted to APSIPA ASC 2025

点击查看摘要

Abstract:A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman’s rank correlation coefficient (SRCC) over the widely-used average pooling approach.
zh

[AI-33] AI Compute Architecture and Evolution Trends

【速读】:该论文旨在解决当前人工智能(AI)发展从学术研究向实际应用转变过程中所面临的多层次挑战,包括技术架构复杂性、系统可扩展性、模型演化路径及生态可持续性等问题。其解决方案的关键在于提出一个七层AI计算架构模型(Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, Application Layer),并基于大规模语言模型(Large-Scale Language Models, LLMs)的三阶段演进路径,系统性地分析每一层的技术发展轨迹与核心能力,从而为构建高效、可扩展且自洽的AI生态系统提供结构化框架与演进方向。

链接: https://arxiv.org/abs/2508.21394
作者: Bor-Sung Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 29 pages, 26 figures

点击查看摘要

Abstract:The focus of AI development has shifted from academic research to practical applications. However, AI development faces numerous challenges at various levels. This article will attempt to analyze the opportunities and challenges of AI from several different perspectives using a structured approach. This article proposes a seven-layer model for AI compute architecture, including Physical Layer, Link Layer, Neural Network Layer, Context Layer, Agent Layer, Orchestrator Layer, and Application Layer, from bottom to top. It also explains how AI computing has evolved into this 7-layer architecture through the three-stage evolution on large-scale language models (LLMs). For each layer, we describe the development trajectory and key technologies. In Layers 1 and 2 we discuss AI computing issues and the impact of Scale-Up and Scale-Out strategies on computing architecture. In Layer 3 we explore two different development paths for LLMs. In Layer 4 we discuss the impact of contextual memory on LLMs and compares it to traditional processor memory. In Layers 5 to 7 we discuss the trends of AI agents and explore the issues in evolution from a single AI agent to an AI-based ecosystem, and their impact on the AI industry. Furthermore, AI development involves not only technical challenges but also the economic issues to build self-sustainable ecosystem. This article analyzes the internet industry to provide predictions on the future trajectory of AI development.
zh

[AI-34] zkLoRA: Fine-Tuning Large Language Models with Verifiable Security via Zero-Knowledge Proofs

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在不信任环境中的安全与可验证性问题,尤其是在参数高效微调(如低秩适配 LoRA)过程中,如何保障模型参数和训练数据的隐私,并确保微调过程的正确性。解决方案的关键在于提出 zkLoRA,这是首个将 LoRA 微调与零知识证明(Zero-Knowledge Proofs, ZKPs)相结合的框架,通过引入查找论证(lookup arguments)、sumcheck 协议和多项式承诺(polynomial commitments)等先进密码学技术,实现了对 Transformer 架构中算术与非算术操作的端到端可验证性,涵盖前向传播、反向传播及参数更新全过程,从而在保护隐私的同时提供可证明的安全性和正确性保证。

链接: https://arxiv.org/abs/2508.21393
作者: Guofu Liao,Taotao Wang,Shengli Zhang,Jiqun Zhang,Shi Long,Dacheng Tao
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Fine-tuning large language models (LLMs) is crucial for adapting them to specific tasks, yet it remains computationally demanding and raises concerns about correctness and privacy, particularly in untrusted environments. Although parameter-efficient methods like Low-Rank Adaptation (LoRA) significantly reduce resource requirements, ensuring the security and verifiability of fine-tuning under zero-knowledge constraints remains an unresolved challenge. To address this, we introduce zkLoRA, the first framework to integrate LoRA fine-tuning with zero-knowledge proofs (ZKPs), achieving provable security and correctness. zkLoRA employs advanced cryptographic techniques – such as lookup arguments, sumcheck protocols, and polynomial commitments – to verify both arithmetic and non-arithmetic operations in Transformer-based architectures. The framework provides end-to-end verifiability for forward propagation, backward propagation, and parameter updates during LoRA fine-tuning, while safeguarding the privacy of model parameters and training data. Leveraging GPU-based implementations, zkLoRA demonstrates practicality and efficiency through experimental validation on open-source LLMs like LLaMA, scaling up to 13 billion parameters. By combining parameter-efficient fine-tuning with ZKPs, zkLoRA bridges a critical gap, enabling secure and trustworthy deployment of LLMs in sensitive or untrusted environments.
zh

[AI-35] Iterative Inference in a Chess-Playing Neural Network

【速读】:该论文试图解决神经网络在构建表征时是通过平滑、渐进的优化过程,还是更复杂的计算机制这一核心问题。其解决方案的关键在于扩展对数几率透镜(logit lens)方法,用于分析超人类国际象棋引擎Leela Chess Zero的策略网络(policy network),从而揭示不同层间策略分布的演化特性。研究发现,尽管棋力和解题能力随层数呈现显著单调提升,但策略分布轨迹常表现出非平滑性:早期正确解可能被后续层丢弃,移动排名与最终输出相关性低,且策略差异在深层网络中持续较高,这表明策略生成并非简单收敛过程,而是存在复杂、非线性的动态演化机制。

链接: https://arxiv.org/abs/2508.21380
作者: Elias Sandmann,Sebastian Lapuschkin,Wojciech Samek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Do neural networks build their representations through smooth, gradual refinement, or via more complex computational processes? We investigate this by extending the logit lens to analyze the policy network of Leela Chess Zero, a superhuman chess engine. We find strong monotonic trends in playing strength and puzzle-solving ability across layers, yet policy distributions frequently follow non-smooth trajectories. Evidence for this includes correct puzzle solutions that are discovered early but subsequently discarded, move rankings that remain poorly correlated with final outputs, and high policy divergence until late in the network. These findings contrast with the smooth distributional convergence typically observed in language models.
zh

[AI-36] RoboInspector: Unveiling the Unreliability of Policy Code for LLM -enabled Robotic Manipulation

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在机器人操作任务中生成策略代码(policy code)时存在的不可靠性问题,尤其针对用户指令多样性与任务复杂性导致的策略生成失败。其关键解决方案是提出RoboInspector管道,从任务复杂度和指令粒度两个维度系统识别并表征策略代码不可靠行为,并基于失败策略代码反馈设计了一种改进方法,显著提升了策略代码生成的可靠性(最高达35%),且在仿真与真实环境中均验证有效。

链接: https://arxiv.org/abs/2508.21378
作者: Chenduo Ying,Linkang Du,Peng Cheng,Yuanchao Shu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) demonstrate remarkable capabilities in reasoning and code generation, enabling robotic manipulation to be initiated with just a single instruction. The LLM carries out various tasks by generating policy code required to control the robot. Despite advances in LLMs, achieving reliable policy code generation remains a significant challenge due to the diverse requirements of real-world tasks and the inherent complexity of user instructions. In practice, different users may provide distinct instructions to drive the robot for the same task, which may cause the unreliability of policy code generation. To bridge this gap, we design RoboInspector, a pipeline to unveil and characterize the unreliability of the policy code for LLM-enabled robotic manipulation from two perspectives: the complexity of the manipulation task and the granularity of the instruction. We perform comprehensive experiments with 168 distinct combinations of tasks, instructions, and LLMs in two prominent frameworks. The RoboInspector identifies four main unreliable behaviors that lead to manipulation failure. We provide a detailed characterization of these behaviors and their underlying causes, giving insight for practical development to reduce unreliability. Furthermore, we introduce a refinement approach guided by failure policy code feedback that improves the reliability of policy code generation by up to 35% in LLM-enabled robotic manipulation, evaluated in both simulation and real-world environments.
zh

[AI-37] hink in Games: Learning to Reason in Games via Reinforcement Learning with Large Language Models

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在交互式任务中表现不佳的问题,即LLMs虽具备丰富的命题知识(declarative knowledge),却难以将其转化为动态决策能力(procedural knowledge)。传统强化学习(Reinforcement Learning, RL)虽能获取程序性知识,但存在黑箱特性且需大量训练数据。论文提出Think in Games(TiG)框架,其核心在于将RL决策过程重构为语言建模任务:LLMs生成基于自然语言的策略,并通过在线强化学习根据环境反馈迭代优化。这一方法使LLMs在游戏环境中直接获得交互经验,同时保留其推理与解释能力,从而以更低的数据和计算成本实现高效、可解释的决策。

链接: https://arxiv.org/abs/2508.21365
作者: Yi Liao,Yu Gu,Yuan Sui,Zining Zhu,Yifan Lu,Guohua Tang,Zhongqian Sun,Wei Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at complex reasoning tasks such as mathematics and coding, yet they frequently struggle with simple interactive tasks that young children perform effortlessly. This discrepancy highlights a critical gap between declarative knowledge (knowing about something) and procedural knowledge (knowing how to do something). Although traditional reinforcement learning (RL) agents can acquire procedural knowledge through environmental interaction, they often operate as black boxes and require substantial training data. In contrast, LLMs possess extensive world knowledge and reasoning capabilities, but are unable to effectively convert this static knowledge into dynamic decision-making in interactive settings. To address this challenge, we propose Think in Games (TiG), a novel framework that empowers LLMs to develop procedural understanding through direct interaction with game environments, while retaining their inherent reasoning and explanatory abilities. Specifically, TiG reformulates RL-based decision-making as a language modeling task: LLMs generate language-guided policies, which are refined iteratively through online reinforcement learning based on environmental feedback. Our experimental results show that TiG successfully bridges the gap between declarative and procedural knowledge, achieving competitive performance with dramatically lower data and computational demands compared to conventional RL methods. Moreover, TiG provides step-by-step natural language explanations for its decisions, greatly improving transparency and interpretability in complex interactive tasks.
zh

[AI-38] Adaptive Heavy-Tailed Stochastic Gradient Descent

【速读】:该论文旨在解决大规模神经网络模型中优化算法因过度依赖训练损失而导致泛化性能不佳的问题。其核心解决方案是提出自适应重尾随机梯度下降(Adaptive Heavy Tailed Stochastic Gradient Descent, AHTSGD),该方法基于两个关键经验观察:一是随机梯度下降中梯度噪声呈现固有的重尾分布,二是神经网络训练过程中的“稳定边缘现象”(Edge of Stability),即曲率在收敛前先增长后趋于平稳。AHTSGD 在训练初期注入重尾噪声以增强探索能力,并随尖锐度稳定逐渐过渡到轻尾噪声,从而动态适应损失景观的尖锐程度,加速收敛至宽盆地(wide basins),提升泛化性能。这是首个根据稳定边缘现象调整优化器注入噪声特性的算法,在MNIST、CIFAR-10等基准数据集上优于SGD及其他噪声增强方法,尤其在含噪数据如SVHN上表现显著提升,且对学习率选择具有鲁棒性。

链接: https://arxiv.org/abs/2508.21353
作者: Bodu Gong,Gustavo Enrique Batista,Pierre Lafaye de Micheaux
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.
zh

[AI-39] DLGAN : Time Series Synthesis Based on Dual-Layer Generative Adversarial Networks

【速读】:该论文旨在解决现有时间序列合成方法在生成过程中难以保持原始时间序列的时序依赖关系的问题,以及直接在随机序列上建模时序特征导致无法准确捕捉真实数据特征信息的挑战。其解决方案的关键在于提出一种名为Dual-Layer Generative Adversarial Networks (DLGAN) 的简单但有效的生成模型,该模型将时间序列生成过程分解为两个阶段:序列特征提取与序列重构。首先,这两个阶段构成一个完整的时序自动编码器,通过在原始时间序列上进行监督学习,确保重构过程能够恢复序列的时序依赖;其次,利用生成对抗网络(GAN)生成与真实时间序列特征向量对齐的合成特征向量,从而让生成器能够从真实时间序列中准确捕获时序特征。

链接: https://arxiv.org/abs/2508.21340
作者: Xuan Hou,Shuhan Liu,Zhaohui Peng,Yaohui Chu,Yue Zhang,Yining Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 3 figures

点击查看摘要

Abstract:Time series synthesis is an effective approach to ensuring the secure circulation of time series data. Existing time series synthesis methods typically perform temporal modeling based on random sequences to generate target sequences, which often struggle to ensure the temporal dependencies in the generated time series. Additionally, directly modeling temporal features on random sequences makes it challenging to accurately capture the feature information of the original time series. To address the above issues, we propose a simple but effective generative model \textbfDual-\textbfLayer \textbfGenerative \textbfAdversarial \textbfNetworks, named \textbfDLGAN. The model decomposes the time series generation process into two stages: sequence feature extraction and sequence reconstruction. First, these two stages form a complete time series autoencoder, enabling supervised learning on the original time series to ensure that the reconstruction process can restore the temporal dependencies of the sequence. Second, a Generative Adversarial Network (GAN) is used to generate synthetic feature vectors that align with the real-time sequence feature vectors, ensuring that the generator can capture the temporal features from real time series. Extensive experiments on four public datasets demonstrate the superiority of this model across various evaluation metrics.
zh

[AI-40] Stage-Diff: Stage-wise Long-Term Time Series Generation Based on Diffusion Models

【速读】:该论文旨在解决长时序时间序列生成中的两大核心挑战:一是如何平衡长期时间依赖与数据分布随时间的漂移(distributional drift);二是如何有效捕捉不同特征序列间的复杂相互关系(即跨序列依赖)以及单个序列内部的结构特性(即内序列依赖)。其解决方案的关键在于提出了一种基于扩散模型的分阶段生成框架Stage-Diff,通过分阶段序列生成和跨阶段信息传递机制,在保持长期依赖的同时建模分布漂移;同时在每个阶段内采用渐进式序列分解策略,实现多尺度下的通道独立建模,并结合多通道融合建模以增强跨序列信息整合能力,从而在内序列与跨序列依赖之间取得良好平衡。

链接: https://arxiv.org/abs/2508.21330
作者: Xuan Hou,Shuhan Liu,Zhaohui Peng,Yaohui Chu,Yue Zhang,Yining Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures

点击查看摘要

Abstract:Generative models have been successfully used in the field of time series generation. However, when dealing with long-term time series, which span over extended periods and exhibit more complex long-term temporal patterns, the task of generation becomes significantly more challenging. Long-term time series exhibit long-range temporal dependencies, but their data distribution also undergoes gradual changes over time. Finding a balance between these long-term dependencies and the drift in data distribution is a key challenge. On the other hand, long-term time series contain more complex interrelationships between different feature sequences, making the task of effectively capturing both intra-sequence and inter-sequence dependencies another important challenge. To address these issues, we propose Stage-Diff, a staged generative model for long-term time series based on diffusion models. First, through stage-wise sequence generation and inter-stage information transfer, the model preserves long-term sequence dependencies while enabling the modeling of data distribution shifts. Second, within each stage, progressive sequence decomposition is applied to perform channel-independent modeling at different time scales, while inter-stage information transfer utilizes multi-channel fusion modeling. This approach combines the robustness of channel-independent modeling with the information fusion advantages of multi-channel modeling, effectively balancing the intra-sequence and inter-sequence dependencies of long-term time series. Extensive experiments on multiple real-world datasets validate the effectiveness of Stage-Diff in long-term time series generation tasks.
zh

[AI-41] Multi-Ontology Integration with Dual-Axis Propagation for Medical Concept Representation CIKM2025

【速读】:该论文旨在解决现有医学概念表示学习方法在利用多源本体图(ontology graphs)时存在的局限性问题,即大多数研究仅关注单一本体系统内部的关系建模,或孤立地处理多个本体系统(如疾病、药物和操作),而未能构建一个统一的学习框架来整合跨本体的关联信息。其解决方案的关键在于提出LINKO——一种大语言模型(Large Language Model, LLM)增强的集成本体学习框架,通过双轴知识传播机制实现跨本体与本体内知识的协同融合:一方面在每个本体内部沿层级结构进行垂直传播(intra-ontology vertical propagation),另一方面在不同本体的同一层级间进行水平传播(inter-ontology horizontal propagation),从而显著提升医学概念的表示能力,并在稀有病预测和小样本场景下展现出更强的鲁棒性。

链接: https://arxiv.org/abs/2508.21320
作者: Mohsen Nayebi Kerdabadi,Arya Hadizadeh Moghaddam,Dongjie Wang,Zijun Yao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This work has been accepted as a full research paper at CIKM 2025

点击查看摘要

Abstract:Medical ontology graphs map external knowledge to medical codes in electronic health records via structured relationships. By leveraging domain-approved connections (e.g., parent-child), predictive models can generate richer medical concept representations by incorporating contextual information from related concepts. However, existing literature primarily focuses on incorporating domain knowledge from a single ontology system, or from multiple ontology systems (e.g., diseases, drugs, and procedures) in isolation, without integrating them into a unified learning structure. Consequently, concept representation learning often remains limited to intra-ontology relationships, overlooking cross-ontology connections. In this paper, we propose LINKO, a large language model (LLM)-augmented integrative ontology learning framework that leverages multiple ontology graphs simultaneously by enabling dual-axis knowledge propagation both within and across heterogeneous ontology systems to enhance medical concept representation learning. Specifically, LINKO first employs LLMs to provide a graph-retrieval-augmented initialization for ontology concept embedding, through an engineered prompt that includes concept descriptions, and is further augmented with ontology context. Second, our method jointly learns the medical concepts in diverse ontology graphs by performing knowledge propagation in two axes: (1) intra-ontology vertical propagation across hierarchical ontology levels and (2) inter-ontology horizontal propagation within every level in parallel. Last, through extensive experiments on two public datasets, we validate the superior performance of LINKO over state-of-the-art baselines. As a plug-in encoder compatible with existing EHR predictive models, LINKO further demonstrates enhanced robustness in scenarios involving limited data availability and rare disease prediction.
zh

[AI-42] MultiFluxAI Enhancing Platform Engineering with Advanced Agent -Orchestrated Retrieval Systems

【速读】:该论文旨在解决产品工程领域中多源异构数据难以管理与集成的问题,尤其在跨应用领域场景下,如何高效响应用户复杂查询并提升数字生态系统的用户参与度。其解决方案的关键在于构建一个名为MultiFluxAI的创新人工智能平台,该平台融合生成式AI(Generative AI)、向量化(vectorization)以及代理编排(agentic orchestration)等先进AI技术,实现对复杂查询的动态化、上下文感知型响应。

链接: https://arxiv.org/abs/2508.21307
作者: Sri Ram Macharla,Sridhar Murthy J,Anjaneyulu Pasala
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Abstract accepted for presentation at ACM ISEC 2025

点击查看摘要

Abstract:MultiFluxAI is an innovative AI platform developed to address the challenges of managing and integrating vast, disparate data sources in product engineering across application domains. It addresses both current and new service related queries that enhance user engagement in the digital ecosystem. This platform leverages advanced AI techniques, such as Generative AI, vectorization, and agentic orchestration to provide dynamic and context-aware responses to complex user queries.
zh

[AI-43] Locus: Agent ic Predicate Synthesis for Directed Fuzzing

【速读】:该论文旨在解决定向模糊测试(directed fuzzing)中因目标状态在程序中嵌套深、输入空间庞大而导致的搜索效率低下问题。现有方法依赖分支距离或人工指定约束来引导搜索,但分支信息往往不足以精确刻画进展,而人工约束又难以泛化至不同类型的漏洞和程序。解决方案的关键在于提出 Locus 框架,通过自动合成语义明确的中间状态谓词(predicates)作为到达目标状态的里程碑,从而在程序插桩后可排除不可能达成目标状态的执行路径,并提供额外的覆盖率引导。Locus 利用代理式框架结合程序分析工具迭代生成并优化候选谓词,同时借助符号执行确保谓词严格弱化目标状态以避免误拒,显著提升了八种主流模糊测试器发现真实漏洞的效率,平均加速比达 41.6 倍。

链接: https://arxiv.org/abs/2508.21302
作者: Jie Zhu,Chihao Shen,Ziyang Li,Jiahao Yu,Yizheng Chen,Kexin Pei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Directed fuzzing aims to find program inputs that lead to specified target program states. It has broad applications, such as debugging system crashes, confirming reported bugs, and generating exploits for potential vulnerabilities. This task is inherently challenging because target states are often deeply nested in the program, while the search space manifested by numerous possible program inputs is prohibitively large. Existing approaches rely on branch distances or manually-specified constraints to guide the search; however, the branches alone are often insufficient to precisely characterize progress toward reaching the target states, while the manually specified constraints are often tailored for specific bug types and thus difficult to generalize to diverse target states and programs. We present Locus, a novel framework to improve the efficiency of directed fuzzing. Our key insight is to synthesize predicates to capture fuzzing progress as semantically meaningful intermediate states, serving as milestones towards reaching the target states. When used to instrument the program under fuzzing, they can reject executions unlikely to reach the target states, while providing additional coverage guidance. To automate this task and generalize to diverse programs, Locus features an agentic framework with program analysis tools to synthesize and iteratively refine the candidate predicates, while ensuring the predicates strictly relax the target states to prevent false rejections via symbolic execution. Our evaluation shows that Locus substantially improves the efficiency of eight state-of-the-art fuzzers in discovering real-world vulnerabilities, achieving an average speedup of 41.6x. So far, Locus has found eight previously unpatched bugs, with one already acknowledged with a draft patch. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2508.21302 [cs.CR] (or arXiv:2508.21302v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.21302 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-44] MyGO: Memory Yielding Generative Offline-consolidation for Lifelong Learning Systems

【速读】:该论文旨在解决持续学习(Continual Learning)中因任务序列更新导致的灾难性遗忘(Catastrophic Forgetting)问题,尤其针对现有方法依赖存储原始数据(如经验回放)或复杂正则化项所带来的数据隐私泄露、存储开销大及跨任务性能下降等挑战。其解决方案的关键在于提出一种受生物清醒-睡眠周期启发的新型框架MyGO(Memory Yielding Generative Offline-consolidation):在“清醒”阶段快速学习新任务并训练一个紧凑的生成模型(Generative Memory, G-mem)以捕获该任务的数据分布;在“睡眠”阶段进入离线状态,利用所有已学G-mem模型生成伪数据(“梦境”),并通过知识蒸馏将新旧知识统一固化到核心特征提取器中。该机制无需存储任何原始数据,仅保留轻量级生成模型,从而在保障隐私与存储效率的同时显著缓解遗忘,且在计算机视觉和自然语言处理任务上均展现出优异的泛化能力与稳定性。

链接: https://arxiv.org/abs/2508.21296
作者: Shihao Ji,Zihui Song
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 5 pages

点击查看摘要

Abstract:Continual or Lifelong Learning aims to develop models capable of acquiring new knowledge from a sequence of tasks without catastrophically forgetting what has been learned before. Existing approaches often rely on storing samples from previous tasks (experience replay) or employing complex regularization terms to protect learned weights. However, these methods face challenges related to data privacy, storage limitations, and performance degradation when tasks are dissimilar. To address these challenges, we introduce MyGO (Memory Yielding Generative Offline-consolidation), a novel lifelong learning framework inspired by the biological wake-sleep cycle. During the “wake” phase, the system rapidly learns a new task and trains a compact generative model (Generative Memory, G-mem) to capture its data distribution. During the “sleep” phase, the system enters an offline state, using all learned G-mem models to generate pseudo-data (“dreams”) and consolidate new and old knowledge into a core feature extractor via knowledge distillation. This approach obviates the need to store any raw data, retaining only compact generative models, which offers significant advantages in privacy and storage efficiency. We evaluate MyGO on computer vision (Split-MNIST) and natural language processing (Split-AG News) benchmarks, comparing it against a sequential fine-tuning baseline. The results demonstrate that MyGO significantly mitigates catastrophic forgetting and maintains high average accuracy across tasks, proving the framework’s effectiveness and domain-generality.
zh

[AI-45] Breaking the Cold-Start Barrier: Reinforcement Learning with Double and Dueling DQNs

【速读】:该论文旨在解决推荐系统中“冷用户”(cold-user)问题,即针对交互历史稀疏的新用户难以提供准确推荐的挑战。解决方案的关键在于采用双深度Q网络(Double DQN)和dueling DQN相结合的强化学习框架,通过动态学习用户在有限反馈下的偏好,并与矩阵分解模型融合,从而在不依赖敏感人口统计信息的前提下显著提升推荐准确性,尤其在RMSE指标上优于传统流行度基线和主动学习方法。

链接: https://arxiv.org/abs/2508.21259
作者: Minda Zhao
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recommender systems struggle to provide accurate suggestions to new users with limited interaction history, a challenge known as the cold-user problem. This paper proposes a reinforcement learning approach using Double and Dueling Deep Q-Networks (DQN) to dynamically learn user preferences from sparse feedback, enhancing recommendation accuracy without relying on sensitive demographic data. By integrating these advanced DQN variants with a matrix factorization model, we achieve superior performance on a large e-commerce dataset compared to traditional methods like popularity-based and active learning strategies. Experimental results show that our method, particularly Dueling DQN, reduces Root Mean Square Error (RMSE) for cold users, offering an effective solution for privacy-constrained environments.
zh

[AI-46] A Mixture of Experts Gating Network for Enhanced Surrogate Modeling in External Aerodynamics

【速读】:该论文旨在解决高保真计算流体动力学(CFD)仿真在汽车设计与优化流程中计算成本高昂的问题,提出了一种基于元学习(meta-learning)的混合专家(Mixture of Experts, MoE)框架作为解决方案。其关键在于利用三种异构且先进的代理模型(DoMINO、X-MeshGraphNet 和 FigConvNet)的互补优势,通过一个专用的门控网络动态分配空间自适应权重,以最优方式融合各专家模型在局部区域对表面压力和壁面剪切应力场的预测结果;同时引入熵正则化项防止模型坍塌并促进专家间均衡贡献,从而显著降低 L-2 预测误差,在 DrivAerML 数据集上优于单一专家模型及平均集成方法。

链接: https://arxiv.org/abs/2508.21249
作者: Mohammad Amin Nabian,Sanjay Choudhry
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA); Fluid Dynamics (physics.flu-dyn)
备注:

点击查看摘要

Abstract:The computational cost associated with high-fidelity CFD simulations remains a significant bottleneck in the automotive design and optimization cycle. While ML-based surrogate models have emerged as a promising alternative to accelerate aerodynamic predictions, the field is characterized by a diverse and rapidly evolving landscape of specialized neural network architectures, with no single model demonstrating universal superiority. This paper introduces a novel meta-learning framework that leverages this architectural diversity as a strength. We propose a Mixture of Experts (MoE) model that employs a dedicated gating network to dynamically and optimally combine the predictions from three heterogeneous, state-of-the-art surrogate models: DoMINO, a decomposable multi-scale neural operator; X-MeshGraphNet, a scalable multi-scale graph neural network; and FigConvNet, a factorized implicit global convolution network. The gating network learns a spatially-variant weighting strategy, assigning credibility to each expert based on its localized performance in predicting surface pressure and wall shear stress fields. To prevent model collapse and encourage balanced expert contributions, we integrate an entropy regularization term into the training loss function. The entire system is trained and validated on the DrivAerML dataset, a large-scale, public benchmark of high-fidelity CFD simulations for automotive aerodynamics. Quantitative results demonstrate that the MoE model achieves a significant reduction in L-2 prediction error, outperforming not only the ensemble average but also the most accurate individual expert model across all evaluated physical quantities. This work establishes the MoE framework as a powerful and effective strategy for creating more robust and accurate composite surrogate models by synergistically combining the complementary strengths of specialized architectures.
zh

[AI-47] Full-Frequency Temporal Patching and Structured Masking for Enhanced Audio Classification

【速读】:该论文旨在解决当前基于Transformer和状态空间模型(State-Space Models, SSMs)的音频分类方法在处理频谱图时因采用计算机视觉中常见的方形补丁划分策略所引发的问题:这种划分方式破坏了频谱图中连续的频率模式,导致补丁数量过多、训练效率低下且计算开销大。其解决方案的关键在于提出一种新的补丁划分策略——全频时序补丁(Full-Frequency Temporal Patching, FFTP),该策略通过在时间维度上局部建模的同时覆盖完整的频率带宽,从而更好地匹配频谱图的时间-频率非对称特性,保留谐波结构,并显著减少补丁数量与计算量;同时引入补丁对齐的频谱掩码方法SpecMask,在固定掩蔽预算下融合全频掩码与局部时空掩码,增强模型对时间扰动的鲁棒性并保持频谱连续性,最终在AudioSet-18k和SpeechCommandsV2数据集上实现性能提升与计算效率优化。

链接: https://arxiv.org/abs/2508.21243
作者: Aditya Makineni,Baocheng Geng,Qing Tian
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Transformers and State-Space Models (SSMs) have advanced audio classification by modeling spectrograms as sequences of patches. However, existing models such as the Audio Spectrogram Transformer (AST) and Audio Mamba (AuM) adopt square patching from computer vision, which disrupts continuous frequency patterns and produces an excessive number of patches, slowing training, and increasing computation. We propose Full-Frequency Temporal Patching (FFTP), a patching strategy that better matches the time-frequency asymmetry of spectrograms by spanning full frequency bands with localized temporal context, preserving harmonic structure, and significantly reducing patch count and computation. We also introduce SpecMask, a patch-aligned spectrogram augmentation that combines full-frequency and localized time-frequency masks under a fixed masking budget, enhancing temporal robustness while preserving spectral continuity. When applied on both AST and AuM, our patching method with SpecMask improves mAP by up to +6.76 on AudioSet-18k and accuracy by up to +8.46 on SpeechCommandsV2, while reducing computation by up to 83.26%, demonstrating both performance and efficiency gains.
zh

[AI-48] Addressing accuracy and hallucination of LLM s in Alzheimers disease research through knowledge graphs

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)在生物医学领域(如阿尔茨海默病研究)应用中面临的幻觉、领域知识有限以及响应缺乏可解释性和可追溯性等问题。其解决方案的关键在于引入基于图结构的检索增强生成(Graph-based Retrieval-Augmented Generation, GraphRAG)方法,通过构建包含50篇文献和70个专家问题的阿尔茨海默病知识图谱,并利用GPT-4o作为生成模型,在回答前融合结构化领域知识以提升响应质量与可追溯性。

链接: https://arxiv.org/abs/2508.21238
作者: Tingxuan Xu,Jiarui Feng,Justin Melendez,Kaleigh Roberts,Donghong Cai,Mingfang Zhu,Donald Elbert,Yixin Chen,Randall J. Bateman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In the past two years, large language model (LLM)-based chatbots, such as ChatGPT, have revolutionized various domains by enabling diverse task completion and question-answering capabilities. However, their application in scientific research remains constrained by challenges such as hallucinations, limited domain-specific knowledge, and lack of explainability or traceability for the response. Graph-based Retrieval-Augmented Generation (GraphRAG) has emerged as a promising approach to improving chatbot reliability by integrating domain-specific contextual information before response generation, addressing some limitations of standard LLMs. Despite its potential, there are only limited studies that evaluate GraphRAG on specific domains that require intensive knowledge, like Alzheimer’s disease or other biomedical domains. In this paper, we assess the quality and traceability of two popular GraphRAG systems. We compile a database of 50 papers and 70 expert questions related to Alzheimer’s disease, construct a GraphRAG knowledge base, and employ GPT-4o as the LLM for answering queries. We then compare the quality of responses generated by GraphRAG with those from a standard GPT-4o model. Additionally, we discuss and evaluate the traceability of several Retrieval-Augmented Generation (RAG) and GraphRAG systems. Finally, we provide an easy-to-use interface with a pre-built Alzheimer’s disease database for researchers to test the performance of both standard RAG and GraphRAG.
zh

[AI-49] Manifold Trajectories in Next-Token Prediction: From Replicator Dynamics to Softmax Equilibrium

【速读】:该论文试图解决大语言模型(Large Language Models, LLMs)中解码过程的数学本质问题,即如何从概率分布的角度理解生成式 AI(Generative AI)在给定上下文和温度参数下逐 token 生成的行为。其核心问题是:为何在固定上下文与温度条件下,下一个词的概率分布会沿着一个平滑轨迹演化并收敛至 softmax 平衡点?解决方案的关键在于将解码视为定义在概率单纯形(probability simplex)上的受约束变分原理,其中离散的归一化上升步对应经典的乘法权重更新(multiplicative-weights update),其连续时间极限即为复制者动力学(replicator flow)。这一框架不仅形式化了“流形遍历”(manifold traversal)的直观认知,还揭示了温度作为沿同一轨迹精确的时间缩放因子,以及 top-k 和 nucleus 采样通过限制流到单纯形的子面(face)来保持理论保证的本质机制。

链接: https://arxiv.org/abs/2508.21186
作者: Christopher R. Lee-Jenkins
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Dynamical Systems (math.DS)
备注:

点击查看摘要

Abstract:Decoding in large language models is often described as scoring tokens and normalizing with softmax. We give a minimal, self-contained account of this step as a constrained variational principle on the probability simplex. The discrete, normalization-respecting ascent is the classical multiplicative-weights (entropic mirror) update; its continuous-time limit is the replicator flow. From these ingredients we prove that, for a fixed context and temperature, the next-token distribution follows a smooth trajectory inside the simplex and converges to the softmax equilibrium. This formalizes the common ``manifold traversal’’ intuition at the output-distribution level. The analysis yields precise, practice-facing consequences: temperature acts as an exact rescaling of time along the same trajectory, while top-k and nucleus sampling restrict the flow to a face with identical guarantees. We also outline a controlled account of path-dependent score adjustments and their connection to loop-like, hallucination-style behavior. We make no claims about training dynamics or internal representations; those are deferred to future work.
zh

[AI-50] FUTURE: Flexible Unlearning for Tree Ensemble CIKM2025

【速读】:该论文旨在解决树集成模型(tree ensembles)在数据隐私保护和“被遗忘权”背景下的样本遗忘问题,即如何高效、准确地从训练好的模型中移除特定敏感样本的影响。现有方法通常局限于特定模型结构或依赖离散的树结构,难以推广至复杂集成模型且在大规模数据上效率低下。解决方案的关键在于将遗忘过程建模为基于梯度的优化任务,并引入概率模型近似以处理树集成的非可微特性,从而实现端到端的有效且高效的遗忘学习。

链接: https://arxiv.org/abs/2508.21181
作者: Ziheng Chen,Jin Huang,Jiali Cheng,Yuchan Guo,Mengjie Wang,Lalitesh Morishetti,Kaushiki Nag,Hadi Amiri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: CIKM 2025

点击查看摘要

Abstract:Tree ensembles are widely recognized for their effectiveness in classification tasks, achieving state-of-the-art performance across diverse domains, including bioinformatics, finance, and medical diagnosis. With increasing emphasis on data privacy and the \textitright to be forgotten, several unlearning algorithms have been proposed to enable tree ensembles to forget sensitive information. However, existing methods are often tailored to a particular model or rely on the discrete tree structure, making them difficult to generalize to complex ensembles and inefficient for large-scale datasets. To address these limitations, we propose FUTURE, a novel unlearning algorithm for tree ensembles. Specifically, we formulate the problem of forgetting samples as a gradient-based optimization task. In order to accommodate non-differentiability of tree ensembles, we adopt the probabilistic model approximations within the optimization framework. This enables end-to-end unlearning in an effective and efficient manner. Extensive experiments on real-world datasets show that FUTURE yields significant and successful unlearning performance.
zh

[AI-51] Deep Residual Echo State Networks: exploring residual orthogonal connections in untrained Recurrent Neural Networks

【速读】:该论文旨在解决传统回声状态网络(Echo State Networks, ESNs)在长期信息处理能力上的局限性问题。其解决方案的关键在于提出一种基于时间残差连接(temporal residual connections)的深度无训练循环神经网络——深度残差回声状态网络(Deep Residual Echo State Networks, DeepResESNs)。通过构建多层无训练残差循环结构,该方法显著提升了网络的记忆容量和对长期时序依赖的建模能力;同时,论文对不同正交配置(包括随机生成与固定结构)的时间残差连接进行了系统分析,并给出了确保网络动态稳定的数学条件,从而在多种时间序列任务中展现出优于传统浅层及深层Reservoir Computing(RC)模型的性能。

链接: https://arxiv.org/abs/2508.21172
作者: Matteo Pinna,Andrea Ceni,Claudio Gallicchio
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 6 figures

点击查看摘要

Abstract:Echo State Networks (ESNs) are a particular type of untrained Recurrent Neural Networks (RNNs) within the Reservoir Computing (RC) framework, popular for their fast and efficient learning. However, traditional ESNs often struggle with long-term information processing. In this paper, we introduce a novel class of deep untrained RNNs based on temporal residual connections, called Deep Residual Echo State Networks (DeepResESNs). We show that leveraging a hierarchy of untrained residual recurrent layers significantly boosts memory capacity and long-term temporal modeling. For the temporal residual connections, we consider different orthogonal configurations, including randomly generated and fixed-structure configurations, and we study their effect on network dynamics. A thorough mathematical analysis outlines necessary and sufficient conditions to ensure stable dynamics within DeepResESN. Our experiments on a variety of time series tasks showcase the advantages of the proposed approach over traditional shallow and deep RC.
zh

[AI-52] WaveLLDM: Design and Development of a Lightweight Latent Diffusion Model for Speech Enhancement and Restoration

【速读】:该论文旨在解决音频修复与去噪任务中因噪声、压缩和传输伪影导致的音质退化问题,尤其针对传统扩散模型在计算资源消耗大及难以处理长段缺失音频时的局限性。其解决方案的关键在于提出WaveLLDM(Wave Lightweight Latent Diffusion Model),通过将高效的神经音频编解码器(neural audio codec)与潜在扩散模型(latent diffusion model)相结合,在压缩的潜在空间(latent space)中进行音频重建,从而显著降低计算复杂度并保持高质量的频谱重构能力。该架构在Voicebank+DEMAND数据集上实现了较低的Log-Spectral Distance(LSD)得分(0.48–0.60),验证了其有效性,尽管在感知质量(WB-PESQ 1.62–1.71)和语音清晰度(STOI 0.76–0.78)方面仍有提升空间,主要归因于架构调优不足、缺乏微调及训练时间有限。

链接: https://arxiv.org/abs/2508.21153
作者: Kevin Putra Santoso,Rizka Wakhidatus Sholikah,Raden Venantius Hari Ginardi
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:High-quality audio is essential in a wide range of applications, including online communication, virtual assistants, and the multimedia industry. However, degradation caused by noise, compression, and transmission artifacts remains a major challenge. While diffusion models have proven effective for audio restoration, they typically require significant computational resources and struggle to handle longer missing segments. This study introduces WaveLLDM (Wave Lightweight Latent Diffusion Model), an architecture that integrates an efficient neural audio codec with latent diffusion for audio restoration and denoising. Unlike conventional approaches that operate in the time or spectral domain, WaveLLDM processes audio in a compressed latent space, reducing computational complexity while preserving reconstruction quality. Empirical evaluations on the Voicebank+DEMAND test set demonstrate that WaveLLDM achieves accurate spectral reconstruction with low Log-Spectral Distance (LSD) scores (0.48 to 0.60) and good adaptability to unseen data. However, it still underperforms compared to state-of-the-art methods in terms of perceptual quality and speech clarity, with WB-PESQ scores ranging from 1.62 to 1.71 and STOI scores between 0.76 and 0.78. These limitations are attributed to suboptimal architectural tuning, the absence of fine-tuning, and insufficient training duration. Nevertheless, the flexible architecture that combines a neural audio codec and latent diffusion model provides a strong foundation for future development.
zh

[AI-53] EmbodiedOneVision: Interleaved Vision-Text-Action Pretraining for General Robot Control

【速读】:该论文旨在解决当前视觉-语言-动作(Vision-Language-Action, VLA)模型在开放世界中难以实现人类水平的交错式推理与物理交互灵活性的问题。解决方案的关键在于提出EO-Robotics体系,包括EO-1统一具身基础模型和EO-Data1.5M大规模高质量多模态具身推理数据集;其中,EO-1通过交错式视觉-文本-动作预训练,在统一架构下处理图像、文本、视频及动作输入,并结合自回归解码与流匹配去噪的协同训练策略,显著提升了机器人控制与多模态具身推理能力,从而实现了更自然、泛化更强的开放世界理解与操作性能。

链接: https://arxiv.org/abs/2508.21112
作者: Delin Qu,Haoming Song,Qizhi Chen,Zhaoqing Chen,Xianqiang Gao,Xinyi Ye,Qi Lv,Modi Shi,Guanghui Ren,Cheng Ruan,Maoqing Yao,Haoran Yang,Jiacheng Bao,Bin Zhao,Dong Wang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.
zh

[AI-54] Automating the Deep Space Network Data Systems; A Case Study in Adaptive Anomaly Detection through Agent ic AI

【速读】:该论文旨在解决深空网络(Deep Space Network, DSN)中设备长期运行导致的退化与异常检测难题,以保障数十个依赖DSN维持地-星通信的航天器的数据流稳定。其核心解决方案是构建一个集成机器学习(Machine Learning, ML)、强化学习(Reinforcement Learning, RL)和大语言模型(Large Language Model, LLM)的智能系统,实现对多变量时间序列数据的全量重建、实时异常识别与严重性分级,并通过人类反馈持续优化模型性能;其中关键创新在于开发了完整的数据处理流水线,整合了从数据提取、解析到建模的全流程,最终由代理型AI(Agentic AI)系统执行复杂推理以完成异常分类与预测,从而为DSN维护提供可解释、可迭代的智能化支持。

链接: https://arxiv.org/abs/2508.21111
作者: Evan J. Chou(1 and 2),Lisa S. Locke(3),Harvey M. Soldan(3) ((1) University of California San Diego, (2) Pasadena City College, (3) Jet Propulsion Laboratory California Institute of Technology)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Deep Space Network (DSN) is NASA’s largest network of antenna facilities that generate a large volume of multivariate time-series data. These facilities contain DSN antennas and transmitters that undergo degradation over long periods of time, which may cause costly disruptions to the data flow and threaten the earth-connection of dozens of spacecraft that rely on the Deep Space Network for their lifeline. The purpose of this study was to experiment with different methods that would be able to assist JPL engineers with directly pinpointing anomalies and equipment degradation through collected data, and continue conducting maintenance and operations of the DSN for future space missions around our universe. As such, we have researched various machine learning techniques that can fully reconstruct data through predictive analysis, and determine anomalous data entries within real-time datasets through statistical computations and thresholds. On top of the fully trained and tested machine learning models, we have also integrated the use of a reinforcement learning subsystem that classifies identified anomalies based on severity level and a Large Language Model that labels an explanation for each anomalous data entry, all of which can be improved and fine-tuned over time through human feedback/input. Specifically, for the DSN transmitters, we have also implemented a full data pipeline system that connects the data extraction, parsing, and processing workflow all together as there was no coherent program or script for performing these tasks before. Using this data pipeline system, we were able to then also connect the models trained from DSN antenna data, completing the data workflow for DSN anomaly detection. This was all wrapped around and further connected by an agentic AI system, where complex reasoning was utilized to determine the classifications and predictions of anomalous data.
zh

[AI-55] An Explainable Attention-Enhanced Bidirectional Long Short-Term Memory Neural Network for Joint 48-Hour Forecasting of Temperature Irradiance and Relative Humidity

【速读】:该论文旨在解决智能暖通空调(HVAC)系统中短期气象预测的准确性与可解释性问题,以支持模型预测控制(MPC)实现建筑能效优化。其解决方案的关键在于构建一个基于堆叠双向长短期记忆网络(stacked Bidirectional Long Short-Term Memory, BiLSTM)并引入注意力机制的深度学习框架,通过联合预测温度、太阳辐照度和相对湿度三个变量,有效捕捉时间序列内部及跨特征间的依赖关系;同时利用集成梯度(Integrated Gradients)和注意力权重提升模型可解释性,从而在保证高精度(如温度MAE为1.3°C)的同时增强决策透明度,推动数据驱动的短临气象预测技术在建筑能源管理中的落地应用。

链接: https://arxiv.org/abs/2508.21109
作者: Georgios Vamvouras,Konstantinos Braimakis,Christos Tzivanidis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 27 pages, 8 figures

点击查看摘要

Abstract:This paper presents a Deep Learning (DL) framework for 48-hour forecasting of temperature, solar irradiance, and relative humidity to support Model Predictive Control (MPC) in smart HVAC systems. The approach employs a stacked Bidirectional Long Short-Term Memory (BiLSTM) network with attention, capturing temporal and cross-feature dependencies by jointly predicting all three variables. Historical meteorological data (2019-2022) with encoded cyclical time features were used for training, while 2023 data evaluated generalization. The model achieved Mean Absolute Errors of 1.3 degrees Celsius (temperature), 31 W/m2 (irradiance), and 6.7 percentage points (humidity), outperforming state-of-the-art numerical weather prediction and machine learning benchmarks. Integrated Gradients quantified feature contributions, and attention weights revealed temporal patterns, enhancing interpretability. By combining multivariate forecasting, attention-based DL, and explainability, this work advances data-driven weather prediction. The demonstrated accuracy and transparency highlight the framework’s potential for energy-efficient building control through reliable short-term meteorological forecasting.
zh

[AI-56] Learning to Generate Unit Test via Adversarial Reinforcement Learning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成高质量单元测试(unit tests)方面存在的挑战,尤其是在缺乏足够标注数据的情况下如何有效训练LLM以生成能准确暴露代码缺陷的测试用例。解决方案的关键在于提出一种名为UTRL(Unit Test Reinforcement Learning)的强化学习框架,其核心思想是通过对抗式迭代训练两个LLM:一个负责生成单元测试(unit test generator),另一个负责生成代码(code generator)。其中,测试生成器通过最大化“判别奖励”(discrimination reward)来提升其发现代码缺陷的能力,而代码生成器则通过最大化“代码奖励”(code reward)来优化其生成的代码通过测试的能力。这种双向反馈机制显著提升了生成测试的质量和实用性,实验证明Qwen3-4B模型经UTRL训练后生成的测试效果优于监督微调方法,并且超越了GPT-4.1等前沿模型。

链接: https://arxiv.org/abs/2508.21107
作者: Dongjun Lee,Changho Hwang,Kimin Lee
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Code is available at: this https URL

点击查看摘要

Abstract:Unit testing is a core practice in programming, enabling systematic evaluation of programs produced by human developers or large language models (LLMs). Given the challenges in writing comprehensive unit tests, LLMs have been employed to automate test generation, yet methods for training LLMs to produce high-quality tests remain underexplored. In this work, we propose UTRL, a novel reinforcement learning framework that trains an LLM to generate high-quality unit tests given a programming instruction. Our key idea is to iteratively train two LLMs, the unit test generator and the code generator, in an adversarial manner via reinforcement learning. The unit test generator is trained to maximize a discrimination reward, which reflects its ability to produce tests that expose faults in the code generator’s solutions, and the code generator is trained to maximize a code reward, which reflects its ability to produce solutions that pass the unit tests generated by the test generator. In our experiments, we demonstrate that unit tests generated by Qwen3-4B trained via UTRL show higher quality compared to unit tests generated by the same model trained via supervised fine-tuning on human-written ground-truth unit tests, yielding code evaluations that more closely align with those induced by the ground-truth tests. Moreover, Qwen3-4B trained with UTRL outperforms frontier models such as GPT-4.1 in generating high-quality unit tests, highlighting the effectiveness of UTRL in training LLMs for this task.
zh

[AI-57] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

【速读】:该论文旨在解决传统自适应梯度方法(如Adagrad)因使用对角预条件矩阵而无法捕捉参数间相关性的局限性,从而限制了优化效率的问题。其解决方案的关键在于提出AdaGram优化器,通过引入高效的全矩阵自适应梯度更新机制,利用快速对称分解计算每轮迭代中的预条件更新方向,并借助矩阵积分器方法在优化轨迹中保持预条件矩阵的低秩结构,从而在降低计算和内存开销的同时,有效建模参数间的协方差关系,提升收敛速度。

链接: https://arxiv.org/abs/2508.21106
作者: Tatyana Matveeva,Aleksandr Katrutsa,Evgeny Frolov
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods, approximating the exact Hessian, can model these correlations and may enable faster convergence. At the same time, their computational and memory costs are often prohibitive for large-scale models. To address this limitation, we propose AdaGram, an optimizer that enables efficient full-matrix adaptive gradient updates. To reduce memory and computational overhead, we utilize fast symmetric factorization for computing the preconditioned update direction at each iteration. Additionally, we maintain the low-rank structure of a preconditioner along the optimization trajectory using matrix integrator methods. Numerical experiments on standard machine learning tasks show that AdaGram converges faster or matches the performance of diagonal adaptive optimizers when using rank five and smaller rank approximations. This demonstrates AdaGram’s potential as a scalable solution for adaptive optimization in large models.
zh

[AI-58] PVPO: Pre-Estimated Value-Based Policy Optimization for Agent ic Reasoning

【速读】:该论文旨在解决基于群体策略(group policies)的无评论家强化学习方法在复杂任务中因依赖大量组内采样与比较来估计优势值(advantage)而引发的局部最优陷阱和计算成本过高的问题。其解决方案的关键在于引入一个优势参考锚点(advantage reference anchor)和数据预采样机制:通过预先使用参考模型进行轨迹回放并利用计算出的奖励得分作为参考锚点,有效校正组内比较带来的累积偏差,显著降低对回放缓冲区数量的依赖;同时,参考模型可在预采样阶段评估样本难度,从而筛选高收益样本以提升训练效率。

链接: https://arxiv.org/abs/2508.21104
作者: Wenfeng Feng,Penghong Zhao,Guochao Jiang,Chuzhan Hao,Yuewei Zhang,Hao Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 14 pages, 5 figures

点击查看摘要

Abstract:Critic-free reinforcement learning methods, particularly group policies, have attracted considerable attention for their efficiency in complex tasks. However, these methods rely heavily on multiple sampling and comparisons within the policy to estimate advantage, which may cause the policy to fall into local optimum and increase computational cost. To address these issues, we propose PVPO, an efficient reinforcement learning method enhanced by an advantage reference anchor and data pre-sampling. Specifically, we use the reference model to rollout in advance and employ the calculated reward score as a reference anchor. Our approach effectively corrects the cumulative bias introduced by intra-group comparisons and significantly reduces reliance on the number of rollouts. Meanwhile, the reference model can assess sample difficulty during data pre-sampling, enabling effective selection of high-gain data to improve training efficiency. Experiments conducted on nine datasets across two domains demonstrate that PVPO achieves State-Of-The-Art (SOTA) performance. Our approach not only demonstrates robust generalization across multiple tasks, but also exhibits scalable performance across models of varying scales.
zh

[AI-59] Spatiotemporal EEG-Based Emotion Recognition Using SAM Ratings from Serious Games with Hybrid Deep Learning

【速读】:该论文旨在解决现有基于脑电图(EEG)的情绪识别研究中普遍存在的局限性,即多数工作仅聚焦于二元效价(valence)预测或个体特定分类,导致模型泛化能力弱、难以部署于真实场景中的情感计算系统。其解决方案的关键在于构建一个统一的多粒度情绪分类框架,利用GAMEEMO数据集(包含14通道EEG信号与连续自评情绪评分)和结构化的预处理流程——包括时间窗分割、混合统计与频域特征提取及z-score标准化——将原始EEG信号转化为高判别力的输入向量,并从三个互补维度定义情绪标签:(i) 基于平均极性的二元效价分类,(ii) 多类情绪识别,(iii) 通过分箱实现的细粒度多标签表示。在此基础上,采用多种机器学习与深度神经网络模型进行对比评估,最终发现LSTM-GRU架构在所有任务中表现最优,验证了该多粒度框架的有效性和实用性。

链接: https://arxiv.org/abs/2508.21103
作者: Abdul Rehman,Ilona Heldal,Jerry Chun-Wei Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in EEG-based emotion recognition have shown promising outcomes using both deep learning and classical machine learning approaches; however, most existing studies focus narrowly on binary valence prediction or subject-specific classification, which limits generalizability and deployment in real-world affective computing systems. To address this gap, this paper presents a unified, multigranularity EEG emotion classification framework built on the GAMEEMO dataset, which consists of 14-channel EEG recordings and continuous self-reported emotion ratings (boring, horrible, calm, and funny) from 28 subjects across four emotion-inducing gameplay scenarios. Our pipeline employs a structured preprocessing strategy that comprises temporal window segmentation, hybrid statistical and frequency-domain feature extraction, and z-score normalization to convert raw EEG signals into robust, discriminative input vectors. Emotion labels are derived and encoded across three complementary axes: (i) binary valence classification based on the averaged polarity of positive and negative emotion ratings, and (ii) Multi-class emotion classification, where the presence of the most affective state is predicted. (iii) Fine-grained multi-label representation via binning each emotion into 10 ordinal classes. We evaluate a broad spectrum of models, including Random Forest, XGBoost, and SVM, alongside deep neural architectures such as LSTM, LSTM-GRU, and CNN-LSTM. Among these, the LSTM-GRU model consistently outperforms the others, achieving an F1-score of 0.932 in the binary valence task and 94.5% and 90.6% in both multi-class and Multi-Label emotion classification.
zh

[AI-60] Beyond Prediction: Reinforcement Learning as the Defining Leap in Healthcare AI

【速读】:该论文旨在解决当前人工智能在医疗领域应用中仍以预测性模型为主、缺乏动态决策能力的问题,即如何将强化学习(Reinforcement Learning, RL)从工具层面提升为具备自主决策能力的代理型智能(agentive intelligence),从而实现更符合临床实际需求的长期干预优化。其解决方案的关键在于系统梳理RL在医疗场景下的技术架构与应用场景,涵盖模型驱动与数据驱动方法、离线/批处理约束策略、奖励设计与不确定性校准等核心环节,并通过多源信息融合(如生命体征、实验室检查、影像学和设备遥测数据)及分布式部署模式(集中式、联邦式或边缘计算),构建可解释、可部署且符合伦理规范的临床决策系统。同时,论文强调需突破传统预测模型局限,聚焦于安全、人对齐的策略学习机制,推动RL在重症监护、慢性病管理、精神健康、诊断辅助和机器人协助等多个领域的落地转化。

链接: https://arxiv.org/abs/2508.21101
作者: Dilruk Perera,Gousia Habib,Qianyi Xu,Daniel J. Tan,Kai He,Erik Cambria,Mengling Feng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 40 pages in total (including appendix)

点击查看摘要

Abstract:Reinforcement learning (RL) marks a fundamental shift in how artificial intelligence is applied in healthcare. Instead of merely predicting outcomes, RL actively decides interventions with long term goals. Unlike traditional models that operate on fixed associations, RL systems learn through trial, feedback, and long-term reward optimization, introducing transformative possibilities and new risks. From an information fusion lens, healthcare RL typically integrates multi-source signals such as vitals, labs clinical notes, imaging and device telemetry using temporal and decision-level mechanisms. These systems can operate within centralized, federated, or edge architectures to meet real-time clinical constraints, and naturally span data, features and decision fusion levels. This survey explore RL’s rise in healthcare as more than a set of tools, rather a shift toward agentive intelligence in clinical environments. We first structure the landscape of RL techniques including model-based and model-free methods, offline and batch-constrained approaches, and emerging strategies for reward specification and uncertainty calibration through the lens of healthcare constraints. We then comprehensively analyze RL applications spanning critical care, chronic disease, mental health, diagnostics, and robotic assistance, identifying their trends, gaps, and translational bottlenecks. In contrast to prior reviews, we critically analyze RL’s ethical, deployment, and reward design challenges, and synthesize lessons for safe, human-aligned policy learning. This paper serves as both a a technical roadmap and a critical reflection of RL’s emerging transformative role in healthcare AI not as prediction machinery, but as agentive clinical intelligence.
zh

[AI-61] Model-Driven Quantum Code Generation Using Large Language Models and Retrieval-Augmented Generation

【速读】:该论文旨在解决量子及混合量子-经典软件系统开发中因平台异构性和开发者技能短缺所导致的成本高企与风险增加问题。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合检索增强生成(Retrieval-Augmented Generation, RAG)管道,实现从UML模型实例到可执行量子代码的自动化转换。实验表明,精心设计的提示词(prompt)可使CodeBLEU评分提升达四倍,显著提高生成代码的准确性与一致性;进一步研究还可拓展至以软件系统模型作为RAG信息源,或使用LLMs进行代码到代码的转换(如编译器场景),从而推动模型驱动的软件工程在量子计算领域的应用。

链接: https://arxiv.org/abs/2508.21097
作者: Nazanin Siavash,Armin Moin
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: This paper is accepted to the New Ideas and Emerging Results (NIER) track of the ACM/IEEE 28th International Conference on Model Driven Engineering Languages and Systems (MODELS)

点击查看摘要

Abstract:This paper introduces a novel research direction for model-to-text/code transformations by leveraging Large Language Models (LLMs) that can be enhanced with Retrieval-Augmented Generation (RAG) pipelines. The focus is on quantum and hybrid quantum-classical software systems, where model-driven approaches can help reduce the costs and mitigate the risks associated with the heterogeneous platform landscape and lack of developers’ skills. We validate one of the proposed ideas regarding generating code out of UML model instances of software systems. This Python code uses a well-established library, called Qiskit, to execute on gate-based or circuit-based quantum computers. The RAG pipeline that we deploy incorporates sample Qiskit code from public GitHub repositories. Experimental results show that well-engineered prompts can improve CodeBLEU scores by up to a factor of four, yielding more accurate and consistent quantum code. However, the proposed research direction can go beyond this through further investigation in the future by conducting experiments to address our other research questions and ideas proposed here, such as deploying software system model instances as the source of information in the RAG pipelines, or deploying LLMs for code-to-code transformations, for instance, for transpilation use cases.
zh

[AI-62] Evaluating Differentially Private Generation of Domain-Specific Text

【速读】:该论文旨在解决高风险领域(如医疗和金融)中因隐私和监管障碍导致真实世界数据难以使用的问题。其解决方案的关键在于提出一个统一的基准测试框架,用于系统评估在正式差分隐私(Differential Privacy, DP)保障下生成的文本数据集的效用(utility)与保真度(fidelity)。该基准涵盖代表性数据选择、合理的隐私预算设定、预训练影响以及多种评估指标,从而揭示当前隐私保护生成方法在严格隐私约束下的性能退化问题,为未来更先进的隐私保护数据共享技术提供评估标准与研究方向。

链接: https://arxiv.org/abs/2508.20452
作者: Yidan Sun,Viktor Schlegel,Srinivasan Nandakumar,Iqra Zahid,Yuping Wu,Warren Del-Pinto,Goran Nenadic,Siew-Kei Lam,Jie Zhang,Anil A Bharath
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注:

点击查看摘要

Abstract:Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.
zh

[AI-63] NSPDI-SNN: An efficient lightweight SNN based on nonlinear synaptic pruning and dendritic integration

【速读】:该论文旨在解决脉冲神经网络(Spiking Neural Networks, SNNs)在模拟生物神经元复杂结构与高效计算能力方面存在的不足,特别是其突触稀疏性与信息表征能力较弱的问题。解决方案的关键在于提出一种名为非线性剪枝与树突整合(Nonlinear Spiking Pruning and Dendritic Integration, NSPDI-SNN)的轻量化方法:首先引入非线性树突整合(Nonlinear Dendritic Integration, NDI),增强神经元对时空信息的表示能力;其次设计异质状态转换比率的树突棘机制,并构建新型非线性突触剪枝(Nonlinear Synaptic Pruning, NSP)策略,在保证性能的前提下显著提升SNN的稀疏性,从而实现高效的突触信息传递与计算效率。

链接: https://arxiv.org/abs/2508.21566
作者: Wuque Cai,Hongze Sun,Jiayi He,Qianqian Liao,Yunliang Zang,Duo Chen,Dezhong Yao,Daqing Guo
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 13 pages, 8 figures, 5 tables; This manuscript has been submitted for possible pulication

点击查看摘要

Abstract:Spiking neural networks (SNNs) are artificial neural networks based on simulated biological neurons and have attracted much attention in recent artificial intelligence technology studies. The dendrites in biological neurons have efficient information processing ability and computational power; however, the neurons of SNNs rarely match the complex structure of the dendrites. Inspired by the nonlinear structure and highly sparse properties of neuronal dendrites, in this study, we propose an efficient, lightweight SNN method with nonlinear pruning and dendritic integration (NSPDI-SNN). In this method, we introduce nonlinear dendritic integration (NDI) to improve the representation of the spatiotemporal information of neurons. We implement heterogeneous state transition ratios of dendritic spines and construct a new and flexible nonlinear synaptic pruning (NSP) method to achieve the high sparsity of SNN. We conducted systematic experiments on three benchmark datasets (DVS128 Gesture, CIFAR10-DVS, and CIFAR10) and extended the evaluation to two complex tasks (speech recognition and reinforcement learning-based maze navigation task). Across all tasks, NSPDI-SNN consistently achieved high sparsity with minimal performance degradation. In particular, our method achieved the best experimental results on all three event stream datasets. Further analysis showed that NSPDI significantly improved the efficiency of synaptic information transfer as sparsity increased. In conclusion, our results indicate that the complex structure and nonlinear computation of neuronal dendrites provide a promising approach for developing efficient SNN methods.
zh

[AI-64] EconAgent ic in DePIN Markets: A Large Language Model Approach to the Sharing Economy of Decentralized Physical Infrastructure

【速读】:该论文旨在解决去中心化物理基础设施(Decentralized Physical Infrastructure, DePIN)市场在快速扩张过程中因缺乏监管、AI代理自主部署于智能合约所引发的效率低下与人类价值观错位等问题。解决方案的关键在于提出EconAgentic框架,这是一个基于大语言模型(Large Language Model, LLM)的机制设计工具,能够模拟AI代理对代币激励的响应行为、基础设施投资决策及市场适应能力,并通过对比人类启发式基准评估其经济影响,从而实现对DePIN市场动态演化、利益相关者行为和宏观经济指标的系统性分析,提升市场的效率、包容性与稳定性。

链接: https://arxiv.org/abs/2508.21368
作者: Yulin Liu,Mocca Schweitzer
机构: 未知
类目: General Economics (econ.GN); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Decentralized Physical Infrastructure (DePIN) market is revolutionizing the sharing economy through token-based economics and smart contracts that govern decentralized operations. By 2024, DePIN projects have exceeded \ 10 billion in market capitalization, underscoring their rapid growth. However, the unregulated nature of these markets, coupled with the autonomous deployment of AI agents in smart contracts, introduces risks such as inefficiencies and potential misalignment with human values. To address these concerns, we introduce EconAgentic, a Large Language Model (LLM)-powered framework designed to mitigate these challenges. Our research focuses on three key areas: 1) modeling the dynamic evolution of DePIN markets, 2) evaluating stakeholders’ actions and their economic impacts, and 3) analyzing macroeconomic indicators to align market outcomes with societal goals. Through EconAgentic, we simulate how AI agents respond to token incentives, invest in infrastructure, and adapt to market conditions, comparing AI-driven decisions with human heuristic benchmarks. Our results show that EconAgentic provides valuable insights into the efficiency, inclusion, and stability of DePIN markets, contributing to both academic understanding and practical improvements in the design and governance of decentralized, tokenized economies.
zh

[AI-65] A Financial Brain Scan of the LLM

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在生成经济预测等社会科学研究任务时,其内部决策机制缺乏可解释性和可控性的问题。传统方法往往将LLM视为黑箱,难以识别驱动其推理的核心概念并对其进行干预。解决方案的关键在于提出一种“脑扫描”(brain scanning)技术,能够识别引导模型推理的自然语言概念(如情绪、技术分析和时机),并在保持模型性能不变的前提下量化这些概念的相对重要性,并通过微调实现对模型风险偏好、乐观或悲观倾向的定向控制,从而支持偏见校正与情境模拟,具有透明、轻量且可复现的特点,适用于社会科学实证研究。

链接: https://arxiv.org/abs/2508.21285
作者: Hui Chen,Antoine Didisheim,Luciano Somoza,Hanqing Tian
机构: 未知
类目: General Finance (q-fin.GN); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); General Economics (econ.GN)
备注: 47 pages

点击查看摘要

Abstract:Emerging techniques in computer science make it possible to “brain scan” large language models (LLMs), identify the plain-English concepts that guide their reasoning, and steer them while holding other factors constant. We show that this approach can map LLM-generated economic forecasts to concepts such as sentiment, technical analysis, and timing, and compute their relative importance without reducing performance. We also show that models can be steered to be more or less risk-averse, optimistic, or pessimistic, which allows researchers to correct or simulate biases. The method is transparent, lightweight, and replicable for empirical research in the social sciences.
zh

[AI-66] Deep Active Learning for Lung Disease Severity Classification from Chest X-rays: Learning with Less Data in the Presence of Class Imbalance

【速读】:该论文旨在解决肺部疾病严重程度分类任务中因类别不平衡导致的标注数据需求量大问题,特别是在胸部X光片(CXR)图像上进行诊断时。其核心解决方案是采用基于贝叶斯神经网络(Bayesian Neural Network, BNN)近似的深度主动学习(Deep Active Learning),结合加权损失函数(weighted loss function)以缓解类别不平衡,并利用多种采样策略(如熵采样和均值标准差采样)从无标签数据池中迭代选择最具信息量的样本进行标注。实验表明,该方法在仅使用15.4%至23.1%的标注数据情况下即可达到接近全监督训练的性能,显著减少了人工标注负担,同时保持了高诊断准确性(如二分类中准确率达93.7%,AU ROC为0.91)。

链接: https://arxiv.org/abs/2508.21263
作者: Roy M. Gabriel,Mohammadreza Zandehshahvar,Marly van Assen,Nattakorn Kittisut,Kyle Peters,Carlo N. De Cecco,Ali Adibi
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:To reduce the amount of required labeled data for lung disease severity classification from chest X-rays (CXRs) under class imbalance, this study applied deep active learning with a Bayesian Neural Network (BNN) approximation and weighted loss function. This retrospective study collected 2,319 CXRs from 963 patients (mean age, 59.2 \pm 16.6 years; 481 female) at Emory Healthcare affiliated hospitals between January and November 2020. All patients had clinically confirmed COVID-19. Each CXR was independently labeled by 3 to 6 board-certified radiologists as normal, moderate, or severe. A deep neural network with Monte Carlo Dropout was trained using active learning to classify disease severity. Various acquisition functions were used to iteratively select the most informative samples from an unlabeled pool. Performance was evaluated using accuracy, area under the receiver operating characteristic curve (AU ROC), and area under the precision-recall curve (AU PRC). Training time and acquisition time were recorded. Statistical analysis included descriptive metrics and performance comparisons across acquisition strategies. Entropy Sampling achieved 93.7% accuracy (AU ROC, 0.91) in binary classification (normal vs. diseased) using 15.4% of the training data. In the multi-class setting, Mean STD sampling achieved 70.3% accuracy (AU ROC, 0.86) using 23.1% of the labeled data. These methods outperformed more complex and computationally expensive acquisition functions and significantly reduced labeling needs. Deep active learning with BNN approximation and weighted loss effectively reduces labeled data requirements while addressing class imbalance, maintaining or exceeding diagnostic performance.
zh

[AI-67] Reinforcement Learning for Optimizing Large Qubit Array based Quantum Sensor Circuits

【速读】:该论文旨在解决大规模量子传感器电路中因量子比特数量增加而导致的量子线路设计与控制复杂度指数级增长的问题,尤其是在优化纠缠分布以提升量子传感器灵敏度和效率方面的挑战。解决方案的关键在于将强化学习(Reinforcement Learning, RL)与基于张量网络的模拟方法(特别是矩阵乘积态,Matrix Product State, MPS)相结合,从而实现对多达60个量子比特的电路进行高效可扩展的优化。通过RL代理学习重构电路结构,在最大化量子费舍尔信息(Quantum Fisher Information, QFI)和纠缠熵的同时,显著降低门数和电路深度,实验表明QFI接近1,纠缠熵维持在0.8–1.0区间,并实现高达90%的电路复杂度压缩,验证了该方法在实际约束下优化复杂量子电路的有效性。

链接: https://arxiv.org/abs/2508.21253
作者: Laxmisha Ashok Attisara,Sathish Kumar
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 10 pages, 13 figures, 2 tables

点击查看摘要

Abstract:As the number of qubits in a sensor increases, the complexity of designing and controlling the quantum circuits grows exponentially. Manually optimizing these circuits becomes infeasible. Optimizing entanglement distribution in large-scale quantum circuits is critical for enhancing the sensitivity and efficiency of quantum sensors [5], [6]. This paper presents an engineering integration of reinforcement learning with tensor-network-based simulation (MPS) for scalable circuit optimization for optimizing quantum sensor circuits with up to 60 qubits. To enable efficient simulation and scalability, we adopt tensor network methods, specifically the Matrix Product State (MPS) representation, instead of traditional state vector or density matrix approaches. Our reinforcement learning agent learns to restructure circuits to maximize Quantum Fisher Information (QFI) and entanglement entropy while reducing gate counts and circuit depth. Experimental results show consistent improvements, with QFI values approaching 1, entanglement entropy in the 0.8-1.0 range, and up to 90% reduction in depth and gate count. These results highlight the potential of combining quantum machine learning and tensor networks to optimize complex quantum circuits under realistic constraints.
zh

[AI-68] Quantum Machine Learning for Optimizing Entanglement Distribution in Quantum Sensor Circuits

【速读】:该论文旨在解决量子传感器电路中如何优化纠缠分布以提升测量灵敏度和系统相干性的问题。其关键解决方案是利用量子机器学习中的强化学习方法,在量子环境中对纠缠布局进行优化,从而最大化量子费舍尔信息(Quantum Fisher Information, QFI)和纠缠熵,同时最小化电路深度与门数量。通过Qiskit实现并引入噪声模型与误差缓解策略,模拟真实量子环境,最终在保持高QFI(0.84–1.0)和高纠缠熵的前提下,使电路深度和门数减少20–86%,显著提升了量子传感电路的性能。

链接: https://arxiv.org/abs/2508.21252
作者: Laxmisha Ashok Attisara,Sathish Kumar
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 11 pages, 13 figures, 4 tables

点击查看摘要

Abstract:In the rapidly evolving field of quantum computing, optimizing quantum circuits for specific tasks is crucial for enhancing performance and efficiency. More recently, quantum sensing has become a distinct and rapidly growing branch of research within the area of quantum science and technology. The field is expected to provide new opportunities, especially regarding high sensitivity and precision. Entanglement is one of the key factors in achieving high sensitivity and measurement precision [3]. This paper presents a novel approach utilizing quantum machine learning techniques to optimize entanglement distribution in quantum sensor circuits. By leveraging reinforcement learning within a quantum environment, we aim to optimize the entanglement layout to maximize Quantum Fisher Information (QFI) and entanglement entropy, which are key indicators of a quantum system’s sensitivity and coherence, while minimizing circuit depth and gate counts. Our implementation, based on Qiskit, integrates noise models and error mitigation strategies to simulate realistic quantum environments. The results demonstrate significant improvements in circuit performance and sensitivity, highlighting the potential of machine learning in quantum circuit optimization by measuring high QFI and entropy in the range of 0.84-1.0 with depth and gate count reduction by 20-86%.
zh

[AI-69] Zero-Shot KWS for Childrens Speech using Layer-Wise Features from SSL Models

【速读】:该论文旨在解决儿童语音在关键词检测(Keyword Spotting, KWS)任务中因声学和语言特征与成人语音显著不同而导致的性能下降问题。解决方案的关键在于提出一种零样本(zero-shot)KWS方法,利用先进的自监督学习(Self-Supervised Learning, SSL)模型(如Wav2Vec2、HuBERT和Data2Vec)提取分层特征,并基于这些特征训练一个Kaldi框架下的DNN KWS系统。实验表明,该方法在PFSTAR儿童语音数据集上实现了当前最优性能,尤其以Wav2Vec2第22层特征表现最佳,在30个关键词测试中达到ATWV 0.691、MTWV 0.7003,且误报率和漏检率分别仅为0.0164和0.0547,验证了SSL特征在跨年龄群体及噪声环境下的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.21248
作者: Subham Kutum,Abhijit Sinha,Hemant Kumar Kathania,Sudarsana Reddy Kadiri,Mahesh Chandra Govil
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Sound (cs.SD); Signal Processing (eess.SP)
备注: Accepted

点击查看摘要

Abstract:Numerous methods have been proposed to enhance Keyword Spotting (KWS) in adult speech, but children’s speech presents unique challenges for KWS systems due to its distinct acoustic and linguistic characteristics. This paper introduces a zero-shot KWS approach that leverages state-of-the-art self-supervised learning (SSL) models, including Wav2Vec2, HuBERT and Data2Vec. Features are extracted layer-wise from these SSL models and used to train a Kaldi-based DNN KWS system. The WSJCAM0 adult speech dataset was used for training, while the PFSTAR children’s speech dataset was used for testing, demonstrating the zero-shot capability of our method. Our approach achieved state-of-the-art results across all keyword sets for children’s speech. Notably, the Wav2Vec2 model, particularly layer 22, performed the best, delivering an ATWV score of 0.691, a MTWV score of 0.7003 and probability of false alarm and probability of miss of 0.0164 and 0.0547 respectively, for a set of 30 keywords. Furthermore, age-specific performance evaluation confirmed the system’s effectiveness across different age groups of children. To assess the system’s robustness against noise, additional experiments were conducted using the best-performing layer of the best-performing Wav2Vec2 model. The results demonstrated a significant improvement over traditional MFCC-based baseline, emphasizing the potential of SSL embeddings even in noisy conditions. To further generalize the KWS framework, the experiments were repeated for an additional CMU dataset. Overall the results highlight the significant contribution of SSL features in enhancing Zero-Shot KWS performance for children’s speech, effectively addressing the challenges associated with the distinct characteristics of child speakers.
zh

[AI-70] HCQA: Hybrid Classical-Quantum Agent for Generating Optimal Quantum Sensor Circuits

【速读】:该论文旨在解决复杂量子物理问题中最优量子传感器电路(Quantum Sensor Circuits, QSCs)设计的难题,其核心挑战在于如何自动寻找能够最大化量子费舍尔信息(Quantum Fisher Information, QFI)并最小化门操作数量的量子线路。解决方案的关键在于提出了一种混合计算智能强化学习算法(Hybrid Computational Intelligence-based Q-learning Algorithm, HCQA),该算法融合了深度Q网络(Deep Q-Network, DQN)用于策略优化,并引入基于量子力学原理的动作选择机制:利用Ry门编码智能体当前状态并生成可能动作的叠加态,通过测量获得概率性动作输出,从而实现对高QFI敏感的纠缠量子态(如压缩态)的自动化生成与优化。此方法显著提升了量子传感任务中电路设计的智能化与效率。

链接: https://arxiv.org/abs/2508.21246
作者: Ahmad Alomari,Sathish A. P. Kumar
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI)
备注: 9 pages, 9 figures

点击查看摘要

Abstract:This study proposes an HCQA for designing optimal Quantum Sensor Circuits (QSCs) to address complex quantum physics problems. The HCQA integrates computational intelligence techniques by leveraging a Deep Q-Network (DQN) for learning and policy optimization, enhanced by a quantum-based action selection mechanism based on the Q-values. A quantum circuit encodes the agent current state using Ry gates, and then creates a superposition of possible actions. Measurement of the circuit results in probabilistic action outcomes, allowing the agent to generate optimal QSCs by selecting sequences of gates that maximize the Quantum Fisher Information (QFI) while minimizing the number of gates. This computational intelligence-driven HCQA enables the automated generation of entangled quantum states, specifically the squeezed states, with high QFI sensitivity for quantum state estimation and control. Evaluation of the HCQA on a QSC that consists of two qubits and a sequence of Rx, Ry, and S gates demonstrates its efficiency in generating optimal QSCs with a QFI of 1. This work highlights the synergy between AI-driven learning and quantum computation, illustrating how intelligent agents can autonomously discover optimal quantum circuit designs for enhanced sensing and estimation tasks.
zh

[AI-71] Can Layer-wise SSL Features Improve Zero-Shot ASR Performance for Childrens Speech?

【速读】:该论文旨在解决生成式 AI (Generative AI) 在儿童语音识别(ASR)中的性能瓶颈问题,即现有自动语音识别系统在处理儿童语音时因声学和语言特征差异大、变异性高而导致的识别准确率低的问题。其解决方案的关键在于利用自监督学习(SSL)预训练模型(如Wav2Vec2、HuBERT、Data2Vec和WavLM)提取分层特征,并在零样本场景下将其集成到基于DNN的ASR系统中进行优化。实验表明,Wav2Vec2模型第22层的特征在PFSTAR儿童语音测试集上实现了最低词错误率(WER=5.15%),相较直接零样本解码提升51.64%,验证了分层特征选择与SSL迁移的有效性,且该方法在不同年龄段儿童中均具稳定增益,展现出良好的泛化能力。

链接: https://arxiv.org/abs/2508.21225
作者: Abhijit Sinha,Hemant Kumar Kathania,Sudarsana Reddy Kadiri,Shrikanth Narayanan
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Sound (cs.SD); Signal Processing (eess.SP)
备注: Accepted

点击查看摘要

Abstract:Automatic Speech Recognition (ASR) systems often struggle to accurately process children’s speech due to its distinct and highly variable acoustic and linguistic characteristics. While recent advancements in self-supervised learning (SSL) models have greatly enhanced the transcription of adult speech, accurately transcribing children’s speech remains a significant challenge. This study investigates the effectiveness of layer-wise features extracted from state-of-the-art SSL pre-trained models - specifically, Wav2Vec2, HuBERT, Data2Vec, and WavLM in improving the performance of ASR for children’s speech in zero-shot scenarios. A detailed analysis of features extracted from these models was conducted, integrating them into a simplified DNN-based ASR system using the Kaldi toolkit. The analysis identified the most effective layers for enhancing ASR performance on children’s speech in a zero-shot scenario, where WSJCAM0 adult speech was used for training and PFSTAR children speech for testing. Experimental results indicated that Layer 22 of the Wav2Vec2 model achieved the lowest Word Error Rate (WER) of 5.15%, representing a 51.64% relative improvement over the direct zero-shot decoding using Wav2Vec2 (WER of 10.65%). Additionally, age group-wise analysis demonstrated consistent performance improvements with increasing age, along with significant gains observed even in younger age groups using the SSL features. Further experiments on the CMU Kids dataset confirmed similar trends, highlighting the generalizability of the proposed approach.
zh

[AI-72] Pep2Prob Benchmark: Predicting Frag ment Ion Probability for MS2-based Proteomics

【速读】:该论文旨在解决肽段碎片离子概率预测中因依赖全局碎片统计而造成的准确性不足问题,这一假设忽略了肽段序列和电荷状态对碎片形成的特异性影响。解决方案的关键在于构建了首个针对肽段特异性的碎片离子概率预测数据集Pep2Prob,该数据集基于超过18300万条高质量、高分辨率的HCD二级质谱(MS²)谱图,涵盖608,780个独特前体离子(即肽段序列与电荷状态的组合),并建立了相应的碎片概率统计信息。通过引入肽段特异性特征,模型性能显著优于仅使用全局碎片统计的传统方法,表明复杂非线性关系需借助先进机器学习技术进行建模。

链接: https://arxiv.org/abs/2508.21076
作者: Hao Xu,Zhichao Wang,Shengqi Sang,Pisit Wajanasara,Nuno Bandeira
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Dataset is available at HuggingFace: this https URL

点击查看摘要

Abstract:Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS ^2 ) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS ^2 analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from mass spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment’s probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present Pep2Prob, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS ^2 spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches.
zh

机器学习

[LG-0] Achieving Hilbert-Schmidt Independence Under Rényi Differential Privacy for Fair and Private Data Generation

链接: https://arxiv.org/abs/2508.21815
作者: Tobias Hyrup,Emmanouil Panagiotou,Arjun Roy,Arthur Zimek,Eirini Ntoutsi,Peter Schneider-Kamp
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As privacy regulations such as the GDPR and HIPAA and responsibility frameworks for artificial intelligence such as the AI Act gain traction, the ethical and responsible use of real-world data faces increasing constraints. Synthetic data generation has emerged as a promising solution to risk-aware data sharing and model development, particularly for tabular datasets that are foundational to sensitive domains such as healthcare. To address both privacy and fairness concerns in this setting, we propose FLIP (Fair Latent Intervention under Privacy guarantees), a transformer-based variational autoencoder augmented with latent diffusion to generate heterogeneous tabular data. Unlike the typical setup in fairness-aware data generation, we assume a task-agnostic setup, not reliant on a fixed, defined downstream task, thus offering broader applicability. To ensure privacy, FLIP employs Rényi differential privacy (RDP) constraints during training and addresses fairness in the input space with RDP-compatible balanced sampling that accounts for group-specific noise levels across multiple sampling rates. In the latent space, we promote fairness by aligning neuron activation patterns across protected groups using Centered Kernel Alignment (CKA), a similarity measure extending the Hilbert-Schmidt Independence Criterion (HSIC). This alignment encourages statistical independence between latent representations and the protected feature. Empirical results demonstrate that FLIP effectively provides significant fairness improvements for task-agnostic fairness and across diverse downstream tasks under differential privacy constraints.

[LG-1] QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models

链接: https://arxiv.org/abs/2508.21810
作者: Jessica Liang,Anirudh Bharadwaj
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The growing scale of Large Language Models (LLMs) has necessitated the development of parameter-efficient fine-tuning techniques. Low-Rank Adaptation (LoRA) has emerged as a promising approach, reducing the number of trainable parameters by applying low-rank updates to pretrained weights. While standard LoRA learns both update factors directly, several recent variants first initialize those matrices via an SVD of the pretrained weights – an operation that can be expensive on large models and yields singular vectors that are not always easy to interpret. In this work, we extract an orthonormal basis from the pretrained weight matrix using QR decomposition with column pivoting, and then express the LoRA update as a linear combination of these basis vectors – training only the scalar coefficients, which imposes clear structure on adaptation and drastically reduces parameter count. Experiments across GLUE tasks show that QR-LoRA matches or exceeds the performance of full fine-tuning, standard LoRA, and SVD-LoRA (LoRA with update matrices initialized via singular value decomposition) with as few as 601 parameters – a reduction of over 1000x compared to full fine-tuning and 77x fewer than typical LoRA setups.

[LG-2] UniMLR: Modeling Implicit Class Significance for Multi-Label Ranking

链接: https://arxiv.org/abs/2508.21772
作者: V. Bugra Yesilkaynak,Emine Dari,Alican Mertan,Gozde Unal
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Existing multi-label ranking (MLR) frameworks only exploit information deduced from the bipartition of labels into positive and negative sets. Therefore, they do not benefit from ranking among positive labels, which is the novel MLR approach we introduce in this paper. We propose UniMLR, a new MLR paradigm that models implicit class relevance/significance values as probability distributions using the ranking among positive labels, rather than treating them as equally important. This approach unifies ranking and classification tasks associated with MLR. Additionally, we address the challenges of scarcity and annotation bias in MLR datasets by introducing eight synthetic datasets (Ranked MNISTs) generated with varying significance-determining factors, providing an enriched and controllable experimental environment. We statistically demonstrate that our method accurately learns a representation of the positive rank order, which is consistent with the ground truth and proportional to the underlying significance values. Finally, we conduct comprehensive empirical experiments on both real-world and synthetic datasets, demonstrating the value of our proposed framework.

[LG-3] Inferring Effects of Major Events through Discontinuity Forecasting of Population Anxiety

链接: https://arxiv.org/abs/2508.21722
作者: Siddharth Mangalik,Ojas Deshpande,Adithya V. Ganesan,Sean A. P. Clouston,H. Andrew Schwartz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Estimating community-specific mental health effects of local events is vital for public health policy. While forecasting mental health scores alone offers limited insights into the impact of events on community well-being, quasi-experimental designs like the Longitudinal Regression Discontinuity Design (LRDD) from econometrics help researchers derive more effects that are more likely to be causal from observational data. LRDDs aim to extrapolate the size of changes in an outcome (e.g. a discontinuity in running scores for anxiety) due to a time-specific event. Here, we propose adapting LRDDs beyond traditional forecasting into a statistical learning framework whereby future discontinuities (i.e. time-specific shifts) and changes in slope (i.e. linear trajectories) are estimated given a location’s history of the score, dynamic covariates (other running assessments), and exogenous variables (static representations). Applying our framework to predict discontinuities in the anxiety of US counties from COVID-19 events, we found the task was difficult but more achievable as the sophistication of models was increased, with the best results coming from integrating exogenous and dynamic covariates. Our approach shows strong improvement ( r=+.46 for discontinuity and r = +.65 for slope) over traditional static community representations. Discontinuity forecasting raises new possibilities for estimating the idiosyncratic effects of potential future or hypothetical events on specific communities.

[LG-4] A Soft Inducement Framework for Incentive-Aided Steering of No-Regret Players

链接: https://arxiv.org/abs/2508.21672
作者: Asrin Efe Yorulmaz,Raj Kiriti Velicheti,Melih Bastopcu,Tamer Başar
类目: Computer Science and Game Theory (cs.GT); Information Theory (cs.IT); Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In this work, we investigate a steering problem in a mediator-augmented two-player normal-form game, where the mediator aims to guide players toward a specific action profile through information and incentive design. We first characterize the games for which successful steering is possible. Moreover, we establish that steering players to any desired action profile is not always achievable with information design alone, nor when accompanied with sublinear payment schemes. Consequently, we derive a lower bound on the constant payments required per round to achieve this goal. To address these limitations incurred with information design, we introduce an augmented approach that involves a one-shot information design phase before the start of the repeated game, transforming the prior interaction into a Stackelberg game. Finally, we theoretically demonstrate that this approach improves the convergence rate of players’ action profiles to the target point by a constant factor with high probability, and support it with empirical results.

[LG-5] rajectory learning for ensemble forecasts via the continuous ranked probability score: a Lorenz 96 case study

链接: https://arxiv.org/abs/2508.21664
作者: Sagy Ephrati,James Woodfield
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注: 19 pages, 9 figures. All comments are welcome!

点击查看摘要

Abstract:This paper demonstrates the feasibility of trajectory learning for ensemble forecasts by employing the continuous ranked probability score (CRPS) as a loss function. Using the two-scale Lorenz '96 system as a case study, we develop and train both additive and multiplicative stochastic parametrizations to generate ensemble predictions. Results indicate that CRPS-based trajectory learning produces parametrizations that are both accurate and sharp. The resulting parametrizations are straightforward to calibrate and outperform derivative-fitting-based parametrizations in short-term forecasts. This approach is particularly promising for data assimilation applications due to its accuracy over short lead times.

[LG-6] I Stolenly Swear That I Am Up to (No) Good: Design and Evaluation of Model Stealing Attacks

链接: https://arxiv.org/abs/2508.21654
作者: Daryna Oliynyk,Rudolf Mayer,Kathrin Grosse,Andreas Rauber
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: Under review

点击查看摘要

Abstract:Model stealing attacks endanger the confidentiality of machine learning models offered as a service. Although these models are kept secret, a malicious party can query a model to label data samples and train their own substitute model, violating intellectual property. While novel attacks in the field are continually being published, their design and evaluations are not standardised, making it challenging to compare prior works and assess progress in the field. This paper is the first to address this gap by providing recommendations for designing and evaluating model stealing attacks. To this end, we study the largest group of attacks that rely on training a substitute model – those attacking image classification models. We propose the first comprehensive threat model and develop a framework for attack comparison. Further, we analyse attack setups from related works to understand which tasks and models have been studied the most. Based on our findings, we present best practices for attack development before, during, and beyond experiments and derive an extensive list of open research questions regarding the evaluation of model stealing attacks. Our findings and recommendations also transfer to other problem domains, hence establishing the first generic evaluation methodology for model stealing attacks.

[LG-7] Predicting Social Media Engagement from Emotional and Temporal Features

链接: https://arxiv.org/abs/2508.21650
作者: Yunwoo Kim,Junhyuk Hwang
类目: Machine Learning (cs.LG)
*备注: 7 pages

点击查看摘要

Abstract:We present a machine learning approach for predicting social media engagement (comments and likes) from emotional and temporal features. The dataset contains 600 songs with annotations for valence, arousal, and related sentiment metrics. A multi target regression model based on HistGradientBoostingRegressor is trained on log transformed engagement ratios to address skewed targets. Performance is evaluated with both a custom order of magnitude accuracy and standard regression metrics, including the coefficient of determination (R^2). Results show that emotional and temporal metadata, together with existing view counts, predict future engagement effectively. The model attains R^2 = 0.98 for likes but only R^2 = 0.41 for comments. This gap indicates that likes are largely driven by readily captured affective and exposure signals, whereas comments depend on additional factors not represented in the current feature set.

[LG-8] Introduction to the Analysis of Probabilistic Decision-Making Algorithms

链接: https://arxiv.org/abs/2508.21620
作者: Agustinus Kristiadi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Decision theories offer principled methods for making choices under various types of uncertainty. Algorithms that implement these theories have been successfully applied to a wide range of real-world problems, including materials and drug discovery. Indeed, they are desirable since they can adaptively gather information to make better decisions in the future, resulting in data-efficient workflows. In scientific discovery, where experiments are costly, these algorithms can thus significantly reduce the cost of experimentation. Theoretical analyses of these algorithms are crucial for understanding their behavior and providing valuable insights for developing next-generation algorithms. However, theoretical analyses in the literature are often inaccessible to non-experts. This monograph aims to provide an accessible, self-contained introduction to the theoretical analysis of commonly used probabilistic decision-making algorithms, including bandit algorithms, Bayesian optimization, and tree search algorithms. Only basic knowledge of probability theory and statistics, along with some elementary knowledge about Gaussian processes, is assumed.

[LG-9] Adapting to Change: A Comparison of Continual and Transfer Learning for Modeling Building Thermal Dynamics under Concept Drifts

链接: https://arxiv.org/abs/2508.21615
作者: Fabian Raisch,Max Langtry,Felix Koch,Ruchi Choudhary,Christoph Goebel,Benjamin Tischler
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: Currently under review

点击查看摘要

Abstract:Transfer Learning (TL) is currently the most effective approach for modeling building thermal dynamics when only limited data are available. TL uses a pretrained model that is fine-tuned to a specific target building. However, it remains unclear how to proceed after initial fine-tuning, as more operational measurement data are collected over time. This challenge becomes even more complex when the dynamics of the building change, for example, after a retrofit or a change in occupancy. In Machine Learning literature, Continual Learning (CL) methods are used to update models of changing systems. TL approaches can also address this challenge by reusing the pretrained model at each update step and fine-tuning it with new measurement data. A comprehensive study on how to incorporate new measurement data over time to improve prediction accuracy and address the challenges of concept drifts (changes in dynamics) for building thermal dynamics is still missing. Therefore, this study compares several CL and TL strategies, as well as a model trained from scratch, for thermal dynamics modeling during building operation. The methods are evaluated using 5–7 years of simulated data representative of single-family houses in Central Europe, including scenarios with concept drifts from retrofits and changes in occupancy. We propose a CL strategy (Seasonal Memory Learning) that provides greater accuracy improvements than existing CL and TL methods, while maintaining low computational effort. SML outperformed the benchmark of initial fine-tuning by 28.1% without concept drifts and 34.9% with concept drifts. Comments: Currently under review Subjects: Systems and Control (eess.SY); Machine Learning (cs.LG) Cite as: arXiv:2508.21615 [eess.SY] (or arXiv:2508.21615v1 [eess.SY] for this version) https://doi.org/10.48550/arXiv.2508.21615 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Convergence of Stochastic Gradient Methods for Wide Two-Layer Physics-Informed Neural Networks

链接: https://arxiv.org/abs/2508.21571
作者: Bangti Jin,Longjun Wu
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Machine Learning (stat.ML)
*备注: 24 pages

点击查看摘要

Abstract:Physics informed neural networks (PINNs) represent a very popular class of neural solvers for partial differential equations. In practice, one often employs stochastic gradient descent type algorithms to train the neural network. Therefore, the convergence guarantee of stochastic gradient descent is of fundamental importance. In this work, we establish the linear convergence of stochastic gradient descent / flow in training over-parameterized two layer PINNs for a general class of activation functions in the sense of high probability. These results extend the existing result [18] in which gradient descent was analyzed. The challenge of the analysis lies in handling the dynamic randomness introduced by stochastic optimization methods. The key of the analysis lies in ensuring the positive definiteness of suitable Gram matrices during the training. The analysis sheds insight into the dynamics of the optimization process, and provides guarantees on the neural networks trained by stochastic algorithms.

[LG-11] OASIS: Harnessing Diffusion Adversarial Network for Ocean Salinity Imputation using Sparse Drifter Trajectories CIKM2025

链接: https://arxiv.org/abs/2508.21570
作者: Bo Li,Yingqi Feng,Ming Jin,Xin Zheng,Yufei Tang,Laurent Cherubin,Alan Wee-Chung Liew,Can Wang,Qinghua Lu,Jingwei Yao,Shirui Pan,Hong Zhang,Xingquan Zhu
类目: Machine Learning (cs.LG)
*备注: CIKM 2025 Accepted

点击查看摘要

Abstract:Ocean salinity plays a vital role in circulation, climate, and marine ecosystems, yet its measurement is often sparse, irregular, and noisy, especially in drifter-based datasets. Traditional approaches, such as remote sensing and optimal interpolation, rely on linearity and stationarity, and are limited by cloud cover, sensor drift, and low satellite revisit rates. While machine learning models offer flexibility, they often fail under severe sparsity and lack principled ways to incorporate physical covariates without specialized sensors. In this paper, we introduce the OceAn Salinity Imputation System (OASIS), a novel diffusion adversarial framework designed to address these challenges.

[LG-12] Comprehensive Signal Quality Evaluation of a Wearable Textile ECG Garment: A Sex-Balanced Study

链接: https://arxiv.org/abs/2508.21554
作者: Maximilian P. Oppelt,Tobias S. Zech,Sarah H. Lorenz,Laurenz Ottmann,Jan Steffan,Bjoern M. Eskofier,Nadine R. Lang-Richter,Norman Pfeiffer
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce a novel wearable textile-garment featuring an innovative electrode placement aimed at minimizing noise and motion artifacts, thereby enhancing signal fidelity in Electrocardiography (ECG) recordings. We present a comprehensive, sex-balanced evaluation involving 15 healthy males and 15 healthy female participants to ensure the device’s suitability across anatomical and physiological variations. The assessment framework encompasses distinct evaluation approaches: quantitative signal quality indices to objectively benchmark device performance; rhythm-based analyzes of physiological parameters such as heart rate and heart rate variability; machine learning classification tasks to assess application-relevant predictive utility; morphological analysis of ECG features including amplitude and interval parameters; and investigations of the effects of electrode projection angle given by the textile / body shape, with all analyzes stratified by sex to elucidate sex-specific influences. Evaluations were conducted across various activity phases representing real-world conditions. The results demonstrate that the textile system achieves signal quality highly concordant with reference devices in both rhythm and morphological analyses, exhibits robust classification performance, and enables identification of key sex-specific determinants affecting signal acquisition. These findings underscore the practical viability of textile-based ECG garments for physiological monitoring as well as psychophysiological state detection. Moreover, we identify the importance of incorporating sex-specific design considerations to ensure equitable and reliable cardiac diagnostics in wearable health technologies.

[LG-13] Binary Weight Multi-Bit Activation Quantization for Compute-in-Memory CNN Accelerators

链接: https://arxiv.org/abs/2508.21524
作者: Wenyong Zhou,Zhengwu Liu,Yuan Ren,Ngai Wong
类目: Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 5 pages, 6 figures

点击查看摘要

Abstract:Compute-in-memory (CIM) accelerators have emerged as a promising way for enhancing the energy efficiency of convolutional neural networks (CNNs). Deploying CNNs on CIM platforms generally requires quantization of network weights and activations to meet hardware constraints. However, existing approaches either prioritize hardware efficiency with binary weight and activation quantization at the cost of accuracy, or utilize multi-bit weights and activations for greater accuracy but limited efficiency. In this paper, we introduce a novel binary weight multi-bit activation (BWMA) method for CNNs on CIM-based accelerators. Our contributions include: deriving closed-form solutions for weight quantization in each layer, significantly improving the representational capabilities of binarized weights; and developing a differentiable function for activation quantization, approximating the ideal multi-bit function while bypassing the extensive search for optimal settings. Through comprehensive experiments on CIFAR-10 and ImageNet datasets, we show that BWMA achieves notable accuracy improvements over existing methods, registering gains of 1.44%-5.46% and 0.35%-5.37% on respective datasets. Moreover, hardware simulation results indicate that 4-bit activation quantization strikes the optimal balance between hardware cost and model performance.

[LG-14] Spiking Decision Transformers: Local Plasticity Phase-Coding and Dendritic Routing for Low-Power Sequence Control

链接: https://arxiv.org/abs/2508.21505
作者: Vishal Pandey,Debasmita Biswas
类目: Machine Learning (cs.LG)
*备注: Preprint (31 pages, 19 images, 7 tables)

点击查看摘要

Abstract:Reinforcement learning agents based on Transformer architectures have achieved impressive performance on sequential decision-making tasks, but their reliance on dense matrix operations makes them ill-suited for energy-constrained, edge-oriented platforms. Spiking neural networks promise ultra-low-power, event-driven inference, yet no prior work has seamlessly merged spiking dynamics with return-conditioned sequence modeling. We present the Spiking Decision Transformer (SNN-DT), which embeds Leaky Integrate-and-Fire neurons into each self-attention block, trains end-to-end via surrogate gradients, and incorporates biologically inspired three-factor plasticity, phase-shifted spike-based positional encodings, and a lightweight dendritic routing module. Our implementation matches or exceeds standard Decision Transformer performance on classic control benchmarks (CartPole-v1, MountainCar-v0, Acrobot-v1, Pendulum-v1) while emitting fewer than ten spikes per decision, an energy proxy suggesting over four orders-of-magnitude reduction in per inference energy. By marrying sequence modeling with neuromorphic efficiency, SNN-DT opens a pathway toward real-time, low-power control on embedded and wearable devices.

[LG-15] Failure Prediction Is a Better Performance Proxy for Early-Exit Networks Than Calibration

链接: https://arxiv.org/abs/2508.21495
作者: Piotr Kubaty,Filip Szatkowski,Metod Jazbec,Bartosz Wójcik
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Early-exit models speed up inference by attaching internal classifiers to intermediate layers of the model and allowing computation to stop once a prediction satisfies an exit criterion. Most early-exit methods rely on confidence-based exit strategies, which motivated some works to calibrate intermediate classifiers to improve the performance of the entire model. In this paper, we show that calibration measures can be misleading indicators of the performance of multi-exit models: a well-calibrated classifier may still waste computation, and common calibration methods do not preserve the sample ranking within a classifier. We demonstrate empirical cases where miscalibrated networks outperform calibrated ones. As an alternative, we propose to use failure prediction as a more useful proxy for early-exit model performance. Unlike calibration, failure prediction accounts for changes in the ranking of samples and shows a strong correlation with efficiency improvements, making it a more dependable basis for designing and evaluating early-exit models.

[LG-16] Normalized Maximum Likelihood Code-Length on Riemannian Manifold Data Spaces

链接: https://arxiv.org/abs/2508.21466
作者: Kota Fukuzawa,Atsushi Suzuki,Kenji Yamanishi
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: 14 pages. This is a preprint of an article submitted to IEEE Transactions on Information Theory

点击查看摘要

Abstract:In recent years, with the large-scale expansion of graph data, there has been an increased focus on Riemannian manifold data spaces other than Euclidean space. In particular, the development of hyperbolic spaces has been remarkable, and they have high expressive power for graph data with hierarchical structures. Normalized Maximum Likelihood (NML) is employed in regret minimization and model selection. However, existing formulations of NML have been developed primarily in Euclidean spaces and are inherently dependent on the choice of coordinate systems, making it non-trivial to extend NML to Riemannian manifolds. In this study, we define a new NML that reflects the geometric structure of Riemannian manifolds, called the Riemannian manifold NML (Rm-NML). This Rm-NML is invariant under coordinate transformations and coincides with the conventional NML under the natural parameterization in Euclidean space. We extend existing computational techniques for NML to the setting of Riemannian manifolds. Furthermore, we derive a method to simplify the computation of Rm-NML on Riemannian symmetric spaces, which encompass data spaces of growing interest such as hyperbolic spaces. To illustrate the practical application of our proposed method, we explicitly computed the Rm-NML for normal distributions on hyperbolic spaces.

[LG-17] Beyond expected value: geometric mean optimization for long-term policy performance in reinforcement learning

链接: https://arxiv.org/abs/2508.21443
作者: Xinyi Sheng,Dominik Baumann
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted final version to appear in the Proceedings of the IEEE Conference on Decision and Control

点击查看摘要

Abstract:Reinforcement learning (RL) algorithms typically optimize the expected cumulative reward, i.e., the expected value of the sum of scalar rewards an agent receives over the course of a trajectory. The expected value averages the performance over an infinite number of trajectories. However, when deploying the agent in the real world, this ensemble average may be uninformative for the performance of individual trajectories. Thus, in many applications, optimizing the long-term performance of individual trajectories might be more desirable. In this work, we propose a novel RL algorithm that combines the standard ensemble average with the time-average growth rate, a measure for the long-term performance of individual trajectories. We first define the Bellman operator for the time-average growth rate. We then show that, under multiplicative reward dynamics, the geometric mean aligns with the time-average growth rate. To address more general and unknown reward dynamics, we propose a modified geometric mean with N -sliding window that captures the path-dependency as an estimator for the time-average growth rate. This estimator is embedded as a regularizer into the objective, forming a practical algorithm and enabling the policy to benefit from ensemble average and time-average simultaneously. We evaluate our algorithm in challenging simulations, where it outperforms conventional RL methods.

[LG-18] Quantum enhanced ensemble GANs for anomaly detection in continuous biomanufacturing

链接: https://arxiv.org/abs/2508.21438
作者: Rajiv Kailasanathan,William R. Clements,Mohammad Reza Boskabadi,Shawn M. Gibford,Emmanouil Papadakis,Christopher J. Savoie,Seyed Soheil Mansouri
类目: Machine Learning (cs.LG); Other Quantitative Biology (q-bio.OT); Quantum Physics (quant-ph)
*备注:

点击查看摘要

Abstract:The development of continuous biomanufacturing processes requires robust and early anomaly detection, since even minor deviations can compromise yield and stability, leading to disruptions in scheduling, reduced weekly production, and diminished economic performance. These processes are inherently complex and exhibit non-linear dynamics with intricate relationships between process variables, thus making advanced methods for anomaly detection essential for efficient operation. In this work, we present a novel framework for unsupervised anomaly detection in continuous biomanufacturing based on an ensemble of generative adversarial networks (GANs). We first establish a benchmark dataset simulating both normal and anomalous operation regimes in a continuous process for the production of a small molecule. We then demonstrate the effectiveness of our GAN-based framework in detecting anomalies caused by sudden feedstock variability. Finally, we evaluate the impact of using a hybrid quantum/classical GAN approach with both a simulated quantum circuit and a real photonic quantum processor on anomaly detection performance. We find that the hybrid approach yields improved anomaly detection rates. Our work shows the potential of hybrid quantum/classical approaches for solving real-world problems in complex continuous biomanufacturing processes.

[LG-19] Rethinking Layer-wise Model Merging through Chain of Merges

链接: https://arxiv.org/abs/2508.21421
作者: Pietro Buzzega,Riccardo Salami,Angelo Porrello,Simone Calderara
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fine-tuning pretrained models has become a standard pathway to achieve state-of-the-art performance across a wide range of domains, leading to a proliferation of task-specific model variants. As the number of such specialized modules in-creases, merging them into a unified model without retraining has become a critical challenge. Existing merging techniques often rely on interference heuristics,importance weighting, or activation matching while treating each layer independently, thereby failing to account for the inter-layer dependencies inherent in deep networks. This simplification leads to distributional mismatches, especially inactivation-based methods, when changes in early layers are not properly reflected in downstream ones. We identify these mismatches as a form of internal covariate shift, comparable to the phenomenon encountered in the initial phases of neural networks training. To address it, we propose Chain of Merges (CoM), a layer-wise merging procedure that updates activation statistics in an auto-regressive fashion, explicitly accounting for cross-layer interactions. CoM produces a coherent merged model through a series of conditionally optimal updates, effectively mitigating degradation caused by covariate shift. Experiments on standard bench-marks demonstrate that CoM achieves state-of-the-art performance.

[LG-20] PMODE: Theoretically Grounded and Modular Mixture Modeling

链接: https://arxiv.org/abs/2508.21396
作者: Robert A. Vandermeulen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce PMODE (Partitioned Mixture Of Density Estimators), a general and modular framework for mixture modeling with both parametric and nonparametric components. PMODE builds mixtures by partitioning the data and fitting separate estimators to each subset. It attains near-optimal rates for this estimator class and remains valid even when the mixture components come from different distribution families. As an application, we develop MV-PMODE, which scales a previously theoretical approach to high-dimensional density estimation to settings with thousands of dimensions. Despite its simplicity, it performs competitively against deep baselines on CIFAR-10 anomaly detection.

[LG-21] Faster Inference of Cell Complexes from Flows via Matrix Factorization

链接: https://arxiv.org/abs/2508.21372
作者: Til Spreuer,Josef Hoppe,Michael T. Schaub
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: 5 pages, 5 figures, accepted at EUSIPCO 2025 in Palermo, evaluation code available at this https URL

点击查看摘要

Abstract:We consider the following inference problem: Given a set of edge-flow signals observed on a graph, lift the graph to a cell complex, such that the observed edge-flow signals can be represented as a sparse combination of gradient and curl flows on the cell complex. Specifically, we aim to augment the observed graph by a set of 2-cells (polygons encircled by closed, non-intersecting paths), such that the eigenvectors of the Hodge Laplacian of the associated cell complex provide a sparse, interpretable representation of the observed edge flows on the graph. As it has been shown that the general problem is NP-hard in prior work, we here develop a novel matrix-factorization-based heuristic to solve the problem. Using computational experiments, we demonstrate that our new approach is significantly less computationally expensive than prior heuristics, while achieving only marginally worse performance in most settings. In fact, we find that for specifically noisy settings, our new approach outperforms the previous state of the art in both solution quality and computational speed.

[LG-22] Distribution-Aware Feature Selection for SAEs

链接: https://arxiv.org/abs/2508.21324
作者: Narmeen Oozeer,Nirmalendu Prakash,Michael Lan,Alice Rigg,Amirali Abdullah
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sparse autoencoders (SAEs) decompose neural activations into interpretable features. A widely adopted variant, the TopK SAE, reconstructs each token from its K most active latents. However, this approach is inefficient, as some tokens carry more information than others. BatchTopK addresses this limitation by selecting top activations across a batch of tokens. This improves average reconstruction but risks an “activation lottery,” where rare high-magnitude features crowd out more informative but lower-magnitude ones. To address this issue, we introduce Sampled-SAE: we score the columns (representing features) of the batch activation matrix (via L_2 norm or entropy), forming a candidate pool of size Kl , and then apply Top- K to select tokens across the batch from the restricted pool of features. Varying l traces a spectrum between batch-level and token-specific selection. At l=1 , tokens draw only from K globally influential features, while larger l expands the pool toward standard BatchTopK and more token-specific features across the batch. Small l thus enforces global consistency; large l favors fine-grained reconstruction. On Pythia-160M, no single value optimizes l across all metrics: the best choice depends on the trade-off between shared structure, reconstruction fidelity, and downstream performance. Sampled-SAE thus reframes BatchTopK as a tunable, distribution-aware family.

[LG-23] Convergence of regularized agent -state-based Q-learning in POMDPs

链接: https://arxiv.org/abs/2508.21314
作者: Amit Sinha,Matthieu Geist,Aditya Mahajan
类目: Machine Learning (cs.LG)
*备注: Accepted in CDC 2025

点击查看摘要

Abstract:In this paper, we present a framework to understand the convergence of commonly used Q-learning reinforcement learning algorithms in practice. Two salient features of such algorithms are: (i)~the Q-table is recursively updated using an agent state (such as the state of a recurrent neural network) which is not a belief state or an information state and (ii)~policy regularization is often used to encourage exploration and stabilize the learning algorithm. We investigate the simplest form of such Q-learning algorithms which we call regularized agent-state-based Q-learning (RASQL) and show that it converges under mild technical conditions to the fixed point of an appropriately defined regularized MDP, which depends on the stationary distribution induced by the behavioral policy. We also show that a similar analysis continues to work for a variant of RASQL that learns periodic policies. We present numerical examples to illustrate that the empirical convergence behavior matches with the proposed theoretical limit.

[LG-24] Improving Fisher Information Estimation and Efficiency for LoRA-based LLM Unlearning

链接: https://arxiv.org/abs/2508.21300
作者: Yejin Kim,Eunwon Kim,Buru Chang,Junsuk Choe
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:LLMs have demonstrated remarkable performance across various tasks but face challenges related to unintentionally generating outputs containing sensitive information. A straightforward approach to address this issue is to retrain the model after excluding the problematic data. However, this approach incurs prohibitively high computational costs. To overcome this limitation, machine unlearning has emerged as a promising solution that can effectively remove sensitive information without the need to retrain the model from scratch. Recently, FILA has been proposed as a parameter-efficient unlearning method by integrating LoRA adapters. Specifically, it calculates the Fisher information to identify parameters associated with the forget set and assigns them to LoRA adapters for updates. Despite its innovative approach, FILA still requires access to all model parameters and does not adequately account for fundamental assumptions underlying Fisher information, leading to inaccuracies in importance estimation. To address these limitations, we propose VILA, a novel unlearning framework that explicitly considers the assumptions overlooked in FILA, thereby enhancing the accuracy of parameter identification for the forget set. Moreover, VILA significantly reduces computational costs by enabling parameter identification without accessing the entire model. Our method achieves up to 100x higher parameter efficiency and 40x faster training speed compared to FILA, and sets new state-of-the-art performance on benchmarks including TOFU, WMDP, and MUSE. Our code is available at this https URL.

[LG-25] Detecting Domain Shifts in Myoelectric Activations: Challenges and Opportunities in Stream Learning PRICAI25

链接: https://arxiv.org/abs/2508.21278
作者: Yibin Sun,Nick Lim,Guilherme Weigert Cassales,Heitor Murilo Gomes,Bernhard Pfahringer,Albert Bifet,Anany Dwivedi
类目: Machine Learning (cs.LG); Robotics (cs.RO)
*备注: 16 pages, 5 figures, 1 table, PRICAI25

点击查看摘要

Abstract:Detecting domain shifts in myoelectric activations poses a significant challenge due to the inherent non-stationarity of electromyography (EMG) signals. This paper explores the detection of domain shifts using data stream (DS) learning techniques, focusing on the DB6 dataset from the Ninapro database. We define domains as distinct time-series segments based on different subjects and recording sessions, applying Kernel Principal Component Analysis (KPCA) with a cosine kernel to pre-process and highlight these shifts. By evaluating multiple drift detection methods such as CUSUM, Page-Hinckley, and ADWIN, we reveal the limitations of current techniques in achieving high performance for real-time domain shift detection in EMG signals. Our results underscore the potential of streaming-based approaches for maintaining stable EMG decoding models, while highlighting areas for further research to enhance robustness and accuracy in real-world scenarios.

[LG-26] CALM: A Framework for Continuous Adaptive and LLM -Mediated Anomaly Detection in Time-Series Streams

链接: https://arxiv.org/abs/2508.21273
作者: Ashok Devireddy,Shunping Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The detection of anomalies in non-stationary time-series streams is a critical but challenging task across numerous industrial and scientific domains. Traditional models, trained offline, suffer significant performance degradation when faced with concept drift, where the underlying statistical properties of the data change over time. This paper introduces CALM (Continuous, Adaptive, and LLM-Mediated), a novel, end-to-end framework for real-time anomaly detection designed to address this challenge. CALM is built on the Apache Beam distributed processing framework and leverages the TimesFm foundation model for forecasting-based anomaly detection. The framework’s novelty lies in two core contributions. First, it implements a closed-loop, continuous fine-tuning mechanism that allows the anomaly detection model to adapt to evolving data patterns in near real-time. Second, it introduces an LLM-as-a-Judge component, a Large Language Model that provides semantic, context-aware judgments on detected anomalies to curate a high-quality training dataset, deciding whether an anomaly represents transient noise or a meaningful pattern shift. We evaluate CALM on the comprehensive TSB-UAD benchmark. Our results demonstrate that the continuously fine-tuned model improves the ROC AUC score in most datasets compared to the static, pre-trained base model, validating the efficacy of our adaptive, LLM-guided approach to maintaining high-performance anomaly detection in dynamic streaming environments.

[LG-27] Guess-and-Learn (GL): Measuring the Cumulative Error Cost of Cold-Start Adaptation

链接: https://arxiv.org/abs/2508.21270
作者: Roland Arnold
类目: Machine Learning (cs.LG)
*备注: 15 pages, 7 figures. Main text is 10 pages. Code and data are available at this https URL

点击查看摘要

Abstract:Evaluation of machine learning models typically emphasizes final accuracy, overlooking the cost of adaptation: the cumulative errors incurred while learning from scratch. Guess-and- Learn (GL) v1.0 addresses this gap by measuring cold-start adaptability - the total mistakes a model makes while sequentially labeling an unlabeled dataset. At each step, the learner selects an instance, predicts its label, receives the ground truth, and updates parameters under either online (per-sample) or batch (delayed) mode. The resulting error trajectory exposes adaptation speed, selection quality, and bias - dynamics invisible to endpoint metrics. GL defines four tracks (Scratch/Pretrained \times Online/Batch) to disentangle the effects of initialization and update frequency. We formalize the protocol, relate it to classical mistake-bound theory, and estimate a heuristic “oracle reference band” for MNIST as a plausibility reference. Baseline experiments on MNIST and AG News, spanning classical methods (Perceptron, k-NN), convolutional architectures (CNN, ResNet-50), and pretrained transformers (ViT-B/16, BERT-base), reveal systematic differences in early-phase efficiency: smaller models can adapt with fewer initial errors, while pretraining benefits vary by domain. Across settings, current models remain well above the oracle band, highlighting an adaptability gap. By quantifying the mistake cost of early learning, GL complements conventional benchmarks and provides a reproducible framework for developing learners that are not only accurate in the limit but also reliable from the first examples. Comments: 15 pages, 7 figures. Main text is 10 pages. Code and data are available at this https URL Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.21270 [cs.LG] (or arXiv:2508.21270v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.21270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-28] Owen Sampling Accelerates Contribution Estimation in Federated Learning ECAI2025

链接: https://arxiv.org/abs/2508.21261
作者: Hossein KhademSohi,Hadi Hemmati,Jiayu Zhou,Steve Drew
类目: Machine Learning (cs.LG)
*备注: ECAI 2025 camera-ready; 8 pages + appendix; code link inside

点击查看摘要

Abstract:Federated Learning (FL) aggregates information from multiple clients to train a shared global model without exposing raw data. Accurately estimating each client’s contribution is essential not just for fair rewards, but for selecting the most useful clients so the global model converges faster. The Shapley value is a principled choice, yet exact computation scales exponentially with the number of clients, making it infeasible for large federations. We propose FedOwen, an efficient framework that uses Owen sampling to approximate Shapley values under the same total evaluation budget as existing methods while keeping the approximation error small. In addition, FedOwen uses an adaptive client selection strategy that balances exploiting high-value clients with exploring under-sampled ones, reducing bias and uncovering rare but informative data. Under a fixed valuation cost, FedOwen achieves up to 23 percent higher final accuracy within the same number of communication rounds compared to state-of-the-art baselines on non-IID benchmarks.

[LG-29] RelP: Faithful and Efficient Circuit Discovery via Relevance Patching

链接: https://arxiv.org/abs/2508.21258
作者: Farnoush Rezaei Jafari,Oliver Eberle,Ashkan Khakzar,Neel Nanda
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Activation patching is a standard method in mechanistic interpretability for localizing the components of a model responsible for specific behaviors, but it is computationally expensive to apply at scale. Attribution patching offers a faster, gradient-based approximation, yet suffers from noise and reduced reliability in deep, highly non-linear networks. In this work, we introduce Relevance Patching (RelP), which replaces the local gradients in attribution patching with propagation coefficients derived from Layer-wise Relevance Propagation (LRP). LRP propagates the network’s output backward through the layers, redistributing relevance to lower-level components according to local propagation rules that ensure properties such as relevance conservation or improved signal-to-noise ratio. Like attribution patching, RelP requires only two forward passes and one backward pass, maintaining computational efficiency while improving faithfulness. We validate RelP across a range of models and tasks, showing that it more accurately approximates activation patching than standard attribution patching, particularly when analyzing residual stream and MLP outputs in the Indirect Object Identification (IOI) task. For instance, for MLP outputs in GPT-2 Large, attribution patching achieves a Pearson correlation of 0.006, whereas RelP reaches 0.956, highlighting the improvement offered by RelP. Additionally, we compare the faithfulness of sparse feature circuits identified by RelP and Integrated Gradients (IG), showing that RelP achieves comparable faithfulness without the extra computational cost associated with IG.

[LG-30] Class Incremental Continual Learning with Self-Organizing Maps and Variational Autoencoders Using Synthetic Replay

链接: https://arxiv.org/abs/2508.21240
作者: Pujan Thapa,Alexander Ororbia,Travis Desell
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This work introduces a novel generative continual learning framework based on self-organizing maps (SOMs) and variational autoencoders (VAEs) to enable memory-efficient replay, eliminating the need to store raw data samples or task labels. For high-dimensional input spaces, such as of CIFAR-10 and CIFAR-100, we design a scheme where the SOM operates over the latent space learned by a VAE, whereas, for lower-dimensional inputs, such as those found in MNIST and FashionMNIST, the SOM operates in a standalone fashion. Our method stores a running mean, variance, and covariance for each SOM unit, from which synthetic samples are then generated during future learning iterations. For the VAE-based method, generated samples are then fed through the decoder to then be used in subsequent replay. Experimental results on standard class-incremental benchmarks show that our approach performs competitively with state-of-the-art memory-based methods and outperforms memory-free methods, notably improving over best state-of-the-art single class incremental performance on CIFAR-10 and CIFAR-100 by nearly 10 % and 7 %, respectively. Our methodology further facilitates easy visualization of the learning process and can also be utilized as a generative model post-training. Results show our method’s capability as a scalable, task-label-free, and memory-efficient solution for continual learning.

[LG-31] Population-Scale Network Embeddings Expose Educational Divides in Network Structure Related to Right-Wing Populist Voting

链接: https://arxiv.org/abs/2508.21236
作者: Malte Lüken(1 and 2 and 3),Javier Garcia-Bernardo(4),Sreeparna Deb(5),Flavio Hafner(1 and 3),Megha Khosla(5) ((1) Netherlands eScience Center, (2) University of Amsterdam, (3) Erasmus University Rotterdam, (4) Utrecht University, (5) Delft University of Technology)
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG); Applications (stat.AP)
*备注: 31 pages, 6 figures, Supplementary Materials available at this https URL

点击查看摘要

Abstract:Administrative registry data can be used to construct population-scale networks whose ties reflect shared social contexts between persons. With machine learning, such networks can be encoded into numerical representations – embeddings – that automatically capture individuals’ position within the network. We created embeddings for all persons in the Dutch population from a population-scale network that represents five shared contexts: neighborhood, work, family, household, and school. To assess the informativeness of these embeddings, we used them to predict right-wing populist voting. Embeddings alone predicted right-wing populist voting above chance-level but performed worse than individual characteristics. Combining the best subset of embeddings with individual characteristics only slightly improved predictions. However, after transforming the embeddings to make their dimensions more sparse and orthogonal, we found that one embedding dimension was strongly associated with the outcome. Mapping this dimension back to the population network revealed differences in network structure related to right-wing populist voting between different school ties and achieved education levels. Our study contributes methodologically by demonstrating how population-scale network embeddings can be made interpretable, and substantively by linking structural network differences in education to right-wing populist voting.

[LG-32] Multi-robot Path Planning and Scheduling via Model Predictive Optimal Transport (MPC-OT)

链接: https://arxiv.org/abs/2508.21205
作者: Usman A. Khan,Mouhacine Benosman,Wenliang Liu,Federico Pecora,Joseph W. Durham
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 2025 IEEE Conference on Decision and Control

点击查看摘要

Abstract:In this paper, we propose a novel methodology for path planning and scheduling for multi-robot navigation that is based on optimal transport theory and model predictive control. We consider a setup where N robots are tasked to navigate to M targets in a common space with obstacles. Mapping robots to targets first and then planning paths can result in overlapping paths that lead to deadlocks. We derive a strategy based on optimal transport that not only provides minimum cost paths from robots to targets but also guarantees non-overlapping trajectories. We achieve this by discretizing the space of interest into K cells and by imposing a K\times K cost structure that describes the cost of transitioning from one cell to another. Optimal transport then provides \textitoptimal and non-overlapping cell transitions for the robots to reach the targets that can be readily deployed without any scheduling considerations. The proposed solution requires \unicodex1D4AA(K^3\log K) computations in the worst-case and \unicodex1D4AA(K^2\log K) for well-behaved problems. To further accommodate potentially overlapping trajectories (unavoidable in certain situations) as well as robot dynamics, we show that a temporal structure can be integrated into optimal transport with the help of \textitreplans and \textitmodel predictive control.

[LG-33] Synthetic CVs To Build and Test Fairness-Aware Hiring Tools

链接: https://arxiv.org/abs/2508.21179
作者: Jorge Saldivar,Anna Gatzioura,Carlos Castillo
类目: Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Algorithmic hiring has become increasingly necessary in some sectors as it promises to deal with hundreds or even thousands of applicants. At the heart of these systems are algorithms designed to retrieve and rank candidate profiles, which are usually represented by Curricula Vitae (CVs). Research has shown, however, that such technologies can inadvertently introduce bias, leading to discrimination based on factors such as candidates’ age, gender, or national origin. Developing methods to measure, mitigate, and explain bias in algorithmic hiring, as well as to evaluate and compare fairness techniques before deployment, requires sets of CVs that reflect the characteristics of people from diverse backgrounds. However, datasets of these characteristics that can be used to conduct this research do not exist. To address this limitation, this paper introduces an approach for building a synthetic dataset of CVs with features modeled on real materials collected through a data donation campaign. Additionally, the resulting dataset of 1,730 CVs is presented, which we envision as a potential benchmarking standard for research on algorithmic hiring discrimination. Subjects: Computers and Society (cs.CY); Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2508.21179 [cs.CY] (or arXiv:2508.21179v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2508.21179 Focus to learn more arXiv-issued DOI via DataCite

[LG-34] RARR : Robust Real-World Activity Recognition with Vibration by Scavenging Near-Surface Audio Online

链接: https://arxiv.org/abs/2508.21167
作者: Dong Yoon Lee,Alyssa Weakley,Hui Wei,Blake Brown,Keyana Carrion,Shijia Pan
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:One in four people dementia live alone, leading family members to take on caregiving roles from a distance. Many researchers have developed remote monitoring solutions to lessen caregiving needs; however, limitations remain including privacy preserving solutions, activity recognition, and model generalizability to new users and environments. Structural vibration sensor systems are unobtrusive solutions that have been proven to accurately monitor human information, such as identification and activity recognition, in controlled settings by sensing surface vibrations generated by activities. However, when deploying in an end user’s home, current solutions require a substantial amount of labeled data for accurate activity recognition. Our scalable solution adapts synthesized data from near-surface acoustic audio to pretrain a model and allows fine tuning with very limited data in order to create a robust framework for daily routine tracking.

[LG-35] Data-Driven Bifurcation Handling in Physics-Based Reduced-Order Vascular Hemodynamic Models

链接: https://arxiv.org/abs/2508.21165
作者: Natalia L. Rubio,Eric F. Darve,Alison L. Marsden
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注: 32 pages, 13 figures

点击查看摘要

Abstract:Three-dimensional (3D) finite-element simulations of cardiovascular flows provide high-fidelity predictions to support cardiovascular medicine, but their high computational cost limits clinical practicality. Reduced-order models (ROMs) offer computationally efficient alternatives but suffer reduced accuracy, particularly at vessel bifurcations where complex flow physics are inadequately captured by standard Poiseuille flow assumptions. We present an enhanced numerical framework that integrates machine learning-predicted bifurcation coefficients into zero-dimensional (0D) hemodynamic ROMs to improve accuracy while maintaining computational efficiency. We develop a resistor-resistor-inductor (RRI) model that uses neural networks to predict pressure-flow relationships from bifurcation geometry, incorporating linear and quadratic resistances along with inductive effects. The method employs non-dimensionalization to reduce training data requirements and apriori flow split prediction for improved bifurcation characterization. We incorporate the RRI model into a 0D model using an optimization-based solution strategy. We validate the approach in isolated bifurcations and vascular trees, across Reynolds numbers from 0 to 5,500, defining ROM accuracy by comparison to 3D finite element simulation. Results demonstrate substantial accuracy improvements: averaged across all trees and Reynolds numbers, the RRI method reduces inlet pressure errors from 54 mmHg (45%) for standard 0D models to 25 mmHg (17%), while a simplified resistor-inductor (RI) variant achieves 31 mmHg (26%) error. The enhanced 0D models show particular effectiveness at high Reynolds numbers and in extensive vascular networks. This hybrid numerical approach enables accurate, real-time hemodynamic modeling for clinical decision support, uncertainty quantification, and digital twins in cardiovascular biomedical engineering.

[LG-36] Privacy Auditing Synthetic Data Release through Local Likelihood Attacks

链接: https://arxiv.org/abs/2508.21146
作者: Joshua Ward,Chi-Hua Wang,Guang Cheng
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Auditing the privacy leakage of synthetic data is an important but unresolved problem. Most existing privacy auditing frameworks for synthetic data rely on heuristics and unreasonable assumptions to attack the failure modes of generative models, exhibiting limited capability to describe and detect the privacy exposure of training data through synthetic data release. In this paper, we study designing Membership Inference Attacks (MIAs) that specifically exploit the observation that tabular generative models tend to significantly overfit to certain regions of the training distribution. Here, we propose Generative Likelihood Ratio Attack (Gen-LRA), a novel, computationally efficient No-Box MIA that, with no assumption of model knowledge or access, formulates its attack by evaluating the influence a test observation has in a surrogate model’s estimation of a local likelihood ratio over the synthetic data. Assessed over a comprehensive benchmark spanning diverse datasets, model architectures, and attack parameters, we find that Gen-LRA consistently dominates other MIAs for generative models across multiple performance metrics. These results underscore Gen-LRA’s effectiveness as a privacy auditing tool for the release of synthetic data, highlighting the significant privacy risks posed by generative model overfitting in real-world applications.

[LG-37] Adaptive LLM Routing under Budget Constraints EMNLP2025

链接: https://arxiv.org/abs/2508.21141
作者: Pranoy Panda,Raghav Magazine,Chaitanya Devaguptapu,Sho Takemori,Vishal Sharma
类目: Machine Learning (cs.LG)
*备注: Accepted at EMNLP 2025 (findings)

点击查看摘要

Abstract:Large Language Models (LLMs) have revolutionized natural language processing, but their varying capabilities and costs pose challenges in practical applications. LLM routing addresses this by dynamically selecting the most suitable LLM for each query/task. Previous approaches treat this as a supervised learning problem, assuming complete knowledge of optimal query-LLM pairings. However, real-world scenarios lack such comprehensive mappings and face evolving user queries. We thus propose to study LLM routing as a contextual bandit problem, enabling adaptive decision-making using bandit feedback without requiring exhaustive inference across all LLMs for all queries (in contrast to supervised routing). To address this problem, we develop a shared embedding space for queries and LLMs, where query and LLM embeddings are aligned to reflect their affinity. This space is initially learned from offline human preference data and refined through online bandit feedback. We instantiate this idea through Preference-prior Informed Linucb fOr adaptive rouTing (PILOT), a novel extension of LinUCB. To handle diverse user budgets for model routing, we introduce an online cost policy modeled as a multi-choice knapsack problem, ensuring resource-efficient routing.

[LG-38] Considerations for Estimating Causal Effects of Informatively Timed Treatments

链接: https://arxiv.org/abs/2508.21804
作者: Arman Oganisian
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Epidemiological studies are often concerned with estimating causal effects of a sequence of treatment decisions on survival outcomes. In many settings, treatment decisions do not occur at fixed, pre-specified followup times. Rather, timing varies across subjects in ways that may be informative of subsequent treatment decisions and potential outcomes. Awareness of the issue and its potential solutions is lacking in the literature, which motivate this work. Here, we formalize the issue of informative timing, problems associated with ignoring it, and show how g-methods can be used to analyze sequential treatments that are informatively timed. As we describe, in such settings, the waiting times between successive treatment decisions may be properly viewed as a time-varying confounders. Using synthetic examples, we illustrate how g-methods that do not adjust for these waiting times may be biased and how adjustment can be done in scenarios where patients may die or be censored in between treatments. We draw connections between adjustment and identification with discrete-time versus continuous-time models. Finally, we provide implementation guidance and examples using publicly available software. Our concluding message is that 1) considering timing is important for valid inference and 2) correcting for informative timing can be done with g-methods that adjust for waiting times between treatments as time-varying confounders.

[LG-39] Surface Stability Modeling with Universal Machine Learning Interatomic Potentials: A Comprehensive Cleavage Energy Benchmarking Study

链接: https://arxiv.org/abs/2508.21663
作者: Ardavan Mehdizadeh,Peter Schindler
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 70 pages total (main paper + supplementary information), 4 figures in main text, multiple supplementary figures and tables

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) have revolutionized computational materials science by bridging the gap between quantum mechanical accuracy and classical simulation efficiency, enabling unprecedented exploration of materials properties across the periodic table. Despite their remarkable success in predicting bulk properties, no systematic evaluation has assessed how well these universal MLIPs (uMLIPs) can predict cleavage energies, a critical property governing fracture, catalysis, surface stability, and interfacial phenomena. Here, we present a comprehensive benchmark of 19 state-of-the-art uMLIPs for cleavage energy prediction using our previously established density functional theory (DFT) database of 36,718 slab structures spanning elemental, binary, and ternary metallic compounds. We evaluate diverse architectural paradigms, analyzing their performance across chemical compositions, crystal systems, thickness, and surface orientations. Our results reveal that training data composition dominates architectural sophistication: models trained on the Open Materials 2024 (OMat24) dataset, which emphasizes non-equilibrium configurations, achieve mean absolute percentage errors below 6% and correctly identify the thermodynamically most stable surface terminations in 87% of cases, without any explicit surface energy training. In contrast, architecturally identical models trained on equilibrium-only datasets show five-fold higher errors, while models trained on surface-adsorbate data fail catastrophically with a 17-fold degradation. Remarkably, simpler architectures trained on appropriate data achieve comparable accuracy to complex transformers while offering 10-100x computational speedup. These findings show that the community should focus on strategic training data generation that captures the relevant physical phenomena.

[LG-40] Machine Intelligence on the Edge: Interpretable Cardiac Pattern Localisation Using Reinforcement Learning

链接: https://arxiv.org/abs/2508.21652
作者: Haozhe Tian,Qiyu Rao,Nina Moutonnet,Pietro Ferraro,Danilo Mandic
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Matched filters are widely used to localise signal patterns due to their high efficiency and interpretability. However, their effectiveness deteriorates for low signal-to-noise ratio (SNR) signals, such as those recorded on edge devices, where prominent noise patterns can closely resemble the target within the limited length of the filter. One example is the ear-electrocardiogram (ear-ECG), where the cardiac signal is attenuated and heavily corrupted by artefacts. To address this, we propose the Sequential Matched Filter (SMF), a paradigm that replaces the conventional single matched filter with a sequence of filters designed by a Reinforcement Learning agent. By formulating filter design as a sequential decision-making process, SMF adaptively design signal-specific filter sequences that remain fully interpretable by revealing key patterns driving the decision-making. The proposed SMF framework has strong potential for reliable and interpretable clinical decision support, as demonstrated by its state-of-the-art R-peak detection and physiological state classification performance on two challenging real-world ECG datasets. The proposed formulation can also be extended to a broad range of applications that require accurate pattern localisation from noise-corrupted signals.

[LG-41] Adaptive generative moment matching networks for improved learning of dependence structures

链接: https://arxiv.org/abs/2508.21531
作者: Marius Hofert,Gan Yao
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:An adaptive bandwidth selection procedure for the mixture kernel in the maximum mean discrepancy (MMD) for fitting generative moment matching networks (GMMNs) is introduced, and its ability to improve the learning of copula random number generators is demonstrated. Based on the relative error of the training loss, the number of kernels is increased during training; additionally, the relative error of the validation loss is used as an early stopping criterion. While training time of such adaptively trained GMMNs (AGMMNs) is similar to that of GMMNs, training performance is increased significantly in comparison to GMMNs, which is assessed and shown based on validation MMD trajectories, samples and validation MMD values. Superiority of AGMMNs over GMMNs, as well as typical parametric copula models, is demonstrated in terms of three applications. First, convergence rates of quasi-random versus pseudo-random samples from high-dimensional copulas are investigated for three functionals of interest and in dimensions as large as 100 for the first time. Second, replicated validation MMDs, as well as Monte Carlo and quasi-Monte Carlo applications based on the expected payoff of a basked call option and the risk measure expected shortfall as functionals are used to demonstrate the improved training of AGMMNs over GMMNs for a copula model fitted to the standardized residuals of the 50 constituents of the SP 500 index after deGARCHing. Last, both the latter dataset and 50 constituents of the FTSE~100 are used to demonstrate that the improved training of AGMMNs over GMMNs and in comparison to the fitting of classical parametric copula models indeed also translates to an improved model prediction.

[LG-42] Data-driven Discovery of Digital Twins in Biomedical Research

链接: https://arxiv.org/abs/2508.21484
作者: Clémence Métayer,Annabelle Ballesta,Julien Martinelli
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Recent technological advances have expanded the availability of high-throughput biological datasets, enabling the reliable design of digital twins of biomedical systems or patients. Such computational tools represent key reaction networks driving perturbation or drug response and can guide drug discovery and personalized therapeutics. Yet, their development still relies on laborious data integration by the human modeler, so that automated approaches are critically needed. The success of data-driven system discovery in Physics, rooted in clean datasets and well-defined governing laws, has fueled interest in applying similar techniques in Biology, which presents unique challenges. Here, we reviewed methodologies for automatically inferring digital twins from biological time series, which mostly involve symbolic or sparse regression. We evaluate algorithms according to eight biological and methodological challenges, associated to noisy/incomplete data, multiple conditions, prior knowledge integration, latent variables, high dimensionality, unobserved variable derivatives, candidate library design, and uncertainty quantification. Upon these criteria, sparse regression generally outperformed symbolic regression, particularly when using Bayesian frameworks. We further highlight the emerging role of deep learning and large language models, which enable innovative prior knowledge integration, though the reliability and consistency of such approaches must be improved. While no single method addresses all challenges, we argue that progress in learning digital twins will come from hybrid and modular frameworks combining chemical reaction network-based mechanistic grounding, Bayesian uncertainty quantification, and the generative and knowledge integration capacities of deep learning. To support their development, we further propose a benchmarking framework to evaluate methods across all challenges.

[LG-43] Weighted Support Points from Random Measures: An Interpretable Alternative for Generative Modeling

链接: https://arxiv.org/abs/2508.21255
作者: Peiqi Zhao,Carlos E. Rodríguez,Ramsés H. Mena,Stephen G. Walker
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 24 pages, 6 figures

点击查看摘要

Abstract:Support points summarize a large dataset through a smaller set of representative points that can be used for data operations, such as Monte Carlo integration, without requiring access to the full dataset. In this sense, support points offer a compact yet informative representation of the original data. We build on this idea to introduce a generative modeling framework based on random weighted support points, where the randomness arises from a weighting scheme inspired by the Dirichlet process and the Bayesian bootstrap. The proposed method generates diverse and interpretable sample sets from a fixed dataset, without relying on probabilistic modeling assumptions or neural network architectures. We present the theoretical formulation of the method and develop an efficient optimization algorithm based on the Convex–Concave Procedure (CCP). Empirical results on the MNIST and CelebA-HQ datasets show that our approach produces high-quality and diverse outputs at a fraction of the computational cost of black-box alternatives such as Generative Adversarial Networks (GANs) or Denoising Diffusion Probabilistic Models (DDPMs). These results suggest that random weighted support points offer a principled, scalable, and interpretable alternative for generative modeling. A key feature is their ability to produce genuinely interpolative samples that preserve underlying data structure.

[LG-44] Quantum-inspired probability metrics define a complete universal space for statistical learning

链接: https://arxiv.org/abs/2508.21086
作者: Logan S. McCarty
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 42 pages, 1 figure

点击查看摘要

Abstract:Comparing probability distributions is a core challenge across the natural, social, and computational sciences. Existing methods, such as Maximum Mean Discrepancy (MMD), struggle in high-dimensional and non-compact domains. Here we introduce quantum probability metrics (QPMs), derived by embedding probability measures in the space of quantum states: positive, unit-trace operators on a Hilbert space. This construction extends kernel-based methods and overcomes the incompleteness of MMD on non-compact spaces. Viewed as an integral probability metric (IPM), QPMs have dual functions that uniformly approximate all bounded, uniformly continuous functions on \mathbbR^n , offering enhanced sensitivity to subtle distributional differences in high dimensions. For empirical distributions, QPMs are readily calculated using eigenvalue methods, with analytic gradients suited for learning and optimization. Although computationally more intensive for large sample sizes ( O(n^3) vs. O(n^2) ), QPMs can significantly improve performance as a drop-in replacement for MMD, as demonstrated in a classic generative modeling task. By combining the rich mathematical framework of quantum mechanics with classical probability theory, this approach lays the foundation for powerful tools to analyze and manipulate probability measures.

[LG-45] ImmunoAI: Accelerated Antibody Discovery Using Gradient-Boosted Machine Learning with Thermodynamic-Hydrodynamic Descriptors and 3D Geometric Interface Topology CEC

链接: https://arxiv.org/abs/2508.21082
作者: Shawnak Shivakumar,Matthew Sandora
类目: Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注: 6 pages, accepted at IEEE International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME) '25

点击查看摘要

Abstract:Human metapneumovirus (hMPV) poses serious risks to pediatric, elderly, and immunocompromised populations. Traditional antibody discovery pipelines require 10-12 months, limiting their applicability for rapid outbreak response. This project introduces ImmunoAI, a machine learning framework that accelerates antibody discovery by predicting high-affinity candidates using gradient-boosted models trained on thermodynamic, hydrodynamic, and 3D topological interface descriptors. A dataset of 213 antibody-antigen complexes was curated to extract geometric and physicochemical features, and a LightGBM regressor was trained to predict binding affinity with high precision. The model reduced the antibody candidate search space by 89%, and fine-tuning on 117 SARS-CoV-2 binding pairs further reduced Root Mean Square Error (RMSE) from 1.70 to 0.92. In the absence of an experimental structure for the hMPV A2.2 variant, AlphaFold2 was used to predict its 3D structure. The fine-tuned model identified two optimal antibodies with predicted picomolar affinities targeting key mutation sites (G42V and E96K), making them excellent candidates for experimental testing. In summary, ImmunoAI shortens design cycles and enables faster, structure-informed responses to viral outbreaks.

信息检索

[IR-0] DMGIN: How Multimodal LLM s Enhance Large Recommendation Models for Lifelong User Post-click Behaviors

链接: https://arxiv.org/abs/2508.21801
作者: Zhuoxing Wei,Qingchen Xie,Qi Liu
类目: Information Retrieval (cs.IR)
*备注: 8 pages, 5 figures

点击查看摘要

Abstract:Modeling user interest based on lifelong user behavior sequences is crucial for enhancing Click-Through Rate (CTR) prediction. However, long post-click behavior sequences themselves pose severe performance issues: the sheer volume of data leads to high computational costs and inefficiencies in model training and inference. Traditional methods address this by introducing two-stage approaches, but this compromises model effectiveness due to incomplete utilization of the full sequence context. More importantly, integrating multimodal embeddings into existing large recommendation models (LRM) presents significant challenges: These embeddings often exacerbate computational burdens and mismatch with LRM architectures. To address these issues and enhance the model’s efficiency and accuracy, we introduce Deep Multimodal Group Interest Network (DMGIN). Given the observation that user post-click behavior sequences contain a large number of repeated items with varying behaviors and timestamps, DMGIN employs Multimodal LLMs(MLLM) for grouping to reorganize complete lifelong post-click behavior sequences more effectively, with almost no additional computational overhead, as opposed to directly introducing multimodal embeddings. To mitigate the potential information loss from grouping, we have implemented two key strategies. First, we analyze behaviors within each group using both interest statistics and intra-group transformers to capture group traits. Second, apply inter-group transformers to temporally ordered groups to capture the evolution of user group interests. Our extensive experiments on both industrial and public datasets confirm the effectiveness and efficiency of DMGIN. The A/B test in our LBS advertising system shows that DMGIN improves CTR by 4.7% and Revenue per Mile by 2.3%.

[IR-1] -Retrievability: A Topic-Focused Approach to Measure Fair Document Exposure in Information Retrieval CIKM2025

链接: https://arxiv.org/abs/2508.21704
作者: Xuejun Chang,Zaiqiao Meng,Debasis Ganguly
类目: Information Retrieval (cs.IR)
*备注: Accepted by Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), November 10-14, 2025, Seoul, Republic of Korea

点击查看摘要

Abstract:Retrievability of a document is a collection-based statistic that measures its expected (reciprocal) rank of being retrieved within a specific rank cut-off. A collection with uniformly distributed retrievability scores across documents is an indicator of fair document exposure. While retrievability scores have been used to quantify the fairness of exposure for a collection, in our work, we use the distribution of retrievability scores to measure the exposure bias of retrieval models. We hypothesise that an uneven distribution of retrievability scores across the entire collection may not accurately reflect exposure bias but rather indicate variations in topical relevance. As a solution, we propose a topic-focused localised retrievability measure, which we call \textitT-Retrievability (topic-retrievability), which first computes retrievability scores over multiple groups of topically-related documents, and then aggregates these localised values to obtain the collection-level statistics. Our analysis using this proposed T-Retrievability measure uncovers new insights into the exposure characteristics of various neural ranking models. The findings suggest that this localised measure provides a more nuanced understanding of exposure fairness, offering a more reliable approach for assessing document accessibility in IR systems.

[IR-2] Unidentified and Confounded? Understanding Two-Tower Models for Unbiased Learning to Rank (Extended Abstract)

链接: https://arxiv.org/abs/2508.21698
作者: Philipp Hager,Onno Zoeter,Maarten de Rijke
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Additive two-tower models are popular learning-to-rank methods for handling biased user feedback in industry settings. Recent studies, however, report a concerning phenomenon: training two-tower models on clicks collected by well-performing production systems leads to decreased ranking performance. This paper investigates two recent explanations for this observation: confounding effects from logging policies and model identifiability issues. We theoretically analyze the identifiability conditions of two-tower models, showing that either document swaps across positions or overlapping feature distributions are required to recover model parameters from clicks. We also investigate the effect of logging policies on two-tower models, finding that they introduce no bias when models perfectly capture user behavior. However, logging policies can amplify biases when models imperfectly capture user behavior, particularly when prediction errors correlate with document placement across positions. We propose a sample weighting technique to mitigate these effects and provide actionable insights for researchers and practitioners using two-tower models.

[IR-3] NewsReX: A More Efficient Approach to News Recommendation with Keras 3 and JAX

链接: https://arxiv.org/abs/2508.21572
作者: Igor L.R. Azevedo,Toyotaro Suzumura,Yuichiro Yasui
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Reproducing and comparing results in news recommendation research has become increasingly difficult. This is due to a fragmented ecosystem of diverse codebases, varied configurations, and mainly due to resource-intensive models. We introduce NewsReX, an open-source library designed to streamline this process. Our key contribution is a modern implementation built on Keras 3 and JAX, which provides an increase in computational efficiency. Experiments show that NewsReX is faster than current implementations. To support broader research, we provide a straightforward guide and scripts for training models on custom datasets. We validated this functionality using a proprietary Japanese news dataset from Nikkei News, a leading Japanese media corporation renowned for its comprehensive coverage of business, economic, and financial news. NewsReX makes reproducing complex experiments faster and more accessible to a wider range of hardware making sure the speed up it also achieved for less powerful GPUs, like an 8GB RTX 3060 Ti. Beyond the library, this paper offers an analysis of key training parameters often overlooked in the literature, including the effect of different negative sampling strategies, the varying number of epochs, the impact of random batching, and more. This supplementary analysis serves as a valuable reference for future research, aiming to reduce redundant computation when comparing baselines and guide best practices. Code available at this https URL.

[IR-4] Geospatial Question Answering on Historical Maps Using Spatio-Temporal Knowledge Graphs and Large Language Models

链接: https://arxiv.org/abs/2508.21491
作者: Ziyi Liu,Sidi Wu,Lorenz Hurni
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Recent advances have enabled the extraction of vectorized features from digital historical maps. To fully leverage this information, however, the extracted features must be organized in a structured and meaningful way that supports efficient access and use. One promising approach is question answering (QA), which allows users – especially those unfamiliar with database query languages – to retrieve knowledge in a natural and intuitive manner. In this project, we developed a GeoQA system by integrating a spatio-temporal knowledge graph (KG) constructed from historical map data with large language models (LLMs). Specifically, we have defined the ontology to guide the construction of the spatio-temporal KG and investigated workflows of two different types of GeoQA: factual and descriptive. Additional data sources, such as historical map images and internet search results, are incorporated into our framework to provide extra context for descriptive GeoQA. Evaluation results demonstrate that the system can generate answers with a high delivery rate and a high semantic accuracy. To make the framework accessible, we further developed a web application that supports interactive querying and visualization.

[IR-5] Evaluating Recabilities of Foundation Models: A Multi-Domain Multi-Dataset Benchmark

链接: https://arxiv.org/abs/2508.21354
作者: Qijiong Liu,Jieming Zhu,Yingxin Lai,Xiaoyu Dong,Lu Fan,Zhipeng Bian,Zhenhua Dong,Xiao-Ming Wu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains – including e-commerce, entertainment, and social media – we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training significantly enhances the adaptability of foundation models. All code and data have been publicly released to facilitate future research.

[IR-6] owards On-Device Personalization: Cloud-device Collaborative Data Augmentation for Efficient On-device Language Model

链接: https://arxiv.org/abs/2508.21313
作者: Zhaofeng Zhong,Wei Yuan,Liang Qu,Tong Chen,Hao Wang,Xiangyu Zhao,Hongzhi Yin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the advancement of large language models (LLMs), significant progress has been achieved in various Natural Language Processing (NLP) tasks. However, existing LLMs still face two major challenges that hinder their broader adoption: (1) their responses tend to be generic and lack personalization tailored to individual users, and (2) they rely heavily on cloud infrastructure due to intensive computational requirements, leading to stable network dependency and response delay. Recent research has predominantly focused on either developing cloud-based personalized LLMs or exploring the on-device deployment of general-purpose LLMs. However, few studies have addressed both limitations simultaneously by investigating personalized on-device language models. To bridge this gap, we propose CDCDA-PLM, a framework for deploying personalized on-device language models on user devices with support from a powerful cloud-based LLM. Specifically, CDCDA-PLM leverages the server-side LLM’s strong generalization capabilities to augment users’ limited personal data, mitigating the issue of data scarcity. Using both real and synthetic data, A personalized on-device language models (LMs) is fine-tuned via parameter-efficient fine-tuning (PEFT) modules and deployed on users’ local devices, enabling them to process queries without depending on cloud-based LLMs. This approach eliminates reliance on network stability and ensures high response speeds. Experimental results across six tasks in a widely used personalization benchmark demonstrate the effectiveness of CDCDA-PLM.

[IR-7] he Hidden Cost of Defaults in Recommender System Evaluation RECSYS2025

链接: https://arxiv.org/abs/2508.21180
作者: Hannah Berlin,Robin Svahn,Alan Said
类目: Information Retrieval (cs.IR)
*备注: Accepted to RecSys 2025

点击查看摘要

Abstract:Hyperparameter optimization is critical for improving the performance of recommender systems, yet its implementation is often treated as a neutral or secondary concern. In this work, we shift focus from model benchmarking to auditing the behavior of RecBole, a widely used recommendation framework. We show that RecBole’s internal defaults, particularly an undocumented early-stopping policy, can prematurely terminate Random Search and Bayesian Optimization. This limits search coverage in ways that are not visible to users. Using six models and two datasets, we compare search strategies and quantify both performance variance and search path instability. Our findings reveal that hidden framework logic can introduce variability comparable to the differences between search strategies. These results highlight the importance of treating frameworks as active components of experimental design and call for more transparent, reproducibility-aware tooling in recommender systems research. We provide actionable recommendations for researchers and developers to mitigate hidden configuration behaviors and improve the transparency of hyperparameter tuning workflows.

附件下载

点击下载今日全部论文列表