本篇博文主要内容为 2025-08-22 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-22)

今日共更新451篇论文,其中:

  • 自然语言处理75篇(Computation and Language (cs.CL))
  • 人工智能128篇(Artificial Intelligence (cs.AI))
  • 计算机视觉92篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习137篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Intern-S1: A Scientific Multimodal Foundation Model

【速读】: 该论文旨在解决开放源代码基础模型在高价值但更具挑战性的科学专业领域中性能显著落后于闭源模型的问题,从而缩小两者之间的差距并推动向通用人工智能(AGI)迈进。其解决方案的关键在于提出Intern-S1——一个具备多模态理解与推理能力的专用通用模型,采用280亿激活参数、2410亿总参数的混合专家(Mixture-of-Experts, MoE)架构,并在包含超过2.5万亿科学领域token的持续预训练数据上进行训练;此外,在后训练阶段引入基于“奖励混合”(Mixture-of-Rewards, MoR)机制的离线与在线强化学习(Reinforcement Learning, RL)策略,实现对1000多个任务的协同训练。这一系列算法、数据和训练系统层面的集成创新使Intern-S1在科学专业任务中表现卓越,甚至超越了闭源最先进模型。

链接: https://arxiv.org/abs/2508.15763
作者: Lei Bai,Zhongrui Cai,Maosong Cao,Weihan Cao,Chiyu Chen,Haojiong Chen,Kai Chen,Pengcheng Chen,Ying Chen,Yongkang Chen,Yu Cheng,Yu Cheng,Pei Chu,Tao Chu,Erfei Cui,Ganqu Cui,Long Cui,Ziyun Cui,Nianchen Deng,Ning Ding,Nanqin Dong,Peijie Dong,Shihan Dou,Sinan Du,Haodong Duan,Caihua Fan,Ben Gao,Changjiang Gao,Jianfei Gao,Songyang Gao,Yang Gao,Zhangwei Gao,Jiaye Ge,Qiming Ge,Lixin Gu,Yuzhe Gu,Aijia Guo,Qipeng Guo,Xu Guo,Conghui He,Junjun He,Yili Hong,Siyuan Hou,Caiyu Hu,Hanglei Hu,Jucheng Hu,Ming Hu,Zhouqi Hua,Haian Huang,Junhao Huang,Xu Huang,Zixian Huang,Zhe Jiang,Lingkai Kong,Linyang Li,Peiji Li,Pengze Li,Shuaibin Li,Tianbin Li,Wei Li,Yuqiang Li,Dahua Lin,Junyao Lin,Tianyi Lin,Zhishan Lin,Hongwei Liu,Jiangning Liu,Jiyao Liu,Junnan Liu,Kai Liu,Kaiwen Liu,Kuikun Liu,Shichun Liu,Shudong Liu,Wei Liu,Xinyao Liu,Yuhong Liu,Zhan Liu,Yinquan Lu,Haijun Lv,Hongxia Lv,Huijie Lv,Qidang Lv,Ying Lv,Chengqi Lyu,Chenglong Ma,Jianpeng Ma,Ren Ma,Runmin Ma,Runyuan Ma,Xinzhu Ma,Yichuan Ma,Zihan Ma,Sixuan Mi,Junzhi Ning,Wenchang Ning,Xinle Pang,Jiahui Peng,Runyu Peng,Yu Qiao
机构: Shanghai AI Laboratory (上海人工智能实验室)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL this http URL comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at this https URL.
zh

[NLP-1] LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

【速读】: 该论文旨在解决当前AI代理在真实动态环境中有效协调多种工具(如网络搜索、文件操作、数学推理和数据分析)完成多步骤任务时缺乏系统性评估的问题。现有基准测试难以反映实际场景中工具调用的复杂性和动态性,导致对模型能力的评估存在偏差。解决方案的关键在于提出LiveMCP-101基准数据集,包含101个经过迭代大语言模型(LLM)重写与人工审核的现实查询任务,并引入基于真实执行计划(ground-truth execution plans)的新型评估方法,而非仅依赖原始API输出,从而更准确地衡量AI代理在复杂任务中的工具编排能力与稳定性。

链接: https://arxiv.org/abs/2508.15760
作者: Ming Yin,Dinghan Shen,Silei Xu,Jianbing Han,Sixun Dong,Mian Zhang,Yebowen Hu,Shujian Liu,Simin Ma,Song Wang,Sathish Reddy Indurthi,Xun Wang,Yiran Chen,Kaiqiang Song
机构: Duke University (杜克大学); Zoom Video Communications (Zoom 视频通讯)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.
zh

[NLP-2] Language-Guided Tuning: Enhancing Numeric Optimization with Textual Feedback

【速读】: 该论文旨在解决机器学习中配置优化(configuration optimization)这一关键瓶颈问题,即如何在模型架构、训练策略、特征工程和超参数等多个维度上进行协同调优。传统方法通常独立处理各维度且缺乏可解释性,而现有自动化方法则面临动态适应能力不足与对优化决策语义理解有限的问题。其解决方案的关键在于提出一种基于多智能体大语言模型(multi-agent Large Language Models)的语言引导调优(Language-Guided Tuning, LGT)框架,通过引入文本梯度(textual gradients)——一种补充数值优化的定性反馈信号——实现对训练动态和配置依赖关系的语义理解,并由三个专业化智能体(Advisor、Evaluator 和 Optimizer)构成闭环反馈机制,从而实现智能化、可解释且持续进化的配置优化过程。

链接: https://arxiv.org/abs/2508.15757
作者: Yuxing Lu,Yucheng Hu,Nan Sun,Xukai Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Configuration optimization remains a critical bottleneck in machine learning, requiring coordinated tuning across model architecture, training strategy, feature engineering, and hyperparameters. Traditional approaches treat these dimensions independently and lack interpretability, while recent automated methods struggle with dynamic adaptability and semantic reasoning about optimization decisions. We introduce Language-Guided Tuning (LGT), a novel framework that employs multi-agent Large Language Models to intelligently optimize configurations through natural language reasoning. We apply textual gradients - qualitative feedback signals that complement numerical optimization by providing semantic understanding of training dynamics and configuration interdependencies. LGT coordinates three specialized agents: an Advisor that proposes configuration changes, an Evaluator that assesses progress, and an Optimizer that refines the decision-making process, creating a self-improving feedback loop. Through comprehensive evaluation on six diverse datasets, LGT demonstrates substantial improvements over traditional optimization methods, achieving performance gains while maintaining high interpretability.
zh

[NLP-3] Dissecting Tool-Integrated Reasoning : An Empirical Study and Analysis

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在需要精确计算的任务中表现不足的问题,以及工具集成推理(Tool-Integrated Reasoning, TIR)在提升LLM推理能力方面的泛化效果和行为机制尚不明确的挑战。其解决方案的关键在于提出ReasonZoo这一涵盖九类多样化推理任务的综合性基准,并引入Performance-Aware Cost (PAC) 和 Area Under the Performance-Cost Curve (AUC-PCC) 两个新指标,系统评估TIR在不同领域中的有效性与推理效率。实证结果表明,TIR不仅显著提升了数学与非数学任务上的性能,还通过优化PAC和AUC-PCC指标实现了更高效、更少冗余的推理过程,验证了TIR在复杂推理任务中的通用优势。

链接: https://arxiv.org/abs/2508.15754
作者: Yufeng Zhao,Junnan Liu,Hongwei Liu,Dongsheng Zhu,Yuan Shen,Songyang Zhang,Kai Chen
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint, working in progress

点击查看摘要

Abstract:Large Language Models (LLMs) have made significant strides in reasoning tasks through methods like chain-of-thought (CoT) reasoning. However, they often fall short in tasks requiring precise computations. Tool-Integrated Reasoning (TIR) has emerged as a solution by incorporating external tools into the reasoning process. Nevertheless, the generalization of TIR in improving the reasoning ability of LLM is still unclear. Additionally, whether TIR has improved the model’s reasoning behavior and helped the model think remains to be studied. We introduce ReasonZoo, a comprehensive benchmark encompassing nine diverse reasoning categories, to evaluate the effectiveness of TIR across various domains. Additionally, we propose two novel metrics, Performance-Aware Cost (PAC) and Area Under the Performance-Cost Curve (AUC-PCC), to assess reasoning efficiency. Our empirical evaluation demonstrates that TIR-enabled models consistently outperform their non-TIR counterparts in both mathematical and non-mathematical tasks. Furthermore, TIR enhances reasoning efficiency, as evidenced by improved PAC and AUC-PCC, indicating reduced overthinking and more streamlined reasoning. These findings underscore the domain-general benefits of TIR and its potential to advance LLM capabilities in complex reasoning tasks.
zh

[NLP-4] End-to-End Agent ic RAG System Training for Traceable Diagnostic Reasoning

【速读】: 该论文旨在解决医疗大语言模型(Large Language Models, LLMs)在临床诊断中因知识盲区和幻觉(hallucination)导致的准确性不足问题,同时克服现有检索增强生成(Retrieval-Augmented Generation, RAG)方法在外部知识利用不充分及推理可追溯性差的局限。其核心解决方案是提出一个端到端训练的代理型RAG系统——Deep-DxSearch,该系统通过强化学习(Reinforcement Learning, RL)优化LLM作为智能体(agent)与外部检索语料库之间的交互策略,设计包含格式、检索质量、推理结构和诊断准确率的多维奖励机制,从而实现可追踪的检索增强推理(retrieval-augmented reasoning)。实验表明,该方法在常见病与罕见病、分布内与分布外场景下均显著优于GPT-4o、DeepSeek-R1等先进基线模型,且消融研究验证了奖励设计与语料构建的关键作用。

链接: https://arxiv.org/abs/2508.15746
作者: Qiaoyu Zheng,Yuze Sun,Chaoyi Wu,Weike Zhao,Pengcheng Qiu,Yongguo Yu,Kun Sun,Yanfeng Wang,Ya Zhang,Weidi Xie
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 35 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch’s diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See this https URL. Comments: 35 pages, 5 figures, 3 tables Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.15746 [cs.CL] (or arXiv:2508.15746v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15746 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-5] EcomMMMU: Strategic Utilization of Visuals for Robust Multimodal E-Commerce Models

【速读】: 该论文旨在解决电商场景中多图像视觉语言理解任务中,产品图像是否始终提升模型性能的问题。现有数据集在规模和设计上存在局限,难以系统评估图像对理解任务的贡献,而研究发现图像并非总是有益——某些情况下甚至会引入冗余或导致性能下降,表明多模态大语言模型(MLLMs)在利用丰富视觉内容方面存在瓶颈。解决方案的关键在于提出一种名为SUMEI的数据驱动方法,其核心是通过预测图像的视觉效用(visual utility)来智能筛选和利用多图像信息,从而在下游任务中实现更高效、稳健的多模态理解能力。

链接: https://arxiv.org/abs/2508.15721
作者: Xinyi Ling,Hanwen Du,Zhihui Zhu,Xia Ning
机构: The Ohio State University (俄亥俄州立大学); The Ohio State University (俄亥俄州立大学); The Ohio State University (俄亥俄州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:E-commerce platforms are rich in multimodal data, featuring a variety of images that depict product details. However, this raises an important question: do these images always enhance product understanding, or can they sometimes introduce redundancy or degrade performance? Existing datasets are limited in both scale and design, making it difficult to systematically examine this question. To this end, we introduce EcomMMMU, an e-commerce multimodal multitask understanding dataset with 406,190 samples and 8,989,510 images. EcomMMMU is comprised of multi-image visual-language data designed with 8 essential tasks and a specialized VSS subset to benchmark the capability of multimodal large language models (MLLMs) to effectively utilize visual content. Analysis on EcomMMMU reveals that product images do not consistently improve performance and can, in some cases, degrade it. This indicates that MLLMs may struggle to effectively leverage rich visual content for e-commerce tasks. Building on these insights, we propose SUMEI, a data-driven method that strategically utilizes multiple images via predicting visual utilities before using them for downstream tasks. Comprehensive experiments demonstrate the effectiveness and robustness of SUMEI. The data and code are available through this https URL.
zh

[NLP-6] Stemming – The Evolution and Current State with a Focus on Bangla

【速读】: 该论文旨在解决孟加拉语(Bangla)在数字资源匮乏背景下,因缺乏高质量标注数据和有效分词工具而导致的自然语言处理(Natural Language Processing, NLP)能力受限的问题。其核心挑战在于孟加拉语具有高度屈折性(highly-inflectional)的形态结构,使得传统分词方法难以准确识别词干(stem),从而影响下游任务如信息检索、文本分类等的效果。解决方案的关键在于开发鲁棒且可复现的孟加拉语分词器(stemmer),并推动基于形态学变体(morphological variants)的有效处理机制研究,同时改进评估指标以更真实反映模型性能,从而为低资源语言的NLP系统提供坚实基础。

链接: https://arxiv.org/abs/2508.15711
作者: Abhijit Paul,Mashiat Amin Farin,Sharif Md. Abdullah,Ahmedul Kabir,Zarif Masud,Shebuti Rayana
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Bangla, the seventh most widely spoken language worldwide with 300 million native speakers, faces digital under-representation due to limited resources and lack of annotated datasets. Stemming, a critical preprocessing step in language analysis, is essential for low-resource, highly-inflectional languages like Bangla, because it can reduce the complexity of algorithms and models by significantly reducing the number of words the algorithm needs to consider. This paper conducts a comprehensive survey of stemming approaches, emphasizing the importance of handling morphological variants effectively. While exploring the landscape of Bangla stemming, it becomes evident that there is a significant gap in the existing literature. The paper highlights the discontinuity from previous research and the scarcity of accessible implementations for replication. Furthermore, it critiques the evaluation methodologies, stressing the need for more relevant metrics. In the context of Bangla’s rich morphology and diverse dialects, the paper acknowledges the challenges it poses. To address these challenges, the paper suggests directions for Bangla stemmer development. It concludes by advocating for robust Bangla stemmers and continued research in the field to enhance language analysis and processing.
zh

[NLP-7] Position Bias Mitigates Position Bias:Mitigate Position Bias Through Inter-Position Knowledge Distillation EMNLP2025

【速读】: 该论文旨在解决长上下文处理中因位置偏差(Positional Bias, PB)导致的性能不均问题,即模型在不同上下文位置上的敏感度不一致,从而损害了对长文本的理解与推理能力。其解决方案的关键在于提出Pos2Distill框架——一种位置到位置的知识蒸馏方法,通过将优势位置(如开头或结尾)所具备的优越信息处理能力迁移至劣势位置,以此缩小各位置间的性能差距。该方法的核心思想是利用位置差异本身来对抗位置偏差,进而提升整体上下文位置的均匀性和任务表现力。

链接: https://arxiv.org/abs/2508.15709
作者: Yifei Wang,Feng Xiong,Yong Wang,Linjing Li,Xiangxiang Chu,Daniel Dajun Zeng
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); AMAP, Alibaba Group (阿里巴巴集团AMAP)
类目: Computation and Language (cs.CL)
备注: EMNLP2025 Main

点击查看摘要

Abstract:Positional bias (PB), manifesting as non-uniform sensitivity across different contextual locations, significantly impairs long-context comprehension and processing capabilities. While prior work seeks to mitigate PB through modifying the architectures causing its emergence, significant PB still persists. To address PB effectively, we introduce \textbfPos2Distill, a position to position knowledge distillation framework. Pos2Distill transfers the superior capabilities from advantageous positions to less favorable ones, thereby reducing the huge performance gaps. The conceptual principle is to leverage the inherent, position-induced disparity to counteract the PB itself. We identify distinct manifestations of PB under \textbf\textscretrieval and \textbf\textscreasoning paradigms, thereby designing two specialized instantiations: \emphPos2Distill-R\textsuperscript1 and \emphPos2Distill-R\textsuperscript2 respectively, both grounded in this core principle. By employing the Pos2Distill approach, we achieve enhanced uniformity and significant performance gains across all contextual positions in long-context retrieval and reasoning tasks. Crucially, both specialized systems exhibit strong cross-task generalization mutually, while achieving superior performance on their respective tasks.
zh

[NLP-8] Benchmarking Computer Science Survey Generation

【速读】: 该论文旨在解决科学文献综述(scientific survey)自动化生成的挑战,即随着学术文献数量激增,人工撰写综述变得愈发不可行,而现有大语言模型(LLM)在该任务上的性能缺乏标准化评估体系。其解决方案的关键在于提出一个名为SurGE(Survey Generation Evaluation)的新基准,包含两类核心要素:一是由主题描述、专家撰写的综述及完整参考文献构成的测试实例集合;二是超过百万篇论文的大规模学术语料库作为检索池。此外,论文还设计了一个多维自动化评估框架,从信息覆盖度、引用准确性、结构组织性和内容质量四个维度量化评估生成综述的质量,从而为LLM驱动的综述生成研究提供可复现、可比较的评测标准。

链接: https://arxiv.org/abs/2508.15658
作者: Weihang Su,Anzhe Xie,Qingyao Ai,Jianming Long,Jiaxin Mao,Ziyi Ye,Yiqun Liu
机构: Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: this https URL
zh

[NLP-9] SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreaking attacks)时安全性不足的问题,特别是揭示了LLMs在识别有害请求(作为判别器)方面优于防御生成有害内容(作为生成器)的现象。解决方案的关键在于提出一种名为SDGO(Self-Discrimination-Guided Optimization)的强化学习框架,该框架利用模型自身的判别能力作为奖励信号,通过迭代自优化来提升生成安全性,无需额外标注数据或外部模型。这一方法实现了判别与生成能力的对齐,显著增强了模型对分布外(out-of-distribution, OOD)越狱攻击的鲁棒性,并能在仅使用少量判别样本的情况下进一步提升生成能力。

链接: https://arxiv.org/abs/2508.15648
作者: Peng Ding,Wen Sun,Dailin Li,Wei Zou,Jiaming Wang,Jiajun Chen,Shujian Huang
机构: 未知
类目: Computation and Language (cs.CL)
备注: Accepted by EMNLP 2025, 15 pages, 4 figures, 6 tables

点击查看摘要

Abstract:Large Language Models (LLMs) excel at various natural language processing tasks but remain vulnerable to jailbreaking attacks that induce harmful content generation. In this paper, we reveal a critical safety inconsistency: LLMs can more effectively identify harmful requests as discriminators than defend against them as generators. This insight inspires us to explore aligning the model’s inherent discrimination and generation capabilities. To this end, we propose SDGO (Self-Discrimination-Guided Optimization), a reinforcement learning framework that leverages the model’s own discrimination capabilities as a reward signal to enhance generation safety through iterative self-improvement. Our method does not require any additional annotated data or external models during the training phase. Extensive experiments demonstrate that SDGO significantly improves model safety compared to both prompt-based and training-based baselines while maintaining helpfulness on general benchmarks. By aligning LLMs’ discrimination and generation capabilities, SDGO brings robust performance against out-of-distribution (OOD) jailbreaking attacks. This alignment achieves tighter coupling between these two capabilities, enabling the model’s generation capability to be further enhanced with only a small amount of discriminative samples. Our code and datasets are available at this https URL.
zh

[NLP-10] Classification errors distort findings in automated speech processing: examples and solutions from child-development research

【速读】: 该论文旨在解决自动化分类算法中的分类误差对科学研究结论的下游影响问题,特别是这些误差如何扭曲关键科学问题的估计结果,如兄弟姐妹对儿童语言输入的影响以及儿童产出与其输入之间的关联。其解决方案的关键在于提出一种贝叶斯校准方法,用于恢复效应量的无偏估计,该方法能够有效识别并修正因分类错误导致的偏差,尽管不能提供绝对可靠的保障,但仍为处理具有非零错误率的事件检测与分类任务提供了重要思路。

链接: https://arxiv.org/abs/2508.15637
作者: Lucas Gautheron,Evan Kidd,Anton Malko,Marvin Lavechin,Alejandrina Cristia
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Applications (stat.AP)
备注:

点击查看摘要

Abstract:With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children’s experience, behavior, and outcomes, with a sizable literature employing long-form audio-recordings to study language acquisition. While numerous articles report on the accuracy and reliability of the most popular automated classifiers, less has been written on the downstream effects of classification errors on measurements and statistical inferences (e.g., the estimate of correlations and effect sizes in regressions). This paper proposes a Bayesian approach to study the effects of algorithmic errors on key scientific questions, including the effect of siblings on children’s language experience and the association between children’s production and their input. In both the most commonly used \glslena, and an open-source alternative (the Voice Type Classifier from the ACLEW system), we find that classification errors can significantly distort estimates. For instance, automated annotations underestimated the negative effect of siblings on adult input by 20–80%, potentially placing it below statistical significance thresholds. We further show that a Bayesian calibration approach for recovering unbiased estimates of effect sizes can be effective and insightful, but does not provide a fool-proof solution. Both the issue reported and our solution may apply to any classifier involving event detection and classification with non-zero error rates.
zh

[NLP-11] rained Miniatures: Low cost High Efficacy SLMs for Sales Marketing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在特定应用场景(如销售与市场营销推广)中因计算资源消耗大、成本高昂而难以部署的问题。其解决方案的关键在于提出“训练微缩模型”(Trained Miniatures)的概念,即通过微调小型语言模型(Small Language Models, SLMs)以适配高价值的垂直领域任务,在保持与LLMs相当的领域响应质量的同时,显著降低计算成本和部署开销。

链接: https://arxiv.org/abs/2508.15617
作者: Ishaan Bhola,Mukunda NS,Sravanth Kurmala,Harsh Nandwani,Arihant Jain
机构: SuperAGI Research
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel in text generation; however, these creative elements require heavy computation and are accompanied by a steep cost. Especially for targeted applications such as sales and marketing outreach, these costs are far from feasible. This paper introduces the concept of “Trained Miniatures” - Small Language Models(SLMs) fine-tuned for specific, high-value applications, generating similar domain-specific responses for a fraction of the cost.
zh

[NLP-12] SafetyFlow: An Agent -Flow System for Automated LLM Safety Benchmarking

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)安全评估基准构建过程中存在的三大问题:人工标注成本高、数据冗余度大以及难度分布有限。为应对这些挑战,作者提出SafetyFlow——首个用于自动化构建LLM安全评估基准的代理流(agent-flow)系统。其关键创新在于通过协调七个专业化智能体(agents)自动完成从提示生成到质量控制的全流程,无需人工干预即可在四天内构建出高质量、低冗余且具备强区分能力的安全基准数据集(SafetyFlowBench),包含23,446条查询样本。该方案不仅显著降低时间与资源消耗,还通过集成人类专家知识于自动化流程中实现可控性与可靠性平衡,从而为大规模、高效、可靠的LLM安全评测提供了新范式。

链接: https://arxiv.org/abs/2508.15526
作者: Xiangyang Zhu,Yuan Tian,Chunyi Li,Kaiwei Zhang,Wei Sun,Guangtao Zhai
机构: Shanghai AI Lab; East China Normal University
类目: Computation and Language (cs.CL)
备注: Code and dataset are available at this https URL

点击查看摘要

Abstract:The rapid proliferation of large language models (LLMs) has intensified the requirement for reliable safety evaluation to uncover model vulnerabilities. To this end, numerous LLM safety evaluation benchmarks are proposed. However, existing benchmarks generally rely on labor-intensive manual curation, which causes excessive time and resource consumption. They also exhibit significant redundancy and limited difficulty. To alleviate these problems, we introduce SafetyFlow, the first agent-flow system designed to automate the construction of LLM safety benchmarks. SafetyFlow can automatically build a comprehensive safety benchmark in only four days without any human intervention by orchestrating seven specialized agents, significantly reducing time and resource cost. Equipped with versatile tools, the agents of SafetyFlow ensure process and cost controllability while integrating human expertise into the automatic pipeline. The final constructed dataset, SafetyFlowBench, contains 23,446 queries with low redundancy and strong discriminative power. Our contribution includes the first fully automated benchmarking pipeline and a comprehensive safety benchmark. We evaluate the safety of 49 advanced LLMs on our dataset and conduct extensive experiments to validate our efficacy and efficiency.
zh

[NLP-13] he Enemy from Within: A Study of Political Delegitimization Discourse in Israeli Political Speech

【速读】: 该论文旨在解决政治合法化话语(Political Delegitimization Discourse, PDD)的自动化识别与量化分析问题,即如何系统性地从大规模文本数据中检测并分类对政治实体规范有效性进行符号攻击的言论。其解决方案的关键在于构建了一个包含10,410句希伯来语语料的标注数据集,并提出一个两阶段分类流水线:首先使用微调的编码器模型进行PDD二分类检测,再通过解码器大语言模型(LLM)对PDD的强度、不礼貌程度、目标类型及情感框架等特征进行细粒度分类。该方法在测试集上达到F₁=0.74(二分类)和宏平均F₁=0.67(多标签分类),实现了对民主话语中PDD现象的可扩展、高精度分析。

链接: https://arxiv.org/abs/2508.15524
作者: Naama Rivlin-Angert,Guy Mor-Lan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the first large-scale computational study of political delegitimization discourse (PDD), defined as symbolic attacks on the normative validity of political entities. We curate and manually annotate a novel Hebrew-language corpus of 10,410 sentences drawn from Knesset speeches (1993-2023), Facebook posts (2018-2021), and leading news outlets, of which 1,812 instances (17.4%) exhibit PDD and 642 carry additional annotations for intensity, incivility, target type, and affective framing. We introduce a two-stage classification pipeline combining finetuned encoder models and decoder LLMs. Our best model (DictaLM 2.0) attains an F _1 of 0.74 for binary PDD detection and a macro-F _1 of 0.67 for classification of delegitimization characteristics. Applying this classifier to longitudinal and cross-platform data, we see a marked rise in PDD over three decades, higher prevalence on social media versus parliamentary debate, greater use by male than female politicians, and stronger tendencies among right-leaning actors - with pronounced spikes during election campaigns and major political events. Our findings demonstrate the feasibility and value of automated PDD analysis for understanding democratic discourse.
zh

[NLP-14] Dream 7B: Diffusion Large Language Models

【速读】: 该论文旨在解决传统自回归(Autoregressive, AR)语言模型在生成序列时存在效率低、缺乏灵活性以及难以实现高质量并行生成的问题。其核心解决方案在于提出Dream 7B,一个基于离散扩散建模(Discrete Diffusion Modeling)的大规模语言模型,通过迭代去噪机制在并行层面优化序列生成过程,从而显著提升生成效率与质量。关键创新包括:使用AR语言模型进行初始化以稳定训练,并引入上下文自适应的token级噪声重调度策略,使模型具备任意顺序生成、填空补全及可调质量-速度权衡等灵活推理能力,有效突破了现有扩散语言模型在通用性、数学推理和代码生成任务中的性能瓶颈。

链接: https://arxiv.org/abs/2508.15487
作者: Jiacheng Ye,Zhihui Xie,Lin Zheng,Jiahui Gao,Zirui Wu,Xin Jiang,Zhenguo Li,Lingpeng Kong
机构: The University of Hong Kong (香港大学); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce Dream 7B, the most powerful open diffusion large language model to date. Unlike autoregressive (AR) models that generate tokens sequentially, Dream 7B employs discrete diffusion modeling to refine sequences in parallel through iterative denoising. Our model consistently outperforms existing diffusion language models on general, mathematical, and coding tasks. Dream 7B demonstrates superior planning abilities and inference flexibility, including arbitrary-order generation, infilling capabilities, and tunable quality-speed trade-offs. These results are achieved through simple yet effective training techniques, including AR-based LLM initialization and context-adaptive token-level noise rescheduling. We release both Dream-Base and Dream-Instruct to facilitate further research in diffusion-based language modeling.
zh

[NLP-15] HebID: Detecting Social Identities in Hebrew-language Political Text

【速读】: 该论文旨在解决现有社会身份检测数据集普遍存在的局限性问题,即多数数据集以英语为主、标签单一且仅涵盖粗粒度的身份类别,难以捕捉多维度、精细化的社会身份表达。为此,作者构建了首个用于社会身份检测的多标签希伯来语语料库HebID,包含5,536条来自以色列政客Facebook帖子的句子,并基于调查数据对12种细微的社会身份(如右翼、极端正统派、社会导向型等)进行人工标注。解决方案的关键在于:一是建立具有文化适配性的多标签标注体系,二是引入经过希伯来语微调的大语言模型(LLM)作为分类器,在宏平均F₁分数(macro-F₁ = 0.74)上表现最优,从而为非英语政治语境下的社会身份研究提供了可迁移的方法框架和高质量基准数据集。

链接: https://arxiv.org/abs/2508.15483
作者: Guy Mor-Lan,Naama Rivlin-Angert,Yael R. Kaplan,Tamir Sheafer,Shaul R. Shenhav
机构: The Hebrew University of Jerusalem(希伯来大学); The Open University of Israel(开放大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians’ Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro- F_1 = 0.74). We apply our classifier to politicians’ Facebook posts and parliamentary speeches, evaluating differences in popularity, temporal trends, clustering patterns, and gender-related variations in identity expression. We utilize identity choices from a national public survey, enabling a comparison between identities portrayed in elite discourse and the public’s identity priorities. HebID provides a comprehensive foundation for studying social identities in Hebrew and can serve as a model for similar research in other non-English political contexts.
zh

[NLP-16] SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts – Extended Version EMNLP2025

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在准确性、计算效率与可持续性方面缺乏系统评估的问题。现有基准测试未能全面量化SLMs在多维指标上的表现,导致其资源效率与实际应用价值之间的权衡不清晰。解决方案的关键在于提出首个专门针对SLMs的综合评估基准——SLM-Bench,该基准涵盖15个SLMs在9项自然语言处理(Natural Language Processing, NLP)任务上的表现,使用23个跨14个领域的数据集,并在4种硬件配置下进行标准化评测,同时量化11个维度的指标(包括正确性、计算开销和能耗),从而实现对SLMs效率权衡的全面刻画。此外,论文还开发了一个开源基准测试流程,确保结果可复现并推动后续研究。

链接: https://arxiv.org/abs/2508.15478
作者: Nghiem Thanh Pham,Tung Kieu,Duc-Manh Nguyen,Son Ha Xuan,Nghia Duong-Trung,Danh Le-Phuoc
机构: FPT University (越南FPT大学); Aalborg University (丹麦奥尔堡大学); Technische Universität Berlin (德国柏林工业大学); RMIT University (澳大利亚皇家墨尔本理工大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Performance (cs.PF)
备注: 24 pages. An extended version of “SLM-Bench: A Comprehensive Benchmark of Small Language Models on Environmental Impacts” accepted at EMNLP 2025

点击查看摘要

Abstract:Small Language Models (SLMs) offer computational efficiency and accessibility, yet a systematic evaluation of their performance and environmental impact remains lacking. We introduce SLM-Bench, the first benchmark specifically designed to assess SLMs across multiple dimensions, including accuracy, computational efficiency, and sustainability metrics. SLM-Bench evaluates 15 SLMs on 9 NLP tasks using 23 datasets spanning 14 domains. The evaluation is conducted on 4 hardware configurations, providing a rigorous comparison of their effectiveness. Unlike prior benchmarks, SLM-Bench quantifies 11 metrics across correctness, computation, and consumption, enabling a holistic assessment of efficiency trade-offs. Our evaluation considers controlled hardware conditions, ensuring fair comparisons across models. We develop an open-source benchmarking pipeline with standardized evaluation protocols to facilitate reproducibility and further research. Our findings highlight the diverse trade-offs among SLMs, where some models excel in accuracy while others achieve superior energy efficiency. SLM-Bench sets a new standard for SLM evaluation, bridging the gap between resource efficiency and real-world applicability.
zh

[NLP-17] Influence-driven Curriculum Learning for Pre-training on Limited Data

【速读】: 该论文旨在解决传统课程学习(curriculum learning)在预训练语言模型中效果有限的问题,即如何更有效地利用训练数据的难度顺序来提升模型性能。其解决方案的关键在于摒弃依赖人类主观判断的难度指标,转而采用一种基于模型自身训练过程的难度度量——训练数据影响(training data influence),该指标能够量化单个训练样本对模型输出的影响程度。通过按此模型中心的难度排序训练数据,实验表明模型在基准测试中性能提升超过10个百分点,验证了以模型感知难度为导向的课程学习策略在语言模型预训练中的有效性。

链接: https://arxiv.org/abs/2508.15475
作者: Loris Schoenegger,Lukas Thoma,Terra Blevins,Benjamin Roth
机构: Faculty of Computer Science, University of Vienna, Vienna, Austria (维也纳大学计算机科学学院); UniVie Doctoral School Computer Science, University of Vienna, Vienna, Austria (维也纳大学博士生院计算机科学); Faculty of Philological and Cultural Studies, University of Vienna, Vienna, Austria (维也纳大学语言与文化研究学院); Khoury College of Computer Sciences, Northeastern University, Boston, USA (东北大学计算机科学学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 9 pages

点击查看摘要

Abstract:Curriculum learning, a training technique where data is presented to the model in order of example difficulty (e.g., from simpler to more complex documents), has shown limited success for pre-training language models. In this work, we investigate whether curriculum learning becomes competitive if we replace conventional human-centered difficulty metrics with one that more closely corresponds to example difficulty as observed during model training. Specifically, we experiment with sorting training examples by their \textittraining data influence, a score which estimates the effect of individual training examples on the model’s output. Models trained on our curricula are able to outperform ones trained in random order by over 10 percentage points in benchmarks, confirming that curriculum learning is beneficial for language model pre-training, as long as a more model-centric notion of difficulty is adopted.
zh

[NLP-18] Subjective Behaviors and Preferences in LLM : Language of Browsing EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)在处理用户主观行为与偏好时的适配性问题,特别是针对用户浏览网页或使用应用时形成的个性化、非结构化序列行为数据——这种行为可类比为一种“浏览语言”(browsing language),但缺乏自然语言的语法结构。研究核心关注三个问题:小模型是否能更有效地表征此类“浏览语言”?单一参数集的LLM能否充分捕捉用户间的异质性?以及高平均性能是否伴随低个体差异以实现更好的用户对齐。解决方案的关键在于提出HeTLM(Heterogeneity aware Training of Language Model),通过聚类驱动的分组训练机制,为不同用户群体分配专属参数子集,在控制总参数量的前提下显著提升模型对个体用户行为的拟合能力与一致性,从而实现更高均值和更低方差的生成性能,优化用户层面的对齐效果。

链接: https://arxiv.org/abs/2508.15474
作者: Sai Sundaresan,Harshita Chopra,Atanu R. Sinha,Koustava Goswami,Nagasai Saketh Naidu,Raghav Karan,N Anushka
机构: Adobe Research (Adobe 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:A Large Language Model (LLM) offers versatility across domains and tasks, purportedly benefiting users with a wide variety of behaviors and preferences. We question this perception about an LLM when users have inherently subjective behaviors and preferences, as seen in their ubiquitous and idiosyncratic browsing of websites or apps. The sequential behavior logs of pages, thus generated, form something akin to each user’s self-constructed “language”, albeit without the structure and grammar imbued in natural languages. We ask: (i) Can a small LM represent the “language of browsing” better than a large LM? (ii) Can an LM with a single set of parameters (or, single LM) adequately capture myriad users’ heterogeneous, subjective behaviors and preferences? (iii) Can a single LM with high average performance, yield low variance in performance to make alignment good at user level? We introduce clusterwise LM training, HeTLM (Heterogeneity aware Training of Language Model), appropriate for subjective behaviors. We find that (i) a small LM trained using a page-level tokenizer outperforms large pretrained or finetuned LMs; (ii) HeTLM with heterogeneous cluster specific set of parameters outperforms a single LM of the same family, controlling for the number of parameters; and (iii) a higher mean and a lower variance in generation ensues, implying improved alignment.
zh

[NLP-19] SLM4Offer: Personalized Marketing Offer Generation Using Contrastive Learning Based Fine-Tuning

【速读】: 该论文旨在解决个性化营销中如何提升优惠券(offer)接受率的问题,核心挑战在于如何精准匹配客户画像(customer persona)与相关优惠内容。解决方案的关键在于提出一种基于对比学习的生成式AI模型SLM4Offer,其通过在共享嵌入空间中利用InfoNCE损失函数对齐客户特征与优惠信息,从而增强模型的泛化能力;相较于传统监督微调方法,该模型在合成数据集上实现了17%的优惠接受率提升,验证了对比目标在个性化营销中的有效性。

链接: https://arxiv.org/abs/2508.15471
作者: Vedasamhitha Challapalli,Konduru Venkat Sai,Piyush Pratap Singh,Rupesh Prasad,Arvind Maurya,Atul Singh
机构: 未知
类目: Computation and Language (cs.CL)
备注: 10 pages, BDA Conference 2025

点击查看摘要

Abstract:Personalized marketing has emerged as a pivotal strategy for enhancing customer engagement and driving business growth. Academic and industry efforts have predominantly focused on recommendation systems and personalized advertisements. Nonetheless, this facet of personalization holds significant potential for increasing conversion rates and improving customer satisfaction. Prior studies suggest that well-executed personalization strategies can boost revenue by up to 40 percent, underscoring the strategic importance of developing intelligent, data-driven approaches for offer generation. This work introduces SLM4Offer, a generative AI model for personalized offer generation, developed by fine-tuning a pre-trained encoder-decoder language model, specifically Google’s Text-to-Text Transfer Transformer (T5-Small 60M) using a contrastive learning approach. SLM4Offer employs InfoNCE (Information Noise-Contrastive Estimation) loss to align customer personas with relevant offers in a shared embedding space. A key innovation in SLM4Offer lies in the adaptive learning behaviour introduced by contrastive loss, which reshapes the latent space during training and enhances the model’s generalizability. The model is fine-tuned and evaluated on a synthetic dataset designed to simulate customer behaviour and offer acceptance patterns. Experimental results demonstrate a 17 percent improvement in offer acceptance rate over a supervised fine-tuning baseline, highlighting the effectiveness of contrastive objectives in advancing personalized marketing.
zh

[NLP-20] RadReason : Radiology Report Evaluation Metric with Reason s and Sub-Scores

【速读】: 该论文旨在解决自动生成放射科报告(radiology reports)的评估难题,即现有方法缺乏临床基础、可解释性差且粒度粗略,难以在真实临床工作流中应用。其解决方案的关键在于提出RadReason框架,该框架通过两个核心创新实现细粒度评分与可解释性:一是子评分动态加权(Sub-score Dynamic Weighting),根据实时F1统计自适应优先处理临床挑战性强的六类错误;二是多数引导优势缩放(Majority-Guided Advantage Scaling),基于子评分一致性推导提示难度以调整策略梯度更新。这些机制共同提升了优化稳定性,并显著增强与专家临床判断的一致性。

链接: https://arxiv.org/abs/2508.15464
作者: Yingshu Li,Yunyi Liu,Lingqiao Liu,Lei Wang,Luping Zhou
机构: University of Sydney (悉尼大学); University of Adelaide (阿德莱德大学); University of Wollongong (卧龙岗大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Evaluating automatically generated radiology reports remains a fundamental challenge due to the lack of clinically grounded, interpretable, and fine-grained metrics. Existing methods either produce coarse overall scores or rely on opaque black-box models, limiting their usefulness in real-world clinical workflows. We introduce RadReason, a novel evaluation framework for radiology reports that not only outputs fine-grained sub-scores across six clinically defined error types, but also produces human-readable justifications that explain the rationale behind each score. Our method builds on Group Relative Policy Optimization and incorporates two key innovations: (1) Sub-score Dynamic Weighting, which adaptively prioritizes clinically challenging error types based on live F1 statistics; and (2) Majority-Guided Advantage Scaling, which adjusts policy gradient updates based on prompt difficulty derived from sub-score agreement. Together, these components enable more stable optimization and better alignment with expert clinical judgment. Experiments on the ReXVal benchmark show that RadReason surpasses all prior offline metrics and achieves parity with GPT-4-based evaluations, while remaining explainable, cost-efficient, and suitable for clinical deployment. Code will be released upon publication.
zh

[NLP-21] PyTOD: Programmable Task-Oriented Dialogue with Execution Feedback SIGDIAL2025

【速读】: 该论文旨在解决任务导向型对话(Task-Oriented Dialogue, TOD)代理中对话状态跟踪(Dialogue State Tracking, DST)的准确性问题,尤其是在复杂多轮交互场景下如何实现高效且鲁棒的状态估计。其核心解决方案是提出PyTOD框架,通过生成可执行代码来动态追踪对话状态,并利用策略(policy)与执行(execution)反馈进行误差纠正;关键创新在于采用基于语言模型的约束解码方法替代传统语法规则,以灵活遵循API模式(API schema),从而在SGD基准测试中实现了最先进的状态跟踪性能,显著提升了对话过程中用户目标估计的准确性和鲁棒性。

链接: https://arxiv.org/abs/2508.15456
作者: Alexandru Coca,Bo-Hsiang Tseng,Pete Boothroyd,Jianpeng Cheng,Mark Gaynor,Zhenxing Zhang,Joe Stacey,Tristan Guigue,Héctor Martinez Alonso,Diarmuid Ó Séaghdha,Anders Johannsen
机构: 未知
类目: Computation and Language (cs.CL)
备注: 20 pages, 12 figures. To appear at SIGDIAL 2025

点击查看摘要

Abstract:Programmable task-oriented dialogue (TOD) agents enable language models to follow structured dialogue policies, but their effectiveness hinges on accurate state tracking. We present PyTOD, an agent that generates executable code to track dialogue state and uses policy and execution feedback for efficient error correction. To this end, PyTOD employs a simple constrained decoding approach, using a language model instead of grammar rules to follow API schemata. This leads to state-of-the-art state tracking performance on the challenging SGD benchmark. Our experiments show that PyTOD surpasses strong baselines in both accuracy and robust user goal estimation as the dialogue progresses, demonstrating the effectiveness of execution-aware state tracking.
zh

[NLP-22] Principle Methods of Rendering Non-equivalent Words from Uzbek and Dari to Russian and English

【速读】: 该论文旨在解决源语言与目标语言之间非等值词(non-equivalent words)的翻译难题,尤其针对文化、传统、食物及服饰等领域中缺乏直接对应表达的词汇。其核心问题在于如何专业地实现这些词汇在目标语言中的准确再现,以消除因语义不对等导致的理解偏差。解决方案的关键在于系统梳理并提出多种翻译策略与规则,并通过案例实践验证其有效性——研究基于文献分析法完成,选取25个来自达里语(Dar Uzbek)的非等值词分别译为英语和俄语,从而构建可操作的跨语言转换框架。

链接: https://arxiv.org/abs/2508.15453
作者: Mohammad Ibrahim Qani
机构: 未知
类目: Computation and Language (cs.CL)
备注: Fully abstract is available in the attached file

点击查看摘要

Abstract:These pure languages understanding directly relates to translation knowledge where linguists and translators need to work and research to eradicate misunderstanding. Misunderstandings mostly appear in non-equivalent words because there are different local and internal words like food, garment, cultural and traditional words and others in every notion. Truly, most of these words do not have equivalent in the target language and these words need to be worked and find their equivalent in the target language to fully understand the both languages. The purpose of this research is to introduce the methods of rendering non-equivalent words professionally from the source language to the target language and this research has been completed using library-based research. However, some of these non-equivalent words are already professionally rendered to the target language but still there many other words to be rendered. As a result, this research paper includes different ways and rules of rendering non-equivalent words from source language to the target language and 25 non-equvalent words have been rendered from Dar Uzbek into English and Russian languages.
zh

[NLP-23] M-HELP: Using Social Media Data to Detect Mental Health Help-Seeking Signals EMNLP2025

【速读】: 该论文旨在解决当前心理健康障碍检测中缺乏对主动寻求帮助个体识别的难题。现有数据集多聚焦于诊断标签,而忽视了用户在社交媒体上表达求助意图的行为特征。解决方案的关键在于构建了一个名为M-Help的新颖数据集,该数据集不仅标注了帮助-seeking行为(help-seeking behavior),还进一步细化至具体的精神健康障碍类型及其潜在诱因(如人际关系挑战或财务压力等),从而支持AI模型同时完成三项关键任务:识别求助者、诊断精神健康状况以及挖掘问题根源。

链接: https://arxiv.org/abs/2508.15440
作者: MSVPJ Sathvik,Zuhair Hasan Shaik,Vivek Gupta
机构: IIIT Dharwad (印度信息技术研究所达瓦德分校); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL)
备注: Accepted at Findings of EMNLP 2025

点击查看摘要

Abstract:Mental health disorders are a global crisis. While various datasets exist for detecting such disorders, there remains a critical gap in identifying individuals actively seeking help. This paper introduces a novel dataset, M-Help, specifically designed to detect help-seeking behavior on social media. The dataset goes beyond traditional labels by identifying not only help-seeking activity but also specific mental health disorders and their underlying causes, such as relationship challenges or financial stressors. AI models trained on M-Help can address three key tasks: identifying help-seekers, diagnosing mental health conditions, and uncovering the root causes of issues.
zh

[NLP-24] GraSP: A Unified Graph-Based Framework for Scalable Generation Quality Tagging and Management of Synthetic Data for SFT and DPO

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)训练中高质量数据获取困难的问题,尤其是在监督微调(Supervised Fine-Tuning, SFT)和对齐任务(如直接偏好优化 Direct Preference Optimization, DPO)中,人工标注数据成本高、扩展性差的瓶颈。其解决方案的关键在于提出了一种模块化、可配置的合成数据生成框架,通过双阶段质量标记机制(结合启发式规则与大语言模型评估)自动筛选和评分来自OASST格式对话的数据,从而实现高保真度、结构灵活的合成对话数据生成,支持SFT与DPO等多种训练范式,显著降低LLM训练流程中的数据准备开销。

链接: https://arxiv.org/abs/2508.15432
作者: Bidyapati Pradhan,Surajit Dasgupta,Amit Kumar Saha,Omkar Anustoop,Sriram Puttagunta,Vipul Mittal,Gopal Sarda
机构: ServiceNow Inc. (ServiceNow 公司)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The advancement of large language models (LLMs) is critically dependent on the availability of high-quality datasets for Supervised Fine-Tuning (SFT), alignment tasks like Direct Preference Optimization (DPO), etc. In this work, we present a comprehensive synthetic data generation framework that facilitates scalable, configurable, and high-fidelity generation of synthetic data tailored for these training paradigms. Our approach employs a modular and configuration-based pipeline capable of modeling complex dialogue flows with minimal manual intervention. This framework uses a dual-stage quality tagging mechanism, combining heuristic rules and LLM-based evaluations, to automatically filter and score data extracted from OASST-formatted conversations, ensuring the curation of high-quality dialogue samples. The resulting datasets are structured under a flexible schema supporting both SFT and DPO use cases, enabling seamless integration into diverse training workflows. Together, these innovations offer a robust solution for generating and managing synthetic conversational data at scale, significantly reducing the overhead of data preparation in LLM training pipelines.
zh

[NLP-25] A Study of Privacy-preserving Language Modeling Approaches

【速读】: 该论文旨在解决语言模型在训练过程中可能记忆并泄露敏感数据所带来的隐私风险问题,这一风险对个体隐私权构成威胁。其解决方案的关键在于系统性地梳理和分析现有的隐私保护语言建模方法,深入探讨各类方法的优势与局限性,从而为未来研究提供理论基础与实践指导。

链接: https://arxiv.org/abs/2508.15421
作者: Pritilata Saha,Abhirup Sinha
机构: Paderborn University (帕德博恩大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Recent developments in language modeling have increased their use in various applications and domains. Language models, often trained on sensitive data, can memorize and disclose this information during privacy attacks, raising concerns about protecting individuals’ privacy rights. Preserving privacy in language models has become a crucial area of research, as privacy is one of the fundamental human rights. Despite its significance, understanding of how much privacy risk these language models possess and how it can be mitigated is still limited. This research addresses this by providing a comprehensive study of the privacy-preserving language modeling approaches. This study gives an in-depth overview of these approaches, highlights their strengths, and investigates their limitations. The outcomes of this study contribute to the ongoing research on privacy-preserving language modeling, providing valuable insights and outlining future research directions.
zh

[NLP-26] LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

【速读】: 该论文旨在解决大型语音语言模型(Large Speech-Language Models, LSLMs)研究中因架构碎片化和缺乏透明度而导致的系统性比较困难与复现性差的问题。当前LSLM领域普遍存在仅发布模型权重而未提供对应训练数据和配置的情况,严重阻碍了研究进展。解决方案的关键在于提出首个完全开源、端到端的框架LLaSO,其包含三个核心资源:(1) LLaSO-Align(12M实例的语音-文本对齐语料库)、(2) LLaSO-Instruct(13.5M实例的多任务指令微调数据集)以及(3) LLaSO-Eval(可复现的标准评估基准)。通过构建并发布基于这些公开数据训练的LLaSO-Base模型(3.8B参数),该框架确立了一个性能优越且可复现的基线(标准化得分0.72),验证了其有效性,并推动LSLM研究向统一、开放的方向发展。

链接: https://arxiv.org/abs/2508.15418
作者: Yirong Sun,Yizhong Geng,Peidong Wei,Yanjun Chen,Jinghan Yang,Rongfei Chen,Wei Zhang,Xiaoyu Shen
机构: Ningbo Key Laboratory of Spatial Intelligence and Digital Derivative (宁波空间智能与数字衍生重点实验室); Institute of Digital Twin (数字孪生研究所); EIT; Logic Intelligence Technology (逻辑智能科技); BUPT; Xiamen University (厦门大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Sound (cs.SD)
备注:

点击查看摘要

Abstract:The development of Large Speech-Language Models (LSLMs) has been slowed by fragmented architectures and a lack of transparency, hindering the systematic comparison and reproducibility of research. Unlike in the vision-language domain, the LSLM field suffers from the common practice of releasing model weights without their corresponding training data and configurations. To address these critical gaps, we introduce LLaSO, the first fully open, end-to-end framework for large-scale speech-language modeling. LLaSO provides the community with three essential resources: (1) LLaSO-Align, a 12M-instance speech-text alignment corpus; (2) LLaSO-Instruct, a 13.5M-instance multi-task instruction-tuning dataset; and (3) LLaSO-Eval, a reproducible benchmark for standardized evaluation. To validate our framework, we build and release LLaSO-Base, a 3.8B-parameter reference model trained exclusively on our public data. It achieves a normalized score of 0.72, establishing a strong, reproducible baseline that surpasses comparable models. Our analysis reveals that while broader training coverage enhances performance, significant generalization gaps persist on unseen tasks, particularly in pure audio scenarios. By releasing the complete stack of data, benchmarks, and models, LLaSO establishes a foundational open standard to unify research efforts and accelerate community-driven progress in LSLMs. We release the code, dataset, pretrained models, and results in this https URL.
zh

[NLP-27] Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems

【速读】: 该论文旨在解决生成式 AI (Generative AI, GenAI) 在构建可靠且高效系统时所面临的不可预测性和低效性问题,这些问题限制了 GenAI 与传统软件工程实践的深度融合。其解决方案的关键在于推动一种范式转变:未来 GenAI 原生系统应将 GenAI 的认知能力与传统软件工程原则相结合,以实现系统的鲁棒性、自适应性和效率提升。作者提出围绕可靠性(reliability)、卓越性(excellence)、可演化性(evolvability)、自立性(self-reliance)和保障性(assurance)五大支柱设计基础原则,并引入 GenAI 原生单元(GenAI-native cells)、有机基质(organic substrates)和可编程路由器(programmable routers)等架构模式,从而指导构建具备韧性与自我进化能力的系统。

链接: https://arxiv.org/abs/2508.15411
作者: Frederik Vandeputte
机构: Nokia Bell Labs(诺基亚贝尔实验室)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Generative AI (GenAI) has emerged as a transformative technology, demonstrating remarkable capabilities across diverse application domains. However, GenAI faces several major challenges in developing reliable and efficient GenAI-empowered systems due to its unpredictability and inefficiency. This paper advocates for a paradigm shift: future GenAI-native systems should integrate GenAI’s cognitive capabilities with traditional software engineering principles to create robust, adaptive, and efficient systems. We introduce foundational GenAI-native design principles centered around five key pillars – reliability, excellence, evolvability, self-reliance, and assurance – and propose architectural patterns such as GenAI-native cells, organic substrates, and programmable routers to guide the creation of resilient and self-evolving systems. Additionally, we outline the key ingredients of a GenAI-native software stack and discuss the impact of these systems from technical, user adoption, economic, and legal perspectives, underscoring the need for further validation and experimentation. Our work aims to inspire future research and encourage relevant communities to implement and refine this conceptual framework. Subjects: Software Engineering (cs.SE); Computation and Language (cs.CL); Machine Learning (cs.LG); Multiagent Systems (cs.MA) Cite as: arXiv:2508.15411 [cs.SE] (or arXiv:2508.15411v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.15411 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-28] When Audio and Text Disagree: Revealing Text Bias in Large Audio-Language Models EMNLP2025

【速读】: 该论文旨在解决大型音频-语言模型(Large Audio-Language Models, LALMs)在处理音频与文本模态信息冲突时存在的性能不稳定问题,特别是其对文本输入的显著偏好导致忽略音频证据,从而影响音频主导任务的可靠性。解决方案的关键在于构建首个专门用于评估此类情境下模态优先级的基准测试MCR-BENCH,并通过监督微调(supervised fine-tuning)探索缓解文本偏倚的策略,同时揭示模型在矛盾输入下仍存在过度自信的置信度模式,强调需改进训练过程中的模态平衡机制和多模态融合方法以提升鲁棒性。

链接: https://arxiv.org/abs/2508.15407
作者: Cheng Wang,Gelei Deng,Xianglin Yang,Han Qiu,Tianwei Zhang
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Tsinghua University (清华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025 Main

点击查看摘要

Abstract:Large Audio-Language Models (LALMs) are enhanced with audio perception capabilities, enabling them to effectively process and understand multimodal inputs that combine audio and text. However, their performance in handling conflicting information between audio and text modalities remains largely unexamined. This paper introduces MCR-BENCH, the first comprehensive benchmark specifically designed to evaluate how LALMs prioritize information when presented with inconsistent audio-text pairs. Through extensive evaluation across diverse audio understanding tasks, we reveal a concerning phenomenon: when inconsistencies exist between modalities, LALMs display a significant bias toward textual input, frequently disregarding audio evidence. This tendency leads to substantial performance degradation in audio-centric tasks and raises important reliability concerns for real-world applications. We further investigate the influencing factors of text bias, and explore mitigation strategies through supervised finetuning, and analyze model confidence patterns that reveal persistent overconfidence even with contradictory inputs. These findings underscore the need for improved modality balance during training and more sophisticated fusion mechanisms to enhance the robustness when handling conflicting multi-modal inputs. The project is available at this https URL.
zh

[NLP-29] Attribution Citation and Quotation: A Survey of Evidence-based Text Generation with Large Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本时缺乏可靠性和可信任性的问题,尤其是如何通过引入证据支持来提升生成内容的可追溯性与可验证性。其解决方案的关键在于系统性地分析134篇相关文献,构建一个统一的分类体系(taxonomy),并梳理涵盖七个关键维度的300个评估指标,从而明确以引用(citations)、归属(attribution)或引文(quotations)为基础的证据化文本生成方法的特征与代表性技术,为该领域的标准化发展提供理论框架与实践指导。

链接: https://arxiv.org/abs/2508.15396
作者: Tobias Schreieder,Tim Schopf,Michael Färber
机构: TU Dresden (德累斯顿工业大学); ScaDS.AI Dresden/Leipzig (德国数据科学中心德累斯顿/莱比锡)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The increasing adoption of large language models (LLMs) has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.
zh

[NLP-30] CITE: A Comprehensive Benchmark for Heterogeneous Text-Attributed Graphs on Catalytic Materials

【速读】: 该论文旨在解决当前 heterogeneous text-attributed graphs (HTAGs) 缺乏大规模基准数据集的问题,这一短缺已成为制约表示学习方法在 HTAG 上发展与公平比较的关键瓶颈。解决方案的核心在于提出首个且规模最大的异质文本属性引用图基准数据集 CITE(Catalytic Information Textual Entities Graph),其包含超过 438K 节点和 1.2M 边,涵盖四种关系类型,并建立了标准化的评估流程与节点分类任务的全面基准测试,同时通过消融实验验证了异质性和文本属性对模型性能的影响,从而为 HTAG 的研究提供了可靠的数据基础与统一的评估框架。

链接: https://arxiv.org/abs/2508.15392
作者: Chenghao Zhang,Qingqing Long,Ludi Wang,Wenjuan Cui,Jianjun Yu,Yi Du
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: 23 pages, 4 figures,

点击查看摘要

Abstract:Text-attributed graphs(TAGs) are pervasive in real-world systems,where each node carries its own textual features. In many cases these graphs are inherently heterogeneous, containing multiple node types and diverse edge types. Despite the ubiquity of such heterogeneous TAGs, there remains a lack of large-scale benchmark datasets. This shortage has become a critical bottleneck, hindering the development and fair comparison of representation learning methods on heterogeneous text-attributed graphs. In this paper, we introduce CITE - Catalytic Information Textual Entities Graph, the first and largest heterogeneous text-attributed citation graph benchmark for catalytic materials. CITE comprises over 438K nodes and 1.2M edges, spanning four relation types. In addition, we establish standardized evaluation procedures and conduct extensive benchmarking on the node classification task, as well as ablation experiments on the heterogeneous and textual properties of CITE. We compare four classes of learning paradigms, including homogeneous graph models, heterogeneous graph models, LLM(Large Language Model)-centric models, and LLM+Graph models. In a nutshell, we provide (i) an overview of the CITE dataset, (ii) standardized evaluation protocols, and (iii) baseline and ablation experiments across diverse modeling paradigms.
zh

[NLP-31] Exploiting Vocabulary Frequency Imbalance in Language Model Pre-training

【速读】: 该论文试图解决的问题是:在大型语言模型(Large Language Models, LLMs)中,随着词汇表规模扩大(从24K到196K),模型性能提升的机制尚不明确,尤其是高频率词与低频词之间的token分布不平衡是否对训练有益。解决方案的关键在于通过受控实验发现,更大的词汇表主要通过降低文本的Kolmogorov复杂度来减少token级别的不确定性,从而提升模型性能;具体而言,这种收益几乎全部来自对前2,500个最常见词的交叉熵损失显著下降,而罕见词的损失则上升。进一步地,当约束输入和输出嵌入范数以缓解token频率失衡时,性能优势消失,说明模型并非被动承受不平衡,而是主动利用了这一特性——这揭示了“更大词汇表带来好处”的本质是“降低token化文本的复杂性”,而非单纯增加词汇量。该结论为分词器与模型协同设计提供了清晰、可解释的优化方向。

链接: https://arxiv.org/abs/2508.15390
作者: Woojin Chung,Jeonghoon Kim
机构: KAIST (韩国科学技术院); NAVER Cloud (NAVER云)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Preprint

点击查看摘要

Abstract:Large language models are trained with tokenizers, and the resulting token distribution is highly imbalanced: a few words dominate the stream while most occur rarely. Recent practice favors ever-larger vocabularies, but the source of the benefit is unclear. We conduct a controlled study that scales the language model’s vocabulary from 24K to 196K while holding data, compute, and optimization fixed. We first quantify the complexity of tokenized text, formalized via Kolmogorov complexity, and show that larger vocabularies reduce this complexity. Above 24K, every common word is already a single token, so further growth mainly deepens the relative token-frequency imbalance. A word-level loss decomposition shows that larger vocabularies reduce cross-entropy almost exclusively by lowering uncertainty on the 2,500 most frequent words, even though loss on the rare tail rises. Constraining input and output embedding norms to attenuate the effect of token-frequency imbalance reverses the gain, directly showing that the model exploits rather than suffers from imbalance. Because the same frequent words cover roughly 77% of tokens in downstream benchmarks, this training advantage transfers intact. We also show that enlarging model parameters with a fixed vocabulary yields the same frequent-word benefit. Our results reframe “bigger vocabularies help” as “lowering the complexity of tokenized text helps,” providing a simple, principled lever for tokenizer-model co-design and clarifying the loss dynamics that govern language-model scaling in pre-training.
zh

[NLP-32] Confidence-Modulated Speculative Decoding for Large Language Models

【速读】: 该论文旨在解决现有推测解码(speculative decoding)方法在实际应用中因静态 drafting 长度和刚性验证标准而导致的适应性不足问题,尤其是在模型不确定性与输入复杂度变化时效率下降、资源利用率低的问题。解决方案的关键在于提出一种基于信息论的框架,通过引入置信度调制的起草机制(confidence-modulated drafting),利用熵和 margin-based 不确定性度量动态调整每轮迭代中推测生成的 token 数量,并同时以相同置信信号调节验证过程,从而降低回滚频率、提升资源利用率并保持输出质量。该方法为大语言模型在不同不确定性条件下的高效、鲁棒解码提供了一种可插拔的原理性方案。

链接: https://arxiv.org/abs/2508.15371
作者: Jaydip Sen,Subhasis Dasgupta,Hetvi Waghela
机构: 未知
类目: Computation and Language (cs.CL)
备注: This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

点击查看摘要

Abstract:Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter’s output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.
zh

[NLP-33] Unveiling Trust in Multimodal Large Language Models : Evaluation Analysis and Mitigation

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在可信性(trustworthiness)方面的系统性风险问题,现有评估与缓解方法往往局限于单一维度,忽视了由多模态特性引入的新风险类型及其跨模态影响。解决方案的关键在于提出一个名为MultiTrust-X的综合性基准框架,其核心包含三个维度:定义涵盖真实性、鲁棒性、安全性、公平性和隐私性的五维可信性指标体系;识别两类新型风险(多模态风险和跨模态影响);并整合数据、模型架构、训练和推理算法四个层面的多种缓解策略。基于此框架,研究构建了32个任务和28个精选数据集,对30个开源及商用MLLM进行全方位评估,并深入分析8种代表性缓解方法的效果,最终发现当前方法普遍存在整体可信性提升不足、存在意外权衡等问题,由此提出一种增强推理能力的安全对齐方法(Reasoning-Enhanced Safety Alignment, RESA),通过链式思维(chain-of-thought reasoning)识别潜在风险,在保障模型性能的同时显著提升安全性,达到当前最优效果。

链接: https://arxiv.org/abs/2508.15370
作者: Yichi Zhang,Yao Huang,Yifan Wang,Yitong Sun,Chang Liu,Zhe Zhao,Zhengwei Fang,Huanran Chen,Xiao Yang,Xingxing Wei,Hang Su,Yinpeng Dong,Jun Zhu
机构: Tsinghua University (清华大学); Beihang University (北京航空航天大学); RealAI
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: For Appendix, please refer to arXiv:2406.07057

点击查看摘要

Abstract:The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.
zh

[NLP-34] A Survey on Large Language Model Benchmarks

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)评估体系中存在的关键问题,包括数据污染导致的分数虚高、文化与语言偏见引发的不公平评估,以及对过程可信度和动态环境适应能力的缺失。其解决方案的关键在于系统性梳理现有283个代表性基准测试,将其划分为通用能力、领域特定和目标导向三类,并据此提出可参考的基准设计范式,以推动更科学、公平且具备实践指导意义的LLM评估体系发展。

链接: https://arxiv.org/abs/2508.15361
作者: Shiwen Ni,Guhong Chen,Shuaimin Li,Xuanang Chen,Siyi Li,Bingli Wang,Qiyao Wang,Xingjian Wang,Yifan Zhang,Liyang Fan,Chengming Li,Ruifeng Xu,Le Sun,Min Yang
机构: Shenzhen Key Laboratory for High Performance Data Mining (深圳高性能数据挖掘重点实验室); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Southern University of Science and Technology (南方科技大学); University of Chinese Academy of Sciences (中国科学院大学); University of Science and Technology of China (中国科学技术大学); Shanghai University of Electric Power (上海电力大学); Shanghai AI Lab (上海人工智能实验室); South China University of Technology (华南理工大学); Institute of Software, Chinese Academy of Sciences (中国科学院软件研究所); Shenzhen MSU-BIT University (深圳北理莫斯科大学); Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区); Shenzhen University of Advanced Technology (深圳先进科技学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In recent years, with the rapid development of the depth and breadth of large language models’ capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model performance, benchmarks are not only a core means to measure model capabilities but also a key element in guiding the direction of model development and promoting technological innovation. We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain-specific, and target-specific. General capability benchmarks cover aspects such as core linguistics, knowledge, and reasoning; domain-specific benchmarks focus on fields like natural sciences, humanities and social sciences, and engineering technology; target-specific benchmarks pay attention to risks, reliability, agents, etc. We point out that current benchmarks have problems such as inflated scores caused by data contamination, unfair evaluation due to cultural and linguistic biases, and lack of evaluation on process credibility and dynamic environments, and provide a referable design paradigm for future benchmark innovation.
zh

[NLP-35] KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

【速读】: 该论文旨在解决知识图谱补全(Knowledge Graph Completion, KGC)模型在多数据集和多评估指标下难以进行统一、可靠比较的问题。现有方法如MRR(Mean Reciprocal Rank)、MR(Mean Rank)和Hit@k等指标常因数据集差异或指标间冲突导致模型排名不一致,阻碍了对模型整体性能的客观评判。论文提出一种基于平均解距离的元指标(EDAS, Evaluation based on Distance from Average Solution),其核心在于通过计算每个模型在所有数据集与指标上的表现与其平均表现之间的标准化欧氏距离,生成一个[0,1]区间内的单一归一化分数,从而实现跨数据集、跨指标的全局性综合评价。该方案不仅提升了评估的稳定性与可解释性,还促进了公平、一致的KGC模型选择。

链接: https://arxiv.org/abs/2508.15357
作者: Haji Gul,Abul Ghani Naim,Ajaz Ahmad Bhat
机构: 未知
类目: Computation and Language (cs.CL); Performance (cs.PF)
备注:

点击查看摘要

Abstract:Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k, are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models, however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG Evaluation based on Distance from Average Solution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ( M_i \in [0,1] ). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.
zh

[NLP-36] DiagECG: An LLM -Driven Framework for Diagnostic Reasoning via Discretized ECG Tokenization

【速读】: 该论文旨在解决当前心电图(Electrocardiography, ECG)自动化分析方法在跨临床任务中泛化能力弱、缺乏对开放式推理支持的问题。其解决方案的关键在于提出DiagECG框架,通过将12导联ECG信号的连续嵌入离散化为符号令牌(symbolic tokens),并利用独立于导联的编码器与量化模块扩展大语言模型(Large Language Model, LLM)的词汇表,从而实现ECG与自然语言输入的统一建模;同时,通过自回归ECG预测预训练任务弥合模态差距,使LLM能够借助其原生的语言建模能力捕捉时间动态,并最终在ECG问答和诊断报告生成等任务上完成指令微调,无需修改核心模型即可实现多任务强性能与分布外泛化能力。

链接: https://arxiv.org/abs/2508.15338
作者: Jinning Yang,Wen Shi
机构: Harvard Medical School (哈佛医学院); Massachusetts General Hospital (马萨诸塞州总医院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electrocardiography plays a central role in cardiovascular diagnostics, yet existing automated approaches often struggle to generalize across clinical tasks and offer limited support for open-ended reasoning. We present DiagECG, a novel framework that integrates time-series and language modeling by enabling large language models to process 12-lead ECG signals for clinical text generation tasks. Our approach discretizes continuous ECG embeddings into symbolic tokens using a lead-independent encoder and quantization module. These tokens are then used to extend the vocabulary of LLM, allowing the model to handle both ECG and natural language inputs in a unified manner. To bridge the modality gap, we pretrain the model on an autoregressive ECG forecasting task, enabling the LLM to model temporal dynamics using its native language modeling capabilities. Finally, we perform instruction tuning on both ECG question answering and diagnostic report generation. Without modifying the core model, DiagECG achieves strong performance across tasks while maintaining generalization to out-of-distribution settings. Extensive experiments demonstrate the effectiveness of each component and highlight the potential of integrating symbolic ECG representations into LLMs for medical reasoning.
zh

[NLP-37] CUPE: Contextless Universal Phoneme Encoder for Language-Agnostic Speech Processing

【速读】: 该论文旨在解决通用音素识别(Universal phoneme recognition)中对长语音段和语言特定模式依赖过强的问题,尤其是在需要纯净音素表示以摆脱上下文影响的语音处理任务中。解决方案的关键在于提出了一种轻量级模型CUPE,其通过在仅120毫秒(约一个音素时长)的固定宽度窗口内独立处理语音片段,学习跨语言共有的基本声学模式,从而实现高效的跨语言泛化能力。尽管参数量少于现有方法,CUPE在监督与自监督训练下均表现出优异的跨语言性能,证明了基于音素长度窗口建模基础声学特征是实现通用语音处理的有效路径。

链接: https://arxiv.org/abs/2508.15316
作者: Abdul Rehman,Jian-Jun Zhang,Xiaosong Yang
机构: Bournemouth University (伯恩茅斯大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Accepted in: 8th International Conference on Natural Language and Speech Processing (ICNLSP 2025)

点击查看摘要

Abstract:Universal phoneme recognition typically requires analyzing long speech segments and language-specific patterns. Many speech processing tasks require pure phoneme representations free from contextual influence, which motivated our development of CUPE - a lightweight model that captures key phoneme features in just 120 milliseconds, about one phoneme’s length. CUPE processes short, fixed-width windows independently and, despite fewer parameters than current approaches, achieves competitive cross-lingual performance by learning fundamental acoustic patterns common to all languages. Our extensive evaluation through supervised and self-supervised training on diverse languages, including zero-shot tests on the UCLA Phonetic Corpus, demonstrates strong cross-lingual generalization and reveals that effective universal speech processing is possible through modeling basic acoustic patterns within phoneme-length windows.
zh

[NLP-38] IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)代理在与不可信数据源交互时,因间接提示注入(Indirect Prompt Injection, IPI)攻击导致的恶意工具调用问题。IPI攻击通过污染外部响应中的指令,诱导代理执行非预期操作,而现有防御方法依赖于提示策略或辅助检测模型,缺乏对代理行为的结构化约束,难以抵御更强攻击向量。解决方案的关键在于提出一种名为IPIGuard的新颖防御范式,其核心思想是将代理的任务执行过程建模为在预定义的工具依赖图(Tool Dependency Graph, TDG)上的遍历过程,通过显式解耦动作规划与外部数据交互,从源头上减少由注入指令引发的意外工具调用,从而显著提升系统对IPI攻击的鲁棒性。

链接: https://arxiv.org/abs/2508.15310
作者: Hengyu An,Jinghuai Zhang,Tianyu Du,Chunyi Zhou,Qingming Li,Tao Lin,Shouling Ji
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025

点击查看摘要

Abstract:Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks. However, when interacting with untrusted data sources (e.g., fetching information from public websites), tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes, a threat referred to as Indirect Prompt Injection (IPI). Existing defenses typically rely on advanced prompting strategies or auxiliary detection models. While these methods have demonstrated some effectiveness, they fundamentally rely on assumptions about the model’s inherent security, which lacks structural constraints on agent behaviors. As a result, agents still retain unrestricted access to tool invocations, leaving them vulnerable to stronger attack vectors that can bypass the security guardrails of the model. To prevent malicious tool invocations at the source, we propose a novel defensive task execution paradigm, called IPIGuard, which models the agents’ task execution process as a traversal over a planned Tool Dependency Graph (TDG). By explicitly decoupling action planning from interaction with external data, IPIGuard significantly reduces unintended tool invocations triggered by injected instructions, thereby enhancing robustness against IPI attacks. Experiments on the AgentDojo benchmark show that IPIGuard achieves a superior balance between effectiveness and robustness, paving the way for the development of safer agentic systems in dynamic environments.
zh

[NLP-39] Multiple Memory Systems for Enhancing the Long-term Memory of Agent

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)代理在长期交互中难以高效处理海量历史数据的问题,尤其是现有记忆模块(如MemoryBank和A-MEM)因存储内容质量低而导致回忆性能差与响应质量下降的缺陷。其解决方案的关键在于提出一种受认知心理学理论启发的多记忆系统(Multiple Memory System, MMS),通过将短期记忆分解为多个长期记忆片段,并据此构建具有对应关系的检索记忆单元(Retrieval Memory Unit)与上下文记忆单元(Contextual Memory Unit),从而实现基于用户查询的精准匹配与高质量上下文生成,有效提升历史数据的利用效率与响应质量。

链接: https://arxiv.org/abs/2508.15294
作者: Gaoke Zhang,Bo Wang,Yunlong Ma,Dongming Zhao,Zifei Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:An agent powered by large language models have achieved impressive results, but effectively handling the vast amounts of historical data generated during interactions remains a challenge. The current approach is to design a memory module for the agent to process these data. However, existing methods, such as MemoryBank and A-MEM, have poor quality of stored memory content, which affects recall performance and response quality. In order to better construct high-quality long-term memory content, we have designed a multiple memory system (MMS) inspired by cognitive psychology theory. The system processes short-term memory to multiple long-term memory fragments, and constructs retrieval memory units and contextual memory units based on these fragments, with a one-to-one correspondence between the two. During the retrieval phase, MMS will match the most relevant retrieval memory units based on the user’s query. Then, the corresponding contextual memory units is obtained as the context for the response stage to enhance knowledge, thereby effectively utilizing historical data. Experiments on LoCoMo dataset compared our method with three others, proving its effectiveness. Ablation studies confirmed the rationality of our memory units. We also analyzed the robustness regarding the number of selected memory segments and the storage overhead, demonstrating its practical value.
zh

[NLP-40] Evaluating Knowledge Graph Complexity via Semantic Spectral and Structural Metrics for Link Prediction

【速读】: 该论文旨在解决知识图谱(Knowledge Graph, KG)中数据集复杂度评估的可靠性问题,特别是针对现有复杂度指标如累积谱梯度(Cumulative Spectral Gradient, CSG)在多关系链接预测任务中的有效性与稳定性不足的问题。其解决方案的关键在于通过引入一组结构和语义层面的复杂度度量指标(如关系熵、最大关系多样性、关系类型基数等),对KG复杂度进行系统性分析,并验证这些指标与标准性能指标(如MRR和Hit@1)之间更强的统计关联性。研究发现,CSG并不具备预期的鲁棒性与泛化能力,而新提出的结构性和语义性指标能更准确地反映任务难度,从而为知识驱动学习中的复杂度建模提供更稳定、可解释且任务对齐的新范式。

链接: https://arxiv.org/abs/2508.15291
作者: Haji Gul,Abul Ghani Naim,Ajaz Ahmad Bhat
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Understanding dataset complexity is fundamental to evaluating and comparing link prediction models on knowledge graphs (KGs). While the Cumulative Spectral Gradient (CSG) metric, derived from probabilistic divergence between classes within a spectral clustering framework, has been proposed as a classifier agnostic complexity metric purportedly scaling with class cardinality and correlating with downstream performance, it has not been evaluated in KG settings so far. In this work, we critically examine CSG in the context of multi relational link prediction, incorporating semantic representations via transformer derived embeddings. Contrary to prior claims, we find that CSG is highly sensitive to parametrisation and does not robustly scale with the number of classes. Moreover, it exhibits weak or inconsistent correlation with standard performance metrics such as Mean Reciprocal Rank (MRR) and Hit@1. To deepen the analysis, we introduce and benchmark a set of structural and semantic KG complexity metrics. Our findings reveal that global and local relational ambiguity captured via Relation Entropy, node level Maximum Relation Diversity, and Relation Type Cardinality exhibit strong inverse correlations with MRR and Hit@1, suggesting these as more faithful indicators of task difficulty. Conversely, graph connectivity measures such as Average Degree, Degree Entropy, PageRank, and Eigenvector Centrality correlate positively with Hit@10. Our results demonstrate that CSGs purported stability and generalization predictive power fail to hold in link prediction settings and underscore the need for more stable, interpretable, and task-aligned measures of dataset complexity in knowledge driven learning.
zh

[NLP-41] Adversarial Attacks against Neural Ranking Models via In-Context Learning

【速读】: 该论文旨在解决神经排序模型(Neural Ranking Models, NRMs)在面对对抗性操纵时的脆弱性问题,即攻击者可通过生成误导性内容使虚假信息在检索结果中获得更高排名。其解决方案的关键在于提出一种新颖的黑盒攻击框架——少样本对抗提示(Few-Shot Adversarial Prompting, FSAP),该框架利用大语言模型(Large Language Models, LLMs)的上下文学习能力,通过少量示例提示(few-shot prompting)直接生成语法流畅且主题一致的对抗文档,无需梯度访问或模型内部结构干预。FSAP通过条件化于已观察到的有害样本支持集,合成嵌入错误信息但排名竞争力强的内容,在TREC 2020和2021健康伪信息赛道上的实验证明其能持续超越真实准确文档,且具备强立场一致性与低可检测性,构成对神经检索系统的现实且可扩展威胁。

链接: https://arxiv.org/abs/2508.15283
作者: Amin Bigdeli,Negar Arabzadeh,Ebrahim Bagheri,Charles L. A. Clarke
机构: University of Waterloo (滑铁卢大学); University of California, Berkeley (加州大学伯克利分校); University of Toronto (多伦多大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While neural ranking models (NRMs) have shown high effectiveness, they remain susceptible to adversarial manipulation. In this work, we introduce Few-Shot Adversarial Prompting (FSAP), a novel black-box attack framework that leverages the in-context learning capabilities of Large Language Models (LLMs) to generate high-ranking adversarial documents. Unlike previous approaches that rely on token-level perturbations or manual rewriting of existing documents, FSAP formulates adversarial attacks entirely through few-shot prompting, requiring no gradient access or internal model instrumentation. By conditioning the LLM on a small support set of previously observed harmful examples, FSAP synthesizes grammatically fluent and topically coherent documents that subtly embed false or misleading information and rank competitively against authentic content. We instantiate FSAP in two modes: FSAP-IntraQ, which leverages harmful examples from the same query to enhance topic fidelity, and FSAP-InterQ, which enables broader generalization by transferring adversarial patterns across unrelated queries. Our experiments on the TREC 2020 and 2021 Health Misinformation Tracks, using four diverse neural ranking models, reveal that FSAP-generated documents consistently outrank credible, factually accurate documents. Furthermore, our analysis demonstrates that these adversarial outputs exhibit strong stance alignment and low detectability, posing a realistic and scalable threat to neural retrieval systems. FSAP also effectively generalizes across both proprietary and open-source LLMs.
zh

[NLP-42] AmbiSQL: Interactive Ambiguity Detection and Resolution for Text-to-SQL

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在 Text-to-SQL 任务中因查询歧义(query ambiguity)导致的错误解析问题,即大型语言模型(LLMs)在将自然语言问题映射到 SQL 查询时,常因用户意图不明确而产生误判。解决方案的关键在于提出 AmbiSQL 系统,其核心包括两个方面:一是构建细粒度的歧义分类体系,用于识别影响数据库元素映射和 LLM 推理的歧义类型;二是通过交互式多选题引导用户澄清意图,并利用反馈重写模糊问题,从而提升 SQL 生成的准确性。实验表明,该方法在歧义检测上达到 87.2% 的精确率,并使集成 Text-to-SQL 系统的 SQL 精确匹配准确率提升 50%。

链接: https://arxiv.org/abs/2508.15276
作者: Zhongjun Ding,Yin Lin,Tianjing Zeng
机构: Alibaba Group(阿里巴巴集团)
类目: Databases (cs.DB); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Text-to-SQL systems translate natural language questions into SQL queries, providing substantial value for non-expert users. While large language models (LLMs) show promising results for this task, they remain error-prone. Query ambiguity has been recognized as a major obstacle for LLM-based Text-to-SQL systems, leading to misinterpretation of user intent and inaccurate SQL generation. We demonstrate AmbiSQL, an interactive system that automatically detects query ambiguities and guides users through intuitive multiple-choice questions to clarify their intent. Our approach introduces a fine-grained ambiguity taxonomy for identifying ambiguities that affect database element mapping and LLM reasoning, then incorporates user feedback to rewrite ambiguous questions. Evaluation on an ambiguous query dataset shows that AmbiSQL achieves 87.2% precision in ambiguity detection and improves SQL exact match accuracy by 50% when integrated with Text-to-SQL systems. Our demonstration showcases the significant performance gains and highlights the system’s practical usability. Code repo and demonstration are available at: this https URL.
zh

[NLP-43] ComQA: Extracting Temporal Commonsense from Text

【速读】: 该论文旨在解决自然语言处理中事件时间常识(temporal commonsense)难以显式获取的问题,尤其针对大型语言模型(LLM)在缺乏明确文本提示时推理事件持续时间的能力不足。其解决方案的关键在于提出一种基于LLM的自动提取流程,用于从真实语料(如SAMSum和RealNews)中挖掘事件的时间常识,并构建TComQA数据集;该数据集经众包验证具有超过80%的精度,且使用该数据集微调的模型在时间常识问答任务上优于现有方法。

链接: https://arxiv.org/abs/2508.15274
作者: Lekshmi R Nair,Arun Sankar,Koninika Pal
机构: Indian Institute of Technology Palakkad (印度理工学院帕拉卡德分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Understanding events necessitates grasping their temporal context, which is often not explicitly stated in natural language. For example, it is not a trivial task for a machine to infer that a museum tour may last for a few hours, but can not take months. Recent studies indicate that even advanced large language models (LLMs) struggle in generating text that require reasoning with temporal commonsense due to its infrequent explicit mention in text. Therefore, automatically mining temporal commonsense for events enables the creation of robust language models. In this work, we investigate the capacity of LLMs to extract temporal commonsense from text and evaluate multiple experimental setups to assess their effectiveness. Here, we propose a temporal commonsense extraction pipeline that leverages LLMs to automatically mine temporal commonsense and use it to construct TComQA, a dataset derived from SAMSum and RealNews corpora. TComQA has been validated through crowdsourcing and achieves over 80% precision in extracting temporal commonsense. The model trained with TComQA also outperforms an LLM fine-tuned on existing dataset of temporal question answering task.
zh

[NLP-44] Conflict-Aware Soft Prompting for Retrieval-Augmented Generation EMNLP2025

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统中常见的“上下文-记忆冲突”问题,即当外部检索到的上下文与大语言模型(Large Language Models, LLMs)内部参数化知识相矛盾时,模型难以有效识别并纠正错误信息,从而导致推理结果不可靠。解决方案的关键在于提出Conflict-Aware REtrieval-Augmented Generation (CARE),其核心机制是引入一个上下文评估器(context assessor),该模块通过编码原始上下文标记为紧凑的记忆token嵌入,并借助基于事实/对抗性软提示(grounded/adversarial soft prompting)训练策略,学会区分不可靠上下文并捕捉引导信号,从而在推理过程中优先采纳更可靠的来源知识,显著提升问答和事实核查任务的准确性。

链接: https://arxiv.org/abs/2508.15253
作者: Eunseong Choi,June Park,Hyeri Lee,Jongwuk Lee
机构: Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025; 14 pages; 5 figures, 11 tables

点击查看摘要

Abstract:Retrieval-augmented generation (RAG) enhances the capabilities of large language models (LLMs) by incorporating external knowledge into their input prompts. However, when the retrieved context contradicts the LLM’s parametric knowledge, it often fails to resolve the conflict between incorrect external context and correct parametric knowledge, known as context-memory conflict. To tackle this problem, we introduce Conflict-Aware REtrieval-Augmented Generation (CARE), consisting of a context assessor and a base LLM. The context assessor encodes compact memory token embeddings from raw context tokens. Through grounded/adversarial soft prompting, the context assessor is trained to discern unreliable context and capture a guidance signal that directs reasoning toward the more reliable knowledge source. Extensive experiments show that CARE effectively mitigates context-memory conflicts, leading to an average performance gain of 5.0% on QA and fact-checking benchmarks, establishing a promising direction for trustworthy and adaptive RAG systems.
zh

[NLP-45] Retrieval-Augmented Review Generation for Poisoning Recommender Systems

【速读】: 该论文旨在解决推荐系统(Recommender Systems, RSs)在面对数据投毒攻击时的脆弱性问题,特别是针对黑盒场景下攻击者知识受限、难以生成高质量且具备迁移能力的伪造用户画像的问题。现有方法因文本评论质量低下而难以同时保障攻击效果与隐蔽性。解决方案的关键在于利用多模态基础模型的上下文学习(In-Context Learning, ICL)能力,通过引入演示检索算法和文本风格迁移策略来增强伪造评论的质量,从而构建一个名为RAGAN的新型攻击框架——该框架由“越狱器”(jailbreaker)生成初始伪用户行为,并通过指令代理(instructional agent)与守卫机制(guardian)协同优化,显著提升伪造画像的攻击迁移性和隐蔽性,在多个真实数据集上实现了当前最优的投毒攻击性能。

链接: https://arxiv.org/abs/2508.15252
作者: Shiyi Yang,Xinshu Li,Guanglin Zhou,Chen Wang,Xiwei Xu,Liming Zhu,Lina Yao
机构: University of New South Wales (新南威尔士大学); CSIRO’s Data61 (澳大利亚联邦科学与工业研究组织数据61); Macquarie University (麦考瑞大学); University of Queensland (昆士兰大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Recent studies have shown that recommender systems (RSs) are highly vulnerable to data poisoning attacks, where malicious actors inject fake user profiles, including a group of well-designed fake ratings, to manipulate recommendations. Due to security and privacy constraints in practice, attackers typically possess limited knowledge of the victim system and thus need to craft profiles that have transferability across black-box RSs. To maximize the attack impact, the profiles often remains imperceptible. However, generating such high-quality profiles with the restricted resources is challenging. Some works suggest incorporating fake textual reviews to strengthen the profiles; yet, the poor quality of the reviews largely undermines the attack effectiveness and imperceptibility under the practical setting. To tackle the above challenges, in this paper, we propose to enhance the quality of the review text by harnessing in-context learning (ICL) capabilities of multimodal foundation models. To this end, we introduce a demonstration retrieval algorithm and a text style transfer strategy to augment the navie ICL. Specifically, we propose a novel practical attack framework named RAGAN to generate high-quality fake user profiles, which can gain insights into the robustness of RSs. The profiles are generated by a jailbreaker and collaboratively optimized on an instructional agent and a guardian to improve the attack transferability and imperceptibility. Comprehensive experiments on various real-world datasets demonstrate that RAGAN achieves the state-of-the-art poisoning attack performance. Subjects: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR) Cite as: arXiv:2508.15252 [cs.CR] (or arXiv:2508.15252v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.15252 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-46] EMNLP: Educator-role Moral and Normative Large Language Models Profiling EMNLP

【速读】: 该论文旨在解决当前生成式 AI(Generative AI)在模拟专业角色(如教师)时缺乏系统性心理与伦理评估的问题。其解决方案的关键在于提出 EMNLP 框架——一个面向教育者角色的道德与规范性大语言模型(LLM)画像工具,通过构建 88 个教师特异性道德困境并引入软提示注入(soft prompt injection)机制,实现对教师角色 LLM 的人格特征、道德发展阶段及伦理风险的量化评估。实验表明,教师角色 LLM 倾向于理想化和极化的性格表现,在抽象道德推理上优于人类教师,但在情感复杂情境中表现不足,且高推理能力模型更易受有害提示攻击,揭示了能力与安全性之间的悖论。

链接: https://arxiv.org/abs/2508.15250
作者: Yilin Jiang,Mingzi Zhang,Sheng Jin,Zengyi Yu,Xiangjie Kong,Binghao Tu
机构: Zhejiang University of Technology (浙江工业大学); The Hong Kong University of Science and Technology (广州) (香港科技大学(广州)); East China Normal University (华东师范大学); GuangHua Law School, ZheJiang University (光华法学院,浙江大学); College of Computer Science and Technology, Zhejiang University of Technology (浙江工业大学计算机科学与技术学院)
类目: Computation and Language (cs.CL)
备注: 24pages, 12 figures, Accepted by EMNLP Main Confrence

点击查看摘要

Abstract:Simulating Professions (SP) enables Large Language Models (LLMs) to emulate professional roles. However, comprehensive psychological and ethical evaluation in these contexts remains lacking. This paper introduces EMNLP, an Educator-role Moral and Normative LLMs Profiling framework for personality profiling, moral development stage measurement, and ethical risk under soft prompt injection. EMNLP extends existing scales and constructs 88 teacher-specific moral dilemmas, enabling profession-oriented comparison with human teachers. A targeted soft prompt injection set evaluates compliance and vulnerability in teacher SP. Experiments on 12 LLMs show teacher-role LLMs exhibit more idealized and polarized personalities than human teachers, excel in abstract moral reasoning, but struggle with emotionally complex situations. Models with stronger reasoning are more vulnerable to harmful prompt injection, revealing a paradox between capability and safety. The model temperature and other hyperparameters have limited influence except in some risk behaviors. This paper presents the first benchmark to assess ethical and psychological alignment of teacher-role LLMs for educational AI. Resources are available at this https URL.
zh

[NLP-47] UniCoM: A Universal Code-Switching Speech Generator EMNLP2025

【速读】: 该论文旨在解决多语言语音技术中因代码混用(Code-Switching, CS)现象普遍存在但高质量标注数据稀缺而导致系统性能受限的问题。其解决方案的关键在于提出一种名为Universal Code-Mixer (UniCoM) 的新流水线,通过引入Substituting WORDs with Synonyms (SWORDS) 算法,在不改变语义的前提下,基于词性信息将选定词汇替换为其翻译,从而生成高质量、自然且语义一致的代码混用语音样本。该方法有效构建了适用于自动语音识别(ASR)和语音到文本翻译(S2TT)任务的Code-Switching FLEURS (CS-FLEURS) 多语言语料库,显著提升了代码混用场景下语音系统的训练与评估能力。

链接: https://arxiv.org/abs/2508.15244
作者: Sangmin Lee,Woojin Chung,Seyun Um,Hong-Goo Kang
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to EMNLP 2025 Findings

点击查看摘要

Abstract:Code-switching (CS), the alternation between two or more languages within a single speaker’s utterances, is common in real-world conversations and poses significant challenges for multilingual speech technology. However, systems capable of handling this phenomenon remain underexplored, primarily due to the scarcity of suitable datasets. To resolve this issue, we propose Universal Code-Mixer (UniCoM), a novel pipeline for generating high-quality, natural CS samples without altering sentence semantics. Our approach utilizes an algorithm we call Substituting WORDs with Synonyms (SWORDS), which generates CS speech by replacing selected words with their translations while considering their parts of speech. Using UniCoM, we construct Code-Switching FLEURS (CS-FLEURS), a multilingual CS corpus designed for automatic speech recognition (ASR) and speech-to-text translation (S2TT). Experimental results show that CS-FLEURS achieves high intelligibility and naturalness, performing comparably to existing datasets on both objective and subjective metrics. We expect our approach to advance CS speech technology and enable more inclusive multilingual systems.
zh

[NLP-48] WangchanThaiInstruct: An instruction-following Dataset for Culture-Aware Multitask and Multi-domain Evaluation in Thai EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言如泰语中的指令遵循能力不足的问题,尤其是现有评估基准多依赖翻译数据,忽视了文化与专业领域特有的语境差异。其解决方案的关键在于构建了一个由人工撰写的泰语指令数据集 WangchanThaiInstruct,涵盖四个专业领域和七类任务,并通过多阶段质量控制流程确保数据的准确性与专业性。该数据集支持零样本评估与指令微调实验,结果表明:使用原生泰语监督数据微调的模型在域内和域外基准上均优于基于翻译数据训练的模型,凸显了文化与专业语境对提升LLM在低资源语言中对齐性能的重要性。

链接: https://arxiv.org/abs/2508.15239
作者: Peerat Limkonchotiwat,Pume Tuchinda,Lalita Lowphansirikul,Surapon Nonesung,Panuthep Tasawong,Alham Fikri Aji,Can Udomcharoenchaikit,Sarana Nutanong
机构: AI Singapore(人工智能新加坡); VISTEC; SCB 10X; MBZUAI
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 (Main). Model and Dataset: this https URL

点击查看摘要

Abstract:Large language models excel at instruction-following in English, but their performance in low-resource languages like Thai remains underexplored. Existing benchmarks often rely on translations, missing cultural and domain-specific nuances needed for real-world use. We present WangchanThaiInstruct, a human-authored Thai dataset for evaluation and instruction tuning, covering four professional domains and seven task types. Created through a multi-stage quality control process with annotators, domain experts, and AI researchers, WangchanThaiInstruct supports two studies: (1) a zero-shot evaluation showing performance gaps on culturally and professionally specific tasks, and (2) an instruction tuning study with ablations isolating the effect of native supervision. Models fine-tuned on WangchanThaiInstruct outperform those using translated data in both in-domain and out-of-domain benchmarks. These findings underscore the need for culturally and professionally grounded instruction data to improve LLM alignment in low-resource, linguistically diverse settings.
zh

[NLP-49] VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models

【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在边缘设备部署时因词汇表相关组件(如词嵌入和语言建模头)占用大量内存而导致的存储瓶颈问题。现有静态词汇裁剪方法虽能减少内存使用,但存在设计僵化、导致预填充阶段信息丢失且缺乏灵活性的缺陷。解决方案的关键在于提出 VocabTailor 框架,其基于两个核心原则:词汇局部性原理(lexical locality principle),即单次推理仅需少量token;以及词汇相关组件间计算特性不对称性。通过解耦动态词汇选择机制实现嵌入层的内存卸载,并采用混合静态-动态词汇选择策略优化语言建模头,从而支持按需加载词汇组件,在显著降低高达99%的词汇相关内存占用的同时,保持任务性能几乎无损。

链接: https://arxiv.org/abs/2508.15229
作者: Hanling Zhang,Yayu Zhou,Tongcheng Fang,Zhihang Yuan,Guohao Dai,Yu Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Small Language Models (SLMs) provide computational advantages in resource-constrained environments, yet memory limitations remain a critical bottleneck for edge device deployment. A substantial portion of SLMs’ memory footprint stems from vocabulary-related components, particularly embeddings and language modeling (LM) heads, due to large vocabulary sizes. Existing static vocabulary pruning, while reducing memory usage, suffers from rigid, one-size-fits-all designs that cause information loss from the prefill stage and a lack of flexibility. In this work, we identify two key principles underlying the vocabulary reduction challenge: the lexical locality principle, the observation that only a small subset of tokens is required during any single inference, and the asymmetry in computational characteristics between vocabulary-related components of SLM. Based on these insights, we introduce VocabTailor, a novel decoupled dynamic vocabulary selection framework that addresses memory constraints through offloading embedding and implements a hybrid static-dynamic vocabulary selection strategy for LM Head, enabling on-demand loading of vocabulary components. Comprehensive experiments across diverse downstream tasks demonstrate that VocabTailor achieves a reduction of up to 99% in the memory usage of vocabulary-related components with minimal or no degradation in task performance, substantially outperforming existing static vocabulary pruning.
zh

[NLP-50] Are Checklists Really Useful for Automatic Evaluation of Generative Tasks? EMNLP2025

【速读】: 该论文旨在解决生成式任务中基于大语言模型(Large Language Models, LLMs)的自动评估因标准模糊而面临的挑战。其核心解决方案在于系统性地探究检查清单(checklist)在自动评估中的使用策略:是否应针对所有问题统一使用,还是仅在特定情况下选择性使用,并通过六种方法生成检查清单,在八个不同规模的模型上进行实验验证。研究发现,选择性使用检查清单在成对比较任务中能提升评估性能,而在直接打分任务中效果不一致;同时,即使某些检查项与人工评分相关性较低,它们仍常反映人类写作的标准,暗示人工评价可能存在不一致性。这表明,为实现更可靠的人工与自动评估,亟需明确定义客观评价标准。

链接: https://arxiv.org/abs/2508.15218
作者: Momoka Furuhashi,Kouta Nakayama,Takashi Kodama,Saku Sugawara
机构: Tohoku University (东北大学); Research and Development Center for Large Language Models, National Institute of Informatics (国家信息学研究所大规模语言模型研发中心); National Institute of Informatics (国家信息学研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to the EMNLP 2025 Main Conference

点击查看摘要

Abstract:Automatic evaluation of generative tasks using large language models faces challenges due to ambiguous criteria. Although automatic checklist generation is a potentially promising approach, its usefulness remains underexplored. We investigate whether checklists should be used for all questions or selectively, generate them using six methods, evaluate their effectiveness across eight model sizes, and identify checklist items that correlate with human evaluations. Through experiments on pairwise comparison and direct scoring tasks, we find that selective checklist use tends to improve evaluation performance in pairwise settings, while its benefits are less consistent in direct scoring. Our analysis also shows that even checklist items with low correlation to human scores often reflect human-written criteria, indicating potential inconsistencies in human evaluation. These findings highlight the need to more clearly define objective evaluation criteria to guide both human and automatic evaluations. \footnoteOur code is available at~this https URL
zh

[NLP-51] Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多步工具调用任务中面临的挑战,包括工具选择、参数生成和工具链规划的困难。现有方法依赖于人工设计的任务特定示例或从预定义库中检索,这不仅需要大量专家投入,而且随着工具多样性和任务复杂度的增加,提示工程效率显著下降。解决方案的关键在于提出一种自引导式方法——分步经验回溯(Stepwise Experience Recall, SEER),其通过细粒度地从持续更新的经验池中进行分步检索,替代静态或人工维护的工具库;SEER能够以增量方式将过往成功轨迹添加至经验池,从而实现池的持续扩展与模型性能的逐步提升。

链接: https://arxiv.org/abs/2508.15214
作者: Sijia Cui,Aiyao He,Shuai Xu,Hongming Zhang,Yanna Wang,Qingyang Zhang,Yajing Wang,Bo Xu
机构: The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所认知与复杂系统决策智能重点实验室); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Nanjing University of Information Science & Technology (南京信息工程大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025

点击查看摘要

Abstract:Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1% on easy and 4.7% on hard questions. We further test SEER on \tau -bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44% and 23.38%, respectively.
zh

[NLP-52] Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在特定领域问答(domain-specific QA)中表现不佳的问题,尤其是检索增强生成(Retrieval-Augmented Generation, RAG)因噪声检索导致幻觉和延迟,以及持续预训练(continued pretraining)成本高且跨领域灵活性差的局限性。作者指出,问题根源在于领域知识的长尾分布使得部分有用但非核心的知识未被充分挖掘和利用。解决方案的关键在于提出Selct2Know(S2K)框架,其核心是通过内-外知识自选择策略(internal-external knowledge self-selection strategy)和选择性监督微调(selective supervised fine-tuning)实现高效、低成本的知识内化,并结合结构化推理数据生成流程与GRPO算法提升复杂推理能力,从而在医疗、法律和金融等多领域QA任务中显著优于现有方法,同时接近领域预训练模型的性能但成本更低。

链接: https://arxiv.org/abs/2508.15213
作者: Bolei He,Xinran He,Run Shao,Shanfu Shu,Xianwei Xue,Mingquan Cheng,Haifeng Li,Zhenhua Ling
机构: University of Science and Technology of China (中国科学技术大学); Baidu Inc. (百度公司); Central South University (中南大学); Chongqing University (重庆大学)
类目: Computation and Language (cs.CL)
备注: EMNLP2025 Findings

点击查看摘要

Abstract:Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.
zh

[NLP-53] SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在长序列推理中因KV缓存(Key-Value Cache)瓶颈导致的内存占用线性增长与注意力计算二次方增长的问题。现有方法主要通过沿时间轴(temporal axis)压缩KV缓存(如token剔除或合并)来缓解此问题,但忽略了特征维度(即通道轴,channel axis)上的细粒度重要性差异,从而难以在效率与模型精度之间取得良好平衡。解决方案的关键在于提出SPARK——一种无需训练、可即插即用的方法,通过在通道级别应用非结构化稀疏性(unstructured sparsity)对KV进行剪枝,并在注意力分数计算过程中动态恢复被剪枝条目,从而有效去除通道层面的冗余信息。该方法与现有的KV压缩和量化技术正交,可无缝集成以进一步加速推理,在相同内存预算下支持更长序列处理,且在等长序列下相比基于剔除的方法可减少超过30%的KV缓存存储量,同时保持甚至提升模型性能。

链接: https://arxiv.org/abs/2508.15212
作者: Huanxuan Liao,Yixing Xu,Shizhu He,Guanchen Li,Xuanwu Yin,Dong Li,Emad Barsoum,Jun Zhao,Kang Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at this https URL.
zh

[NLP-54] Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

【速读】: 该论文旨在解决现有过程奖励模型(Process Reward Models, PRMs)在金融领域中推理能力不足的问题,因为当前PRMs主要基于通用或STEM领域的数据训练,在金融场景下难以准确评估结构化、符号化且对事实与监管合规性敏感的中间推理步骤。其解决方案的关键在于提出Fin-PRM——一个面向金融领域的轨迹感知型PRM,通过融合步骤级(step-level)与轨迹级(trajectory-level)奖励监督机制,实现对金融逻辑一致的推理路径进行细粒度评估,并支持离线与在线奖励学习两种设置,从而有效提升下游任务中的推理质量与性能表现。

链接: https://arxiv.org/abs/2508.15202
作者: Yuanchen Zhou,Shuo Jiang,Jie Zhu,Junhui Li,Lifan Guo,Feng Chen,Chi Zhang
机构: Qwen DianJin Team (通义千问电金团队)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbfFin-PRM, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9% in supervised learning, 5.2% in reinforcement learning, and 5.1% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at this https URL.
zh

[NLP-55] LLM 4Sweat: A Trustworthy Large Language Model for Hyperhidrosis Support

【速读】: 该论文旨在解决罕见疾病(如多汗症)在应用大语言模型(Large Language Models, LLMs)进行诊断与护理支持时,因训练数据稀缺且不可靠而导致的性能瓶颈问题。其核心解决方案在于提出一个名为LLM4Sweat的开源、领域专用的LLM框架,采用三阶段流程:首先利用前沿LLM从结构化开源数据生成医学上合理的合成病例(vignettes),以扩充高质量问答数据集;其次基于该数据集对开源基础模型进行微调,实现多任务能力——包括诊断建议、个性化治疗方案及共情心理支持;最后通过临床与心理学专家评估反馈迭代优化模型输出,确保准确性、适当性与共情力。该方法不仅为多汗症提供了首个开放可用的LLM支持系统,也为其他具有类似数据稀缺和可信度挑战的罕见病提供了一种可泛化的建模路径。

链接: https://arxiv.org/abs/2508.15192
作者: Wenjie Lin,Jin Wei-Kocsis
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have shown promise in healthcare, their application for rare medical conditions is still hindered by scarce and unreliable datasets for fine-tuning. Hyperhidrosis, a disorder causing excessive sweating beyond physiological needs, is one such rare disorder, affecting 2-3% of the population and significantly impacting both physical comfort and psychosocial well-being. To date, no work has tailored LLMs to advance the diagnosis or care of hyperhidrosis. To address this gap, we present LLM4Sweat, an open-source and domain-specific LLM framework for trustworthy and empathetic hyperhidrosis support. The system follows a three-stage pipeline. In the data augmentation stage, a frontier LLM generates medically plausible synthetic vignettes from curated open-source data to create a diverse and balanced question-answer dataset. In the fine-tuning stage, an open-source foundation model is fine-tuned on the dataset to provide diagnosis, personalized treatment recommendations, and empathetic psychological support. In the inference and expert evaluation stage, clinical and psychological specialists assess accuracy, appropriateness, and empathy, with validated responses iteratively enriching the dataset. Experiments show that LLM4Sweat outperforms baselines and delivers the first open-source LLM framework for hyperhidrosis, offering a generalizable approach for other rare diseases with similar data and trustworthiness challenges.
zh

[NLP-56] SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling

【速读】: 该论文旨在解决现有分词方法(如Byte-Pair Encoding, BPE 或 WordPiece)仅依赖频率统计而忽视文本语义结构的问题,导致在长文本场景中出现语义冗余片段被过度分词、上下文连贯性利用不足的现象。其解决方案的关键在于提出一种语义感知的分词框架 SemToken,通过轻量级编码器提取上下文语义嵌入,并基于局部语义聚类合并语义等价的token;同时根据语义密度动态分配异构粒度——在信息丰富区域进行细粒度分词,在重复或低熵区域实现粗粒度压缩,从而在不显著影响语言建模性能的前提下,大幅降低token数量(最高达2.4倍)并提升计算效率(最高达1.9倍加速)。

链接: https://arxiv.org/abs/2508.15190
作者: Dong Liu,Yanxuan Yu
机构: Yale University (耶鲁大学); Columbia University (哥伦比亚大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbfSemToken, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to 2.4\times reduction in token count and 1.9\times speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
zh

[NLP-57] ContextualLVLM-Agent : A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

【速读】: 该论文旨在解决当前大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理复杂、多轮且具有视觉 grounding 的任务时存在的挑战,包括深度推理能力不足、上下文理解持续性差、实体追踪困难以及多步骤指令遵循不准确等问题。现有基准测试难以真实反映多模态交互的动态性和复杂性,常导致上下文丢失和视觉幻觉。解决方案的关键在于提出一种名为 CoLVLM Agent 的整体框架,其通过一个无需对底层模型进行大规模再训练的“记忆-感知-规划-执行”迭代循环机制,显著增强 LVLMs 的推理与指令遵循能力;实验表明,该框架在自建的 MMDR-Bench 基准上平均人类评分达 4.03,优于 GPT-4o(3.92)和 Gemini 1.5 Pro(3.85),尤其在推理深度、指令一致性及错误抑制方面表现突出,并能维持长时间对话中的稳定性能。

链接: https://arxiv.org/abs/2508.15164
作者: Seungmin Han,Haeun Kwon,Ji-jun Park,Taeyang Yoon
机构: Dongguk University (东国大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative “memory-perception-planning-execution” cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.
zh

[NLP-58] Identifying and Answering Questions with False Assumptions: An Interpretable Approach EMNLP2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在面对包含错误假设的问题时,因幻觉(hallucination)现象而产生误导性回答的问题。其核心挑战在于识别并正确回应那些基于虚假前提的提问,这类问题本身没有标准答案。解决方案的关键在于将问题转化为事实验证任务,并引入外部证据来缓解幻觉;具体而言,通过检索相关证据进行验证,并进一步生成和验证原子级假设,从而不仅提升回答准确性,还提供可解释性——明确指出哪些假设是错误的。

链接: https://arxiv.org/abs/2508.15139
作者: Zijie Wang,Eduardo Blanco
机构: University of Arizona (亚利桑那大学)
类目: Computation and Language (cs.CL)
备注: To appear at EMNLP 2025 Main conference

点击查看摘要

Abstract:People often ask questions with false assumptions, a type of question that does not have regular answers. Answering such questions require first identifying the false assumptions. Large Language Models (LLMs) often generate misleading answers because of hallucinations. In this paper, we focus on identifying and answering questions with false assumptions in several domains. We first investigate to reduce the problem to fact verification. Then, we present an approach leveraging external evidence to mitigate hallucinations. Experiments with five LLMs demonstrate that (1) incorporating retrieved evidence is beneficial and (2) generating and validating atomic assumptions yields more improvements and provides an interpretable answer by specifying the false assumptions.
zh

[NLP-59] aiXiv: A Next-Generation Open Access Ecosystem for Scientific Discovery Generated by AI Scientists

【速读】: 该论文旨在解决当前AI生成科研内容(如研究提案、论文等)在传统出版生态中缺乏合适传播渠道的问题。现有期刊和会议依赖人工同行评审,难以规模化且对AI生成内容接受度低;而预印本平台(如arXiv)则缺乏严格的质控机制,导致大量高质量AI生成研究成果无法有效发表与传播,从而阻碍科学进步。其解决方案的关键在于提出aiXiv——一个面向人类与AI科学家的下一代开放获取平台,采用多智能体架构支持研究提案与论文由人与AI共同提交、评审与迭代优化,并通过API和MCP接口实现异构主体的无缝集成,构建可扩展、可拓展的自主科学研究生态系统。实验证明,aiXiv能显著提升AI生成研究内容的质量,为下一代开放获取科研生态奠定基础。

链接: https://arxiv.org/abs/2508.15126
作者: Pengsong Zhang,Xiang Hu,Guowei Huang,Yang Qi,Heng Zhang,Xiuxu Li,Jiaxing Song,Jiabin Luo,Yijiang Li,Shuo Yin,Chengxiao Dai,Eric Hanchen Jiang,Xiaoyan Zhou,Zhenfei Yin,Boqin Yuan,Jing Dong,Guinan Su,Guanren Qiao,Haiming Tang,Anghong Du,Lili Pan,Zhenzhong Lan,Xinyu Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Preprint under review. Code is available at this https URL . Website is available at this https URL

点击查看摘要

Abstract:Recent advances in large language models (LLMs) have enabled AI agents to autonomously generate scientific proposals, conduct experiments, author papers, and perform peer reviews. Yet this flood of AI-generated research content collides with a fragmented and largely closed publication ecosystem. Traditional journals and conferences rely on human peer review, making them difficult to scale and often reluctant to accept AI-generated research content; existing preprint servers (e.g. arXiv) lack rigorous quality-control mechanisms. Consequently, a significant amount of high-quality AI-generated research lacks appropriate venues for dissemination, hindering its potential to advance scientific progress. To address these challenges, we introduce aiXiv, a next-generation open-access platform for human and AI scientists. Its multi-agent architecture allows research proposals and papers to be submitted, reviewed, and iteratively refined by both human and AI scientists. It also provides API and MCP interfaces that enable seamless integration of heterogeneous human and AI scientists, creating a scalable and extensible ecosystem for autonomous scientific discovery. Through extensive experiments, we demonstrate that aiXiv is a reliable and robust platform that significantly enhances the quality of AI-generated research proposals and papers after iterative revising and reviewing on aiXiv. Our work lays the groundwork for a next-generation open-access ecosystem for AI scientists, accelerating the publication and dissemination of high-quality AI-generated research content. Code is available at this https URL. Website is available at this https URL.
zh

[NLP-60] Open-Universe Assistance Games

【速读】: 该论文旨在解决具身人工智能(Embodied AI)代理在开放宇宙环境中如何有效推理和执行人类未预先定义的目标与偏好的问题。传统方法依赖于固定或有限的目标集合,难以适应真实世界中动态且多样化的用户意图。解决方案的关键在于提出一种名为GOOD(GOals from Open-ended Dialogue)的在线、数据高效的方法:通过自然语言对话实时提取目标,并利用大型语言模型(LLM)模拟具有不同复杂意图的用户,基于其响应进行概率推理以推断目标分布。该方法无需大规模离线数据集即可实现丰富的目标表征与不确定性估计,从而提升代理在开放场景下的适应性与可解释性。

链接: https://arxiv.org/abs/2508.15119
作者: Rachel Ma,Jingyi Qu,Andreea Bobu,Dylan Hadfield-Menell
机构: MIT CSAIL (Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 7 pages + 2 pages references + 7 pages appendix

点击查看摘要

Abstract:Embodied AI agents must infer and act in an interpretable way on diverse human goals and preferences that are not predefined. To formalize this setting, we introduce Open-Universe Assistance Games (OU-AGs), a framework where the agent must reason over an unbounded and evolving space of possible goals. In this context, we introduce GOOD (GOals from Open-ended Dialogue), a data-efficient, online method that extracts goals in the form of natural language during an interaction with a human, and infers a distribution over natural language goals. GOOD prompts an LLM to simulate users with different complex intents, using its responses to perform probabilistic inference over candidate goals. This approach enables rich goal representations and uncertainty estimation without requiring large offline datasets. We evaluate GOOD in a text-based grocery shopping domain and in a text-operated simulated household robotics environment (AI2Thor), using synthetic user profiles. Our method outperforms a baseline without explicit goal tracking, as confirmed by both LLM-based and human evaluations.
zh

[NLP-61] LLM s and Agent ic AI in Insurance Decision-Making: Opportunities and Challenges For Africa

【速读】: 该论文旨在解决人工智能(AI),特别是大语言模型(Large Language Models, LLMs)和代理型AI(agentic AI)在保险行业应用中所面临的机遇与挑战,尤其是在非洲保险市场中存在的关键空白问题。其解决方案的关键在于推动由非洲本地利益相关者——包括精算师、保险公司、监管机构和技术领导者——共同参与的协作机制,以制定包容性、可持续且公平的AI战略与实践方案,从而实现技术赋能下的本地化创新与价值创造。

链接: https://arxiv.org/abs/2508.15110
作者: Graham Hill,JingYuan Gong,Thulani Babeli,Moseli Mots’oehli,James Gachomo Wanjiku
机构: The Shard South Africa; University of Hawai‘i at Manoa; Elenjical Solutions South Africa
类目: Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Emerging Technologies (cs.ET); Applications (stat.AP)
备注:

点击查看摘要

Abstract:In this work, we highlight the transformative potential of Artificial Intelligence (AI), particularly Large Language Models (LLMs) and agentic AI, in the insurance sector. We consider and emphasize the unique opportunities, challenges, and potential pathways in insurance amid rapid performance improvements, increased open-source access, decreasing deployment costs, and the complexity of LLM or agentic AI frameworks. To bring it closer to home, we identify critical gaps in the African insurance market and highlight key local efforts, players, and partnership opportunities. Finally, we call upon actuaries, insurers, regulators, and tech leaders to a collaborative effort aimed at creating inclusive, sustainable, and equitable AI strategies and solutions: by and for Africans.
zh

[NLP-62] Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

【速读】: 该论文旨在解决当前数学预训练数据集质量低下问题,即现有基于Common Crawl构建的数学语料库因提取规则脆弱、HTML到文本转换失真及无法可靠保留数学结构而导致性能受限。其核心解决方案是提出一种领域无关的鲁棒科学文本抽取管道(domain-agnostic pipeline),通过结合布局感知渲染(layout-aware rendering)与目标导向的大语言模型(LLM-based)清洗阶段,实现对多种数学格式(如MathJax、KaTeX、MathML)的有效恢复和标准化处理,从而在保留公式与代码块结构完整性的同时去除冗余内容、统一符号表示并修正不一致性,最终构建出高质量、大规模数学语料库Nemotron-CC-Math-3+(133B tokens)和Nemotron-CC-Math-4+(52B tokens),显著优于此前所有开源数学预训练数据集,并在数学推理和通用能力上取得明显提升。

链接: https://arxiv.org/abs/2508.15096
作者: Rabeeh Karimi Mahabadi,Sanjeev Satheesh,Shrimai Prabhumoye,Mostofa Patwary,Mohammad Shoeybi,Bryan Catanzaro
机构: NVIDIA(英伟达); Boston University(波士顿大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content–including math–from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.15096 [cs.CL] (or arXiv:2508.15096v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15096 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Rabeeh Karimi Mahabadi [view email] [v1] Wed, 20 Aug 2025 22:16:57 UTC (1,078 KB)
zh

[NLP-63] Mapping the Course for Prompt-based Structured Prediction

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在结构化预测任务中因自回归特性导致的幻觉(hallucination)和复杂推理能力不足的问题。其解决方案的关键在于将LLMs与组合推理(combinatorial inference)相结合,通过引入符号推理方法来增强预测结果的结构一致性,同时探索不同的提示策略以有效估计LLM置信度值用于符号推理。实验表明,无论提示策略如何,叠加符号推理均能提升预测的一致性和准确性;此外,基于结构化预测目标进行校准和微调可进一步提高复杂任务的表现,验证了结构化学习在LLM时代仍具价值。

链接: https://arxiv.org/abs/2508.15090
作者: Matt Pauk,Maria Leonor Pacheco
机构: University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:LLMs have been shown to be useful for a variety of language tasks, without requiring task-specific fine-tuning. However, these models often struggle with hallucinations and complex reasoning problems due to their autoregressive nature. We propose to address some of these issues, specifically in the area of structured prediction, by combining LLMs with combinatorial inference in an attempt to marry the predictive power of LLMs with the structural consistency provided by inference methods. We perform exhaustive experiments in an effort to understand which prompting strategies can effectively estimate LLM confidence values for use with symbolic inference, and show that, regardless of the prompting strategy, the addition of symbolic inference on top of prompting alone leads to more consistent and accurate predictions. Additionally, we show that calibration and fine-tuning using structured prediction objectives leads to increased performance for challenging tasks, showing that structured learning is still valuable in the era of LLMs.
zh

[NLP-64] LongRecall: A Structured Approach for Robust Recall Evaluation in Long-Form Text

【速读】: 该论文旨在解决生成式 AI(Generative AI)在长文本问答(long-form QA)任务中召回率(recall)评估不准确的问题,尤其是现有方法因依赖词汇重叠导致对未证实实体和改写答案的误判,以及大语言模型作为评判者(LLM-as-a-Judge)时因缺乏结构化验证而易产生错位和幻觉。解决方案的关键在于提出 LongRecall 框架,其核心是三阶段递进式评估机制:首先将答案分解为自包含的事实单元,继而通过词汇与语义过滤逐步缩小候选匹配范围,并最终利用结构化蕴含检查(structured entailment checks)验证事实一致性,从而显著降低假阳性与假阴性,提升对多样表述和上下文变化的鲁棒性,为系统性召回评估提供可靠基础。

链接: https://arxiv.org/abs/2508.15085
作者: MohamamdJavad Ardestani,Ehsan Kamalloo,Davood Rafiei
机构: University of Alberta (阿尔伯塔大学); ServiceNow Research (ServiceNow 研究院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:LongRecall. The completeness of machine-generated text, ensuring that it captures all relevant information, is crucial in domains such as medicine and law and in tasks like list-based question answering (QA), where omissions can have serious consequences. However, existing recall metrics often depend on lexical overlap, leading to errors with unsubstantiated entities and paraphrased answers, while LLM-as-a-Judge methods with long holistic prompts capture broader semantics but remain prone to misalignment and hallucinations without structured verification. We introduce LongRecall, a general three-stage recall evaluation framework that decomposes answers into self-contained facts, successively narrows plausible candidate matches through lexical and semantic filtering, and verifies their alignment through structured entailment checks. This design reduces false positives and false negatives while accommodating diverse phrasings and contextual variations, serving as a foundational building block for systematic recall assessment. We evaluate LongRecall on three challenging long-form QA benchmarks using both human annotations and LLM-based judges, demonstrating substantial improvements in recall accuracy over strong lexical and LLM-as-a-Judge baselines.
zh

[NLP-65] Dont Think Twice! Over-Reasoning Impairs Confidence Calibration ICML2025

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在问答任务中因过度自信而导致的校准不足问题,尤其是在知识密集型任务中。研究发现,传统依赖增加推理预算(reasoning budget)以提升模型对自身置信度判断准确性的“测试时扩展”(test-time scaling)范式不仅无效,反而会加剧系统性过自信现象;而通过引入外部信息检索机制(search-augmented generation)来增强模型对相关证据的获取能力,可将置信度校准准确率从纯推理方法的48.7%显著提升至89.3%,表明信息访问能力而非推理深度或计算预算才是改善此类任务校准性能的关键瓶颈。

链接: https://arxiv.org/abs/2508.15050
作者: Romain Lacombe,Kerrie Wu,Eddie Dilworth
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Published at ICML 2025 Workshop on Reliable and Responsible Foundation Models

点击查看摘要

Abstract:Large Language Models deployed as question answering tools require robust calibration to avoid overconfidence. We systematically evaluate how reasoning capabilities and budget affect confidence assessment accuracy, using the ClimateX dataset (Lacombe et al., 2023) and expanding it to human and planetary health. Our key finding challenges the “test-time scaling” paradigm: while recent reasoning LLMs achieve 48.7% accuracy in assessing expert confidence, increasing reasoning budgets consistently impairs rather than improves calibration. Extended reasoning leads to systematic overconfidence that worsens with longer thinking budgets, producing diminishing and negative returns beyond modest computational investments. Conversely, search-augmented generation dramatically outperforms pure reasoning, achieving 89.3% accuracy by retrieving relevant evidence. Our results suggest that information access, rather than reasoning depth or inference budget, may be the critical bottleneck for improved confidence calibration of knowledge-intensive tasks.
zh

[NLP-66] Reward-Shifted Speculative Sampling Is An Efficient Test-Time Weak-to-Strong Aligner

【速读】: 该论文旨在解决测试时对齐(test-time alignment)过程中因额外计算开销导致推理成本过高、限制实际应用的问题。其解决方案的关键在于提出奖励偏移的推测采样(Reward-Shifted Speculative Sampling, SSS)算法:通过训练一个与人类偏好对齐的小型草稿模型(draft model),而保持目标模型(target model)不变,利用两者之间的分布偏移,通过调整接受准则和奖励奖励项分布,无需显式获取强化学习从人类反馈(Reinforcement Learning from Human Feedback, RLHF)最优解即可恢复其效果,从而在显著降低推理代价的同时实现更优的黄金奖励得分(gold reward scores)。

链接: https://arxiv.org/abs/2508.15044
作者: Bolian Li,Yanran Wu,Xinyu Luo,Ruqi Zhang
机构: Purdue University (普渡大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Aligning large language models (LLMs) with human preferences has become a critical step in their development. Recent research has increasingly focused on test-time alignment, where additional compute is allocated during inference to enhance LLM safety and reasoning capabilities. However, these test-time alignment techniques often incur substantial inference costs, limiting their practical application. We are inspired by the speculative sampling acceleration, which leverages a small draft model to efficiently predict future tokens, to address the efficiency bottleneck of test-time alignment. We introduce the reward-Shifted Speculative Sampling (SSS) algorithm, in which the draft model is aligned with human preferences, while the target model remains unchanged. We theoretically demonstrate that the distributional shift between the aligned draft model and the unaligned target model can be exploited to recover the RLHF optimal solution without actually obtaining it, by modifying the acceptance criterion and bonus token distribution. Our algorithm achieves superior gold reward scores at a significantly reduced inference cost in test-time weak-to-strong alignment experiments, thereby validating both its effectiveness and efficiency.
zh

[NLP-67] Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems EMNLP2025

【速读】: 该论文旨在解决当前对话式可解释人工智能(ConvXAI)系统在多语言泛化能力不足以及对自由形式自定义输入支持有限的问题。关键解决方案包括:首先构建了MultiCoXQL数据集,作为涵盖五种类型学多样语言(含一种低资源语言)的多语言扩展数据集,以提升模型在不同语言环境下的解析性能;其次提出一种新的解析方法以增强多语言解析效果,并在此基础上引入Compass数据集,专门用于评估ConvXAI系统在用户自定义输入提取任务中的表现,通过单语、跨语言和多语言三种场景下对多种大语言模型(LLMs)及BERT类模型的系统性评测,验证了所提方案的有效性与泛化能力。

链接: https://arxiv.org/abs/2508.14982
作者: Qianli Wang,Tatiana Anikina,Nils Feldhus,Simon Ostermann,Fedor Splitt,Jiaao Li,Yoana Tsoneva,Sebastian Möller,Vera Schmitt
机构: Technische Universität Berlin (柏林工业大学); German Research Center for Artificial Intelligence (DFKI) (德国人工智能研究中心); Saarland Informatics Campus (萨尔兰信息学园区); Centre for European Research in Trusted AI (CERTAIN) (欧洲可信人工智能研究中心); BIFOLD – Berlin Institute for the Foundations of Learning and Data (柏林学习与数据基础研究所)
类目: Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 Findings, camera-ready version

点击查看摘要

Abstract:Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user’s desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users’ underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.
zh

[NLP-68] Improving LLM s for Machine Translation Using Synthetic Preference Data ECAI2025

【速读】: 该论文旨在解决如何利用少量易获取的数据资源,提升通用指令微调的大语言模型(Large Language Model, LLM)在机器翻译(Machine Translation, MT)任务中的性能问题。其解决方案的关键在于采用直接偏好优化(Direct Preference Optimization, DPO)训练方法,通过程序化筛选并增强公共数据集的子集构建高质量偏好对,并基于启发式规则与自动评估指标(如COMET)对由两个LLM生成的翻译结果进行排序标注,从而有效微调GaMS-9B-Instruct模型。实验表明,该方法显著优于原始基线模型,在Wikipedia文章翻译任务中分别获得约0.04和0.02的COMET分数提升,且在语言和格式错误的避免上更具一致性。

链接: https://arxiv.org/abs/2508.14951
作者: Dario Vajda,Domen Vreš,Marko Robnik-Šikonja
机构: University of Ljubljana, Faculty of Computer and Information Science (卢布尔雅那大学计算机与信息科学学院)
类目: Computation and Language (cs.CL)
备注: Paper with individual presentation at LUHME workshop at ECAI 2025

点击查看摘要

Abstract:Large language models have emerged as effective machine translation systems. In this paper, we explore how a general instruction-tuned large language model can be improved for machine translation using relatively few easily produced data resources. Using Slovene as a use case, we improve the GaMS-9B-Instruct model using Direct Preference Optimization (DPO) training on a programmatically curated and enhanced subset of a public dataset. As DPO requires pairs of quality-ranked instances, we generated its training dataset by translating English Wikipedia articles using two LLMs, GaMS-9B-Instruct and EuroLLM-9B-Instruct. We ranked the resulting translations based on heuristics coupled with automatic evaluation metrics such as COMET. The evaluation shows that our fine-tuned model outperforms both models involved in the dataset generation. In comparison to the baseline models, the fine-tuned model achieved a COMET score gain of around 0.04 and 0.02, respectively, on translating Wikipedia articles. It also more consistently avoids language and formatting errors.
zh

[NLP-69] Robust Symbolic Reasoning for Visual Narratives via Hierarchical and Semantically Normalized Knowledge Graphs

【速读】: 该论文旨在解决视觉叙事(如漫画)中符号化叙事图谱存在的不一致性与冗余问题,即相似动作或事件在不同标注或语境下被赋予不同标签,从而限制了推理能力与泛化性能。解决方案的关键在于提出一种基于认知基础的语义归一化框架,通过词汇相似性和嵌入聚类方法对语义相关的动作和事件进行整合,实现跨叙事层级(面板级、事件级、故事级)的符号类别对齐,同时保持图谱的可解释性。该方法有效降低了标注噪声,提升了叙事理解任务中的连贯性与鲁棒性。

链接: https://arxiv.org/abs/2508.14941
作者: Yi-Chun Chen
机构: Yale University (耶鲁大学)
类目: Multimedia (cs.MM); Computation and Language (cs.CL)
备注: 12 pages, 4 figures, 2 tables. Extends our earlier framework on hierarchical narrative graphs with a semantic normalization module

点击查看摘要

Abstract:Understanding visual narratives such as comics requires structured representations that capture events, characters, and their relations across multiple levels of story organization. However, symbolic narrative graphs often suffer from inconsistency and redundancy, where similar actions or events are labeled differently across annotations or contexts. Such variance limits the effectiveness of reasoning and generalization. This paper introduces a semantic normalization framework for hierarchical narrative knowledge graphs. Building on cognitively grounded models of narrative comprehension, we propose methods that consolidate semantically related actions and events using lexical similarity and embedding-based clustering. The normalization process reduces annotation noise, aligns symbolic categories across narrative levels, and preserves interpretability. We demonstrate the framework on annotated manga stories from the Manga109 dataset, applying normalization to panel-, event-, and story-level graphs. Preliminary evaluations across narrative reasoning tasks, such as action retrieval, character grounding, and event summarization, show that semantic normalization improves coherence and robustness, while maintaining symbolic transparency. These findings suggest that normalization is a key step toward scalable, cognitively inspired graph models for multimodal narrative understanding. Comments: 12 pages, 4 figures, 2 tables. Extends our earlier framework on hierarchical narrative graphs with a semantic normalization module Subjects: Multimedia (cs.MM); Computation and Language (cs.CL) Cite as: arXiv:2508.14941 [cs.MM] (or arXiv:2508.14941v1 [cs.MM] for this version) https://doi.org/10.48550/arXiv.2508.14941 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-70] Bridging the Culture Gap: A Framework for LLM -Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages

【速读】: 该论文旨在解决低资源语言中数学应用题(Math Word Problems, MWP)的多语言与文化背景适配问题,即现有多语言数学推理基准普遍依赖翻译而非本地化构建,导致保留了以英语为中心的实体(如人名、机构名和货币),从而掩盖了真实多语言数学能力。解决方案的关键在于提出一种由大语言模型(Large Language Models, LLMs)驱动的文化本地化框架,能够自动从已有数据源中提取并生成具有本土文化特征的实体(如本地姓名、组织和货币),从而构建真正本地化的数学应用题数据集,有效缓解英语中心偏倚,并提升模型在引入本土实体时的鲁棒性。

链接: https://arxiv.org/abs/2508.14913
作者: Israel Abebe Azime,Tadesse Destaw Belay,Dietrich Klakow,Philipp Slusallek,Anshuman Chhabra
机构: Saarland University (萨尔兰大学); Instituto Politécnico Nacional (国家理工学院); University of South Florida (南佛罗里达大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated significant capabilities in solving mathematical problems expressed in natural language. However, multilingual and culturally-grounded mathematical reasoning in low-resource languages lags behind English due to the scarcity of socio-cultural task datasets that reflect accurate native entities such as person names, organization names, and currencies. Existing multilingual benchmarks are predominantly produced via translation and typically retain English-centric entities, owing to the high cost associated with human annotater-based localization. Moreover, automated localization tools are limited, and hence, truly localized datasets remain scarce. To bridge this gap, we introduce a framework for LLM-driven cultural localization of math word problems that automatically constructs datasets with native names, organizations, and currencies from existing sources. We find that translated benchmarks can obscure true multilingual math ability under appropriate socio-cultural contexts. Through extensive experiments, we also show that our framework can help mitigate English-centric entity bias and improves robustness when native entities are introduced across various languages.
zh

[NLP-71] Preliminary Ranking of WMT25 General Machine Translation Systems

【速读】: 该论文旨在解决机器翻译(Machine Translation, MT)系统在国际权威评测会议WMT25中,如何公平、可靠地进行性能评估的问题。其关键在于区分自动评价指标与人工评价之间的差异:由于自动指标可能偏向于采用重排序技术(如质量估计重排序或最小贝叶斯风险解码)的系统,因此基于自动评分的初步排名可能存在偏差;为此,研究者提出以人类评估作为最终排名依据,从而确保结果的可靠性,并将自动排名仅作为参与者撰写系统提交论文时的参考信息。

链接: https://arxiv.org/abs/2508.14909
作者: Tom Kocmi,Eleftherios Avramidis,Rachel Bawden,Ondřej Bojar,Konstantin Dranch,Anton Dvorkovich,Sergey Dukanov,Natalia Fedorova,Mark Fishel,Markus Freitag,Thamme Gowda,Roman Grundkiewicz,Barry Haddow,Marzena Karpinska,Philipp Koehn,Howard Lakougna,Jessica Lundin,Kenton Murray,Masaaki Nagata,Stefano Perrella,Lorenzo Proietti,Martin Popel,Maja Popović,Parker Riley,Mariya Shmatova,Steinþór Steingrímsson,Lisa Yankovskaya,Vilém Zouhar
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We present the preliminary ranking of the WMT25 General Machine Translation Shared Task, in which MT systems have been evaluated using automatic metrics. As this ranking is based on automatic evaluations, it may be biased in favor of systems that employ re-ranking techniques, such as Quality Estimation re-ranking or Minimum Bayes Risk decoding. The official WMT25 ranking will be based on human evaluation, which is more reliable and will supersede the automatic ranking. The purpose of this report is not to present the final findings of the General MT task, but rather to share preliminary results with task participants, which may be useful when preparing their system submission papers. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.14909 [cs.CL] (or arXiv:2508.14909v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.14909 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-72] Efficient Switchable Safety Control in LLM s via Magic-Token-Guided Co-Training

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)内容安全机制中存在的两大问题:一是现有方法如监督微调(Supervised Fine-Tuning, SFT)和基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)通常依赖多阶段训练流程,导致效率低下;二是缺乏部署后的细粒度可控性,难以根据实际应用场景动态调整模型的安全行为。解决方案的关键在于提出一种统一的协同训练(co-training)框架,通过单阶段SFT集成三种安全行为模式——积极型(positive,合法/亲社会)、消极型(negative,未过滤/高风险)和拒绝型(rejective,拒答导向/保守),并利用简单的系统级指令或“魔法令牌”(magic token)在推理时动态激活对应行为,实现无需重新训练的灵活切换。该策略不仅显著降低训练与部署成本,还通过在输出空间中诱导出明确的“安全对齐边界”(Safety Alignment Margin),为模型安全性提供实证支持,并实现了前所未有的细粒度控制能力。

链接: https://arxiv.org/abs/2508.14904
作者: Jianfeng Si,Lin Sun,Zhewen Tan,Xiangzheng Zhang
机构: Qiyuan Tech(奇源科技)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 12 pages,5 figures,4 tables

点击查看摘要

Abstract:Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model’s safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.
zh

[NLP-73] ranssion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge

【速读】: 该论文旨在解决多语言自动语音识别(Multilingual Automatic Speech Recognition, MASR)系统在跨语言场景下性能不稳定、模型泛化能力不足的问题。解决方案的关键在于构建一个分层协同的混合架构:首先采用冻结的Whisper-large-v3作为语音编码器,利用其大规模预训练获得鲁棒的声学特征表示;其次引入可训练的适配模块(adaptor),通过Linear-ReLU-Linear结构实现语音与文本表征的有效对齐;最后集成冻结的Qwen2.5-7B-Instruct大语言模型(LLM)并结合低秩适应(LoRA)技术进行上下文感知的语言解码优化。该设计实现了预训练模型与任务特定微调的有机结合,在11种语言上达到9.83%的词错误率(WER),显著提升了多语言ASR系统的准确性与稳定性。

链接: https://arxiv.org/abs/2508.14916
作者: Xiaoxiao Li,An Zhu,Youhai Jiang,Fengjie Zhu
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents the architecture and performance of a novel Multilingual Automatic Speech Recognition (ASR) system developed by the Transsion Speech Team for Track 1 of the MLC-SLM 2025 Challenge. The proposed system comprises three key components: 1) a frozen Whisper-large-v3 based speech encoder, leveraging large-scale pretraining to ensure robust acoustic feature extraction; 2) a trainable adaptor module using Linear-ReLU-Linear transformation mechanisms to effectively align speech and text representations; and 3) a frozen Qwen2.5-7B-Instruct large language model (LLM) integrated with trainable LoRA for optimized contextual linguistic decoding. By systematically combining pretrained models with task specific fine-tuning, the system achieved a word/character error rate (WER/CER) of 9.83% across 11 languages in the evaluation set and ranked third place among global participants.
zh

[NLP-74] A Chinese Heart Failure Status Speech Database with Universal and Personalised Classification

【速读】: 该论文旨在解决中文语种中是否存在可用于心力衰竭(Heart Failure, HF)检测的语音特征这一关键问题,填补了此前在非英语语言中HF语音识别研究的空白。其解决方案的关键在于构建首个针对中国HF患者的语音数据库,并采用“患者级”(patient-wise)与“配对级”(pair-wise)两种分类方法验证中文语音在HF检测中的有效性;其中,“配对级”分类作为去个体化基准,为未来研究提供了可靠参照。此外,研究提出自适应频率滤波器(Adaptive Frequency Filter, AFF)用于频域重要性分析,进一步揭示了个体差异是导致分类不准确的主要因素。

链接: https://arxiv.org/abs/2508.14908
作者: Yue Pan,Liwei Liu,Changxin Li,Xinyao Wang,Yili Xia,Hanyue Zhang,Ming Chu
机构: School of Information Science and Technology (信息科学与技术学院); Advanced Computing and Storage Laboratory, 2012 Laboratories (先进计算与存储实验室,2012实验室); Institute of High Performance Computing (高性能计算研究所); Taizhou School of Clinial Medicine (台州临床医学院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech is a cost-effective and non-intrusive data source for identifying acute and chronic heart failure (HF). However, there is a lack of research on whether Chinese syllables contain HF-related information, as observed in other well-studied languages. This study presents the first Chinese speech database of HF patients, featuring paired recordings taken before and after hospitalisation. The findings confirm the effectiveness of the Chinese language in HF detection using both standard ‘patient-wise’ and personalised ‘pair-wise’ classification approaches, with the latter serving as an ideal speaker-decoupled baseline for future research. Statistical tests and classification results highlight individual differences as key contributors to inaccuracy. Additionally, an adaptive frequency filter (AFF) is proposed for frequency importance analysis. The data and demonstrations are published at this https URL.
zh

计算机视觉

[CV-0] CineScale: Free Lunch in High-Resolution Cinematic Visual Generation ICCV2025

【速读】:该论文旨在解决视觉扩散模型在高分辨率生成能力上的局限性问题,即由于训练数据和计算资源的限制,现有模型难以生成高保真度的图像或视频,尤其在超出其原始训练分辨率时易出现重复模式等低质量内容。解决方案的关键在于提出一种名为CineScale的新推理范式,通过针对两种主流视频生成架构(T2I、T2V与I2V、V2V)设计专用变体来缓解高频信息增加带来的误差累积问题,从而有效提升高分辨率视觉生成的质量与多样性。该方法无需微调即可实现8K图像生成,仅需少量LoRA微调即可达成4K视频生成,显著扩展了预训练模型在高分辨率场景下的应用潜力。

链接: https://arxiv.org/abs/2508.15774
作者: Haonan Qiu,Ning Yu,Ziqi Huang,Paul Debevec,Ziwei Liu
机构: Netflix Eyeline Studios; Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: CineScale is an extended work of FreeScale (ICCV 2025). Project Page: this https URL , Code Repo: this https URL

点击查看摘要

Abstract:Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: this https URL.
zh

[CV-1] Scaling Group Inference for Diverse and High-Quality Generation WWW

【速读】:该论文旨在解决生成式模型在实际应用中因独立采样导致多输出冗余的问题,即用户为同一提示(prompt)获得的多个图像(如4–8张)往往相似度高,限制了创意探索与选择空间。解决方案的关键在于提出一种可扩展的群体推理(group inference)方法,将多样本优化建模为一个二次整数分配问题:候选输出作为图节点,通过优化单样本质量(一元项)与组内多样性(二元项)的联合目标来选取最优子集;同时引入基于中间预测的渐进式候选集剪枝策略,显著提升计算效率,从而支持大规模候选集下的高效推理。该框架适用于文本到图像、图像到图像、图像提示及视频生成等多种任务,使生成模型能够将多个输出视为连贯的整体而非孤立样本。

链接: https://arxiv.org/abs/2508.15773
作者: Gaurav Parmar,Or Patashnik,Daniil Ostashev,Kuan-Chieh Wang,Kfir Aberman,Srinivasa Narasimhan,Jun-Yan Zhu
机构: Carnegie Mellon University (卡内基梅隆大学); Snap Research (Snap 研究院); Tel Aviv University (特拉维夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR); Machine Learning (cs.LG)
备注: Project website: this https URL , GitHub: this https URL

点击查看摘要

Abstract:Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we progressively prune the candidate set using intermediate predictions, allowing our method to scale up to large candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.
zh

[CV-2] Visual Autoregressive Modeling for Instruction-Guided Image Editing

【速读】:该论文旨在解决扩散模型在指令引导图像编辑中因全局去噪过程导致的编辑区域与整体图像上下文耦合问题,从而引发非预期的伪修改和编辑指令遵循度下降。其解决方案的关键在于提出一种视觉自回归(Visual Autoregressive, VAR)框架 VAREdit,将图像编辑重构为多尺度目标特征的逐级预测任务,并引入 Scale-Aligned Reference (SAR) 模块,在首个自注意力层注入尺度对齐的条件信息,有效缓解细粒度源特征难以指导粗粒度目标特征生成的问题,从而显著提升编辑准确性和效率。

链接: https://arxiv.org/abs/2508.15772
作者: Qingyang Mao,Qi Cai,Yehao Li,Yingwei Pan,Mingyue Cheng,Ting Yao,Qi Liu,Tao Mei
机构: HiDream.ai
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Source codes and models are available at this https URL

点击查看摘要

Abstract:Recent advances in diffusion models have brought remarkable visual fidelity to instruction-guided image editing. However, their global denoising process inherently entangles the edited region with the entire image context, leading to unintended spurious modifications and compromised adherence to editing instructions. In contrast, autoregressive models offer a distinct paradigm by formulating image synthesis as a sequential process over discrete visual tokens. Their causal and compositional mechanism naturally circumvents the adherence challenges of diffusion-based methods. In this paper, we present VAREdit, a visual autoregressive (VAR) framework that reframes image editing as a next-scale prediction problem. Conditioned on source image features and text instructions, VAREdit generates multi-scale target features to achieve precise edits. A core challenge in this paradigm is how to effectively condition the source image tokens. We observe that finest-scale source features cannot effectively guide the prediction of coarser target features. To bridge this gap, we introduce a Scale-Aligned Reference (SAR) module, which injects scale-matched conditioning information into the first self-attention layer. VAREdit demonstrates significant advancements in both editing adherence and efficiency. On standard benchmarks, it outperforms leading diffusion-based methods by 30%+ higher GPT-Balance score. Moreover, it completes a 512\times512 editing in 1.2 seconds, making it 2.2 \times faster than the similarly sized UltraEdit. The models are available at this https URL.
zh

[CV-3] SceneGen: Single-Image 3D Scene Generation in One Feedforward Pass

【速读】:该论文致力于解决从单张场景图像中同时合成多个3D资产(包括几何形状与纹理)的挑战性问题,尤其关注无需优化或资产检索即可实现高效生成的能力。其解决方案的关键在于提出SceneGen框架,该框架通过引入一种新颖的特征聚合模块,融合视觉编码器与几何编码器提取的局部和全局场景信息,并结合位置预测头(position head),在单次前向传播中同时生成3D资产及其相对空间位置,从而实现了端到端、高效率且高质量的3D内容生成。

链接: https://arxiv.org/abs/2508.15769
作者: Yanxu Meng,Haoning Wu,Ya Zhang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Technical Report; Project Page: this https URL

点击查看摘要

Abstract:3D content generation has recently attracted significant research interest due to its applications in VR/AR and embodied AI. In this work, we address the challenging task of synthesizing multiple 3D assets within a single scene image. Concretely, our contributions are fourfold: (i) we present SceneGen, a novel framework that takes a scene image and corresponding object masks as input, simultaneously producing multiple 3D assets with geometry and texture. Notably, SceneGen operates with no need for optimization or asset retrieval; (ii) we introduce a novel feature aggregation module that integrates local and global scene information from visual and geometric encoders within the feature extraction module. Coupled with a position head, this enables the generation of 3D assets and their relative spatial positions in a single feedforward pass; (iii) we demonstrate SceneGen’s direct extensibility to multi-image input scenarios. Despite being trained solely on single-image inputs, our architectural design enables improved generation performance with multi-image inputs; and (iv) extensive quantitative and qualitative evaluations confirm the efficiency and robust generation abilities of our approach. We believe this paradigm offers a novel solution for high-quality 3D content generation, potentially advancing its practical applications in downstream tasks. The code and model will be publicly available at: this https URL.
zh

[CV-4] ATLAS: Decoupling Skeletal and Shape Parameters for Expressive Parametric Human Modeling ICCV2025

【速读】:该论文旨在解决现有参数化人体网格建模方法在捕捉多样化身体姿态与形状细节方面的局限性,尤其是受限于训练数据多样性不足及建模假设过于刚性的缺陷。此外,传统方法先通过线性基底优化外部表面再回归内部骨骼关节点的范式,导致骨骼与软组织之间存在不良依赖关系,难以直接控制身高和骨长等关键属性。其解决方案的关键在于提出ATLAS模型,该模型基于60万张高分辨率扫描数据(由240台同步相机采集)学习而成,并通过将网格表示显式地锚定在人体骨骼上,实现形状基底与骨架基底的解耦。这一设计显著提升了形状表达能力、支持细粒度的身体属性定制,并使关键点拟合独立于外部软组织特性,从而在未见受试者多样姿态下的拟合精度优于现有方法,且非线性姿态修正项能更有效地建模复杂姿态。

链接: https://arxiv.org/abs/2508.15767
作者: Jinhyung Park,Javier Romero,Shunsuke Saito,Fabian Prada,Takaaki Shiratori,Yichen Xu,Federica Bogo,Shoou-I Yu,Kris Kitani,Rawal Khirodkar
机构: Meta; Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025; Website: this https URL

点击查看摘要

Abstract:Parametric body models offer expressive 3D representation of humans across a wide range of poses, shapes, and facial expressions, typically derived by learning a basis over registered 3D meshes. However, existing human mesh modeling approaches struggle to capture detailed variations across diverse body poses and shapes, largely due to limited training data diversity and restrictive modeling assumptions. Moreover, the common paradigm first optimizes the external body surface using a linear basis, then regresses internal skeletal joints from surface vertices. This approach introduces problematic dependencies between internal skeleton and outer soft tissue, limiting direct control over body height and bone lengths. To address these issues, we present ATLAS, a high-fidelity body model learned from 600k high-resolution scans captured using 240 synchronized cameras. Unlike previous methods, we explicitly decouple the shape and skeleton bases by grounding our mesh representation in the human skeleton. This decoupling enables enhanced shape expressivity, fine-grained customization of body attributes, and keypoint fitting independent of external soft-tissue characteristics. ATLAS outperforms existing methods by fitting unseen subjects in diverse poses more accurately, and quantitative evaluations show that our non-linear pose correctives more effectively capture complex poses compared to linear models.
zh

[CV-5] Waver: Wave Your Way to Lifelike Video Generation

【速读】:该论文旨在解决当前视频生成模型在运动复杂性捕捉、时序一致性以及多模态统一生成能力方面的不足。其核心解决方案是提出Waver,一个基于混合流DiT(Hybrid Stream DiT)架构的高性能基础模型,该架构通过优化跨模态对齐与训练收敛速度,实现文本到视频(T2V)、图像到视频(I2V)和文本到图像(T2I)的统一生成。此外,研究团队构建了高质量数据筛选流程并引入基于多模态大语言模型(MLLM)的视频质量评估模型,确保训练数据的高保真度,从而显著提升生成视频的运动幅度和时序稳定性,在Artificial Analysis基准上达到T2V与I2V任务的Top 3水平,优于多数开源模型并媲美甚至超越主流商业方案。

链接: https://arxiv.org/abs/2508.15761
作者: Yifu Zhang,Hao Yang,Yuqi Zhang,Yifei Hu,Fengda Zhu,Chuang Lin,Xiaofeng Mei,Yi Jiang,Zehuan Yuan,Bingyue Peng
机构: Bytedance Waver Team (字节跳动瓦尔团队)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Waver, a high-performance foundation model for unified image and video generation. Waver can directly generate videos with durations ranging from 5 to 10 seconds at a native resolution of 720p, which are subsequently upscaled to 1080p. The model simultaneously supports text-to-video (T2V), image-to-video (I2V), and text-to-image (T2I) generation within a single, integrated framework. We introduce a Hybrid Stream DiT architecture to enhance modality alignment and accelerate training convergence. To ensure training data quality, we establish a comprehensive data curation pipeline and manually annotate and train an MLLM-based video quality model to filter for the highest-quality samples. Furthermore, we provide detailed training and inference recipes to facilitate the generation of high-quality videos. Building on these contributions, Waver excels at capturing complex motion, achieving superior motion amplitude and temporal consistency in video synthesis. Notably, it ranks among the Top 3 on both the T2V and I2V leaderboards at Artificial Analysis (data as of 2025-07-30 10:00 GMT+8), consistently outperforming existing open-source models and matching or surpassing state-of-the-art commercial solutions. We hope this technical report will help the community more efficiently train high-quality video generation models and accelerate progress in video generation technologies. Official page: this https URL.
zh

[CV-6] “Does the cafe entrance look accessible? Where is the door?” Towards Geospatial AI Agents for Visual Inquiries ICCV’25

【速读】:该论文旨在解决交互式数字地图依赖预设结构化GIS数据(如道路网络、兴趣点索引)而难以回答关于地理视觉信息(即“世界看起来如何”)的复杂查询问题。其解决方案的关键在于提出Geo-Visual Agents——一种多模态AI代理,能够通过分析大规模地理空间图像库(包括街景图像、场所照片和航空影像)与传统GIS数据融合的信息,理解并回应复杂的视觉-空间询问。

链接: https://arxiv.org/abs/2508.15752
作者: Jon E. Froehlich,Jared Hwang,Zeyu Wang,John S. O’Meara,Xia Su,William Huang,Yang Zhang,Alex Fiannaca,Philip Nelson,Shaun Kane
机构: University of Washington (华盛顿大学); Google Research (谷歌研究); UCLA (加州大学洛杉矶分校); Google DeepMind (谷歌深度心智)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the ICCV’25 Workshop “Vision Foundation Models and Generative AI for Accessibility: Challenges and Opportunities”

点击查看摘要

Abstract:Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on pre-existing structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geo-visual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents–multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work.
zh

[CV-7] Fine-grained Multi-class Nuclei Segmentation with Molecular-empowered All-in-SAM Model

【速读】:该论文旨在解决通用视觉基础模型(如Segment Anything Model, SAM)在细粒度语义分割任务中表现不足的问题,尤其是在识别特定细胞亚型或特定类型细胞时的局限性。其解决方案的关键在于提出分子赋能的“All-in-SAM”模型,通过全栈式设计实现:(1) 利用分子信息引导的弱监督学习减少对精细像素级标注的依赖;(2) 基于SAM适配器(SAM adapter)增强模型对特定语义的关注能力,保留SAM强泛化性的同时提升针对性;(3) 引入分子导向修正学习(Molecular-Oriented Corrective Learning, MOCL)以进一步优化分割精度。该方法显著提升了细胞分类性能,尤其在不同标注质量条件下仍保持鲁棒性。

链接: https://arxiv.org/abs/2508.15751
作者: Xueyuan Li,Can Cui,Ruining Deng,Yucheng Tang,Quan Liu,Tianyuan Yao,Shunxing Bao,Naweed Chowdhury,Haichun Yang,Yuankai Huo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 25 pages, 3 figures, accepted by Journal of Medical Imaging

点击查看摘要

Abstract:Purpose: Recent developments in computational pathology have been driven by advances in Vision Foundation Models, particularly the Segment Anything Model (SAM). This model facilitates nuclei segmentation through two primary methods: prompt-based zero-shot segmentation and the use of cell-specific SAM models for direct segmentation. These approaches enable effective segmentation across a range of nuclei and cells. However, general vision foundation models often face challenges with fine-grained semantic segmentation, such as identifying specific nuclei subtypes or particular cells. Approach: In this paper, we propose the molecular-empowered All-in-SAM Model to advance computational pathology by leveraging the capabilities of vision foundation models. This model incorporates a full-stack approach, focusing on: (1) annotation-engaging lay annotators through molecular-empowered learning to reduce the need for detailed pixel-level annotations, (2) learning-adapting the SAM model to emphasize specific semantics, which utilizes its strong generalizability with SAM adapter, and (3) refinement-enhancing segmentation accuracy by integrating Molecular-Oriented Corrective Learning (MOCL). Results: Experimental results from both in-house and public datasets show that the All-in-SAM model significantly improves cell classification performance, even when faced with varying annotation quality. Conclusions: Our approach not only reduces the workload for annotators but also extends the accessibility of precise biomedical image analysis to resource-limited settings, thereby advancing medical diagnostics and automating pathology image analysis.
zh

[CV-8] Probability Density from Latent Diffusion Models for Out-of-Distribution Detection ECAI2025

【速读】:该论文旨在解决生成式模型中**分布外检测(Out-of-Distribution Detection, OOD Detection)**的可靠性问题,即如何准确判断输入数据是否来自训练数据的分布。尽管理论上数据似然(data likelihood)在假设OOD数据均匀分布时是最优检测器,但早期研究发现其在实践中表现不佳,引发对其实用性的质疑。论文的关键解决方案是:将生成模型从像素空间迁移至预训练ResNet-18的特征表示空间进行训练,从而评估似然作为OOD检测指标在更合理表示空间中的性能,并与OpenOOD基准中的先进方法对比,验证了表示空间中密度估计能力的提升可显著改善OOD检测效果。

链接: https://arxiv.org/abs/2508.15737
作者: Joonas Järve,Karl Kaspar Haavel,Meelis Kull
机构: University of Tartu (塔尔图大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: ECAI 2025

点击查看摘要

Abstract:Despite rapid advances in AI, safety remains the main bottleneck to deploying machine-learning systems. A critical safety component is out-of-distribution detection: given an input, decide whether it comes from the same distribution as the training data. In generative models, the most natural OOD score is the data likelihood. Actually, under the assumption of uniformly distributed OOD data, the likelihood is even the optimal OOD detector, as we show in this work. However, earlier work reported that likelihood often fails in practice, raising doubts about its usefulness. We explore whether, in practice, the representation space also suffers from the inability to learn good density estimation for OOD detection, or if it is merely a problem of the pixel space typically used in generative models. To test this, we trained a Variational Diffusion Model not on images, but on the representation space of a pre-trained ResNet-18 to assess the performance of our likelihood-based detector in comparison to state-of-the-art methods from the OpenOOD suite.
zh

[CV-9] WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

【速读】:该论文旨在解决长视频生成中结构与时间一致性难以维持的问题,尤其是现有方法依赖RGB信号导致的物体结构和运动随时间累积误差(temporal drift)问题。解决方案的关键在于提出WorldWeaver框架,其核心创新包括:1)在统一的长时程建模架构中联合预测感知条件(perceptual conditions)与颜色信息,从而显著提升时间一致性和运动动态;2)利用深度线索(depth cues)构建记忆库,因其对漂移更鲁棒,可保留更清晰的上下文信息;3)采用分段噪声调度(segmented noise scheduling)训练预测组,进一步抑制漂移并降低计算成本。

链接: https://arxiv.org/abs/2508.15720
作者: Zhiheng Liu,Xueqing Deng,Shoufa Chen,Angtian Wang,Qiushan Guo,Mingfei Han,Zeyue Xue,Mengzhao Chen,Ping Luo,Linjie Yang
机构: The University of Hong Kong (香港大学); ByteDance Seed (字节跳动种子项目)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Generative video modeling has made significant strides, yet ensuring structural and temporal consistency over long sequences remains a challenge. Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations. To address these issues, we introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme. Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics. Second, by leveraging depth cues, which we observe to be more resistant to drift than RGB, we construct a memory bank that preserves clearer contextual information, improving quality in long-horizon video generation. Third, we employ segmented noise scheduling for training prediction groups, which further mitigates drift and reduces computational cost. Extensive experiments on both diffusion- and rectified flow-based models demonstrate the effectiveness of WorldWeaver in reducing temporal drift and improving the fidelity of generated videos.
zh

[CV-10] StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理长视频时因键值(Key-Value, KV)缓存存储与注意力机制带来的内存和计算开销过大而导致的效率瓶颈问题。现有视觉压缩方法通常需要预先编码完整视觉上下文或提前获知问题,这在长视频理解和多轮对话场景中难以实现。其解决方案的关键在于提出StreamMem——一种查询无关(query-agnostic)的KV缓存记忆机制,通过流式编码新视频帧,并利用视觉标记与通用查询标记之间的注意力分数对KV缓存进行压缩,同时维持固定大小的KV记忆体,从而在内存受限条件下实现高效问答(Question Answering, QA)。

链接: https://arxiv.org/abs/2508.15717
作者: Yanlai Yang,Zhuokai Zhao,Satya Narayan Shukla,Aashu Singh,Shlok Kumar Mishra,Lizhu Zhang,Mengye Ren
机构: Meta(元); New York University (纽约大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 15 pages, 3 figures

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have made significant progress in visual-language reasoning, but their ability to efficiently handle long videos remains limited. Despite recent advances in long-context MLLMs, storing and attending to the key-value (KV) cache for long visual contexts incurs substantial memory and computational overhead. Existing visual compression methods require either encoding the entire visual context before compression or having access to the questions in advance, which is impractical for long video understanding and multi-turn conversational settings. In this work, we propose StreamMem, a query-agnostic KV cache memory mechanism for streaming video understanding. Specifically, StreamMem encodes new video frames in a streaming manner, compressing the KV cache using attention scores between visual tokens and generic query tokens, while maintaining a fixed-size KV memory to enable efficient question answering (QA) in memory-constrained, long-video scenarios. Evaluation on three long video understanding and two streaming video question answering benchmarks shows that StreamMem achieves state-of-the-art performance in query-agnostic KV cache compression and is competitive with query-aware compression approaches.
zh

[CV-11] LLM -empowered Dynamic Prompt Routing for Vision-Language Models Tuning under Long-Tailed Distributions EMNLP2025

【速读】:该论文旨在解决预训练视觉语言模型(VLMs)在类别不平衡场景下微调时出现的偏差问题,尤其关注VLM预训练阶段固有的类别不平衡可能在下游任务中被放大。解决方案的关键在于提出多维动态提示路由(Multi-dimensional Dynamic Prompt Routing, MDPR)框架:该框架构建涵盖五个视觉-语义维度的类知识库,在微调过程中通过动态路由机制对全局视觉类别进行对齐、检索最优提示,并平衡细粒度语义信息,最终通过logits融合实现稳定预测。此方法有效缓解了长尾分布下的性能下降问题,同时计算开销极低,具备良好的灵活性与效率。

链接: https://arxiv.org/abs/2508.15688
作者: Yongju Jia,Jiarui Ma,Xiangxian Li,Baiqiao Zhang,Xianhui Cao,Juan Liu,Yulong Bian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: accepted by EMNLP 2025

点击查看摘要

Abstract:Pre-trained vision-language models (VLMs), such as CLIP, have demonstrated impressive capability in visual tasks, but their fine-tuning often suffers from bias in class-imbalanced scene. Recent works have introduced large language models (LLMs) to enhance VLM fine-tuning with supplementing semantic information. However, they often overlook inherent class imbalance in VLMs’ pre-training, which may lead to bias accumulation in downstream tasks. To address this problem, this paper proposes a Multi-dimensional Dynamic Prompt Routing (MDPR) framework. MDPR constructs a comprehensive knowledge base for classes, spanning five visual-semantic dimensions. During fine-tuning, the dynamic routing mechanism aligns global visual classes, retrieves optimal prompts, and balances fine-grained semantics, yielding stable predictions through logits fusion. Extensive experiments on long-tailed benchmarks, including CIFAR-LT, ImageNet-LT, and Places-LT, demonstrate that MDPR achieves comparable results with current SOTA methods. Ablation studies further confirm the effectiveness of our semantic library for tail classes, and show that our dynamic routing incurs minimal computational overhead, making MDPR a flexible and efficient enhancement for VLM fine-tuning under data imbalance.
zh

[CV-12] CM2LoD3: Reconstructing LoD3 Building Models Using Semantic Conflict Maps

【速读】:该论文旨在解决大规模生成精细三维建筑模型(Level of Detail 3,LoD3)的自动化难题,特别是如何高效重建包含窗户、门和通道等立面细节的建筑结构。传统方法依赖人工建模,难以满足城市级应用的需求。其解决方案的关键在于提出CM2LoD3方法,利用射线-模型先验分析获得的冲突图(Conflict Maps, CMs),结合自研的语义冲突图生成器(Semantic Conflict Map Generator, SCMG)对真实CM进行语义分割,并进一步融合带有置信度评分的纹理分割结果以提升分割精度,从而显著提高LoD3模型重建的准确性与鲁棒性。

链接: https://arxiv.org/abs/2508.15672
作者: Franz Hanke,Antonia Bieringer,Olaf Wysocki,Boris Jutzi
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: This paper was accepted for the 20th 3D GeoInfo 9th Smart Data Smart Cities Conference

点击查看摘要

Abstract:Detailed 3D building models are crucial for urban planning, digital twins, and disaster management applications. While Level of Detail 1 (LoD)1 and LoD2 building models are widely available, they lack detailed facade elements essential for advanced urban analysis. In contrast, LoD3 models address this limitation by incorporating facade elements such as windows, doors, and underpasses. However, their generation has traditionally required manual modeling, making large-scale adoption challenging. In this contribution, CM2LoD3, we present a novel method for reconstructing LoD3 building models leveraging Conflict Maps (CMs) obtained from ray-to-model-prior analysis. Unlike previous works, we concentrate on semantically segmenting real-world CMs with synthetically generated CMs from our developed Semantic Conflict Map Generator (SCMG). We also observe that additional segmentation of textured models can be fused with CMs using confidence scores to further increase segmentation performance and thus increase 3D reconstruction accuracy. Experimental results demonstrate the effectiveness of our CM2LoD3 method in segmenting and reconstructing building openings, with the 61% performance with uncertainty-aware fusion of segmented building textures. This research contributes to the advancement of automated LoD3 model reconstruction, paving the way for scalable and efficient 3D city modeling. Our project is available: this https URL
zh

[CV-13] MapKD: Unlocking Prior Knowledge with Cross-Modal Distillation for Efficient Online HD Map Construction

【速读】:该论文旨在解决在线高精地图(HD map)构建中依赖过时离线地图和多模态传感器导致的计算开销问题。现有方法虽利用了高精地图(HD map)或标准地图(SD map)先验信息及多模态数据融合,但其推理阶段仍存在冗余计算与资源消耗。解决方案的关键在于提出一种多层级跨模态知识蒸馏框架 MapKD,采用教师-教练-学生(Teacher-Coach-Student, TCS)范式:以融合相机与激光雷达(LiDAR)并携带先验信息的模型作为教师,引入一个基于视觉的教练模型(含模拟 LiDAR 输入)以弥合跨模态知识迁移鸿沟,最终训练出轻量级纯视觉学生模型;同时设计 Token-Guided 2D Patch Distillation (TGPD) 和 Masked Semantic Response Distillation (MSRD) 两种针对性蒸馏策略,实现鸟瞰图特征对齐与语义学习引导,在 nuScenes 数据集上显著提升学生模型性能(+6.68 mIoU, +10.94 mAP),并加速推理速度。

链接: https://arxiv.org/abs/2508.15653
作者: Ziyang Yan,Ruikai Li,Zhiyong Cui,Bohan Li,Han Jiang,Yilong Ren,Aoyong Li,Zhenning Li,Sijia Wen,Haiyang Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Online HD map construction is a fundamental task in autonomous driving systems, aiming to acquire semantic information of map elements around the ego vehicle based on real-time sensor inputs. Recently, several approaches have achieved promising results by incorporating offline priors such as SD maps and HD maps or by fusing multi-modal data. However, these methods depend on stale offline maps and multi-modal sensor suites, resulting in avoidable computational overhead at inference. To address these limitations, we employ a knowledge distillation strategy to transfer knowledge from multimodal models with prior knowledge to an efficient, low-cost, and vision-centric student model. Specifically, we propose MapKD, a novel multi-level cross-modal knowledge distillation framework with an innovative Teacher-Coach-Student (TCS) paradigm. This framework consists of: (1) a camera-LiDAR fusion model with SD/HD map priors serving as the teacher; (2) a vision-centric coach model with prior knowledge and simulated LiDAR to bridge the cross-modal knowledge transfer gap; and (3) a lightweight vision-based student model. Additionally, we introduce two targeted knowledge distillation strategies: Token-Guided 2D Patch Distillation (TGPD) for bird’s eye view feature alignment and Masked Semantic Response Distillation (MSRD) for semantic learning guidance. Extensive experiments on the challenging nuScenes dataset demonstrate that MapKD improves the student model by +6.68 mIoU and +10.94 mAP while simultaneously accelerating inference speed. The code is available at:this https URL.
zh

[CV-14] owards a 3D Transfer-based Black-box Attack via Critical Feature Guidance

【速读】:该论文旨在解决3D点云深度神经网络(Deep Neural Networks, DNNs)在黑盒攻击场景下的脆弱性问题,即在无法获取目标模型参数或输出信息的情况下,如何生成具有高迁移性的对抗性点云。其解决方案的关键在于提出一种名为CFG(Critical Feature Guidance)的新型迁移式黑盒攻击方法:通过识别并引导攻击优先破坏不同DNN架构中共有的关键特征(Critical Features),从而显著提升对抗样本在未知模型间的迁移能力;同时,在损失函数中显式约束点云扰动的最大偏移幅度,确保生成的对抗点云在视觉上保持不可感知性。

链接: https://arxiv.org/abs/2508.15650
作者: Shuchao Pang,Zhenghan Chen,Shen Zhang,Liming Lu,Siyuan Liang,Anan Du,Yongbin Zhou
机构: Nanjing University of Science and Technology (南京理工大学); STCA, Microsoft (微软); Nanyang Technological University (南洋理工大学); Nanjing University of Industry Technology (南京工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 11 pages, 6 figures

点击查看摘要

Abstract:Deep neural networks for 3D point clouds have been demonstrated to be vulnerable to adversarial examples. Previous 3D adversarial attack methods often exploit certain information about the target models, such as model parameters or outputs, to generate adversarial point clouds. However, in realistic scenarios, it is challenging to obtain any information about the target models under conditions of absolute security. Therefore, we focus on transfer-based attacks, where generating adversarial point clouds does not require any information about the target models. Based on our observation that the critical features used for point cloud classification are consistent across different DNN architectures, we propose CFG, a novel transfer-based black-box attack method that improves the transferability of adversarial point clouds via the proposed Critical Feature Guidance. Specifically, our method regularizes the search of adversarial point clouds by computing the importance of the extracted features, prioritizing the corruption of critical features that are likely to be adopted by diverse architectures. Further, we explicitly constrain the maximum deviation extent of the generated adversarial point clouds in the loss function to ensure their imperceptibility. Extensive experiments conducted on the ModelNet40 and ScanObjectNN benchmark datasets demonstrate that the proposed CFG outperforms the state-of-the-art attack methods by a large margin.
zh

[CV-15] Weakly-Supervised Learning for Tree Instances Segmentation in Airborne Lidar Point Clouds

【速读】:该论文旨在解决机载激光扫描(ALS)数据中树木实例分割(tree instance segmentation)的挑战,尤其是由于传感器分辨率、植被状态和地形特征等导致的数据差异,以及获取大量精确标注数据以训练全监督分割方法的成本高昂问题。其解决方案的关键在于提出一种弱监督方法:首先利用未经微调的模型或闭式算法获得初始分割结果,并由人工操作员对分割质量进行标签评分;随后,基于这些评分标签训练一个评分模型(rating model),用于将分割输出分类为与人工标注一致的类别;最终,通过评分模型的反馈对原始分割模型进行微调,从而在正确识别树木实例方面提升34%,同时显著减少非树木实例的误判。

链接: https://arxiv.org/abs/2508.15646
作者: Swann Emilien Céleste Destouches,Jesse Lahaye,Laurent Valentin Jospin,Jan Skaloud
机构: Environmental Sensing & Observation Laboratory (ESO), Ecole Polytechnique Fédérale de Lausanne (EPFL)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 9 figures

点击查看摘要

Abstract:Tree instance segmentation of airborne laser scanning (ALS) data is of utmost importance for forest monitoring, but remains challenging due to variations in the data caused by factors such as sensor resolution, vegetation state at acquisition time, terrain characteristics, etc. Moreover, obtaining a sufficient amount of precisely labeled data to train fully supervised instance segmentation methods is expensive. To address these challenges, we propose a weakly supervised approach where labels of an initial segmentation result obtained either by a non-finetuned model or a closed form algorithm are provided as a quality rating by a human operator. The labels produced during the quality assessment are then used to train a rating model, whose task is to classify a segmentation output into the same classes as specified by the human operator. Finally, the segmentation model is finetuned using feedback from the rating model. This in turn improves the original segmentation model by 34% in terms of correctly identified tree instances while considerably reducing the number of non-tree instances predicted. Challenges still remain in data over sparsely forested regions characterized by small trees (less than two meters in height) or within complex surroundings containing shrubs, boulders, etc. which can be confused as trees where the performance of the proposed method is reduced.
zh

[CV-16] When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

【速读】:该论文旨在解决当前视频大语言模型(Video LLM)在时间感知上的局限性问题,即:时间戳编码隐式、帧级特征难以捕捉时序连续性、以及语言-视觉对齐易偏离关注实体。其解决方案的关键在于提出Grounded VideoDiT架构,包含三项核心创新:(1)引入扩散时间潜在编码器(Diffusion Temporal Latent, DTL),提升边界敏感性和时序一致性;(2)构建对象锚定表示(object grounded representations),将查询实体显式关联到局部视觉证据以增强对齐;(3)采用混合标记方案,结合离散时间标记实现显式的时间戳建模,从而支持细粒度的时间推理能力。这些设计共同提升了模型的时空定位与语义理解能力,在Charades STA、NExT GQA及多个VideoQA基准上取得最先进性能。

链接: https://arxiv.org/abs/2508.15641
作者: Pengcheng Fang,Yuxia Chen,Rui Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.
zh

[CV-17] Multi-perspective monitoring of wildlife and human activities from camera traps and drones with deep learning models

【速读】:该论文旨在解决野生动物与人类活动在景观系统中的空间分布不明确问题,从而评估人兽冲突并支持有效的保护规划。其核心解决方案是结合相机陷阱(camera traps)与无人机热红外成像(thermal infrared drone imagery)进行多视角监测,并利用深度学习模型实现自动化目标检测。其中,YOLOv11s模型在相机陷阱图像中表现出最优性能(精确率96.2%、召回率92.3%、mAP50 96.7%),而增强的Faster R-CNN模型则用于分析无人机热成像数据,提供互补的空中视角;空间模式分析进一步识别出野生动物和居民活动热点及其重叠区域,精准定位潜在的人兽冲突地带,显著提升了景观尺度下的野生动物监测与管理能力。

链接: https://arxiv.org/abs/2508.15629
作者: Hao Chen,Fang Qiu,Li An,Douglas Stow,Eve Bohnett,Haitao Lyu,Shuang Tian
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Wildlife and human activities are key components of landscape systems. Understanding their spatial distribution is essential for evaluating human wildlife interactions and informing effective conservation planning. Multiperspective monitoring of wildlife and human activities by combining camera traps and drone imagery. Capturing the spatial patterns of their distributions, which allows the identification of the overlap of their activity zones and the assessment of the degree of human wildlife conflict. The study was conducted in Chitwan National Park (CNP), Nepal, and adjacent regions. Images collected by visible and nearinfrared camera traps and thermal infrared drones from February to July 2022 were processed to create training and testing datasets, which were used to build deep learning models to automatic identify wildlife and human activities. Drone collected thermal imagery was used for detecting targets to provide a multiple monitoring perspective. Spatial pattern analysis was performed to identify animal and resident activity hotspots and delineation potential human wildlife conflict zones. Among the deep learning models tested, YOLOv11s achieved the highest performance with a precision of 96.2%, recall of 92.3%, mAP50 of 96.7%, and mAP50 of 81.3%, making it the most effective for detecting objects in camera trap imagery. Drone based thermal imagery, analyzed with an enhanced Faster RCNN model, added a complementary aerial viewpoint for camera trap detections. Spatial pattern analysis identified clear hotspots for both wildlife and human activities and their overlapping patterns within certain areas in the CNP and buffer zones indicating potential conflict. This study reveals human wildlife conflicts within the conserved landscape. Integrating multiperspective monitoring with automated object detection enhances wildlife surveillance and landscape management.
zh

[CV-18] Fast globally optimal Truncated Least Squares point cloud registration with fixed rotation axis

【速读】:该论文旨在解决点云配准(point cloud registration)中存在高比例异常值(outlier rate up to 95%)时的鲁棒性优化问题,特别是针对带有已知对应关系的 truncated least squares (TLS) 形式化建模下的全局最优解求解难题。其关键解决方案在于提出一种线性时间复杂度的凸松弛方法(linear time convex relaxation)以及一种用于加速分支定界(Branch and Bound, BnB)的收缩算法(contractor method),从而在提供旋转轴的前提下,可在半秒内实现100个点的全局最优配准,相较当前最先进的SDP求解器STRIDE在旋转约束问题上提速两个数量级。此外,作者还通过对抗性实例验证了所提方法在局部极小值接近全局最小值时仍能保证全局最优性。

链接: https://arxiv.org/abs/2508.15613
作者: Ivo Ivanov,Carsten Markgraf
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent results showed that point cloud registration with given correspondences can be made robust to outlier rates of up to 95% using the truncated least squares (TLS) formulation. However, solving this combinatorial optimization problem to global optimality is challenging. Provably globally optimal approaches using semidefinite programming (SDP) relaxations take hundreds of seconds for 100 points. In this paper, we propose a novel linear time convex relaxation as well as a contractor method to speed up Branch and Bound (BnB). Our solver can register two 3D point clouds with 100 points to provable global optimality in less than half a second when the axis of rotation is provided. Although it currently cannot solve the full 6DoF problem, it is two orders of magnitude faster than the state-of-the-art SDP solver STRIDE when solving the rotation-only TLS problem. In addition to providing a formal proof for global optimality, we present empirical evidence of global optimality using adversarial instances with local minimas close to the global minimum.
zh

[CV-19] High-Frequency First: A Two-Stage Approach for Improving Image INR

【速读】:该论文旨在解决隐式神经表示(Implicit Neural Representations, INRs)在图像重建中因神经网络的频谱偏差(spectral bias)导致难以捕捉高频细节(如边缘和纹理)的问题。其解决方案的关键在于提出了一种两阶段训练策略:第一阶段通过邻域感知的软掩码(neighbor-aware soft mask)动态赋予局部变化强烈的像素更高权重,从而引导模型早期聚焦于高频信息;第二阶段过渡到全图训练,以优化整体重建质量。该方法从训练过程本身出发,而非依赖网络结构或激活函数调整,为缓解频谱偏差提供了新的思路。

链接: https://arxiv.org/abs/2508.15582
作者: Sumit Kumar Dam,Mrityunjoy Gain,Eui-Nam Huh,Choong Seon Hong
机构: 1: Unknown; 2: Unknown
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper on INR; 4 figures, 8 pages

点击查看摘要

Abstract:Implicit Neural Representations (INRs) have emerged as a powerful alternative to traditional pixel-based formats by modeling images as continuous functions over spatial coordinates. A key challenge, however, lies in the spectral bias of neural networks, which tend to favor low-frequency components while struggling to capture high-frequency (HF) details such as sharp edges and fine textures. While prior approaches have addressed this limitation through architectural modifications or specialized activation functions, we propose an orthogonal direction by directly guiding the training process. Specifically, we introduce a two-stage training strategy where a neighbor-aware soft mask adaptively assigns higher weights to pixels with strong local variations, encouraging early focus on fine details. The model then transitions to full-image training. Experimental results show that our approach consistently improves reconstruction quality and complements existing INR methods. As a pioneering attempt to assign frequency-aware importance to pixels in image INR, our work offers a new avenue for mitigating the spectral bias problem.
zh

[CV-20] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

【速读】:该论文旨在解决测试时适应(Test-time adaptation, TTA)中存在的两大核心问题:一是现有方法普遍依赖反向传播或迭代优化,限制了可扩展性并阻碍实时部署;二是缺乏对类别条件特征分布的显式建模,导致决策边界不可靠、预测校准不足。解决方案的关键在于提出ADAPT方法,将TTA重新建模为高斯概率推断任务,通过逐步更新类均值和共享协方差矩阵来显式建模类别条件似然,从而实现无需训练、无梯度更新的闭式推理。同时,引入基于CLIP先验的轻量正则化与历史知识库以纠正潜在似然偏差,使得该方法在不依赖源数据、无需完整目标数据访问的情况下,支持在线与归纳设置,并在多种分布偏移场景下实现了最先进的鲁棒性和可扩展性。

链接: https://arxiv.org/abs/2508.15568
作者: Youjia Zhang,Youngeun Kim,Young-Geun Choi,Hongyeob Kim,Huiling Liu,Sungeun Hong
机构: Sungkyunkwan University (成均馆大学); Yale University (耶鲁大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.
zh

[CV-21] D3FNet: A Differential Attention Fusion Network for Fine-Grained Road Structure Extraction in Remote Perception Systems ICCV2025

【速读】:该论文旨在解决高分辨率遥感影像中窄道路提取的难题,主要挑战包括道路宽度有限、拓扑结构碎片化以及频繁遮挡等问题。其解决方案的核心在于提出D3FNet——一种基于D-LinkNet架构的空洞双流差异注意力融合网络,关键创新包括:(1) 差异注意力空洞提取模块(DADE),在瓶颈层增强微弱道路特征并抑制背景噪声;(2) 双流解码融合机制(DDFM),整合原始特征与注意力调制特征,实现空间精度与语义上下文的平衡;(3) 多尺度空洞策略(扩张率1, 3, 5, 9),缓解网格伪影并提升窄道路预测连续性。该方法特别针对细粒度、遮挡和低对比度道路段进行优化,在DeepGlobe和CHN6-CUG基准上显著优于现有先进模型。

链接: https://arxiv.org/abs/2508.15537
作者: Chang Liu,Yang Xu,Tamas Sziranyi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 6 figures, International Conference on Computer Vision, ICCV 2025 (DriveX) paper id 5

点击查看摘要

Abstract:Extracting narrow roads from high-resolution remote sensing imagery remains a significant challenge due to their limited width, fragmented topology, and frequent occlusions. To address these issues, we propose D3FNet, a Dilated Dual-Stream Differential Attention Fusion Network designed for fine-grained road structure segmentation in remote perception systems. Built upon the encoder-decoder backbone of D-LinkNet, D3FNet introduces three key innovations:(1) a Differential Attention Dilation Extraction (DADE) module that enhances subtle road features while suppressing background noise at the bottleneck; (2) a Dual-stream Decoding Fusion Mechanism (DDFM) that integrates original and attention-modulated features to balance spatial precision with semantic context; and (3) a multi-scale dilation strategy (rates 1, 3, 5, 9) that mitigates gridding artifacts and improves continuity in narrow road prediction. Unlike conventional models that overfit to generic road widths, D3FNet specifically targets fine-grained, occluded, and low-contrast road segments. Extensive experiments on the DeepGlobe and CHN6-CUG benchmarks show that D3FNet achieves superior IoU and recall on challenging road regions, outperforming state-of-the-art baselines. Ablation studies further verify the complementary synergy of attention-guided encoding and dual-path decoding. These results confirm D3FNet as a robust solution for fine-grained narrow road extraction in complex remote and cooperative perception scenarios.
zh

[CV-22] Multi-Object Sketch Animation with Grouping and Motion Trajectory Priors ACM-MM2025

【速读】:该论文旨在解决现有矢量草图动画方法在处理多对象交互和复杂运动时存在的局限性,如仅适用于单对象场景、时间不一致性和泛化能力差等问题。其解决方案的关键在于提出了一种两阶段的流水线方法:首先通过语义分组和关键帧定义生成粗略动画;其次引入基于群体的位移网络(Group-based Displacement Network, GDN),利用文本到视频模型先验知识预测群体特异性位移场,并结合上下文条件特征增强模块(Context-conditioned Feature Enhancement, CCFE)提升时间一致性,从而显著改善多对象复杂动画的质量与稳定性。

链接: https://arxiv.org/abs/2508.15535
作者: Guotao Liang,Juncheng Hu,Ximing Xing,Jing Zhang,Qian Yu
机构: Beihang University (北京航空航天大学); Qingdao Research Institute, Beihang University (北京航空航天大学青岛研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:We introduce GroupSketch, a novel method for vector sketch animation that effectively handles multi-object interactions and complex motions. Existing approaches struggle with these scenarios, either being limited to single-object cases or suffering from temporal inconsistency and poor generalization. To address these limitations, our method adopts a two-stage pipeline comprising Motion Initialization and Motion Refinement. In the first stage, the input sketch is interactively divided into semantic groups and key frames are defined, enabling the generation of a coarse animation via interpolation. In the second stage, we propose a Group-based Displacement Network (GDN), which refines the coarse animation by predicting group-specific displacement fields, leveraging priors from a text-to-video model. GDN further incorporates specialized modules, such as Context-conditioned Feature Enhancement (CCFE), to improve temporal consistency. Extensive experiments demonstrate that our approach significantly outperforms existing methods in generating high-quality, temporally consistent animations for complex, multi-object sketches, thus expanding the practical applications of sketch animation.
zh

[CV-23] ExtraG S: Geometric-Aware Trajectory Extrapolation with Uncertainty-Guided Generative Priors

【速读】:该论文旨在解决自动驾驶场景中从记录的驾驶日志中合成外推视图(extrapolated views)时存在的几何一致性差和渲染过度平滑的问题。现有方法虽利用生成先验(generative priors)作为伪真值,但难以保证几何准确性与细节保真度。其解决方案的关键在于提出一个名为ExtraGS的综合性框架,核心创新包括:基于混合高斯-有符号距离函数(Gaussian-Signed Distance Function)设计的路表高斯表示(Road Surface Gaussian, RSG),以及用于高效处理远距离物体的远场高斯(Far Field Gaussians, FFG);同时引入基于球谐函数的自监督不确定性估计机制,实现仅在出现外推伪影区域选择性融合生成先验,从而显著提升外推视图的真实感与几何一致性,同时保持原轨迹上的高保真度。

链接: https://arxiv.org/abs/2508.15529
作者: Kaiyuan Tan,Yingying Shen,Haohui Zhu,Zhiwei Zhan,Shan Zhao,Mingfei Tu,Hongcheng Luo,Haiyang Sun,Bing Wang,Guang Chen,Hangjun Ye
机构: UIUC(伊利诺伊大学厄巴纳-香槟分校); Xiaomi EV(小米汽车)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Synthesizing extrapolated views from recorded driving logs is critical for simulating driving scenes for autonomous driving vehicles, yet it remains a challenging task. Recent methods leverage generative priors as pseudo ground truth, but often lead to poor geometric consistency and over-smoothed renderings. To address these limitations, we propose ExtraGS, a holistic framework for trajectory extrapolation that integrates both geometric and generative priors. At the core of ExtraGS is a novel Road Surface Gaussian(RSG) representation based on a hybrid Gaussian-Signed Distance Function (SDF) design, and Far Field Gaussians (FFG) that use learnable scaling factors to efficiently handle distant objects. Furthermore, we develop a self-supervised uncertainty estimation framework based on spherical harmonics that enables selective integration of generative priors only where extrapolation artifacts occur. Extensive experiments on multiple datasets, diverse multi-camera setups, and various generative priors demonstrate that ExtraGS significantly enhances the realism and geometric consistency of extrapolated views, while preserving high fidelity along the original trajectory.
zh

[CV-24] ask-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion

【速读】:该论文旨在解决多模态图像融合(Multimodal Image Fusion, MMIF)中存在的模态错位、高频细节破坏以及任务特异性局限等问题。解决方案的关键在于提出AdaSFFuse框架,其核心创新包括:1)自适应近似小波变换(Adaptive Approximate Wavelet Transform, AdaWAT),用于实现不同场景下多模态图像的频域解耦与精细对齐;2)空间-频率Mamba块(Spatial-Frequency Mamba Blocks),通过可学习映射动态调整跨域融合过程,在空间和频率域协同优化特征整合,从而提升融合质量并保留关键细节。该方法在红外-可见光融合、多焦点融合、多曝光融合及医学图像融合等四项任务中均表现出优越性能,兼具高效率与紧凑结构。

链接: https://arxiv.org/abs/2508.15505
作者: Mengyu Wang,Zhenyu Liu,Kun Li,Yu Wang,Yuwei Wang,Yanyan Wei,Fei Wang
机构: Nanchang Hangkong University (南昌航空大学); Zhejiang University (浙江大学); Anhui Agricultural University (安徽农业大学); Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE Transactions on Multimedia

点击查看摘要

Abstract:Multimodal Image Fusion (MMIF) aims to integrate complementary information from different imaging modalities to overcome the limitations of individual sensors. It enhances image quality and facilitates downstream applications such as remote sensing, medical diagnostics, and robotics. Despite significant advancements, current MMIF methods still face challenges such as modality misalignment, high-frequency detail destruction, and task-specific limitations. To address these challenges, we propose AdaSFFuse, a novel framework for task-generalized MMIF through adaptive cross-domain co-fusion learning. AdaSFFuse introduces two key innovations: the Adaptive Approximate Wavelet Transform (AdaWAT) for frequency decoupling, and the Spatial-Frequency Mamba Blocks for efficient multimodal fusion. AdaWAT adaptively separates the high- and low-frequency components of multimodal images from different scenes, enabling fine-grained extraction and alignment of distinct frequency characteristics for each modality. The Spatial-Frequency Mamba Blocks facilitate cross-domain fusion in both spatial and frequency domains, enhancing this process. These blocks dynamically adjust through learnable mappings to ensure robust fusion across diverse modalities. By combining these components, AdaSFFuse improves the alignment and integration of multimodal features, reduces frequency loss, and preserves critical details. Extensive experiments on four MMIF tasks – Infrared-Visible Image Fusion (IVF), Multi-Focus Image Fusion (MFF), Multi-Exposure Image Fusion (MEF), and Medical Image Fusion (MIF) – demonstrate AdaSFFuse’s superior fusion performance, ensuring both low computational cost and a compact network, offering a strong balance between performance and efficiency. The code will be publicly available at this https URL.
zh

[CV-25] MExECON: Multi-view Extended Explicit Clothed humans Optimized via Normal integration

【速读】:该论文旨在解决从稀疏多视角RGB图像中高保真重建穿着衣物的人体虚拟形象(clothed human avatars)的问题,尤其关注几何结构和身体姿态估计的准确性提升。其解决方案的关键在于提出了一种联合多视角人体优化算法(Joint Multi-view Body Optimization, JMBO),该算法通过在所有输入视角下共同拟合单一SMPL-X人体模型,强制实现多视角一致性,并将优化后的人体模型作为低频先验引导后续表面重建;同时,利用前后视图的法向量图(normal map)融合策略,精确捕捉衣物褶皱、发型等细粒度表面细节。整个流程无需重新训练网络即可实现多视角性能增益。

链接: https://arxiv.org/abs/2508.15500
作者: Fulden Ece Uğur,Rafael Redondo,Albert Barreiro,Stefan Hristov,Roger Marí
机构: Eurecat, Centre Tecnològic de Catalunya (加泰罗尼亚技术中心), Barcelona, Spain
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This work presents MExECON, a novel pipeline for 3D reconstruction of clothed human avatars from sparse multi-view RGB images. Building on the single-view method ECON, MExECON extends its capabilities to leverage multiple viewpoints, improving geometry and body pose estimation. At the core of the pipeline is the proposed Joint Multi-view Body Optimization (JMBO) algorithm, which fits a single SMPL-X body model jointly across all input views, enforcing multi-view consistency. The optimized body model serves as a low-frequency prior that guides the subsequent surface reconstruction, where geometric details are added via normal map integration. MExECON integrates normal maps from both front and back views to accurately capture fine-grained surface details such as clothing folds and hairstyles. All multi-view gains are achieved without requiring any network re-training. Experimental results show that MExECON consistently improves fidelity over the single-view baseline and achieves competitive performance compared to modern few-shot 3D reconstruction methods.
zh

[CV-26] LGMSNet: Thinning a medical image segmentation model via dual-level multiscale fusion ECAI2025

【速读】:该论文旨在解决轻量级医学图像分割模型在资源受限临床环境中面临的两大挑战:一是现有轻量模型为追求效率牺牲性能,且通常回避计算成本较高的注意力机制,导致全局上下文感知能力受限;二是现有架构忽视了同一卷积核下的通道冗余问题,影响特征提取效率。解决方案的关键在于提出LGMSNet框架,其核心创新包括:(1)采用异构层内卷积核设计,在提取局部高频信息的同时缓解通道冗余;(2)引入稀疏Transformer-卷积混合分支以捕获低频全局信息,从而在极低计算开销下实现卓越的分割性能与零样本泛化能力。

链接: https://arxiv.org/abs/2508.15476
作者: Chengqi Dong,Fenghe Tang,Rongge Mao,Xinpei Gao,S.Kevin Zhou
机构: University of Science and Technology of China (中国科学技术大学); Suzhou Institute for Advance Research, USTC (苏州研究院,中国科学技术大学); Institute of Computing Technology, CAS (中国科学院计算技术研究所); Jiangsu Provincial Key Laboratory of Multimodal Digital Twin Technology (江苏省多模态数字孪生技术重点实验室); State Key Laboratory of Precision & Intelligent Chemistry (精密与智能化学国家重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ECAI 2025

点击查看摘要

Abstract:Medical image segmentation plays a pivotal role in disease diagnosis and treatment planning, particularly in resource-constrained clinical settings where lightweight and generalizable models are urgently needed. However, existing lightweight models often compromise performance for efficiency and rarely adopt computationally expensive attention mechanisms, severely restricting their global contextual perception capabilities. Additionally, current architectures neglect the channel redundancy issue under the same convolutional kernels in medical imaging, which hinders effective feature extraction. To address these challenges, we propose LGMSNet, a novel lightweight framework based on local and global dual multiscale that achieves state-of-the-art performance with minimal computational overhead. LGMSNet employs heterogeneous intra-layer kernels to extract local high-frequency information while mitigating channel redundancy. In addition, the model integrates sparse transformer-convolutional hybrid branches to capture low-frequency global information. Extensive experiments across six public datasets demonstrate LGMSNet’s superiority over existing state-of-the-art methods. In particular, LGMSNet maintains exceptional performance in zero-shot generalization tests on four unseen datasets, underscoring its potential for real-world deployment in resource-limited medical scenarios. The whole project code is in this https URL.
zh

[CV-27] Enhancing Novel View Synthesis from extremely sparse views with SfM-free 3D Gaussian Splatting Framework

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在极端稀疏视图输入条件下因依赖Structure-from-Motion (SfM) 初始化而失效的问题,即当训练视图极少时(如仅2个视图),SfM无法准确重建场景几何结构,导致渲染质量显著下降。解决方案的关键在于提出一种无需SfM的3DGS方法:首先设计一个密集立体模块(dense stereo module)来逐步估计相机位姿并重建全局稠密点云用于初始化;其次引入一致性视图插值模块(coherent view interpolation module),基于训练视图对插值相机位姿,并生成视角一致的内容作为监督信号;同时结合多尺度拉普拉斯一致性正则化与自适应空间感知的多尺度几何正则化,提升几何结构质量和渲染内容保真度。实验表明,该方法在极稀疏视图下PSNR提升达2.75dB,且合成图像无明显失真、保留丰富高频细节。

链接: https://arxiv.org/abs/2508.15457
作者: Zongqi He,Hanmin Li,Kin-Chung Chan,Yushen Zuo,Hao Xie,Zhe Xiao,Jun Xiao,Kin-Man Lam
机构: The Hong Kong Polytechnic University (香港理工大学); Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 4 figures

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has demonstrated remarkable real-time performance in novel view synthesis, yet its effectiveness relies heavily on dense multi-view inputs with precisely known camera poses, which are rarely available in real-world scenarios. When input views become extremely sparse, the Structure-from-Motion (SfM) method that 3DGS depends on for initialization fails to accurately reconstruct the 3D geometric structures of scenes, resulting in degraded rendering quality. In this paper, we propose a novel SfM-free 3DGS-based method that jointly estimates camera poses and reconstructs 3D scenes from extremely sparse-view inputs. Specifically, instead of SfM, we propose a dense stereo module to progressively estimates camera pose information and reconstructs a global dense point cloud for initialization. To address the inherent problem of information scarcity in extremely sparse-view settings, we propose a coherent view interpolation module that interpolates camera poses based on training view pairs and generates viewpoint-consistent content as additional supervision signals for training. Furthermore, we introduce multi-scale Laplacian consistent regularization and adaptive spatial-aware multi-scale geometry regularization to enhance the quality of geometrical structures and rendered content. Experiments show that our method significantly outperforms other state-of-the-art 3DGS-based approaches, achieving a remarkable 2.75dB improvement in PSNR under extremely sparse-view conditions (using only 2 training views). The images synthesized by our method exhibit minimal distortion while preserving rich high-frequency details, resulting in superior visual quality compared to existing techniques.
zh

[CV-28] Aligning Moments in Time using Video Queries

【速读】:该论文旨在解决视频到视频时刻检索(Video-to-video moment retrieval, Vid2VidMR)任务中因语义帧级对齐困难和查询视频与目标视频间复杂依赖关系建模不足而导致的精确时刻定位难题。其解决方案的关键在于提出一种基于Transformer的模型MATR(Moment Alignment TRansformer),通过双阶段序列对齐机制将目标视频表示条件化于查询视频特征,从而有效编码两者间的语义关联与时间依赖;在此基础上,利用前景/背景分类头与边界预测头实现对目标视频中语义匹配时刻的精准识别。此外,为提升模型性能,作者还设计了一种自监督预训练策略,使模型在无需标注的情况下学习视频内随机片段的定位能力,从而获得更强的任务特定初始化。

链接: https://arxiv.org/abs/2508.15439
作者: Yogesh Kumar,Uday Agarwal,Manish Gupta,Anand Mishra
机构: Indian Institute of Technology Jodhpur (印度理工学院乔德普尔分校); Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 4 figures

点击查看摘要

Abstract:Video-to-video moment retrieval (Vid2VidMR) is the task of localizing unseen events or moments in a target video using a query video. This task poses several challenges, such as the need for semantic frame-level alignment and modeling complex dependencies between query and target videos. To tackle this challenging problem, we introduce MATR (Moment Alignment TRansformer), a transformer-based model designed to capture semantic context as well as the temporal details necessary for precise moment localization. MATR conditions target video representations on query video features using dual-stage sequence alignment that encodes the required correlations and dependencies. These representations are then used to guide foreground/background classification and boundary prediction heads, enabling the model to accurately identify moments in the target video that semantically match with the query video. Additionally, to provide a strong task-specific initialization for MATR, we propose a self-supervised pre-training technique that involves training the model to localize random clips within videos. Extensive experiments demonstrate that MATR achieves notable performance improvements of 13.1% in R@1 and 8.1% in mIoU on an absolute scale compared to state-of-the-art methods on the popular ActivityNet-VRL dataset. Additionally, on our newly proposed dataset, SportsMoments, MATR shows a 14.7% gain in R@1 and a 14.4% gain in mIoU on an absolute scale over strong baselines.
zh

[CV-29] On the Effectiveness of Graph Reordering for Accelerating Approximate Nearest Neighbor Search on GPU

【速读】:该论文旨在解决图结构近似最近邻搜索(Approximate Nearest Neighbor Search, ANNS)在GPU上的内存布局优化问题。尽管基于图的ANNS已成为现代AI应用的主流范式,但现有研究多聚焦于算法创新,忽视了内存访问模式对执行效率的显著影响。论文提出了一种统一的评估框架,其关键在于通过图适配器(graph adapter)将任意拓扑结构的图转换为统一表示,并结合GPU优化的图遍历引擎,实现对多种重排序策略的系统性评估。实验表明,针对GPU架构设计的重排序方法可在保持搜索精度的前提下,提升最多15%的查询每秒(QPS),证明内存布局优化与现有算法创新具有正交性。

链接: https://arxiv.org/abs/2508.15436
作者: Yutaro Oguri,Mai Nishimura,Yusuke Matsui
机构: The University of Tokyo (东京大学); OMRON SINIC X Corporation (欧姆龙 SINIC X 公司)
类目: Information Retrieval (cs.IR); Computer Vision and Pattern Recognition (cs.CV); Databases (cs.DB); Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
备注:

点击查看摘要

Abstract:We present the first systematic investigation of graph reordering effects for graph-based Approximate Nearest Neighbor Search (ANNS) on a GPU. While graph-based ANNS has become the dominant paradigm for modern AI applications, recent approaches focus on algorithmic innovations while neglecting memory layout considerations that significantly affect execution time. Our unified evaluation framework enables comprehensive evaluation of diverse reordering strategies across different graph indices through a graph adapter that converts arbitrary graph topologies into a common representation and a GPU-optimized graph traversal engine. We conduct a comprehensive analysis across diverse datasets and state-of-the-art graph indices, introducing analysis metrics that quantify the relationship between structural properties and memory layout effectiveness. Our GPU-targeted reordering achieves up to 15 % QPS improvements while preserving search accuracy, demonstrating that memory layout optimization operates orthogonally to existing algorithmic innovations. We will release all code upon publication to facilitate reproducibility and foster further research.
zh

[CV-30] A Curated Dataset and Deep Learning Approach for Minor Dent Detection in Vehicles

【速读】:该论文旨在解决传统汽车损伤检测方法中存在的劳动密集、耗时且难以发现微小表面缺陷(如微观凹痕)的问题。其核心解决方案是基于YOLOv8目标检测框架构建一种深度学习模型,通过自建标注数据集并结合实时数据增强策略训练定制化模型(YOLOv8m-t42),从而实现对车身表面微观缺陷的高精度、低延迟自动识别。实验表明,YOLOv8m-t42在精确率(0.86)、召回率(0.84)和F1分数(0.85)等指标上优于基线模型YOLOv8m-t4,且PR曲线下面积更大(0.88 vs. 0.82),具备更强的鲁棒性和实用性,适用于保险理赔自动化与车辆检测等实时场景。

链接: https://arxiv.org/abs/2508.15431
作者: Danish Zia Baig,Mohsin Kamal
机构: School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, Pakistan
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional car damage inspection techniques are labor-intensive, manual, and frequently overlook tiny surface imperfections like microscopic dents. Machine learning provides an innovative solution to the increasing demand for quicker and more precise inspection methods. The paper uses the YOLOv8 object recognition framework to provide a deep learning-based solution for automatically detecting microscopic surface flaws, notably tiny dents, on car exteriors. Traditional automotive damage inspection procedures are manual, time-consuming, and frequently unreliable at detecting tiny flaws. To solve this, a bespoke dataset containing annotated photos of car surfaces under various lighting circumstances, angles, and textures was created. To improve robustness, the YOLOv8m model and its customized variants, YOLOv8m-t4 and YOLOv8m-t42, were trained employing real-time data augmentation approaches. Experimental results show that the technique has excellent detection accuracy and low inference latency, making it suited for real-time applications such as automated insurance evaluations and automobile inspections. Evaluation parameters such as mean Average Precision (mAP), precision, recall, and F1-score verified the model’s efficacy. With a precision of 0.86, recall of 0.84, and F1-score of 0.85, the YOLOv8m-t42 model outperformed the YOLOv8m-t4 model (precision: 0.81, recall: 0.79, F1-score: 0.80) in identifying microscopic surface defects. With a little reduced mAP@0.5:0.95 of 0.20, the mAP@0.5 for YOLOv8m-t42 stabilized at 0.60. Furthermore, YOLOv8m-t42’s PR curve area was 0.88, suggesting more consistent performance than YOLOv8m-t4 (0.82). YOLOv8m-t42 has greater accuracy and is more appropriate for practical dent detection applications, even though its convergence is slower.
zh

[CV-31] Lang2Lift: A Framework for Language-Guided Pallet Detection and Pose Estimation Integrated in Autonomous Outdoor Forklift Operation

【速读】:该论文旨在解决物流与建筑行业中室外环境下托盘搬运自动化的难题,尤其是在负载变化、托盘质量与尺寸不一以及环境无结构化等复杂条件下,人工定位和取放托盘存在效率低、安全隐患大及劳动力短缺的问题。解决方案的关键在于提出Lang2Lift框架,其核心是利用基础模型(foundation models)实现自然语言引导的托盘检测与6D位姿估计,通过整合Florence-2与SAM-2实现语言接地的分割(language-grounded segmentation),并结合FoundationPose在杂乱多托盘场景下进行鲁棒位姿估计,最终将精准位姿信息输入运动规划模块,实现叉车的全自主操作。

链接: https://arxiv.org/abs/2508.15427
作者: Huy Hoang Nguyen,Johannes Huemer,Markus Murschitz,Tobias Glueck,Minh Nhat Vu,Andreas Kugi
机构: AIT Austrian Institute of Technology GmbH (奥地利科学院技术研究所); Automation & Control Institute, TU Wien (维也纳工业大学自动化与控制研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 7 figures

点击查看摘要

Abstract:The logistics and construction industries face persistent challenges in automating pallet handling, especially in outdoor environments with variable payloads, inconsistencies in pallet quality and dimensions, and unstructured surroundings. In this paper, we tackle automation of a critical step in pallet transport: the pallet pick-up operation. Our work is motivated by labor shortages, safety concerns, and inefficiencies in manually locating and retrieving pallets under such conditions. We present Lang2Lift, a framework that leverages foundation models for natural language-guided pallet detection and 6D pose estimation, enabling operators to specify targets through intuitive commands such as “pick up the steel beam pallet near the crane.” The perception pipeline integrates Florence-2 and SAM-2 for language-grounded segmentation with FoundationPose for robust pose estimation in cluttered, multi-pallet outdoor scenes under variable lighting. The resulting poses feed into a motion planning module for fully autonomous forklift operation. We validate Lang2Lift on the ADAPT autonomous forklift platform, achieving 0.76 mIoU pallet segmentation accuracy on a real-world test dataset. Timing and error analysis demonstrate the system’s robustness and confirm its feasibility for deployment in operational logistics and construction environments. Video demonstrations are available at this https URL
zh

[CV-32] Bidirectional Temporal Information Propagation for Moving Infrared Small Target Detection

【速读】:该论文旨在解决现有基于学习的多帧红外小目标检测方法中,滑动窗口机制无法联合优化整个视频片段、且忽略窗口外全局时间信息的问题,从而导致计算冗余和性能次优。其解决方案的关键在于提出一种双向时间信息传播方法(Bidirectional temporal information propagation method, BIRD),通过前向与后向传播分支分别建模局部时序运动特征(Local Temporal Motion Fusion, LTMF)与全局时序运动特征(Global Temporal Motion Fusion, GTMF),并融合双向聚合特征以增强目标表征;同时引入时空融合损失(Spatio-Temporal Fusion loss)对整段视频进行联合优化,显著提升检测精度与推理效率。

链接: https://arxiv.org/abs/2508.15415
作者: Dengyan Luo,Yanping Xiang,Hu Wang,Luping Ji.Shuai Li,Mao Ye
机构: University of Electronic Science and Technology of China (电子科技大学); Shandong University (山东大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Moving infrared small target detection is broadly adopted in infrared search and track systems, and has attracted considerable research focus in recent years. The existing learning-based multi-frame methods mainly aggregate the information of adjacent frames in a sliding window fashion to assist the detection of the current frame. However, the sliding-window-based methods do not consider joint optimization of the entire video clip and ignore the global temporal information outside the sliding window, resulting in redundant computation and sub-optimal performance. In this paper, we propose a Bidirectional temporal information propagation method for moving InfraRed small target Detection, dubbed BIRD. The bidirectional propagation strategy simultaneously utilizes local temporal information of adjacent frames and global temporal information of past and future frames in a recursive fashion. Specifically, in the forward and backward propagation branches, we first design a Local Temporal Motion Fusion (LTMF) module to model local spatio-temporal dependency between a target frame and its two adjacent frames. Then, a Global Temporal Motion Fusion (GTMF) module is developed to further aggregate the global propagation feature with the local fusion feature. Finally, the bidirectional aggregated features are fused and input into the detection head for detection. In addition, the entire video clip is jointly optimized by the traditional detection loss and the additional Spatio-Temporal Fusion (STF) loss. Extensive experiments demonstrate that the proposed BIRD method not only achieves the state-of-the-art performance but also shows a fast inference speed.
zh

[CV-33] From Linearity to Non-Linearity: How Masked Autoencoders Capture Spatial Correlations

【速读】:该论文试图解决的问题是:Masked Autoencoders (MAEs) 在应用于新数据集时,其性能高度依赖于超参数(如掩码比例、图像块大小、编码器/解码器层数)的选择,而这些超参数与下游任务性能之间的内在联系尚未被充分理解。解决方案的关键在于通过理论分析揭示 MAE 学习空间相关性的机制——具体而言,作者首先推导出线性 MAE 所学特征的解析表达式,证明掩码比例和图像块大小可调控模型提取短程与长程空间相关性特征的能力;进一步扩展至非线性 MAE 时发现,其表征能够适应数据集中更高阶的空间统计特性,从而为实践中合理选择 MAE 超参数提供了理论依据和指导原则。

链接: https://arxiv.org/abs/2508.15404
作者: Anthony Bisulco,Rahul Ramesh,Randall Balestriero,Pratik Chaudhari
机构: University of Pennsylvania (宾夕法尼亚大学); Brown University (布朗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Masked Autoencoders (MAEs) have emerged as a powerful pretraining technique for vision foundation models. Despite their effectiveness, they require extensive hyperparameter tuning (masking ratio, patch size, encoder/decoder layers) when applied to novel datasets. While prior theoretical works have analyzed MAEs in terms of their attention patterns and hierarchical latent variable models, the connection between MAE hyperparameters and performance on downstream tasks is relatively unexplored. This work investigates how MAEs learn spatial correlations in the input image. We analytically derive the features learned by a linear MAE and show that masking ratio and patch size can be used to select for features that capture short- and long-range spatial correlations. We extend this analysis to non-linear MAEs to show that MAE representations adapt to spatial correlations in the dataset, beyond second-order statistics. Finally, we discuss some insights on how to select MAE hyper-parameters in practice.
zh

[CV-34] Spiking Variational Graph Representation Inference for Video Summarization

【速读】:该论文旨在解决短视频内容中关键信息提取的难题,特别是现有方法在捕捉全局时间依赖性、保持语义连贯性以及多通道特征融合过程中噪声干扰方面的不足。其解决方案的关键在于提出一种基于脉冲神经网络(Spiking Neural Networks, SNN)的Spiking Variational Graph (SpiVG) 网络:首先利用SNN事件驱动计算机制实现关键帧特征的自主学习;其次引入动态聚合图推理模块,将上下文对象一致性与语义视角连贯性解耦,以支持细粒度且自适应的帧间推理;最后设计变分推断重建模块,通过证据下界优化(Evidence Lower Bound Optimization, ELBO)建模多通道特征分布的潜在结构,并借助后验分布正则化抑制过拟合,从而有效降低融合过程中的不确定性与噪声影响。

链接: https://arxiv.org/abs/2508.15389
作者: Wenrui Li,Wei Han,Liang-Jian Deng,Ruiqin Xiong,Xiaopeng Fan
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology Suzhou Research Institute (哈尔滨工业大学苏州研究院); University of Electronic Science and Technology of China (电子科技大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by IEEE TIP

点击查看摘要

Abstract:With the rise of short video content, efficient video summarization techniques for extracting key information have become crucial. However, existing methods struggle to capture the global temporal dependencies and maintain the semantic coherence of video content. Additionally, these methods are also influenced by noise during multi-channel feature fusion. We propose a Spiking Variational Graph (SpiVG) Network, which enhances information density and reduces computational complexity. First, we design a keyframe extractor based on Spiking Neural Networks (SNN), leveraging the event-driven computation mechanism of SNNs to learn keyframe features autonomously. To enable fine-grained and adaptable reasoning across video frames, we introduce a Dynamic Aggregation Graph Reasoner, which decouples contextual object consistency from semantic perspective coherence. We present a Variational Inference Reconstruction Module to address uncertainty and noise arising during multi-channel feature fusion. In this module, we employ Evidence Lower Bound Optimization (ELBO) to capture the latent structure of multi-channel feature distributions, using posterior distribution regularization to reduce overfitting. Experimental results show that SpiVG surpasses existing methods across multiple datasets such as SumMe, TVSum, VideoXum, and QFVS. Our codes and pre-trained models are available at this https URL.
zh

[CV-35] DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability

【速读】:该论文旨在解决当前深度学习模型在抽象推理(abstract reasoning)能力上的根本瓶颈问题,特别是针对现有模型难以有效模拟人类逻辑推理过程的局限性。其解决方案的关键在于提出一种“因果链建模”(causal chain modeling)视角,系统分解Raven’s Progressive Matrices (RPM) 任务中的完整推理链条:图像 → 抽象属性 → 进展属性模式 → 模式一致性 → 正确答案,并据此设计基线模型DIO。然而实验发现,基于最大化上下文与正确选项之间互信息变分下界的目标函数无法促使模型真正习得预设的人类推理逻辑,原因在于该下界紧致性不足以及互信息作为统计量无法捕捉因果关系。为此,论文进一步提出了三项改进方法以克服上述局限,核心在于强化模型对因果结构的理解和推理路径的显式建模。

链接: https://arxiv.org/abs/2508.15387
作者: Ruizhuo Song,Beiming Yuan
机构: University of Science and Technology Beijing (北京科技大学); Beijing Engineering Research Center of Industrial Spectrum Imaging (工业光谱成像工程研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 9 figures, 8 tables

点击查看摘要

Abstract:Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven’s Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling’’ perspective to systematically analyze the complete causal chain in RPM tasks: image \rightarrow abstract attributes \rightarrow progressive attribute patterns \rightarrow pattern consistency \rightarrow correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:
zh

[CV-36] DriveSplat: Decoupled Driving Scene Reconstruction with Geometry-enhanced Partitioned Neural Gaussians

【速读】:该论文旨在解决自动驾驶场景中3D场景重建面临的挑战,尤其是由快速移动的车辆、行人以及大规模静态背景导致的运动模糊问题。现有基于3D高斯泼溅(3D Gaussian Splatting)的方法虽通过动态-静态解耦缓解了部分问题,但忽略了背景几何关系的优化,且仅依赖单视图拟合添加高斯点,导致新视角渲染鲁棒性差、几何结构不准确。其解决方案的关键在于提出DriveSplat框架:采用区域划分的体素初始化策略(近/中/远区)以增强近距离细节表达;引入可变形神经高斯(Deformable Neural Gaussians)建模非刚性动态物体,并通过可学习的形变网络进行时序参数调整;同时利用预训练模型提供的深度与法向量先验对整个框架进行监督,从而显著提升几何精度和新视角合成质量。

链接: https://arxiv.org/abs/2508.15376
作者: Cong Wang,Xianda Guo,Wenbo Xu,Wei Tian,Ruiqi Song,Chenming Zhang,Lingxi Li,Long Chen
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tencent (腾讯); 4. Tsinghua University (清华大学); 5. Baidu (百度); 6. Peking University (北京大学); 7. Microsoft (微软)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the realm of driving scenarios, the presence of rapidly moving vehicles, pedestrians in motion, and large-scale static backgrounds poses significant challenges for 3D scene reconstruction. Recent methods based on 3D Gaussian Splatting address the motion blur problem by decoupling dynamic and static components within the scene. However, these decoupling strategies overlook background optimization with adequate geometry relationships and rely solely on fitting each training view by adding Gaussians. Therefore, these models exhibit limited robustness in rendering novel views and lack an accurate geometric representation. To address the above issues, we introduce DriveSplat, a high-quality reconstruction method for driving scenarios based on neural Gaussian representations with dynamic-static decoupling. To better accommodate the predominantly linear motion patterns of driving viewpoints, a region-wise voxel initialization scheme is employed, which partitions the scene into near, middle, and far regions to enhance close-range detail representation. Deformable neural Gaussians are introduced to model non-rigid dynamic actors, whose parameters are temporally adjusted by a learnable deformation network. The entire framework is further supervised by depth and normal priors from pre-trained models, improving the accuracy of geometric structures. Our method has been rigorously evaluated on the Waymo and KITTI datasets, demonstrating state-of-the-art performance in novel-view synthesis for driving scenarios.
zh

[CV-37] Image-Conditioned 3D Gaussian Splat Quantization

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在长期存档应用中的两大局限性:一是现有压缩方法仅能将中等规模场景压缩至兆字节级别,难以适用于大规模场景或场景集合的存储;二是缺乏对存档后场景变化的适应能力。解决方案的关键在于提出一种图像条件高斯溅射量化器(Image-Conditioned Gaussian Splat Quantizer, ICGS-Quantizer),其通过联合利用高斯点间及属性间的相关性,并采用跨训练场景共享的固定码本(codebook),显著提升量化效率并消除每场景独立码本带来的开销,从而将存储需求降至千字节级别且保持视觉保真度;同时,通过在解码时引入捕获图像作为条件,使模型具备对存档后场景变更的适应能力,编码、量化与解码过程联合训练确保了码本对条件解码的有效性。

链接: https://arxiv.org/abs/2508.15372
作者: Xinshuang Liu,Runfa Blark Li,Keito Suzuki,Truong Nguyen
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:3D Gaussian Splatting (3DGS) has attracted considerable attention for enabling high-quality real-time rendering. Although 3DGS compression methods have been proposed for deployment on storage-constrained devices, two limitations hinder archival use: (1) they compress medium-scale scenes only to the megabyte range, which remains impractical for large-scale scenes or extensive scene collections; and (2) they lack mechanisms to accommodate scene changes after long-term archival. To address these limitations, we propose an Image-Conditioned Gaussian Splat Quantizer (ICGS-Quantizer) that substantially enhances compression efficiency and provides adaptability to scene changes after archiving. ICGS-Quantizer improves quantization efficiency by jointly exploiting inter-Gaussian and inter-attribute correlations and by using shared codebooks across all training scenes, which are then fixed and applied to previously unseen test scenes, eliminating the overhead of per-scene codebooks. This approach effectively reduces the storage requirements for 3DGS to the kilobyte range while preserving visual fidelity. To enable adaptability to post-archival scene changes, ICGS-Quantizer conditions scene decoding on images captured at decoding time. The encoding, quantization, and decoding processes are trained jointly, ensuring that the codes, which are quantized representations of the scene, are effective for conditional decoding. We evaluate ICGS-Quantizer on 3D scene compression and 3D scene updating. Experimental results show that ICGS-Quantizer consistently outperforms state-of-the-art methods in compression efficiency and adaptability to scene changes. Our code, model, and data will be publicly available on GitHub.
zh

[CV-38] ransfer learning optimization based on evolutionary selective fine tuning IJCNN

【速读】:该论文旨在解决深度学习模型在迁移学习中因全参数微调导致的计算成本高和过拟合问题。解决方案的关键在于提出一种名为BioTune的进化自适应微调技术,通过进化算法智能筛选出对目标任务最具贡献的模型层进行微调,从而在保证性能的同时显著减少可训练参数数量,提升迁移学习的效率与泛化能力。

链接: https://arxiv.org/abs/2508.15367
作者: Jacinto Colan,Ana Davila,Yasuhisa Hasegawa
机构: Nagoya University (名古屋大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the Workshop artiFicial And bio-inspIred netwoRked intelliGence foR cOnstrained aUtoNomous Devices (FAIRGROUND). 2025 International Joint Conference on Neural Networks (IJCNN)

点击查看摘要

Abstract:Deep learning has shown substantial progress in image analysis. However, the computational demands of large, fully trained models remain a consideration. Transfer learning offers a strategy for adapting pre-trained models to new tasks. Traditional fine-tuning often involves updating all model parameters, which can potentially lead to overfitting and higher computational costs. This paper introduces BioTune, an evolutionary adaptive fine-tuning technique that selectively fine-tunes layers to enhance transfer learning efficiency. BioTune employs an evolutionary algorithm to identify a focused set of layers for fine-tuning, aiming to optimize model performance on a given target task. Evaluation across nine image classification datasets from various domains indicates that BioTune achieves competitive or improved accuracy and efficiency compared to existing fine-tuning methods such as AutoRGN and LoRA. By concentrating the fine-tuning process on a subset of relevant layers, BioTune reduces the number of trainable parameters, potentially leading to decreased computational cost and facilitating more efficient transfer learning across diverse data characteristics and distributions.
zh

[CV-39] An Empirical Study on How Video-LLM s Answer Video Questions

【速读】:该论文旨在解决当前视频大语言模型(Video Large Language Models, Video-LLMs)在性能提升方面研究较多,但对其内部工作机制缺乏系统理解的问题。其解决方案的关键在于采用注意力掩蔽(attention knockouts)作为主要分析工具,并设计三种变体:视频时间掩蔽(Video Temporal Knockout)、视频空间掩蔽(Video Spatial Knockout)以及语言到视频掩蔽(Language-to-Video Knockout),结合全局与细粒度两种设置,在不同层数窗口下进行系统性实证分析。通过这一方法,论文揭示了视频信息提取主要发生在早期层、部分中间层对问答任务具有显著影响、以及空间-时间建模更依赖语言引导的检索而非高计算成本的帧内/帧间自注意力机制等核心发现,从而为优化视频理解模型的效率提供了可操作的依据。

链接: https://arxiv.org/abs/2508.15360
作者: Chenhui Gou,Ziyu Ma,Zicheng Duan,Haoyu He,Feng Chen,Akide Liu,Bohan Zhuang,Jianfei Cai,Hamid Rezatofighi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Taking advantage of large-scale data and pretrained language models, Video Large Language Models (Video-LLMs) have shown strong capabilities in answering video questions. However, most existing efforts focus on improving performance, with limited attention to understanding their internal mechanisms. This paper aims to bridge this gap through a systematic empirical study. To interpret existing VideoLLMs, we adopt attention knockouts as our primary analytical tool and design three variants: Video Temporal Knockout, Video Spatial Knockout, and Language-to-Video Knockout. Then, we apply these three knockouts on different numbers of layers (window of layers). By carefully controlling the window of layers and types of knockouts, we provide two settings: a global setting and a fine-grained setting. Our study reveals three key findings: (1) Global setting indicates Video information extraction primarily occurs in early layers, forming a clear two-stage process – lower layers focus on perceptual encoding, while higher layers handle abstract reasoning; (2) In the fine-grained setting, certain intermediate layers exert an outsized impact on video question answering, acting as critical outliers, whereas most other layers contribute minimally; (3) In both settings, we observe that spatial-temporal modeling relies more on language-guided retrieval than on intra- and inter-frame self-attention among video tokens, despite the latter’s high computational cost. Finally, we demonstrate that these insights can be leveraged to reduce attention computation in Video-LLMs. To our knowledge, this is the first work to systematically uncover how Video-LLMs internally process and understand video content, offering interpretability and efficiency perspectives for future research.
zh

[CV-40] RCDINO: Enhancing Radar-Camera 3D Object Detection with DINOv2 Semantic Features

【速读】:该论文旨在解决三维目标检测中多模态数据融合效率与表征能力不足的问题,特别是在自动驾驶和机器人场景下,如何有效结合相机(camera)与雷达(radar)数据以提升检测精度。解决方案的关键在于提出一种基于Transformer的多模态模型RCDINO,其核心创新是将预训练的DINOv2基础模型提取的语义丰富特征与视觉主干网络(visual backbone)特征进行深度融合,从而增强视觉表征能力,同时保持对原有检测架构的兼容性。实验表明,该方法在nuScenes数据集上实现了56.4 NDS和48.1 mAP的领先性能,验证了其有效性。

链接: https://arxiv.org/abs/2508.15353
作者: Olga Matykina,Dmitry Yudin
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication in Optical Memory and Neural Networks, 2025

点击查看摘要

Abstract:Three-dimensional object detection is essential for autonomous driving and robotics, relying on effective fusion of multimodal data from cameras and radar. This work proposes RCDINO, a multimodal transformer-based model that enhances visual backbone features by fusing them with semantically rich representations from the pretrained DINOv2 foundation model. This approach enriches visual representations and improves the model’s detection performance while preserving compatibility with the baseline architecture. Experiments on the nuScenes dataset demonstrate that RCDINO achieves state-of-the-art performance among radar-camera models, with 56.4 NDS and 48.1 mAP. Our implementation is available at this https URL.
zh

[CV-41] Predicting Road Crossing Behaviour using Pose Detection and Sequence Modelling

【速读】:该论文旨在解决自动驾驶车辆中行人道路横穿意图的早期预测问题,以提升交通安全与决策效率。其关键解决方案是构建一个端到端的深度学习框架,首先通过姿态检测模型提取行人关键点(pose detection),再结合序列建模技术对时间动态特征进行分析,从而实现对行人是否即将横穿马路的准确预测。研究比较了三种序列建模方法(GRU、LSTM 和 1D CNN),发现 GRU 在预测准确性上优于 LSTM,而 1D CNN 在推理速度上表现最优,为实时应用提供了可行路径。

链接: https://arxiv.org/abs/2508.15336
作者: Subhasis Dasgupta,Preetam Saha,Agniva Roy,Jaydip Sen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This is a pre-print version of the original paper accepted in the IEEE conference INDISCON 2025. It contains 8 figures and 1 table. The length of the paper is 7 pages

点击查看摘要

Abstract:The world is constantly moving towards AI based systems and autonomous vehicles are now reality in different parts of the world. These vehicles require sensors and cameras to detect objects and maneuver according to that. It becomes important to for such vehicles to also predict from a distant if a person is about to cross a road or not. The current study focused on predicting the intent of crossing the road by pedestrians in an experimental setup. The study involved working with deep learning models to predict poses and sequence modelling for temporal predictions. The study analysed three different sequence modelling to understand the prediction behaviour and it was found out that GRU was better in predicting the intent compared to LSTM model but 1D CNN was the best model in terms of speed. The study involved video analysis, and the output of pose detection model was integrated later on to sequence modelling techniques for an end-to-end deep learning framework for predicting road crossing intents.
zh

[CV-42] VideoEraser: Concept Erasure in Text-to-Video Diffusion Models EMNLP

【速读】:该论文旨在解决文本到视频(text-to-video, T2V)扩散模型在生成过程中可能引发的隐私、版权及安全问题,尤其是当模型被用于生成包含未经授权个人身份、艺术作品或有害内容的视频时。其核心挑战在于这些模型通常训练于大量未授权数据,导致即使在明确提示下仍可能生成不当内容。解决方案的关键在于提出一种无需重新训练的框架VideoEraser,通过两个阶段实现对不良概念的有效抑制:第一阶段为选择性提示嵌入调整(Selective Prompt Embedding Adjustment, SPEA),动态修改与目标概念相关的提示词嵌入;第二阶段为对抗鲁棒噪声引导(Adversarial-Resilient Noise Guidance, ARNG),增强模型对恶意提示的抵抗力。该方法作为即插即用模块可无缝集成至主流T2V扩散模型中,在对象擦除、艺术风格擦除、名人擦除和显性内容擦除四项任务中均显著优于现有方法,平均降低46%的不良内容生成率。

链接: https://arxiv.org/abs/2508.15314
作者: Naen Xu,Jinghuai Zhang,Changjiang Li,Zhi Chen,Chunyi Zhou,Qingming Li,Tianyu Du,Shouling Ji
机构: Zhejiang University (浙江大学); University of California, Los Angeles (加州大学洛杉矶分校); Palo Alto Networks (帕洛阿尔托网络公司); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: To appear in the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)

点击查看摘要

Abstract:The rapid growth of text-to-video (T2V) diffusion models has raised concerns about privacy, copyright, and safety due to their potential misuse in generating harmful or misleading content. These models are often trained on numerous datasets, including unauthorized personal identities, artistic creations, and harmful materials, which can lead to uncontrolled production and distribution of such content. To address this, we propose VideoEraser, a training-free framework that prevents T2V diffusion models from generating videos with undesirable concepts, even when explicitly prompted with those concepts. Designed as a plug-and-play module, VideoEraser can seamlessly integrate with representative T2V diffusion models via a two-stage process: Selective Prompt Embedding Adjustment (SPEA) and Adversarial-Resilient Noise Guidance (ARNG). We conduct extensive evaluations across four tasks, including object erasure, artistic style erasure, celebrity erasure, and explicit content erasure. Experimental results show that VideoEraser consistently outperforms prior methods regarding efficacy, integrity, fidelity, robustness, and generalizability. Notably, VideoEraser achieves state-of-the-art performance in suppressing undesirable content during T2V generation, reducing it by 46% on average across four tasks compared to baselines.
zh

[CV-43] First RAG Second SEG: A Training-Free Paradigm for Camouflaged Object Detection

【速读】:该论文旨在解决伪装目标检测(Camouflaged Object Detection, COD)任务中现有方法依赖大量训练数据和计算资源、以及基础模型如Segment Anything Model (SAM) 在未微调情况下难以有效处理COD问题且需高质量提示(prompt)的挑战。其解决方案的关键在于提出一种无需训练的两阶段范式——RAG-SEG:第一阶段通过无监督聚类构建紧凑的特征检索数据库,实现快速有效的特征检索,并生成粗略掩码作为伪标签提示;第二阶段利用SAM2对这些伪标签进行精细化分割。该方法摒弃了传统训练流程,在保持与最先进方法相当甚至更优性能的同时,显著提升了计算效率与实用性,实验均在个人笔记本电脑上完成,验证了其高效性与可行性。

链接: https://arxiv.org/abs/2508.15313
作者: Wutao Liu,YiDan Wang,Pan Gao
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Camouflaged object detection (COD) poses a significant challenge in computer vision due to the high similarity between objects and their backgrounds. Existing approaches often rely on heavy training and large computational resources. While foundation models such as the Segment Anything Model (SAM) offer strong generalization, they still struggle to handle COD tasks without fine-tuning and require high-quality prompts to yield good performance. However, generating such prompts manually is costly and inefficient. To address these challenges, we propose \textbfFirst RAG, Second SEG (RAG-SEG), a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement. RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval. During inference, the retrieved features produce pseudo-labels that guide precise mask generation using SAM2. Our method eliminates the need for conventional training while maintaining competitive performance. Extensive experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods. Notably, all experiments are conducted on a \textbfpersonal laptop, highlighting the computational efficiency and practicality of our approach. We present further analysis in the Appendix, covering limitations, salient object detection extension, and possible improvements.
zh

[CV-44] BasketLiDAR: The First LiDAR-Camera Multimodal Dataset for Professional Basketball MOT

【速读】:该论文旨在解决体育场景中实时3D多目标跟踪(Multi-Object Tracking, MOT)的难题,尤其针对篮球比赛中因球员高速移动、密集交互和频繁遮挡导致的传统摄像头系统难以实现高精度、低延迟跟踪的问题。其解决方案的关键在于构建首个融合LiDAR点云与同步多视角摄像机视频的多模态数据集BasketLiDAR,并提出一种基于LiDAR与视觉信息融合的新型MOT框架:该框架包含仅使用LiDAR的实时跟踪流水线和融合LiDAR与相机数据的多模态跟踪流水线,从而在保证实时性的同时显著提升复杂遮挡条件下的跟踪准确率。

链接: https://arxiv.org/abs/2508.15299
作者: Ryunosuke Hayashi,Kohei Torimi,Rokuto Nagata,Kazuma Ikeda,Ozora Sako,Taichi Nakamura,Masaki Tani,Yoshimitsu Aoki,Kentaro Yoshioka
机构: Keio University (庆应义塾大学); AISIN CORPORATION (爱信公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to MMSports

点击查看摘要

Abstract:Real-time 3D trajectory player tracking in sports plays a crucial role in tactical analysis, performance evaluation, and enhancing spectator experience. Traditional systems rely on multi-camera setups, but are constrained by the inherently two-dimensional nature of video data and the need for complex 3D reconstruction processing, making real-time analysis challenging. Basketball, in particular, represents one of the most difficult scenarios in the MOT field, as ten players move rapidly and complexly within a confined court space, with frequent occlusions caused by intense physical contact. To address these challenges, this paper constructs BasketLiDAR, the first multimodal dataset in the sports MOT field that combines LiDAR point clouds with synchronized multi-view camera footage in a professional basketball environment, and proposes a novel MOT framework that simultaneously achieves improved tracking accuracy and reduced computational cost. The BasketLiDAR dataset contains a total of 4,445 frames and 3,105 player IDs, with fully synchronized IDs between three LiDAR sensors and three multi-view cameras. We recorded 5-on-5 and 3-on-3 game data from actual professional basketball players, providing complete 3D positional information and ID annotations for each player. Based on this dataset, we developed a novel MOT algorithm that leverages LiDAR’s high-precision 3D spatial information. The proposed method consists of a real-time tracking pipeline using LiDAR alone and a multimodal tracking pipeline that fuses LiDAR and camera data. Experimental results demonstrate that our approach achieves real-time operation, which was difficult with conventional camera-only methods, while achieving superior tracking performance even under occlusion conditions. The dataset is available upon request at: this https URL Comments: Accepted to MMSports Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.15299 [cs.CV] (or arXiv:2508.15299v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.15299 Focus to learn more arXiv-issued DOI via DataCite
zh

[CV-45] PA: Temporal Prompt Alignment for Fetal Congenital Heart Defect Classification

【速读】:该论文旨在解决胎儿先天性心脏病(Congenital Heart Defect, CHD)在超声视频中检测的难题,主要挑战包括图像噪声干扰、探头位置变异以及现有机器学习方法普遍忽视时序信息、分类任务局限于二分类且缺乏预测校准能力。解决方案的关键在于提出Temporal Prompt Alignment (TPA) 框架,其核心创新包括:利用基础图像-文本模型和提示感知对比学习,通过可训练的时间聚合模块提取视频帧特征并捕捉心脏运动动态;引入边际铰链对比损失将视频表示与类别特定文本提示对齐;同时设计Conditional Variational Autoencoder Style Modulation (CVAESM) 模块学习潜在风格向量以调节嵌入表示,并量化分类不确定性,从而提升模型在临床场景下的可靠性与性能。

链接: https://arxiv.org/abs/2508.15298
作者: Darya Taratynova,Alya Almsouti,Beknur Kalmakhanbet,Numan Saeed,Mohammad Yaqub
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Congenital heart defect (CHD) detection in ultrasound videos is hindered by image noise and probe positioning variability. While automated methods can reduce operator dependence, current machine learning approaches often neglect temporal information, limit themselves to binary classification, and do not account for prediction calibration. We propose Temporal Prompt Alignment (TPA), a method leveraging foundation image-text model and prompt-aware contrastive learning to classify fetal CHD on cardiac ultrasound videos. TPA extracts features from each frame of video subclips using an image encoder, aggregates them with a trainable temporal extractor to capture heart motion, and aligns the video representation with class-specific text prompts via a margin-hinge contrastive loss. To enhance calibration for clinical reliability, we introduce a Conditional Variational Autoencoder Style Modulation (CVAESM) module, which learns a latent style vector to modulate embeddings and quantifies classification uncertainty. Evaluated on a private dataset for CHD detection and on a large public dataset, EchoNet-Dynamic, for systolic dysfunction, TPA achieves state-of-the-art macro F1 scores of 85.40% for CHD diagnosis, while also reducing expected calibration error by 5.38% and adaptive ECE by 6.8%. On EchoNet-Dynamic’s three-class task, it boosts macro F1 by 4.73% (from 53.89% to 58.62%). Temporal Prompt Alignment (TPA) is a framework for fetal congenital heart defect (CHD) classification in ultrasound videos that integrates temporal modeling, prompt-aware contrastive learning, and uncertainty quantification.
zh

[CV-46] DesignCLIP: Multimodal Learning with CLIP for Design Patent Understanding EMNLP2025

【速读】:该论文旨在解决设计专利分析中传统图像依赖方法的局限性问题,即专利图像(通常为抽象结构草图)难以充分表达视觉上下文和语义信息,从而在现有技术检索中引发评估歧义。解决方案的关键在于提出一个统一框架 DesignCLIP,利用视觉-语言模型 CLIP 的多模态理解能力,结合类感知分类(class-aware classification)、对比学习(contrastive learning)、生成式详细描述(generated detailed captions)以及多视角图像学习(multi-views image learning),以增强对设计专利图像的语义表征与跨模态匹配能力。实验表明,DesignCLIP 在专利分类、专利检索及多模态专利检索等任务上均显著优于基线与当前最优模型(SOTA)。

链接: https://arxiv.org/abs/2508.15297
作者: Zhu Wang,Homaira Huda Shomee,Sathya N. Ravi,Sourav Medya
机构: University of Illinois Chicago (伊利诺伊大学芝加哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by EMNLP 2025. 22 pages, 14 figures

点击查看摘要

Abstract:In the field of design patent analysis, traditional tasks such as patent classification and patent image retrieval heavily depend on the image data. However, patent images – typically consisting of sketches with abstract and structural elements of an invention – often fall short in conveying comprehensive visual context and semantic information. This inadequacy can lead to ambiguities in evaluation during prior art searches. Recent advancements in vision-language models, such as CLIP, offer promising opportunities for more reliable and accurate AI-driven patent analysis. In this work, we leverage CLIP models to develop a unified framework DesignCLIP for design patent applications with a large-scale dataset of U.S. design patents. To address the unique characteristics of patent data, DesignCLIP incorporates class-aware classification and contrastive learning, utilizing generated detailed captions for patent images and multi-views image learning. We validate the effectiveness of DesignCLIP across various downstream tasks, including patent classification and patent retrieval. Additionally, we explore multimodal patent retrieval, which provides the potential to enhance creativity and innovation in design by offering more diverse sources of inspiration. Our experiments show that DesignCLIP consistently outperforms baseline and SOTA models in the patent domain on all tasks. Our findings underscore the promise of multimodal approaches in advancing patent analysis. The codebase is available here: this https URL.
zh

[CV-47] RATopo: Improving Lane Topology Reasoning via Redundancy Assignment ACM-MM2025

【速读】:该论文旨在解决现有自动驾驶中车道拓扑推理(lane topology reasoning)方法因监督信号受限而导致性能不佳的问题。当前主流方法采用“先检测后推理”的范式,依赖检测阶段的一对一匹配结果进行拓扑关系监督,但这种策略的监督范围有限,难以充分训练模型学习复杂的车道间及车道与交通元素之间的拓扑关系。解决方案的关键在于提出一种冗余分配策略(Redundancy Assignment Strategy, RATopo),其核心创新是重构Transformer解码器结构,通过交换交叉注意力(cross-attention)与自注意力(self-attention)层的位置,保留冗余车道预测以支持有效的“一对多”分配;同时引入多个并行且参数独立的交叉注意力模块,增强检测车道的几何多样性,从而实现更丰富、更具代表性的拓扑监督信号。实验表明,RATopo具有模型无关性,可无缝集成至现有拓扑推理框架中,并显著提升车道-车道和车道-交通元素的拓扑推理性能。

链接: https://arxiv.org/abs/2508.15272
作者: Han Li,Shaofei Huang,Longfei Xu,Yulu Gao,Beipeng Mu,Si Liu
机构: Beihang University (北京航空航天大学); Zhongguancun Academy (中关村学院); University of Macau (澳门科技大学); School of Computer Science and Engineering (计算机科学与工程学院); Hangzhou International Innovation Institute (杭州国际创新研究院); Meituan (美团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.
zh

[CV-48] Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images DATE ICCV2025

【速读】:该论文旨在解决计算病理学中异常检测的难题,即在疾病相关数据稀缺或缺失的情况下,如何准确识别罕见且复杂的病理异常。现有工业场景下的异常检测方法因计算资源受限、组织结构多样以及缺乏可解释性而难以直接迁移至病理图像分析。解决方案的关键在于提出Ano-NAViLa模型——一个基于预训练视觉语言模型(Vision-Language Model)并引入轻量级可训练MLP的正常与异常病理知识增强型模型。通过融合正常和异常病理知识,Ano-NAViLa不仅提升了对病理图像变异性的鲁棒性和检测准确性,还借助图像-文本关联实现了可解释性,从而在两个来自不同器官的淋巴结数据集上实现了最先进的异常检测与定位性能。

链接: https://arxiv.org/abs/2508.15256
作者: Jinsol Song,Jiamu Wang,Anh Tien Nguyen,Keunho Byeon,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak
机构: Korea University (韩国大学); The Catholic University of Korea (天主教大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025. \c{opyright} IEEE 2025. This is the author’s accepted version (camera-ready) of the paper. The definitive version is published in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV 2025). DOI will be updated when available

点击查看摘要

Abstract:Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.
zh

[CV-49] Comp-X: On Defining an Interactive Learned Image Compression Paradigm With Expert-driven LLM Agent

【速读】:该论文旨在解决传统图像压缩编码(image coding)方法在灵活性与用户友好性方面的局限性问题,即现有编码器通常依赖固定编码模式且需人工选择参数,难以适应多样化的用户需求,尤其对非专业用户不友好。其解决方案的关键在于提出Comp-X框架,通过三大创新实现智能化交互式图像压缩:(i) 构建统一的多功能编码框架,整合人类感知、可变码率和空间比特分配等多种编码目标;(ii) 设计交互式编码代理(interactive coding agent),利用增强的上下文学习方法结合编码专家反馈,使大语言模型(LLM)能够理解用户请求、自主选择编码模式并调用工具;(iii) 提出IIC-bench基准,系统评估交互式压缩性能,从而推动图像压缩向通用人工智能(AGI)方向演进。

链接: https://arxiv.org/abs/2508.15243
作者: Yixin Gao,Xin Li,Xiaohan Pan,Runsen Feng,Bingchen Li,Yunpeng Qi,Yiting Lu,Zhengxue Cheng,Zhibo Chen,Jörn Ostermann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We present Comp-X, the first intelligently interactive image compression paradigm empowered by the impressive reasoning capability of large language model (LLM) agent. Notably, commonly used image codecs usually suffer from limited coding modes and rely on manual mode selection by engineers, making them unfriendly for unprofessional users. To overcome this, we advance the evolution of image coding paradigm by introducing three key innovations: (i) multi-functional coding framework, which unifies different coding modes of various objective/requirements, including human-machine perception, variable coding, and spatial bit allocation, into one framework. (ii) interactive coding agent, where we propose an augmented in-context learning method with coding expert feedback to teach the LLM agent how to understand the coding request, mode selection, and the use of the coding tools. (iii) IIC-bench, the first dedicated benchmark comprising diverse user requests and the corresponding annotations from coding experts, which is systematically designed for intelligently interactive image compression evaluation. Extensive experimental results demonstrate that our proposed Comp-X can understand the coding requests efficiently and achieve impressive textual interaction capability. Meanwhile, it can maintain comparable compression performance even with a single coding framework, providing a promising avenue for artificial general intelligence (AGI) in image compression.
zh

[CV-50] Pretrained Diffusion Models Are Inherently Skipped-Step Samplers

【速读】:该论文旨在解决扩散模型(Diffusion Models)在生成过程中存在序列依赖性导致采样效率低下的问题,即传统方法需通过长序列的逐步去噪完成生成,限制了推理速度。其解决方案的关键在于提出了一种“跳步采样”(skipped-step sampling)机制,该机制能够在保持与标准扩散模型相同训练目标的前提下,直接跳过多个中间去噪步骤,从而实现加速采样。研究表明,这种基于马尔可夫过程(Markovian)的加速采样是预训练扩散模型的内在属性,且通过与DDIM结合进一步提升了生成质量与效率。

链接: https://arxiv.org/abs/2508.15233
作者: Wenju Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Diffusion models have been achieving state-of-the-art results across various generation tasks. However, a notable drawback is their sequential generation process, requiring long-sequence step-by-step generation. Existing methods, such as DDIM, attempt to reduce sampling steps by constructing a class of non-Markovian diffusion processes that maintain the same training objective. However, there remains a gap in understanding whether the original diffusion process can achieve the same efficiency without resorting to non-Markovian processes. In this paper, we provide a confirmative answer and introduce skipped-step sampling, a mechanism that bypasses multiple intermediate denoising steps in the iterative generation process, in contrast with the traditional step-by-step refinement of standard diffusion inference. Crucially, we demonstrate that this skipped-step sampling mechanism is derived from the same training objective as the standard diffusion model, indicating that accelerated sampling via skipped-step sampling via a Markovian way is an intrinsic property of pretrained diffusion models. Additionally, we propose an enhanced generation method by integrating our accelerated sampling technique with DDIM. Extensive experiments on popular pretrained diffusion models, including the OpenAI ADM, Stable Diffusion, and Open Sora models, show that our method achieves high-quality generation with significantly reduced sampling steps.
zh

[CV-51] AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation ACM-MM2025

【速读】:该论文旨在解决无人机视觉-语言导航(UAV-VLN)在复杂户外环境中因轨迹长、机动性强而导致的可靠性不足问题,通常需要人工干预或过于详细的指令才能实现有效导航。其核心挑战在于如何平衡无人机高移动性带来的多尺度视角优势与学习过程中动作空间的可控性。解决方案的关键在于提出一种双高度协同的新型任务范式——DuAl-VLN,其中高海拔无人机负责宏观环境推理(集成多模态大语言模型Pilot-LLM),低海拔无人机执行精细导航与目标定位(采用轻量级多阶段策略),二者仅交换最小化坐标信息以保证协作效率,从而实现高效且鲁棒的协同导航。

链接: https://arxiv.org/abs/2508.15232
作者: Ruipu Wu,Yige Zhang,Jinyu Chen,Linjiang Huang,Shifeng Zhang,Xu Zhou,Liang Wang,Si Liu
机构: Beihang University (北京航空航天大学); Sangfor Technologies Inc. (深信服科技有限公司); Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Aerial Vision-and-Language Navigation (VLN) is an emerging task that enables Unmanned Aerial Vehicles (UAVs) to navigate outdoor environments using natural language instructions and visual cues. However, due to the extended trajectories and complex maneuverability of UAVs, achieving reliable UAV-VLN performance is challenging and often requires human intervention or overly detailed instructions. To harness the advantages of UAVs’ high mobility, which could provide multi-grained perspectives, while maintaining a manageable motion space for learning, we introduce a novel task called Dual-Altitude UAV Collaborative VLN (DuAl-VLN). In this task, two UAVs operate at distinct altitudes: a high-altitude UAV responsible for broad environmental reasoning, and a low-altitude UAV tasked with precise navigation. To support the training and evaluation of the DuAl-VLN, we construct the HaL-13k, a dataset comprising 13,838 collaborative high-low UAV demonstration trajectories, each paired with target-oriented language instructions. This dataset includes both unseen maps and an unseen object validation set to systematically evaluate the model’s generalization capabilities across novel environments and unfamiliar targets. To consolidate their complementary strengths, we propose a dual-UAV collaborative VLN framework, AeroDuo, where the high-altitude UAV integrates a multimodal large language model (Pilot-LLM) for target reasoning, while the low-altitude UAV employs a lightweight multi-stage policy for navigation and target grounding. The two UAVs work collaboratively and only exchange minimal coordinate information to ensure efficiency.
zh

[CV-52] Center-Oriented Prototype Contrastive Clustering

【速读】:该论文旨在解决对比学习在聚类任务中因类别间冲突导致的性能瓶颈问题,特别是现有基于原型对比的方法中硬原型计算与真实簇中心存在偏差的问题。其解决方案的关键在于提出一种以中心为导向的原型对比聚类框架(center-oriented prototype contrastive clustering framework),核心创新包括:一是软原型对比模块,通过样本属于簇中心的概率作为权重计算类别原型,从而缓解类别间冲突并减少原型漂移;二是双一致性学习模块,分别对同一样本的不同变换和不同样本的邻域进行对齐,确保特征具备变换不变的语义信息和紧凑的簇内分布,为原型计算提供可靠保障。

链接: https://arxiv.org/abs/2508.15231
作者: Shihao Dong,Xiaotong Zhou,Yuhui Zheng,Huiying Xu,Xinzhong Zhu
机构: Nanjing University of Information Science and Technology (南京信息工程大学); Qinghai Normal University (青海师范大学); Zhejiang Normal University (浙江师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Contrastive learning is widely used in clustering tasks due to its discriminative representation. However, the conflict problem between classes is difficult to solve effectively. Existing methods try to solve this problem through prototype contrast, but there is a deviation between the calculation of hard prototypes and the true cluster center. To address this problem, we propose a center-oriented prototype contrastive clustering framework, which consists of a soft prototype contrastive module and a dual consistency learning module. In short, the soft prototype contrastive module uses the probability that the sample belongs to the cluster center as a weight to calculate the prototype of each category, while avoiding inter-class conflicts and reducing prototype drift. The dual consistency learning module aligns different transformations of the same sample and the neighborhoods of different samples respectively, ensuring that the features have transformation-invariant semantic information and compact intra-cluster distribution, while providing reliable guarantees for the calculation of prototypes. Extensive experiments on five datasets show that the proposed method is effective compared to the SOTA. Our code is published on this https URL.
zh

[CV-53] Collaborative Multi-Modal Coding for High-Quality 3D Generation

【速读】:该论文旨在解决当前3D生成模型在多模态数据利用上的局限性问题,即现有3D原生生成架构大多局限于单一模态(如仅使用RGB图像或点云),忽视了不同模态(如RGB图像、RGBD和点云)之间的互补优势,同时受限于训练数据规模与多样性。解决方案的关键在于提出TriMM,一个首个前馈式3D原生生成模型,其核心创新包括:1)引入协同多模态编码机制,融合各模态特异性特征并保留其独特表示能力;2)通过辅助的2D和3D监督增强多模态编码的鲁棒性和性能;3)基于嵌入的多模态代码,采用三平面潜在扩散模型(triplane latent diffusion model)生成高质量3D资产,显著提升纹理与几何细节。实验证明,TriMM能在小样本训练下实现媲美大规模数据训练模型的性能,并验证了其他多模态数据集的可集成性。

链接: https://arxiv.org/abs/2508.15228
作者: Ziang Cao,Zhaoxi Chen,Liang Pan,Ziwei Liu
机构: S-Lab, Nanyang Technological University, Singapore(新加坡南洋理工大学); Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.
zh

[CV-54] See it. Say it. Sorted: Agent ic System for Compositional Diagram Generation

【速读】:该论文旨在解决从粗糙手绘草图到精确、结构化流程图(flowchart)的自动转换问题,即“草图到示意图生成”(sketch-to-diagram generation)。现有扩散模型(diffusion models)在图像生成中表现优异,但在空间精度、对齐性和符号结构方面难以满足流程图等结构性图形的需求。其解决方案的关键在于提出一种无需训练的代理系统(training-free agentic system),名为“See it. Say it. Sorted.”,该系统通过视觉-语言模型(VLM)与大语言模型(LLM)协同工作,以迭代方式生成可编辑的可缩放矢量图形(SVG)程序:首先由批判性VLM提出定性关系修改建议,多个候选LLM采用不同策略(保守-激进、替代、聚焦)合成SVG更新,再由判别式VLM选择最优方案,从而确保稳定改进。该设计强调定性推理而非脆弱的数值估计,保留全局约束(如对齐和连通性),并支持人工干预修正,最终在10个来自公开论文的真实流程图草图上优于两个前沿闭源图像生成大模型(GPT-5 和 Gemini-2.5-Pro),准确组合基本图形元素(如多头箭头)且不引入冗余文本。

链接: https://arxiv.org/abs/2508.15222
作者: Hantao Zhang,Jingyang Liu,Ed Li
机构: Yale University (耶鲁大学); University of Edinburgh (爱丁堡大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study sketch-to-diagram generation: converting rough hand sketches into precise, compositional diagrams. Diffusion models excel at photorealism but struggle with the spatial precision, alignment, and symbolic structure required for flowcharts. We introduce See it. Say it. Sorted., a training-free agentic system that couples a Vision-Language Model (VLM) with Large Language Models (LLMs) to produce editable Scalable Vector Graphics (SVG) programs. The system runs an iterative loop in which a Critic VLM proposes a small set of qualitative, relational edits; multiple candidate LLMs synthesize SVG updates with diverse strategies (conservative-aggressive, alternative, focused); and a Judge VLM selects the best candidate, ensuring stable improvement. This design prioritizes qualitative reasoning over brittle numerical estimates, preserves global constraints (e.g., alignment, connectivity), and naturally supports human-in-the-loop corrections. On 10 sketches derived from flowcharts in published papers, our method more faithfully reconstructs layout and structure than two frontier closed-source image generation LLMs (GPT-5 and Gemini-2.5-Pro), accurately composing primitives (e.g., multi-headed arrows) without inserting unwanted text. Because outputs are programmatic SVGs, the approach is readily extensible to presentation tools (e.g., PowerPoint) via APIs and can be specialized with improved prompts and task-specific tools. The codebase is open-sourced at this https URL.
zh

[CV-55] STAGNet: A Spatio-Temporal Graph and LSTM Framework for Accident Anticipation

【速读】:该论文旨在解决基于行车记录仪视频的交通事故预测问题,其核心挑战在于如何从单一视觉输入中提取有效的时空特征以实现高精度的事故预警。解决方案的关键在于提出了一种名为STAGNet的新型模型,通过融合更优的时空特征并利用循环网络进行特征聚合,从而在不依赖LiDAR、雷达等复杂传感器的前提下,显著提升事故预测性能。实验表明,STAGNet在多个公开数据集上均实现了更高的平均精度和平均碰撞时间(mean time-to-collision),验证了其在跨数据集训练与测试场景下的泛化能力。

链接: https://arxiv.org/abs/2508.15216
作者: Vipooshan Vipulananthan,Kumudu Mohottala,Kavindu Chinthana,Nimsara Paramulla,Charith D Chitraranjan
机构: University of Moratuwa (莫鲁塔瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accident prediction and timely warnings play a key role in improving road safety by reducing the risk of injury to road users and minimizing property damage. Advanced Driver Assistance Systems (ADAS) are designed to support human drivers and are especially useful when they can anticipate potential accidents before they happen. While many existing systems depend on a range of sensors such as LiDAR, radar, and GPS, relying solely on dash-cam video input presents a more challenging but a more cost-effective and easily deployable solution. In this work, we incorporate better spatio-temporal features and aggregate them through a recurrent network to improve upon state-of-the-art graph neural networks for predicting accidents from dash-cam videos. Experiments using three publicly available datasets show that our proposed STAGNet model achieves higher average precision and mean time-to-collision values than previous methods, both when cross-validated on a given dataset and when trained and tested on different datasets.
zh

[CV-56] DyMorph-B2I: Dynamic and Morphology-Guided Binary-to-Instance Segmentation for Renal Pathology

【速读】:该论文旨在解决肾病理学中功能单位(functional units)的形态学定量分析精度不足的问题,其核心挑战在于现有数据集和自动化方法仅提供二值语义掩码(binary masks),无法实现个体实例级分割(instance-level segmentation),从而限制了下游分析的准确性。解决方案的关键在于提出了一种动态、形态引导的二值到实例分割流水线DyMorph-B2I,该方法将分水岭(watershed)、骨架化(skeletonization)和形态学操作(morphological operations)整合进统一框架,并引入自适应几何精修与针对不同功能单元类别的可调超参数优化机制,从而有效分离粘附性强且形态多样的结构,显著优于传统单一方法及简单组合策略。

链接: https://arxiv.org/abs/2508.15208
作者: Leiyue Zhao,Yuechen Yang,Yanfan Zhu,Haichun Yang,Yuankai Huo,Paul D. Simonson,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
机构: Southern University of Science and Technology (南方科技大学); Vanderbilt University (范德比尔特大学); Vanderbilt University Medical Center (范德比尔特大学医学中心); Weill Cornell Medicine (威尔康奈尔医学院); Cornell Tech (康奈尔科技); Northwell Health (北威尔健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:Accurate morphological quantification of renal pathology functional units relies on instance-level segmentation, yet most existing datasets and automated methods provide only binary (semantic) masks, limiting the precision of downstream analyses. Although classical post-processing techniques such as watershed, morphological operations, and skeletonization, are often used to separate semantic masks into instances, their individual effectiveness is constrained by the diverse morphologies and complex connectivity found in renal tissue. In this study, we present DyMorph-B2I, a dynamic, morphology-guided binary-to-instance segmentation pipeline tailored for renal pathology. Our approach integrates watershed, skeletonization, and morphological operations within a unified framework, complemented by adaptive geometric refinement and customizable hyperparameter tuning for each class of functional unit. Through systematic parameter optimization, DyMorph-B2I robustly separates adherent and heterogeneous structures present in binary masks. Experimental results demonstrate that our method outperforms individual classical approaches and naïve combinations, enabling superior instance separation and facilitating more accurate morphometric analysis in renal pathology workflows. The pipeline is publicly available at: this https URL.
zh

[CV-57] Adversarial Agent Behavior Learning in Autonomous Driving Using Deep Reinforcement Learning

【速读】:该论文旨在解决在强化学习中,如何有效建模规则驱动的周围代理(surrounding agents)以引发安全关键场景(如自动驾驶)中的失败情况问题。当前常用的行为建模策略包括IDM模型等,但这些方法难以捕捉潜在的对抗性行为。论文提出一种基于学习的方法,用于推导出能够诱发失败场景的对抗性行为,从而增强对智能体鲁棒性的评估。其解决方案的关键在于利用学习机制自动发现规则代理的最坏情况行为模式,并通过与所有规则代理进行对抗测试,验证其导致累积奖励下降的能力,从而提升系统在复杂交互环境下的安全性评估效果。

链接: https://arxiv.org/abs/2508.15207
作者: Arjun Srinivasan,Anubhav Paras,Aniket Bera
机构: University of Maryland, College Park, USA
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing approaches in reinforcement learning train an agent to learn desired optimal behavior in an environment with rule based surrounding agents. In safety critical applications such as autonomous driving it is crucial that the rule based agents are modelled properly. Several behavior modelling strategies and IDM models are used currently to model the surrounding agents. We present a learning based method to derive the adversarial behavior for the rule based agents to cause failure scenarios. We evaluate our adversarial agent against all the rule based agents and show the decrease in cumulative reward.
zh

[CV-58] SurgWound-Bench: A Benchmark for Surgical Wound Diagnosis

【速读】:该论文旨在解决外科伤口感染(Surgical Site Infection, SSI)预防与管理中的关键挑战,即缺乏涵盖多种类型外科伤口的公开数据集和基准评测工具,以及现有深度学习方法在专家标注成本高、数据隐私保护等方面存在局限。其解决方案的关键在于:首先构建了首个开源的外科伤口数据集SurgWound,包含697张由三位专业外科医生标注的图像及八类细粒度临床属性;其次提出首个针对外科伤口诊断的基准测试,涵盖视觉问答(Visual Question Answering, VQA)与报告生成任务;最后设计了一个三阶段学习框架WoundQwen,通过多模态大语言模型(Multimodal Large Language Models, MLLMs)分步实现伤口特征识别、感染风险评估与综合报告生成,从而支持个性化伤口护理和及时干预,提升患者预后。

链接: https://arxiv.org/abs/2508.15189
作者: Jiahao Xu(Ohio State University, USA),Changchang Yin(Ohio State University Wexner Medical Center, USA),Odysseas Chatzipanagiotou(Ohio State University Wexner Medical Center, USA),Diamantis Tsilimigras(Ohio State University Wexner Medical Center, USA),Kevin Clear(Ohio State University Wexner Medical Center, USA),Bingsheng Yao(Northeastern University, USA),Dakuo Wang(Northeastern University, USA),Timothy Pawlik(Ohio State University Wexner Medical Center, USA),Ping Zhang(Ohio State University, USA)
机构: The Ohio State University (俄亥俄州立大学); Wexner Medical Center (韦克纳医学中心); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Surgical site infection (SSI) is one of the most common and costly healthcare-associated infections and and surgical wound care remains a significant clinical challenge in preventing SSIs and improving patient outcomes. While recent studies have explored the use of deep learning for preliminary surgical wound screening, progress has been hindered by concerns over data privacy and the high costs associated with expert annotation. Currently, no publicly available dataset or benchmark encompasses various types of surgical wounds, resulting in the absence of an open-source Surgical-Wound screening tool. To address this gap: (1) we present SurgWound, the first open-source dataset featuring a diverse array of surgical wound types. It contains 697 surgical wound images annotated by 3 professional surgeons with eight fine-grained clinical attributes. (2) Based on SurgWound, we introduce the first benchmark for surgical wound diagnosis, which includes visual question answering (VQA) and report generation tasks to comprehensively evaluate model performance. (3) Furthermore, we propose a three-stage learning framework, WoundQwen, for surgical wound diagnosis. In the first stage, we employ five independent MLLMs to accurately predict specific surgical wound characteristics. In the second stage, these predictions serve as additional knowledge inputs to two MLLMs responsible for diagnosing outcomes, which assess infection risk and guide subsequent interventions. In the third stage, we train a MLLM that integrates the diagnostic results from the previous two stages to produce a comprehensive report. This three-stage framework can analyze detailed surgical wound characteristics and provide subsequent instructions to patients based on surgical images, paving the way for personalized wound care, timely intervention, and improved patient outcomes.
zh

[CV-59] MeSS: City Mesh-Guided Outdoor Scene Generation with Cross-View Consistent Diffusion

【速读】:该论文旨在解决城市级网格模型(city mesh models)缺乏真实纹理导致其在虚拟城市导航和自动驾驶等应用中受限的问题。解决方案的关键在于提出一种基于网格的场景合成方法(MeSS),通过增强图像扩散模型来提升跨视角的一致性:首先利用级联式外补绘制ControlNet生成几何一致的稀疏视图;其次通过AGInpaint模块传播更密集的中间视图;最后借助GCAlign模块全局消除视觉不一致性(如曝光差异)。同时,在生成过程中同步重建3D高斯泼溅(3D Gaussian Splatting, 3DGS)场景,从而实现高质量、风格一致且几何对齐的室外场景生成。

链接: https://arxiv.org/abs/2508.15169
作者: Xuyang Chen,Zhijun Zhai,Kaixuan Zhou,Zengmao Wang,Jianan He,Dong Wang,Yanfeng Zhang,mingwei Sun,Rüdiger Westermann,Konrad Schindler,Liqiu Meng
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Zhejiang University (浙江大学); 3. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Mesh models have become increasingly accessible for numerous cities; however, the lack of realistic textures restricts their application in virtual urban navigation and autonomous driving. To address this, this paper proposes MeSS (Meshbased Scene Synthesis) for generating high-quality, styleconsistent outdoor scenes with city mesh models serving as the geometric prior. While image and video diffusion models can leverage spatial layouts (such as depth maps or HD maps) as control conditions to generate street-level perspective views, they are not directly applicable to 3D scene generation. Video diffusion models excel at synthesizing consistent view sequences that depict scenes but often struggle to adhere to predefined camera paths or align accurately with rendered control videos. In contrast, image diffusion models, though unable to guarantee cross-view visual consistency, can produce more geometry-aligned results when combined with ControlNet. Building on this insight, our approach enhances image diffusion models by improving cross-view consistency. The pipeline comprises three key stages: first, we generate geometrically consistent sparse views using Cascaded Outpainting ControlNets; second, we propagate denser intermediate views via a component dubbed AGInpaint; and third, we globally eliminate visual inconsistencies (e.g., varying exposure) using the GCAlign module. Concurrently with generation, a 3D Gaussian Splatting (3DGS) scene is reconstructed by initializing Gaussian balls on the mesh surface. Our method outperforms existing approaches in both geometric alignment and generation quality. Once synthesized, the scene can be rendered in diverse styles through relighting and style transfer techniques.
zh

[CV-60] XDR-LVLM: An Explainable Vision-Language Large Model for Diabetic Retinopathy Diagnosis

【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)自动化诊断中深度学习模型“黑箱”特性导致的临床可解释性不足问题,从而限制其在实际医疗场景中的应用。解决方案的关键在于提出XDR-LVLM框架,该框架基于视觉-语言大模型(Vision-Language Large Models, LVLMs),通过专用医学视觉编码器、LVLM核心模块,并结合多任务提示工程(Multi-task Prompt Engineering)与多阶段微调策略,实现对眼底图像中病理特征的深度理解,并生成包含疾病严重程度分级、关键病灶识别(如出血、渗出、微动脉瘤)及自然语言解释的综合诊断报告,显著提升了诊断精度与临床可解释性。

链接: https://arxiv.org/abs/2508.15168
作者: Masato Ito,Kaito Tanaka,Keisuke Matsuda,Aya Nakayama
机构: SANNO University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diabetic Retinopathy (DR) is a major cause of global blindness, necessitating early and accurate diagnosis. While deep learning models have shown promise in DR detection, their black-box nature often hinders clinical adoption due to a lack of transparency and interpretability. To address this, we propose XDR-LVLM (eXplainable Diabetic Retinopathy Diagnosis with LVLM), a novel framework that leverages Vision-Language Large Models (LVLMs) for high-precision DR diagnosis coupled with natural language-based explanations. XDR-LVLM integrates a specialized Medical Vision Encoder, an LVLM Core, and employs Multi-task Prompt Engineering and Multi-stage Fine-tuning to deeply understand pathological features within fundus images and generate comprehensive diagnostic reports. These reports explicitly include DR severity grading, identification of key pathological concepts (e.g., hemorrhages, exudates, microaneurysms), and detailed explanations linking observed features to the diagnosis. Extensive experiments on the Diabetic Retinopathy (DDR) dataset demonstrate that XDR-LVLM achieves state-of-the-art performance, with a Balanced Accuracy of 84.55% and an F1 Score of 79.92% for disease diagnosis, and superior results for concept detection (77.95% BACC, 66.88% F1). Furthermore, human evaluations confirm the high fluency, accuracy, and clinical utility of the generated explanations, showcasing XDR-LVLM’s ability to bridge the gap between automated diagnosis and clinical needs by providing robust and interpretable insights.
zh

[CV-61] Reliable Multi-view 3D Reconstruction for `Just-in-time Edge Environments

【速读】:该论文旨在解决多视角三维重建(multi-view 3D reconstruction)在边缘计算环境中因动态性与操作不利因素导致的可靠性问题,特别是由时空相关扰动(spatiotemporally correlated disruptions)引发的相机运行中断所造成的重建质量持续下降问题。解决方案的关键在于提出一种受投资组合理论(portfolio theory)启发的边缘资源管理策略,通过遗传算法(genetic algorithm)快速求解优化问题,实现对相机选择的智能调度,在系统面临扰动时仍能保障重建质量的稳定性与可满足性。

链接: https://arxiv.org/abs/2508.15158
作者: Md. Nurul Absur,Abhinav Kumar,Swastik Brahma,Saptarshi Debroy
机构: City University of New York (纽约市立大学); University of Cincinnati (辛辛那提大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: 11 Pages, 7 Figures

点击查看摘要

Abstract:Multi-view 3D reconstruction applications are revolutionizing critical use cases that require rapid situational-awareness, such as emergency response, tactical scenarios, and public safety. In many cases, their near-real-time latency requirements and ad-hoc needs for compute resources necessitate adoption of `Just-in-time’ edge environments where the system is set up on the fly to support the applications during the mission lifetime. However, reliability issues can arise from the inherent dynamism and operational adversities of such edge environments, resulting in spatiotemporally correlated disruptions that impact the camera operations, which can lead to sustained degradation of reconstruction quality. In this paper, we propose a novel portfolio theory inspired edge resource management strategy for reliable multi-view 3D reconstruction against possible system disruptions. Our proposed methodology can guarantee reconstruction quality satisfaction even when the cameras are prone to spatiotemporally correlated disruptions. The portfolio theoretic optimization problem is solved using a genetic algorithm that converges quickly for realistic system settings. Using publicly available and customized 3D datasets, we demonstrate the proposed camera selection strategy’s benefits in guaranteeing reliable 3D reconstruction against traditional baseline strategies, under spatiotemporal disruptions.
zh

[CV-62] HiRQA: Hierarchical Ranking and Quality Alignment for Opinion-Unaware Image Quality Assessment

【速读】:该论文旨在解决无参考图像质量评估(No-Reference Image Quality Assessment, NR-IQA)中因数据集偏差和依赖主观标签而导致的泛化性能不足问题。解决方案的关键在于提出一种自监督、意见无关(opinion-unaware)的层级排序与质量对齐框架 HiRQA,其核心创新包括:1)引入一种新型高阶排序损失(higher-order ranking loss),通过扭曲图像对之间的相对排序关系监督质量预测;2)设计嵌入距离损失(embedding distance loss),强制特征空间中的距离与感知差异保持一致;3)在训练阶段使用结构化文本提示引导对比对齐损失(contrastive alignment loss),提升表示学习能力。该方法仅需输入图像即可预测质量分数,无需原始参考图像或推理时的辅助模态,在合成与真实退化场景下均表现出卓越的泛化能力和SOTA性能。

链接: https://arxiv.org/abs/2508.15130
作者: Vaishnav Ramesh,Haining Wang,Md Jahidul Islam
机构: University of Florida (佛罗里达大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 8 figures

点击查看摘要

Abstract:Despite significant progress in no-reference image quality assessment (NR-IQA), dataset biases and reliance on subjective labels continue to hinder their generalization performance. We propose HiRQA, Hierarchical Ranking and Quality Alignment), a self-supervised, opinion-unaware framework that offers a hierarchical, quality-aware embedding through a combination of ranking and contrastive learning. Unlike prior approaches that depend on pristine references or auxiliary modalities at inference time, HiRQA predicts quality scores using only the input image. We introduce a novel higher-order ranking loss that supervises quality predictions through relational ordering across distortion pairs, along with an embedding distance loss that enforces consistency between feature distances and perceptual differences. A training-time contrastive alignment loss, guided by structured textual prompts, further enhances the learned representation. Trained only on synthetic distortions, HiRQA generalizes effectively to authentic degradations, as demonstrated through evaluation on various distortions such as lens flare, haze, motion blur, and low-light conditions. For real-time deployment, we introduce \textbfHiRQA-S, a lightweight variant with an inference time of only 3.5 ms per image. Extensive experiments across synthetic and authentic benchmarks validate HiRQA’s state-of-the-art (SOTA) performance, strong generalization ability, and scalability.
zh

[CV-63] Side Effects of Erasing Concepts from Diffusion Models EMNLP2025

【速读】:该论文旨在解决文本到图像(Text-to-Image, T2I)生成模型中概念擦除技术(Concept Erasure Techniques, CETs)的鲁棒性不足问题,即当前CET方法在实际应用中易被绕过且存在显著副作用。其核心挑战在于如何在有效禁止生成特定“目标”概念的同时,保持对其他概念的高质量合成能力,并避免引发语义混淆或属性泄露等异常现象。解决方案的关键在于提出一个系统性的评估框架——Side Effect Evaluation (\see),该框架通过层次化与组合式提示词构建测试集,自动化量化CET在三个维度上的表现:邻近概念的影响、目标概念的规避能力以及属性泄露程度。实验表明,现有CET方法可通过超类-子类层级关系和语义相似提示(如目标概念的组合变体)轻易被绕过,同时暴露出注意力集中或分散的反常行为,从而揭示了当前CET方法的脆弱性和潜在风险。

链接: https://arxiv.org/abs/2508.15124
作者: Shaswati Saha,Sourajit Saha,Manas Gaur,Tejas Gokhale
机构: University of Maryland, Baltimore County (马里兰大学巴尔的摩县分校)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Findings of the Association for Computational Linguistics: EMNLP 2025

点击查看摘要

Abstract:Concerns about text-to-image (T2I) generative models infringing on privacy, copyright, and safety have led to the development of Concept Erasure Techniques (CETs). The goal of an effective CET is to prohibit the generation of undesired ``target’’ concepts specified by the user, while preserving the ability to synthesize high-quality images of the remaining concepts. In this work, we demonstrate that CETs can be easily circumvented and present several side effects of concept erasure. For a comprehensive measurement of the robustness of CETs, we present Side Effect Evaluation (\see), an evaluation benchmark that consists of hierarchical and compositional prompts that describe objects and their attributes. This dataset and our automated evaluation pipeline quantify side effects of CETs across three aspects: impact on neighboring concepts, evasion of targets, and attribute leakage. Our experiments reveal that CETs can be circumvented by using superclass-subclass hierarchy and semantically similar prompts, such as compositional variants of the target. We show that CETs suffer from attribute leakage and counterintuitive phenomena of attention concentration or dispersal. We release our dataset, code, and evaluation tools to aid future work on robust concept erasure. Comments: Findings of the Association for Computational Linguistics: EMNLP 2025 Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.15124 [cs.LG] (or arXiv:2508.15124v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.15124 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Shaswati Saha [view email] [v1] Wed, 20 Aug 2025 23:16:01 UTC (6,297 KB) Full-text links: Access Paper: View a PDF of the paper titled Side Effects of Erasing Concepts from Diffusion Models, by Shaswati Saha and 3 other authorsView PDFHTML (experimental)TeX SourceOther Formats view license Current browse context: cs.LG prev | next new | recent | 2025-08 Change to browse by: cs cs.CV References Citations NASA ADSGoogle Scholar Semantic Scholar a export BibTeX citation Loading… BibTeX formatted citation loading… Data provided by: Bookmark checked=“checked”> Bibliographic Tools Bibliographic and Citation Tools Bibliographic Explorer Toggle Bibliographic Explorer (What is the Explorer?) Connected Papers Toggle Connected Papers (What is Connected Papers?) Litmaps Toggle Litmaps (What is Litmaps?) scite.ai Toggle scite Smart Citations (What are Smart Citations?) Code, Data, Media Code, Data and Media Associated with this Article alphaXiv Toggle alphaXiv (What is alphaXiv?) Links to Code Toggle CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub Toggle DagsHub (What is DagsHub?) GotitPub Toggle Gotit.pub (What is GotitPub?) Huggingface Toggle Hugging Face (What is Huggingface?) Links to Code Toggle Papers with Code (What is Papers with Code?) ScienceCast Toggle ScienceCast (What is ScienceCast?) Demos Demos Replicate Toggle Replicate (What is Replicate?) Spaces Toggle Hugging Face Spaces (What is Spaces?) Spaces Toggle TXYZ.AI (What is TXYZ.AI?) Related Papers Recommenders and Search Tools Link to Influence Flower Influence Flower (What are Influence Flowers?) Core recommender toggle CORE Recommender (What is CORE?) IArxiv recommender toggle IArxiv Recommender (What is IArxiv?) Author Venue Institution Topic About arXivLabs arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv’s community? Learn more about arXivLabs. Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?) mathjaxToggle(); About Help contact arXivClick here to contact arXiv Contact subscribe to arXiv mailingsClick here to subscribe Subscribe Copyright Privacy Policy Web Accessibility Assistance arXiv Operational Status Get status notifications via email or slack
zh

[CV-64] CurveFlow: Curvature-Guided Flow Matching for Image Generation

【速读】:该论文旨在解决现有修正流(rectified flow)模型因强制采用线性轨迹而导致生成过程可能经过数据流形上低概率区域的问题,进而影响图像与文本指令之间的语义对齐(instructional compliance)。其关键解决方案是提出CurveFlow框架,通过在流匹配过程中直接引入曲率引导机制,学习平滑且非线性的生成轨迹,并设计了一种鲁棒的曲率正则化技术以惩罚轨迹内在曲率的突变,从而显著提升文本到图像生成任务中的语义一致性与图像质量。

链接: https://arxiv.org/abs/2508.15093
作者: Yan Luo,Drake Du,Hao Huang,Yi Fang,Mengyu Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing rectified flow models are based on linear trajectories between data and noise distributions. This linearity enforces zero curvature, which can inadvertently force the image generation process through low-probability regions of the data manifold. A key question remains underexplored: how does the curvature of these trajectories correlate with the semantic alignment between generated images and their corresponding captions, i.e., instructional compliance? To address this, we introduce CurveFlow, a novel flow matching framework designed to learn smooth, non-linear trajectories by directly incorporating curvature guidance into the flow path. Our method features a robust curvature regularization technique that penalizes abrupt changes in the trajectory’s intrinsic this http URL experiments on MS COCO 2014 and 2017 demonstrate that CurveFlow achieves state-of-the-art performance in text-to-image generation, significantly outperforming both standard rectified flow variants and other non-linear baselines like Rectified Diffusion. The improvements are especially evident in semantic consistency metrics such as BLEU, METEOR, ROUGE, and CLAIR. This confirms that our curvature-aware modeling substantially enhances the model’s ability to faithfully follow complex instructions while simultaneously maintaining high image quality. The code is made publicly available at this https URL.
zh

[CV-65] GasTwinFormer: A Hybrid Vision Transformer for Livestock Methane Emission Segmentation and Dietary Classification in Optical Gas Imaging ICCV

【速读】:该论文旨在解决畜牧业甲烷(CH₄)排放的自动化监测难题,以支持气候减缓策略。传统方法难以实现高精度、实时的甲烷泄漏识别与饲料类型分类,而该研究提出了一种名为GasTwinFormer的混合视觉Transformer架构,其核心创新在于设计了一个新颖的Mix Twin编码器,交替使用空间降维的全局注意力机制和局部分组注意力机制,从而在保持高效计算的同时提升对光学气体成像(Optical Gas Imaging, OGI)中甲烷羽流的空间分割精度。此外,模型通过轻量级LR-ASPP解码器实现多尺度特征融合,并在统一框架下同步完成甲烷分割与饲料分类任务,最终在自建的首个基于OGI的牛只甲烷排放数据集上实现了74.47% mIoU、83.63% mF1的分割性能以及100%的饲料分类准确率,验证了饮食-排放关联性在实际应用中的有效性。

链接: https://arxiv.org/abs/2508.15057
作者: Toqi Tahamid Sarker,Mohamed Embaby,Taminul Islam,Amer AbuGhazaleh,Khaled R Ahmed
机构: Southern Illinois University Carbondale (南伊利诺伊大学卡本代尔分校); University of California, Davis (加州大学戴维斯分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for publication at ICCVW 2025

点击查看摘要

Abstract:Livestock methane emissions represent 32% of human-caused methane production, making automated monitoring critical for climate mitigation strategies. We introduce GasTwinFormer, a hybrid vision transformer for real-time methane emission segmentation and dietary classification in optical gas imaging through a novel Mix Twin encoder alternating between spatially-reduced global attention and locally-grouped attention mechanisms. Our architecture incorporates a lightweight LR-ASPP decoder for multi-scale feature aggregation and enables simultaneous methane segmentation and dietary classification in a unified framework. We contribute the first comprehensive beef cattle methane emission dataset using OGI, containing 11,694 annotated frames across three dietary treatments. GasTwinFormer achieves 74.47% mIoU and 83.63% mF1 for segmentation while maintaining exceptional efficiency with only 3.348M parameters, 3.428G FLOPs, and 114.9 FPS inference speed. Additionally, our method achieves perfect dietary classification accuracy (100%), demonstrating the effectiveness of leveraging diet-emission correlations. Extensive ablation studies validate each architectural component, establishing GasTwinFormer as a practical solution for real-time livestock emission monitoring. Please see our project page at this http URL.
zh

[CV-66] Decentralized Vision-Based Autonomous Aerial Wildlife Monitoring

【速读】:该论文旨在解决野生动物野外监测中难以实现高效并行部署与个体识别的问题,尤其在复杂非结构化环境中缺乏可扩展、低带宽且传感器资源受限的自动化解决方案。其关键在于提出一种去中心化的基于视觉的多旋翼无人机(multi-quadrotor)系统,通过设计新颖的视觉协同与跟踪算法,在无需中央通信或控制的情况下,实现对大型野生动物个体的鲁棒识别与持续追踪,从而支持群体行为分析及健康干预等应用。

链接: https://arxiv.org/abs/2508.15038
作者: Makram Chahine,William Yang,Alaa Maalouf,Justin Siriska,Ninad Jadhav,Daniel Vogt,Stephanie Gil,Robert Wood,Daniela Rus
机构: MIT (麻省理工学院); Harvard University (哈佛大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Wildlife field operations demand efficient parallel deployment methods to identify and interact with specific individuals, enabling simultaneous collective behavioral analysis, and health and safety interventions. Previous robotics solutions approach the problem from the herd perspective, or are manually operated and limited in scale. We propose a decentralized vision-based multi-quadrotor system for wildlife monitoring that is scalable, low-bandwidth, and sensor-minimal (single onboard RGB camera). Our approach enables robust identification and tracking of large species in their natural habitat. We develop novel vision-based coordination and tracking algorithms designed for dynamic, unstructured environments without reliance on centralized communication or control. We validate our system through real-world experiments, demonstrating reliable deployment in diverse field conditions.
zh

[CV-67] Reversible Unfolding Network for Concealed Visual Perception with Generative Refinement

【速读】:该论文旨在解决隐式视觉感知(Concealed Visual Perception, CVP)中因遮挡或退化导致的不确定性问题,现有方法通常仅在掩码(mask)域内采用可逆建模策略,而忽视了RGB域的潜在利用价值。其解决方案的关键在于提出一种具有生成式精修能力的可逆展开网络(Reversible Unfolding Network with Generative Refinement, RUN++),该网络将CVP任务建模为数学优化问题,并将其迭代求解过程展开为多阶段深度网络结构。其中,每个阶段融合三个专用模块:用于掩码域可逆建模的核心区域提取模块(Concealed Object Region Extraction, CORE)、用于RGB域上下文增强的前景-背景分离模块(Context-Aware Region Enhancement, CARE),以及基于噪声增强的微调迭代模块(Finetuning Iteration via Noise-based Enhancement, FINE)。FINE模块引入一种定向伯努利扩散模型,仅对分割掩码中的不确定区域进行精细化修复,从而在不增加全图计算开销的前提下,借助扩散模型的生成能力实现细节恢复。这一设计实现了可逆先验与生成模型的协同作用,显著降低误检率和漏检率,同时构建了一个适用于真实退化场景的鲁棒CVP系统框架。

链接: https://arxiv.org/abs/2508.15027
作者: Chunming He,Fengyang Xiao,Rihan Zhang,Chengyu Fang,Deng-Ping Fan,Sina Farsiu
机构: Duke University (杜克大学); Tsinghua University (清华大学); Nankai University (南开大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 21 tables, 13 figures

点击查看摘要

Abstract:Existing methods for concealed visual perception (CVP) often leverage reversible strategies to decrease uncertainty, yet these are typically confined to the mask domain, leaving the potential of the RGB domain underexplored. To address this, we propose a reversible unfolding network with generative refinement, termed RUN++. Specifically, RUN++ first formulates the CVP task as a mathematical optimization problem and unfolds the iterative solution into a multi-stage deep network. This approach provides a principled way to apply reversible modeling across both mask and RGB domains while leveraging a diffusion model to resolve the resulting uncertainty. Each stage of the network integrates three purpose-driven modules: a Concealed Object Region Extraction (CORE) module applies reversible modeling to the mask domain to identify core object regions; a Context-Aware Region Enhancement (CARE) module extends this principle to the RGB domain to foster better foreground-background separation; and a Finetuning Iteration via Noise-based Enhancement (FINE) module provides a final refinement. The FINE module introduces a targeted Bernoulli diffusion model that refines only the uncertain regions of the segmentation mask, harnessing the generative power of diffusion for fine-detail restoration without the prohibitive computational cost of a full-image process. This unique synergy, where the unfolding network provides a strong uncertainty prior for the diffusion model, allows RUN++ to efficiently direct its focus toward ambiguous areas, significantly mitigating false positives and negatives. Furthermore, we introduce a new paradigm for building robust CVP systems that remain effective under real-world degradations and extend this concept into a broader bi-level optimization framework.
zh

[CV-68] AIGen: Training-Free Adversarial Image Generation via Diffusion Models ICCV

【速读】:该论文旨在解决生成式模型(如扩散模型)在对抗攻击中存在效率低、图像质量差及计算资源消耗大的问题。现有方法通常需要数百次采样步骤才能生成高质量的对抗样本,且难以兼顾攻击成功率与视觉保真度。其解决方案的关键在于提出一种无需训练的黑盒攻击方法TAIGen,该方法通过在扩散过程的“混合步区间”注入扰动即可实现高攻击效果,仅需3–20步采样;同时设计了一种选择性RGB通道策略:利用注意力图作用于红色通道,结合GradCAM引导的扰动施加于绿色通道和蓝色通道,从而在保持图像结构完整性的同时最大化目标模型的误分类概率。此机制显著提升了攻击效率(比现有扩散攻击快10倍),并维持了PSNR高于30 dB的图像质量,同时表现出最强的攻击效果(最低鲁棒准确率),验证了其对防御机制的破坏力最强。

链接: https://arxiv.org/abs/2508.15020
作者: Susim Roy,Anubhooti Jain,Mayank Vatsa,Richa Singh
机构: University at Buffalo (纽约州立大学布法罗分校); IIT Jodhpur (印度理工学院贾多普尔分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCVW-CV4BIOM 2025

点击查看摘要

Abstract:Adversarial attacks from generative models often produce low-quality images and require substantial computational resources. Diffusion models, though capable of high-quality generation, typically need hundreds of sampling steps for adversarial generation. This paper introduces TAIGen, a training-free black-box method for efficient adversarial image generation. TAIGen produces adversarial examples using only 3-20 sampling steps from unconditional diffusion models. Our key finding is that perturbations injected during the mixing step interval achieve comparable attack effectiveness without processing all timesteps. We develop a selective RGB channel strategy that applies attention maps to the red channel while using GradCAM-guided perturbations on green and blue channels. This design preserves image structure while maximizing misclassification in target models. TAIGen maintains visual quality with PSNR above 30 dB across all tested datasets. On ImageNet with VGGNet as source, TAIGen achieves 70.6% success against ResNet, 80.8% against MNASNet, and 97.8% against ShuffleNet. The method generates adversarial examples 10x faster than existing diffusion-based attacks. Our method achieves the lowest robust accuracy, indicating it is the most impactful attack as the defense mechanism is least successful in purifying the images generated by TAIGen.
zh

[CV-69] textitadder-viz: Real-Time Visualization Software for Transcoding Event Video

【速读】:该论文旨在解决事件视频(event video)在表示方式上的局限性问题,具体表现为现有表示方法在灵活性、处理速度和压缩性方面的不足。针对这一挑战,论文提出的关键解决方案是进一步改进此前提出的统一 ADΔER 表示法,并通过增强 “adder-viz” 软件工具来实现实时事件数据转码过程的可视化及闭环应用支持,从而提升事件视频处理的效率与实用性。

链接: https://arxiv.org/abs/2508.14996
作者: Andrew C. Freeman,Luke Reinkensmeyer
机构: Baylor University (贝勒大学)
类目: Multimedia (cs.MM); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Image and Video Processing (eess.IV)
备注: Accepted to the Open-Source Track at ACM Multimedia 2025

点击查看摘要

Abstract:Recent years have brought about a surge in neuromorphic ``event’’ video research, primarily targeting computer vision applications. Event video eschews video frames in favor of asynchronous, per-pixel intensity samples. While much work has focused on a handful of representations for specific event cameras, these representations have shown limitations in flexibility, speed, and compressibility. We previously proposed the unified AD\DeltaER representation to address these concerns. This paper introduces numerous improvements to the \textitadder-viz software for visualizing real-time event transcode processes and applications in-the-loop. The MIT-licensed software is available from a centralized repository at \hrefthis https URLthis https URL.
zh

[CV-70] A Vision-Based Shared-Control Teleoperation Scheme for Controlling the Robotic Arm of a Four-Legged Robot

【速读】:该论文旨在解决在危险和偏远环境中远程操控四足机器人(quadruped robot)时存在的安全性和操作效率问题,特别是由于缺乏集成障碍物检测机制以及机械臂控制方式不直观导致的碰撞风险增加和操作者认知负荷过高的挑战。其解决方案的关键在于提出一种基于视觉的姿态估计方法,利用外部摄像头结合机器学习模型实时检测操作者手腕位置,并将该运动直接映射为机器人臂的控制指令,从而实现直观、低延迟的远程操控;同时引入轨迹规划器以识别并规避与环境障碍物及机器人自身结构的潜在碰撞,确保操作安全性。该方案在真实机器人平台上验证了其鲁棒性与实用性,为工业场景中对安全性、精度和易用性要求较高的应用提供了成本效益高的解决方案。

链接: https://arxiv.org/abs/2508.14994
作者: Murilo Vinicius da Silva,Matheus Hipolito Carvalho,Juliano Negri,Thiago Segreto,Gustavo J. G. Lahr,Ricardo V. Godoy,Marcelo Becker
机构: University of São Paulo (圣保罗大学); Instituto Israelita de Ensino e Pesquisa Albert Einstein (以色列艾伯特·爱因斯坦教育与研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:In hazardous and remote environments, robotic systems perform critical tasks demanding improved safety and efficiency. Among these, quadruped robots with manipulator arms offer mobility and versatility for complex operations. However, teleoperating quadruped robots is challenging due to the lack of integrated obstacle detection and intuitive control methods for the robotic arm, increasing collision risks in confined or dynamically changing workspaces. Teleoperation via joysticks or pads can be non-intuitive and demands a high level of expertise due to its complexity, culminating in a high cognitive load on the operator. To address this challenge, a teleoperation approach that directly maps human arm movements to the robotic manipulator offers a simpler and more accessible solution. This work proposes an intuitive remote control by leveraging a vision-based pose estimation pipeline that utilizes an external camera with a machine learning-based model to detect the operator’s wrist position. The system maps these wrist movements into robotic arm commands to control the robot’s arm in real-time. A trajectory planner ensures safe teleoperation by detecting and preventing collisions with both obstacles and the robotic arm itself. The system was validated on the real robot, demonstrating robust performance in real-time control. This teleoperation approach provides a cost-effective solution for industrial applications where safety, precision, and ease of use are paramount, ensuring reliable and intuitive robotic control in high-risk environments.
zh

[CV-71] Paired-Sampling Contrastive Framework for Joint Physical-Digital Face Attack Detection ICCV2025

【速读】:该论文旨在解决现代人脸识别系统在面对物理呈现攻击(physical presentation attacks)和数字伪造攻击(digital forgeries)时的脆弱性问题。传统方法通常采用独立模型分别检测两类攻击,导致系统复杂度高、推理延迟大,并且难以应对混合攻击向量。其解决方案的关键在于提出一种配对采样对比框架(Paired-Sampling Contrastive Framework),通过自动匹配的真实与攻击自拍照对,学习跨模态的通用活体线索(modality-agnostic liveness cues),从而实现统一的防欺骗检测。该方法在6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection基准上取得了2.10%的平均分类错误率(ACER),且模型轻量(4.46 GFLOPs),训练时间少于一小时,具备实际部署可行性。

链接: https://arxiv.org/abs/2508.14980
作者: Andrei Balykin,Anvar Ganiev,Denis Kondranin,Kirill Polevoda,Nikolai Liudkevich,Artem Petrov
机构: IDRND
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV2025 FAS workshop

点击查看摘要

Abstract:Modern face recognition systems remain vulnerable to spoofing attempts, including both physical presentation attacks and digital forgeries. Traditionally, these two attack vectors have been handled by separate models, each targeting its own artifacts and modalities. However, maintaining distinct detectors increases system complexity and inference latency and leaves systems exposed to combined attack vectors. We propose the Paired-Sampling Contrastive Framework, a unified training approach that leverages automatically matched pairs of genuine and attack selfies to learn modality-agnostic liveness cues. Evaluated on the 6th Face Anti-Spoofing Challenge Unified Physical-Digital Attack Detection benchmark, our method achieves an average classification error rate (ACER) of 2.10 percent, outperforming prior solutions. The framework is lightweight (4.46 GFLOPs) and trains in under one hour, making it practical for real-world deployment. Code and pretrained models are available at this https URL.
zh

[CV-72] You Only Pose Once: A Minimalists Detection Transformer for Monocular RGB Category-level 9D Multi-Object Pose Estimation

【速读】:该论文旨在解决从单张RGB图像中准确恢复特定类别下未见实例的9自由度(9-DoF)位姿这一核心挑战,该问题在机器人学与自动化领域具有重要意义。现有方法通常依赖伪深度图、CAD模型或多阶段级联结构,将2D检测与位姿估计分离,导致复杂性和性能瓶颈。本文提出YOPO(You Only Pose Once),一种单阶段、基于查询的统一框架,首次实现无需额外数据即可直接在类别层面联合进行2D检测与9-DoF位姿估计。其关键创新在于:在Transformer检测器基础上引入轻量级位姿头、基于边界框条件的平移模块以及6D-aware匈牙利匹配代价函数,使整个模型仅使用RGB图像和类别级位姿标签即可端到端训练,显著提升精度并达到当前最优性能,在REAL275数据集上实现了79.6% IoU₅₀和54.1% 10°10cm指标,大幅缩小了与RGB-D系统之间的差距。

链接: https://arxiv.org/abs/2508.14965
作者: Hakjin Lee,Junghoon Seo,Jaehoon Sim
机构: PIT IN Co.(PIT IN 公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: this https URL

点击查看摘要

Abstract:Accurately recovering the full 9-DoF pose of unseen instances within specific categories from a single RGB image remains a core challenge for robotics and automation. Most existing solutions still rely on pseudo-depth, CAD models, or multi-stage cascades that separate 2D detection from pose estimation. Motivated by the need for a simpler, RGB-only alternative that learns directly at the category level, we revisit a longstanding question: Can object detection and 9-DoF pose estimation be unified with high performance, without any additional data? We show that they can with our method, YOPO, a single-stage, query-based framework that treats category-level 9-DoF estimation as a natural extension of 2D detection. YOPO augments a transformer detector with a lightweight pose head, a bounding-box-conditioned translation module, and a 6D-aware Hungarian matching cost. The model is trained end-to-end only with RGB images and category-level pose labels. Despite its minimalist design, YOPO sets a new state of the art on three benchmarks. On the REAL275 dataset, it achieves 79.6% \rmIoU_50 and 54.1% under the 10^\circ 10\rmcm metric, surpassing prior RGB-only methods and closing much of the gap to RGB-D systems. The code, models, and additional qualitative results can be found on our project.
zh

[CV-73] Fast Graph Neural Network for Image Classification

【速读】:该论文旨在解决传统卷积神经网络(CNN)在处理复杂场景和细粒度分类任务时,对图像中像素或区域间空间关系建模能力不足的问题。其解决方案的关键在于将图像表示为图结构(Graph Convolutional Networks, GCNs),其中像素或图像区域作为节点,通过Voronoi图构建初始图结构,并利用Delaunay三角剖分对其进行优化,从而更有效地捕捉图像中的局部与全局关系。此方法显著提升了预处理效率与分类准确率,在多个基准数据集上优于现有先进模型,尤其在复杂场景和细粒度类别识别中表现突出。

链接: https://arxiv.org/abs/2508.14958
作者: Mustafa Mohammadi Gharasuie,Luis Rueda
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, proceeding into CanadianAI 2025

点击查看摘要

Abstract:The rapid progress in image classification has been largely driven by the adoption of Graph Convolutional Networks (GCNs), which offer a robust framework for handling complex data structures. This study introduces a novel approach that integrates GCNs with Voronoi diagrams to enhance image classification by leveraging their ability to effectively model relational data. Unlike conventional convolutional neural networks (CNNs), our method represents images as graphs, where pixels or regions function as vertices. These graphs are then refined using corresponding Delaunay triangulations, optimizing their representation. The proposed model achieves significant improvements in both preprocessing efficiency and classification accuracy across various benchmark datasets, surpassing state-of-the-art approaches, particularly in challenging scenarios involving intricate scenes and fine-grained categories. Experimental results, validated through cross-validation, underscore the effectiveness of combining GCNs with Voronoi diagrams for advancing image classification. This research not only presents a novel perspective on image classification but also expands the potential applications of graph-based learning paradigms in computer vision and unstructured data analysis.
zh

[CV-74] Heatmap Regression without Soft-Argmax for Facial Landmark Detection

【速读】:该论文旨在解决人脸关键点检测(Facial Landmark Detection)中传统基于热图回归方法依赖Soft-argmax近似带来的优化瓶颈问题。Soft-argmax虽能实现端到端训练,但其非精确的梯度传播限制了模型收敛速度与精度表现。解决方案的关键在于摒弃Soft-argmax,转而采用经典的结构化预测(Structured Prediction)框架设计新的训练目标函数,从而在不依赖不可微操作的情况下更有效地引导模型学习关键点位置,最终在WFLW、COFW和300W三个基准上实现更快收敛(提速2.2倍)且保持或超越现有最优性能。

链接: https://arxiv.org/abs/2508.14929
作者: Chiao-An Yang,Raymond A. Yeh
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Facial landmark detection is an important task in computer vision with numerous applications, such as head pose estimation, expression analysis, face swapping, etc. Heatmap regression-based methods have been widely used to achieve state-of-the-art results in this task. These methods involve computing the argmax over the heatmaps to predict a landmark. Since argmax is not differentiable, these methods use a differentiable approximation, Soft-argmax, to enable end-to-end training on deep-nets. In this work, we revisit this long-standing choice of using Soft-argmax and demonstrate that it is not the only way to achieve strong performance. Instead, we propose an alternative training objective based on the classic structured prediction framework. Empirically, our method achieves state-of-the-art performance on three facial landmark benchmarks (WFLW, COFW, and 300W), converging 2.2x faster during training while maintaining better/competitive accuracy. Our code is available here: this https URL.
zh

[CV-75] Scalable FPGA Framework for Real-Time Denoising in High-Throughput Imaging: A DRAM-Optimized Pipeline using High-Level Synthesis

【速读】:该论文旨在解决高通量成像工作流(如PRISM)中数据生成速率超过传统实时处理能力的问题,导致图像数据难以及时处理和分析。解决方案的关键在于设计了一个基于FPGA的可扩展预处理流水线,利用高层次综合(High-Level Synthesis, HLS)实现帧减法与平均操作,通过DRAM-backed缓冲和突发模式AXI4接口优化内存访问效率,从而在帧间间隔内完成低延迟去噪处理,支持在线处理并减少下游CPU/GPU分析的数据集规模。

链接: https://arxiv.org/abs/2508.14917
作者: Weichien Liao
机构: 未知
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Distributed, Parallel, and Cluster Computing (cs.DC); Image and Video Processing (eess.IV); Signal Processing (eess.SP); Instrumentation and Detectors (physics.ins-det)
备注: FPGA-based denoising pipeline for PRISM-scale imaging. Real-time frame subtraction and averaging via burst-mode AXI4 and DRAM buffering. Benchmarked against CPU/GPU workflows; scalable across multi-bank FPGA setups

点击查看摘要

Abstract:High-throughput imaging workflows, such as Parallel Rapid Imaging with Spectroscopic Mapping (PRISM), generate data at rates that exceed conventional real-time processing capabilities. We present a scalable FPGA-based preprocessing pipeline for real-time denoising, implemented via High-Level Synthesis (HLS) and optimized for DRAM-backed buffering. Our architecture performs frame subtraction and averaging directly on streamed image data, minimizing latency through burst-mode AXI4 interfaces. The resulting kernel operates below the inter-frame interval, enabling inline denoising and reducing dataset size for downstream CPU/GPU analysis. Validated under PRISM-scale acquisition, this modular FPGA framework offers a practical solution for latency-sensitive imaging workflows in spectroscopy and microscopy.
zh

[CV-76] he Impact of Image Resolution on Face Detection: A Comparative Analysis of MTCNN YOLOv XI and YOLOv XII models

【速读】:该论文旨在解决低分辨率图像条件下人脸检测(face detection)性能下降的问题,这是影响实际应用如监控、生物特征认证和人机交互等场景的关键挑战。解决方案的关键在于系统性地评估三种主流深度学习人脸检测模型(YOLOv11、YOLOv12 和 MTCNN)在不同输入分辨率(160×160、320×320 和 640×640)下的检测精度与鲁棒性,通过多维度指标(如精确率、召回率、mAP50、mAP50-95 及推理时间)进行量化分析,从而为不同计算资源和实时性要求的部署场景提供可操作的模型选择依据。

链接: https://arxiv.org/abs/2507.23341
作者: Ahmet Can Ömercikoğlu(1),Mustafa Mansur Yönügül(1),Pakize Erdoğmuş(1) ((1) Düzce University, Department of Computer Engineering, Düzce, Türkiye)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures, 4 tables

点击查看摘要

Abstract:Face detection is a crucial component in many AI-driven applications such as surveillance, biometric authentication, and human-computer interaction. However, real-world conditions like low-resolution imagery present significant challenges that degrade detection performance. In this study, we systematically investigate the impact of input resolution on the accuracy and robustness of three prominent deep learning-based face detectors: YOLOv11, YOLOv12, and MTCNN. Using the WIDER FACE dataset, we conduct extensive evaluations across multiple image resolutions (160x160, 320x320, and 640x640) and assess each model’s performance using metrics such as precision, recall, mAP50, mAP50-95, and inference time. Results indicate that YOLOv11 outperforms YOLOv12 and MTCNN in terms of detection accuracy, especially at higher resolutions, while YOLOv12 exhibits slightly better recall. MTCNN, although competitive in landmark localization, lags in real-time inference speed. Our findings provide actionable insights for selecting resolution-aware face detection models suitable for varying operational constraints.
zh

[CV-77] Exploring the Landscape of Non-Equilibrium Memories with Neural Cellular Automata

【速读】:该论文旨在解决多体记忆(many-body memories)的物理机制与多样性问题,即如何在存在任意扰动的情况下,实现长时间保持初始信息的局部非平衡动力学系统的设计与识别。其解决方案的关键在于结合严格的数学证明与机器学习方法,揭示了二维空间中多体记忆的景观远比以往认知的丰富:发现了能以不同于Toom规则的方式纠错、由涨落稳定有序相、以及仅在噪声存在下才能保存信息的新类型记忆机制,表明物理系统可通过多种途径实现鲁棒的信息存储。

链接: https://arxiv.org/abs/2508.15726
作者: Ethan Lake,Ehsan Pajouheshgar
机构: University of California Berkeley (加州大学伯克利分校); École Polytechnique Fédérale de Lausanne (洛桑联邦理工学院)
类目: atistical Mechanics (cond-mat.stat-mech); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Cellular Automata and Lattice Gases (nlin.CG)
备注: 4+9 pages

点击查看摘要

Abstract:We investigate the landscape of many-body memories: families of local non-equilibrium dynamics that retain information about their initial conditions for thermodynamically long time scales, even in the presence of arbitrary perturbations. In two dimensions, the only well-studied memory is Toom’s rule. Using a combination of rigorous proofs and machine learning methods, we show that the landscape of 2D memories is in fact quite vast. We discover memories that correct errors in ways qualitatively distinct from Toom’s rule, have ordered phases stabilized by fluctuations, and preserve information only in the presence of noise. Taken together, our results show that physical systems can perform robust information storage in many distinct ways, and demonstrate that the physics of many-body memories is richer than previously realized. Interactive visualizations of the dynamics studied in this work are available at this https URL.
zh

[CV-78] Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset

【速读】:该论文旨在解决脑部磁共振血管成像(MRA)中血管分割精度不足的问题,尤其是在缺乏高质量标注数据集的情况下,传统手动标注或经典方法(如Frangi滤波器)难以满足手术规划所需的准确性。其关键解决方案是提出一种基于海森矩阵(Hessian matrix)的轻量级半监督神经网络模型HessNet,该模型仅含6000个参数,可在CPU上运行,显著降低计算资源需求;同时利用HessNet对复杂管状结构(如脑血管)进行初步分割,辅助专家仅针对最复杂的病例进行精细化标注,从而高效构建大规模、高精度的脑MRA血管标注数据集(基于IXI数据集,标注200张图像),实现了在极小训练样本下达到前沿性能的血管分割效果。

链接: https://arxiv.org/abs/2508.15660
作者: Alexandra Bernadotte,Elfimov Nikita,Mikhail Shutov,Ivan Menshikov
机构: M.V.Lomonosov Moscow State University (莫斯科国立大学); HSE University (高等经济大学); Neurosputnik LLC; Rebis LLC
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 2 figures

点击查看摘要

Abstract:Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image segmentation, but their development depends on well-annotated training datasets. However, there is a notable lack of publicly available MRA datasets with detailed brain vessel annotations. To address this gap, we propose a novel semi-supervised learning lightweight neural network with Hessian matrices on board for 3D segmentation of complex structures such as tubular structures, which we named HessNet. The solution is a Hessian-based neural network with only 6000 parameters. HessNet can run on the CPU and significantly reduces the resource requirements for training neural networks. The accuracy of vessel segmentation on a minimal training dataset reaches state-of-the-art results. It helps us create a large, semi-manually annotated brain vessel dataset of brain MRA images based on the IXI dataset (annotated 200 images). Annotation was performed by three experts under the supervision of three neurovascular surgeons after applying HessNet. It provides high accuracy of vessel segmentation and allows experts to focus only on the most complex important cases. The dataset is available at this https URL. Comments: 11 pages, 2 figures Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV) ACMclasses: I.4.6; I.5.4; J.3 Cite as: arXiv:2508.15660 [eess.IV] (or arXiv:2508.15660v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.15660 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-79] Label Uncertainty for Ultrasound Segmentation

【速读】:该论文旨在解决医学影像中因放射科医生间观察差异导致的标签不确定性问题,尤其是在肺部超声(LUS)这类主观性较强的模态中,标注一致性难以保证。其解决方案的关键在于引入专家提供的逐像素置信度值作为标注信息,而非将标注视为绝对真值;通过建模真实临床数据中的aleatoric不确定性(即固有随机性),在训练过程中利用这些置信度信号优化分割模型性能。研究发现,采用60%置信度阈值对标签进行二值化处理后训练模型,相比50%阈值的朴素方法显著提升下游临床任务表现,如S/F氧合比估计、分类及30天再入院预测,表明合理利用标签置信度可有效增强AI模型的可靠性与临床实用性。

链接: https://arxiv.org/abs/2508.15635
作者: Malini Shivaram,Gautam Rajendrakumar Gare,Laura Hutchins,Jacob Duplantis,Thomas Deiss,Thales Nogueira Gomes,Thong Tran,Keyur H. Patel,Thomas H Fox,Amita Krishnan,Deva Ramanan,Bennett DeBoisblanc,Ricardo Rodriguez,John Galeotti
机构: Carnegie Mellon University (卡内基梅隆大学); LSUHSC Internal Medicine (路易斯安那州立大学健康科学中心内科); Cosmetic Surgery Facility LLC (整形外科设施有限责任公司)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注: Paper under review

点击查看摘要

Abstract:In medical imaging, inter-observer variability among radiologists often introduces label uncertainty, particularly in modalities where visual interpretation is subjective. Lung ultrasound (LUS) is a prime example-it frequently presents a mixture of highly ambiguous regions and clearly discernible structures, making consistent annotation challenging even for experienced clinicians. In this work, we introduce a novel approach to both labeling and training AI models using expert-supplied, per-pixel confidence values. Rather than treating annotations as absolute ground truth, we design a data annotation protocol that captures the confidence that radiologists have in each labeled region, modeling the inherent aleatoric uncertainty present in real-world clinical data. We demonstrate that incorporating these confidence values during training leads to improved segmentation performance. More importantly, we show that this enhanced segmentation quality translates into better performance on downstream clinically-critical tasks-specifically, estimating S/F oxygenation ratio values, classifying S/F ratio change, and predicting 30-day patient readmission. While we empirically evaluate many methods for exposing the uncertainty to the learning model, we find that a simple approach that trains a model on binarized labels obtained with a (60%) confidence threshold works well. Importantly, high thresholds work far better than a naive approach of a 50% threshold, indicating that training on very confident pixels is far more effective. Our study systematically investigates the impact of training with varying confidence thresholds, comparing not only segmentation metrics but also downstream clinical outcomes. These results suggest that label confidence is a valuable signal that, when properly leveraged, can significantly enhance the reliability and clinical utility of AI in medical imaging.
zh

[CV-80] Are Virtual DES Images a Valid Alternative to the Real Ones?

【速读】:该论文旨在解决对比增强光谱乳腺摄影(CESM)中因获取双能减影图像(DES)所导致的患者辐射暴露问题,提出通过图像到图像翻译技术从低能量图像(LE)人工生成虚拟DES图像,从而减少对高能成像的依赖。其解决方案的关键在于利用三种不同的深度学习模型——预训练U-Net、端到端训练的U-Net以及CycleGAN,实现LE图像到DES图像的映射,并评估生成的虚拟DES图像在乳腺病变良恶性分类任务中的性能表现。结果表明,预训练U-Net模型效果最佳,F1分数达85.59%,虽略低于真实DES图像的90.35%,但验证了虚拟DES图像在临床应用中的潜力。

链接: https://arxiv.org/abs/2508.15594
作者: Ana C. Perre,Luís A. Alexandre,Luís C. Freire
机构: Faculdade Ciências da Saúde, Universidade da Beira Interior(贝拉内陆大学健康科学学院); Unidade Local de Saúde do Oeste(西部地方卫生单位); NOVA LINCS, Universidade da Beira Interior(贝拉内陆大学NOVA LINCS); Escola Superior de Tecnologia da Saúde de Lisboa, Instituto Politécnico de Lisboa(里斯本理工学院健康技术高等学院)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 4 figures, 3 tables

点击查看摘要

Abstract:Contrast-enhanced spectral mammography (CESM) is an imaging modality that provides two types of images, commonly known as low-energy (LE) and dual-energy subtracted (DES) images. In many domains, particularly in medicine, the emergence of image-to-image translation techniques has enabled the artificial generation of images using other images as input. Within CESM, applying such techniques to generate DES images from LE images could be highly beneficial, potentially reducing patient exposure to radiation associated with high-energy image acquisition. In this study, we investigated three models for the artificial generation of DES images (virtual DES): a pre-trained U-Net model, a U-Net trained end-to-end model, and a CycleGAN model. We also performed a series of experiments to assess the impact of using virtual DES images on the classification of CESM examinations into malignant and non-malignant categories. To our knowledge, this is the first study to evaluate the impact of virtual DES images on CESM lesion classification. The results demonstrate that the best performance was achieved with the pre-trained U-Net model, yielding an F1 score of 85.59% when using the virtual DES images, compared to 90.35% with the real DES images. This discrepancy likely results from the additional diagnostic information in real DES images, which contributes to a higher classification accuracy. Nevertheless, the potential for virtual DES image generation is considerable and future advancements may narrow this performance gap to a level where exclusive reliance on virtual DES images becomes clinically viable.
zh

[CV-81] Deep Equilibrium Convolutional Sparse Coding for Hyperspectral Image Denoising

【速读】:该论文旨在解决高光谱图像(Hyperspectral Images, HSIs)在遥感应用中因复杂噪声模式导致的退化问题,尤其关注去噪后图像物理属性的保持,以提升去噪结果的可靠性。现有基于深度展开(deep unfolding)的方法虽能将物理模型优化映射为可学习网络结构,但其固定深度设计缺乏收敛性保障。为此,作者提出了一种基于深度均衡(Deep Equilibrium, DEQ)框架的卷积稀疏编码(Convolutional Sparse Coding, CSC)方法——DECSC,其核心在于将CSC模型的近端梯度下降过程建模为一个不动点问题,并通过DEQ机制实现无限深度网络结构,从而自然契合优化过程并提供理论收敛性保证。DECSC进一步融合局部空间-光谱相关性、非局部空间自相似性和全局空间一致性:利用共享2D卷积稀疏表示确保跨波段全局空间一致性,非共享3D卷积稀疏表示捕捉局部空间-光谱细节;嵌入Transformer模块挖掘非局部自相似性,并引入细节增强模块强化图像细节保留。最终,该方法实现了优于当前主流去噪技术的性能表现。

链接: https://arxiv.org/abs/2508.15553
作者: Jin Ye,Jingran Wang,Fengchao Xiong,Jingzhou Chen,Yuntao Qian
机构: Nanjing University of Science and Technology (南京理工大学); Zhejiang University (浙江大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Hyperspectral images (HSIs) play a crucial role in remote sensing but are often degraded by complex noise patterns. Ensuring the physical property of the denoised HSIs is vital for robust HSI denoising, giving the rise of deep unfolding-based methods. However, these methods map the optimization of a physical model to a learnable network with a predefined depth, which lacks convergence guarantees. In contrast, Deep Equilibrium (DEQ) models treat the hidden layers of deep networks as the solution to a fixed-point problem and models them as infinite-depth networks, naturally consistent with the optimization. Under the framework of DEQ, we propose a Deep Equilibrium Convolutional Sparse Coding (DECSC) framework that unifies local spatial-spectral correlations, nonlocal spatial self-similarities, and global spatial consistency for robust HSI denoising. Within the convolutional sparse coding (CSC) framework, we enforce shared 2D convolutional sparse representation to ensure global spatial consistency across bands, while unshared 3D convolutional sparse representation captures local spatial-spectral details. To further exploit nonlocal self-similarities, a transformer block is embedded after the 2D CSC. Additionally, a detail enhancement module is integrated with the 3D CSC to promote image detail preservation. We formulate the proximal gradient descent of the CSC model as a fixed-point problem and transform the iterative updates into a learnable network architecture within the framework of DEQ. Experimental results demonstrate that our DECSC method achieves superior denoising performance compared to state-of-the-art methods.
zh

[CV-82] Self-supervised physics-informed generative networks for phase retrieval from a single X-ray hologram

【速读】:该论文旨在解决X射线相位对比成像中从单次强度测量(全息图)恢复波场相位信息的逆问题,传统代数或迭代方法依赖特定近似或边界条件且需专家手动调参,难以适应复杂多变的实验条件。其解决方案的关键在于提出一种物理信息引导的生成对抗网络(physics-informed generative adversarial network),无需配对、非配对或模拟训练数据即可实现对样品平面未传播波场的相位与吸收信息的联合定量重建,从而显著提升算法在不同样本类型和成像条件下的通用性与鲁棒性。

链接: https://arxiv.org/abs/2508.15530
作者: Xiaogang Yang(1),Dawit Hailu(2),Vojtěch Kulvait(2),Thomas Jentschke(2),Silja Flenner(2),Imke Greving(2),Stuart I. Campbell(1),Johannes Hagemann(3),Christian G. Schroer(3, 4, 5),Tak Ming Wong(2, 6),Julian Moosmann(2) ((1) NSLS-II, Brookhaven National Laboratory, Upton, USA, (2) Institute of Materials Physics, Helmholtz-Zentrum Hereon, Geesthacht, Germany, (3) Center for X-ray and Nano Science CXNS, Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany, (4) Department of Physics, Universität Hamburg, Hamburg, Germany, (5) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany, (6) Institute of Metallic Biomaterials, Helmholtz-Zentrum Hereon, Geesthacht, Germany)
机构: 未知
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Computational Physics (physics.comp-ph); Instrumentation and Detectors (physics.ins-det)
备注: Version of record published in Optics Express, Vol. 33, Issue 17, pp. 35832-35851 (2025). Merged article, 20 pages of main text, 1 page of supplement header, and 7 pages of supplement (total 28 pages). Contains 10 figures in the main article and 5 figures in the supplement

点击查看摘要

Abstract:X-ray phase contrast imaging significantly improves the visualization of structures with weak or uniform absorption, broadening its applications across a wide range of scientific disciplines. Propagation-based phase contrast is particularly suitable for time- or dose-critical in vivo/in situ/operando (tomography) experiments because it requires only a single intensity measurement. However, the phase information of the wave field is lost during the measurement and must be recovered. Conventional algebraic and iterative methods often rely on specific approximations or boundary conditions that may not be met by many samples or experimental setups. In addition, they require manual tuning of reconstruction parameters by experts, making them less adaptable for complex or variable conditions. Here we present a self-learning approach for solving the inverse problem of phase retrieval in the near-field regime of Fresnel theory using a single intensity measurement (hologram). A physics-informed generative adversarial network is employed to reconstruct both the phase and absorbance of the unpropagated wave field in the sample plane from a single hologram. Unlike most deep learning approaches for phase retrieval, our approach does not require paired, unpaired, or simulated training data. This significantly broadens the applicability of our approach, as acquiring or generating suitable training data remains a major challenge due to the wide variability in sample types and experimental configurations. The algorithm demonstrates robust and consistent performance across diverse imaging conditions and sample types, delivering quantitative, high-quality reconstructions for both simulated data and experimental datasets acquired at beamline P05 at PETRA III (DESY, Hamburg), operated by Helmholtz-Zentrum Hereon. Furthermore, it enables the simultaneous retrieval of both phase and absorption information.
zh

[CV-83] DoSReMC: Domain Shift Resilient Mammography Classification using Batch Normalization Adaptation

【速读】:该论文旨在解决深度学习模型在乳腺癌自动识别任务中因域偏移(domain shift)导致的跨域性能下降问题,这限制了生成式 AI 在真实临床环境中的安全与公平部署。解决方案的关键在于提出 DoSReMC(Domain Shift Resilient Mammography Classification)框架,通过仅微调批量归一化(batch normalization, BN)和全连接(fully connected, FC)层来增强模型的跨域泛化能力,同时保留预训练卷积核不变;此外,进一步结合对抗训练策略以提升模型在不同数据分布下的鲁棒性。该方法无需重新训练整个模型,可无缝集成至现有AI流程,适用于多样化的临床场景。

链接: https://arxiv.org/abs/2508.15452
作者: Uğurcan Akyüz,Deniz Katircioglu-Öztürk,Emre K. Süslü,Burhan Keleş,Mete C. Kaya,Gamze Durhan,Meltem G. Akpınar,Figen B. Demirkazık,Gözde B. Akar
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Numerous deep learning-based solutions have been developed for the automatic recognition of breast cancer using mammography images. However, their performance often declines when applied to data from different domains, primarily due to domain shift - the variation in data distributions between source and target domains. This performance drop limits the safe and equitable deployment of AI in real-world clinical settings. In this study, we present DoSReMC (Domain Shift Resilient Mammography Classification), a batch normalization (BN) adaptation framework designed to enhance cross-domain generalization without retraining the entire model. Using three large-scale full-field digital mammography (FFDM) datasets - including HCTP, a newly introduced, pathologically confirmed in-house dataset - we conduct a systematic cross-domain evaluation with convolutional neural networks (CNNs). Our results demonstrate that BN layers are a primary source of domain dependence: they perform effectively when training and testing occur within the same domain, and they significantly impair model generalization under domain shift. DoSReMC addresses this limitation by fine-tuning only the BN and fully connected (FC) layers, while preserving pretrained convolutional filters. We further integrate this targeted adaptation with an adversarial training scheme, yielding additional improvements in cross-domain generalizability. DoSReMC can be readily incorporated into existing AI pipelines and applied across diverse clinical environments, providing a practical pathway toward more robust and generalizable mammography classification systems.
zh

[CV-84] Bladder Cancer Diagnosis with Deep Learning: A Multi-Task Framework and Online Platform

【速读】:该论文旨在解决临床膀胱癌诊断中依赖医生主观判断导致的诊断结果变异性和不一致性问题,以提升诊断的客观性、准确性与效率。其解决方案的关键在于构建一个集成的多任务深度学习框架,包含三个核心模块:基于增强型EfficientNet-B0与卷积块注意力模块(Convolutional Block Attention Module, CBAM)的分类模型用于肿瘤识别;基于ResNet34-UNet++架构并融合自注意力机制和注意力门控的分割模型实现精准病灶边界提取;以及利用ConvNeXt-Tiny进行分子亚型分类(如HER-2和Ki-67表达状态)。此外,作者开发了一个基于Gradio的在线诊断平台,整合上述模型并提供多格式图像上传、双语界面及动态阈值调整等功能,显著提升了临床可用性与实时反馈能力。

链接: https://arxiv.org/abs/2508.15379
作者: Jinliang Yu,Mingduo Xie,Yue Wang,Tianfan Fu,Xianglai Xu,Jiajun Wang
机构: Peking University (北京大学); State Key Laboratory for Novel Software Technology at Nanjing University, School of Computer Science, Nanjing University, Nanjing, Jiangsu, China (南京大学软件新技术国家重点实验室,计算机科学系,南京市,江苏省,中国); Department of Urology, Zhongshan Hospital, Fudan University, No.180 Fenglin Road, Shanghai 200032, China (复旦大学中山医院泌尿外科,上海市枫林路180号,200032)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Clinical cystoscopy, the current standard for bladder cancer diagnosis, suffers from significant reliance on physician expertise, leading to variability and subjectivity in diagnostic outcomes. There is an urgent need for objective, accurate, and efficient computational approaches to improve bladder cancer diagnostics. Leveraging recent advancements in deep learning, this study proposes an integrated multi-task deep learning framework specifically designed for bladder cancer diagnosis from cystoscopic images. Our framework includes a robust classification model using EfficientNet-B0 enhanced with Convolutional Block Attention Module (CBAM), an advanced segmentation model based on ResNet34-UNet++ architecture with self-attention mechanisms and attention gating, and molecular subtyping using ConvNeXt-Tiny to classify molecular markers such as HER-2 and Ki-67. Additionally, we introduce a Gradio-based online diagnostic platform integrating all developed models, providing intuitive features including multi-format image uploads, bilingual interfaces, and dynamic threshold adjustments. Extensive experimentation demonstrates the effectiveness of our methods, achieving outstanding accuracy (93.28%), F1-score (82.05%), and AUC (96.41%) for classification tasks, and exceptional segmentation performance indicated by a Dice coefficient of 0.9091. The online platform significantly improved the accuracy, efficiency, and accessibility of clinical bladder cancer diagnostics, enabling practical and user-friendly deployment. The code is publicly available. Our multi-task framework and integrated online tool collectively advance the field of intelligent bladder cancer diagnosis by improving clinical reliability, supporting early tumor detection, and enabling real-time diagnostic feedback. These contributions mark a significant step toward AI-assisted decision-making in urology. Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.15379 [eess.IV] (or arXiv:2508.15379v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.15379 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Mingduo Xie [view email] [v1] Thu, 21 Aug 2025 09:20:03 UTC (1,373 KB)
zh

[CV-85] Explainable Knowledge Distillation for Efficient Medical Image Classification

【速读】:该论文旨在解决医学影像诊断中模型性能与计算效率之间的矛盾问题,特别是在资源受限的临床环境中部署高精度深度学习模型的挑战。其核心解决方案是基于知识蒸馏(Knowledge Distillation)框架,利用高性能教师模型(如VGG19和轻量级Vision Transformer)指导一个硬件感知的紧凑型学生模型(源自OFA-595超网络)的训练,通过混合监督策略(结合真实标签与教师模型软目标)实现准确率与推理效率的平衡。实验表明,该方法在保持高分类性能的同时显著减少参数量和推理时间,且借助Score-CAM可视化技术增强了模型可解释性,从而推动了高效、可信的医疗AI系统落地应用。

链接: https://arxiv.org/abs/2508.15251
作者: Aqib Nazir Mir,Danish Raza Rizvi
机构: Jamia Millia Islamia (贾米亚米尔利亚伊斯兰大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This study comprehensively explores knowledge distillation frameworks for COVID-19 and lung cancer classification using chest X-ray (CXR) images. We employ high-capacity teacher models, including VGG19 and lightweight Vision Transformers (Visformer-S and AutoFormer-V2-T), to guide the training of a compact, hardware-aware student model derived from the OFA-595 supernet. Our approach leverages hybrid supervision, combining ground-truth labels with teacher models’ soft targets to balance accuracy and computational efficiency. We validate our models on two benchmark datasets: COVID-QU-Ex and LCS25000, covering multiple classes, including COVID-19, healthy, non-COVID pneumonia, lung, and colon cancer. To interpret the spatial focus of the models, we employ Score-CAM-based visualizations, which provide insight into the reasoning process of both teacher and student networks. The results demonstrate that the distilled student model maintains high classification performance with significantly reduced parameters and inference time, making it an optimal choice in resource-constrained clinical environments. Our work underscores the importance of combining model efficiency with explainability for practical, trustworthy medical AI solutions.
zh

[CV-86] Pathology-Informed Latent Diffusion Model for Anomaly Detection in Lymph Node Metastasis

【速读】:该论文旨在解决数字病理学中异常检测面临的标注数据稀缺问题,传统监督学习方法依赖大量人工标注样本,而实际场景中此类数据难以获取。为此,作者提出一种基于视觉-语言模型与去噪扩散概率模型(Denoising Diffusion Probabilistic Models, DDPM)相结合的无监督异常检测方法,其关键在于利用与正常组织相关的病理学关键词(histopathology prompts)引导图像重建过程,从而在无需精细标注的前提下区分正常与异常组织区域。该方案通过引入语义先验信息增强模型对组织结构差异的感知能力,在胃淋巴结和乳腺淋巴结等多个器官的数据集上验证了其有效性及域迁移下的泛化性能。

链接: https://arxiv.org/abs/2508.15236
作者: Jiamu Wang,Keunho Byeon,Jinsol Song,Anh Nguyen,Sangjeong Ahn,Sung Hak Lee,Jin Tae Kwak
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Anomaly detection is an emerging approach in digital pathology for its ability to efficiently and effectively utilize data for disease diagnosis. While supervised learning approaches deliver high accuracy, they rely on extensively annotated datasets, suffering from data scarcity in digital pathology. Unsupervised anomaly detection, however, offers a viable alternative by identifying deviations from normal tissue distributions without requiring exhaustive annotations. Recently, denoising diffusion probabilistic models have gained popularity in unsupervised anomaly detection, achieving promising performance in both natural and medical imaging datasets. Building on this, we incorporate a vision-language model with a diffusion model for unsupervised anomaly detection in digital pathology, utilizing histopathology prompts during reconstruction. Our approach employs a set of pathology-related keywords associated with normal tissues to guide the reconstruction process, facilitating the differentiation between normal and abnormal tissues. To evaluate the effectiveness of the proposed method, we conduct experiments on a gastric lymph node dataset from a local hospital and assess its generalization ability under domain shift using a public breast lymph node dataset. The experimental results highlight the potential of the proposed method for unsupervised anomaly detection across various organs in digital pathology. Code: this https URL.
zh

[CV-87] Zero-shot Volumetric CT Super-Resolution using 3D Gaussian Splatting with Upsampled 2D X-ray Projection Priors

【速读】:该论文旨在解决医学影像中高分辨率(High-Resolution, HR)计算机断层扫描(Computed Tomography, CT)图像重建的难题,尤其是在辐射暴露限制下难以获取大量配对低分辨率(Low-Resolution, LR)与HR CT数据的情况下。传统监督式超分辨率(Super-Resolution, SR)方法依赖大规模配对数据集,而零样本SR方法虽无需配对数据,却因内部信息有限难以恢复精细解剖结构。解决方案的关键在于提出一种新颖的零样本3D CT SR框架:首先利用扩散模型(Diffusion Model)从大量HR 2D X射线投影数据中学习先验知识,并采用逐投影自适应采样策略生成高质量HR投影;随后将这些投影作为外部先验输入至3D高斯泼溅(3D Gaussian Splatting, GS)重建流程中;进一步引入负Alpha混合(Negative Alpha Blending, NAB-GS),允许高斯密度表示中出现负值,从而实现LR与扩散生成投影之间的残差学习,增强高频结构重建能力。此方法显著提升了3D CT超分辨率的质量和细节恢复能力。

链接: https://arxiv.org/abs/2508.15151
作者: Jeonghyun Noh,Hyun-Jic Oh,Byungju Chae,Won-Ki Jeong
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Computed tomography (CT) is widely used in clinical diagnosis, but acquiring high-resolution (HR) CT is limited by radiation exposure risks. Deep learning-based super-resolution (SR) methods have been studied to reconstruct HR from low-resolution (LR) inputs. While supervised SR approaches have shown promising results, they require large-scale paired LR-HR volume datasets that are often unavailable. In contrast, zero-shot methods alleviate the need for paired data by using only a single LR input, but typically struggle to recover fine anatomical details due to limited internal information. To overcome these, we propose a novel zero-shot 3D CT SR framework that leverages upsampled 2D X-ray projection priors generated by a diffusion model. Exploiting the abundance of HR 2D X-ray data, we train a diffusion model on large-scale 2D X-ray projection and introduce a per-projection adaptive sampling strategy. It selects the generative process for each projection, thus providing HR projections as strong external priors for 3D CT reconstruction. These projections serve as inputs to 3D Gaussian splatting for reconstructing a 3D CT volume. Furthermore, we propose negative alpha blending (NAB-GS) that allows negative values in Gaussian density representation. NAB-GS enables residual learning between LR and diffusion-based projections, thereby enhancing high-frequency structure reconstruction. Experiments on two datasets show that our method achieves superior quantitative and qualitative results for 3D CT SR.
zh

[CV-88] Scalable Event-Based Video Streaming for Machines with MoQ

【速读】:该论文旨在解决基于事件的视频数据在传输过程中的效率与低延迟问题。传统视频流媒体依赖于有损压缩和速率自适应流式传输,但这类方法不适用于异步像素采样(asynchronous pixel samples)的神经形态“事件”传感器所生成的数据。针对这一挑战,作者提出了一种新的低延迟事件流式传输格式,其关键在于利用最新的Media Over QUIC协议草案中的功能,实现可扩展、高效的事件数据传输,从而满足计算机视觉应用对实时性和带宽利用率的需求。

链接: https://arxiv.org/abs/2508.15003
作者: Andrew C. Freeman
机构: Baylor University (贝勒大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted to ACM Mile High Video 2025

点击查看摘要

Abstract:Lossy compression and rate-adaptive streaming are a mainstay in traditional video steams. However, a new class of neuromorphic ``event’’ sensors records video with asynchronous pixel samples rather than image frames. These sensors are designed for computer vision applications, rather than human video consumption. Until now, researchers have focused their efforts primarily on application development, ignoring the crucial problem of data transmission. We survey the landscape of event-based video systems, discuss the technical issues with our recent scalable event streaming work, and propose a new low-latency event streaming format based on the latest additions to the Media Over QUIC protocol draft.
zh

[CV-89] Fusing Structural Phenotypes with Functional Data for Early Prediction of Primary Angle Closure Glaucoma Progression

【速读】:该论文旨在解决原发性闭角型青光眼(Primary Angle Closure Glaucoma, PACG)患者中快速与缓慢进展性病变的精准分类问题,以实现更个体化的疾病监测和干预策略。其解决方案的关键在于整合视盘(Optic Nerve Head, ONH)结构参数与基于象限的视野(Visual Field, VF)功能参数,通过机器学习(Machine Learning, ML)模型进行多模态特征融合建模,其中随机森林(Random Forest)模型结合结构与功能数据表现最优(AUC = 0.87),并借助SHAP值识别出六个关键预测因子,尤其是下象限视盘相关结构指标(如下象限平均RNFL厚度、MRW及LC曲率)在区分进展速率中具有主导作用,凸显了ONH形态学特征在PACG疾病进程评估中的核心价值。

链接: https://arxiv.org/abs/2508.14922
作者: Swati Sharma,Thanadet Chuangsuwanich,Royston K.Y. Tan,Shimna C. Prasad,Tin A. Tun,Shamira A. Perera,Martin L. Buist,Tin Aung,Monisha E. Nongpiur,Michaël J. A. Girard
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: 23 pages, 5 figures, 3 tables

点击查看摘要

Abstract:Purpose: To classify eyes as slow or fast glaucoma progressors in patients with primary angle closure glaucoma (PACG) using an integrated approach combining optic nerve head (ONH) structural features and sector-based visual field (VF) functional parameters. Methods: PACG patients with 5 reliable VF tests over 5 years were included. Progression was assessed in Zeiss Forum, with baseline VF within six months of OCT. Fast progression was VFI decline -2.0% per year; slow progression -2.0% per year. OCT volumes were AI-segmented to extract 31 ONH parameters. The Glaucoma Hemifield Test defined five regions per hemifield, aligned with RNFL distribution. Mean sensitivity per region was combined with structural parameters to train ML classifiers. Multiple models were tested, and SHAP identified key predictors. Main outcome measures: Classification of slow versus fast progressors using combined structural and functional data. Results: We analyzed 451 eyes from 299 patients. Mean VFI progression was -0.92% per year; 369 eyes progressed slowly and 82 rapidly. The Random Forest model combining structural and functional features achieved the best performance (AUC = 0.87, 2000 Monte Carlo iterations). SHAP identified six key predictors: inferior MRW, inferior and inferior-temporal RNFL thickness, nasal-temporal LC curvature, superior nasal VF sensitivity, and inferior RNFL and GCL+IPL thickness. Models using only structural or functional features performed worse with AUC of 0.82 and 0.78, respectively. Conclusions: Combining ONH structural and VF functional parameters significantly improves classification of progression risk in PACG. Inferior ONH features, MRW and RNFL thickness, were the most predictive, highlighting the critical role of ONH morphology in monitoring disease progression.
zh

人工智能

[AI-0] Discovering Hidden Algebraic Structures via Transformers with Rank-Aware Beam GRPO

【速读】:该论文旨在解决多变量多项式分解(multivariate polynomial decomposition)这一NP-hard的代数问题,该任务在科学与工程领域具有广泛应用,要求模型具备高精度和深刻洞察力。其关键解决方案包括:首先构建一个可控复杂度的合成数据生成管道以支持训练;其次提出一种基于排名感知的强化学习方法——束搜索分组相对策略优化(Beam Grouped Relative Policy Optimization, BGRPO),该方法显著提升了模型在硬代数问题上的推理能力;通过BGRPO微调,模型在保持甚至提升准确率的同时,将束宽(beam width)降低至原来的一半,使推理计算量减少约75%。此外,模型在多项式简化任务中表现优异,部分场景下超越Mathematica。

链接: https://arxiv.org/abs/2508.15766
作者: Jaeha Lee,Gio Huh,Ning Su,Tony Yue YU
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent efforts have extended the capabilities of transformers in logical reasoning and symbolic computations. In this work, we investigate their capacity for non-linear latent pattern discovery in the context of functional decomposition, focusing on the challenging algebraic task of multivariate polynomial decomposition. This problem, with widespread applications in science and engineering, is proved to be NP-hard, and demands both precision and insight. Our contributions are threefold: First, we develop a synthetic data generation pipeline providing fine-grained control over problem complexity. Second, we train transformer models via supervised learning and evaluate them across four key dimensions involving scaling behavior and generalizability. Third, we propose Beam Grouped Relative Policy Optimization (BGRPO), a rank-aware reinforcement learning method suitable for hard algebraic problems. Finetuning with BGRPO improves accuracy while reducing beam width by up to half, resulting in approximately 75% lower inference compute. Additionally, our model demonstrates competitive performance in polynomial simplification, outperforming Mathematica in various cases.
zh

[AI-1] Neural Robot Dynamics

【速读】:该论文旨在解决现有神经动力学模拟器(neural simulators)在机器人仿真中难以泛化至新任务和环境的问题,其核心挑战在于缺乏对全局状态的充分表征。解决方案的关键在于提出一种名为NeRD(Neural Robot Dynamics)的模型,该模型通过采用机器人中心且空间不变的仿真状态表示方式,替代传统分析型仿真器中的低层动力学与接触求解器,并可作为可插拔的后端求解器集成到先进的机器人仿真平台中,从而实现高精度、稳定性和跨任务/环境的泛化能力,同时支持仅在神经引擎中进行策略学习,并能基于真实世界数据微调以缩小仿真与现实之间的差距。

链接: https://arxiv.org/abs/2508.15755
作者: Jie Xu,Eric Heiden,Iretiayo Akinola,Dieter Fox,Miles Macklin,Yashraj Narang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Graphics (cs.GR); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Accurate and efficient simulation of modern robots remains challenging due to their high degrees of freedom and intricate mechanisms. Neural simulators have emerged as a promising alternative to traditional analytical simulators, capable of efficiently predicting complex dynamics and adapting to real-world data; however, existing neural simulators typically require application-specific training and fail to generalize to novel tasks and/or environments, primarily due to inadequate representations of the global state. In this work, we address the problem of learning generalizable neural simulators for robots that are structured as articulated rigid bodies. We propose NeRD (Neural Robot Dynamics), learned robot-specific dynamics models for predicting future states for articulated rigid bodies under contact constraints. NeRD uniquely replaces the low-level dynamics and contact solvers in an analytical simulator and employs a robot-centric and spatially-invariant simulation state representation. We integrate the learned NeRD models as an interchangeable backend solver within a state-of-the-art robotics simulator. We conduct extensive experiments to show that the NeRD simulators are stable and accurate over a thousand simulation steps; generalize across tasks and environment configurations; enable policy learning exclusively in a neural engine; and, unlike most classical simulators, can be fine-tuned from real-world data to bridge the gap between simulation and reality.
zh

[AI-2] Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots

【速读】:该论文试图解决人类与人工智能代理(AI agent)之间形成拟社会关系(parasocial relationship)所带来的负面影响问题,这类关系可能对人类心理健康产生严重甚至悲剧性后果。现有方法难以有效预防此类动态,因为拟社会线索通常在私密对话中逐步显现,且并非所有情感互动都具有危害性。解决方案的关键在于提出一种基于先进语言模型重构的实时响应评估框架,能够持续监测对话中的拟社会线索,并在早期阶段识别出潜在风险。实验表明,该框架在小规模合成数据集上实现了高准确率的检测,且通过多轮测试和宽容的一致性规则避免了误报,为防范拟社会关系提供了可行的技术路径。

链接: https://arxiv.org/abs/2508.15748
作者: Emma Rath,Stuart Armstrong,Rebecca Gorman
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The development of parasocial relationships with AI agents has severe, and in some cases, tragic effects for human well-being. Yet preventing such dynamics is challenging: parasocial cues often emerge gradually in private conversations, and not all forms of emotional engagement are inherently harmful. We address this challenge by introducing a simple response evaluation framework, created by repurposing a state-of-the-art language model, that evaluates ongoing conversations for parasocial cues in real time. To test the feasibility of this approach, we constructed a small synthetic dataset of thirty dialogues spanning parasocial, sycophantic, and neutral conversations. Iterative evaluation with five stage testing successfully identified all parasocial conversations while avoiding false positives under a tolerant unanimity rule, with detection typically occurring within the first few exchanges. These findings provide preliminary evidence that evaluation agents can provide a viable solution for the prevention of parasocial relations.
zh

[AI-3] Measuring the environmental impact of delivering AI at Google Scale

【速读】:该论文旨在解决当前缺乏对大规模AI生产环境中AI推理(inference)工作负载环境影响的实证测量问题,特别是能量消耗、碳排放和水资源使用等关键指标。其解决方案的关键在于提出并实施了一套全面的方法论,系统性地量化了从AI加速器到数据中心整体能效的全栈基础设施能耗,包括活跃AI加速器功耗、主机系统能耗、闲置机器容量及数据中心能源开销。通过在Google Gemini AI助手服务中进行详细仪器化测量,作者揭示了单次文本提示平均仅消耗0.24 Wh能量,并展示了软件效率优化与清洁能源采购带来的显著减排效果(一年内能耗降低33倍、碳足迹降低44倍),从而为AI服务的环境可持续性评估提供了可比基准与改进方向。

链接: https://arxiv.org/abs/2508.15734
作者: Cooper Elsworth,Keguo Huang,David Patterson,Ian Schneider,Robert Sedivy,Savannah Goodman,Ben Townsend,Parthasarathy Ranganathan,Jeff Dean,Amin Vahdat,Ben Gomes,James Manyika
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The transformative power of AI is undeniable - but as user adoption accelerates, so does the need to understand and mitigate the environmental impact of AI serving. However, no studies have measured AI serving environmental metrics in a production environment. This paper addresses this gap by proposing and executing a comprehensive methodology for measuring the energy usage, carbon emissions, and water consumption of AI inference workloads in a large-scale, AI production environment. Our approach accounts for the full stack of AI serving infrastructure - including active AI accelerator power, host system energy, idle machine capacity, and data center energy overhead. Through detailed instrumentation of Google’s AI infrastructure for serving the Gemini AI assistant, we find the median Gemini Apps text prompt consumes 0.24 Wh of energy - a figure substantially lower than many public estimates. We also show that Google’s software efficiency efforts and clean energy procurement have driven a 33x reduction in energy consumption and a 44x reduction in carbon footprint for the median Gemini Apps text prompt over one year. We identify that the median Gemini Apps text prompt uses less energy than watching nine seconds of television (0.24 Wh) and consumes the equivalent of five drops of water (0.26 mL). While these impacts are low compared to other daily activities, reducing the environmental impact of AI serving continues to warrant important attention. Towards this objective, we propose that a comprehensive measurement of AI serving environmental metrics is critical for accurately comparing models, and to properly incentivize efficiency gains across the full AI serving stack.
zh

[AI-4] utorial on the Probabilistic Unification of Estimation Theory Machine Learning and Generative AI

【速读】:该论文试图解决如何从不确定、噪声干扰的数据中提取有效信息这一核心问题,这在时间序列分析、模式识别和语言建模等领域具有普遍性。其解决方案的关键在于提出一个统一的数学框架,将经典估计理论、统计推断与现代机器学习(包括深度学习和大语言模型)有机整合,揭示出最大似然估计(Maximum Likelihood Estimation, MLE)、最大后验估计(MAP)、贝叶斯分类以及注意力机制等方法均源于共享的概率原理——即从噪声和/或有偏观测中推断隐藏因果机制。通过系统性地分析不同场景(如系统辨识、图像分类和语言生成)下的建模策略,论文表明复杂模型的发展本质上是对基础概率原则的逐步扩展与深化,从而为应对过拟合、数据稀疏性和可解释性等实际挑战提供了理论依据与实践路径。

链接: https://arxiv.org/abs/2508.15719
作者: Mohammed Elmusrati
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Extracting meaning from uncertain, noisy data is a fundamental problem across time series analysis, pattern recognition, and language modeling. This survey presents a unified mathematical framework that connects classical estimation theory, statistical inference, and modern machine learning, including deep learning and large language models. By analyzing how techniques such as maximum likelihood estimation, Bayesian inference, and attention mechanisms address uncertainty, the paper illustrates that many AI methods are rooted in shared probabilistic principles. Through illustrative scenarios including system identification, image classification, and language generation, we show how increasingly complex models build upon these foundations to tackle practical challenges like overfitting, data sparsity, and interpretability. In other words, the work demonstrates that maximum likelihood, MAP estimation, Bayesian classification, and deep learning all represent different facets of a shared goal: inferring hidden causes from noisy and/or biased observations. It serves as both a theoretical synthesis and a practical guide for students and researchers navigating the evolving landscape of machine learning.
zh

[AI-5] Foundation Models for Cross-Domain EEG Analysis Application: A Survey

【速读】:该论文旨在解决当前脑电图(EEG)分析领域中基础模型研究碎片化的问题,具体表现为模型角色多样、架构不统一以及缺乏系统性分类。其解决方案的关键在于提出首个面向模态的分类体系,将EEG基础模型按输出模态划分为原生EEG解码、EEG-文本、EEG-视觉、EEG-音频及更广泛的多模态框架五大类别,并系统梳理各类别的研究思路、理论基础与架构创新,从而为未来方法学发展提供结构化参考,推动EEG基础模型向可扩展、可解释和在线可用的方向转化。

链接: https://arxiv.org/abs/2508.15716
作者: Hongqi Li,Yitong Chen,Yujuan Wang,Weihang Ni,Haodong Zhang
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Submitted to IEEE Journals

点击查看摘要

Abstract:Electroencephalography (EEG) analysis stands at the forefront of neuroscience and artificial intelligence research, where foundation models are reshaping the traditional EEG analysis paradigm by leveraging their powerful representational capacity and cross-modal generalization. However, the rapid proliferation of these techniques has led to a fragmented research landscape, characterized by diverse model roles, inconsistent architectures, and a lack of systematic categorization. To bridge this gap, this study presents the first comprehensive modality-oriented taxonomy for foundation models in EEG analysis, systematically organizing research advances based on output modalities of the native EEG decoding, EEG-text, EEG-vision, EEG-audio, and broader multimodal frameworks. We rigorously analyze each category’s research ideas, theoretical foundations, and architectural innovations, while highlighting open challenges such as model interpretability, cross-domain generalization, and real-world applicability in EEG-based systems. By unifying this dispersed field, our work not only provides a reference framework for future methodology development but accelerates the translation of EEG foundation models into scalable, interpretable, and online actionable solutions.
zh

[AI-6] NiceWebRL: a Python library for human subject experiments with reinforcement learning environments

【速读】:该论文旨在解决如何将强化学习(Reinforcement Learning, RL)环境有效用于在线人类受试者实验的问题,从而促进AI算法与人类行为的比较、认知科学理论的验证以及人机协作算法的开发。其解决方案的关键在于提出NiceWebRL——一个基于Python的开源库,能够将任意Jax实现的RL环境转化为支持单智能体和多智能体场景的在线交互界面,使研究人员能够在真实人类参与者中测试AI模型,并在不同任务场景下探索生成式AI(Generative AI)与人类行为的协同关系,进而推动人类类AI(Human-like AI)、人类兼容AI(Human-compatible AI)和人类辅助AI(Human-assistive AI)的发展。

链接: https://arxiv.org/abs/2508.15693
作者: Wilka Carvalho,Vikram Goddla,Ishaan Sinha,Hoon Shin,Kunal Jha
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present NiceWebRL, a research tool that enables researchers to use machine reinforcement learning (RL) environments for online human subject experiments. NiceWebRL is a Python library that allows any Jax-based environment to be transformed into an online interface, supporting both single-agent and multi-agent environments. As such, NiceWebRL enables AI researchers to compare their algorithms to human performance, cognitive scientists to test ML algorithms as theories for human cognition, and multi-agent researchers to develop algorithms for human-AI collaboration. We showcase NiceWebRL with 3 case studies that demonstrate its potential to help develop Human-like AI, Human-compatible AI, and Human-assistive AI. In the first case study (Human-like AI), NiceWebRL enables the development of a novel RL model of cognition. Here, NiceWebRL facilitates testing this model against human participants in both a grid world and Craftax, a 2D Minecraft domain. In our second case study (Human-compatible AI), NiceWebRL enables the development of a novel multi-agent RL algorithm that can generalize to human partners in the Overcooked domain. Finally, in our third case study (Human-assistive AI), we show how NiceWebRL can allow researchers to study how an LLM can assist humans on complex tasks in XLand-Minigrid, an environment with millions of hierarchical tasks. The library is available at this https URL.
zh

[AI-7] GRAFT: GRaPH and Table Reasoning for Textual Alignment – A Benchmark for Structured Instruction Following and Visual Reasoning

【速读】:该论文旨在解决当前多模态模型在视觉 grounded 结构化推理任务中缺乏统一、细粒度评估基准的问题。现有评测方法难以系统性地衡量模型在图表理解、多步分析和结构化输出等方面的综合能力,导致评估结果不一致且难以横向比较。解决方案的关键在于构建 GRAFT——一个结构化的多模态基准,其核心创新包括:(1)通过 Python 可视化库程序化生成具有明确语义和结构的图表与表格图像,确保数据可控性和一致性;(2)为每张图像配以基于视觉内容生成的多步骤分析问题,并提供 JSON/YAML 格式的结构化答案,支持对推理过程和输出格式的精确评估;(3)引入包含比较、趋势识别、排序、聚合、比例估计和异常检测等类型的推理分类体系,实现对不同推理能力的细粒度划分与量化。该方案为多模态模型在视觉驱动的结构化任务上提供了标准化、可扩展的评估框架,推动了该领域的评测范式革新。

链接: https://arxiv.org/abs/2508.15690
作者: Abhigya Verma,Sriram Puttagunta,Seganrasan Subramanian,Sravan Ramachandran
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 23 pages, 9 tables, 3 figures

点击查看摘要

Abstract:GRAFT is a structured multimodal benchmark for evaluating models on instruction-following, visual reasoning, and visual-textual alignment tasks. It features programmatically generated charts and synthetically rendered tables, created with Python visualization libraries to ensure control over data semantics, structure, and clarity. Each GRAFT instance pairs a chart or table image with a systematically generated, multi-step analytical question based solely on visual content. Answers are provided in structured formats such as JSON or YAML, supporting consistent evaluation of both reasoning and output format. The benchmark introduces a taxonomy of reasoning types including comparison, trend identification, ranking, aggregation, proportion estimation, and anomaly detection to enable comprehensive assessment. Reference answers follow strict factual and formatting guidelines for precise, aspect-based evaluation. GRAFT offers a unified, scalable framework for fine-grained benchmarking of multimodal models on visually grounded, structured reasoning tasks, setting a new evaluation standard in this field.
zh

[AI-8] Row-Column Hybrid Grouping for Fault-Resilient Multi-Bit Weight Representation on IMC Arrays

【速读】:该论文旨在解决模拟存内计算(Analog In-Memory Computing, IMC)系统中两个关键限制其可扩展性和部署性的挑战:一是由固定故障(Stuck-at Faults, SAFs)引起的计算不可靠性,二是现有容错算法(如Fault-Free, FF)带来的高编译开销。解决方案的关键在于提出一种新颖的多比特权重表示技术——行-列混合分组(row-column hybrid grouping),该技术通过在行和列方向同时引入冗余,增强了容错能力,并可与现有容错方案有效结合;同时设计了一个将容错权重分解问题建模为整数线性规划(Integer Linear Programming, ILP)的编译器流水线,利用现成求解器实现快速且可扩展的编译,并通过理论洞察识别出可直接求解的故障模式以进一步加速计算。

链接: https://arxiv.org/abs/2508.15685
作者: Kang Eun Jeon,Sangheum Yeon,Jinhee Kim,Hyeonsu Bang,Johnny Rhe,Jong Hwan Ko
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注: Accepted to appear at ICCAD’25 (Munich, Germany)

点击查看摘要

Abstract:This paper addresses two critical challenges in analog In-Memory Computing (IMC) systems that limit their scalability and deployability: the computational unreliability caused by stuck-at faults (SAFs) and the high compilation overhead of existing fault-mitigation algorithms, namely Fault-Free (FF). To overcome these limitations, we first propose a novel multi-bit weight representation technique, termed row-column hybrid grouping, which generalizes conventional column grouping by introducing redundancy across both rows and columns. This structural redundancy enhances fault tolerance and can be effectively combined with existing fault-mitigation solutions. Second, we design a compiler pipeline that reformulates the fault-aware weight decomposition problem as an Integer Linear Programming (ILP) task, enabling fast and scalable compilation through off-the-shelf solvers. Further acceleration is achieved through theoretical insights that identify fault patterns amenable to trivial solutions, significantly reducing computation. Experimental results on convolutional networks and small language models demonstrate the effectiveness of our approach, achieving up to 8%p improvement in accuracy, 150x faster compilation, and 2x energy efficiency gain compared to existing baselines.
zh

[AI-9] Futurity as Infrastructure: A Techno-Philosophical Interpretation of the AI Lifecycle

【速读】:该论文试图解决当前人工智能(AI)监管框架在应对数据生命周期中递归价值链动态时的盲区问题,尤其是现有负责任AI(Responsible AI)框架未能充分捕捉AI系统从数据摄入到部署过程中所体现的技术运作与经济逻辑中的“生成性”动态。其解决方案的关键在于引入基于西蒙东(Simondonian)技术哲学的形式化分析,将AI生命周期重构为“前个体环境—个体化过程—个体化AI”的三阶段模型,并提出“未来性”(futurity)这一核心概念:即数据通过特征存储等基础设施实现反馈、适应与时间递归,从而形成自我强化的非竞争性增长机制。这一理论框架揭示了技术寡头通过捕获、训练与部署基础设施集中价值和决策权的权力不对称问题,主张以生命周期审计、时间可追溯性、反馈问责制、递归透明度及对抗递归再利用的权利等措施进行制度性回应。

链接: https://arxiv.org/abs/2508.15680
作者: Mark Cote,Susana Aires
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 15 pages, 3 figures, Presented at IAIL 2025 - Imagining the AI Landscape after the AI Act, 4th International Workshop on Imagining the AI Landscape After the AI Act, The fourth International Conference on Hybrid Human-Artificial Intelligence

点击查看摘要

Abstract:This paper argues that a techno-philosophical reading of the EU AI Act provides insight into the long-term dynamics of data in AI systems, specifically, how the lifecycle from ingestion to deployment generates recursive value chains that challenge existing frameworks for Responsible AI. We introduce a conceptual tool to frame the AI pipeline, spanning data, training regimes, architectures, feature stores, and transfer learning. Using cross-disciplinary methods, we develop a technically grounded and philosophically coherent analysis of regulatory blind spots. Our central claim is that what remains absent from policymaking is an account of the dynamic of becoming that underpins both the technical operation and economic logic of AI. To address this, we advance a formal reading of AI inspired by Simondonian philosophy of technology, reworking his concept of individuation to model the AI lifecycle, including the pre-individual milieu, individuation, and individuated AI. To translate these ideas, we introduce futurity: the self-reinforcing lifecycle of AI, where more data enhances performance, deepens personalisation, and expands application domains. Futurity highlights the recursively generative, non-rivalrous nature of data, underpinned by infrastructures like feature stores that enable feedback, adaptation, and temporal recursion. Our intervention foregrounds escalating power asymmetries, particularly the tech oligarchy whose infrastructures of capture, training, and deployment concentrate value and decision-making. We argue that effective regulation must address these infrastructural and temporal dynamics, and propose measures including lifecycle audits, temporal traceability, feedback accountability, recursion transparency, and a right to contest recursive reuse.
zh

[AI-10] Mind and Motion Aligned: A Joint Evaluation IsaacSim Benchmark for Task Planning and Low-Level Policies in Mobile Manipulation

【速读】:该论文试图解决当前机器人与具身智能(embodied AI)领域中基准测试(benchmark)的割裂问题,即现有评估体系要么专注于高层语言指令理解(通常假设底层执行完美),要么仅关注低层控制(依赖简单单步命令),导致无法对任务规划与物理执行一体化的系统进行综合评估。解决方案的关键在于提出 Kitchen-R,一个基于 Isaac Sim 模拟器构建的数字孪生(digital twin)厨房环境基准,其包含超过 500 条复杂语言指令,支持移动操作机器人,并提供三种评估模式:独立评估规划模块、独立评估控制策略,以及最关键的整体系统集成评估。该设计实现了从任务规划到低层控制的端到端评估能力,填补了具身智能研究中的关键空白。

链接: https://arxiv.org/abs/2508.15663
作者: Nikita Kachaev,Andrei Spiridonov,Andrey Gorodetsky,Kirill Muravyev,Nikita Oskolkov,Aditya Narendra,Vlad Shakhuro,Dmitry Makarov,Aleksandr I. Panov,Polina Fedotova,Alexey K. Kovalev
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Benchmarks are crucial for evaluating progress in robotics and embodied AI. However, a significant gap exists between benchmarks designed for high-level language instruction following, which often assume perfect low-level execution, and those for low-level robot control, which rely on simple, one-step commands. This disconnect prevents a comprehensive evaluation of integrated systems where both task planning and physical execution are critical. To address this, we propose Kitchen-R, a novel benchmark that unifies the evaluation of task planning and low-level control within a simulated kitchen environment. Built as a digital twin using the Isaac Sim simulator and featuring more than 500 complex language instructions, Kitchen-R supports a mobile manipulator robot. We provide baseline methods for our benchmark, including a task-planning strategy based on a vision-language model and a low-level control policy based on diffusion policy. We also provide a trajectory collection system. Our benchmark offers a flexible framework for three evaluation modes: independent assessment of the planning module, independent assessment of the control policy, and, crucially, an integrated evaluation of the whole system. Kitchen-R bridges a key gap in embodied AI research, enabling more holistic and realistic benchmarking of language-guided robotic agents.
zh

[AI-11] Understanding Action Effects through Instrumental Empowerment in Multi-Agent Reinforcement Learning ECAI

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)系统中缺乏显式价值反馈时,如何量化个体智能体对团队行为贡献的问题。传统方法依赖于显式奖励信号或学习到的价值函数来评估团队性能,但在无价值反馈场景下难以解析单个智能体的行为影响。解决方案的关键在于提出一种基于信息论的Shapley值方法——意图合作值(Intended Cooperation Values, ICVs),通过分析策略分布来度量每个智能体对其协作者在工具性赋能(instrumental empowerment)上的因果影响。ICVs通过评估队友决策不确定性与偏好一致性,揭示智能体行为是否促进确定性决策或保持未来行动灵活性,从而识别出有助于团队成功的具体行为模式,并提升MARL系统的可解释性。

链接: https://arxiv.org/abs/2508.15652
作者: Ardian Selmonaj,Miroslav Strupl,Oleg Szehr,Alessandro Antonucci
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: European Conference on Artificial Intelligence (ECAI) 2025

点击查看摘要

Abstract:To reliably deploy Multi-Agent Reinforcement Learning (MARL) systems, it is crucial to understand individual agent behaviors within a team. While prior work typically evaluates overall team performance based on explicit reward signals or learned value functions, it is unclear how to infer agent contributions in the absence of any value feedback. In this work, we investigate whether meaningful insights into agent behaviors can be extracted that are consistent with the underlying value functions, solely by analyzing the policy distribution. Inspired by the phenomenon that intelligent agents tend to pursue convergent instrumental values, which generally increase the likelihood of task success, we introduce Intended Cooperation Values (ICVs), a method based on information-theoretic Shapley values for quantifying each agent’s causal influence on their co-players’ instrumental empowerment. Specifically, ICVs measure an agent’s action effect on its teammates’ policies by assessing their decision uncertainty and preference alignment. The analysis across cooperative and competitive MARL environments reveals the extent to which agents adopt similar or diverse strategies. By comparing action effects between policies and value functions, our method identifies which agent behaviors are beneficial to team success, either by fostering deterministic decisions or by preserving flexibility for future action choices. Our proposed method offers novel insights into cooperation dynamics and enhances explainability in MARL systems.
zh

[AI-12] GRASPED: Graph Anomaly Detection using Autoencoder with Spectral Encoder and Decoder (Full Version) ECAI2025

【速读】:该论文旨在解决图结构数据中节点异常检测(node anomaly detection)的问题,尤其针对现有方法在标注数据稀缺情况下性能受限,以及无监督方法多依赖空间信息或仅使用低通滤波器而缺乏多频段分析能力的局限性。其解决方案的关键在于提出一种基于谱域建模的无监督图自编码器模型GRASPED,该模型采用图小波卷积(Graph Wavelet Convolution)作为编码器,并结合维纳图反卷积(Wiener Graph Deconvolution)作为解码器,具备带通滤波特性,能够在多个尺度上同时捕获全局与局部图结构信息,从而实现对节点属性的可学习重建,有效识别异常模式。

链接: https://arxiv.org/abs/2508.15633
作者: Wei Herng Choong,Jixing Liu,Ching-Yu Kao,Philip Sperl
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Full version of the paper accepted for publication at the European Conference on Artificial Intelligence (ECAI 2025)

点击查看摘要

Abstract:Graph machine learning has been widely explored in various domains, such as community detection, transaction analysis, and recommendation systems. In these applications, anomaly detection plays an important role. Recently, studies have shown that anomalies on graphs induce spectral shifts. Some supervised methods have improved the utilization of such spectral domain information. However, they remain limited by the scarcity of labeled data due to the nature of anomalies. On the other hand, existing unsupervised learning approaches predominantly rely on spatial information or only employ low-pass filters, thereby losing the capacity for multi-band analysis. In this paper, we propose Graph Autoencoder with Spectral Encoder and Spectral Decoder (GRASPED) for node anomaly detection. Our unsupervised learning model features an encoder based on Graph Wavelet Convolution, along with structural and attribute decoders. The Graph Wavelet Convolution-based encoder, combined with a Wiener Graph Deconvolution-based decoder, exhibits bandpass filter characteristics that capture global and local graph information at multiple scales. This design allows for a learning-based reconstruction of node attributes, effectively capturing anomaly information. Extensive experiments on several real-world graph anomaly detection datasets demonstrate that GRASPED outperforms current state-of-the-art models.
zh

[AI-13] Adapting A Vector-Symbolic Memory for Lisp ACT-R

【速读】:该论文旨在解决传统ACT-R系统中声明性记忆(Declarative Memory, DM)模块在可扩展性和记忆块间相似性表达方面的局限性问题。其解决方案的关键在于引入全息声明性记忆(Holographic Declarative Memory, HDM),这是一种基于向量符号计算的替代方案,能够在不存储实际记忆块的情况下实现高效召回,并通过架构定义的方式自然地表达记忆块之间的语义相似性。作者进一步将HDM适配至Lisp ACT-R这一最广泛使用的ACT-R实现,使得原有基于DM设计的模型无需重大修改即可运行于HDM之上,同时开发了文本处理流水线以将大规模文档内容注入记忆,并提出了一种新颖机制——仅使用标记的向量表示即可检索完整记忆块,从而在保持HDM优势的同时增强了与现有ACT-R模型的兼容性。

链接: https://arxiv.org/abs/2508.15630
作者: Meera Ray,Christopher L. Dancy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 6 pages. 5 figures. Submitted and accepted to the 23rd International Conference on Cognitive Modeling (ICCM 2025)

点击查看摘要

Abstract:Holographic Declarative Memory (HDM) is a vector-symbolic alternative to ACT-R’s Declarative Memory (DM) system that can bring advantages such as scalability and architecturally defined similarity between DM chunks. We adapted HDM to work with the most comprehensive and widely-used implementation of ACT-R (Lisp ACT-R) so extant ACT-R models designed with DM can be run with HDM without major changes. With this adaptation of HDM, we have developed vector-based versions of common ACT-R functions, set up a text processing pipeline to add the contents of large documents to ACT-R memory, and most significantly created a useful and novel mechanism to retrieve an entire chunk of memory based on a request using only vector representations of tokens. Preliminary results indicate that we can maintain vector-symbolic advantages of HDM (e.g., chunk recall without storing the actual chunk and other advantages with scaling) while also extending it so that previous ACT-R models may work with the system with little (or potentially no) modifications within the actual procedural and declarative memory portions of a model. As a part of iterative improvement of this newly translated holographic declarative memory module, we will continue to explore better time-context representations for vectors to improve the module’s ability to reconstruct chunks during recall. To more fully test this translated HDM module, we also plan to develop decision-making models that use instance-based learning (IBL) theory, which is a useful application of HDM given the advantages of the system.
zh

[AI-14] ransduction is All You Need for Structured Data Workflows

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的系统在处理复杂数据时缺乏结构化推理能力与组合泛化能力的问题。传统方法依赖于手工设计的提示(prompt engineering),难以实现跨任务的可扩展性和逻辑一致性。解决方案的关键在于提出 Agentics 框架,通过将代理(agent)抽象为数据类型的内部组件,实现逻辑传递(logical transduction)机制——即当不同数据类型被连接时,由 LLM 自动执行逻辑转换,从而构建以数据建模为核心的声明式(declarative)AI 工作流。此设计使开发者无需关注提示工程,而是专注于数据类型定义与组合,显著提升了多领域任务(如多项选择题问答、文本到 SQL 的语义解析和提示优化)中的准确性与可扩展性。

链接: https://arxiv.org/abs/2508.15610
作者: Alfio Gliozzo,Naweed Khan,Christodoulos Constantinides,Nandana Mihindukulasooriya,Nahuel Defosse,Junkyu Lee
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 32 pages, 8 figures

点击查看摘要

Abstract:This paper introduces Agentics, a modular framework for building agent-based systems capable of structured reasoning and compositional generalization over complex data. Designed with research and practical applications in mind, Agentics offers a novel perspective on working with data and AI workflows. In this framework, agents are abstracted from the logical flow and they are used internally to the data type to enable logical transduction among data. Agentics encourages AI developers to focus on modeling data rather than crafting prompts, enabling a declarative language in which data types are provided by LLMs and composed through logical transduction, which is executed by LLMs when types are connected. We provide empirical evidence demonstrating the applicability of this framework across domain-specific multiple-choice question answering, semantic parsing for text-to-SQL, and automated prompt optimization tasks, achieving state-of-the-art accuracy or improved scalability without sacrificing performance. The open-source implementation is available at \textttthis https URL.
zh

[AI-15] A Dynamical Systems Framework for Reinforcement Learning Safety and Robustness Verification

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在安全关键系统中应用时缺乏形式化方法验证策略鲁棒性和安全性的问题。其解决方案的关键在于将RL智能体与其环境建模为一个离散时间自治动力系统,并借助动力系统理论中的有限时间李雅普诺夫指数(Finite-Time Lyapunov Exponent, FTLE)来识别和可视化拉格朗日相干结构(Lagrangian Coherent Structures, LCS),这些结构作为系统行为的隐式“骨架”:排斥型LCS构成不安全区域的安全屏障,吸引型LCS揭示系统的收敛特性及潜在故障模式(如意外“陷阱”状态)。进一步地,作者提出三种定量指标——平均边界排斥度(Mean Boundary Repulsion, MBR)、聚合伪吸引子强度(Aggregated Spurious Attractor Strength, ASAS)与时间感知伪吸引子强度(Temporally-Aware Spurious Attractor Strength, TASAS),以形式化衡量策略的安全裕度与鲁棒性,并提供局部稳定性保证推导方法及处理模型不确定性的扩展分析。

链接: https://arxiv.org/abs/2508.15588
作者: Ahmed Nasir,Abdelhafid Zenati
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The application of reinforcement learning to safety-critical systems is limited by the lack of formal methods for verifying the robustness and safety of learned policies. This paper introduces a novel framework that addresses this gap by analyzing the combination of an RL agent and its environment as a discrete-time autonomous dynamical system. By leveraging tools from dynamical systems theory, specifically the Finite-Time Lyapunov Exponent (FTLE), we identify and visualize Lagrangian Coherent Structures (LCS) that act as the hidden “skeleton” governing the system’s behavior. We demonstrate that repelling LCS function as safety barriers around unsafe regions, while attracting LCS reveal the system’s convergence properties and potential failure modes, such as unintended “trap” states. To move beyond qualitative visualization, we introduce a suite of quantitative metrics, Mean Boundary Repulsion (MBR), Aggregated Spurious Attractor Strength (ASAS), and Temporally-Aware Spurious Attractor Strength (TASAS), to formally measure a policy’s safety margin and robustness. We further provide a method for deriving local stability guarantees and extend the analysis to handle model uncertainty. Through experiments in both discrete and continuous control environments, we show that this framework provides a comprehensive and interpretable assessment of policy behavior, successfully identifying critical flaws in policies that appear successful based on reward alone.
zh

[AI-16] DeepThink3D: Enhancing Large Language Models with Programmatic Reasoning Reasoning in Complex 3D Situated Reasoning Tasks

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂三维场景(3D scenes)中进行推理能力不足的问题,尤其是在涉及工具调用(tool usage)的多步骤、高复杂度任务中的表现局限。现有方法虽通过API调用和思维链(chain of thought)整合程序来解决3D情境推理问题,但受限于数据集中问题的简单性,生成的推理链条较短且难以应对复杂任务。解决方案的关键在于:首先,在SQA3D基准上采用组合式与迭代进化策略生成更复杂的问答对以提升任务难度;其次,基于此构建的数据集对LLM进行微调,使其更熟练地使用3D工具;最后,引入直接偏好优化(Direct Preference Optimization, DPO)方法,直接优化模型生成的工具链策略,从而显著提升其在复杂任务中的准确性。

链接: https://arxiv.org/abs/2508.15548
作者: Jiayi Song,Rui Wan,Lipeng Ma,Weidong Yang,Qingyuan Zhou,Yixuan Li,Ben Fei
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work enhances the ability of large language models (LLMs) to perform complex reasoning in 3D scenes. Recent work has addressed the 3D situated reasoning task by invoking tool usage through large language models. Large language models call tools via APIs and integrate the generated programs through a chain of thought to solve problems based on the program results. However, due to the simplicity of the questions in the dataset, the generated program reasoning chains are relatively short. To solve this main challenge, in this paper, we introduce DeepThink3D to enhance the tool usage of LLMs in complex 3D situated reasoning tasks. Our work proposes a combinatorial and iterative evolutionary approach on the SQA3D benchmark to generate more complex questions. Building on this foundation, we fine-tune the large language model to make it more proficient in using 3D tools. By employing Direct Preference Optimization (DPO), we directly optimize the toolchain strategies generated by models, thereby enhancing their accuracy in complex tasks.
zh

[AI-17] Super-additive Cooperation in Language Model Agents

【速读】:该论文旨在解决自主人工智能(AI)代理在复杂社会情境中如何表现出合作行为的问题,特别是在缺乏长期互动机制的情况下,如何提升其初始(一次性)合作倾向。解决方案的关键在于借鉴超加性合作理论(super-additive cooperation theory),通过构建一个模拟团队内动态与团队间竞争的虚拟锦标赛环境,使语言模型代理在囚徒困境博弈中展现出显著增强的整体合作水平和初始合作倾向。实验表明,外部组间竞争与内部重复互动的结合能够有效激发合作行为,为多智能体AI系统的设计提供了新范式,并揭示了群体竞争可能促进合作的反直觉机制。

链接: https://arxiv.org/abs/2508.15510
作者: Filippo Tonini,Lukas Galke
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: FAIEMA 2025

点击查看摘要

Abstract:With the prospect of autonomous artificial intelligence (AI) agents, studying their tendency for cooperative behavior becomes an increasingly relevant topic. This study is inspired by the super-additive cooperation theory, where the combined effects of repeated interactions and inter-group rivalry have been argued to be the cause for cooperative tendencies found in humans. We devised a virtual tournament where language model agents, grouped into teams, face each other in a Prisoner’s Dilemma game. By simulating both internal team dynamics and external competition, we discovered that this blend substantially boosts both overall and initial, one-shot cooperation levels (the tendency to cooperate in one-off interactions). This research provides a novel framework for large language models to strategize and act in complex social scenarios and offers evidence for how intergroup competition can, counter-intuitively, result in more cooperative behavior. These insights are crucial for designing future multi-agent AI systems that can effectively work together and better align with human values. Source code is available at this https URL.
zh

[AI-18] hink in Blocks: Adaptive Reasoning from Direct Response to Deep Reasoning

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在链式推理(chain-of-thought)过程中因推理长度过长而导致的过度思考问题,即计算资源浪费和响应延迟增加的问题。其核心解决方案是提出Think in Blocks框架,关键在于引入显式的块结构(block-structured paradigm),使模型能够动态预测一个推理预算(即推理块的数量),并据此自适应地划分推理过程,从而实现从零到深度推理的灵活调整;该框架通过三阶段训练流程(监督微调、基于奖励的直接偏好优化与强化学习)实现推理深度对任务复杂度的感知与适配,并在推理阶段利用显式的块计数动态控制链式推理长度,显著提升效率与灵活性。

链接: https://arxiv.org/abs/2508.15507
作者: Yekun Zhu,Guang Chen,Chengjun Mao
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) with chains-of-thought have demonstrated strong performance on an increasing range of tasks, particularly those involving complex logical reasoning. However, excessively long chains can lead to overthinking, causing computational waste and slower responses. This raises a question: can LLMs dynamically adjust the length of their reasoning processes based on task complexity? To address this, we propose the Think in Blocks framework, which enables adaptive reasoning-from zero to deep reasoning-by partitioning the reasoning process into a tunable number of blocks. Our main contributions are: (1) Establishing an explicit block-structured paradigm in which the model first predicts an integer reasoning budget-the number of blocks-and then partitions its reasoning accordingly; (2) Training an adaptive model through a three-stage pipeline-Supervised Fine-Tuning, reward-guided Direct Preference Optimization, and Reinforcement Learning-that adjusts its reasoning depth to problem difficulty; (3) Exploiting the explicit block count to dynamically control reasoning depth at inference time, allowing flexible adjustment of chain-of-thought length during deployment.
zh

[AI-19] LLM -Driven Self-Refinement for Embodied Drone Task Planning

【速读】:该论文旨在解决工业级具身无人机(embodied drones)在复杂动态环境中执行任务时,因缺乏持续状态评估与自适应优化能力而导致的任务成功率低的问题。传统方法依赖单帧最终状态判断,难以应对连续、多变的操作场景,且缺乏从经验中结构化学习的能力。解决方案的关键在于提出SRDrone系统:其一,采用连续状态评估方法(continuous state evaluation),实现对任务执行过程的精准监测与解释性反馈;其二,构建分层行为树(hierarchical Behavior Tree, BT)修改模型,结合多层级BT计划分析与约束策略空间,支持基于经验的结构化反思学习。实验表明,该方案相较基线方法提升任务成功率(Success Rate, SR)达44.87%,并在真实部署中通过迭代自精炼获得96.25%的SR,有效融合了大语言模型(LLMs)的通用推理智能与具身无人机的物理执行约束。

链接: https://arxiv.org/abs/2508.15501
作者: Deyu Zhang,Xicheng Zhang,Jiahao Li,Tingting Long,Xunhua Dai,Yongjian Fu,Jinrui Zhang,Ju Ren,Yaoxue Zhang
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 14pages

点击查看摘要

Abstract:We introduce SRDrone, a novel system designed for self-refinement task planning in industrial-grade embodied drones. SRDrone incorporates two key technical contributions: First, it employs a continuous state evaluation methodology to robustly and accurately determine task outcomes and provide explanatory feedback. This approach supersedes conventional reliance on single-frame final-state assessment for continuous, dynamic drone operations. Second, SRDrone implements a hierarchical Behavior Tree (BT) modification model. This model integrates multi-level BT plan analysis with a constrained strategy space to enable structured reflective learning from experience. Experimental results demonstrate that SRDrone achieves a 44.87% improvement in Success Rate (SR) over baseline methods. Furthermore, real-world deployment utilizing an experience base optimized through iterative self-refinement attains a 96.25% SR. By embedding adaptive task refinement capabilities within an industrial-grade BT planning framework, SRDrone effectively integrates the general reasoning intelligence of Large Language Models (LLMs) with the stringent physical execution constraints inherent to embodied drones. Code is available at this https URL.
zh

[AI-20] A Solvable Molecular Switch Model for Stable Temporal Information Processing

【速读】:该论文旨在解决如何设计具有生物启发性行为且具备稳定数学性质的动态系统模型,以支持在序列数据上进行可靠学习的问题。其核心挑战在于平衡神经形态计算中对类脑功能(如突触可塑性)的需求与数学稳定性(如收敛性和记忆衰减特性)之间的矛盾。解决方案的关键在于提出一个输入驱动的一状态微分方程模型,该模型在线性于状态、非线性于输入的同时具备解析可解性,并证明其具有收敛性和 fading memory(记忆衰减)性质,从而确保对时变输入的稳定处理能力。这一特性使得该模型可作为深度级联或递归架构中的计算单元,为分子开关等物理器件在神经形态计算中的应用提供理论支撑。

链接: https://arxiv.org/abs/2508.15451
作者: H. I. Nurdin,C. A. Nijhuis
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Systems and Control (eess.SY)
备注: 21 pages, 6 figures, submitted for publication. Comments are welcome

点击查看摘要

Abstract:This paper studies an input-driven one-state differential equation model initially developed for an experimentally demonstrated dynamic molecular switch that switches like synapses in the brain do. The linear-in-the-state and nonlinear-in-the-input model is exactly solvable, and it is shown that it also possesses mathematical properties of convergence and fading memory that enable stable processing of time-varying inputs by nonlinear dynamical systems. Thus, the model exhibits the co-existence of biologically-inspired behavior and desirable mathematical properties for stable learning on sequential data. The results give theoretical support for the use of the dynamic molecular switches as computational units in deep cascaded/layered feedforward and recurrent architectures as well as other more general structures for neuromorphic computing. They could also inspire more general exactly solvable models that can be fitted to emulate arbitrary physical devices which can mimic brain-inspired behaviour and perform stable computation on input signals.
zh

[AI-21] Reliable Unlearning Harmful Information in LLM s with Metamorphosis Representation Projection AAAI2026 NEURIPS2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)中因存储有害知识而导致的安全隐患问题,尤其是现有机器遗忘(machine unlearning)方法难以实现持续有效的遗忘且易受重学习攻击的局限性。其解决方案的关键在于提出一种基于不可逆投影特性的 metamorphosis representation projection (MRP) 方法,通过在特定网络层的隐藏状态空间中实施投影变换,彻底消除有害信息的同时保留有用知识,从而实现高效且安全的连续遗忘。

链接: https://arxiv.org/abs/2508.15449
作者: Chengcan Wu,Zeming Wei,Huanran Chen,Yinpeng Dong,Meng Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 9 figures, Under review as a full paper at AAAI 2026. A preliminary version is under review at the NeurIPS 2025 Workshop on Reliable ML from Unreliable Data

点击查看摘要

Abstract:While Large Language Models (LLMs) have demonstrated impressive performance in various domains and tasks, concerns about their safety are becoming increasingly severe. In particular, since models may store unsafe knowledge internally, machine unlearning has emerged as a representative paradigm to ensure model safety. Existing approaches employ various training techniques, such as gradient ascent and negative preference optimization, in attempts to eliminate the influence of undesired data on target models. However, these methods merely suppress the activation of undesired data through parametric training without completely eradicating its informational traces within the model. This fundamental limitation makes it difficult to achieve effective continuous unlearning, rendering these methods vulnerable to relearning attacks. To overcome these challenges, we propose a Metamorphosis Representation Projection (MRP) approach that pioneers the application of irreversible projection properties to machine unlearning. By implementing projective transformations in the hidden state space of specific network layers, our method effectively eliminates harmful information while preserving useful knowledge. Experimental results demonstrate that our approach enables effective continuous unlearning and successfully defends against relearning attacks, achieving state-of-the-art performance in unlearning effectiveness while preserving natural performance. Our code is available in this https URL.
zh

[AI-22] From Bits to Boardrooms: A Cutting-Edge Multi-Agent LLM Framework for Business Excellence ECAI2025

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在企业决策支持与战略规划应用中,难以协调复杂运营分析与高层战略目标的问题,从而导致跨组织层级协作效率低下和工作流碎片化。其解决方案的关键在于提出BusiAgent这一多智能体框架,融合三项核心创新:一是基于扩展的连续时间马尔可夫决策过程(Extended Continuous Time Markov Decision Process, CTMDP)实现动态代理建模;二是引入广义熵度量优化协同效率;三是采用多层级Stackelberg博弈处理层级决策流程,同时结合上下文Thompson采样进行提示优化,并辅以质量保障体系降低错误率。该框架显著提升了从细粒度业务洞察到宏观战略整合的能力,在方案质量和用户满意度上均优于现有方法。

链接: https://arxiv.org/abs/2508.15447
作者: Zihao Wang,Junming Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ECAI 2025

点击查看摘要

Abstract:Large Language Models (LLMs) have shown promising potential in business applications, particularly in enterprise decision support and strategic planning, yet current approaches often struggle to reconcile intricate operational analyses with overarching strategic goals across diverse market environments, leading to fragmented workflows and reduced collaboration across organizational levels. This paper introduces BusiAgent, a novel multi-agent framework leveraging LLMs for advanced decision-making in complex corporate environments. BusiAgent integrates three core innovations: an extended Continuous Time Markov Decision Process (CTMDP) for dynamic agent modeling, a generalized entropy measure to optimize collaborative efficiency, and a multi-level Stackelberg game to handle hierarchical decision processes. Additionally, contextual Thompson sampling is employed for prompt optimization, supported by a comprehensive quality assurance system to mitigate errors. Extensive empirical evaluations across diverse business scenarios validate BusiAgent’s efficacy, demonstrating its capacity to generate coherent, client-focused solutions that smoothly integrate granular insights with high-level strategy, significantly outperforming established approaches in both solution quality and user satisfaction. By fusing cutting-edge AI technologies with deep business insights, BusiAgent marks a substantial step forward in AI-driven enterprise decision-making, empowering organizations to navigate complex business landscapes more effectively.
zh

[AI-23] st-time Corpus Feedback: From Retrieval to RAG

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)系统中检索与推理模块割裂的问题,即大多数RAG流程采用静态设计,仅在初始阶段检索一次文档后便直接生成答案,缺乏对复杂任务所需的迭代证据收集或高精度检索的支持。其解决方案的关键在于引入反馈驱动的动态检索与排序机制,通过整合来自查询、上下文或文档池的反馈信号,使检索过程成为可学习、可适应的端到端组件,从而提升模型在知识密集型自然语言处理任务中的表现。

链接: https://arxiv.org/abs/2508.15437
作者: Mandeep Rathee,Venktesh V,Sean MacAvaney,Avishek Anand
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 1 figure

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a standard framework for knowledge-intensive NLP tasks, combining large language models (LLMs) with document retrieval from external corpora. Despite its widespread use, most RAG pipelines continue to treat retrieval and reasoning as isolated components, retrieving documents once and then generating answers without further interaction. This static design often limits performance on complex tasks that require iterative evidence gathering or high-precision retrieval. Recent work in both the information retrieval (IR) and NLP communities has begun to close this gap by introducing adaptive retrieval and ranking methods that incorporate feedback. In this survey, we present a structured overview of advanced retrieval and ranking mechanisms that integrate such feedback. We categorize feedback signals based on their source and role in improving the query, retrieved context, or document pool. By consolidating these developments, we aim to bridge IR and NLP perspectives and highlight retrieval as a dynamic, learnable component of end-to-end RAG systems.
zh

[AI-24] An Empirical Study of Knowledge Distillation for Code Understanding Tasks ICSE2026

【速读】:该论文旨在解决预训练语言模型(Pre-trained Language Models, PLMs)在代码理解任务中因计算密集性和推理延迟高而导致的大规模应用难题。其核心解决方案是采用知识蒸馏(Knowledge Distillation, KD)技术,通过将大型教师模型的知识迁移至小型学生模型,实现模型压缩与加速,从而在保持教师模型性能的同时显著提升推理效率。关键创新在于系统性地评估了两类KD方法(基于logit和基于特征的)在代码理解任务中的有效性,并发现特征蒸馏方法能以仅5%参数量保留高达98%的教师性能,且代码专用PLM作为教师模型效果更优,同时指出学生模型架构与教师相似性并非决定性能的关键因素。

链接: https://arxiv.org/abs/2508.15423
作者: Ruiqi Wang,Zezhou Yang,Cuiyun Gao,Xin Xia,Qing Liao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted by ICSE 2026 (Cycle 1)

点击查看摘要

Abstract:Pre-trained language models (PLMs) have emerged as powerful tools for code understanding. However, deploying these PLMs in large-scale applications faces practical challenges due to their computational intensity and inference latency. Knowledge distillation (KD), a promising model compression and acceleration technique, addresses these limitations by transferring knowledge from large teacher models to compact student models, enabling efficient inference while preserving most of the teacher models’ capabilities. While this technique has shown remarkable success in natural language processing and computer vision domains, its potential for code understanding tasks remains largely underexplored. In this paper, we systematically investigate the effectiveness and usage of KD in code understanding tasks. Our study encompasses two popular types of KD methods, i.e., logit-based and feature-based KD methods, experimenting across eight student models and two teacher PLMs from different domains on three downstream tasks. The experimental results indicate that KD consistently offers notable performance boosts across student models with different sizes compared with standard fine-tuning. Notably, code-specific PLM demonstrates better effectiveness as the teacher model. Among all KD methods, the latest feature-based KD methods exhibit superior performance, enabling student models to retain up to 98% teacher performance with merely 5% parameters. Regarding student architecture, our experiments reveal that similarity with teacher architecture does not necessarily lead to better performance. We further discuss the efficiency and behaviors in the KD process and inference, summarize the implications of findings, and identify promising future directions. Comments: Accepted by ICSE 2026 (Cycle 1) Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.15423 [cs.SE] (or arXiv:2508.15423v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.15423 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-25] Bridging Generalization and Personalization in Wearable Human Activity Recognition via On-Device Few-Shot Learning

【速读】:该论文旨在解决可穿戴设备上人体活动识别(Human Activity Recognition, HAR)模型在跨用户部署时性能下降的问题,其根源在于用户引起的概念漂移(User-Induced Concept Drift, UICD)。为实现高效个性化,论文提出一种混合框架:先在多用户数据上训练通用模型,随后利用少量用户特定数据在设备端通过少样本学习(few-shot learning)快速微调分类器层。该方案的关键在于仅更新分类器层而非整个模型,从而在保持极低计算与内存开销的前提下实现鲁棒的个性化适应,且已在基于RISC-V架构的GAP9微控制器上验证有效性,显著提升了三种不同HAR场景下的识别准确率。

链接: https://arxiv.org/abs/2508.15413
作者: Pixi Kang,Julian Moosmann,Mengxi Liu,Bo Zhou,Michele Magno,Paul Lukowicz,Sizhen Bian
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Human Activity Recognition (HAR) using wearable devices has advanced significantly in recent years, yet its generalization remains limited when models are deployed to new users. This degradation in performance is primarily due to user-induced concept drift (UICD), highlighting the importance of efficient personalization. In this paper, we present a hybrid framework that first generalizes across users and then rapidly adapts to individual users using few-shot learning directly on-device. By updating only the classifier layer with user-specific data, our method achieves robust personalization with minimal computational and memory overhead. We implement this framework on the energy-efficient RISC-V-based GAP9 microcontroller and validate it across three diverse HAR scenarios: RecGym, QVAR-Gesture, and Ultrasound-Gesture. Post-deployment adaptation yields consistent accuracy improvements of 3.73%, 17.38%, and 3.70% respectively. These results confirm that fast, lightweight, and effective personalization is feasible on embedded platforms, paving the way for scalable and user-aware HAR systems in the wild \footnotethis https URL.
zh

[AI-26] Hybrid Least Squares/Gradient Descent Methods for DeepONets

【速读】:该论文旨在解决DeepONet训练效率低的问题,其核心挑战在于模型输出对分支网络(branch network)最后一层参数呈线性关系,但直接构建所有可能输入组合的最小二乘(Least Squares, LS)系统会导致维度灾难,难以求解。解决方案的关键在于将庞大的LS系统分解为两个独立的子问题:分别针对分支网络和主干网络(trunk network)进行求解,从而显著降低计算复杂度并提升训练效率。该方法进一步推广至带有L²正则项的损失函数,包括物理信息驱动的无监督学习场景,增强了适用性与鲁棒性。

链接: https://arxiv.org/abs/2508.15394
作者: Jun Choi,Chang-Ock Lee,Minam Moon
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Numerical Analysis (math.NA)
备注:

点击查看摘要

Abstract:We propose an efficient hybrid least squares/gradient descent method to accelerate DeepONet training. Since the output of DeepONet can be viewed as linear with respect to the last layer parameters of the branch network, these parameters can be optimized using a least squares (LS) solve, and the remaining hidden layer parameters are updated by means of gradient descent form. However, building the LS system for all possible combinations of branch and trunk inputs yields a prohibitively large linear problem that is infeasible to solve directly. To address this issue, our method decomposes the large LS system into two smaller, more manageable subproblems \unicodex2014 one for the branch network and one for the trunk network \unicodex2014 and solves them separately. This method is generalized to a broader type of L^2 loss with a regularization term for the last layer parameters, including the case of unsupervised learning with physics-informed loss.
zh

[AI-27] EvoFormer: Learning Dynamic Graph-Level Representations with Structural and Temporal Bias Correction

【速读】:该论文旨在解决动态图表示学习中的两个关键问题:结构访问偏差(Structural Visit Bias)和突变演化盲视(Abrupt Evolution Blindness)。前者指随机游走采样过度关注高连接度节点,导致结构表示冗余且噪声大;后者则源于时间建模策略僵化或过于简单,无法有效捕捉突发的结构变化,造成时序嵌入不一致。解决方案的核心在于提出EvoFormer框架,其关键创新包括:(1) 结构感知Transformer模块,通过基于节点结构角色的位置编码实现全局结构区分与精准表示,缓解结构访问偏差;(2) 演化敏感的时间模块,采用三步策略——随机游走时间戳分类生成初始时序嵌入、图级时间分割划分结构一致时间段、段感知时间自注意力结合边演化预测任务,从而精确识别结构演变边界并适应快速时序变化。

链接: https://arxiv.org/abs/2508.15378
作者: Haodi Zhong,Liuxin Zou,Di Wang,Bo Wang,Zhenxing Niu,Quan Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Dynamic graph-level embedding aims to capture structural evolution in networks, which is essential for modeling real-world scenarios. However, existing methods face two critical yet under-explored issues: Structural Visit Bias, where random walk sampling disproportionately emphasizes high-degree nodes, leading to redundant and noisy structural representations; and Abrupt Evolution Blindness, the failure to effectively detect sudden structural changes due to rigid or overly simplistic temporal modeling strategies, resulting in inconsistent temporal embeddings. To overcome these challenges, we propose EvoFormer, an evolution-aware Transformer framework tailored for dynamic graph-level representation learning. To mitigate Structural Visit Bias, EvoFormer introduces a Structure-Aware Transformer Module that incorporates positional encoding based on node structural roles, allowing the model to globally differentiate and accurately represent node structures. To overcome Abrupt Evolution Blindness, EvoFormer employs an Evolution-Sensitive Temporal Module, which explicitly models temporal evolution through a sequential three-step strategy: (I) Random Walk Timestamp Classification, generating initial timestamp-aware graph-level embeddings; (II) Graph-Level Temporal Segmentation, partitioning the graph stream into segments reflecting structurally coherent periods; and (III) Segment-Aware Temporal Self-Attention combined with an Edge Evolution Prediction task, enabling the model to precisely capture segment boundaries and perceive structural evolution trends, effectively adapting to rapid temporal shifts. Extensive evaluations on five benchmark datasets confirm that EvoFormer achieves state-of-the-art performance in graph similarity ranking, temporal anomaly detection, and temporal segmentation tasks, validating its effectiveness in correcting structural and temporal biases.
zh

[AI-28] Planning with Minimal Disruption

【速读】:该论文旨在解决规划中如何最小化对初始状态的修改以达成目标的问题,这一概念被称为计划扰动(plan disruption)。其核心解决方案在于提出了一种基于规划的重构方法,通过联合优化动作成本之和与计划扰动程度,实现对两个目标的平衡。实验结果表明,该方法能够在多个基准测试中有效生成兼顾效率与扰动最小化的高质量计划。

链接: https://arxiv.org/abs/2508.15358
作者: Alberto Pozanco,Marianela Morales,Daniel Borrajo,Manuela Veloso
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In many planning applications, we might be interested in finding plans that minimally modify the initial state to achieve the goals. We refer to this concept as plan disruption. In this paper, we formally introduce it, and define various planning-based compilations that aim to jointly optimize both the sum of action costs and plan disruption. Experimental results in different benchmarks show that the reformulated task can be effectively solved in practice to generate plans that balance both objectives.
zh

[AI-29] RETAIL: Towards Real-world Travel Planning for Large Language Models

【速读】:该论文旨在解决当前大型语言模型在自动化旅行规划中存在的三大核心问题:一是现有系统假设用户仅提供显式查询,而现实中用户需求往往是隐式的;二是现有方案忽略多样化的环境因素与用户偏好,导致生成计划的可行性受限;三是现有方法仅能生成基础的兴趣点(Point of Interest, POI)排列,无法提供包含丰富细节的一体化旅行计划。其解决方案的关键在于构建一个新型数据集 RETAIL,该数据集支持隐式与显式查询的决策,并涵盖有无修改需求的场景,同时引入环境感知能力以保障计划在真实世界中的可行性,并整合详尽的POI信息以实现全要素一体化旅行计划;此外,提出一种主题引导的多智能体框架(Topic-Guided Multi-Agent framework, TGMA),实验证明即使最强基线模型的通过率仅为1.0%,而TGMA显著提升至2.72%,展现出面向真实世界旅行规划的潜力。

链接: https://arxiv.org/abs/2508.15335
作者: Bin Deng,Yizhe Feng,Zeming Liu,Qing Wei,Xiangrong Zhu,Shuai Chen,Yuanfang Guo,Yunhong Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although large language models have enhanced automated travel planning abilities, current systems remain misaligned with real-world scenarios. First, they assume users provide explicit queries, while in reality requirements are often implicit. Second, existing solutions ignore diverse environmental factors and user preferences, limiting the feasibility of plans. Third, systems can only generate plans with basic POI arrangements, failing to provide all-in-one plans with rich details. To mitigate these challenges, we construct a novel dataset \textbfRETAIL, which supports decision-making for implicit queries while covering explicit queries, both with and without revision needs. It also enables environmental awareness to ensure plan feasibility under real-world scenarios, while incorporating detailed POI information for all-in-one travel plans. Furthermore, we propose a topic-guided multi-agent framework, termed TGMA. Our experiments reveal that even the strongest existing model achieves merely a 1.0% pass rate, indicating real-world travel planning remains extremely challenging. In contrast, TGMA demonstrates substantially improved performance 2.72%, offering promising directions for real-world travel planning.
zh

[AI-30] Search-Based Credit Assignment for Offline Preference-Based Reinforcement Learning

【速读】:该论文旨在解决离线强化学习中依赖人工设计奖励函数的问题,以及如何有效融合专家示范(expert demonstrations)与偏好反馈(preferences)这两种人类反馈形式的局限性。其关键解决方案是提出一种基于搜索的偏好加权(Search-Based Preference Weighting, SPW)机制:对于偏好标记轨迹中的每个转移状态,SPW通过搜索相似的专家示范状态-动作对,并依据相似度分数直接推导出步骤级的重要性权重,进而指导标准偏好学习过程,实现更精确的信用分配(credit assignment),从而在复杂机器人操作任务中显著提升联合利用两种反馈源的学习效果。

链接: https://arxiv.org/abs/2508.15327
作者: Xiancheng Gao,Yufeng Shi,Wengang Zhou,Houqiang Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 7 pages, 6 figures, under review

点击查看摘要

Abstract:Offline reinforcement learning refers to the process of learning policies from fixed datasets, without requiring additional environment interaction. However, it often relies on well-defined reward functions, which are difficult and expensive to design. Human feedback is an appealing alternative, but its two common forms, expert demonstrations and preferences, have complementary limitations. Demonstrations provide stepwise supervision, but they are costly to collect and often reflect limited expert behavior modes. In contrast, preferences are easier to collect, but it is unclear which parts of a behavior contribute most to a trajectory segment, leaving credit assignment unresolved. In this paper, we introduce a Search-Based Preference Weighting (SPW) scheme to unify these two feedback sources. For each transition in a preference labeled trajectory, SPW searches for the most similar state-action pairs from expert demonstrations and directly derives stepwise importance weights based on their similarity scores. These weights are then used to guide standard preference learning, enabling more accurate credit assignment that traditional approaches struggle to achieve. We demonstrate that SPW enables effective joint learning from preferences and demonstrations, outperforming prior methods that leverage both feedback types on challenging robot manipulation tasks.
zh

[AI-31] Coarse-to-Fine Grounded Memory for LLM Agent Planning EMNLP2025

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在复杂规划任务中因依赖单一粒度记忆而导致的知识多样性不足与规划灵活性受限的问题。现有方法多依赖于从环境交互中收集的经验构建单粒度记忆,其性能受制于经验质量,难以适应多样化场景。解决方案的关键在于提出一种粗粒度到细粒度的接地记忆框架(Coarse-to-Fine Grounded Memory, \Ours),通过将环境信息锚定为粗粒度焦点点以指导训练阶段的经验采集,并进一步从每条经验中提取混合粒度的动作提示(actionable hybrid-grained tips)进行结构化存储;在推理阶段,该框架可检索相关经验与提示支持任务规划,并在遭遇环境异常时,利用LLM对当前状态进行细粒度关键信息锚定,从而实现灵活的自我问答(self-QA)反思与计划修正。

链接: https://arxiv.org/abs/2508.15305
作者: Wei Yang,Jinwei Xiao,Hongming Zhang,Qingyang Zhang,Yanna Wang,Bo Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 Main Conference;27 pages,15 figures

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have driven growing interest in LLM-based agents for complex planning tasks. To avoid costly agent training, many studies adopted memory mechanism that enhances LLM with offline experiences or online trajectory analysis. However, existing works focus on single-granularity memory derived from dynamic environmental interactions, which are inherently constrained by the quality of the collected experiences. This limitation, in turn, constrain the diversity of knowledge and the flexibility of planning. We propose Coarse-to-Fine Grounded Memory (\Ours), a novel framework that grounds coarse-to-fine memories with LLM, thereby fully leverage them for flexible adaptation to diverse scenarios. \Ours grounds environmental information into coarse-grained focus points to guide experience collection in training tasks, followed by grounding of actionable hybrid-grained tips from each experience. At inference, \Ours retrieves task-relevant experiences and tips to support planning. When facing environmental anomalies, the LLM grounds the current situation into fine-grained key information, enabling flexible self-QA reflection and plan correction.
zh

[AI-32] Way to Build Native AI-driven 6G Air Interface: Principles Roadmap and Outlook

【速读】:该论文旨在解决6G网络中传统无线接入技术在面对多样化任务、复杂数据类型及动态信道条件时,难以实现高效语义传输与系统可扩展性的问题。其解决方案的关键在于提出一种原生AI驱动的空中接口架构(native AI-driven air interface architecture),该架构围绕“压缩”(compression)与“自适应”(adaptation)两大核心特性构建:压缩机制通过提取源数据中的关键语义信息,聚焦于任务相关性而非符号级精度;自适应机制则使空中接口能够动态调整以适应不同任务、数据类型和信道状态,从而保障系统的可扩展性与鲁棒性。

链接: https://arxiv.org/abs/2508.15277
作者: Ping Zhang,Kai Niu,Yiming Liu,Zijian Liang,Nan Ma,Xiaodong Xu,Wenjun Xu,Mengying Sun,Yinqiu Liu,Xiaoyun Wang,Ruichen Zhang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI)
备注: 14 pages, 7 figures

点击查看摘要

Abstract:Artificial intelligence (AI) is expected to serve as a foundational capability across the entire lifecycle of 6G networks, spanning design, deployment, and operation. This article proposes a native AI-driven air interface architecture built around two core characteristics: compression and adaptation. On one hand, compression enables the system to understand and extract essential semantic information from the source data, focusing on task relevance rather than symbol-level accuracy. On the other hand, adaptation allows the air interface to dynamically transmit semantic information across diverse tasks, data types, and channel conditions, ensuring scalability and robustness. This article first introduces the native AI-driven air interface architecture, then discusses representative enabling methodologies, followed by a case study on semantic communication in 6G non-terrestrial networks. Finally, it presents a forward-looking discussion on the future of native AI in 6G, outlining key challenges and research opportunities.
zh

[AI-33] M-LLM 3REC: A Motivation-Aware User-Item Interaction Framework for Enhancing Recommendation Accuracy with LLM s

【速读】:该论文旨在解决推荐系统在冷启动(cold-start)和数据稀疏场景下的性能瓶颈问题,这些问题传统方法如基于内容的过滤、协同过滤及深度学习难以有效应对。现有解决方案要么通过生成伪交互序列引入冗余或噪声信号,要么过度依赖语义相似性而忽略用户动机的动态变化。其关键创新在于提出一种名为 M-LLM³REC 的新型推荐框架,核心在于利用大语言模型(large language models, LLMs)从有限用户交互中深度提取动机驱动的语义信号,并通过三个集成模块——动机导向的用户画像提取器(Motivation-Oriented Profile Extractor, MOPE)、动机导向特征编码器(Motivation-Oriented Trait Encoder, MOTE)以及动机对齐推荐器(Motivational Alignment Recommender, MAR)——实现更鲁棒、个性化且泛化能力强的推荐效果,尤其在冷启动条件下显著优于当前最优框架。

链接: https://arxiv.org/abs/2508.15262
作者: Lining Chen,Qingwen Zeng,Huaming Chen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 10pages

点击查看摘要

Abstract:Recommendation systems have been essential for both user experience and platform efficiency by alleviating information overload and supporting decision-making. Traditional methods, i.e., content-based filtering, collaborative filtering, and deep learning, have achieved impressive results in recommendation systems. However, the cold-start and sparse-data scenarios are still challenging to deal with. Existing solutions either generate pseudo-interaction sequence, which often introduces redundant or noisy signals, or rely heavily on semantic similarity, overlooking dynamic shifts in user motivation. To address these limitations, this paper proposes a novel recommendation framework, termed M- LLM^3 REC, which leverages large language models for deep motivational signal extraction from limited user interactions. M- LLM^3 REC comprises three integrated modules: the Motivation-Oriented Profile Extractor (MOPE), Motivation-Oriented Trait Encoder (MOTE), and Motivational Alignment Recommender (MAR). By emphasizing motivation-driven semantic modeling, M- LLM^3 REC demonstrates robust, personalized, and generalizable recommendations, particularly boosting performance in cold-start situations in comparison with the state-of-the-art frameworks.
zh

[AI-34] Computational Intelligence based Land-use Allocation Approaches for Mixed Use Areas

【速读】:该论文旨在解决城市土地利用分配中的多目标优化问题(multi-objective optimization problem),该问题在可持续城市发展政策中具有关键意义,其核心挑战在于平衡土地利用兼容性与经济目标之间的权衡。解决方案的关键在于提出一系列计算智能算法,特别是CR+DES算法,通过引入缩放差分向量(scaled difference vectors)增强探索能力,并结合系统性的约束松弛策略,在保证可行性的前提下拓展解空间的搜索范围;同时,采用Kruskal-Wallis检验与紧凑字母显示法进行统计验证,证明了融合差分向量的算法在多个指标上显著优于传统方法,从而为城市规划者提供了基于证据的计算工具,以更有效地支持快速城市化地区的土地利用决策。

链接: https://arxiv.org/abs/2508.15240
作者: Sabab Aosaf,Muhammad Ali Nayeem,Afsana Haque,M Sohel Rahmana
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Urban land-use allocation represents a complex multi-objective optimization problem critical for sustainable urban development policy. This paper presents novel computational intelligence approaches for optimizing land-use allocation in mixed-use areas, addressing inherent trade-offs between land-use compatibility and economic objectives. We develop multiple optimization algorithms, including custom variants integrating differential evolution with multi-objective genetic algorithms. Key contributions include: (1) CR+DES algorithm leveraging scaled difference vectors for enhanced exploration, (2) systematic constraint relaxation strategy improving solution quality while maintaining feasibility, and (3) statistical validation using Kruskal-Wallis tests with compact letter displays. Applied to a real-world case study with 1,290 plots, CR+DES achieves 3.16% improvement in land-use compatibility compared to state-of-the-art methods, while MSBX+MO excels in price optimization with 3.3% improvement. Statistical analysis confirms algorithms incorporating difference vectors significantly outperform traditional approaches across multiple metrics. The constraint relaxation technique enables broader solution space exploration while maintaining practical constraints. These findings provide urban planners and policymakers with evidence-based computational tools for balancing competing objectives in land-use allocation, supporting more effective urban development policies in rapidly urbanizing regions.
zh

[AI-35] GenTune: Toward Traceable Prompts to Improve Controllability of Image Refinement in Environment Design

【速读】:该论文旨在解决生成式AI在环境设计(Environment Design)领域中应用时面临的两大挑战:一是大型语言模型(LLM)生成的提示词(prompt)过长,导致设计师难以准确识别和修改影响特定视觉元素的关键标签;二是基于局部修复(inpainting)的编辑方法虽能实现细节调整,但常因缺乏全局一致性而影响图像整体质量。解决方案的关键在于提出GenTune系统,该系统通过建立图像元素与提示词标签之间的可追溯映射关系,使设计师能够直观选择图像中的任意元素,并精准定位到对应的提示词标签进行修改,从而实现细粒度控制与全局一致性兼顾的图像优化。

链接: https://arxiv.org/abs/2508.15227
作者: Wen-Fan Wang,Ting-Ying Lee,Chien-Ting Lu,Che-Wei Hsu,Nil Ponsa Campany,Yu Chen,Mike Y. Chen,Bing-Yu Chen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted ACM Symposium on User Interface Software and Technology (UIST '25)

点击查看摘要

Abstract:Environment designers in the entertainment industry create imaginative 2D and 3D scenes for games, films, and television, requiring both fine-grained control of specific details and consistent global coherence. Designers have increasingly integrated generative AI into their workflows, often relying on large language models (LLMs) to expand user prompts for text-to-image generation, then iteratively refining those prompts and applying inpainting. However, our formative study with 10 designers surfaced two key challenges: (1) the lengthy LLM-generated prompts make it difficult to understand and isolate the keywords that must be revised for specific visual elements; and (2) while inpainting supports localized edits, it can struggle with global consistency and correctness. Based on these insights, we present GenTune, an approach that enhances human–AI collaboration by clarifying how AI-generated prompts map to image content. Our GenTune system lets designers select any element in a generated image, trace it back to the corresponding prompt labels, and revise those labels to guide precise yet globally consistent image refinement. In a summative study with 20 designers, GenTune significantly improved prompt–image comprehension, refinement quality, and efficiency, and overall satisfaction (all p .01 ) compared to current practice. A follow-up field study with two studios further demonstrated its effectiveness in real-world settings.
zh

[AI-36] Locally Pareto-Optimal Interpretations for Black-Box Machine Learning Models

【速读】:该论文旨在解决黑箱机器学习模型解释中准确性和可解释性之间的权衡问题,尤其是现有多目标解释合成方法通常缺乏对帕累托最优解的严格保证,而具备此类保证的方法又面临探索帕累托最优空间时的严重可扩展性限制。解决方案的关键在于提出一种基于局部最优性保证的框架:首先利用多目标学习或搜索技术(如多目标蒙特卡洛树搜索)生成一组帕累托最优候选解释;随后将每个候选解的局部最优性验证转化为布尔可满足性问题(Boolean Satisfiability Problem, SAT),并借助SAT求解器进行高效验证,从而在保持解释质量的同时显著提升计算效率。

链接: https://arxiv.org/abs/2508.15220
作者: Aniruddha Joshi,Supratik Chakraborty,S Akshay,Shetal Shah,Hazem Torfah,Sanjit Seshia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: This work has been accepted at ATVA’25

点击查看摘要

Abstract:Creating meaningful interpretations for black-box machine learning models involves balancing two often conflicting objectives: accuracy and explainability. Exploring the trade-off between these objectives is essential for developing trustworthy interpretations. While many techniques for multi-objective interpretation synthesis have been developed, they typically lack formal guarantees on the Pareto-optimality of the results. Methods that do provide such guarantees, on the other hand, often face severe scalability limitations when exploring the Pareto-optimal space. To address this, we develop a framework based on local optimality guarantees that enables more scalable synthesis of interpretations. Specifically, we consider the problem of synthesizing a set of Pareto-optimal interpretations with local optimality guarantees, within the immediate neighborhood of each solution. Our approach begins with a multi-objective learning or search technique, such as Multi-Objective Monte Carlo Tree Search, to generate a best-effort set of Pareto-optimal candidates with respect to accuracy and explainability. We then verify local optimality for each candidate as a Boolean satisfiability problem, which we solve using a SAT solver. We demonstrate the efficacy of our approach on a set of benchmarks, comparing it against previous methods for exploring the Pareto-optimal front of interpretations. In particular, we show that our approach yields interpretations that closely match those synthesized by methods offering global guarantees.
zh

[AI-37] R-ConstraintBench: Evaluating LLM s on NP-Complete Scheduling

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在高约束环境下推理可靠性不足的问题,尤其是在资源受限的项目调度问题(Resource-Constrained Project Scheduling Problems, RCPSP)中表现不稳定。其解决方案的关键在于提出R-ConstraintBench这一可扩展的评估框架,通过线性增长非冗余前置约束、引入停机时间、时间窗和析取约束等逐步增加难度,系统性地测试LLMs在复杂约束交互下的可行性判断能力。实验表明,尽管强模型在仅含前置约束的有向无环图(DAG)上接近最优,但一旦多种约束类型耦合,性能显著下降,揭示了约束交互而非图深度是主要瓶颈,同时指出合成数据上的优异表现无法保证在真实场景中的泛化能力。

链接: https://arxiv.org/abs/2508.15204
作者: Raj Jain,Marc Wetter
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Effective scheduling under tight resource, timing, and operational constraints underpins large-scale planning across sectors such as capital projects, manufacturing, logistics, and IT fleet transitions. However, the reliability of large language models (LLMs) when reasoning under high-constraint regimes is insufficiently characterized. To address this gap, we present R-ConstraintBench, a scalable framework that evaluates models on Resource-Constrained Project Scheduling Problems (RCPSP), an NP-Complete feasibility class, while difficulty increases via linear growth in constraints. R-ConstraintBench incrementally increases non-redundant precedence constraints in Directed Acyclic Graphs (DAGs) and then introduces downtime, temporal windows, and disjunctive constraints. As an illustrative example, we instantiate the benchmark in a data center migration setting and evaluate multiple LLMs using feasibility and error analysis, identifying degradation thresholds and constraint types most associated with failure. Empirically, strong models are near-ceiling on precedence-only DAGs, but feasibility performance collapses when downtime, temporal windows, and disjunctive constraints interact, implicating constraint interaction, not graph depth, as the principal bottleneck. Performance on clean synthetic ramps also does not guarantee transfer to domain-grounded scenarios, underscoring limited generalization.
zh

[AI-38] Survey of Vision-Language-Action Models for Embodied Manipulation

【速读】:该论文旨在解决 embodied intelligence(具身智能)系统中代理与环境交互能力不足的问题,特别是针对机器人在复杂场景下执行操作任务时的泛化能力和鲁棒性瓶颈。其解决方案的关键在于系统性地综述视觉-语言-动作(Vision-Language-Action, VLA)模型的发展与应用,通过梳理VLA架构演进路径,并从模型结构、训练数据集、预训练方法、后训练策略及评估体系等五个核心维度进行深入分析,揭示当前VLA模型如何提升具身智能体的多模态感知与决策能力,从而推动其在真实世界中的落地部署。

链接: https://arxiv.org/abs/2508.15201
作者: Haoran Li,Yuhui Chen,Wenbo Cui,Weiheng Liu,Kai Liu,Mingcai Zhou,Zhengtao Zhang,Dongbin Zhao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: in Chinese language

点击查看摘要

Abstract:Embodied intelligence systems, which enhance agent capabilities through continuous environment interactions, have garnered significant attention from both academia and industry. Vision-Language-Action models, inspired by advancements in large foundation models, serve as universal robotic control frameworks that substantially improve agent-environment interaction capabilities in embodied intelligence systems. This expansion has broadened application scenarios for embodied AI robots. This survey comprehensively reviews VLA models for embodied manipulation. Firstly, it chronicles the developmental trajectory of VLA architectures. Subsequently, we conduct a detailed analysis of current research across 5 critical dimensions: VLA model structures, training datasets, pre-training methods, post-training methods, and model evaluation. Finally, we synthesize key challenges in VLA development and real-world deployment, while outlining promising future research directions.
zh

[AI-39] PuzzleClone: An SMT-Powered Framework for Synthesizing Verifiable Data

【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在数学与逻辑推理能力训练中所依赖的数据集存在的可靠性不足、多样性有限及可扩展性差的问题。现有LLM生成的数据常缺乏验证机制,难以支撑高质量的推理能力提升。为此,作者提出PuzzleClone框架,其核心创新在于:(1) 将种子谜题编码为结构化的逻辑规范;(2) 通过系统性的变量与约束随机化生成可扩展的变体;(3) 利用重现机制确保生成数据的有效性。该方法实现了大规模、多样化且程序化验证的逻辑与数学谜题合成,显著提升了模型在多个基准上的推理性能。

链接: https://arxiv.org/abs/2508.15180
作者: Kai Xiong,Yanwei Huang,Rongjunchen Zhang,Kun Chen,Haipang Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using Satisfiability Modulo Theories (SMT). Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct a curated benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. We conduct post training (SFT and RL) on PuzzleClone datasets. Experimental results show that training on PuzzleClone yields substantial improvements not only on PuzzleClone testset but also on logic and mathematical benchmarks. Post training raises PuzzleClone average from 14.4 to 56.2 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 12.5 absolute percentage points (AMC2023 from 52.5 to 65.0). Our code and data are available at this https URL.
zh

[AI-40] Mobile-Agent -v3: Foundamental Agents for GUI Automation

【速读】:该论文旨在解决开源图形用户界面(GUI)智能体在跨平台桌面与移动环境中的端到端任务执行能力不足的问题,尤其在UI定位、问答、规划、决策及过程性知识应用等方面表现有限。其核心解决方案在于提出GUI-Owl模型及其扩展框架Mobile-Agent-v3,关键创新包括:(1)构建大规模云原生虚拟环境基础设施,支持Android、Ubuntu、macOS和Windows多平台交互数据自动生成与迭代优化;(2)集成UI定位、规划、动作语义与推理模式等基础代理能力,实现端到端决策并可作为模块化组件嵌入多智能体系统;(3)设计可扩展的异步强化学习框架与轨迹感知的相对策略优化(Trajectory-aware Relative Policy Optimization, TRPO),显著提升真实场景对齐性能,最终在多个GUI基准测试中达到当前最优水平。

链接: https://arxiv.org/abs/2508.15144
作者: Jiabo Ye,Xi Zhang,Haiyang Xu,Haowei Liu,Junyang Wang,Zhaoqing Zhu,Ziwei Zheng,Feiyu Gao,Junjie Cao,Zhengxi Lu,Jitong Liao,Qi Zheng,Fei Huang,Jingren Zhou,Ming Yan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at this https URL.
zh

[AI-41] Universal Reinforcement Learning in Coalgebras: Asynchronous Stochastic Computation via Conduction

【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)中固定点求解的局限性,尤其是如何在异步并行分布式环境中更通用地确定精确或近似的(动作)价值函数问题。其解决方案的关键在于引入广义强化学习(Universal Reinforcement Learning, URL),通过范畴论(category theory)中的数学抽象——包括非良基集合上的共归纳(coinduction)、通用余代数(universal coalgebras)、拓扑模型(topos theory)以及异步分布式最小化模型——将RL算法空间建模为函子范畴(functor category),其中陪域范畴构成一个拓扑结构(topos),具备所有(余)极限、子对象分类器和指数对象等性质。这一框架将传统RL中的动态系统模型(如MDP、POMDP、PSR和LDS)统一为特定类型的余代数,并将求解值函数的问题推广为在并行分布式环境下异步确定最终余代数(final coalgebra)的过程。

链接: https://arxiv.org/abs/2508.15128
作者: Sridhar Mahadevan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 45 pages

点击查看摘要

Abstract:In this paper, we introduce a categorial generalization of RL, termed universal reinforcement learning (URL), building on powerful mathematical abstractions from the study of coinduction on non-well-founded sets and universal coalgebras, topos theory, and categorial models of asynchronous parallel distributed computation. In the first half of the paper, we review the basic RL framework, illustrate the use of categories and functors in RL, showing how they lead to interesting insights. In particular, we also introduce a standard model of asynchronous distributed minimization proposed by Bertsekas and Tsitsiklis, and describe the relationship between metric coinduction and their proof of the Asynchronous Convergence Theorem. The space of algorithms for MDPs or PSRs can be modeled as a functor category, where the co-domain category forms a topos, which admits all (co)limits, possesses a subobject classifier, and has exponential objects. In the second half of the paper, we move on to universal coalgebras. Dynamical system models, such as Markov decision processes (MDPs), partially observed MDPs (POMDPs), a predictive state representation (PSRs), and linear dynamical systems (LDSs) are all special types of coalgebras. We describe a broad family of universal coalgebras, extending the dynamic system models studied previously in RL. The core problem in finding fixed points in RL to determine the exact or approximate (action) value function is generalized in URL to determining the final coalgebra asynchronously in a parallel distributed manner.
zh

[AI-42] Argumentation for Explainable Workforce Optimisation (with Appendix)

【速读】:该论文旨在解决工业场景中人员调度(workforce management)的复杂性问题,即如何在满足作业完成时间(makespan)和操作员移动距离最小化的同时,有效应对执行过程中的动态变化,并为所有利益相关方提供可解释的决策依据。其解决方案的关键在于将人员调度问题建模为抽象论证(abstract argumentation),从而在面对变更时仍能保持推理的一致性和透明度;并通过用户研究验证,该方法相较于传统人工方案能够显著提升问题求解的速度与准确性。

链接: https://arxiv.org/abs/2508.15118
作者: Jennifer Leigh,Dimitrios Letsios,Alessandro Mella,Lucio Machetti,Francesca Toni
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to PAIS 2025

点击查看摘要

Abstract:Workforce management is a complex problem optimising the makespan and travel distance required for a team of operators to complete a set of jobs, using a set of instruments. A crucial challenge in workforce management is accommodating changes at execution time so that explanations are provided to all stakeholders involved. Here, we show that, by understanding workforce management as abstract argumentation in an industrial application, we can accommodate change and obtain faithful explanations. We show, with a user study, that our tool and explanations lead to faster and more accurate problem solving than conventional solutions by hand.
zh

[AI-43] Hydra: A 1.6B-Parameter State-Space Language Model with Sparse Attention Mixture-of-Experts and Memory

【速读】:该论文旨在解决长上下文语言模型在参数效率与计算复杂度之间的权衡问题,特别是如何在有限参数规模下实现对超长文本序列的有效建模。其解决方案的关键在于提出Hydra架构,该架构融合了结构化状态空间模型(Structured State Space Model, SSM)的高效序列建模能力、稀疏全局注意力机制以捕捉关键长距离依赖、分块级混合专家(Mixture-of-Experts, MoE)前馈路由策略以提升容量扩展性,以及双记忆模块(工作空间记忆与事实型PKM记忆)来增强长期信息存储与检索能力。通过组件接口形式化设计、透明的参数与复杂度核算,以及分阶段训练课程规划,Hydra为构建模块化、输入自适应的长上下文语言模型提供了可验证的蓝图,尽管当前原型仅在小规模数据上验证可行性,但其核心创新点为未来大规模性能验证奠定了基础。

链接: https://arxiv.org/abs/2508.15099
作者: Siddharth Chaudhary,Bennett Browning
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:We present Hydra as an architectural proposal for hybrid long-context language models that combine conditional computation, long-context memory mechanisms, and sparse mixture-of-experts within an approximately 1.6B parameter design envelope. Hydra integrates a Mamba-style Structured State Space Model (SSM) backbone with intermittent sparse global attention, chunk-level MoE feed-forward routing, and dual (workspace plus factual PKM) memories. We formalize the component interfaces, give transparent parameter and complexity accounting, and outline a staged curriculum intended to stably activate the parts. We accompany the specification with illustrative toy-scale prototype measurements (tens of millions of parameters on synthetic data) whose sole purpose is to demonstrate implementation feasibility and qualitative scaling behaviors (for example, long-context throughput crossover and controllable expert routing), not to claim competitive full-scale performance. We explicitly delineate assumptions and open risks (training complexity, memory utilization, specialization dynamics) and position Hydra as a blueprint to stimulate empirical follow-up rather than a finished system. By combining SSM efficiency, selective sparse attention, MoE capacity, and learnable memory, Hydra sketches a path toward modular, input-adaptive long-context language models; validating end-task gains at target scale remains future work.
zh

[AI-44] Wormhole Dynamics in Deep Neural Networks

【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)在过参数化 regime 下出现的“欺骗样本”(fooling examples)现象,即模型对看似随机或无结构的输入仍能给出高置信度的分类结果,这揭示了DNN泛化能力与人类认知之间的偏差。其解决方案的关键在于提出一种基于最大似然估计(maximum likelihood estimation)的分析框架,而非依赖传统的梯度优化方法和显式标签;在此基础上,研究发现DNN输出特征空间会因过参数化而发生坍缩(collapse),虽有助于提升泛化性能,但进一步增加层数会导致退化(degeneracy)——模型将不同输入映射到相同输出,实现零损失。为突破此退化问题,作者提出“虫洞”(wormhole)解法,该解可将任意欺骗样本中的随机输入与有意义标签重新关联,从而揭示了捷径学习(shortcut learning)的新机制,并为未来在无监督场景下理解学习动力学提供了理论方向。

链接: https://arxiv.org/abs/2508.15086
作者: Yen-Lung Lai,Zhe Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This work investigates the generalization behavior of deep neural networks (DNNs), focusing on the phenomenon of “fooling examples,” where DNNs confidently classify inputs that appear random or unstructured to humans. To explore this phenomenon, we introduce an analytical framework based on maximum likelihood estimation, without adhering to conventional numerical approaches that rely on gradient-based optimization and explicit labels. Our analysis reveals that DNNs operating in an overparameterized regime exhibit a collapse in the output feature space. While this collapse improves network generalization, adding more layers eventually leads to a state of degeneracy, where the model learns trivial solutions by mapping distinct inputs to the same output, resulting in zero loss. Further investigation demonstrates that this degeneracy can be bypassed using our newly derived “wormhole” solution. The wormhole solution, when applied to arbitrary fooling examples, reconciles meaningful labels with random ones and provides a novel perspective on shortcut learning. These findings offer deeper insights into DNN generalization and highlight directions for future research on learning dynamics in unsupervised settings to bridge the gap between theory and practice.
zh

[AI-45] From Basic Affordances to Symbolic Thought: A Computational Phylogenesis of Biological Intelligence

【速读】:该论文试图解决的问题是:人类大脑中何种机制使得我们能够进行符号推理(symbolic reasoning),而大多数其他动物则不具备这种能力。研究表明,动态绑定(dynamic binding)虽为符号思维所必需,但并不充分。论文提出,除了基本的动态绑定外,两种层次整合(hierarchical integration)——即多位置谓词(multi-place predicates)的整合与结构映射(structure mapping)的整合——是实现基础符号思维的最小要求。解决方案的关键在于通过17组系统性仿真验证了具备多位置谓词和结构映射能力的认知架构相较于仅具备动态绑定能力的架构,在执行依赖于符号关系而非诊断特征的任务时表现出显著优势,从而支持了上述假设。这一发现不仅深化了对人类大脑如何产生符号思维的理解,也为生物启发式人工智能的发展提供了理论依据。

链接: https://arxiv.org/abs/2508.15082
作者: John E. Hummel,Rachel F. Heaton
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 47 pages 8 figures

点击查看摘要

Abstract:What is it about human brains that allows us to reason symbolically whereas most other animals cannot? There is evidence that dynamic binding, the ability to combine neurons into groups on the fly, is necessary for symbolic thought, but there is also evidence that it is not sufficient. We propose that two kinds of hierarchical integration (integration of multiple role-bindings into multiplace predicates, and integration of multiple correspondences into structure mappings) are minimal requirements, on top of basic dynamic binding, to realize symbolic thought. We tested this hypothesis in a systematic collection of 17 simulations that explored the ability of cognitive architectures with and without the capacity for multi-place predicates and structure mapping to perform various kinds of tasks. The simulations were as generic as possible, in that no task could be performed based on any diagnostic features, depending instead on the capacity for multi-place predicates and structure mapping. The results are consistent with the hypothesis that, along with dynamic binding, multi-place predicates and structure mapping are minimal requirements for basic symbolic thought. These results inform our understanding of how human brains give rise to symbolic thought and speak to the differences between biological intelligence, which tends to generalize broadly from very few training examples, and modern approaches to machine learning, which typically require millions or billions of training examples. The results we report also have important implications for bio-inspired artificial intelligence.
zh

[AI-46] S3LoRA: Safe Spectral Sharpness-Guided Pruning in Adaptation of Agent Planner

【速读】:该论文旨在解决基于LoRA(Low-Rank Adaptation)微调的大语言模型(Large Language Models, LLMs)在代理(agent)规划任务中因适应过程导致的安全对齐退化问题,即模型可能产生不安全或不稳定的行为,而现有安全增强方法通常依赖于原始基础模型和指令微调模型的检查点,这在实际部署中难以获取。解决方案的关键在于提出一种轻量、无需数据且与模型无关的框架S3LoRA(Safe Spectral Sharpness-Guided Pruning LoRA),其核心创新是通过仅分析LoRA微调产生的权重更新来识别高风险层:首先引入Magnitude-Aware Spherically Normalized SVD(MAS-SVD)以保留全局幅度信息并稳健分析LoRA更新的结构特性;进而设计谱锐度指数(Spectral Sharpness Index, SSI)作为锐度感知指标,定位更新高度集中且潜在危险的层,并对其进行事后剪枝,从而在不损害任务性能的前提下显著降低安全风险并减少推理开销。

链接: https://arxiv.org/abs/2508.15068
作者: Shuang Ao,Gopal Rumchurn
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 9 pages, 2 figures

点击查看摘要

Abstract:Adapting Large Language Models (LLMs) using parameter-efficient fine-tuning (PEFT) techniques such as LoRA has enabled powerful capabilities in LLM-based agents. However, these adaptations can unintentionally compromise safety alignment, leading to unsafe or unstable behaviors, particularly in agent planning tasks. Existing safety-aware adaptation methods often require access to both base and instruction-tuned model checkpoints, which are frequently unavailable in practice, limiting their applicability. We propose S3LoRA (Safe Spectral Sharpness-Guided Pruning LoRA), a lightweight, data-free, and model-independent framework that mitigates safety risks in LoRA-adapted models by inspecting only the fine-tuned weight updates. We first introduce Magnitude-Aware Spherically Normalized SVD (MAS-SVD), which robustly analyzes the structural properties of LoRA updates while preserving global magnitude information. We then design the Spectral Sharpness Index (SSI), a sharpness-aware metric to detect layers with highly concentrated and potentially unsafe updates. These layers are pruned post-hoc to reduce risk without sacrificing task performance. Extensive experiments and ablation studies across agent planning and language generation tasks show that S3LoRA consistently improves safety metrics while maintaining or improving utility metrics and significantly reducing inference cost. These results establish S3LoRA as a practical and scalable solution for safely deploying LLM-based agents in real-world, resource-constrained, and safety-critical environments.
zh

[AI-47] Demonstrating Onboard Inference for Earth Science Applications with Spectral Analysis Algorithms and Deep Learning

【速读】:该论文旨在解决传统遥感数据处理中依赖地面站后处理导致的延迟问题,从而限制了实时地球科学观测与响应能力。解决方案的关键在于利用搭载神经网络加速硬件的CogniSAT-6/HAMMER(CS-6)卫星,在轨实现边缘计算(edge computing)环境下的深度学习与光谱分析算法推理,使数据处理从地面转移到卫星平台,显著提升遥感数据的实时性与应用灵活性。

链接: https://arxiv.org/abs/2508.15053
作者: Itai Zilberstein,Alberto Candela,Steve Chien,David Rijlaarsdam,Tom Hendrix,Leonie Buckley,Aubrey Dunne
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: International Symposium on Artificial Intelligence, Robotics and Automation in Space, November 2024

点击查看摘要

Abstract:In partnership with Ubotica Technologies, the Jet Propulsion Laboratory is demonstrating state-of-the-art data analysis onboard CogniSAT-6/HAMMER (CS-6). CS-6 is a satellite with a visible and near infrared range hyperspectral instrument and neural network acceleration hardware. Performing data analysis at the edge (e.g. onboard) can enable new Earth science measurements and responses. We will demonstrate data analysis and inference onboard CS-6 for numerous applications using deep learning and spectral analysis algorithms.
zh

[AI-48] Emergent Crowds Dynamics from Language-Driven Multi-Agent Interactions

【速读】:该论文旨在解决现有基于代理(agent)的群体模拟方法中,缺乏对社会交互与环境互动建模的问题,导致代理间及代理与环境间的交互仅限于简单的避障和固定目标导向行为,难以生成具有真实人类群体行为特征的动画。其核心解决方案是引入大语言模型(Large Language Models, LLMs),构建一个融合对话系统与语言驱动导航的双组件框架:首先通过条件化于角色性格、欲望、关系等语义信息的代理中心LLM生成代理间的社交对话;随后将对话内容、个体情绪状态、视觉感知与物理状态共同用于指导代理的运动决策与转向控制。此方法使代理能够基于实时感知输入和持续的社会对话做出更自然的行为选择,从而在复杂场景中自发涌现出分组与解散等群体行为,并实现群体内部的信息传递机制,显著提升仿真结果的真实性与情境适应性。

链接: https://arxiv.org/abs/2508.15047
作者: Yibo Liu,Liam Shatzel,Brandon Haworth,Teseo Schneider
机构: 未知
类目: Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注:

点击查看摘要

Abstract:Animating and simulating crowds using an agent-based approach is a well-established area where every agent in the crowd is individually controlled such that global human-like behaviour emerges. We observe that human navigation and movement in crowds are often influenced by complex social and environmental interactions, driven mainly by language and dialogue. However, most existing work does not consider these dimensions and leads to animations where agent-agent and agent-environment interactions are largely limited to steering and fixed higher-level goal extrapolation. We propose a novel method that exploits large language models (LLMs) to control agents’ movement. Our method has two main components: a dialogue system and language-driven navigation. We periodically query agent-centric LLMs conditioned on character personalities, roles, desires, and relationships to control the generation of inter-agent dialogue when necessitated by the spatial and social relationships with neighbouring agents. We then use the conversation and each agent’s personality, emotional state, vision, and physical state to control the navigation and steering of each agent. Our model thus enables agents to make motion decisions based on both their perceptual inputs and the ongoing dialogue. We validate our method in two complex scenarios that exemplify the interplay between social interactions, steering, and crowding. In these scenarios, we observe that grouping and ungrouping of agents automatically occur. Additionally, our experiments show that our method serves as an information-passing mechanism within the crowd. As a result, our framework produces more realistic crowd simulations, with emergent group behaviours arising naturally from any environmental setting. Subjects: Artificial Intelligence (cs.AI); Graphics (cs.GR) Cite as: arXiv:2508.15047 [cs.AI] (or arXiv:2508.15047v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.15047 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-49] MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLM s CCS2025

【速读】:该论文旨在解决基于混合专家(Mixture of Experts, MoE)架构的大型语言模型(LLMs)和视觉语言模型(VLMs)在运行时存在的隐私泄露问题。MoE通过动态路由机制将输入token分配给特定专家子网络,虽提升了计算效率与模型性能,但其依赖输入语义的激活模式会在硬件执行层面留下可被利用的侧信道痕迹(如缓存占用、页表项刷新等),从而暴露敏感用户数据。解决方案的关键在于首次系统性地识别并利用四种不同计算平台上的架构级侧信道——CPU端的Cache Occupancy Channel和Pageout+Reload,以及GPU端的Performance Counter和TLB Evict+Reload——并据此提出四类针对性攻击:Prompt Inference Attack、Response Reconstruction Attack、Visual Inference Attack 和 Visual Reconstruction Attack,揭示了MoE结构在实际部署中对用户隐私构成的严重威胁,推动安全防护机制的研发与应用。

链接: https://arxiv.org/abs/2508.15036
作者: Ruyi Ding,Tianhong Xu,Xinyi Shen,Aidong Adam Ding,Yunsi Fei
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: This paper will appear in CCS 2025

点击查看摘要

Abstract:The transformer architecture has become a cornerstone of modern AI, fueling remarkable progress across applications in natural language processing, computer vision, and multimodal learning. As these models continue to scale explosively for performance, implementation efficiency remains a critical challenge. Mixture of Experts (MoE) architectures, selectively activating specialized subnetworks (experts), offer a unique balance between model accuracy and computational cost. However, the adaptive routing in MoE architectures, where input tokens are dynamically directed to specialized experts based on their semantic meaning inadvertently opens up a new attack surface for privacy breaches. These input-dependent activation patterns leave distinctive temporal and spatial traces in hardware execution, which adversaries could exploit to deduce sensitive user data. In this work, we propose MoEcho, discovering a side channel analysis based attack surface that compromises user privacy on MoE based systems. Specifically, in MoEcho, we introduce four novel architectural side channels on different computing platforms, including Cache Occupancy Channels and Pageout+Reload on CPUs, and Performance Counter and TLB Evict+Reload on GPUs, respectively. Exploiting these vulnerabilities, we propose four attacks that effectively breach user privacy in large language models (LLMs) and vision language models (VLMs) based on MoE architectures: Prompt Inference Attack, Response Reconstruction Attack, Visual Inference Attack, and Visual Reconstruction Attack. MoEcho is the first runtime architecture level security analysis of the popular MoE structure common in modern transformers, highlighting a serious security and privacy threat and calling for effective and timely safeguards when harnessing MoE based models for developing efficient large scale AI services.
zh

[AI-50] A Systematic Survey of Model Extraction Attacks and Defenses: State-of-the-Art and Perspectives

【速读】:该论文旨在解决生成式 AI(Generative AI)模型在 Machine-Learning-as-a-Service(MLaaS)平台中面临的模型提取攻击(Model Extraction Attacks, MEAs)问题。MEAs 允许攻击者通过访问公开接口,系统性地复制目标模型的功能,从而威胁模型的知识产权、数据隐私和系统安全。论文的关键解决方案是提出了一种新的分类体系(taxonomy),从攻击机制、防御策略和计算环境三个维度对 MEAs 进行系统性归纳与分析,并深入评估现有防御方法的有效性及其在模型效用与安全性之间的权衡挑战。该框架为研究人员、从业者及政策制定者提供了结构化的参考,推动了 AI 安全与隐私领域的研究方向发展。

链接: https://arxiv.org/abs/2508.15031
作者: Kaixiang Zhao,Lincan Li,Kaize Ding,Neil Zhenqiang Gong,Yue Zhao,Yushun Dong
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Machine learning (ML) models have significantly grown in complexity and utility, driving advances across multiple domains. However, substantial computational resources and specialized expertise have historically restricted their wide adoption. Machine-Learning-as-a-Service (MLaaS) platforms have addressed these barriers by providing scalable, convenient, and affordable access to sophisticated ML models through user-friendly APIs. While this accessibility promotes widespread use of advanced ML capabilities, it also introduces vulnerabilities exploited through Model Extraction Attacks (MEAs). Recent studies have demonstrated that adversaries can systematically replicate a target model’s functionality by interacting with publicly exposed interfaces, posing threats to intellectual property, privacy, and system security. In this paper, we offer a comprehensive survey of MEAs and corresponding defense strategies. We propose a novel taxonomy that classifies MEAs according to attack mechanisms, defense approaches, and computing environments. Our analysis covers various attack techniques, evaluates their effectiveness, and highlights challenges faced by existing defenses, particularly the critical trade-off between preserving model utility and ensuring security. We further assess MEAs within different computing paradigms and discuss their technical, ethical, legal, and societal implications, along with promising directions for future research. This systematic survey aims to serve as a valuable reference for researchers, practitioners, and policymakers engaged in AI security and privacy. Additionally, we maintain an online repository continuously updated with related literature at this https URL.
zh

[AI-51] Collab-REC: An LLM -based Agent ic Framework for Balancing Recommendations in Tourism

【速读】:该论文旨在解决旅游推荐系统中存在的流行度偏差(popularity bias)问题,即推荐结果过度集中于热门城市,导致冷门但具有潜力的目的地被忽视,从而加剧了过度旅游(over-tourism)现象。解决方案的关键在于提出一种多智能体协同框架 Collab-REC,其中三个基于大语言模型(LLM)的代理——个性化(Personalization)、流行度(Popularity)和可持续性(Sustainability)——从不同视角生成城市建议;随后由一个非 LLM 的调节器通过多轮协商整合并优化这些提案,确保各代理观点被纳入同时抑制重复或虚假响应,从而提升推荐多样性与整体相关性。

链接: https://arxiv.org/abs/2508.15030
作者: Ashmi Banerjee,Fitri Nur Aisyah,Adithi Satish,Wolfgang Wörndl,Yashar Deldjoo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Collab-REC, a multi-agent framework designed to counteract popularity bias and enhance diversity in tourism recommendations. In our setting, three LLM-based agents – Personalization, Popularity, and Sustainability generate city suggestions from complementary perspectives. A non-LLM moderator then merges and refines these proposals via multi-round negotiation, ensuring each agent’s viewpoint is incorporated while penalizing spurious or repeated responses. Experiments on European city queries show that Collab-REC improves diversity and overall relevance compared to a single-agent baseline, surfacing lesser-visited locales that often remain overlooked. This balanced, context-aware approach addresses over-tourism and better aligns with constraints provided by the user, highlighting the promise of multi-stakeholder collaboration in LLM-driven recommender systems.
zh

[AI-52] win-Boot: Uncertainty-Aware Optimization via Online Two-Sample Bootstrapping

【速读】:该论文旨在解决深度学习模型在过参数化和低数据场景下缺乏不确定性估计的问题,这类场景中模型易过拟合且传统梯度下降方法仅提供点估计而无置信度信息。解决方案的关键在于提出Twin-Bootstrap Gradient Descent(Twin-Boot)方法:通过并行训练两个结构相同的模型,分别基于独立的自助采样(bootstrap sample)进行优化,并引入周期性均值重置机制以确保两模型轨迹处于同一极小值域(basin)内;其参数差异可反映局部(within-basin)不确定性,该估计在训练过程中用于自适应地采样权重,实现数据驱动的正则化,从而偏好平坦解,提升模型校准性和泛化性能,并生成可解释的不确定性图谱。

链接: https://arxiv.org/abs/2508.15019
作者: Carlos Stein Brito
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation (stat.CO); Machine Learning (stat.ML)
备注: 12 pages, 6 figures

点击查看摘要

Abstract:Standard gradient descent methods yield point estimates with no measure of confidence. This limitation is acute in overparameterized and low-data regimes, where models have many parameters relative to available data and can easily overfit. Bootstrapping is a classical statistical framework for uncertainty estimation based on resampling, but naively applying it to deep learning is impractical: it requires training many replicas, produces post-hoc estimates that cannot guide learning, and implicitly assumes comparable optima across runs - an assumption that fails in non-convex landscapes. We introduce Twin-Bootstrap Gradient Descent (Twin-Boot), a resampling-based training procedure that integrates uncertainty estimation into optimization. Two identical models are trained in parallel on independent bootstrap samples, and a periodic mean-reset keeps both trajectories in the same basin so that their divergence reflects local (within-basin) uncertainty. During training, we use this estimate to sample weights in an adaptive, data-driven way, providing regularization that favors flatter solutions. In deep neural networks and complex high-dimensional inverse problems, the approach improves calibration and generalization and yields interpretable uncertainty maps.
zh

[AI-53] Goals and the Structure of Experience

【速读】:该论文试图解决的问题是:如何在认知代理中实现目的性行为(purposeful behavior)的内在机制,特别是传统框架下将世界模型(world model)划分为描述性(state representation)和规范性(reward function)两个独立组件的局限性。其核心挑战在于解释这两个方面是否可以并非预先设定,而是通过代理与环境的交互经验共同涌现(co-emerge)。解决方案的关键在于提出一种基于目标导向状态(telic states)的计算框架,其中描述性和规范性维度并非分离,而是从代理的目标驱动经验分布中协同演化而来;该框架以行为策略与理想体验特征之间的统计差异为量化基础,提供了一种统一的理论视角,能够整合行为、现象学和神经层面的目的性行为机制。

链接: https://arxiv.org/abs/2508.15013
作者: Nadav Amir,Stas Tiomkin,Angela Langdon
机构: 未知
类目: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:

点击查看摘要

Abstract:Purposeful behavior is a hallmark of natural and artificial intelligence. Its acquisition is often believed to rely on world models, comprising both descriptive (what is) and prescriptive (what is desirable) aspects that identify and evaluate state of affairs in the world, respectively. Canonical computational accounts of purposeful behavior, such as reinforcement learning, posit distinct components of a world model comprising a state representation (descriptive aspect) and a reward function (prescriptive aspect). However, an alternative possibility, which has not yet been computationally formulated, is that these two aspects instead co-emerge interdependently from an agent’s goal. Here, we describe a computational framework of goal-directed state representation in cognitive agents, in which the descriptive and prescriptive aspects of a world model co-emerge from agent-environment interaction sequences, or experiences. Drawing on Buddhist epistemology, we introduce a construct of goal-directed, or telic, states, defined as classes of goal-equivalent experience distributions. Telic states provide a parsimonious account of goal-directed learning in terms of the statistical divergence between behavioral policies and desirable experience features. We review empirical and theoretical literature supporting this novel perspective and discuss its potential to provide a unified account of behavioral, phenomenological and neural dimensions of purposeful behaviors across diverse substrates.
zh

[AI-54] Quantized Neural Networks for Microcontrollers: A Comprehensive Review of Methods Platforms and Applications

【速读】:该论文旨在解决在资源受限设备(如微控制器)上部署量化神经网络(Quantized Neural Networks, QNNs)时面临的模型性能、计算复杂度与内存约束之间的平衡难题。其解决方案的关键在于从硬件角度出发,系统性地回顾和分析用于加速深度学习模型的量化技术,并深入探讨模型性能与硬件能力之间的关键权衡关系;同时评估专为QNN在微控制器上执行而设计的软件框架与硬件平台,从而为TinyML(Tiny Machine Learning)领域的高效模型部署提供理论支撑与实践指导。

链接: https://arxiv.org/abs/2508.15008
作者: Hamza A. Abushahla,Dara Varam,Ariel J. N. Panopio,Mohamed I. AlHajri
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 39 pages, 16 figures, 8 Tables, submitted to the Proceedings of the IEEE

点击查看摘要

Abstract:The deployment of Quantized Neural Networks (QNNs) on resource-constrained devices, such as microcontrollers, has introduced significant challenges in balancing model performance, computational complexity and memory constraints. Tiny Machine Learning (TinyML) addresses these issues by integrating advancements across machine learning algorithms, hardware acceleration, and software optimization to efficiently run deep neural networks on embedded systems. This survey presents a hardware-centric introduction to quantization, systematically reviewing essential quantization techniques employed to accelerate deep learning models for embedded applications. In particular, further emphasis is put on critical trade-offs among model performance and hardware capabilities. The survey further evaluates existing software frameworks and hardware platforms designed specifically for supporting QNN execution on microcontrollers. Moreover, we provide an analysis of the current challenges and an outline of promising future directions in the rapidly evolving domain of QNN deployment.
zh

[AI-55] Quantum Long Short-term Memory with Differentiable Architecture Search

【速读】:该论文旨在解决变分量子电路(Variational Quantum Circuits, VQCs)在量子序列学习任务中设计困难且高度依赖特定任务的问题,尤其是在量子循环模型(如QLSTM)的应用中。其关键解决方案是提出DiffQAS-QLSTM,一个端到端可微的框架,能够在训练过程中同时优化VQC的参数和架构选择,从而实现更高效、自适应的量子序列建模。

链接: https://arxiv.org/abs/2508.14955
作者: Samuel Yen-Chi Chen,Prayag Tiwari
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE); Quantum Physics (quant-ph)
备注: Accepted by the IEEE International Conference on Quantum Artificial Intelligence (QAI) 2025

点击查看摘要

Abstract:Recent advances in quantum computing and machine learning have given rise to quantum machine learning (QML), with growing interest in learning from sequential data. Quantum recurrent models like QLSTM are promising for time-series prediction, NLP, and reinforcement learning. However, designing effective variational quantum circuits (VQCs) remains challenging and often task-specific. To address this, we propose DiffQAS-QLSTM, an end-to-end differentiable framework that optimizes both VQC parameters and architecture selection during training. Our results show that DiffQAS-QLSTM consistently outperforms handcrafted baselines, achieving lower loss across diverse test settings. This approach opens the door to scalable and adaptive quantum sequence learning.
zh

[AI-56] Inference Time Debiasing Concepts in Diffusion Models

【速读】:该论文旨在解决文本到图像扩散模型中存在的偏见问题,特别是针对性别、种族和年龄等受保护属性在生成图像中的不公平表现。解决方案的关键在于提出一种名为DeCoDi的去偏方法,其核心创新是通过调整推理过程来避开包含偏见概念的潜在空间区域,而不改变模型训练或显著增加计算开销。该方法仅修改扩散过程中的采样路径,具有轻量级、通用性强且易于部署的优势,适用于任意基于扩散机制的图像生成模型。实验表明,DeCoDi在多个关键概念(如护士、消防员、CEO)上有效降低了偏见,并通过人工与自动评估(GPT-4o)验证了其有效性与一致性。

链接: https://arxiv.org/abs/2508.14933
作者: Lucas S. Kupssinskü,Marco N. Bochernitsan,Jordan Kopper,Otávio Parraga,Rodrigo C. Barros
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose DeCoDi, a debiasing procedure for text-to-image diffusion-based models that changes the inference procedure, does not significantly change image quality, has negligible compute overhead, and can be applied in any diffusion-based image generation model. DeCoDi changes the diffusion process to avoid latent dimension regions of biased concepts. While most deep learning debiasing methods require complex or compute-intensive interventions, our method is designed to change only the inference procedure. Therefore, it is more accessible to a wide range of practitioners. We show the effectiveness of the method by debiasing for gender, ethnicity, and age for the concepts of nurse, firefighter, and CEO. Two distinct human evaluators manually inspect 1,200 generated images. Their evaluation results provide evidence that our method is effective in mitigating biases based on gender, ethnicity, and age. We also show that an automatic bias evaluation performed by the GPT4o is not significantly statistically distinct from a human evaluation. Our evaluation shows promising results, with reliable levels of agreement between evaluators and more coverage of protected attributes. Our method has the potential to significantly improve the diversity of images it generates by diffusion-based text-to-image generative models.
zh

[AI-57] AI Testing Should Account for Sophisticated Strategic Behaviour

【速读】:该论文试图解决当前人工智能(Artificial Intelligence, AI)测试与评估方法在预测实际部署行为时的局限性问题,即现有评估往往忽视了AI系统可能具备理解自身环境并进行策略性推理的能力。解决方案的关键在于引入博弈论(game-theoretic)分析方法,通过形式化和审视基于评估的安全论证(evaluation-based safety cases)中的推理过程,从而设计更能反映AI系统真实行为模式的评估框架。

链接: https://arxiv.org/abs/2508.14927
作者: Vojtech Kovarik,Eric Olav Chen,Sami Petersen,Alexis Ghersengorin,Vincent Conitzer
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This position paper argues for two claims regarding AI testing and evaluation. First, to remain informative about deployment behaviour, evaluations need account for the possibility that AI systems understand their circumstances and reason strategically. Second, game-theoretic analysis can inform evaluation design by formalising and scrutinising the reasoning in evaluation-based safety cases. Drawing on examples from existing AI systems, a review of relevant research, and formal strategic analysis of a stylised evaluation scenario, we present evidence for these claims and motivate several research directions.
zh

[AI-58] Learning to Drive Ethically: Embedding Moral Reasoning into Autonomous Driving

【速读】:该论文旨在解决自动驾驶车辆在复杂人车混行环境中实现伦理可问责决策的问题,即如何在保证行驶效率与安全性的基础上,嵌入道德推理能力以应对常规及紧急工况下的伦理困境。其解决方案的关键在于提出了一种分层的安全强化学习(Safe Reinforcement Learning, Safe RL)框架:在决策层,通过融合碰撞概率与伤害严重度的复合伦理风险成本函数训练智能体生成高阶运动目标;在执行层,采用多项式路径规划结合比例-积分-微分(Proportional-Integral-Derivative, PID)和Stanley控制器实现平滑可行轨迹跟踪,同时引入动态优先经验回放机制增强对稀有高风险事件的学习效率。该方法首次在真实交通场景中验证了基于Safe RL的伦理决策有效性,体现了形式化控制理论与数据驱动学习相结合的优势。

链接: https://arxiv.org/abs/2508.14926
作者: Dianzhao Li,Ostap Okhrin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Autonomous vehicles hold great promise for reducing traffic fatalities and improving transportation efficiency, yet their widespread adoption hinges on embedding robust ethical reasoning into routine and emergency maneuvers. Here, we present a hierarchical Safe Reinforcement Learning (Safe RL) framework that explicitly integrates moral considerations with standard driving objectives. At the decision level, a Safe RL agent is trained using a composite ethical risk cost, combining collision probability and harm severity, to generate high-level motion targets. A dynamic Prioritized Experience Replay mechanism amplifies learning from rare but critical, high-risk events. At the execution level, polynomial path planning coupled with Proportional-Integral-Derivative (PID) and Stanley controllers translates these targets into smooth, feasible trajectories, ensuring both accuracy and comfort. We train and validate our approach on rich, real-world traffic datasets encompassing diverse vehicles, cyclists, and pedestrians, and demonstrate that it outperforms baseline methods in reducing ethical risk and maintaining driving performance. To our knowledge, this is the first study of ethical decision-making for autonomous vehicles via Safe RL in real-world scenarios. Our results highlight the potential of combining formal control theory and data-driven learning to advance ethically accountable autonomy in complex, human-mixed traffic environments.
zh

[AI-59] A Fully Spectral Neuro-Symbolic Reasoning Architecture with Graph Signal Processing as the Computational Backbone

【速读】:该论文旨在解决现有神经符号推理模型中逻辑一致性弱、可解释性差以及计算效率低的问题。其核心解决方案是提出一种全谱域(fully spectral)的神经符号推理架构,以图信号处理(Graph Signal Processing, GSP)作为主要计算基础,将逻辑实体与关系编码为图信号,并通过可学习的谱滤波器控制多尺度信息传播,最终映射到符号谓词用于规则推理。关键创新在于将整个推理流程构建在图谱域内,引入图傅里叶变换、带选择注意力机制和谱规则接地等数学框架,从而实现更鲁棒、高效且可解释的推理能力。

链接: https://arxiv.org/abs/2508.14923
作者: Andrew Kiruluta
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose a fully spectral, neuro-symbolic reasoning architecture that leverages Graph Signal Processing (GSP) as the primary computational backbone for integrating symbolic logic and neural inference. Unlike conventional reasoning models that treat spectral graph methods as peripheral components, our approach formulates the entire reasoning pipeline in the graph spectral domain. Logical entities and relationships are encoded as graph signals, processed via learnable spectral filters that control multi-scale information propagation, and mapped into symbolic predicates for rule-based inference. We present a complete mathematical framework for spectral reasoning, including graph Fourier transforms, band-selective attention, and spectral rule grounding. Experiments on benchmark reasoning datasets (ProofWriter, EntailmentBank, bAbI, CLUTRR, and ARC-Challenge) demonstrate improvements in logical consistency, interpretability, and computational efficiency over state-of-the-art neuro-symbolic models. Our results suggest that GSP provides a mathematically grounded and computationally efficient substrate for robust and interpretable reasoning systems.
zh

[AI-60] Designing an Interdisciplinary Artificial Intelligence Curriculum for Engineering: Evaluation and Insights from Experts

【速读】:该论文试图解决的问题是:当前高等教育中AI相关能力培养的课程体系尚不完善,尤其在跨学科课程开发方面缺乏系统研究与实践,难以满足日益增长的AI专业人才需求。其解决方案的关键在于采用混合方法(定量课程映射与定性焦点小组访谈)对一个全新的工程类人工智能本科专业课程进行多维度评估,涵盖目标能力匹配度、质量、一致性、实用性及有效性,并比较参与和未参与课程开发的教育者之间的认知差异,从而为跨学科AI课程建设提供实证依据和可推广的实施路径。

链接: https://arxiv.org/abs/2508.14921
作者: Johannes Schleiss,Anke Manukjan,Michelle Ines Bieber,Sebastian Lang,Sebastian Stober
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As Artificial Intelligence (AI) increasingly impacts professional practice, there is a growing need to AI-related competencies into higher education curricula. However, research on the implementation of AI education within study programs remains limited and requires new forms of collaboration across disciplines. This study addresses this gap and explores perspectives on interdisciplinary curriculum development through the lens of different stakeholders. In particular, we examine the case of curriculum development for a novel undergraduate program in AI in engineering. The research uses a mixed methods approach, combining quantitative curriculum mapping with qualitative focus group interviews. In addition to assessing the alignment of the curriculum with the targeted competencies, the study also examines the perceived quality, consistency, practicality and effectiveness from both academic and industry perspectives, as well as differences in perceptions between educators who were involved in the development and those who were not. The findings provide a practical understanding of the outcomes of interdisciplinary AI curriculum development and contribute to a broader understanding of how educator participation in curriculum development influences perceptions of quality aspects. It also advances the field of AI education by providing a reference point and insights for further interdisciplinary curriculum developments in response to evolving industry needs.
zh

[AI-61] Disentangling the Drivers of LLM Social Conformity: An Uncertainty-Moderated Dual-Process Mechanism

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在协作团队中表现出的社会从众行为(social conformity)的成因问题,特别是区分其背后的两种心理机制:信息性影响(informational influence,即理性利用群体线索以提高准确性)与规范性影响(normative influence,即受社会压力驱动寻求认同)。解决方案的关键在于采用行为经济学中的信息级联(information cascade)范式,通过操控信息不确定性水平(q = 0.667, 0.55, 0.70),定量分离这两种驱动因素。实验结果表明,LLMs的行为主要由信息性影响主导,且准确性和置信度随证据强度提升而增强;但不确定性显著调节这一过程——在低至中等不确定性下表现为保守策略(系统性低估所有证据源),而在高不确定性下则出现规范性类放大效应(公共信号被过度加权,β = 1.55 vs. 私人信号 β = 0.81),揭示了LLMs从理性处理向启发式响应转变的关键边界条件。

链接: https://arxiv.org/abs/2508.14918
作者: Huixin Zhong,Yanan Liu,Qi Cao,Shijin Wang,Zijing Ye,Zimu Wang,Shiyao Zhang
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) integrate into collaborative teams, their social conformity – the tendency to align with majority opinions – has emerged as a key concern. In humans, conformity arises from informational influence (rational use of group cues for accuracy) or normative influence (social pressure for approval), with uncertainty moderating this balance by shifting from purely analytical to heuristic processing. It remains unclear whether these human psychological mechanisms apply to LLMs. This study adapts the information cascade paradigm from behavioral economics to quantitatively disentangle the two drivers to investigate the moderate effect. We evaluated nine leading LLMs across three decision-making scenarios (medical, legal, investment), manipulating information uncertainty (q = 0.667, 0.55, and 0.70, respectively). Our results indicate that informational influence underpins the models’ behavior across all contexts, with accuracy and confidence consistently rising with stronger evidence. However, this foundational mechanism is dramatically modulated by uncertainty. In low-to-medium uncertainty scenarios, this informational process is expressed as a conservative strategy, where LLMs systematically underweight all evidence sources. In contrast, high uncertainty triggers a critical shift: while still processing information, the models additionally exhibit a normative-like amplification, causing them to overweight public signals (beta 1.55 vs. private beta = 0.81).
zh

[AI-62] Collaborative Filtering using Variational Quantum Hopfield Associative Memory

【速读】:该论文旨在解决传统推荐系统在处理复杂用户行为模式时效率与准确性不足的问题,尤其是在大规模数据集上提取高维特征并实现精准分类的挑战。其解决方案的关键在于提出一种融合量子计算与深度学习的混合推荐框架:首先利用K-Means算法对用户进行聚类,并通过编码器激活函数将用户特征转化为极坐标形式(polar patterns);随后将这些模式输入到基于变分量子霍普菲尔德关联记忆(Variational Quantum Hopfield Associative Memory, QHAM)的量子模块中进行存储与检索;同时,模型采用仅更新一个随机目标量子比特的方式优化量子线路的qubit开销,显著降低硬件资源需求。实验表明,该方法在理想环境和含噪声环境(模拟真实量子硬件误差)下均表现出优异性能,ROC值达0.9795(理想)和0.9177(噪声),验证了其在实际应用场景中的鲁棒性与有效性。

链接: https://arxiv.org/abs/2508.14906
作者: Amir Kermanshahani,Ebrahim Ardeshir-Larijani,Rakesh Saini,Saif Al-Kuwari
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Quantum computing, with its ability to do exponentially faster computation compared to classical systems, has found novel applications in various fields such as machine learning and recommendation systems. Quantum Machine Learning (QML), which integrates quantum computing with machine learning techniques, presents powerful new tools for data processing and pattern recognition. This paper proposes a hybrid recommendation system that combines Quantum Hopfield Associative Memory (QHAM) with deep neural networks to improve the extraction and classification on the MovieLens 1M dataset. User archetypes are clustered into multiple unique groups using the K-Means algorithm and converted into polar patterns through the encoder’s activation function. These polar patterns are then integrated into the variational QHAM-based hybrid recommendation model. The system was trained using the MSE loss over 35 epochs in an ideal environment, achieving an ROC value of 0.9795, an accuracy of 0.8841, and an F-1 Score of 0.8786. Trained with the same number of epochs in a noisy environment using a custom Qiskit AER noise model incorporating bit-flip and readout errors with the same probabilities as in real quantum hardware, it achieves an ROC of 0.9177, an accuracy of 0.8013, and an F-1 Score equal to 0.7866, demonstrating consistent performance. Additionally, we were able to optimize the qubit overhead present in previous QHAM architectures by efficiently updating only one random targeted qubit. This research presents a novel framework that combines variational quantum computing with deep learning, capable of dealing with real-world datasets with comparable performance compared to purely classical counterparts. Additionally, the model can perform similarly well in noisy configurations, showcasing a steady performance and proposing a promising direction for future usage in recommendation systems. Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Machine Learning (cs.LG) Cite as: arXiv:2508.14906 [cs.IR] (or arXiv:2508.14906v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.14906 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-63] Privacy Preserving Inference of Personalized Content for Out of Matrix Users

【速读】:该论文旨在解决推荐系统在小众且动态社区中面临的三大挑战:数据稀疏性(data sparsity)、冷启动用户与物品(cold start users and items)以及隐私约束。传统协同过滤和基于内容的推荐方法在此类场景下表现不佳,要么依赖侵入式用户数据,要么在缺乏历史偏好记录时失效。解决方案的关键在于提出 DeepNaniNet,一个基于归纳图结构的深度神经网络推荐框架,其核心创新包括:1)融合用户-物品交互、物品-物品关系及从 BERT 提取的丰富文本评论嵌入(textual review embeddings);2)引入“内容篮子”(content basket)用户表示方法实现无需用户画像的冷启动推荐;3)采用基于自编码器(autoencoder-based)的泛化策略处理未见过的用户。该方案在 AnimeULike 新数据集和 CiteULike 基准上均展现出卓越性能,尤其在冷启动场景下显著优于现有方法,同时保障了隐私敏感环境下的推荐质量。

链接: https://arxiv.org/abs/2508.14905
作者: Michael Sun,Tai Vu,Andrew Wang
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Recommender systems for niche and dynamic communities face persistent challenges from data sparsity, cold start users and items, and privacy constraints. Traditional collaborative filtering and content-based approaches underperform in these settings, either requiring invasive user data or failing when preference histories are absent. We present DeepNaniNet, a deep neural recommendation framework that addresses these challenges through an inductive graph-based architecture combining user-item interactions, item-item relations, and rich textual review embeddings derived from BERT. Our design enables cold start recommendations without profile mining, using a novel “content basket” user representation and an autoencoder-based generalization strategy for unseen users. We introduce AnimeULike, a new dataset of 10,000 anime titles and 13,000 users, to evaluate performance in realistic scenarios with high proportions of guest or low-activity users. DeepNaniNet achieves state-of-the-art cold start results on the CiteULike benchmark, matches DropoutNet in user recall without performance degradation for out-of-matrix users, and outperforms Weighted Matrix Factorization (WMF) and DropoutNet on AnimeULike warm start by up to 7x and 1.5x in Recall@100, respectively. Our findings demonstrate that DeepNaniNet delivers high-quality, privacy-preserving recommendations in data-sparse, cold start-heavy environments while effectively integrating heterogeneous content sources.
zh

[AI-64] Accelerating GenAI Workloads by Enabling RISC-V Microkernel Support in IREE

【速读】:该论文旨在解决在RISC-V架构上高效执行机器学习推理任务的问题,特别是在IREE(MLIR-based machine learning compiler and runtime)中实现对RISC-V微内核(microkernel)的支持。其解决方案的关键在于:首先在IREE的转换流程中启用将MLIR linalg方言中的收缩操作(contraction ops)降低为linalg.mmt4d操作,以适配RISC-V64目标;随后开发针对RISC-V架构优化的微内核代码,从而提升模型推理性能。通过与上游IREE及参考实现对比,验证了所提方法在Llama-3.2-1B-Instruct模型上的性能优势。

链接: https://arxiv.org/abs/2508.14899
作者: Adeel Ahmad,Ahmad Tameem Kamal,Nouman Amir,Bilal Zafar,Saad Bin Nasir
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This project enables RISC-V microkernel support in IREE, an MLIR-based machine learning compiler and runtime. The approach begins by enabling the lowering of MLIR linalg dialect contraction ops to linalg.mmt4d op for the RISC-V64 target within the IREE pass pipeline, followed by the development of optimized microkernels for RISC-V. The performance gains are compared with upstream IREE and this http URL for the Llama-3.2-1B-Instruct model.
zh

[AI-65] Numerical models outperform AI weather forecasts of record-breaking extremes

【速读】:该论文试图解决的问题是:当前基于人工智能(AI)的天气预报模型在 extrapolation(外推)能力上是否存在局限,尤其是在预测前所未有的极端天气事件(如破纪录的高温、低温和强风)时是否仍能保持可靠性。其解决方案的关键在于通过系统性对比分析,发现尽管AI模型在常规天气预报任务中已超越传统数值天气预报(Numerical Weather Prediction, NWP)模型(如欧洲中期天气预报中心的高分辨率模式HRES),但在处理记录性极端事件时,HRES依然表现更优;具体表现为AI模型对极端事件的频率和强度普遍低估,且误差随极端程度增加而增大,这揭示了AI模型在训练数据分布之外的泛化能力不足,凸显了其在气候变暖背景下高频出现的极端事件预警中的潜在风险。

链接: https://arxiv.org/abs/2508.15724
作者: Zhongwei Zhang,Erich Fischer,Jakob Zscheischler,Sebastian Engelke
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Applications (stat.AP)
备注:

点击查看摘要

Abstract:Artificial intelligence (AI)-based models are revolutionizing weather forecasting and have surpassed leading numerical weather prediction systems on various benchmark tasks. However, their ability to extrapolate and reliably forecast unprecedented extreme events remains unclear. Here, we show that for record-breaking weather extremes, the numerical model High RESolution forecast (HRES) from the European Centre for Medium-Range Weather Forecasts still consistently outperforms state-of-the-art AI models GraphCast, GraphCast operational, Pangu-Weather, Pangu-Weather operational, and Fuxi. We demonstrate that forecast errors in AI models are consistently larger for record-breaking heat, cold, and wind than in HRES across nearly all lead times. We further find that the examined AI models tend to underestimate both the frequency and intensity of record-breaking events, and they underpredict hot records and overestimate cold records with growing errors for larger record exceedance. Our findings underscore the current limitations of AI weather models in extrapolating beyond their training domain and in forecasting the potentially most impactful record-breaking weather events that are particularly frequent in a rapidly warming climate. Further rigorous verification and model development is needed before these models can be solely relied upon for high-stakes applications such as early warning systems and disaster management.
zh

[AI-66] LoUQAL: Low-fidelity informed Uncertainty Quantification for Active Learning in the chemical configuration space

【速读】:该论文旨在解决主动学习(Active Learning)中不确定性量化(Uncertainty Quantification, UQ)的准确性与效率问题,特别是在预测量子化学性质(如激发能和从头算势能面)时,如何有效利用低 fidelity 计算(即计算成本较低但精度较差的近似方法)来提升模型训练的效率和性能。解决方案的关键在于提出一种低 fidelity 信息引导的不确定性量化方法(Low-fidelity Informed Uncertainty Quantification),该方法通过融合低 fidelity 数据提供的先验信息来优化不确定性估计,从而在更少迭代次数内显著降低模型的实证误差,实现对高精度量子化学性质的高效预测。

链接: https://arxiv.org/abs/2508.15577
作者: Vivin Vinod,Peter Zaspel
机构: 未知
类目: Chemical Physics (physics.chem-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Uncertainty quantification is an important scheme in active learning techniques, including applications in predicting quantum chemical properties. In quantum chemical calculations, there exists the notion of a fidelity, a less accurate computation is accessible at a cheaper computational cost. This work proposes a novel low-fidelity informed uncertainty quantification for active learning with applications in predicting diverse quantum chemical properties such as excitation energies and \textitab initio potential energy surfaces. Computational experiments are carried out in order to assess the proposed method with results demonstrating that models trained with the novel method outperform alternatives in terms of empirical error and number of iterations required. The effect of the choice of fidelity is also studied to perform a thorough benchmark.
zh

[AI-67] Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

【速读】:该论文旨在解决基于语言模型(Language Model, LM)的文本到语音(Text-to-Speech, TTS)系统中常见的幻觉语音问题,即生成的语音内容偏离输入文本,影响语音合成的准确性与一致性。现有方法要么需要大量训练资源,要么显著增加推理延迟,难以实际部署。其解决方案的关键在于提出一种后训练框架——GFlOwNet-guided distribution AlignmenT (GOAT),通过不确定性分析发现幻觉与模型不确定性呈强正相关,进而将TTS生成建模为轨迹流优化问题,并引入改进的子轨迹平衡目标与强化的内部奖励作为目标分布,同时结合奖励温度衰减和学习率优化策略以提升训练稳定性与性能平衡。实验表明,GOAT在挑战性测试案例中可降低超过50%的字符错误率,并使不确定性下降达58%,展现出优异的泛化能力和有效性。

链接: https://arxiv.org/abs/2508.15442
作者: Chenlin Liu,Minghui Fang,Patrick Zhang,Wei Zhou,Jie Gao,Jiqing Han
机构: 未知
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.
zh

[AI-68] Robust and Efficient Quantum Reservoir Computing with Discrete Time Crystal

【速读】:该论文旨在解决当前基于量子变分算法的量子机器学习(Quantum Machine Learning, QML)在可训练性(trainability)和噪声鲁棒性(noise robustness)方面的瓶颈问题。其解决方案的关键在于提出了一种无梯度(gradient-free)、噪声鲁棒的量子储备计算(Quantum Reservoir Computing, QRC)算法,该算法利用离散时间晶体(Discrete Time Crystal, DTC)的动力学作为储备系统,通过调控非平衡相变与量子多体动力学特性,实现了对记忆能力、非线性响应及信息搅动能力的有效控制,并在图像分类任务中展现出优于经典方法的量子核优势(quantum kernel advantage),同时在含噪模拟与超导量子处理器实验中验证了拓扑噪声鲁棒性与系统规模扩展带来的精度提升。

链接: https://arxiv.org/abs/2508.15230
作者: Da Zhang,Xin Li,Yibin Guo,Haifeng Yu,Yirong Jin,Zhang-Qi Yin
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 7 figures

点击查看摘要

Abstract:The rapid development of machine learning and quantum computing has placed quantum machine learning at the forefront of research. However, existing quantum machine learning algorithms based on quantum variational algorithms face challenges in trainability and noise robustness. In order to address these challenges, we introduce a gradient-free, noise-robust quantum reservoir computing algorithm that harnesses discrete time crystal dynamics as a reservoir. We first calibrate the memory, nonlinear, and information scrambling capacities of the quantum reservoir, revealing their correlation with dynamical phases and non-equilibrium phase transitions. We then apply the algorithm to the binary classification task and establish a comparative quantum kernel advantage. For ten-class classification, both noisy simulations and experimental results on superconducting quantum processors match ideal simulations, demonstrating the enhanced accuracy with increasing system size and confirming the topological noise robustness. Our work presents the first experimental demonstration of quantum reservoir computing for image classification based on digital quantum simulation. It establishes the correlation between quantum many-body non-equilibrium phase transitions and quantum machine learning performance, providing new design principles for quantum reservoir computing and broader quantum machine learning algorithms in the NISQ era.
zh

[AI-69] Enhanced Predictive Modeling for Hazardous Near-Earth Object Detection: A Comparative Analysis of Advanced Resampling Strategies and Machine Learning Algorithms in Planetary Risk Assessment

【速读】:该论文旨在解决近地天体(Near-Earth Objects, NEOs)危险性预测的精准分类问题,通过机器学习模型实现对潜在威胁天体的有效识别。其解决方案的关键在于采用二元分类框架,并结合数据预处理(如缩放和幂变换)与交叉验证策略,系统比较了六种主流分类器的性能表现;其中,随机森林分类器(Random Forest Classifier, RFC)和梯度提升分类器(Gradient Boosting Classifier, GBC)展现出最优性能,F2-score分别达到0.987和0.986,且误报率和漏报率极低,准确率高达99.7%和99.6%,凸显了集成学习方法在高精度与高召回率场景下的优势,同时强调了根据数据特征和评估指标定制化选择模型的重要性。

链接: https://arxiv.org/abs/2508.15106
作者: Sunkalp Chandra
机构: 未知
类目: Earth and Planetary Astrophysics (astro-ph.EP); Instrumentation and Methods for Astrophysics (astro-ph.IM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This study evaluates the performance of several machine learning models for predicting hazardous near-Earth objects (NEOs) through a binary classification framework, including data scaling, power transformation, and cross-validation. Six classifiers were compared, namely Random Forest Classifier (RFC), Gradient Boosting Classifier (GBC), Support Vector Classifier (SVC), Linear Discriminant Analysis (LDA), Logistic Regression (LR), and K-Nearest Neighbors (KNN). RFC and GBC performed the best, both with an impressive F2-score of 0.987 and 0.986, respectively, with very small variability. SVC followed, with a lower but reasonable score of 0.896. LDA and LR had a moderate performance with scores of around 0.749 and 0.748, respectively, while KNN had a poor performance with a score of 0.691 due to difficulty in handling complex data patterns. RFC and GBC also presented great confusion matrices with a negligible number of false positives and false negatives, which resulted in outstanding accuracy rates of 99.7% and 99.6%, respectively. These findings highlight the power of ensemble methods for high precision and recall and further point out the importance of tailored model selection with regard to dataset characteristics and chosen evaluation metrics. Future research could focus on the optimization of hyperparameters with advanced features engineering to further the accuracy and robustness of the model on NEO hazard predictions.
zh

[AI-70] Equi-mRNA: Protein Translation Equivariant Encoding for mRNA Language Models

【速读】:该论文旨在解决mRNA治疗和合成生物学中对隐含同义密码子(synonymous codon)使用模式建模不足的问题,这种模式虽不改变氨基酸序列,但显著影响翻译效率和基因表达水平。传统方法通过辅助目标引入密码子层面的归纳偏置,却难以显式捕捉遗传密码内在对称性所引发的结构化关系。解决方案的关键在于提出Equi-mRNA——首个在密码子级别具备等变性的mRNA语言模型,其核心创新是将同义密码子对称性编码为二维特殊正交群SO(2)的循环子群,并结合群论先验、等变性损失函数与对称感知池化机制,从而学习具有生物学意义的表示。实验表明,该方法在表达预测、稳定性评估及核糖开关切换等下游任务中准确率提升约10%,序列生成质量提升4倍(Frechet BioDistance指标),并能更好保留功能特性,同时通过可解释性分析揭示了密码子旋转分布与GC含量偏倚及tRNA丰度模式的一致性,为下一代mRNA疗法设计提供了新的生物原理驱动范式。

链接: https://arxiv.org/abs/2508.15103
作者: Mehdi Yazdani-Jahromi,Ali Khodabandeh Yalabadi,Ozlem Ozmen Garibay
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The growing importance of mRNA therapeutics and synthetic biology highlights the need for models that capture the latent structure of synonymous codon (different triplets encoding the same amino acid) usage, which subtly modulates translation efficiency and gene expression. While recent efforts incorporate codon-level inductive biases through auxiliary objectives, they often fall short of explicitly modeling the structured relationships that arise from the genetic code’s inherent symmetries. We introduce Equi-mRNA, the first codon-level equivariant mRNA language model that explicitly encodes synonymous codon symmetries as cyclic subgroups of 2D Special Orthogonal matrix (SO(2)). By combining group-theoretic priors with an auxiliary equivariance loss and symmetry-aware pooling, Equi-mRNA learns biologically grounded representations that outperform vanilla baselines across multiple axes. On downstream property-prediction tasks including expression, stability, and riboswitch switching Equi-mRNA delivers up to approximately 10% improvements in accuracy. In sequence generation, it produces mRNA constructs that are up to approximately 4x more realistic under Frechet BioDistance metrics and approximately 28% better preserve functional properties compared to vanilla baseline. Interpretability analyses further reveal that learned codon-rotation distributions recapitulate known GC-content biases and tRNA abundance patterns, offering novel insights into codon usage. Equi-mRNA establishes a new biologically principled paradigm for mRNA modeling, with significant implications for the design of next-generation therapeutics.
zh

[AI-71] Can synthetic data reproduce real-world findings in epidemiology? A replication study using tree-based generative AI

【速读】:该论文旨在解决当前生成式人工智能(Generative AI)在流行病学领域合成数据质量不高、计算成本高以及对非专家用户不友好等问题,同时指出现有合成数据评估策略难以直接反映统计效用。其核心解决方案是提出使用对抗随机森林(Adversarial Random Forest, ARF)作为高效且便捷的表格式流行病学数据合成方法,并通过复现六项已发表的流行病学研究(涵盖血压、人体测量、心肌梗死、加速度计数据、孤独感和糖尿病等)的统计分析结果,验证了合成数据在保持原始发现一致性方面的可靠性。关键创新在于利用ARF实现高质量合成,且通过降低维度与预处理变量进一步提升合成数据的稳定性和准确性,从而为流行病学研究提供可信赖的替代数据方案。

链接: https://arxiv.org/abs/2508.14936
作者: Jan Kapar,Kathrin Günther,Lori Ann Vallis,Klaus Berger,Nadine Binder,Hermann Brenner,Stefanie Castell,Beate Fischer,Volker Harth,Bernd Holleczek,Timm Intemann,Till Ittermann,André Karch,Thomas Keil,Lilian Krist,Berit Lange,Michael F. Leitzmann,Katharina Nimptsch,Nadia Obi,Iris Pigeot,Tobias Pischon,Tamara Schikowski,Börge Schmidt,Carsten Oliver Schmidt,Anja M. Sedlmair,Justine Tanoey,Harm Wienbergen,Andreas Wienke,Claudia Wigmann,Marvin N. Wright
机构: 未知
类目: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Generative artificial intelligence for synthetic data generation holds substantial potential to address practical challenges in epidemiology. However, many current methods suffer from limited quality, high computational demands, and complexity for non-experts. Furthermore, common evaluation strategies for synthetic data often fail to directly reflect statistical utility. Against this background, a critical underexplored question is whether synthetic data can reliably reproduce key findings from epidemiological research. We propose the use of adversarial random forests (ARF) as an efficient and convenient method for synthesizing tabular epidemiological data. To evaluate its performance, we replicated statistical analyses from six epidemiological publications and compared original with synthetic results. These publications cover blood pressure, anthropometry, myocardial infarction, accelerometry, loneliness, and diabetes, based on data from the German National Cohort (NAKO Gesundheitsstudie), the Bremen STEMI Registry U45 Study, and the Guelph Family Health Study. Additionally, we assessed the impact of dimensionality and variable complexity on synthesis quality by limiting datasets to variables relevant for individual analyses, including necessary derivations. Across all replicated original studies, results from multiple synthetic data replications consistently aligned with original findings. Even for datasets with relatively low sample size-to-dimensionality ratios, the replication outcomes closely matched the original results across various descriptive and inferential analyses. Reducing dimensionality and pre-deriving variables further enhanced both quality and stability of the results.
zh

[AI-72] OM: An Open-Source Tongue Segmentation Method with Multi-Teacher Distillation and Task-Specific Data Augmentation

【速读】:该论文旨在解决当前舌象图像分割研究中存在的局限性,尤其是缺乏鲁棒性强且用户友好的分割工具问题,从而提升智能舌诊系统中舌面分割质量对后续诊断准确性的影响。其解决方案的关键在于提出一种基于多教师知识蒸馏(multi-teacher knowledge distillation)的舌象分割模型(TOM),并通过引入一种新颖的基于扩散机制的数据增强方法,在显著压缩模型参数量(减少96.6%)的同时保持优异的分割性能(mIoU达95.22%)。此外,作者将训练好的模型封装为无需编程经验即可使用的在线与离线工具,并通过病例研究验证了使用分割后的舌部区域进行中医体质分类相较于原始舌图具有更高的分类精度和可解释性,是首个开源且免费的舌象图像分割工具。

链接: https://arxiv.org/abs/2508.14932
作者: Jiacheng Xie,Ziyang Zhang,Biplab Poudel,Congyu Guo,Yang Yu,Guanghui An,Xiaoting Tang,Lening Zhao,Chunhui Xu,Dong Xu
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: Tongue segmentation, data augmentation, synthetic data for AI training, prompt engineering, Segment Anything Model, knowledge distillation, tongue classification

点击查看摘要

Abstract:Tongue imaging serves as a valuable diagnostic tool, particularly in Traditional Chinese Medicine (TCM). The quality of tongue surface segmentation significantly affects the accuracy of tongue image classification and subsequent diagnosis in intelligent tongue diagnosis systems. However, existing research on tongue image segmentation faces notable limitations, and there is a lack of robust and user-friendly segmentation tools. This paper proposes a tongue image segmentation model (TOM) based on multi-teacher knowledge distillation. By incorporating a novel diffusion-based data augmentation method, we enhanced the generalization ability of the segmentation model while reducing its parameter size. Notably, after reducing the parameter count by 96.6% compared to the teacher models, the student model still achieves an impressive segmentation performance of 95.22% mIoU. Furthermore, we packaged and deployed the trained model as both an online and offline segmentation tool (available at this https URL), allowing TCM practitioners and researchers to use it without any programming experience. We also present a case study on TCM constitution classification using segmented tongue patches. Experimental results demonstrate that training with tongue patches yields higher classification performance and better interpretability than original tongue images. To our knowledge, this is the first open-source and freely available tongue image segmentation tool.
zh

[AI-73] A U-Statistic-based random forest approach for genetic interaction study

【速读】:该论文旨在解决复杂性状研究中基因-基因(gene-gene)和基因-环境(gene-environment)交互作用检测的难题,尤其是在高维遗传变异与环境风险因素共存时,传统方法受限于特征空间指数级膨胀和计算强度。其解决方案的关键在于提出一种基于U统计量的随机森林方法(Forest U-Test),通过引入U统计量对分割节点进行检验,有效提升了在定量性状关联分析中识别多因子交互效应的能力,从而在模拟和真实数据中均表现出优于现有方法的性能。

链接: https://arxiv.org/abs/2508.14924
作者: Ming Li,Ruo-Sin Peng,Changshuai Wei,Qing Lu
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Methodology (stat.ME)
备注:

点击查看摘要

Abstract:Variations in complex traits are influenced by multiple genetic variants, environmental risk factors, and their interactions. Though substantial progress has been made in identifying single genetic variants associated with complex traits, detecting the gene-gene and gene-environment interactions remains a great challenge. When a large number of genetic variants and environmental risk factors are involved, searching for interactions is limited to pair-wise interactions due to the exponentially increased feature space and computational intensity. Alternatively, recursive partitioning approaches, such as random forests, have gained popularity in high-dimensional genetic association studies. In this article, we propose a U-Statistic-based random forest approach, referred to as Forest U-Test, for genetic association studies with quantitative traits. Through simulation studies, we showed that the Forest U-Test outperformed existing methods. The proposed method was also applied to study Cannabis Dependence CD, using three independent datasets from the Study of Addiction: Genetics and Environment. A significant joint association was detected with an empirical p-value less than 0.001. The finding was also replicated in two independent datasets with p-values of 5.93e-19 and 4.70e-17, respectively.
zh

[AI-74] SVM/SVR Kernels as Quantum Propagators

【速读】:该论文旨在解决支持向量机(Support Vector Machine, SVM)中核函数设计与量子物理中时间依赖格林函数(time-dependent Green’s functions)之间潜在联系未被充分挖掘的问题。研究发现,许多常见的SVM核函数在数学上等价于由算子逆理论导出的格林函数;特别地,sigmoid核函数可能不满足Mercer定理,导致其对应的格林函数性能不佳。解决方案的关键在于引入核多项式方法(Kernel Polynomial Method, KPM),用于构造与格林函数对齐的正定核(positive-semidefinite kernels),从而显著提升SVM模型在物理系统中的预测精度。

链接: https://arxiv.org/abs/2502.11153
作者: Nan-Hong Kuo,Renata Wong
机构: 未知
类目: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph)
备注:

点击查看摘要

Abstract:We establish a mathematical equivalence between Support Vector Machine (SVM) kernel functions and quantum propagators represented by time-dependent Green’s functions, which has remained largely unexplored. We demonstrate that many common SVM kernels correspond naturally to Green’s functions via operator inversion theory. The sigmoid kernel does not always satisfy Mercer’s theorem, and therefore the corresponding Green’s function may also fail to perform optimally. We further introduce a Kernel Polynomial Method (KPM) for designing customized kernels that align with Green’s functions. Our numerical experiments confirm that employing positive-semidefinite kernels that correspond to Green’s functions significantly improves predictive accuracy of SVM models in physical systems. Subjects: Quantum Physics (quant-ph); Statistical Mechanics (cond-mat.stat-mech); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Mathematical Physics (math-ph) Cite as: arXiv:2502.11153 [quant-ph] (or arXiv:2502.11153v2 [quant-ph] for this version) https://doi.org/10.48550/arXiv.2502.11153 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Nan-Hong Kuo [view email] [v1] Sun, 16 Feb 2025 14:55:43 UTC (590 KB) [v2] Fri, 25 Jul 2025 02:27:30 UTC (211 KB)
zh

机器学习

[LG-0] Distributed Detection of Adversarial Attacks in Multi-Agent Reinforcement Learning with Continuous Action Space ECAI2025

链接: https://arxiv.org/abs/2508.15764
作者: Kiarash Kazari,Ezzeldin Shereen,György Dán
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注: Accepted for publication at ECAI 2025

点击查看摘要

Abstract:We address the problem of detecting adversarial attacks against cooperative multi-agent reinforcement learning with continuous action space. We propose a decentralized detector that relies solely on the local observations of the agents and makes use of a statistical characterization of the normal behavior of observable agents. The proposed detector utilizes deep neural networks to approximate the normal behavior of agents as parametric multivariate Gaussian distributions. Based on the predicted density functions, we define a normality score and provide a characterization of its mean and variance. This characterization allows us to employ a two-sided CUSUM procedure for detecting deviations of the normality score from its mean, serving as a detector of anomalous behavior in real-time. We evaluate our scheme on various multi-agent PettingZoo benchmarks against different state-of-the-art attack methods, and our results demonstrate the effectiveness of our method in detecting impactful adversarial attacks. Particularly, it outperforms the discrete counterpart by achieving AUC-ROC scores of over 0.95 against the most impactful attacks in all evaluated environments.

[LG-1] Communication Efficient LLM Pre-training with SparseLoCo

链接: https://arxiv.org/abs/2508.15706
作者: Amir Sarfi,Benjamin Thérien,Joel Lidin,Eugene Belilovsky
类目: Machine Learning (cs.LG)
*备注: 15 pages, 9 tables, 2 figures

点击查看摘要

Abstract:Communication-efficient distributed training algorithms have received considerable interest recently due to their benefits for training Large Language Models (LLMs) in bandwidth-constrained settings, such as across data centers and over the internet. Despite reducing communication frequency, these methods still typically require communicating a full copy of the model’s gradients-resulting in a communication bottleneck even for cross-datacenter links. Furthermore, they can slightly degrade performance compared to a naive AdamW DDP baseline. While quantization and error feedback are often applied to reduce the pseudo-gradient’s size, in the context of LLM pre-training, existing approaches have been unable to additionally leverage sparsification and have obtained limited quantization. In this work, we introduce SparseLoCo, a communication-efficient training algorithm for LLMs that effectively leverages Top-k sparsification and quantization to reach extreme compression ratios of up to 1-3% sparsity and 2-bit quantization while outperforming full-precision DiLoCo. Our key observations are that outer momentum can be locally approximated by an error feedback combined with aggressive sparsity and that sparse aggregation can actually improve model performance. We empirically demonstrate in a range of communication-constrained LLM training settings that SparseLoCo provides significant benefits in both performance and communication cost.

[LG-2] Investigation of D-Wave quantum annealing for training Restricted Boltzmann Machines and mitigating catastrophic forgetting

链接: https://arxiv.org/abs/2508.15697
作者: Abdelmoula El-Yazizi,Yaroslav Koshka
类目: Machine Learning (cs.LG); Quantum Physics (quant-ph); Machine Learning (stat.ML)
*备注: 26 pages, 5 figures

点击查看摘要

Abstract:Modest statistical differences between the sampling performances of the D-Wave quantum annealer (QA) and the classical Markov Chain Monte Carlo (MCMC), when applied to Restricted Boltzmann Machines (RBMs), are explored to explain, and possibly address, the absence of significant and consistent improvements in RBM trainability when the D-Wave sampling was used in previous investigations. A novel hybrid sampling approach, combining the classical and the QA contributions, is investigated as a promising way to benefit from the modest differences between the two sampling methods. No improvements in the RBM training are achieved in this work, thereby suggesting that the differences between the QA-based and MCMC sampling, mainly found in the medium-to-low probability regions of the distribution, which are less important for the quality of the sample, are insufficient to benefit the training. Difficulties in achieving sufficiently high quality of embedding RBMs into the lattice of the newer generation of D-Wave hardware could be further complicating the task. On the other hand, the ability to generate samples of sufficient variety from lower-probability parts of the distribution has a potential to benefit other machine learning applications, such as the mitigation of catastrophic forgetting (CF) during incremental learning. The feasibility of using QA-generated patterns of desirable classes for CF mitigation by the generative replay is demonstrated in this work for the first time. While the efficiency of the CF mitigation using the D-Wave QA was comparable to that of the classical mitigation, both the speed of generating a large number of distinct desirable patterns and the potential for further improvement make this approach promising for a variety of challenging machine learning applications.

[LG-3] Conditionally adaptive augmented Lagrangian method for physics-informed learning of forward and inverse problems using artificial neural networks

链接: https://arxiv.org/abs/2508.15695
作者: Qifeng Hu,Shamsulhaq Basir,Inanc Senocak
类目: Machine Learning (cs.LG)
*备注: 37 pages, 23 figures

点击查看摘要

Abstract:We present several advances to the physics and equality constrained artificial neural networks (PECANN) framework that substantially improve its capability to learn solutions of canonical partial differential equations (PDEs). First, we generalize the augmented Lagrangian method (ALM) to support multiple independent penalty parameters, enabling simultaneous enforcement of heterogeneous constraints. Second, we reformulate pointwise constraint enforcement and Lagrange multipliers as expectations over constraint terms, reducing memory overhead and permitting efficient mini-batch training. Third, to address PDEs with oscillatory, multi-scale features, we incorporate Fourier feature mappings and show that a single mapping suffices where multiple mappings or more costly architectures were required in related methods. Fourth, we introduce a time-windowing strategy for long-time evolution in which the terminal state of each window is enforced as an initial-condition constraint for the next, ensuring continuity without discrete time models. Crucially, we propose a conditionally adaptive penalty update (CAPU) strategy for ALM, which preserves the principle that larger constraint violations incur stronger penalties. CAPU accelerates the growth of Lagrange multipliers for selectively challenging constraints, enhancing constraint enforcement during training. We demonstrate the effectiveness of PECANN-CAPU on problems including the transonic rarefaction problem, reversible advection of a passive by a vortex, high-wavenumber Helmholtz and Poisson equations, and inverse identification of spatially varying heat sources. Comparisons with established methods and recent Kolmogorov-Arnold network approaches show that PECANN-CAPU achieves competitive accuracy across all cases. Collectively, these advances improve PECANN’s robustness, efficiency, and applicability to demanding problems in scientific computing.

[LG-4] An Efficient Open World Environment for Multi-Agent Social Learning

链接: https://arxiv.org/abs/2508.15679
作者: Eric Ye,Ren Tao,Natasha Jaques
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Many challenges remain before AI agents can be deployed in real-world environments. However, one virtue of such environments is that they are inherently multi-agent and contain human experts. Using advanced social intelligence in such an environment can help an AI agent learn adaptive skills and behaviors that a known expert exhibits. While social intelligence could accelerate training, it is currently difficult to study due to the lack of open-ended multi-agent environments. In this work, we present an environment in which multiple self-interested agents can pursue complex and independent goals, reflective of real world challenges. This environment will enable research into the development of socially intelligent AI agents in open-ended multi-agent settings, where agents may be implicitly incentivized to cooperate to defeat common enemies, build and share tools, and achieve long horizon goals. In this work, we investigate the impact on agent performance due to social learning in the presence of experts and implicit cooperation such as emergent collaborative tool use, and whether agents can benefit from either cooperation or competition in this environment.

[LG-5] nsorized Multi-Task Learning for Personalized Modeling of Heterogeneous Individuals with High-Dimensional Data

链接: https://arxiv.org/abs/2508.15676
作者: Elif Konyar,Mostafa Reisi Gahrooei,Kamran Paynabar
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Effective modeling of heterogeneous subpopulations presents a significant challenge due to variations in individual characteristics and behaviors. This paper proposes a novel approach to address this issue through multi-task learning (MTL) and low-rank tensor decomposition techniques. Our MTL approach aims to enhance personalized modeling by leveraging shared structures among similar tasks while accounting for distinct subpopulation-specific variations. We introduce a framework where low-rank decomposition decomposes the collection of task model parameters into a low-rank structure that captures commonalities and variations across tasks and subpopulations. This approach allows for efficient learning of personalized models by sharing knowledge between similar tasks while preserving the unique characteristics of each subpopulation. Experimental results in simulation and case study datasets demonstrate the superior performance of the proposed method compared to several benchmarks, particularly in scenarios with high variability among subpopulations. The proposed framework not only improves prediction accuracy but also enhances interpretability by revealing underlying patterns that contribute to the personalization of models.

[LG-6] Exploiting Policy Idling for Dexterous Manipulation IROS2025

链接: https://arxiv.org/abs/2508.15669
作者: Annie S. Chen,Philemon Brakel,Antonia Bronars,Annie Xie,Sandy Huang,Oliver Groth,Maria Bauza,Markus Wulfmeier,Nicolas Heess,Dushyant Rao
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: A similar version to this paper was accepted at IROS 2025

点击查看摘要

Abstract:Learning-based methods for dexterous manipulation have made notable progress in recent years. However, learned policies often still lack reliability and exhibit limited robustness to important factors of variation. One failure pattern that can be observed across many settings is that policies idle, i.e. they cease to move beyond a small region of states when they reach certain states. This policy idling is often a reflection of the training data. For instance, it can occur when the data contains small actions in areas where the robot needs to perform high-precision motions, e.g., when preparing to grasp an object or object insertion. Prior works have tried to mitigate this phenomenon e.g. by filtering the training data or modifying the control frequency. However, these approaches can negatively impact policy performance in other ways. As an alternative, we investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement. Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states, thus helping it to escape problematic basins of attraction. On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance, with no additional supervision or training. Furthermore, since the robot tends to idle at critical points in a movement, we also find that learning from the resulting episodes leads to better iterative policy improvement compared to prior approaches. Our perturbation strategy also leads to a 15-35% improvement in absolute success rate on a real-world insertion task that requires complex multi-finger manipulation.

[LG-7] Amortized In-Context Mixed Effect Transformer Models: A Zero-Shot Approach for Pharmacokinetics

链接: https://arxiv.org/abs/2508.15659
作者: César Ali Ojeda Marin,Wilhelm Huisinga,Purity Kavwele,Niklas Hartung
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate dose-response forecasting under sparse sampling is central to precision pharmacotherapy. We present the Amortized In-Context Mixed-Effect Transformer (AICMET) model, a transformer-based latent-variable framework that unifies mechanistic compartmental priors with amortized in-context Bayesian inference. AICMET is pre-trained on hundreds of thousands of synthetic pharmacokinetic trajectories with Ornstein-Uhlenbeck priors over the parameters of compartment models, endowing the model with strong inductive biases and enabling zero-shot adaptation to new compounds. At inference time, the decoder conditions on the collective context of previously profiled trial participants, generating calibrated posterior predictions for newly enrolled patients after a few early drug concentration measurements. This capability collapses traditional model-development cycles from weeks to hours while preserving some degree of expert modelling. Experiments across public datasets show that AICMET attains state-of-the-art predictive accuracy and faithfully quantifies inter-patient variability – outperforming both nonlinear mixed-effects baselines and recent neural ODE variants. Our results highlight the feasibility of transformer-based, population-aware neural architectures as offering a new alternative for bespoke pharmacokinetic modeling pipelines, charting a path toward truly population-aware personalized dosing regimens.

[LG-8] Correct-By-Construction: Certified Individual Fairness through Neural Network Training

链接: https://arxiv.org/abs/2508.15642
作者: Ruihan Zhang,Jun Sun
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Fairness in machine learning is more important than ever as ethical concerns continue to grow. Individual fairness demands that individuals differing only in sensitive attributes receive the same outcomes. However, commonly used machine learning algorithms often fail to achieve such fairness. To improve individual fairness, various training methods have been developed, such as incorporating fairness constraints as optimisation objectives. While these methods have demonstrated empirical effectiveness, they lack formal guarantees of fairness. Existing approaches that aim to provide fairness guarantees primarily rely on verification techniques, which can sometimes fail to produce definitive results. Moreover, verification alone does not actively enhance individual fairness during training. To address this limitation, we propose a novel framework that formally guarantees individual fairness throughout training. Our approach consists of two parts, i.e., (1) provably fair initialisation that ensures the model starts in a fair state, and (2) a fairness-preserving training algorithm that maintains fairness as the model learns. A key element of our method is the use of randomised response mechanisms, which protect sensitive attributes while maintaining fairness guarantees. We formally prove that this mechanism sustains individual fairness throughout the training process. Experimental evaluations confirm that our approach is effective, i.e., producing models that are empirically fair and accurate. Furthermore, our approach is much more efficient than the alternative approach based on certified training (which requires neural network verification during training).

[LG-9] Continual Neural Topic Model

链接: https://arxiv.org/abs/2508.15612
作者: Charu Karakkaparambil James,Waleed Mustafa,Marius Kloft,Sophie Fellenz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In continual learning, our aim is to learn a new task without forgetting what was learned previously. In topic models, this translates to learning new topic models without forgetting previously learned topics. Previous work either considered Dynamic Topic Models (DTMs), which learn the evolution of topics based on the entire training corpus at once, or Online Topic Models, which are updated continuously based on new data but do not have long-term memory. To fill this gap, we propose the Continual Neural Topic Model (CoNTM), which continuously learns topic models at subsequent time steps without forgetting what was previously learned. This is achieved using a global prior distribution that is continuously updated. In our experiments, CoNTM consistently outperformed the dynamic topic model in terms of topic quality and predictive perplexity while being able to capture topic changes online. The analysis reveals that CoNTM can learn more diverse topics and better capture temporal changes than existing methods.

[LG-10] Inductive Domain Transfer In Misspecified Simulation-Based Inference

链接: https://arxiv.org/abs/2508.15593
作者: Ortal Senouf,Antoine Wehenkel,Cédric Vincent-Cuaz,Emmanuel Abbé,Pascal Frossard
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Simulation-based inference (SBI) is a statistical inference approach for estimating latent parameters of a physical system when the likelihood is intractable but simulations are available. In practice, SBI is often hindered by model misspecification–the mismatch between simulated and real-world observations caused by inherent modeling simplifications. RoPE, a recent SBI approach, addresses this challenge through a two-stage domain transfer process that combines semi-supervised calibration with optimal transport (OT)-based distribution alignment. However, RoPE operates in a fully transductive setting, requiring access to a batch of test samples at inference time, which limits scalability and generalization. We propose here a fully inductive and amortized SBI framework that integrates calibration and distributional alignment into a single, end-to-end trainable model. Our method leverages mini-batch OT with a closed-form coupling to align real and simulated observations that correspond to the same latent parameters, using both paired calibration data and unpaired samples. A conditional normalizing flow is then trained to approximate the OT-induced posterior, enabling efficient inference without simulation access at test time. Across a range of synthetic and real-world benchmarks–including complex medical biomarker estimation–our approach matches or surpasses the performance of RoPE, as well as other standard SBI and non-SBI estimators, while offering improved scalability and applicability in challenging, misspecified environments.

[LG-11] Conformalized Exceptional Model Mining: Telling Where Your Model Performs (Not) Well ECML-PKDD

链接: https://arxiv.org/abs/2508.15569
作者: Xin Du,Sikun Yang,Wouter Duivesteijn,Mykola Pechenizkiy
类目: Machine Learning (cs.LG)
*备注: Accepted by ECML-PKDD

点击查看摘要

Abstract:Understanding the nuanced performance of machine learning models is essential for responsible deployment, especially in high-stakes domains like healthcare and finance. This paper introduces a novel framework, Conformalized Exceptional Model Mining, which combines the rigor of Conformal Prediction with the explanatory power of Exceptional Model Mining (EMM). The proposed framework identifies cohesive subgroups within data where model performance deviates exceptionally, highlighting regions of both high confidence and high uncertainty. We develop a new model class, mSMoPE (multiplex Soft Model Performance Evaluation), which quantifies uncertainty through conformal prediction’s rigorous coverage guarantees. By defining a new quality measure, Relative Average Uncertainty Loss (RAUL), our framework isolates subgroups with exceptional performance patterns in multi-class classification and regression tasks. Experimental results across diverse datasets demonstrate the framework’s effectiveness in uncovering interpretable subgroups that provide critical insights into model behavior. This work lays the groundwork for enhancing model interpretability and reliability, advancing the state-of-the-art in explainable AI and uncertainty quantification.

[LG-12] HEAS: Hierarchical Evolutionary Agent Simulation Framework for Cross-Scale Modeling and Multi-Objective Search

链接: https://arxiv.org/abs/2508.15555
作者: Ruiyu Zhang,Lin Nie,Xin Zhao
类目: Multiagent Systems (cs.MA); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE); Software Engineering (cs.SE)
*备注: 9 pages, 1 figure

点击查看摘要

Abstract:Hierarchical Evolutionary Agent Simulation (HEAS) is a Python framework that unifies layered agent-based modeling with evolutionary optimization and tournament evaluation in a single, reproducible workflow. HEAS represents models as hierarchies of lightweight processes (“streams”) scheduled in deterministic layers that read and write a shared context, making cross-scale couplings explicit and auditable. A compact API and CLI-simulate, optimize, evaluate-expose single- and multi-objective evolution, PyTorch policy integration via parameter flattening/unflattening, and general tournament tooling with user-defined scoring and voting rules. The framework standardizes evaluation through uniform per-step and episode metrics, persists seeds, logbooks, and hall-of-fame archives, and provides plotting helpers for traces, Pareto fronts, and comparative outcomes, reducing glue code and improving comparability across studies. HEAS emphasizes separation of mechanism from orchestration, allowing exogenous drivers, endogenous agents, and aggregators to be composed and swapped without refactoring, while the same model can be used for forward simulation, optimization, or systematic comparison. We illustrate usage with two compact examples-an ecological system and an enterprise decision-making setting. HEAS offers a practical foundation for cross-disciplinary, multi-level inquiry, yielding reliable, reproducible results.

[LG-13] AI-Powered Machine Learning Approaches for Fault Diagnosis in Industrial Pumps

链接: https://arxiv.org/abs/2508.15550
作者: Khaled M. A. Alghtus,Ayad Gannan,Khalid M. Alhajri,Ali L. A. Al Jubouri,Hassan A. I. Al-Janahi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study presents a practical approach for early fault detection in industrial pump systems using real-world sensor data from a large-scale vertical centrifugal pump operating in a demanding marine environment. Five key operational parameters were monitored: vibration, temperature, flow rate, pressure, and electrical current. A dual-threshold labeling method was applied, combining fixed engineering limits with adaptive thresholds calculated as the 95th percentile of historical sensor values. To address the rarity of documented failures, synthetic fault signals were injected into the data using domain-specific rules, simulating critical alerts within plausible operating ranges. Three machine learning classifiers - Random Forest, Extreme Gradient Boosting (XGBoost), and Support Vector Machine (SVM) - were trained to distinguish between normal operation, early warnings, and critical alerts. Results showed that Random Forest and XGBoost models achieved high accuracy across all classes, including minority cases representing rare or emerging faults, while the SVM model exhibited lower sensitivity to anomalies. Visual analyses, including grouped confusion matrices and time-series plots, indicated that the proposed hybrid method provides robust detection capabilities. The framework is scalable, interpretable, and suitable for real-time industrial deployment, supporting proactive maintenance decisions before failures occur. Furthermore, it can be adapted to other machinery with similar sensor architectures, highlighting its potential as a scalable solution for predictive maintenance in complex systems.

[LG-14] BadFU: Backdoor Federated Learning through Adversarial Machine Unlearning

链接: https://arxiv.org/abs/2508.15541
作者: Bingguang Lu,Hongsheng Hu,Yuantian Miao,Shaleeza Sohail,Chaoxiang He,Shuo Wang,Xiao Chen
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated learning (FL) has been widely adopted as a decentralized training paradigm that enables multiple clients to collaboratively learn a shared model without exposing their local data. As concerns over data privacy and regulatory compliance grow, machine unlearning, which aims to remove the influence of specific data from trained models, has become increasingly important in the federated setting to meet legal, ethical, or user-driven demands. However, integrating unlearning into FL introduces new challenges and raises largely unexplored security risks. In particular, adversaries may exploit the unlearning process to compromise the integrity of the global model. In this paper, we present the first backdoor attack in the context of federated unlearning, demonstrating that an adversary can inject backdoors into the global model through seemingly legitimate unlearning requests. Specifically, we propose BadFU, an attack strategy where a malicious client uses both backdoor and camouflage samples to train the global model normally during the federated training process. Once the client requests unlearning of the camouflage samples, the global model transitions into a backdoored state. Extensive experiments under various FL frameworks and unlearning strategies validate the effectiveness of BadFU, revealing a critical vulnerability in current federated unlearning practices and underscoring the urgent need for more secure and robust federated unlearning mechanisms.

[LG-15] Stabilization of Perturbed Loss Function: Differential Privacy without Gradient Noise

链接: https://arxiv.org/abs/2508.15523
作者: Salman Habib,Remi Chou,Taejoon Kim
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: under review

点击查看摘要

Abstract:We propose SPOF (Stabilization of Perturbed Loss Function), a differentially private training mechanism intended for multi-user local differential privacy (LDP). SPOF perturbs a stabilized Taylor expanded polynomial approximation of a model’s training loss function, where each user’s data is privatized by calibrated noise added to the coefficients of the polynomial. Unlike gradient-based mechanisms such as differentially private stochastic gradient descent (DP-SGD), SPOF does not require injecting noise into the gradients of the loss function, which improves both computational efficiency and stability. This formulation naturally supports simultaneous privacy guarantees across all users. Moreover, SPOF exhibits robustness to environmental noise during training, maintaining stable performance even when user inputs are corrupted. We compare SPOF with a multi-user extension of DP-SGD, evaluating both methods in a wireless body area network (WBAN) scenario involving heterogeneous user data and stochastic channel noise from body sensors. Our results show that SPOF achieves, on average, up to 3.5% higher reconstruction accuracy and reduces mean training time by up to 57.2% compared to DP-SGD, demonstrating superior privacy-utility trade-offs in multi-user environments.

[LG-16] Jointly Computation- and Communication-Efficient Distributed Learning

链接: https://arxiv.org/abs/2508.15509
作者: Xiaoxing Ren,Nicola Bastianello,Karl H. Johansson,Thomas Parisini
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注: To be presented at 2025 IEEE Conference on Decision and Control

点击查看摘要

Abstract:We address distributed learning problems over undirected networks. Specifically, we focus on designing a novel ADMM-based algorithm that is jointly computation- and communication-efficient. Our design guarantees computational efficiency by allowing agents to use stochastic gradients during local training. Moreover, communication efficiency is achieved as follows: i) the agents perform multiple training epochs between communication rounds, and ii) compressed transmissions are used. We prove exact linear convergence of the algorithm in the strongly convex setting. We corroborate our theoretical results by numerical comparisons with state of the art techniques on a classification task.

[LG-17] Lets Grow an Unbiased Community: Guiding the Fairness of Graphs via New Links

链接: https://arxiv.org/abs/2508.15499
作者: Jiahua Lu,Huaxiao Liu,Shuotong Bai,Junjie Xu,Renqiang Luo,Enyan Dai
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph Neural Networks (GNNs) have achieved remarkable success across diverse applications. However, due to the biases in the graph structures, graph neural networks face significant challenges in fairness. Although the original user graph structure is generally biased, it is promising to guide these existing structures toward unbiased ones by introducing new links. The fairness guidance via new links could foster unbiased communities, thereby enhancing fairness in downstream applications. To address this issue, we propose a novel framework named FairGuide. Specifically, to ensure fairness in downstream tasks trained on fairness-guided graphs, we introduce a differentiable community detection task as a pseudo downstream task. Our theoretical analysis further demonstrates that optimizing fairness within this pseudo task effectively enhances structural fairness, promoting fairness generalization across diverse downstream applications. Moreover, FairGuide employs an effective strategy which leverages meta-gradients derived from the fairness-guidance objective to identify new links that significantly enhance structural fairness. Extensive experimental results demonstrate the effectiveness and generalizability of our proposed method across a variety of graph-based fairness tasks.

[LG-18] Learning Protein-Ligand Binding in Hyperbolic Space

链接: https://arxiv.org/abs/2508.15480
作者: Jianhui Wang,Wenyu Zhu,Bowen Gao,Xin Hong,Ya-Qin Zhang,Wei-Ying Ma,Yanyan Lan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Protein-ligand binding prediction is central to virtual screening and affinity ranking, two fundamental tasks in drug discovery. While recent retrieval-based methods embed ligands and protein pockets into Euclidean space for similarity-based search, the geometry of Euclidean embeddings often fails to capture the hierarchical structure and fine-grained affinity variations intrinsic to molecular interactions. In this work, we propose HypSeek, a hyperbolic representation learning framework that embeds ligands, protein pockets, and sequences into Lorentz-model hyperbolic space. By leveraging the exponential geometry and negative curvature of hyperbolic space, HypSeek enables expressive, affinity-sensitive embeddings that can effectively model both global activity and subtle functional differences-particularly in challenging cases such as activity cliffs, where structurally similar ligands exhibit large affinity gaps. Our mode unifies virtual screening and affinity ranking in a single framework, introducing a protein-guided three-tower architecture to enhance representational structure. HypSeek improves early enrichment in virtual screening on DUD-E from 42.63 to 51.44 (+20.7%) and affinity ranking correlation on JACS from 0.5774 to 0.7239 (+25.4%), demonstrating the benefits of hyperbolic geometry across both tasks and highlighting its potential as a powerful inductive bias for protein-ligand modeling.

[LG-19] Mini-Batch Robustness Verification of Deep Neural Networks

链接: https://arxiv.org/abs/2508.15454
作者: Saar Tzour-Shaday,Dana Drachsler Cohen
类目: Machine Learning (cs.LG); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
*备注: 30 pages, 12 figures, conference OOPSLA 2025

点击查看摘要

Abstract:Neural network image classifiers are ubiquitous in many safety-critical applications. However, they are susceptible to adversarial attacks. To understand their robustness to attacks, many local robustness verifiers have been proposed to analyze \epsilon -balls of inputs. Yet, existing verifiers introduce a long analysis time or lose too much precision, making them less effective for a large set of inputs. In this work, we propose a new approach to local robustness: group local robustness verification. The key idea is to leverage the similarity of the network computations of certain \epsilon -balls to reduce the overall analysis time. We propose BaVerLy, a sound and complete verifier that boosts the local robustness verification of a set of \epsilon -balls by dynamically constructing and verifying mini-batches. BaVerLy adaptively identifies successful mini-batch sizes, accordingly constructs mini-batches of \epsilon -balls that have similar network computations, and verifies them jointly. If a mini-batch is verified, all \epsilon -balls are proven robust. Otherwise, one \epsilon -ball is suspected as not being robust, guiding the refinement. In the latter case, BaVerLy leverages the analysis results to expedite the analysis of that \epsilon -ball as well as the other \epsilon -balls in the batch. We evaluate BaVerLy on fully connected and convolutional networks for MNIST and CIFAR-10. Results show that BaVerLy scales the common one by one verification by 2.3x on average and up to 4.1x, in which case it reduces the total analysis time from 24 hours to 6 hours.

[LG-20] Measures of Overlapping Multivariate Gaussian Clusters in Unsupervised Online Learning KR

链接: https://arxiv.org/abs/2508.15444
作者: Miha Ožbot,Igor Škrjanc
类目: Machine Learning (cs.LG)
*备注: 5 pages, in Slovenian language. 2 figures. Accepted for the 33rd International Electrotechnical and Computer Science Conference ERK 2024 (Portoroz, Slovenia, 26-27 Sep 2024). Conference PDF: this https URL

点击查看摘要

Abstract:In this paper, we propose a new measure for detecting overlap in multivariate Gaussian clusters. The aim of online learning from data streams is to create clustering, classification, or regression models that can adapt over time based on the conceptual drift of streaming data. In the case of clustering, this can result in a large number of clusters that may overlap and should be merged. Commonly used distribution dissimilarity measures are not adequate for determining overlapping clusters in the context of online learning from streaming data due to their inability to account for all shapes of clusters and their high computational demands. Our proposed dissimilarity measure is specifically designed to detect overlap rather than dissimilarity and can be computed faster compared to existing measures. Our method is several times faster than compared methods and is capable of detecting overlapping clusters while avoiding the merging of orthogonal clusters.

[LG-21] Federated Learning based on Self-Evolving Gaussian Clustering

链接: https://arxiv.org/abs/2508.15393
作者: Miha Ožbot,Igor Škrjanc
类目: Machine Learning (cs.LG)
*备注: 5 pages, in slovenian language, 3 figures. Published in the Proceedings of the 33rd International Electrotechnical and Computer Science Conference (ERK 2024), Portoroz, Slovenia, pp. 240-243. Indexed in COBISS ( this http URL -ID 212879107). Official version available at this https URL

点击查看摘要

Abstract:In this study, we present an Evolving Fuzzy System within the context of Federated Learning, which adapts dynamically with the addition of new clusters and therefore does not require the number of clusters to be selected apriori. Unlike traditional methods, Federated Learning allows models to be trained locally on clients’ devices, sharing only the model parameters with a central server instead of the data. Our method, implemented using PyTorch, was tested on clustering and classification tasks. The results show that our approach outperforms established classification methods on several well-known UCI datasets. While computationally intensive due to overlap condition calculations, the proposed method demonstrates significant advantages in decentralized data processing.

[LG-22] Fairness for the People by the People: Minority Collective Action

链接: https://arxiv.org/abs/2508.15374
作者: Omri Ben-Dov,Samira Samadi,Amartya Sanyal,Alexandru Ţifrea
类目: Machine Learning (cs.LG); Computers and Society (cs.CY)
*备注:

点击查看摘要

Abstract:Machine learning models often preserve biases present in training data, leading to unfair treatment of certain minority groups. Despite an array of existing firm-side bias mitigation techniques, they typically incur utility costs and require organizational buy-in. Recognizing that many models rely on user-contributed data, end-users can induce fairness through the framework of Algorithmic Collective Action, where a coordinated minority group strategically relabels its own data to enhance fairness, without altering the firm’s training process. We propose three practical, model-agnostic methods to approximate ideal relabeling and validate them on real-world datasets. Our findings show that a subgroup of the minority can substantially reduce unfairness with a small impact on the overall prediction error.

[LG-23] Enhancing Forecasting with a 2D Time Series Approach for Cohort-Based Data

链接: https://arxiv.org/abs/2508.15369
作者: Yonathan Guttel,Orit Moradov,Nachi Lieder,Asnat Greenstein-Messica
类目: Machine Learning (cs.LG)
*备注: Accepted at IEEE CiFer Companion 2025. 5 pages, 3 figures, 2 tables

点击查看摘要

Abstract:This paper introduces a novel two-dimensional (2D) time series forecasting model that integrates cohort behavior over time, addressing challenges in small data environments. We demonstrate its efficacy using multiple real-world datasets, showcasing superior performance in accuracy and adaptability compared to reference models. The approach offers valuable insights for strategic decision-making across industries facing financial and marketing forecasting challenges.

[LG-24] ExBigBang: A Dynamic Approach for Explainable Persona Classification through Contextualized Hybrid Transformer Analysis

链接: https://arxiv.org/abs/2508.15364
作者: Saleh Afzoon,Amin Beheshti,Nabi Rezvani,Farshad Khunjush,Usman Naseem,John McMahon,Zahra Fathollahi,Mahdieh Labani,Wathiq Mansoor,Xuyun Zhang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In user-centric design, persona development plays a vital role in understanding user behaviour, capturing needs, segmenting audiences, and guiding design decisions. However, the growing complexity of user interactions calls for a more contextualized approach to ensure designs align with real user needs. While earlier studies have advanced persona classification by modelling user behaviour, capturing contextual information, especially by integrating textual and tabular data, remains a key challenge. These models also often lack explainability, leaving their predictions difficult to interpret or justify. To address these limitations, we present ExBigBang (Explainable BigBang), a hybrid text-tabular approach that uses transformer-based architectures to model rich contextual features for persona classification. ExBigBang incorporates metadata, domain knowledge, and user profiling to embed deeper context into predictions. Through a cyclical process of user profiling and classification, our approach dynamically updates to reflect evolving user behaviours. Experiments on a benchmark persona classification dataset demonstrate the robustness of our model. An ablation study confirms the benefits of combining text and tabular data, while Explainable AI techniques shed light on the rationale behind the model’s predictions.

[LG-25] An Enhanced Audio Feature Tailored for Anomalous Sound Detection Based on Pre-trained Models ICANN2025

链接: https://arxiv.org/abs/2508.15334
作者: Guirui Zhong,Qing Wang,Jun Du,Lei Wang,Mingqi Cai,Xin Fang
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 13 pages, 3 figures, accepted by ICANN2025

点击查看摘要

Abstract:Anomalous Sound Detection (ASD) aims at identifying anomalous sounds from machines and has gained extensive research interests from both academia and industry. However, the uncertainty of anomaly location and much redundant information such as noise in machine sounds hinder the improvement of ASD system performance. This paper proposes a novel audio feature of filter banks with evenly distributed intervals, ensuring equal attention to all frequency ranges in the audio, which enhances the detection of anomalies in machine sounds. Moreover, based on pre-trained models, this paper presents a parameter-free feature enhancement approach to remove redundant information in machine audio. It is believed that this parameter-free strategy facilitates the effective transfer of universal knowledge from pre-trained tasks to the ASD task during model fine-tuning. Evaluation results on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2024 Challenge dataset demonstrate significant improvements in ASD performance with our proposed methods.

[LG-26] Saving for the future: Enhancing generalization via partial logic regularization

链接: https://arxiv.org/abs/2508.15317
作者: Zhaorui Tan,Yijie Hu,Xi Yang,Qiufeng Wang,Anh Nguyen,Kaizhu Huang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Generalization remains a significant challenge in visual classification tasks, particularly in handling unknown classes in real-world applications. Existing research focuses on the class discovery paradigm, which tends to favor known classes, and the incremental learning paradigm, which suffers from catastrophic forgetting. Recent approaches such as the L-Reg technique employ logic-based regularization to enhance generalization but are bound by the necessity of fully defined logical formulas, limiting flexibility for unknown classes. This paper introduces PL-Reg, a novel partial-logic regularization term that allows models to reserve space for undefined logic formulas, improving adaptability to unknown classes. Specifically, we formally demonstrate that tasks involving unknown classes can be effectively explained using partial logic. We also prove that methods based on partial logic lead to improved generalization. We validate PL-Reg through extensive experiments on Generalized Category Discovery, Multi-Domain Generalized Category Discovery, and long-tailed Class Incremental Learning tasks, demonstrating consistent performance improvements. Our results highlight the effectiveness of partial logic in tackling challenges related to unknown classes.

[LG-27] MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation

链接: https://arxiv.org/abs/2508.15281
作者: Yi Xu,Moyu Zhang,Chenxuan Li,Zhihao Liao,Haibo Xing,Hao Deng,Jinxin Hu,Yu Zhang,Xiaoyi Zeng,Jing Zhang
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems traditionally represent items using unique identifiers (ItemIDs), but this approach struggles with large, dynamic item corpora and sparse long-tail data, limiting scalability and generalization. Semantic IDs, derived from multimodal content such as text and images, offer a promising alternative by mapping items into a shared semantic space, enabling knowledge transfer and improving recommendations for new or rare items. However, existing methods face two key challenges: (1) balancing cross-modal synergy with modality-specific uniqueness, and (2) bridging the semantic-behavioral gap, where semantic representations may misalign with actual user preferences. To address these challenges, we propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer. First, a shared-specific tokenizer leverages a multi-expert architecture with modality-specific and modality-shared experts, using orthogonal regularization to capture comprehensive multimodal information. Second, behavior-aware fine-tuning dynamically adapts semantic IDs to downstream recommendation objectives while preserving modality information through a multimodal reconstruction loss. Extensive offline experiments and online A/B tests demonstrate that MMQ effectively unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.

[LG-28] Deep Think with Confidence

链接: https://arxiv.org/abs/2508.15260
作者: Yichao Fu,Xuewei Wang,Yuandong Tian,Jiawei Zhao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have shown great potential in reasoning tasks through test-time scaling methods like self-consistency with majority voting. However, this approach often leads to diminishing returns in accuracy and high computational overhead. To address these challenges, we introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time. DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation. It requires no additional model training or hyperparameter tuning and can be seamlessly integrated into existing serving frameworks. We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series. Notably, on challenging benchmarks such as AIME 2025, DeepConf@512 achieves up to 99.9% accuracy and reduces generated tokens by up to 84.7% compared to full parallel thinking.

[LG-29] Learning ECG Representations via Poly-Window Contrastive Learning ALT

链接: https://arxiv.org/abs/2508.15225
作者: Yi Yuan,Joseph Van Duyn,Runze Yan,Zhuoyi Huang,Sulaiman Vesal,Sergey Plis,Xiao Hu,Gloria Hyunjung Kwak,Ran Xiao,Alex Fedorov
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注: This work has been accepted for publication in IEEE-EMBS International Conference on Biomedical and Health Informatics 2025. The final published version will be available via IEEE Xplore

点击查看摘要

Abstract:Electrocardiogram (ECG) analysis is foundational for cardiovascular disease diagnosis, yet the performance of deep learning models is often constrained by limited access to annotated data. Self-supervised contrastive learning has emerged as a powerful approach for learning robust ECG representations from unlabeled signals. However, most existing methods generate only pairwise augmented views and fail to leverage the rich temporal structure of ECG recordings. In this work, we present a poly-window contrastive learning framework. We extract multiple temporal windows from each ECG instance to construct positive pairs and maximize their agreement via statistics. Inspired by the principle of slow feature analysis, our approach explicitly encourages the model to learn temporally invariant and physiologically meaningful features that persist across time. We validate our approach through extensive experiments and ablation studies on the PTB-XL dataset. Our results demonstrate that poly-window contrastive learning consistently outperforms conventional two-view methods in multi-label superclass classification, achieving higher AUROC (0.891 vs. 0.888) and F1 scores (0.680 vs. 0.679) while requiring up to four times fewer pre-training epochs (32 vs. 128) and 14.8% in total wall clock pre-training time reduction. Despite processing multiple windows per sample, we achieve a significant reduction in the number of training epochs and total computation time, making our method practical for training foundational models. Through extensive ablations, we identify optimal design choices and demonstrate robustness across various hyperparameters. These findings establish poly-window contrastive learning as a highly efficient and scalable paradigm for automated ECG analysis and provide a promising general framework for self-supervised representation learning in biomedical time-series data.

[LG-30] See Beyond a Single View: Multi-Attribution Learning Leads to Better Conversion Rate Prediction CIKM2025

链接: https://arxiv.org/abs/2508.15217
作者: Sishuo Chen,Zhangming Chan,Xiang-Rong Sheng,Lei Zhang,Sheng Chen,Chenghuan Hou,Han Zhu,Jian Xu,Bo Zheng
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Accepted at CIKM 2025

点击查看摘要

Abstract:Conversion rate (CVR) prediction is a core component of online advertising systems, where the attribution mechanisms-rules for allocating conversion credit across user touchpoints-fundamentally determine label generation and model optimization. While many industrial platforms support diverse attribution mechanisms (e.g., First-Click, Last-Click, Linear, and Data-Driven Multi-Touch Attribution), conventional approaches restrict model training to labels from a single production-critical attribution mechanism, discarding complementary signals in alternative attribution perspectives. To address this limitation, we propose a novel Multi-Attribution Learning (MAL) framework for CVR prediction that integrates signals from multiple attribution perspectives to better capture the underlying patterns driving user conversions. Specifically, MAL is a joint learning framework consisting of two core components: the Attribution Knowledge Aggregator (AKA) and the Primary Target Predictor (PTP). AKA is implemented as a multi-task learner that integrates knowledge extracted from diverse attribution labels. PTP, in contrast, focuses on the task of generating well-calibrated conversion probabilities that align with the system-optimized attribution metric (e.g., CVR under the Last-Click attribution), ensuring direct compatibility with industrial deployment requirements. Additionally, we propose CAT, a novel training strategy that leverages the Cartesian product of all attribution label combinations to generate enriched supervision signals. This design substantially enhances the performance of the attribution knowledge aggregator. Empirical evaluations demonstrate the superiority of MAL over single-attribution learning baselines, achieving +0.51% GAUC improvement on offline metrics. Online experiments demonstrate that MAL achieved a +2.6% increase in ROI (Return on Investment). Comments: Accepted at CIKM 2025 Subjects: Machine Learning (cs.LG); Information Retrieval (cs.IR) Cite as: arXiv:2508.15217 [cs.LG] (or arXiv:2508.15217v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.15217 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-31] SleepDIFFormer: Sleep Stage Classification via Multivariate Differential Transformer

链接: https://arxiv.org/abs/2508.15215
作者: Benjamin Wei Hao Chin,Yuin Torng Yew,Haocheng Wu,Lanxin Liang,Chow Khuen Chan,Norita Mohd Zain,Siti Balqis Samdin,Sim Kuan Goh
类目: Machine Learning (cs.LG)
*备注: 8 Pages

点击查看摘要

Abstract:Classification of sleep stages is essential for assessing sleep quality and diagnosing sleep disorders such as insomnia. However, manual inspection of EEG characteristics for each stage is time-consuming and prone to human error. Although machine learning and deep learning methods have been actively developed, they continue to face challenges from the non-stationarity and variability of electroencephalography (EEG) and electrooculography (EOG) signals, often leading to poor generalization on unseen datasets. This research proposed a Sleep Stage Classification method by developing Multivariate Differential Transformer (SleepDIFFormer) for joint EEG and EOG representation learning. Specifically, SleepDIFFormer was developed to process EEG and EOG signals using our Multivariate Differential Transformer Architecture (MDTA) for time series, trained with cross-domain alignment. Our method mitigated spatial and temporal attention noise while learning a domain-invariant joint EEG-EOG representation through feature distribution alignment, thereby enabling generalization to unseen target datasets. Empirically, we evaluated our method on five different sleep staging datasets and compared it with existing approaches, achieving state-of-the-art performance. We also conducted thorough ablation analyses of SleepDIFFormer and interpreted the differential attention weights, highlighting their relevance to characteristic sleep EEG patterns. These findings have implications for advancing automated sleep stage classification and its application to sleep quality assessment. Our source code is publicly available at this https URL

[LG-32] Frequency-adaptive tensor neural networks for high-dimensional multi-scale problems

链接: https://arxiv.org/abs/2508.15198
作者: Jizu Huang,Rukang You,Tao Zhou
类目: Machine Learning (cs.LG); Mathematical Physics (math-ph)
*备注:

点击查看摘要

Abstract:Tensor neural networks (TNNs) have demonstrated their superiority in solving high-dimensional problems. However, similar to conventional neural networks, TNNs are also influenced by the Frequency Principle, which limits their ability to accurately capture high-frequency features of the solution. In this work, we analyze the training dynamics of TNNs by Fourier analysis and enhance their expressivity for high-dimensional multi-scale problems by incorporating random Fourier features. Leveraging the inherent tensor structure of TNNs, we further propose a novel approach to extract frequency features of high-dimensional functions by performing the Discrete Fourier Transform to one-dimensional component functions. This strategy effectively mitigates the curse of dimensionality. Building on this idea, we propose a frequency-adaptive TNNs algorithm, which significantly improves the ability of TNNs in solving complex multi-scale problems. Extensive numerical experiments are performed to validate the effectiveness and robustness of the proposed frequency-adaptive TNNs algorithm.

[LG-33] Revisiting Pre-processing Group Fairness: A Modular Benchmarking Framework CIKM2025

链接: https://arxiv.org/abs/2508.15193
作者: Brodie Oldfield,Ziqi Xu,Sevvandi Kandanaarachchi
类目: Machine Learning (cs.LG)
*备注: This paper has been accepted to the 34th ACM International Conference on Information and Knowledge Management (CIKM 2025), Resource Track

点击查看摘要

Abstract:As machine learning systems become increasingly integrated into high-stakes decision-making processes, ensuring fairness in algorithmic outcomes has become a critical concern. Methods to mitigate bias typically fall into three categories: pre-processing, in-processing, and post-processing. While significant attention has been devoted to the latter two, pre-processing methods, which operate at the data level and offer advantages such as model-agnosticism and improved privacy compliance, have received comparatively less focus and lack standardised evaluation tools. In this work, we introduce FairPrep, an extensible and modular benchmarking framework designed to evaluate fairness-aware pre-processing techniques on tabular datasets. Built on the AIF360 platform, FairPrep allows seamless integration of datasets, fairness interventions, and predictive models. It features a batch-processing interface that enables efficient experimentation and automatic reporting of fairness and utility metrics. By offering standardised pipelines and supporting reproducible evaluations, FairPrep fills a critical gap in the fairness benchmarking landscape and provides a practical foundation for advancing data-level fairness research.

[LG-34] Integrated Sensing Communication and Computation for Over-the-Air Federated Edge Learning

链接: https://arxiv.org/abs/2508.15185
作者: Dingzhu Wen,Sijing Xie,Xiaowen Cao,Yuanhao Cui,Jie Xu,Yuanming Shi,Shuguang Cui
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注: The paper has been accepted for publication in IEEE Transactions on Wireless Communications

点击查看摘要

Abstract:This paper studies an over-the-air federated edge learning (Air-FEEL) system with integrated sensing, communication, and computation (ISCC), in which one edge server coordinates multiple edge devices to wirelessly sense the objects and use the sensing data to collaboratively train a machine learning model for recognition tasks. In this system, over-the-air computation (AirComp) is employed to enable one-shot model aggregation from edge devices. Under this setup, we analyze the convergence behavior of the ISCC-enabled Air-FEEL in terms of the loss function degradation, by particularly taking into account the wireless sensing noise during the training data acquisition and the AirComp distortions during the over-the-air model aggregation. The result theoretically shows that sensing, communication, and computation compete for network resources to jointly decide the convergence rate. Based on the analysis, we design the ISCC parameters under the target of maximizing the loss function degradation while ensuring the latency and energy budgets in each round. The challenge lies on the tightly coupled processes of sensing, communication, and computation among different devices. To tackle the challenge, we derive a low-complexity ISCC algorithm by alternately optimizing the batch size control and the network resource allocation. It is found that for each device, less sensing power should be consumed if a larger batch of data samples is obtained and vice versa. Besides, with a given batch size, the optimal computation speed of one device is the minimum one that satisfies the latency constraint. Numerical results based on a human motion recognition task verify the theoretical convergence analysis and show that the proposed ISCC algorithm well coordinates the batch size control and resource allocation among sensing, communication, and computation to enhance the learning performance.

[LG-35] SafeLLM : Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks

链接: https://arxiv.org/abs/2508.15182
作者: Xiangman Li,Xiaodong Wu,Qi Li,Jianbing Ni,Rongxing Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Jailbreak attacks pose a serious threat to the safety of Large Language Models (LLMs) by crafting adversarial prompts that bypass alignment mechanisms, causing the models to produce harmful, restricted, or biased content. In this paper, we propose SafeLLM, a novel unlearning-based defense framework that unlearn the harmful knowledge from LLMs while preserving linguistic fluency and general capabilities. SafeLLM employs a three-stage pipeline: (1) dynamic unsafe output detection using a hybrid approach that integrates external classifiers with model-internal evaluations; (2) token-level harmful content tracing through feedforward network (FFN) activations to localize harmful knowledge; and (3) constrained optimization to suppress unsafe behavior without degrading overall model quality. SafeLLM achieves targeted and irreversible forgetting by identifying and neutralizing FFN substructures responsible for harmful generation pathways. Extensive experiments on prominent LLMs (Vicuna, LLaMA, and GPT-J) across multiple jailbreak benchmarks show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance. Compared to standard defense methods such as supervised fine-tuning and direct preference optimization, SafeLLM offers stronger safety guarantees, more precise control over harmful behavior, and greater robustness to unseen attacks. Moreover, SafeLLM maintains the general performance after the harmful knowledge unlearned. These results highlight unlearning as a promising direction for scalable and effective LLM safety.

[LG-36] A Robust BERT-Based Deep Learning Model for Automated Cancer Type Extraction from Unstructured Pathology Reports

链接: https://arxiv.org/abs/2508.15149
作者: Minh Tran,Jeffery C. Chan,Min Li Huang,Maya Kansara,John P. Grady,Christine E. Napier,Subotheni Thavaneswaran,Mandy L. Ballinger,David M. Thomas,Frank P. Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The accurate extraction of clinical information from electronic medical records is particularly critical to clinical research but require much trained expertise and manual labor. In this study we developed a robust system for automated extraction of the specific cancer types for the purpose of supporting precision oncology research. from pathology reports using a fine-tuned RoBERTa model. This model significantly outperformed the baseline model and a Large Language Model, Mistral 7B, achieving F1_Bertscore 0.98 and overall exact match of 80.61%. This fine-tuning approach demonstrates the potential for scalability that can integrate seamlessly into the molecular tumour board process. Fine-tuning domain-specific models for precision tasks in oncology, may pave the way for more efficient and accurate clinical information extraction.

[LG-37] owards Reliable and Generalizable Differentially Private Machine Learning (Extended Version) ACSA

链接: https://arxiv.org/abs/2508.15141
作者: Wenxuan Bao,Vincent Bindschaedler
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: This paper is published at ACSAC 2024. This is the extended version that includes an overview of the relevant literature. We open-source our codebase at: this https URL

点击查看摘要

Abstract:There is a flurry of recent research papers proposing novel differentially private machine learning (DPML) techniques. These papers claim to achieve new state-of-the-art (SoTA) results and offer empirical results as validation. However, there is no consensus on which techniques are most effective or if they genuinely meet their stated claims. Complicating matters, heterogeneity in codebases, datasets, methodologies, and model architectures make direct comparisons of different approaches challenging. In this paper, we conduct a reproducibility and replicability (R+R) experiment on 11 different SoTA DPML techniques from the recent research literature. Results of our investigation are varied: while some methods stand up to scrutiny, others falter when tested outside their initial experimental conditions. We also discuss challenges unique to the reproducibility of DPML, including additional randomness due to DP noise, and how to address them. Finally, we derive insights and best practices to obtain scientifically valid and reliable results. Comments: This paper is published at ACSAC 2024. This is the extended version that includes an overview of the relevant literature. We open-source our codebase at: this https URL Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR) Cite as: arXiv:2508.15141 [cs.LG] (or arXiv:2508.15141v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.15141 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] owards Source-Free Machine Unlearning CVPR2025

链接: https://arxiv.org/abs/2508.15127
作者: Sk Miraj Ahmed,Umit Yigit Basaran,Dripta S. Raychaudhuri,Arindam Dutta,Rohit Kundu,Fahim Faisal Niloy,Basak Guler,Amit K. Roy-Chowdhury
类目: Machine Learning (cs.LG)
*备注: Accepted by CVPR 2025

点击查看摘要

Abstract:As machine learning becomes more pervasive and data privacy regulations evolve, the ability to remove private or copyrighted information from trained models is becoming an increasingly critical requirement. Existing unlearning methods often rely on the assumption of having access to the entire training dataset during the forgetting process. However, this assumption may not hold true in practical scenarios where the original training data may not be accessible, i.e., the source-free setting. To address this challenge, we focus on the source-free unlearning scenario, where an unlearning algorithm must be capable of removing specific data from a trained model without requiring access to the original training dataset. Building on recent work, we present a method that can estimate the Hessian of the unknown remaining training data, a crucial component required for efficient unlearning. Leveraging this estimation technique, our method enables efficient zero-shot unlearning while providing robust theoretical guarantees on the unlearning performance, while maintaining performance on the remaining data. Extensive experiments over a wide range of datasets verify the efficacy of our method.

[LG-39] Adaptive Anomaly Detection in Evolving Network Environments

链接: https://arxiv.org/abs/2508.15100
作者: Ehssan Mousavipour,Andrey Dimanchev,Majid Ghaderi
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Distribution shift, a change in the statistical properties of data over time, poses a critical challenge for deep learning anomaly detection systems. Existing anomaly detection systems often struggle to adapt to these shifts. Specifically, systems based on supervised learning require costly manual labeling, while those based on unsupervised learning rely on clean data, which is difficult to obtain, for shift adaptation. Both of these requirements are challenging to meet in practice. In this paper, we introduce NetSight, a framework for supervised anomaly detection in network data that continually detects and adapts to distribution shifts in an online manner. NetSight eliminates manual intervention through a novel pseudo-labeling technique and uses a knowledge distillation-based adaptation strategy to prevent catastrophic forgetting. Evaluated on three long-term network datasets, NetSight demonstrates superior adaptation performance compared to state-of-the-art methods that rely on manual labeling, achieving F1-score improvements of up to 11.72%. This proves its robustness and effectiveness in dynamic networks that experience distribution shifts over time.

[LG-40] Evaluating Sparse Autoencoders for Monosemantic Representation

链接: https://arxiv.org/abs/2508.15094
作者: Moghis Fereidouni,Muhammad Umair Haider,Peizhong Ju,A.B. Siddique
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A key barrier to interpreting large language models is polysemanticity, where neurons activate for multiple unrelated concepts. Sparse autoencoders (SAEs) have been proposed to mitigate this issue by transforming dense activations into sparse, more interpretable features. While prior work suggests that SAEs promote monosemanticity, there has been no quantitative comparison with their base models. This paper provides the first systematic evaluation of SAEs against base models concerning monosemanticity. We introduce a fine-grained concept separability score based on the Jensen-Shannon distance, which captures how distinctly a neuron’s activation distributions vary across concepts. Using Gemma-2-2B and multiple SAE variants across five benchmarks, we show that SAEs reduce polysemanticity and achieve higher concept separability. However, greater sparsity of SAEs does not always yield better separability and often impairs downstream performance. To assess practical utility, we evaluate concept-level interventions using two strategies: full neuron masking and partial suppression. We find that, compared to base models, SAEs enable more precise concept-level control when using partial suppression. Building on this, we propose Attenuation via Posterior Probabilities (APP), a new intervention method that uses concept-conditioned activation distributions for targeted suppression. APP outperforms existing approaches in targeted concept removal.

[LG-41] Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size

链接: https://arxiv.org/abs/2508.15071
作者: Rustem Islamov,Niccolo Ajroldi,Antonio Orvieto,Aurelien Lucchi
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Modern optimization algorithms that incorporate momentum and adaptive step-size offer improved performance in numerous challenging deep learning tasks. However, their effectiveness is often highly sensitive to the choice of hyperparameters, especially the step-size. Tuning these parameters is often difficult, resource-intensive, and time-consuming. Therefore, recent efforts have been directed toward enhancing the stability of optimizers across a wide range of hyperparameter choices [Schaipp et al., 2024]. In this paper, we introduce an algorithm that matches the performance of state-of-the-art optimizers while improving stability to the choice of the step-size hyperparameter through a novel adaptation of the NGN step-size method [Orvieto and Xiao, 2024]. Specifically, we propose a momentum-based version (NGN-M) that attains the standard convergence rate of \mathcalO(1/\sqrtK) under less restrictive assumptions, without the need for interpolation condition or assumptions of bounded stochastic gradients or iterates, in contrast to previous approaches. Additionally, we empirically demonstrate that the combination of the NGN step-size with momentum results in enhanced robustness to the choice of the step-size hyperparameter while delivering performance that is comparable to or surpasses other state-of-the-art optimizers.

[LG-42] Robust Estimation Under Heterogeneous Corruption Rates NEURIPS2025

链接: https://arxiv.org/abs/2508.15051
作者: Syomantak Chaudhuri,Jerry Li,Thomas A. Courtade
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Statistics Theory (math.ST); Machine Learning (stat.ML)
*备注: NeurIPS 2025

点击查看摘要

Abstract:We study the problem of robust estimation under heterogeneous corruption rates, where each sample may be independently corrupted with a known but non-identical probability. This setting arises naturally in distributed and federated learning, crowdsourcing, and sensor networks, yet existing robust estimators typically assume uniform or worst-case corruption, ignoring structural heterogeneity. For mean estimation for multivariate bounded distributions and univariate gaussian distributions, we give tight minimax rates for all heterogeneous corruption patterns. For multivariate gaussian mean estimation and linear regression, we establish the minimax rate for squared error up to a factor of \sqrtd , where d is the dimension. Roughly, our findings suggest that samples beyond a certain corruption threshold may be discarded by the optimal estimators – this threshold is determined by the empirical distribution of the corruption rates given.

[LG-43] Rethinking the Potential of Layer Freezing for Efficient DNN Training

链接: https://arxiv.org/abs/2508.15033
作者: Chence Yang,Ci Zhang,Lei Lu,Qitao Tan,Sheng Li,Ao Li,Xulong Tang,Shaoyi Huang,Jinzhen Wang,Guoming Li,Jundong Li,Xiaoming Zhai,Jin Lu,Geng Yuan
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the growing size of deep neural networks and datasets, the computational costs of training have significantly increased. The layer-freezing technique has recently attracted great attention as a promising method to effectively reduce the cost of network training. However, in traditional layer-freezing methods, frozen layers are still required for forward propagation to generate feature maps for unfrozen layers, limiting the reduction of computation costs. To overcome this, prior works proposed a hypothetical solution, which caches feature maps from frozen layers as a new dataset, allowing later layers to train directly on stored feature maps. While this approach appears to be straightforward, it presents several major challenges that are severely overlooked by prior literature, such as how to effectively apply augmentations to feature maps and the substantial storage overhead introduced. If these overlooked challenges are not addressed, the performance of the caching method will be severely impacted and even make it infeasible. This paper is the first to comprehensively explore these challenges and provides a systematic solution. To improve training accuracy, we propose \textitsimilarity-aware channel augmentation, which caches channels with high augmentation sensitivity with a minimum additional storage cost. To mitigate storage overhead, we incorporate lossy data compression into layer freezing and design a \textitprogressive compression strategy, which increases compression rates as more layers are frozen, effectively reducing storage costs. Finally, our solution achieves significant reductions in training cost while maintaining model accuracy, with a minor time overhead. Additionally, we conduct a comprehensive evaluation of freezing and compression strategies, providing insights into optimizing their application for efficient DNN training.

[LG-44] Nonlinear Federated System Identification

链接: https://arxiv.org/abs/2508.15025
作者: Omkar Tupe,Max Hartman,Lav R. Varshney,Saurav Prakash
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:We consider federated learning of linearly-parameterized nonlinear systems. We establish theoretical guarantees on the effectiveness of federated nonlinear system identification compared to centralized approaches, demonstrating that the convergence rate improves as the number of clients increases. Although the convergence rates in the linear and nonlinear cases differ only by a constant, this constant depends on the feature map \phi , which can be carefully chosen in the nonlinear setting to increase excitation and improve performance. We experimentally validate our theory in physical settings where client devices are driven by i.i.d. control inputs and control policies exhibiting i.i.d. random perturbations, ensuring non-active exploration. Experiments use trajectories from nonlinear dynamical systems characterized by real-analytic feature functions, including polynomial and trigonometric components, representative of physical systems including pendulum and quadrotor dynamics. We analyze the convergence behavior of the proposed method under varying noise levels and data distributions. Results show that federated learning consistently improves convergence of any individual client as the number of participating clients increases.

[LG-45] Frag ment-Wise Interpretability in Graph Neural Networks via Molecule Decomposition and Contribution Analysis

链接: https://arxiv.org/abs/2508.15015
作者: Sebastian Musiał,Bartosz Zieliński,Tomasz Danel
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Graph neural networks have demonstrated remarkable success in predicting molecular properties by leveraging the rich structural information encoded in molecular graphs. However, their black-box nature reduces interpretability, which limits trust in their predictions for important applications such as drug discovery and materials design. Furthermore, existing explanation techniques often fail to reliably quantify the contribution of individual atoms or substructures due to the entangled message-passing dynamics. We introduce SEAL (Substructure Explanation via Attribution Learning), a new interpretable graph neural network that attributes model predictions to meaningful molecular subgraphs. SEAL decomposes input graphs into chemically relevant fragments and estimates their causal influence on the output. The strong alignment between fragment contributions and model predictions is achieved by explicitly reducing inter-fragment message passing in our proposed model architecture. Extensive evaluations on synthetic benchmarks and real-world molecular datasets demonstrate that SEAL outperforms other explainability methods in both quantitative attribution metrics and human-aligned interpretability. A user study further confirms that SEAL provides more intuitive and trustworthy explanations to domain experts. By bridging the gap between predictive performance and interpretability, SEAL offers a promising direction for more transparent and actionable molecular modeling.

[LG-46] OAST: Fast and scalable auto-partitioning based on principled static analysis

链接: https://arxiv.org/abs/2508.15010
作者: Sami Alabed,Dominik Grewe,Norman Alexander Rink,Timur Sitdikov,Agnieszka Swietlik,Dimitrios Vytiniotis,Daniel Belov
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:

点击查看摘要

Abstract:Partitioning large machine learning models across distributed accelerator systems is a complex process, requiring a series of interdependent decisions that are further complicated by internal sharding ambiguities. Consequently, existing auto-partitioners often suffer from out-of-memory errors or are prohibitively slow when exploring the exponentially large space of possible partitionings. To mitigate this, they artificially restrict the search space, but this approach frequently yields infeasible solutions that violate device memory constraints or lead to sub-optimal performance. We propose a system that combines a novel static compiler analysis with a Monte Carlo Tree Search. Our analysis constructs an efficient decision space by identifying (i) tensor dimensions requiring identical sharding, and (ii) partitioning “conflicts” that require resolution. Our system significantly outperforms state-of-the-art industrial methods across diverse hardware platforms and model architectures, discovering previously unknown, superior solutions, and the process is fully automated even for complex and large models. Subjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC) Cite as: arXiv:2508.15010 [cs.LG] (or arXiv:2508.15010v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.15010 Focus to learn more arXiv-issued DOI via DataCite

[LG-47] Generative Neural Operators of Log-Complexity Can Simultaneously Solve Infinitely Many Convex Programs

链接: https://arxiv.org/abs/2508.14995
作者: Anastasis Kratsios,Ariel Neufeld,Philipp Schmocker
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computational Finance (q-fin.CP)
*备注:

点击查看摘要

Abstract:Neural operators (NOs) are a class of deep learning models designed to simultaneously solve infinitely many related problems by casting them into an infinite-dimensional space, whereon these NOs operate. A significant gap remains between theory and practice: worst-case parameter bounds from universal approximation theorems suggest that NOs may require an unrealistically large number of parameters to solve most operator learning problems, which stands in direct opposition to a slew of experimental evidence. This paper closes that gap for a specific class of NOs, generative equilibrium operators (GEOs), using (realistic) finite-dimensional deep equilibrium layers, when solving families of convex optimization problems over a separable Hilbert space X . Here, the inputs are smooth, convex loss functions on X , and outputs are the associated (approximate) solutions to the optimization problem defined by each input loss. We show that when the input losses lie in suitable infinite-dimensional compact sets, our GEO can uniformly approximate the corresponding solutions to arbitrary precision, with rank, depth, and width growing only logarithmically in the reciprocal of the approximation error. We then validate both our theoretical results and the trainability of GEOs on three applications: (1) nonlinear PDEs, (2) stochastic optimal control problems, and (3) hedging problems in mathematical finance under liquidity constraints. Subjects: Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC); Computational Finance (q-fin.CP) Cite as: arXiv:2508.14995 [cs.LG] (or arXiv:2508.14995v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.14995 Focus to learn more arXiv-issued DOI via DataCite

[LG-48] Aura-CAPTCHA: A Reinforcement Learning and GAN-Enhanced Multi-Modal CAPTCHA System

链接: https://arxiv.org/abs/2508.14976
作者: Joydeep Chandra,Prabal Manhas,Ramanjot Kaur,Rashi Sahay
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Aura-CAPTCHA was developed as a multi-modal CAPTCHA system to address vulnerabilities in traditional methods that are increasingly bypassed by AI technologies, such as Optical Character Recognition (OCR) and adversarial image processing. The design integrated Generative Adversarial Networks (GANs) for generating dynamic image challenges, Reinforcement Learning (RL) for adaptive difficulty tuning, and Large Language Models (LLMs) for creating text and audio prompts. Visual challenges included 3x3 grid selections with at least three correct images, while audio challenges combined randomized numbers and words into a single task. RL adjusted difficulty based on incorrect attempts, response time, and suspicious user behavior. Evaluations on real-world traffic demonstrated a 92% human success rate and a 10% bot bypass rate, significantly outperforming existing CAPTCHA systems. The system provided a robust and scalable approach for securing online applications while remaining accessible to users, addressing gaps highlighted in previous research.

[LG-49] CuMoLoS-MAE: A Masked Autoencoder for Remote Sensing Data Reconstruction

链接: https://arxiv.org/abs/2508.14957
作者: Anurup Naskar,Nathanael Zhixin Wong,Sara Shamekh
类目: Machine Learning (cs.LG); Atmospheric and Oceanic Physics (physics.ao-ph)
*备注: 4 pages, 2 figures

点击查看摘要

Abstract:Accurate atmospheric profiles from remote sensing instruments such as Doppler Lidar, Radar, and radiometers are frequently corrupted by low-SNR (Signal to Noise Ratio) gates, range folding, and spurious discontinuities. Traditional gap filling blurs fine-scale structures, whereas deep models lack confidence estimates. We present CuMoLoS-MAE, a Curriculum-Guided Monte Carlo Stochastic Ensemble Masked Autoencoder designed to (i) restore fine-scale features such as updraft and downdraft cores, shear lines, and small vortices, (ii) learn a data-driven prior over atmospheric fields, and (iii) quantify pixel-wise uncertainty. During training, CuMoLoS-MAE employs a mask-ratio curriculum that forces a ViT decoder to reconstruct from progressively sparser context. At inference, we approximate the posterior predictive by Monte Carlo over random mask realisations, evaluating the MAE multiple times and aggregating the outputs to obtain the posterior predictive mean reconstruction together with a finely resolved per-pixel uncertainty map. Together with high-fidelity reconstruction, this novel deep learning-based workflow enables enhanced convection diagnostics, supports real-time data assimilation, and improves long-term climate reanalysis.

[LG-50] XAI-Driven Spectral Analysis of Cough Sounds for Respiratory Disease Characterization

链接: https://arxiv.org/abs/2508.14949
作者: Patricia Amado-Caballero,Luis Miguel San-José-Revuelta,María Dolores Aguilar-García,José Ramón Garmendia-Leiza,Carlos Alberola-López,Pablo Casaseca-de-la-Higuera
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:This paper proposes an eXplainable Artificial Intelligence (XAI)-driven methodology to enhance the understanding of cough sound analysis for respiratory disease management. We employ occlusion maps to highlight relevant spectral regions in cough spectrograms processed by a Convolutional Neural Network (CNN). Subsequently, spectral analysis of spectrograms weighted by these occlusion maps reveals significant differences between disease groups, particularly in patients with COPD, where cough patterns appear more variable in the identified spectral regions of interest. This contrasts with the lack of significant differences observed when analyzing raw spectrograms. The proposed approach extracts and analyzes several spectral features, demonstrating the potential of XAI techniques to uncover disease-specific acoustic signatures and improve the diagnostic capabilities of cough sound analysis by providing more interpretable results.

[LG-51] Large Foundation Model for Ads Recommendation

链接: https://arxiv.org/abs/2508.14948
作者: Shangyu Zhang,Shijie Quan,Zhongren Wang,Junwei Pan,Tianqu Zhuang,Bo Fu,Yilong Sun,Jieying Lin,Jushuo Chen,Xiaotian Li,Zhixiang Feng,Xian Hu,Huiting Deng,Hua Lu,Jinpeng Wang,Boqi Dai,Xiaoyu Chen,Bin Hu,Lili Huang,Yanwen Wu,Yeshou Cai,Qi Zhou,Huang Tang,Chunfeng Yang,Chengguo Yin,Tingyu Jiang,Lifeng Wang,Shudong Huang,Dapeng Liu,Lei Xiao,Haijie Gu,Shu-Tao Xia,Jie Jiang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Online advertising relies on accurate recommendation models, with recent advances using pre-trained large-scale foundation models (LFMs) to capture users’ general interests across multiple scenarios and tasks. However, existing methods have critical limitations: they extract and transfer only user representations (URs), ignoring valuable item representations (IRs) and user-item cross representations (CRs); and they simply use a UR as a feature in downstream applications, which fails to bridge upstream-downstream gaps and overlooks more transfer granularities. In this paper, we propose LFM4Ads, an All-Representation Multi-Granularity transfer framework for ads recommendation. It first comprehensively transfers URs, IRs, and CRs, i.e., all available representations in the pre-trained foundation model. To effectively utilize the CRs, it identifies the optimal extraction layer and aggregates them into transferable coarse-grained forms. Furthermore, we enhance the transferability via multi-granularity mechanisms: non-linear adapters for feature-level transfer, an Isomorphic Interaction Module for module-level transfer, and Standalone Retrieval for model-level transfer. LFM4Ads has been successfully deployed in Tencent’s industrial-scale advertising platform, processing tens of billions of daily samples while maintaining terabyte-scale model parameters with billions of sparse embedding keys across approximately two thousand features. Since its production deployment in Q4 2024, LFM4Ads has achieved 10+ successful production launches across various advertising scenarios, including primary ones like Weixin Moments and Channels. These launches achieve an overall GMV lift of 2.45% across the entire platform, translating to estimated annual revenue increases in the hundreds of millions of dollars.

[LG-52] Linear Preference Optimization: Decoupled Gradient Control via Absolute Regularization

链接: https://arxiv.org/abs/2508.14947
作者: Rui Wang,Qianguo Sun,Chao Song,Junlong Wu,Tianrong Chen,Zhiyun Zeng,Yu Li
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:DPO (Direct Preference Optimization) has become a widely used offline preference optimization algorithm due to its simplicity and training stability. However, DPO is prone to overfitting and collapse. To address these challenges, we propose Linear Preference Optimization (LPO), a novel alignment framework featuring three key innovations. First, we introduce gradient decoupling by replacing the log-sigmoid function with an absolute difference loss, thereby isolating the optimization dynamics. Second, we improve stability through an offset constraint combined with a positive regularization term to preserve the chosen response quality. Third, we implement controllable rejection suppression using gradient separation with straightforward estimation and a tunable coefficient that linearly regulates the descent of the rejection probability. Through extensive experiments, we demonstrate that LPO consistently improves performance on various tasks, including general text tasks, math tasks, and text-to-speech (TTS) tasks. These results establish LPO as a robust and tunable paradigm for preference alignment, and we release the source code, models, and training data publicly.

[LG-53] HHNAS-AM: Hierarchical Hybrid Neural Architecture Search using Adaptive Mutation Policies

链接: https://arxiv.org/abs/2508.14946
作者: Anurag Tripathi,Ajeet Kumar Singh,Rajsabi Surya,Aum Gupta,Sahiinii Lemaina Veikho,Dorien Herremans,Sudhir Bisane
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Neural Architecture Search (NAS) has garnered significant research interest due to its capability to discover architectures superior to manually designed ones. Learning text representation is crucial for text classification and other language-related tasks. The NAS model used in text classification does not have a Hybrid hierarchical structure, and there is no restriction on the architecture structure, due to which the search space becomes very large and mostly redundant, so the existing RL models are not able to navigate the search space effectively. Also, doing a flat architecture search leads to an unorganised search space, which is difficult to traverse. For this purpose, we propose HHNAS-AM (Hierarchical Hybrid Neural Architecture Search with Adaptive Mutation Policies), a novel approach that efficiently explores diverse architectural configurations. We introduce a few architectural templates to search on which organise the search spaces, where search spaces are designed on the basis of domain-specific cues. Our method employs mutation strategies that dynamically adapt based on performance feedback from previous iterations using Q-learning, enabling a more effective and accelerated traversal of the search space. The proposed model is fully probabilistic, enabling effective exploration of the search space. We evaluate our approach on the database id (db_id) prediction task, where it consistently discovers high-performing architectures across multiple experiments. On the Spider dataset, our method achieves an 8% improvement in test accuracy over existing baselines.

[LG-54] Structure-Aware Temporal Modeling for Chronic Disease Progression Prediction

链接: https://arxiv.org/abs/2508.14942
作者: Jiacheng Hu,Bo Zhang,Ting Xu,Haifeng Yang,Min Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This study addresses the challenges of symptom evolution complexity and insufficient temporal dependency modeling in Parkinson’s disease progression prediction. It proposes a unified prediction framework that integrates structural perception and temporal modeling. The method leverages graph neural networks to model the structural relationships among multimodal clinical symptoms and introduces graph-based representations to capture semantic dependencies between symptoms. It also incorporates a Transformer architecture to model dynamic temporal features during disease progression. To fuse structural and temporal information, a structure-aware gating mechanism is designed to dynamically adjust the fusion weights between structural encodings and temporal features, enhancing the model’s ability to identify key progression stages. To improve classification accuracy and stability, the framework includes a multi-component modeling pipeline, consisting of a graph construction module, a temporal encoding module, and a prediction output layer. The model is evaluated on real-world longitudinal Parkinson’s disease data. The experiments involve comparisons with mainstream models, sensitivity analysis of hyperparameters, and graph connection density control. Results show that the proposed method outperforms existing approaches in AUC, RMSE, and IPW-F1 metrics. It effectively distinguishes progression stages and improves the model’s ability to capture personalized symptom trajectories. The overall framework demonstrates strong generalization and structural scalability, providing reliable support for intelligent modeling of chronic progressive diseases such as Parkinson’s disease.

[LG-55] Cohort-Aware Agents for Individualized Lung Cancer Risk Prediction Using a Retrieval-Augmented Model Selection Framework

链接: https://arxiv.org/abs/2508.14940
作者: Chongyu Qu,Allen J. Luna,Thomas Z. Li,Junchao Zhu,Junlin Guo,Juming Xiong,Kim L. Sandler,Bennett A. Landman,Yuankai Huo
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate lung cancer risk prediction remains challenging due to substantial variability across patient populations and clinical settings – no single model performs best for all cohorts. To address this, we propose a personalized lung cancer risk prediction agent that dynamically selects the most appropriate model for each patient by combining cohort-specific knowledge with modern retrieval and reasoning techniques. Given a patient’s CT scan and structured metadata – including demographic, clinical, and nodule-level features – the agent first performs cohort retrieval using FAISS-based similarity search across nine diverse real-world cohorts to identify the most relevant patient population from a multi-institutional database. Second, a Large Language Model (LLM) is prompted with the retrieved cohort and its associated performance metrics to recommend the optimal prediction algorithm from a pool of eight representative models, including classical linear risk models (e.g., Mayo, Brock), temporally-aware models (e.g., TDVIT, DLSTM), and multi-modal computer vision-based approaches (e.g., Liao, Sybil, DLS, DLI). This two-stage agent pipeline – retrieval via FAISS and reasoning via LLM – enables dynamic, cohort-aware risk prediction personalized to each patient’s profile. Building on this architecture, the agent supports flexible and cohort-driven model selection across diverse clinical populations, offering a practical path toward individualized risk assessment in real-world lung cancer screening.

[LG-56] MCPTox: A Benchmark for Tool Poisoning Attack on Real-World MCP Servers

链接: https://arxiv.org/abs/2508.14925
作者: Zhiqiang Wang,Yichao Gao,Yanting Wang,Suyuan Liu,Haifeng Sun,Haoran Cheng,Guanquan Shi,Haohua Du,Xiangyang Li
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:By providing a standardized interface for LLM agents to interact with external tools, the Model Context Protocol (MCP) is quickly becoming a cornerstone of the modern autonomous agent ecosystem. However, it creates novel attack surfaces due to untrusted external tools. While prior work has focused on attacks injected through external tool outputs, we investigate a more fundamental vulnerability: Tool Poisoning, where malicious instructions are embedded within a tool’s metadata without execution. To date, this threat has been primarily demonstrated through isolated cases, lacking a systematic, large-scale evaluation. We introduce MCPTox, the first benchmark to systematically evaluate agent robustness against Tool Poisoning in realistic MCP settings. MCPTox is constructed upon 45 live, real-world MCP servers and 353 authentic tools. To achieve this, we design three distinct attack templates to generate a comprehensive suite of 1312 malicious test cases by few-shot learning, covering 10 categories of potential risks. Our evaluation on 20 prominent LLM agents setting reveals a widespread vulnerability to Tool Poisoning, with o1-mini, achieving an attack success rate of 72.8%. We find that more capable models are often more susceptible, as the attack exploits their superior instruction-following abilities. Finally, the failure case analysis reveals that agents rarely refuse these attacks, with the highest refused rate (Claude-3.7-Sonnet) less than 3%, demonstrating that existing safety alignment is ineffective against malicious actions that use legitimate tools for unauthorized operation. Our findings create a crucial empirical baseline for understanding and mitigating this widespread threat, and we release MCPTox for the development of verifiably safer AI agents. Our dataset is available at an anonymized repository: \textitthis https URL. Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2508.14925 [cs.CR] (or arXiv:2508.14925v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.14925 Focus to learn more arXiv-issued DOI via DataCite

[LG-57] Human Feedback Driven Dynamic Speech Emotion Recognition

链接: https://arxiv.org/abs/2508.14920
作者: Ilya Fedorov,Dmitry Korobchenko
类目: ound (cs.SD); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:This work proposes to explore a new area of dynamic speech emotion recognition. Unlike traditional methods, we assume that each audio track is associated with a sequence of emotions active at different moments in time. The study particularly focuses on the animation of emotional 3D avatars. We propose a multi-stage method that includes the training of a classical speech emotion recognition model, synthetic generation of emotional sequences, and further model improvement based on human feedback. Additionally, we introduce a novel approach to modeling emotional mixtures based on the Dirichlet distribution. The models are evaluated based on ground-truth emotions extracted from a dataset of 3D facial animations. We compare our models against the sliding window approach. Our experimental results show the effectiveness of Dirichlet-based approach in modeling emotional mixtures. Incorporating human feedback further improves the model quality while providing a simplified annotation procedure.

[LG-58] Denoising by neural network for muzzle blast detection

链接: https://arxiv.org/abs/2508.14919
作者: Hadrien Pujol,Matteo Bevillacqua,Christophe Thirard,Thierry Mazoyer
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: INTER-NOISE 2024, Aug 2024, Nantes (France), France

点击查看摘要

Abstract:Acoem develops gunshot detection systems, consisting of a microphone array and software that detects and locates shooters on the battlefield. The performance of such systems is obviously affected by the acoustic environment in which they are operating: in particular, when mounted on a moving military vehicle, the presence of noise reduces the detection performance of the software. To limit the influence of the acoustic environment, a neural network has been developed. Instead of using a heavy convolutional neural network, a lightweight neural network architecture was chosen to limit the computational resources required to embed the algorithm on as many hardware platforms as possible. Thanks to the combination of a two hidden layer perceptron and appropriate signal processing techniques, the detection rate of impulsive muzzle blast waveforms (the wave coming from the detonation and indicating the position of the shooter) is significantly increased. With a rms value of noise of the same order as the muzzle blast peak amplitude, the detect rate is more than doubled with this denoising processing.

[LG-59] Personalized Recommendations via Active Utility-based Pairwise Sampling

链接: https://arxiv.org/abs/2508.14911
作者: Bahar Boroomand,James R. Wright
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Recommender systems play a critical role in enhancing user experience by providing personalized suggestions based on user preferences. Traditional approaches often rely on explicit numerical ratings or assume access to fully ranked lists of items. However, ratings frequently fail to capture true preferences due to users’ behavioral biases and subjective interpretations of rating scales, while eliciting full rankings is demanding and impractical. To overcome these limitations, we propose a generalized utility-based framework that learns preferences from simple and intuitive pairwise comparisons. Our approach is model-agnostic and designed to optimize for arbitrary, task-specific utility functions, allowing the system’s objective to be explicitly aligned with the definition of a high-quality outcome in any given application. A central contribution of our work is a novel utility-based active sampling strategy for preference elicitation. This method selects queries that are expected to provide the greatest improvement to the utility of the final recommended outcome. We ground our preference model in the probabilistic Plackett-Luce framework for pairwise data. To demonstrate the versatility of our approach, we present two distinct experiments: first, an implementation using matrix factorization for a classic movie recommendation task, and second, an implementation using a neural network for a complex candidate selection scenario in university admissions. Experimental results demonstrate that our framework provides a more accurate, data-efficient, and user-centric paradigm for personalized ranking.

[LG-60] Closing the Performance Gap in Generative Recommenders with Collaborative Tokenization and Efficient Modeling

链接: https://arxiv.org/abs/2508.14910
作者: Simon Lepage,Jeremie Mary,David Picard
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Code coming soon

点击查看摘要

Abstract:Recent work has explored generative recommender systems as an alternative to traditional ID-based models, reframing item recommendation as a sequence generation task over discrete item tokens. While promising, such methods often underperform in practice compared to well-tuned ID-based baselines like SASRec. In this paper, we identify two key limitations holding back generative approaches: the lack of collaborative signal in item tokenization, and inefficiencies in the commonly used encoder-decoder architecture. To address these issues, we introduce COSETTE, a contrastive tokenization method that integrates collaborative information directly into the learned item representations, jointly optimizing for both content reconstruction and recommendation relevance. Additionally, we propose MARIUS, a lightweight, audio-inspired generative model that decouples timeline modeling from item decoding. MARIUS reduces inference cost while improving recommendation accuracy. Experiments on standard sequential recommendation benchmarks show that our approach narrows, or even eliminates, the performance gap between generative and modern ID-based models, while retaining the benefits of the generative paradigm.

[LG-61] End-to-End Analysis of Charge Stability Diagrams with Transformers

链接: https://arxiv.org/abs/2508.15710
作者: Rahul Marchand,Lucas Schorling,Cornelius Carlsson,Jonas Schuff,Barnaby van Straaten,Taylor L. Patti,Federico Fedele,Joshua Ziegler,Parth Girdhar,Pranav Vaidhyanathan,Natalia Ares
类目: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Quantum Physics (quant-ph)
*备注: 8 pages, 2 figures, RM and LS contributed equally

点击查看摘要

Abstract:Transformer models and end-to-end learning frameworks are rapidly revolutionizing the field of artificial intelligence. In this work, we apply object detection transformers to analyze charge stability diagrams in semiconductor quantum dot arrays, a key task for achieving scalability with spin-based quantum computing. Specifically, our model identifies triple points and their connectivity, which is crucial for virtual gate calibration, charge state initialization, drift correction, and pulse sequencing. We show that it surpasses convolutional neural networks in performance on three different spin qubit architectures, all without the need for retraining. In contrast to existing approaches, our method significantly reduces complexity and runtime, while enhancing generalizability. The results highlight the potential of transformer-based end-to-end learning frameworks as a foundation for a scalable, device- and architecture-agnostic tool for control and tuning of quantum dot devices.

[LG-62] Effect Identification and Unit Categorization in the Multi-Score Regression Discontinuity Design with Application to LED Manufacturing

链接: https://arxiv.org/abs/2508.15692
作者: Philipp Alexander Schwarz,Oliver Schacht,Sven Klaassen,Johannes Oberpriller,Martin Spindler
类目: Methodology (stat.ME); Machine Learning (cs.LG); Econometrics (econ.EM)
*备注:

点击查看摘要

Abstract:The RDD (regression discontinuity design) is a widely used framework for identification and estimation of causal effects at a cutoff of a single running variable. Practical settings, in particular those encountered in production systems, often involve decision-making defined by multiple thresholds and criteria. Common MRD (multi-score RDD) approaches transform these to a one-dimensional design, to employ identification and estimation results. However, this practice can introduce non-compliant behavior. We develop theoretical tools to identify and reduce some of this “fuzziness” when estimating the cutoff-effect on compliers of sub-rules. We provide a sound definition and categorization of unit behavior types for multi-dimensional cutoff-rules, extending existing categorizations. We identify conditions for the existence and identification of the cutoff-effect on complier in multiple dimensions, and specify when identification remains stable after excluding nevertaker and alwaystaker. Further, we investigate how decomposing cutoff-rules into simpler parts alters the unit behavior. This allows identification and removal of non-compliant units potentially improving estimates. We validate our framework on simulated and real-world data from opto-electronic semiconductor manufacturing. Our empirical results demonstrate the usability for refining production policies. Particularly we show that our approach decreases the estimation variance, highlighting the practical value of the MRD framework in manufacturing.

[LG-63] ree-like Pairwise Interaction Networks

链接: https://arxiv.org/abs/2508.15678
作者: Ronald Richman,Salvatore Scognamiglio,Mario V. Wüthrich
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Applications (stat.AP)
*备注:

点击查看摘要

Abstract:Modeling feature interactions in tabular data remains a key challenge in predictive modeling, for example, as used for insurance pricing. This paper proposes the Tree-like Pairwise Interaction Network (PIN), a novel neural network architecture that explicitly captures pairwise feature interactions through a shared feed-forward neural network architecture that mimics the structure of decision trees. PIN enables intrinsic interpretability by design, allowing for direct inspection of interaction effects. Moreover, it allows for efficient SHapley’s Additive exPlanation (SHAP) computations because it only involves pairwise interactions. We highlight connections between PIN and established models such as GA2Ms, gradient boosting machines, and graph neural networks. Empirical results on the popular French motor insurance dataset show that PIN outperforms both traditional and modern neural networks benchmarks in predictive accuracy, while also providing insight into how features interact with each another and how they contribute to the predictions.

[LG-64] Bayesian Optimization with Expected Improvement: No Regret and the Choice of Incumbent

链接: https://arxiv.org/abs/2508.15674
作者: Jingyi Wang,Haowei Wang,Szu Hui Ng,Cosmin G. Petra
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Expected improvement (EI) is one of the most widely used acquisition functions in Bayesian optimization (BO). Despite its proven empirical success in applications, the cumulative regret upper bound of EI remains an open question. In this paper, we analyze the classic noisy Gaussian process expected improvement (GP-EI) algorithm. We consider the Bayesian setting, where the objective is a sample from a GP. Three commonly used incumbents, namely the best posterior mean incumbent (BPMI), the best sampled posterior mean incumbent (BSPMI), and the best observation incumbent (BOI) are considered as the choices of the current best value in GP-EI. We present for the first time the cumulative regret upper bounds of GP-EI with BPMI and BSPMI. Importantly, we show that in both cases, GP-EI is a no-regret algorithm for both squared exponential (SE) and Matérn kernels. Further, we present for the first time that GP-EI with BOI either achieves a sublinear cumulative regret upper bound or has a fast converging noisy simple regret bound for SE and Matérn kernels. Our results provide theoretical guidance to the choice of incumbent when practitioners apply GP-EI in the noisy setting. Numerical experiments are conducted to validate our findings.

[LG-65] High-dimensional Asymptotics of Generalization Performance in Continual Ridge Regression

链接: https://arxiv.org/abs/2508.15494
作者: Yihan Zhao,Wenqing Su,Ying Yang
类目: Methodology (stat.ME); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Continual learning is motivated by the need to adapt to real-world dynamics in tasks and data distribution while mitigating catastrophic forgetting. Despite significant advances in continual learning techniques, the theoretical understanding of their generalization performance lags behind. This paper examines the theoretical properties of continual ridge regression in high-dimensional linear models, where the dimension is proportional to the sample size in each task. Using random matrix theory, we derive exact expressions of the asymptotic prediction risk, thereby enabling the characterization of three evaluation metrics of generalization performance in continual learning: average risk, backward transfer, and forward transfer. Furthermore, we present the theoretical risk curves to illustrate the trends in these evaluation metrics throughout the continual learning process. Our analysis reveals several intriguing phenomena in the risk curves, demonstrating how model specifications influence the generalization performance. Simulation studies are conducted to validate our theoretical findings.

[LG-66] JEDI-linear: Fast and Efficient Graph Neural Networks for Jet Tagging on FPGAs

链接: https://arxiv.org/abs/2508.15468
作者: Zhiqiang Que,Chang Sun,Sudarshan Paramesvaran,Emyr Clement,Katerina Karakoulaki,Christopher Brown,Lauri Laatu,Arianna Cox,Alexander Tapper,Wayne Luk,Maria Spiropulu
类目: High Energy Physics - Experiment (hep-ex); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注: 10 pages, 9 figures

点击查看摘要

Abstract:Graph Neural Networks (GNNs), particularly Interaction Networks (INs), have shown exceptional performance for jet tagging at the CERN High-Luminosity Large Hadron Collider (HL-LHC). However, their computational complexity and irregular memory access patterns pose significant challenges for deployment on FPGAs in hardware trigger systems, where strict latency and resource constraints apply. In this work, we propose JEDI-linear, a novel GNN architecture with linear computational complexity that eliminates explicit pairwise interactions by leveraging shared transformations and global aggregation. To further enhance hardware efficiency, we introduce fine-grained quantization-aware training with per-parameter bitwidth optimization and employ multiplier-free multiply-accumulate operations via distributed arithmetic. Evaluation results show that our FPGA-based JEDI-linear achieves 3.7 to 11.5 times lower latency, up to 150 times lower initiation interval, and up to 6.2 times lower LUT usage compared to state-of-the-art designs while also delivering higher model accuracy and eliminating the need for DSP blocks entirely. In contrast, state-of-the-art solutions consume over 8,700 DSPs. This is the first interaction-based GNN to achieve less than 60~ns latency and currently meets the requirements for use in the HL-LHC CMS Level-1 trigger system. This work advances the next-generation trigger systems by enabling accurate, scalable, and resource-efficient GNN inference in real-time environments. Our open-sourced templates will further support reproducibility and broader adoption across scientific applications.

[LG-67] Bayesian Inference and Learning in Nonlinear Dynamical Systems: A Framework for Incorporating Explicit and Implicit Prior Knowledge

链接: https://arxiv.org/abs/2508.15345
作者: Björn Volkmann,Jan-Hendrik Ewering,Michael Meindl,Simon F. G. Ehlers,Thomas Seel
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 16 pages, Preprint submitted to Automatica

点击查看摘要

Abstract:Accuracy and generalization capabilities are key objectives when learning dynamical system models. To obtain such models from limited data, current works exploit prior knowledge and assumptions about the system. However, the fusion of diverse prior knowledge, e. g. partially known system equations and smoothness assumptions about unknown model parts, with information contained in the data remains a challenging problem, especially in input-output settings with latent system state. In particular, learning functions that are nested inside known system equations can be a laborious and error-prone expert task. This paper considers inference of latent states and learning of unknown model parts for fusion of data information with different sources of prior knowledge. The main contribution is a general-purpose system identification tool that, for the first time, provides a consistent solution for both, online and offline Bayesian inference and learning while allowing to incorporate explicit and implicit prior system knowledge. We propose a novel interface for combining known dynamics functions with a learning-based approximation of unknown system parts. Based on the proposed model structure, closed-form densities for efficient parameter marginalization are derived. No user-tailored coordinate transformations or model inversions are needed, making the presented framework a general-purpose tool for inference and learning. The broad applicability of the devised framework is illustrated in three distinct case studies, including an experimental data set.

[LG-68] Flow Matching at Scale: A Machine Learning Framework for Efficient Large-Size Sampling of Many-Body Systems

链接: https://arxiv.org/abs/2508.15318
作者: Qian-Rui Lee,Daw-Wei Wang
类目: atistical Mechanics (cond-mat.stat-mech); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a machine learning framework based on Flow Matching to overcome the scaling limitations of Markov Chain Monte Carlo (MCMC) methods. We demonstrate its capability in the 2D XY model, where a single network, trained only on configurations from a small ( 32\times 32 ) lattice at sparse temperature points, generates reliable samples for a significantly larger system ( 128\times 128 ) across a continuous temperature range without retraining. The generated configurations show strong agreement with key thermodynamic observables and correctly capture the signatures of the Berezinskii-Kosterlitz-Thouless (BKT) transition. This dual generalization is enabled by the Flow Matching framework, which allows us to learn a continuous, temperature-conditioned mapping. At the same time, the inductive biases of the underlying CNN architecture ensure that the learned local physical rules are scale-invariant. This “train-small, generate-large” capability establishes a new paradigm for efficiently studying critical phenomena, offering a significant computational advantage for exploring the thermodynamic limit. The method can be directly applied to other classical or quantum many-body systems described by continuous fields on a lattice.

[LG-69] GEN2: A Generative Prediction-Correction Framework for Long-time Emulations of Spatially-Resolved Climate Extremes

链接: https://arxiv.org/abs/2508.15196
作者: Mengze Wang,Benedikt Barthel Sorensen,Themistoklis Sapsis
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG); Chaotic Dynamics (nlin.CD)
*备注:

点击查看摘要

Abstract:Accurately quantifying the increased risks of climate extremes requires generating large ensembles of climate realization across a wide range of emissions scenarios, which is computationally challenging for conventional Earth System Models. We propose GEN2, a generative prediction-correction framework for an efficient and accurate forecast of the extreme event statistics. The prediction step is constructed as a conditional Gaussian emulator, followed by a non-Gaussian machine-learning (ML) correction step. The ML model is trained on pairs of the reference data and the emulated fields nudged towards the reference, to ensure the training is robust to chaos. We first validate the accuracy of our model on historical ERA5 data and then demonstrate the extrapolation capabilities on various future climate change scenarios. When trained on a single realization of one warming scenario, our model accurately predicts the statistics of extreme events in different scenarios, successfully extrapolating beyond the distribution of training data.

[LG-70] Kernel-based Equalized Odds: A Quantification of Accuracy-Fairness Trade-off in Fair Representation Learning

链接: https://arxiv.org/abs/2508.15084
作者: Yijin Ni,Xiaoming Huo
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:This paper introduces a novel kernel-based formulation of the Equalized Odds (EO) criterion, denoted as EO_k , for fair representation learning (FRL) in supervised settings. The central goal of FRL is to mitigate discrimination regarding a sensitive attribute S while preserving prediction accuracy for the target variable Y . Our proposed criterion enables a rigorous and interpretable quantification of three core fairness objectives: independence (prediction \hatY is independent of S ), separation (also known as equalized odds; prediction \hatY is independent with S conditioned on target attribute Y ), and calibration ( Y is independent of S conditioned on the prediction \hatY ). Under both unbiased ( Y is independent of S ) and biased ( Y depends on S ) conditions, we show that EO_k satisfies both independence and separation in the former, and uniquely preserves predictive accuracy while lower bounding independence and calibration in the latter, thereby offering a unified analytical characterization of the tradeoffs among these fairness criteria. We further define the empirical counterpart, \hatEO_k , a kernel-based statistic that can be computed in quadratic time, with linear-time approximations also available. A concentration inequality for \hatEO_k is derived, providing performance guarantees and error bounds, which serve as practical certificates of fairness compliance. While our focus is on theoretical development, the results lay essential groundwork for principled and provably fair algorithmic design in future empirical studies.

[LG-71] Generative AI models enable efficient and physically consistent sea-ice simulations

链接: https://arxiv.org/abs/2508.14984
作者: Tobias Sebastian Finn,Marc Bocquet,Pierre Rampal,Charlotte Durand,Flavia Porro,Alban Farchi,Alberto Carrassi
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 43 pages, 10 figures

点击查看摘要

Abstract:Sea ice is governed by highly complex, scale-invariant, and anisotropic processes that are challenging to represent in Earth system models. While advanced numerical models have improved our understanding of the sea-ice dynamics, their computational costs often limit their application in ensemble forecasting and climate simulations. Here, we introduce GenSIM, the first generative AI-based pan-Arctic model that predicts the evolution of all relevant key properties, including concentration, thickness, and drift, in a 12-hour window with improved accuracy over deterministic predictions and high computational efficiency, while remaining physically consistent. Trained on a long simulation from a state-of-the-art sea-ice–ocean system, GenSIM robustly reproduces statistics as observed in numerical models and observations, exhibiting brittle-like short-term dynamics while also depicting the long-term sea-ice decline. Driven solely by atmospheric forcings, we attribute GenSIM’s emergent extrapolation capabilities to patterns that reflect the long-term impact of the ocean: it seemingly has learned an internal ocean emulator. This ability to infer slowly evolving climate-relevant dynamics from short-term predictions underlines the large potential of generative models to generalise for unseen climates and to encode hidden physics.

[LG-72] CUTE-MRI: Conformalized Uncertainty-based framework for Time-adaptivE MRI

链接: https://arxiv.org/abs/2508.14952
作者: Paul Fischer,Jan Nikolas Morshuis,Thomas Küstner,Christian Baumgartner
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Magnetic Resonance Imaging (MRI) offers unparalleled soft-tissue contrast but is fundamentally limited by long acquisition times. While deep learning-based accelerated MRI can dramatically shorten scan times, the reconstruction from undersampled data introduces ambiguity resulting from an ill-posed problem with infinitely many possible solutions that propagates to downstream clinical tasks. This uncertainty is usually ignored during the acquisition process as acceleration factors are often fixed a priori, resulting in scans that are either unnecessarily long or of insufficient quality for a given clinical endpoint. This work introduces a dynamic, uncertainty-aware acquisition framework that adjusts scan time on a per-subject basis. Our method leverages a probabilistic reconstruction model to estimate image uncertainty, which is then propagated through a full analysis pipeline to a quantitative metric of interest (e.g., patellar cartilage volume or cardiac ejection fraction). We use conformal prediction to transform this uncertainty into a rigorous, calibrated confidence interval for the metric. During acquisition, the system iteratively samples k-space, updates the reconstruction, and evaluates the confidence interval. The scan terminates automatically once the uncertainty meets a user-predefined precision target. We validate our framework on both knee and cardiac MRI datasets. Our results demonstrate that this adaptive approach reduces scan times compared to fixed protocols while providing formal statistical guarantees on the precision of the final image. This framework moves beyond fixed acceleration factors, enabling patient-specific acquisitions that balance scan efficiency with diagnostic confidence, a critical step towards personalized and resource-efficient MRI.

[LG-73] Potential and challenges of generative adversarial networks for super-resolution in 4D Flow MRI

链接: https://arxiv.org/abs/2508.14950
作者: Oliver Welin Odeback,Arivazhagan Geetha Balasubramanian,Jonas Schollenberger,Edward Ferdiand,Alistair A. Young,C. Alberto Figueroa,Susanne Schnell,Outi Tammisola,Ricardo Vinuesa,Tobias Granberg,Alexander Fyrdahl,David Marlevi
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG)
*备注: 23 pages, 9 figures

点击查看摘要

Abstract:4D Flow Magnetic Resonance Imaging (4D Flow MRI) enables non-invasive quantification of blood flow and hemodynamic parameters. However, its clinical application is limited by low spatial resolution and noise, particularly affecting near-wall velocity measurements. Machine learning-based super-resolution has shown promise in addressing these limitations, but challenges remain, not least in recovering near-wall velocities. Generative adversarial networks (GANs) offer a compelling solution, having demonstrated strong capabilities in restoring sharp boundaries in non-medical super-resolution tasks. Yet, their application in 4D Flow MRI remains unexplored, with implementation challenged by known issues such as training instability and non-convergence. In this study, we investigate GAN-based super-resolution in 4D Flow MRI. Training and validation were conducted using patient-specific cerebrovascular in-silico models, converted into synthetic images via an MR-true reconstruction pipeline. A dedicated GAN architecture was implemented and evaluated across three adversarial loss functions: Vanilla, Relativistic, and Wasserstein. Our results demonstrate that the proposed GAN improved near-wall velocity recovery compared to a non-adversarial reference (vNRMSE: 6.9% vs. 9.6%); however, that implementation specifics are critical for stable network training. While Vanilla and Relativistic GANs proved unstable compared to generator-only training (vNRMSE: 8.1% and 7.8% vs. 7.2%), a Wasserstein GAN demonstrated optimal stability and incremental improvement (vNRMSE: 6.9% vs. 7.2%). The Wasserstein GAN further outperformed the generator-only baseline at low SNR (vNRMSE: 8.7% vs. 10.7%). These findings highlight the potential of GAN-based super-resolution in enhancing 4D Flow MRI, particularly in challenging cerebrovascular regions, while emphasizing the need for careful selection of adversarial strategies.

[LG-74] AGP: A Novel Arabidopsis thaliana Genomics-Phenomics Dataset and its HyperGraph Baseline Benchmarking

链接: https://arxiv.org/abs/2508.14934
作者: Manuel Serna-Aguilera,Fiona L. Goggin,Aranyak Goswami,Alexander Bucksch,Suxing Liu,Khoa Luu
类目: Genomics (q-bio.GN); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Understanding which genes control which traits in an organism remains one of the central challenges in biology. Despite significant advances in data collection technology, our ability to map genes to traits is still limited. This genome-to-phenome (G2P) challenge spans several problem domains, including plant breeding, and requires models capable of reasoning over high-dimensional, heterogeneous, and biologically structured data. Currently, however, many datasets solely capture genetic information or solely capture phenotype information. Additionally, phenotype data is very heterogeneous, which many datasets do not fully capture. The critical drawback is that these datasets are not integrated, that is, they do not link with each other to describe the same biological specimens. This limits machine learning models’ ability to be informed on the various aspects of these specimens, impacting the breadth of correlations learned, and therefore their ability to make more accurate predictions. To address this gap, we present the Arabidopsis Genomics-Phenomics (AGP) Dataset, a curated multi-modal dataset linking gene expression profiles with phenotypic trait measurements in Arabidopsis thaliana, a model organism in plant biology. AGP supports tasks such as phenotype prediction and interpretable graph learning. In addition, we benchmark conventional regression and explanatory baselines, including a biologically-informed hypergraph baseline, to validate gene-trait associations. To the best of our knowledge, this is the first dataset that provides multi-modal gene information and heterogeneous trait or phenotype data for the same Arabidopsis thaliana specimens. With AGP, we aim to foster the research community towards accurately understanding the connection between genotypes and phenotypes using gene information, higher-order gene pairings, and trait data from several sources.

[LG-75] Computational Resolution of Hadamard Product Factorization for 4 times 4 Matrices

链接: https://arxiv.org/abs/2508.14901
作者: Igor Rivin
类目: Rings and Algebras (math.RA); Machine Learning (cs.LG); Algebraic Geometry (math.AG)
*备注:

点击查看摘要

Abstract:We computationally resolve an open problem concerning the expressibility of 4 \times 4 full-rank matrices as Hadamard products of two rank-2 matrices. Through exhaustive search over \mathbbF_2 , we identify 5,304 counterexamples among the 20,160 full-rank binary matrices (26.3%). We verify that these counterexamples remain valid over \mathbbZ through sign enumeration and provide strong numerical evidence for their validity over \mathbbR . Remarkably, our analysis reveals that matrix density (number of ones) is highly predictive of expressibility, achieving 95.7% classification accuracy. Using modern machine learning techniques, we discover that expressible matrices lie on an approximately 10-dimensional variety within the 16-dimensional ambient space, despite the naive parameter count of 24 (12 parameters each for two 4 \times 4 rank-2 matrices). This emergent low-dimensional structure suggests deep algebraic constraints governing Hadamard factorizability. Subjects: Rings and Algebras (math.RA); Machine Learning (cs.LG); Algebraic Geometry (math.AG) MSC classes: 15A23, 15A69, 05B20, 68W30 Cite as: arXiv:2508.14901 [math.RA] (or arXiv:2508.14901v1 [math.RA] for this version) https://doi.org/10.48550/arXiv.2508.14901 Focus to learn more arXiv-issued DOI via DataCite

信息检索

[IR-0] Reading Between the Lines: A Study of Thematic Bias in Book Recommender Systems RECSYS2025

链接: https://arxiv.org/abs/2508.15643
作者: Nityaa Kalra,Savvina Daniil
类目: Information Retrieval (cs.IR)
*备注: 7 pages, 5 figures, Accepted at FAccTRec at RecSys 2025

点击查看摘要

Abstract:Recommender systems help users discover new content, but can also reinforce existing biases, leading to unfair exposure and reduced diversity. This paper introduces and investigates thematic bias in book recommendations, defined as a disproportionate favouring or neglect of certain book themes. We adopt a multi-stage bias evaluation framework using the Book-Crossing dataset to evaluate thematic bias in recommendations and its impact on different user groups. Our findings show that thematic bias originates from content imbalances and is amplified by user engagement patterns. By segmenting users based on their thematic preferences, we find that users with niche and long-tail interests receive less personalised recommendations, whereas users with diverse interests receive more consistent recommendations. These findings suggest that recommender systems should be carefully designed to accommodate a broader range of user interests. By contributing to the broader goal of responsible AI, this work also lays the groundwork for extending thematic bias analysis to other domains. Comments: 7 pages, 5 figures, Accepted at FAccTRec at RecSys 2025 Subjects: Information Retrieval (cs.IR) Cite as: arXiv:2508.15643 [cs.IR] (or arXiv:2508.15643v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.15643 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-1] LongRetriever: Towards Ultra-Long Sequence based Candidate Retrieval for Recommendation

链接: https://arxiv.org/abs/2508.15486
作者: Ren Qin,Chai Zheng,Xiao Xijun,Zheng Yuchao,Wu Di
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Precisely modeling user ultra-long sequences is critical for industrial recommender systems. Current approaches predominantly focus on leveraging ultra-long sequences in the ranking stage, whereas research for the candidate retrieval stage remains under-explored. This paper presents LongRetriever, a practical framework for incorporating ultra-long sequences into the retrieval stage of recommenders. Specifically, we propose in-context training and multi-context retrieval, which enable candidate-specific interaction between user sequence and candidate item, and ensure training-serving consistency under the search-based paradigm. Extensive online A/B testing conducted on a large-scale e-commerce platform demonstrates statistically significant improvements, confirming the framework’s effectiveness. Currently, LongRetriever has been fully deployed in the platform, impacting billions of users.

[IR-2] On Evaluating the Adversarial Robustness of Foundation Models for Multimodal Entity Linking

链接: https://arxiv.org/abs/2508.15481
作者: Fang Wang,Yongjie Wang,Zonghao Yang,Minghao Hu,Xiaoying Bai
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The explosive growth of multimodal data has driven the rapid development of multimodal entity linking (MEL) models. However, existing studies have not systematically investigated the impact of visual adversarial attacks on MEL models. We conduct the first comprehensive evaluation of the robustness of mainstream MEL models under different adversarial attack scenarios, covering two core tasks: Image-to-Text (I2T) and Image+Text-to-Text (IT2T). Experimental results show that current MEL models generally lack sufficient robustness against visual perturbations. Interestingly, contextual semantic information in input can partially mitigate the impact of adversarial perturbations. Based on this insight, we propose an LLM and Retrieval-Augmented Entity Linking (LLM-RetLink), which significantly improves the model’s anti-interference ability through a two-stage process: first, extracting initial entity descriptions using large vision models (LVMs), and then dynamically generating candidate descriptive sentences via web-based retrieval. Experiments on five datasets demonstrate that LLM-RetLink improves the accuracy of MEL by 0.4%-35.7%, especially showing significant advantages under adversarial conditions. This research highlights a previously unexplored facet of MEL robustness, constructs and releases the first MEL adversarial example dataset, and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments.

[IR-3] rackRec: Iterative Alternating Feedback with Chain-of-Thought via Preference Alignment for Recommendation

链接: https://arxiv.org/abs/2508.15388
作者: Yu Xia,Rui Zhong,Zeyu Song,Wei Yang,Junchen Wan,Qingpeng Cai,Chi Lu,Peng Jiang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The extensive world knowledge and powerful reasoning capabilities of large language models (LLMs) have attracted significant attention in recommendation systems (RS). Specifically, The chain of thought (CoT) has been shown to improve the performance of LLMs on complex reasoning tasks for RS. However, due to the fact that LLMs often suffer from hallucination issues, there is no guarantee that their reasoning CoT is effective. A key challenge is to further enhance the recommendation capabilities of LLMs through effective CoT reasonings. Therefore, we propose \textbfTrackRec, a framework designed to enhance reasoning capabilities of LLMs for RS. TrackRec specifically focuses on accurately inferring recommendation CoT \textbf(RecCoT) for user preference using the knowledge from LLMs. This RecCoT can serve both as an explanation for the LLM’s completion of recommendation tasks and as auxiliary features to assist recommendation models in accomplishing recommendation tasks. TrackRec consists of a RecCoT generator (G) and a RecCoT validator (V) . Furthermore, we design alternating feedback learning mechanism that G undergoes direct preference optimization via feedback from V to produce increasingly accurate RecCoT aligned with V 's standards. Meanwhile, V is fine-tuned using the inference feedback from G to enhance its validation capabilities in alignment with recommendation tasks. Through iterative alternating feedback learning between G and V , TrackRec continuously improves the user preference analysis capability of G and the validation capacity of V . Extensive experiments demonstrate the effectiveness of our approach, showing that it surpasses state-of-the-art methods. Moreover, TrackRec has been deployed on a lagre advertising platform with hundreds of millions of users, achieving substantial gains.

[IR-4] Exploring Scaling Laws of CTR Model for Online Performance Improvement

链接: https://arxiv.org/abs/2508.15326
作者: Weijiang Lai,Beihong Jin,Jiongyan Zhang,Yiyuan Zheng,Jian Dong,Jia Cheng,Jun Lei,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:CTR models play a vital role in improving user experience and boosting business revenue in many online personalized services. However, current CTR models generally encounter bottlenecks in performance improvement. Inspired by the scaling law phenomenon of LLMs, we propose a new paradigm for improving CTR predictions: first, constructing a CTR model with accuracy scalable to the model grade and data size, and then distilling the knowledge implied in this model into its lightweight model that can serve online users. To put it into practice, we construct a CTR model named SUAN (Stacked Unified Attention Network). In SUAN, we propose the UAB as a behavior sequence encoder. A single UAB unifies the modeling of the sequential and non-sequential features and also measures the importance of each user behavior feature from multiple perspectives. Stacked UABs elevate the configuration to a high grade, paving the way for performance improvement. In order to benefit from the high performance of the high-grade SUAN and avoid the disadvantage of its long inference time, we modify the SUAN with sparse self-attention and parallel inference strategies to form LightSUAN, and then adopt online distillation to train the low-grade LightSUAN, taking a high-grade SUAN as a teacher. The distilled LightSUAN has superior performance but the same inference time as the LightSUAN, making it well-suited for online deployment. Experimental results show that SUAN performs exceptionally well and holds the scaling laws spanning three orders of magnitude in model grade and data size, and the distilled LightSUAN outperforms the SUAN configured with one grade higher. More importantly, the distilled LightSUAN has been integrated into an online service, increasing the CTR by 2.81% and CPM by 1.69% while keeping the average inference time acceptable. Our source code is available at this https URL.

[IR-5] Modeling Long-term User Behaviors with Diffusion-driven Multi-interest Network for CTR Prediction

链接: https://arxiv.org/abs/2508.15311
作者: Weijiang Lai,Beihong Jin,Yapeng Zhang,Yiyuan Zheng,Rui Zhao,Jian Dong,Jun Lei,Xingxing Wang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:CTR (Click-Through Rate) prediction, crucial for recommender systems and online advertising, etc., has been confirmed to benefit from modeling long-term user behaviors. Nonetheless, the vast number of behaviors and complexity of noise interference pose challenges to prediction efficiency and effectiveness. Recent solutions have evolved from single-stage models to two-stage models. However, current two-stage models often filter out significant information, resulting in an inability to capture diverse user interests and build the complete latent space of user interests. Inspired by multi-interest and generative modeling, we propose DiffuMIN (Diffusion-driven Multi-Interest Network) to model long-term user behaviors and thoroughly explore the user interest space. Specifically, we propose a target-oriented multi-interest extraction method that begins by orthogonally decomposing the target to obtain interest channels. This is followed by modeling the relationships between interest channels and user behaviors to disentangle and extract multiple user interests. We then adopt a diffusion module guided by contextual interests and interest channels, which anchor users’ personalized and target-oriented interest types, enabling the generation of augmented interests that align with the latent spaces of user interests, thereby further exploring restricted interest space. Finally, we leverage contrastive learning to ensure that the generated augmented interests align with users’ genuine preferences. Extensive offline experiments are conducted on two public datasets and one industrial dataset, yielding results that demonstrate the superiority of DiffuMIN. Moreover, DiffuMIN increased CTR by 1.52% and CPM by 1.10% in online A/B testing. Our source code is available at this https URL.

[IR-6] REG4Rec: Reasoning -Enhanced Generative Model for Large-Scale Recommendation Systems

链接: https://arxiv.org/abs/2508.15308
作者: Haibo Xing,Hao Deng,Yucheng Mao,Jinxin Hu,Yi Xu,Hao Zhang,Jiahao Wang,Shizhun Wang,Yu Zhang,Xiaoyi Zeng,Jing Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation aims to predict a user’s next action in large-scale recommender systems. While traditional methods often suffer from insufficient information interaction, recent generative recommendation models partially address this issue by directly generating item predictions. To better capture user intents, recent studies have introduced a reasoning process into generative recommendation, significantly improving recommendation performance. However, these approaches are constrained by the singularity of item semantic representations, facing challenges such as limited diversity in reasoning pathways and insufficient reliability in the reasoning process. To tackle these issues, we introduce REG4Rec, a reasoning-enhanced generative model that constructs multiple dynamic semantic reasoning paths alongside a self-reflection process, ensuring high-confidence recommendations. Specifically, REG4Rec utilizes an MoE-based parallel quantization codebook (MPQ) to generate multiple unordered semantic tokens for each item, thereby constructing a larger-scale diverse reasoning space. Furthermore, to enhance the reliability of reasoning, we propose a training reasoning enhancement stage, which includes Preference Alignment for Reasoning (PARS) and a Multi-Step Reward Augmentation (MSRA) strategy. PARS uses reward functions tailored for recommendation to enhance reasoning and reflection, while MSRA introduces future multi-step actions to improve overall generalization. During inference, Consistency-Oriented Self-Reflection for Pruning (CORP) is proposed to discard inconsistent reasoning paths, preventing the propagation of erroneous reasoning. Lastly, we develop an efficient offline training strategy for large-scale recommendation. Experiments on real-world datasets and online evaluations show that REG4Rec delivers outstanding performance and substantial practical value.

[IR-7] MLLM Rec: Exploring the Potential of Multimodal Large Language Models in Recommender Systems

链接: https://arxiv.org/abs/2508.15304
作者: Yuzhuo Dang,Xin Zhang,Zhiqiang Pan,Yuxiao Duan,Wanyu Chen,Fei Cai,Honghui Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Multimodal recommendation typically combines the user behavioral data with the modal features of items to reveal user’s preference, presenting superior performance compared to the conventional recommendations. However, existing methods still suffer from two key problems: (1) the initialization methods of user multimodal representations are either behavior-unperceived or noise-contaminated, and (2) the KNN-based item-item graph contains noisy edges with low similarities and lacks audience co-occurrence relationships. To address such issues, we propose MLLMRec, a novel MLLM-driven multimodal recommendation framework with two item-item graph refinement strategies. On the one hand, the item images are first converted into high-quality semantic descriptions using an MLLM, which are then fused with the textual metadata of items. Then, we construct a behavioral description list for each user and feed it into the MLLM to reason about the purified user preference containing interaction motivations. On the other hand, we design the threshold-controlled denoising and topology-aware enhancement strategies to refine the suboptimal item-item graph, thereby enhancing the item representation learning. Extensive experiments on three publicly available datasets demonstrate that MLLMRec achieves the state-of-the-art performance with an average improvement of 38.53% over the best baselines.

[IR-8] Curriculum Approximate Unlearning for Session-based Recommendation

链接: https://arxiv.org/abs/2508.15263
作者: Liu Yang,Zhaochun Ren,Ziqi Zhao,Pengjie Ren,Zhumin Chen,Xinyi Li,Shuaiqiang Wang,Dawei Yin,Xin Xin
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Approximate unlearning for session-based recommendation refers to eliminating the influence of specific training samples from the recommender without retraining of (sub-)models. Gradient ascent (GA) is a representative method to conduct approximate unlearning. However, there still exist dual challenges to apply GA for session-based recommendation. On the one hand, naive applying of GA could lead to degradation of recommendation performance. On the other hand, existing studies fail to consider the ordering of unlearning samples when simultaneously processing multiple unlearning requests, leading to sub-optimal recommendation performance and unlearning effect. To address the above challenges, we introduce CAU, a curriculum approximate unlearning framework tailored to session-based recommendation. CAU handles the unlearning task with a GA term on unlearning samples. Specifically, to address the first challenge, CAU formulates the overall optimization task as a multi-objective optimization problem, where the GA term for unlearning samples is combined with retaining terms for preserving performance. The multi-objective optimization problem is solved through seeking the Pareto-Optimal solution, which achieves effective unlearning with trivial sacrifice on recommendation performance. To tackle the second challenge, CAU adopts a curriculum-based sequence to conduct unlearning on batches of unlearning samples. The key motivation is to perform unlearning from easy samples to harder ones. To this end, CAU first introduces two metrics to measure the unlearning difficulty, including gradient unlearning difficulty and embedding unlearning difficulty. Then, two strategies, hard-sampling and soft-sampling, are proposed to select unlearning samples according to difficulty scores.

[IR-9] Alpha Berkeley: A Scalable Framework for the Orchestration of Agent ic Systems

链接: https://arxiv.org/abs/2508.15066
作者: Thorsten Hellert,João Montenegro,Antonin Sulc
类目: Multiagent Systems (cs.MA); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Coordinating workflows across heterogeneous control systems remains a central challenge in safety-critical environments such as scientific facilities, industrial plants, and energy infrastructures. Language-model-driven agents offer a natural interface for these tasks, but existing approaches often lack scalability, reliability, and human oversight. We introduce the Alpha Berkeley Framework, a production-ready architecture for scalable agentic systems that integrate conversational context with robust tool orchestration. The framework features dynamic capability classification to select only relevant tools per task, a plan-first orchestration model that generates execution plans with explicit dependencies and optional human approval, context-aware task extraction that combines dialogue history with external memory and domain resources, and production-ready execution environments with checkpointing, artifact management, and modular deployment. We demonstrate its versatility through two case studies: a tutorial-style wind farm monitoring example and a deployment at the Advanced Light Source particle accelerator. These results establish Alpha Berkeley as a reliable and transparent framework for agentic systems in high-stakes domains.

[IR-10] Multimodal Recommendation via Self-Corrective Preference Alignmen

链接: https://arxiv.org/abs/2508.14912
作者: Yalong Guan,Xiang Chen,Mingyang Wang,Xiangyu Wu,Lihao Liu,Chao Qi,Shuang Yang,Tingting Gao,Guorui Zhou,Changjian Chen
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:With the rapid growth of live streaming platforms, personalized recommendation systems have become pivotal in improving user experience and driving platform revenue. The dynamic and multimodal nature of live streaming content (e.g., visual, audio, textual data) requires joint modeling of user behavior and multimodal features to capture evolving author characteristics. However, traditional methods relying on single-modal features or treating multimodal ones as supplementary struggle to align users’ dynamic preferences with authors’ multimodal attributes, limiting accuracy and interpretability. To address this, we propose MSPA (Multimodal Self-Corrective Preference Alignment), a personalized author recommendation framework with two components: (1) a Multimodal Preference Composer that uses MLLMs to generate structured preference text and embeddings from users’ tipping history; and (2) a Self-Corrective Preference Alignment Recommender that aligns these preferences with authors’ multimodal features to improve accuracy and interpretability. Extensive experiments and visualizations show that MSPA significantly improves accuracy, recall, and text quality, outperforming baselines in dynamic live streaming scenarios.

附件下载

点击下载今日全部论文列表