本篇博文主要内容为 2025-08-06 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-06)

今日共更新606篇论文,其中:

  • 自然语言处理77篇(Computation and Language (cs.CL))
  • 人工智能189篇(Artificial Intelligence (cs.AI))
  • 计算机视觉149篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习159篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] CompassVerifier: A Unified and Robust Verifier for LLM s Evaluation and Outcome Reward

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在答案验证(answer verification)方面的两大核心问题:一是缺乏系统性评估不同LLM验证能力的综合性基准;二是现有验证器(verifier)模型在处理复杂边界情况时鲁棒性不足,且跨领域泛化能力有限。解决方案的关键在于提出CompassVerifier——一个轻量级、高精度且鲁棒的验证器模型,并配套构建VerifierBench基准数据集。CompassVerifier具备多领域(数学、知识、推理任务)泛化能力,可处理多种答案形式(如多子问题、公式、序列答案),并能有效识别异常或无效响应;VerifierBench通过整合多方模型输出并结合人工分析的元错误模式进行增强,从而提升验证器训练与评估的全面性和准确性。

链接: https://arxiv.org/abs/2508.03686
作者: Shudong Liu,Hongwei Liu,Junnan Liu,Linchen Xiao,Songyang Gao,Chengqi Lyu,Yuzhe Gu,Wenwei Zhang,Derek F. Wong,Songyang Zhang,Kai Chen
机构: Shanghai AI Laboratory; NLP2CT Lab, University of Macau
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Technical Report; 31 Pages

点击查看摘要

Abstract:Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at this https URL.
zh

[NLP-1] More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在通用代码生成基准(如HumanEval)上表现优异,但在特定领域基准(如ParEval)上性能显著下降的问题,探究其根源是领域知识缺失还是提示(prompt)细节不足。解决方案的关键在于提出PartialOrderEval——一个通过构建从最小详细程度到最大详细程度的提示偏序关系来系统评估提示具体性对代码生成效果影响的新框架。实验表明,不同任务对提示细节敏感度各异,且明确的输入输出规范、边界情况处理及分步逻辑分解是提升提示质量的核心因素。

链接: https://arxiv.org/abs/2508.03678
作者: Yangtian Zi,Harshitha Menon,Arjun Guha
机构: Northeastern University (东北大学); Lawrence Livermore National Laboratory (劳伦斯利弗莫尔国家实验室)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Programming Languages (cs.PL)
备注:

点击查看摘要

Abstract:State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval but underperform on specialized suites such as ParEval. Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.
zh

[NLP-2] FairLangProc: A Python package for fairness in NLP

【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中偏见(bias)问题的多样性与分散性,即当前虽已有多种数据集、度量指标和算法用于检测与缓解NLP中的有害偏见,但其实施方式缺乏统一标准,难以推广。解决方案的关键在于提出一个名为FairLangProc的综合性Python工具包,该工具包实现了近年来NLP公平性领域的一些最新进展,并提供了与Hugging Face Transformers库兼容的接口,从而推动偏见缓解技术的广泛采用与民主化应用。

链接: https://arxiv.org/abs/2508.03677
作者: Arturo Pérez-Peralta,Sandra Benítez-Peña,Rosa E. Lillo
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (stat.ML)
备注: 40 pages, 4 figures, 3 tables

点击查看摘要

Abstract:The rise in usage of Large Language Models to near ubiquitousness in recent years has risen societal concern about their applications in decision-making contexts, such as organizational justice or healthcare. This, in turn, poses questions about the fairness of these models in critical settings, which leads to the developement of different procedures to address bias in Natural Language Processing. Although many datasets, metrics and algorithms have been proposed to measure and mitigate harmful prejudice in Natural Language Processing, their implementation is diverse and far from centralized. As a response, this paper presents FairLangProc, a comprehensive Python package providing a common implementation of some of the more recent advances in fairness in Natural Language Processing providing an interface compatible with the famous Hugging Face transformers library, aiming to encourage the widespread use and democratization of bias mitigation techniques. The implementation can be found on this https URL.
zh

[NLP-3] CTR-Sink: Attention Sink for Language Models in Click-Through Rate Prediction

【速读】: 该论文旨在解决点击率(Click-Through Rate, CTR)预测中因用户行为序列与语言模型(Language Model, LM)预训练语料结构不匹配而导致的语义碎片化问题。具体而言,用户行为序列由离散动作组成且以语义空白分隔符连接,与LM所习惯的连贯自然语言存在本质差异,导致注意力机制分散于无关token,难以聚焦于有意义的行为边界及行为间关系,从而降低预测性能。解决方案的关键在于提出CTR-Sink框架,其核心创新是引入行为级注意力汇点(attention sink),通过在连续行为之间插入具有推荐场景特性的sink token(如时间距离等信号),构建稳定的注意力聚集中心,并设计两阶段训练策略引导LM注意力向sink token集中,同时增强sink间依赖关系以更好捕捉行为相关性。

链接: https://arxiv.org/abs/2508.03668
作者: Zixuan Li,Binzong Geng,Jing Xiong,Yong He,Yuxuan Hu,Jian Chen,Dingwei Chen,Xiyu Chang,Liang Zhang,Linjian Mo,Chengming Li,Chuan Yuan,Zhenan Sun
机构: NLPR, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Ant Group (蚂蚁集团); University of Hong Kong (香港大学); Sun Yat-sen University (中山大学); SMBU (深圳大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Click-Through Rate (CTR) prediction, a core task in recommendation systems, estimates user click likelihood using historical behavioral data. Modeling user behavior sequences as text to leverage Language Models (LMs) for this task has gained traction, owing to LMs’ strong semantic understanding and contextual modeling capabilities. However, a critical structural gap exists: user behavior sequences consist of discrete actions connected by semantically empty separators, differing fundamentally from the coherent natural language in LM pre-training. This mismatch causes semantic fragmentation, where LM attention scatters across irrelevant tokens instead of focusing on meaningful behavior boundaries and inter-behavior relationships, degrading prediction performance. To address this, we propose \textitCTR-Sink , a novel framework introducing behavior-level attention sinks tailored for recommendation scenarios. Inspired by attention sink theory, it constructs attention focus sinks and dynamically regulates attention aggregation via external information. Specifically, we insert sink tokens between consecutive behaviors, incorporating recommendation-specific signals such as temporal distance to serve as stable attention sinks. To enhance generality, we design a two-stage training strategy that explicitly guides LM attention toward sink tokens and a attention sink mechanism that amplifies inter-sink dependencies to better capture behavioral correlations. Experiments on one industrial dataset and two open-source datasets (MovieLens, Kuairec), alongside visualization results, validate the method’s effectiveness across scenarios.
zh

[NLP-4] Forest vs Tree: The (N K) Trade-off in Reproducible ML Evaluation

【速读】: 该论文旨在解决机器学习(Machine Learning, ML)评估中因人类标注者间分歧(human disagreement)被忽视而导致的可靠性不足问题。当前实践中,为控制预算,通常仅收集少量标注者对每个样本的响应(即低K值),但这种做法可能掩盖了标注一致性对模型性能比较的影响。解决方案的关键在于系统性地分析在固定预算(N×K)下,样本数量(N)与每样本标注数(K)之间的权衡关系,并基于真实多标注数据集和模拟分布确定最优配置。研究发现,通过合理设置K(通常不超过10),即可在较低总标注量(N×K ≤ 1000)下显著提升评估可靠性,且该最优策略依赖于所采用的评估指标类型——对响应分布敏感的指标在较高K值时表现更优。此方法可指导ML从业者高效设计测试数据收集方案,以最大化预算下的评估可靠性。

链接: https://arxiv.org/abs/2508.03663
作者: Deepak Pandita,Flip Korn,Chris Welty,Christopher M. Homan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reproducibility is a cornerstone of scientific validation and of the authority it confers on its results. Reproducibility in machine learning evaluations leads to greater trust, confidence, and value. However, the ground truth responses used in machine learning often necessarily come from humans, among whom disagreement is prevalent, and surprisingly little research has studied the impact of effectively ignoring disagreement in these responses, as is typically the case. One reason for the lack of research is that budgets for collecting human-annotated evaluation data are limited, and obtaining more samples from multiple annotators for each example greatly increases the per-item annotation costs. We investigate the trade-off between the number of items ( N ) and the number of responses per item ( K ) needed for reliable machine learning evaluation. We analyze a diverse collection of categorical datasets for which multiple annotations per item exist, and simulated distributions fit to these datasets, to determine the optimal (N, K) configuration, given a fixed budget ( N \times K ), for collecting evaluation data and reliably comparing the performance of machine learning models. Our findings show, first, that accounting for human disagreement may come with N \times K at no more than 1000 (and often much lower) for every dataset tested on at least one metric. Moreover, this minimal N \times K almost always occurred for K 10 . Furthermore, the nature of the tradeoff between K and N – or if one even existed – depends on the evaluation metric, with metrics that are more sensitive to the full distribution of responses performing better at higher levels of K . Our methods can be used to help ML practitioners get more effective test data by finding the optimal metrics and number of items and annotations per item to collect to get the most reliability for their budget.
zh

[NLP-5] Can Large Vision-Language Models Understand Multimodal Sarcasm? CIKM2025

【速读】: 该论文旨在解决多模态讽刺分析(Multimodal Sarcasm Analysis, MSA)中大型视觉语言模型(Large Visual Language Models, LVLMs)存在的局限性问题,特别是其在视觉理解不足和概念知识缺乏方面的短板。解决方案的关键在于提出一种无需训练的框架,通过深度融合目标物体提取(in-depth object extraction)与外部概念知识(external conceptual knowledge),显著提升模型在多模态场景下对讽刺语义的理解与解释能力。

链接: https://arxiv.org/abs/2508.03654
作者: Xinyu Wang,Yue Zhang,Liqiang Jing
机构: The University of Texas at Dallas (德克萨斯大学达拉斯分校)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Sarcasm is a complex linguistic phenomenon that involves a disparity between literal and intended meanings, making it challenging for sentiment analysis and other emotion-sensitive tasks. While traditional sarcasm detection methods primarily focus on text, recent approaches have incorporated multimodal information. However, the application of Large Visual Language Models (LVLMs) in Multimodal Sarcasm Analysis (MSA) remains underexplored. In this paper, we evaluate LVLMs in MSA tasks, specifically focusing on Multimodal Sarcasm Detection and Multimodal Sarcasm Explanation. Through comprehensive experiments, we identify key limitations, such as insufficient visual understanding and a lack of conceptual knowledge. To address these issues, we propose a training-free framework that integrates in-depth object extraction and external conceptual knowledge to improve the model’s ability to interpret and explain sarcasm in multimodal contexts. The experimental results on multiple models show the effectiveness of our proposed framework. The code is available at this https URL.
zh

[NLP-6] Are We on the Right Way for Assessing Document Retrieval-Augmented Generation?

【速读】: 该论文旨在解决当前文档检索增强生成(Document Retrieval-Augmented Generation, Document RAG)系统在评估方面的严重不足问题。现有基准测试通常仅关注系统中的单一模块,且依赖于带有不完整标注的合成数据,难以反映真实场景下的瓶颈与挑战。其解决方案的关键在于提出Double-Bench——一个大规模、多语言、多模态的评估体系,能够对文档RAG系统的各个组件进行细粒度评估。该体系包含3,276份文档(共72,880页)和5,168个单跳与多跳查询,覆盖6种语言和4类文档类型,并具备动态更新机制以应对潜在的数据污染问题;所有查询均基于全面扫描的证据页并经人工专家验证,确保高质量与完整性。实验证明,文本与视觉嵌入模型之间的差距正在缩小,同时揭示了当前文档RAG框架存在过度自信问题——即在缺乏证据支持时仍会生成答案,凸显了构建更强检索模型的必要性。

链接: https://arxiv.org/abs/2508.03644
作者: Wenxuan Shen,Mingjia Wang,Yaochen Wang,Dongping Chen,Junjie Yang,Yao Wan,Weiwei Lin
机构: South China University of Technology (华南理工大学); Huazhong University of Science and Technology (华中科技大学); University of Maryland (马里兰大学)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Information Retrieval (cs.IR)
备注: In submission. Project website: this https URL

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) systems using Multimodal Large Language Models (MLLMs) show great promise for complex document understanding, yet their development is critically hampered by inadequate evaluation. Current benchmarks often focus on specific part of document RAG system and use synthetic data with incomplete ground truth and evidence labels, therefore failing to reflect real-world bottlenecks and challenges. To overcome these limitations, we introduce Double-Bench: a new large-scale, multilingual, and multimodal evaluation system that is able to produce fine-grained assessment to each component within document RAG systems. It comprises 3,276 documents (72,880 pages) and 5,168 single- and multi-hop queries across 6 languages and 4 document types with streamlined dynamic update support for potential data contamination issues. Queries are grounded in exhaustively scanned evidence pages and verified by human experts to ensure maximum quality and completeness. Our comprehensive experiments across 9 state-of-the-art embedding models, 4 MLLMs and 4 end-to-end document RAG frameworks demonstrate the gap between text and visual embedding models is narrowing, highlighting the need in building stronger document retrieval models. Our findings also reveal the over-confidence dilemma within current document RAG frameworks that tend to provide answer even without evidence support. We hope our fully open-source Double-Bench provide a rigorous foundation for future research in advanced document RAG systems. We plan to retrieve timely corpus and release new benchmarks on an annual basis.
zh

[NLP-7] OSINT or BULLSHINT? Exploring Open-Source Intelligence tweets about the Russo-Ukrainian War

【速读】: 该论文旨在解决数字时代下社交媒体中开源情报(Open Source Intelligence, OSINT)在地缘政治冲突中的作用问题,特别是区分真实OSINT与虚假信息(称为“BULLSHINT”)对战争叙事和公众认知的影响。其解决方案的关键在于构建一个多维度分析框架,结合情感分析、党派倾向识别、虚假信息检测及命名实体识别(Named Entity Recognition, NER),并辅以社区发现技术,系统揭示OSINT传播者的行为模式、信息扩散路径及其背后的策略性操纵机制,从而为理解社交媒体环境下的数字战和信息战提供实证依据与方法论支持。

链接: https://arxiv.org/abs/2508.03599
作者: Johannes Niu,Mila Stillman,Anna Kruspe
机构: Munich University of Applied Sciences (慕尼黑应用技术大学)
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper examines the role of Open Source Intelligence (OSINT) on Twitter regarding the Russo-Ukrainian war, distinguishing between genuine OSINT and deceptive misinformation efforts, termed “BULLSHINT.” Utilizing a dataset spanning from January 2022 to July 2023, we analyze nearly 2 million tweets from approximately 1,040 users involved in discussing real-time military engagements, strategic analyses, and misinformation related to the conflict. Using sentiment analysis, partisanship detection, misinformation identification, and Named Entity Recognition (NER), we uncover communicative patterns and dissemination strategies within the OSINT community. Significant findings reveal a predominant negative sentiment influenced by war events, a nuanced distribution of pro-Ukrainian and pro-Russian partisanship, and the potential strategic manipulation of information. Additionally, we apply community detection techniques, which are able to identify distinct clusters partisanship, topics, and misinformation, highlighting the complex dynamics of information spread on social media. This research contributes to the understanding of digital warfare and misinformation dynamics, offering insights into the operationalization of OSINT in geopolitical conflicts.
zh

[NLP-8] ackling Distribution Shift in LLM via KILO: Knowledge-Instructed Learning for Continual Adaptation

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在遭遇领域迁移(domain shift)时因灾难性遗忘(catastrophic forgetting)导致性能下降的问题。其解决方案的关键在于提出一种名为KILO(Knowledge-Instructed Learning for Continual Adaptation)的持续学习框架,该框架通过将动态知识图谱与指令微调(instruction tuning)相结合,在训练过程中利用检索到的领域特定知识作为指导,从而增强模型对新领域的适应能力并保留先前习得的知识。实验表明,KILO在后向迁移、前向迁移、F1分数、保留率和训练效率等多个指标上均显著优于强基线方法。

链接: https://arxiv.org/abs/2508.03571
作者: Iing Muttakhiroh,Thomas Fevens
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) often suffer from performance degradation when faced with domain shifts, primarily due to catastrophic forgetting. In this work, we propose KILO (Knowledge-Instructed Learning for Continual Adaptation), a novel continual learning framework that integrates dynamic knowledge graphs with instruction tuning. By leveraging retrieved domain-specific knowledge as guidance during training, KILO enhances both adaptability to new domains and retention of previously acquired knowledge. We pretrain our model on WikiText-103 and evaluate sequential adaptation across four diverse target domains: BioASQ, SciQ, TweetEval, and MIND. Our experiments demonstrate that KILO consistently outperforms strong baselines, including continual fine-tuning, ERNIE 2.0, and CPT, in terms of backward transfer, forward transfer, F1 score, retention rate, and training efficiency. These results highlight the effectiveness of combining structured knowledge retrieval and instruction prompting to overcome domain shift challenges in continual learning scenarios.
zh

[NLP-9] Beyond Meme Templates: Limitations of Visual Similarity Measures in Meme Matching

【速读】: 该论文试图解决现有 meme 匹配方法局限于仅基于模板(Template)的图像背景进行匹配的问题,而忽略了大量非模板类 meme(即不依赖统一视觉背景的 meme)的匹配需求。这限制了自动化 meme 分析的有效性,并难以与基于网络的 meme 词典建立有效关联。解决方案的关键在于提出一种更广泛的 meme 匹配范式,突破传统模板匹配的局限,引入分段相似性计算(segment-wise similarity computation)以提升对非模板类 meme 的匹配性能,并探索利用预训练多模态大语言模型(Multimodal Large Language Model, MLLM)的提示(prompting-based)方法来增强跨格式 meme 的语义理解与匹配能力。实验表明,分段策略在非模板类 meme 匹配上显著优于整体图像相似度方法,但整体上准确匹配共享视觉元素仍是一个开放挑战,亟需更精细的匹配技术。

链接: https://arxiv.org/abs/2508.03562
作者: Muzhaffar Hazman,Susan McKeever,Josephine Griffith
机构: University of Galway (加利福尼亚大学); Technological University Dublin (都柏林理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: Accepted for publication at IEEE International Conference on Image Processing Theory, Tools and Applications (IPTA) 2025

点击查看摘要

Abstract:Internet memes, now a staple of digital communication, play a pivotal role in how users engage within online communities and allow researchers to gain insight into contemporary digital culture. These engaging user-generated content are characterised by their reuse of visual elements also found in other memes. Matching instances of memes via these shared visual elements, called Meme Matching, is the basis of a wealth of meme analysis approaches. However, most existing methods assume that every meme consists of a shared visual background, called a Template, with some overlaid text, thereby limiting meme matching to comparing the background image alone. Current approaches exclude the many memes that are not template-based and limit the effectiveness of automated meme analysis and would not be effective at linking memes to contemporary web-based meme dictionaries. In this work, we introduce a broader formulation of meme matching that extends beyond template matching. We show that conventional similarity measures, including a novel segment-wise computation of the similarity measures, excel at matching template-based memes but fall short when applied to non-template-based meme formats. However, the segment-wise approach was found to consistently outperform the whole-image measures on matching non-template-based memes. Finally, we explore a prompting-based approach using a pretrained Multimodal Large Language Model for meme matching. Our results highlight that accurately matching memes via shared visual elements, not just background templates, remains an open challenge that requires more sophisticated matching techniques.
zh

[NLP-10] PyLate: Flexible Training and Retrieval for Late Interaction Models

【速读】: 该论文旨在解决单向量检索(single vector search)在处理跨域泛化、长文本上下文和推理密集型任务时性能显著下降的问题。其核心解决方案是引入多向量架构(multi-vector architecture),通过保留每个词元(token)的嵌入表示并采用MaxSim操作计算相似度,从而有效缓解信息压缩带来的损失。关键创新在于提出PyLate这一基于Sentence Transformers的轻量级库,原生支持多向量模型训练与实验,提供高效索引等专用功能,同时保持与现有代码模板的兼容性,显著降低使用门槛,推动晚交互模型(late interaction models)在信息检索系统中的研究与落地应用。

链接: https://arxiv.org/abs/2508.03555
作者: Antoine Chaffin,Raphaël Sourty
机构: LightOnNancyFrance; LightOnParisFrance
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: 5 pages

点击查看摘要

Abstract:Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.
zh

[NLP-11] MultiRAG : A Knowledge-guided Framework for Mitigating Hallucination in Multi-source Retrieval Augmented Generation ICDE2025

【速读】: 该论文旨在解决多源检索增强生成(Multi-source Retrieval Augmented Generation, MultiRAG)场景下因知识来源多样性和信息不一致性所导致的幻觉问题(hallucination)。具体而言,现有方法在整合多个检索源时面临两个核心挑战:一是多源数据分布稀疏,难以捕捉逻辑关系;二是不同来源间存在内在矛盾,引发信息冲突。为此,作者提出了一种名为MultiRAG的新框架,其关键创新在于:(1) 引入基于多源线图(multi-source line graphs)的知识构建模块,以高效聚合跨源逻辑关系,缓解数据稀疏性问题;(2) 设计多层级置信度计算机制的检索模块,在图级别与节点级别同时评估信息可靠性,识别并剔除不可靠信息节点,从而有效降低因源间不一致引发的幻觉现象。

链接: https://arxiv.org/abs/2508.03553
作者: Wenlong Wu,Haofen Wang,Bohan Li,Peixuan Huang,Xinzhe Zhao,Lei Liang
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Tongji University (同济大学); Ministry of Industry and Information Technology (工业和信息化部); Novel Software Technology and Industrialization (新型软件技术与产业化协同创新中心); Ant Group (蚂蚁集团)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL)
备注: Accepted by ICDE 2025 Research Paper

点击查看摘要

Abstract:Retrieval Augmented Generation (RAG) has emerged as a promising solution to address hallucination issues in Large Language Models (LLMs). However, the integration of multiple retrieval sources, while potentially more informative, introduces new challenges that can paradoxically exacerbate hallucination problems. These challenges manifest primarily in two aspects: the sparse distribution of multi-source data that hinders the capture of logical relationships and the inherent inconsistencies among different sources that lead to information conflicts. To address these challenges, we propose MultiRAG, a novel framework designed to mitigate hallucination in multi-source retrieval-augmented generation through knowledge-guided approaches. Our framework introduces two key innovations: (1) a knowledge construction module that employs multi-source line graphs to efficiently aggregate logical relationships across different knowledge sources, effectively addressing the sparse data distribution issue; and (2) a sophisticated retrieval module that implements a multi-level confidence calculation mechanism, performing both graph-level and node-level assessments to identify and eliminate unreliable information nodes, thereby reducing hallucinations caused by inter-source inconsistencies. Extensive experiments on four multi-domain query datasets and two multi-hop QA datasets demonstrate that MultiRAG significantly enhances the reliability and efficiency of knowledge retrieval in complex multi-source scenarios. \textcolorblueOur code is available in this https URL.
zh

[NLP-12] Beyond the Surface: Enhancing LLM -as-a-Judge Alignment with Human via Internal Representations

【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-Judge)在自动化评估任务中与人类偏好对齐不足的问题,尤其是在不依赖复杂提示或微调的情况下。其核心挑战在于如何有效利用模型内部表示来提升评分准确性。解决方案的关键是提出 LAGER 框架,该框架通过聚合不同层(尤其是中上层)的 score-token logits,并基于 softmax 分布计算期望得分,从而充分利用跨层互补信息,无需修改模型参数即可显著提升与人类评分的一致性(以 Spearman 相关系数衡量)。实验表明,LAGER 在 Flask、HelpSteer 和 BIGGen 等基准上相比最优基线最高提升 7.5%,且无需推理步骤即可达到或超越基于推理的方法性能。

链接: https://arxiv.org/abs/2508.03550
作者: Peng Lai,Jianjie Zheng,Sijie Cheng,Yun Chen,Peng Li,Yang Liu,Guanhua Chen
机构: Southern University of Science and Technology (南方科技大学); Tsinghua University (清华大学); Shanghai University of Finance and Economics (上海财经大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The growing scale of evaluation tasks has led to the widespread adoption of automated evaluation using large language models, a paradigm known as “LLMas-a-judge.” However, improving its alignment with human preferences without complex prompts or fine-tuning remains challenging. In this work, motivated by preliminary findings that middle-to-upper layers encode semantically and taskrelevant representations that are often more aligned with human judgments than the final layer, we propose LAGER, a lightweight and efficient framework for enhancing LLM-as-a-Judge alignment with human scoring, via internal representations. LAGER produces fine-grained judgment scores by aggregating cross-layer scoretoken logits and computing the expected score from a softmax-based distribution, with the LLM backbone kept frozen. LAGER fully leverages the complementary information across different layers, overcoming the limitations of relying solely on the final layer. We evaluate our method on the standard alignment benchmarks Flask, HelpSteer, and BIGGen using Spearman correlation, and find that LAGER achieves improvements of up to 7.5% over the best baseline across these benchmarks. Without reasoning steps, LAGER matches or outperforms reasoning-based methods. Experiments on downstream applications, such as data selection and emotional understanding, further show the effectiveness of our method.
zh

[NLP-13] EmbedGrad: Gradient-Based Prompt Optimization in Embedding Space for Large Language Models

【速读】: 该论文旨在解决如何高效适配强大的预训练基础模型(foundation models)以应对多样化任务的问题。当前主流方法分为两类:基于文本提示(prompt)的离散优化和通过额外可训练参数进行连续调整,前者精度不足,后者则增加复杂性并降低可解释性。论文提出了一种名为EmbedGrad的新框架,其关键在于通过梯度驱动的方式对提示嵌入(prompt embeddings)进行精细化优化,实现了训练与部署的解耦——在优化阶段利用标注样本引导嵌入调整并保持语义一致性,在推理阶段仅使用优化后的嵌入与用户查询融合,从而实现文本空间中无法达到的细粒度校准,显著提升模型性能,尤其在数学推理、情感分析和因果判断等任务上效果显著。

链接: https://arxiv.org/abs/2508.03533
作者: Xiaoming Hou,Jiquan Zhang,Zibin Lin,DaCheng Tao,Shengli Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Effectively adapting powerful pretrained foundation models to diverse tasks remains a key challenge in AI deployment. Current approaches primarily follow two paradigms:discrete optimization of text prompts through prompt engineering, or continuous adaptation via additional trainable parameters. Both exhibit limitations-discrete methods lack refinement precision while parameter-based techniques increase complexity and reduce interpretability. To address these constraints, we propose EmbedGrad, a novel framework that optimizes text prompt embeddings through gradient-based refinement. Our approach uniquely decouples training from deployment:during optimization,labeled examples guide precise embedding adjustments while preserving semantic meaning; during inference, only optimized embeddings integrate with user queries. This enables fine-grained calibration impossible in text space, such as enhancing the reasoning capability of prompts like please reason step by step. Comprehensive evaluations across mathematical reasoning, sentiment analysis, and causal judgment tasks demonstrate EmbedGrad’s effectiveness:optimizing this reasoning prompt for Qwen2.5-Math-1.5B increased accuracy from 14.74% to 58.96% on mathematical problems. Consistent improvements were observed across model scales (0.5B-14B) and all tasks, with particularly significant gains for smaller models on complex problems like causal judgment. By bridging prompt engineering and parameter efficiency without architectural changes, our work establishes embedding refinement as a powerful new paradigm for task adaptation.
zh

[NLP-14] Marito: Structuring and Building Open Multilingual Terminologies for South African NLP

【速读】: 该论文旨在解决南非官方语言在多语言自然语言处理(Natural Language Processing, NLP)领域中结构化术语数据严重缺乏的问题,尽管已存在大量政府和学术机构的术语资源,但这些资源分散且以非机器可读格式保存,难以用于计算研究与开发。解决方案的关键在于系统性地聚合、清洗并标准化这些零散资源,构建开放且互操作的数据集——即“Marito”数据集,并采用非洲中心化的NOODL框架进行发布;同时通过将其集成到检索增强生成(Retrieval-Augmented Generation, RAG)管道中,验证了其在提升大型语言模型从英语到茨维塔语机器翻译准确性与领域一致性方面的显著效果,从而为构建公平、稳健的多语言NLP技术提供可扩展基础。

链接: https://arxiv.org/abs/2508.03529
作者: Vukosi Marivate,Isheanesu Dzingirai,Fiskani Banda,Richard Lastrucci,Thapelo Sindane,Keabetswe Madumo,Kayode Olaleye,Abiodun Modupe,Unarine Netshifhefhe,Herkulaas Combrink,Mohlatlego Nakeng,Matome Ledwaba
机构: DSFSI, Dept. of Computer Science, University of Pretoria (计算机科学系,普利托里亚大学); AfriDSAI, University of Pretoria (非洲数字科学与人工智能研究院,普利托里亚大学); Lelapa AI; Economics and Management Sciences, University of the Free State (经济学与管理科学系,自由州大学); Interdisciplinary Centre for Digital Futures, University of the Free State (数字未来跨学科中心,自由州大学)
类目: Computation and Language (cs.CL)
备注: Under Review

点击查看摘要

Abstract:The critical lack of structured terminological data for South Africa’s official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. \emphMarito addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational \emphMarito dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models. \emphMarito provides a scalable foundation for developing robust and equitable NLP technologies, ensuring South Africa’s rich linguistic diversity is represented in the digital age.
zh

[NLP-15] MoKA: Mixture of Kronecker Adapters

【速读】: 该论文旨在解决低秩适配器(Low-rank adapters)在参数高效微调(Parameter-efficient fine-tuning, PEFT)大语言模型(Large Language Models, LLMs)时因秩约束导致表达能力受限、难以胜任复杂任务的问题。其解决方案的关键在于提出一种新型的Kronecker适配器——混合Kronecker适配器(Mixture of Kronecker Adapters, MoKA),通过将权重更新建模为多个Kronecker积的混合,并引入门控机制以动态衡量各Kronecker因子的重要性,从而显著提升适配器的表达能力;同时,MoKA支持秩灵活性,在保持极低可训练参数量(最高减少27倍)的前提下实现性能与效率的最佳平衡,并通过标准矩阵运算重构Kronecker计算,保障在GPU硬件上的高效部署。

链接: https://arxiv.org/abs/2508.03527
作者: Mohammadreza Sadeghi,Mahsa Ghazvini Nejad,MirHamed Jafarzadeh Asl,Yu Gu,Yuanhao Yu,Masoud Asgharian,Vahid Partovi Nia
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) is essential for reducing the computational overhead of large language models (LLMs). Low-rank family adapters are commonly used to control the parameter size efficiently while maintaining the generative power of LLMs. However, their limited expressiveness due to the rank constraint often restricts their performance on complex tasks. We propose Mixture of Kronecker Adapters (MoKA), a new generation of Kronecker adapters that addresses this limitation by modeling weight updates as a mixture of Kronecker products. Our proposed adapter leverages a gating mechanism that measures the importance of each Kronecker factor, enabling more expressive adaptation. Moreover, MoKA enables a rank flexibility that provides a better trade-off between parameter efficiency and accuracy. To ensure hardware efficiency, we reformulate Kronecker computations using standard matrix operations, allowing seamless deployment on GPU-optimized hardware. We conduct extensive experiments on instruction-tuning and commonsense reasoning tasks using low-bit quantized versions of LLaMA2-7B and LLaMA3-8B models. MoKA not only outperforms PEFT baselines, but also reduces the number of trainable parameters up to 27x, achieving state-of-the-art trade-offs between performance and parameter efficiency.
zh

[NLP-16] FilBench: Can LLM s Understand and Generate Filipino?

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在菲律宾语(Filipino)等非英语语言上的能力评估缺失问题,尤其是对菲律宾本土语言如菲律宾语、他加禄语(Tagalog)和宿务语(Cebuano)的覆盖不足。其解决方案的关键在于构建了一个以菲律宾为中心的基准测试集——FilBench,该基准涵盖文化知识、传统自然语言处理任务、阅读理解与生成等多维度任务,精准反映菲律宾NLP研究的重点与趋势。通过在27个先进LLMs上进行系统评测,研究揭示了现有模型在菲律宾语阅读理解和翻译方面的显著短板,并指出专为东南亚语言训练的模型也未表现出预期优势,从而凸显了针对特定语言生态定制评估基准的重要性,推动菲律宾语自然语言处理的发展与大语言模型的包容性提升。

链接: https://arxiv.org/abs/2508.03523
作者: Lester James V. Miranda,Elyanah Aco,Conner Manuel,Jan Christian Blaise Cruz,Joseph Marvin Imperial
机构: Allen Institute for AI(艾伦人工智能研究所); Nara Institute of Science and Technology(奈良科学技术大学院大学); Together AI(Together AI); SEACrowd(SEACrowd); MBZUAI(MBZUAI); University of Bath(巴斯大学); National University, Philippines(菲律宾国家大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite the impressive performance of LLMs on English-based tasks, little is known about their capabilities in specific languages such as Filipino. In this work, we address this gap by introducing FilBench, a Filipino-centric benchmark designed to evaluate LLMs across a diverse set of tasks and capabilities in Filipino, Tagalog, and Cebuano. We carefully curate the tasks in FilBench to reflect the priorities and trends of NLP research in the Philippines such as Cultural Knowledge, Classical NLP, Reading Comprehension, and Generation. By evaluating 27 state-of-the-art LLMs on FilBench, we find that several LLMs suffer from reading comprehension and translation capabilities. Our results indicate that FilBench is challenging, with the best model, GPT-4o, achieving only a score of 72.23%. Moreover, we also find that models trained specifically for Southeast Asian languages tend to underperform on FilBench, with the highest-performing model, SEA-LION v3 70B, achieving only a score of 61.07%. Our work demonstrates the value of curating language-specific LLM benchmarks to aid in driving progress on Filipino NLP and increasing the inclusion of Philippine languages in LLM development.
zh

[NLP-17] UPLME: Uncertainty-Aware Probabilistic Language Modelling for Robust Empathy Regression

【速读】: 该论文旨在解决情感回归(empathy regression)任务中因自我报告的情感分数存在噪声而导致的监督学习挑战。现有方法多聚焦于文本分类中的标签噪声问题,而回归场景下的研究相对不足。其解决方案的关键在于提出一种不确定性感知的概率语言建模框架——UPLME,该框架通过贝叶斯概念与变分模型集成训练,能够同时预测情感评分和异方差不确定性(heteroscedastic uncertainty)。UPLME进一步引入两个新颖的损失项:一是惩罚退化不确定性量化(degenerate Uncertainty Quantification, UQ),二是强制输入样本对之间预测情感值的相似性约束。这一设计使模型在存在标签噪声的公开基准数据集上实现了性能提升(皮尔逊相关系数从0.558提升至0.580,以及从0.629提升至0.634),并能有效区分噪声与干净样本,同时在校准误差上显著优于基于变分集成的最新回归不确定性量化方法(从0.571降至0.376)。

链接: https://arxiv.org/abs/2508.03520
作者: Md Rakibul Hasan,Md Zakir Hossain,Aneesh Krishna,Shafin Rahman,Tom Gedeon
机构: 1. University of Sydney (悉尼大学); 2. Bangladesh University of Engineering and Technology (孟加拉国工程技术大学); 3. Australian Institute for Machine Learning (澳大利亚机器学习研究所); 4. Australian Centre for Robotic Vision (澳大利亚机器人视觉中心); 5. School of Computer Science and Engineering, University of Sydney (悉尼大学计算机科学与工程学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Code available at this https URL

点击查看摘要

Abstract:Supervised learning for empathy regression is challenged by noisy self-reported empathy scores. While many algorithms have been proposed for learning with noisy labels in textual classification problems, the regression counterpart is relatively under-explored. We propose UPLME, an uncertainty-aware probabilistic language modelling framework to capture label noise in the regression setting of empathy detection. UPLME includes a probabilistic language model that predicts both empathy score and heteroscedastic uncertainty and is trained using Bayesian concepts with variational model ensembling. We further introduce two novel loss components: one penalises degenerate Uncertainty Quantification (UQ), and another enforces the similarity between the input pairs on which we predict empathy. UPLME provides state-of-the-art performance (Pearson Correlation Coefficient: 0.558\rightarrow0.580 and 0.629\rightarrow0.634 ) in terms of the performance reported in the literature in two public benchmarks, having label noise. Through synthetic label noise injection, we show that UPLME is effective in separating noisy and clean samples based on the predicted uncertainty. UPLME further outperform (Calibration error: 0.571\rightarrow0.376 ) a recent variational model ensembling-based UQ method designed for regression problems.
zh

[NLP-18] raining Long-Context Multi-Turn Software Engineering Agents with Reinforcement Learning

【速读】: 该论文旨在解决强化学习(Reinforcement Learning, RL)在大型语言模型(Large Language Models, LLMs)中应用时,从单轮任务(如数学推理或单次代码生成)向复杂多轮交互场景迁移的挑战,尤其是针对软件工程(Software Engineering, SWE)这类需要与状态化环境进行丰富多轮交互的任务。传统RL方法往往将单轮问题建模为令牌级多轮马尔可夫决策过程(MDP),但忽略了环境反馈的非平凡性,而现实世界任务(如SWE)要求代理能持续接收环境响应并据此调整策略。解决方案的关键在于采用改进的解耦优势策略优化(Decoupled Advantage Policy Optimization, DAPO)算法,在不依赖任何教师模型的前提下,训练基于Qwen2.5-72B-Instruct的代理完成真实软件工程任务。实验表明,该方法将SWE-bench Verified基准上的成功率从基线的20%提升至39%,并在SWE-rebench上达到或超越当前领先的开源模型性能,验证了其在开放模型基础上构建复杂现实问题自主代理的有效路径。

链接: https://arxiv.org/abs/2508.03501
作者: Alexander Golubev,Maria Trofimova,Sergei Polezhaev,Ibragim Badertdinov,Maksim Nekrashevich,Anton Shevtsov,Simon Karasik,Sergey Abramov,Andrei Andriushchenko,Filipp Fisin,Sergei Skvortsov,Boris Yangel
机构: 1. Yandex(雅虎); 2. Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent’s success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models. Subjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Software Engineering (cs.SE) Cite as: arXiv:2508.03501 [cs.LG] (or arXiv:2508.03501v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.03501 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-19] CF-RAG : A Dataset and Method for Carbon Footprint QA Using Retrieval-Augmented Generation

【速读】: 该论文旨在解决从非结构化且格式不统一的可持续性报告PDF中准确提取和回答碳足迹相关问题的挑战。这类报告通常包含文本与表格混合的内容,且缺乏标准化格式,导致传统方法难以高效处理大规模文档。解决方案的关键在于提出一种名为CarbonPDF的基于大语言模型(LLM)的技术,该技术通过在自建的CarbonPDF-QA数据集上微调Llama 3模型实现,专门针对PDF解析后存在的数据不一致性和语义模糊性进行优化,从而显著提升对碳足迹问答任务的准确性,优于当前最先进的问答系统。

链接: https://arxiv.org/abs/2508.03489
作者: Kaiwen Zhao,Bharathan Balaji,Stephen Lee
机构: University of Pittsburgh (匹兹堡大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Product sustainability reports provide valuable insights into the environmental impacts of a product and are often distributed in PDF format. These reports often include a combination of tables and text, which complicates their analysis. The lack of standardization and the variability in reporting formats further exacerbate the difficulty of extracting and interpreting relevant information from large volumes of documents. In this paper, we tackle the challenge of answering questions related to carbon footprints within sustainability reports available in PDF format. Unlike previous approaches, our focus is on addressing the difficulties posed by the unstructured and inconsistent nature of text extracted from PDF parsing. To facilitate this analysis, we introduce CarbonPDF-QA, an open-source dataset containing question-answer pairs for 1735 product report documents, along with human-annotated answers. Our analysis shows that GPT-4o struggles to answer questions with data inconsistencies. To address this limitation, we propose CarbonPDF, an LLM-based technique specifically designed to answer carbon footprint questions on such datasets. We develop CarbonPDF by fine-tuning Llama 3 with our training data. Our results show that our technique outperforms current state-of-the-art techniques, including question-answering (QA) systems finetuned on table and text data.
zh

[NLP-20] Draw Your Mind: Personalized Generation via Condition-Level Modeling in Text-to-Image Diffusion Models ICCV2025

【速读】: 该论文旨在解决文本到图像(Text-to-Image, T2I)扩散模型中个性化生成的准确性问题,即如何在用户干预最少的情况下,将个体偏好自然地融入生成过程。现有方法主要依赖于提示词(prompt-level)建模,但由于T2I扩散模型输入token容量有限,常导致个性化效果不佳。解决方案的关键在于提出DrUM方法,该方法通过在潜在空间(latent space)中采用基于Transformer的适配器(adapter)整合用户画像(user profiling),实现条件级建模(condition-level modeling),从而显著提升个性化生成的准确性和适应性,同时兼容开源文本编码器,无需对基础T2I模型进行额外微调。

链接: https://arxiv.org/abs/2508.03481
作者: Hyungjin Kim,Seokho Ahn,Young-Duk Seo
机构: Inha University (仁荷大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at ICCV 2025

点击查看摘要

Abstract:Personalized generation in T2I diffusion models aims to naturally incorporate individual user preferences into the generation process with minimal user intervention. However, existing studies primarily rely on prompt-level modeling with large-scale models, often leading to inaccurate personalization due to the limited input token capacity of T2I diffusion models. To address these limitations, we propose DrUM, a novel method that integrates user profiling with a transformer-based adapter to enable personalized generation through condition-level modeling in the latent space. DrUM demonstrates strong performance on large-scale datasets and seamlessly integrates with open-source text encoders, making it compatible with widely used foundation T2I models without requiring additional fine-tuning.
zh

[NLP-21] fact check AI at SemEval-2025 Task 7: Multilingual and Crosslingual Fact-checked Claim Retrieval SEMEVAL-2025 ACL

【速读】: 该论文旨在解决多语言(multilingual)与跨语言(cross-lingual)事实核查声明检索(fact-checked claim retrieval)问题,即在不同语言环境下从大规模语料库中高效准确地检索出与查询声明相关的已验证声明。解决方案的关键在于将任务建模为学习排序(Learning-to-Rank)问题,并采用基于预训练Transformer的双编码器(bi-encoder)模型进行微调,该模型专为句子相似度优化;训练时同时利用源语言及其英文翻译以支持多语言检索,仅使用英文翻译进行跨语言检索,从而在轻量级模型(参数少于500M)和Kaggle T4 GPU上实现了高精度的检索性能(多语言Success@10达92%,跨语言为80%)。

链接: https://arxiv.org/abs/2508.03475
作者: Pranshu Rastogi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: 7 pages, 6 tables. Code available at this https URL

点击查看摘要

Abstract:SemEval-2025 Task 7: Multilingual and Crosslingual Fact-Checked Claim Retrieval is approached as a Learning-to-Rank task using a bi-encoder model fine-tuned from a pre-trained transformer optimized for sentence similarity. Training used both the source languages and their English translations for multilingual retrieval and only English translations for cross-lingual retrieval. Using lightweight models with fewer than 500M parameters and training on Kaggle T4 GPUs, the method achieved 92% Success@10 in multilingual and 80% Success@10 in 5th in crosslingual and 10th in multilingual tracks.
zh

[NLP-22] Cropping outperforms dropout as an augmentation strategy for training self-supervised text embeddings

【速读】: 该论文旨在解决当前文本嵌入(text embeddings)模型依赖大量标注文本对进行监督微调的问题,试图探索自监督学习方法在生成高质量文本嵌入中的潜力。其关键解决方案是采用基于数据增强的对比学习策略,特别是系统比较了两种主流正样本对生成方式:裁剪(cropping)增强与丢弃(dropout)增强,并发现裁剪增强显著优于后者;进一步研究表明,在领域内数据上,仅对Transformer模型最后几层进行短时自监督微调即可获得接近监督式最优模型的嵌入质量,且嵌入表示性能随微调过程向深层迁移而提升。

链接: https://arxiv.org/abs/2508.03453
作者: Rita González-Márquez,Philipp Berens,Dmitry Kobak
机构: University of Tübingen, Germany(图宾根大学, 德国); Hertie Institute for AI in Brain Health(赫尔蒂脑健康人工智能研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, sentiment analysis, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via extensive supervised fine-tuning using curated text pairs. This contrasts with computer vision, where self-supervised training based on data augmentations has demonstrated remarkable success. Here we systematically compare the two most well-known augmentation strategies for positive pair generation in contrastive learning of text embeddings. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is below the supervised SOTA models, but for in-domain data, self-supervised fine-tuning produces high-quality text embeddings after very short fine-tuning, sometimes only marginally below the supervised SOTA. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.
zh

[NLP-23] LLM s Have a Heart of Stone: Demystifying the Soft Thinking Ability of Large Reasoning Models

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在推理过程中受限于离散token生成机制,难以有效表达抽象和连续概念的问题。尽管软思维(Soft Thinking)通过引入软token(soft tokens)尝试在连续概念空间中进行推理,但研究发现,现有方法实际上仍依赖于软输入中最显著的成分进行后续解码,导致推理路径探索不足,退化为一种贪婪解码策略,削弱了软token传递信息的优势。解决方案的关键在于引入可控的随机性以打破这种单一支路依赖,具体采用Dirichlet重采样和Gumbel-Softmax技巧等采样策略,其中Gumbel-Softmax在保持平滑性的同时提供充分随机性,显著提升了模型在八个推理基准上的性能表现。

链接: https://arxiv.org/abs/2508.03440
作者: Junhong Wu,Jinliang Lu,Zixuan Ren,Ganqiang Hu,Zhi Wu,Dai Dai,Hua Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 7 figures, working in progress

点击查看摘要

Abstract:Human cognition naturally engages with abstract and fluid concepts, whereas existing reasoning models often rely on generating discrete tokens, potentially constraining their expressive capabilities. Recent advancements aim to address this limitation by enabling large language models (LLMs) to generate soft, abstract tokens, thus facilitating reasoning within a continuous concept space. This paper explores the `Soft Thinking’ capabilities of various LLMs by examining the models’ internal behavior using a suite of probing techniques. Contrary to the common belief that Soft Thinking enables the simultaneous exploration of diverse reasoning paths, our findings reveal that LLMs predominantly rely on the most influential component of the soft inputs during subsequent decoding steps. This reliance hinders the exploration of different reasoning paths and reduces vanilla Soft Thinking to a form of greedy decoding, obscuring the advantage of transmitting more information through Soft Tokens. To tackle this issue, we explore sampling strategies to introduce \emphrandomness, employing methods such as Dirichlet resampling and the Gumbel-Softmax trick. Our experiments demonstrate that incorporating randomness can alleviate the limitations of vanilla approaches and unleash the potential of Soft Thinking. Notably, the Gumbel-Softmax trick provides adequate randomness with controlled smoothness, resulting in superior performance across eight reasoning benchmarks.
zh

[NLP-24] Variety Is the Spice of Life: Detecting Misinformation with Dynamic Environmental Representations CIKM2025

【速读】: 该论文旨在解决现有谣言检测(Misinformation Detection, MD)方法中普遍存在的静态学习假设问题,即传统模型通常将新闻真伪标签与内容、链接及传播特征之间的映射关系视为固定不变,而忽视了现实社会环境中信息真伪可能随时间动态变化的事实。解决方案的关键在于提出一种名为MISDER(Misinformation detection with Dynamic Environmental Representations)的新框架,其核心思想是通过学习每个时间段的社会环境表征,并利用时序模型预测未来时段的表征,从而捕捉谣言在动态社交环境中的演化规律。具体实现上,论文设计了三种变体:基于LSTM的MISDER-LSTM、基于连续动力学方程的MISDER-ODE以及基于预训练动力系统的MISDER-PT,有效提升了对动态演化谣言的识别能力。

链接: https://arxiv.org/abs/2508.03420
作者: Bing Wang,Ximing Li,Yiming Wang,Changchun Li,Jiaxu Cui,Renchu Guan,Bo Yang
机构: College of Computer Science and Technology, Jilin University (吉林大学计算机科学与技术学院)
类目: Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted by CIKM 2025. 11 pages, 4 figures. Code: this https URL

点击查看摘要

Abstract:The proliferation of misinformation across diverse social media platforms has drawn significant attention from both academic and industrial communities due to its detrimental effects. Accordingly, automatically distinguishing misinformation, dubbed as Misinformation Detection (MD), has become an increasingly active research topic. The mainstream methods formulate MD as a static learning paradigm, which learns the mapping between the content, links, and propagation of news articles and the corresponding manual veracity labels. However, the static assumption is often violated, since in real-world scenarios, the veracity of news articles may vacillate within the dynamically evolving social environment. To tackle this problem, we propose a novel framework, namely Misinformation detection with Dynamic Environmental Representations (MISDER). The basic idea of MISDER lies in learning a social environmental representation for each period and employing a temporal model to predict the representation for future periods. In this work, we specify the temporal model as the LSTM model, continuous dynamics equation, and pre-trained dynamics system, suggesting three variants of MISDER, namely MISDER-LSTM, MISDER-ODE, and MISDER-PT, respectively. To evaluate the performance of MISDER, we compare it to various MD baselines across 2 prevalent datasets, and the experimental results can indicate the effectiveness of our proposed model.
zh

[NLP-25] ReDSM5: A Reddit Dataset for DSM-5 Depression Detection CIKM2025

【速读】: 该论文旨在解决现有计算方法在识别社交媒体中抑郁症状时缺乏临床可解释性的问题,即传统模型通常仅将整篇帖子标记为“抑郁”或“非抑郁”,而未依据《精神障碍诊断与统计手册第五版》(DSM-5)的具体诊断标准进行细粒度标注,导致模型结果难以与临床实践对接。其解决方案的关键在于构建了一个名为ReDSM5的新颖Reddit语料库,该语料库包含1484篇长文帖子,由持证心理学家逐句标注了DSM-5中的九种抑郁症状,并为每个标签提供基于DSM-5方法论的简明临床解释。这一设计使模型不仅能实现多标签症状分类,还能生成人类可理解的推理过程,从而提升抑郁症检测的准确性与临床实用性。

链接: https://arxiv.org/abs/2508.03399
作者: Eliseo Bao,Anxo Pérez,Javier Parapar
机构: IRLab, CITIC, Universidade da Coruña (拉科鲁尼亚大学)
类目: Computation and Language (cs.CL)
备注: Accepted as a resource paper at CIKM 2025

点击查看摘要

Abstract:Depression is a pervasive mental health condition that affects hundreds of millions of individuals worldwide, yet many cases remain undiagnosed due to barriers in traditional clinical access and pervasive stigma. Social media platforms, and Reddit in particular, offer rich, user-generated narratives that can reveal early signs of depressive symptomatology. However, existing computational approaches often label entire posts simply as depressed or not depressed, without linking language to specific criteria from the DSM-5, the standard clinical framework for diagnosing depression. This limits both clinical relevance and interpretability. To address this gap, we introduce ReDSM5, a novel Reddit corpus comprising 1484 long-form posts, each exhaustively annotated at the sentence level by a licensed psychologist for the nine DSM-5 depression symptoms. For each label, the annotator also provides a concise clinical rationale grounded in DSM-5 methodology. We conduct an exploratory analysis of the collection, examining lexical, syntactic, and emotional patterns that characterize symptom expression in social media narratives. Compared to prior resources, ReDSM5 uniquely combines symptom-specific supervision with expert explanations, facilitating the development of models that not only detect depression but also generate human-interpretable reasoning. We establish baseline benchmarks for both multi-label symptom classification and explanation generation, providing reference results for future research on detection and interpretability.
zh

[NLP-26] A Comparative Study of Neurosymbolic AI Approaches to Interpretable Logical Reasoning

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在通用逻辑推理(general logical reasoning)能力上的不足,尤其是其推理过程缺乏确定性和可解释性的问题。当前主流的神经符号人工智能(neurosymbolic AI)方法主要分为集成式(integrative)和混合式(hybrid)两类:前者将符号推理嵌入神经网络内部,后者则通过独立的符号求解器执行符号推理。论文通过对比代表性的两种模型——基于集成式方法的逻辑神经网络(Logic Neural Network, LNN)与基于混合式方法的LLM-符号求解器(LLM-Symbolic Solver, LLM-SS),发现混合式方法在通用逻辑推理任务上更具潜力,其关键优势在于推理链更具可解释性,并且能够保留现有大语言模型的能力与优势。为此,作者提出一个模块化、模型无关、领域无关且低人工干预的通用框架,以支持未来基于混合式架构的研究与应用。

链接: https://arxiv.org/abs/2508.03366
作者: Michael K. Chen
机构: Raffles Institution (莱佛士书院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Symbolic Computation (cs.SC)
备注: Accepted to NeSy 2025

点击查看摘要

Abstract:General logical reasoning, defined as the ability to reason deductively on domain-agnostic tasks, continues to be a challenge for large language models (LLMs). Current LLMs fail to reason deterministically and are not interpretable. As such, there has been a recent surge in interest in neurosymbolic AI, which attempts to incorporate logic into neural networks. We first identify two main neurosymbolic approaches to improving logical reasoning: (i) the integrative approach comprising models where symbolic reasoning is contained within the neural network, and (ii) the hybrid approach comprising models where a symbolic solver, separate from the neural network, performs symbolic reasoning. Both contain AI systems with promising results on domain-specific logical reasoning benchmarks. However, their performance on domain-agnostic benchmarks is understudied. To the best of our knowledge, there has not been a comparison of the contrasting approaches that answers the following question: Which approach is more promising for developing general logical reasoning? To analyze their potential, the following best-in-class domain-agnostic models are introduced: Logic Neural Network (LNN), which uses the integrative approach, and LLM-Symbolic Solver (LLM-SS), which uses the hybrid approach. Using both models as case studies and representatives of each approach, our analysis demonstrates that the hybrid approach is more promising for developing general logical reasoning because (i) its reasoning chain is more interpretable, and (ii) it retains the capabilities and advantages of existing LLMs. To support future works using the hybrid approach, we propose a generalizable framework based on LLM-SS that is modular by design, model-agnostic, domain-agnostic, and requires little to no human input.
zh

[NLP-27] hinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

【速读】: 该论文旨在解决生成式 AI(Generative AI)中大型语言模型在上下文学习(In-Context Learning, ICL)场景下推理能力不足的问题,尤其是如何利用模型内部不同推理模式的结构差异来提升推理准确性。解决方案的关键在于提出一种名为“无需思考校准”的新范式(JointThinking),其核心机制是通过并行引导模型以“思考”(Thinking)和“无思考”(Nothinking)两种模式生成答案,若两者不一致则触发第二轮“思考”,仅用一个融合原始问题与两个候选答案的统一提示进行校准。由于不一致情况发生频率较低(如GSM8K数据集上仅6%),该方法在多数情况下仅需一轮推理,显著降低延迟,同时通过引入推理模式多样性有效提升准确性和鲁棒性,并在分布内和分布外任务上均优于现有方法。

链接: https://arxiv.org/abs/2508.03363
作者: Haotian Wu,Bo Xu,Yao Shu,Menglin Yang,Chengwei Qin
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that leverages the structured difference between two reasoning modes, i.e., Thinking and Nothinking, to improve reasoning accuracy. Specifically, our method prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt that incorporates the original question and both candidate answers. Since such disagreement occurs infrequently (e.g., only 6% in GSM8K), our method performs just one round of reasoning in most cases, resulting in minimal latency overhead. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT) and majority voting with improved answer robustness. Moreover, It achieves comparable in-distribution performance to training-based SOTA method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing that leveraging different reasoning modes consistently lowers the error rate and highlights the value of structural thinking diversity. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second round of thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.
zh

[NLP-28] aggus: An Automated Pipeline for the Extraction of Characters Social Networks from Portuguese Fiction Literature

【速读】: 该论文旨在解决从葡萄牙语小说文本中自动识别角色及其互动关系的问题,这一任务通常依赖于多种自然语言处理(Natural Language Processing, NLP)技术,如命名实体识别(Named Entity Recognition, NER)和词性标注(Part-of-Speech Tagging, POS)。然而,现有方法在低资源语言(如葡萄牙语)中表现不佳,主要受限于缺乏高质量人工标注数据。为此,作者提出了一种名为Taggus的专用流水线,其核心创新在于结合词性标注与启发式规则来提升角色识别与共指消解的准确性,并实现角色间互动关系的检测。实验表明,该方案在角色识别与共指消解任务上达到平均F1分数94.1%,较现有主流工具(如商用NER和ChatGPT)提升50.7%;在互动检测任务上F1为75.9%,提升22.3%,显著优于现有方法。

链接: https://arxiv.org/abs/2508.03358
作者: Tiago G Canário,Catarina Duarte,Flávio L. Pinheiro,João L.M. Pereira
机构: Novo Norte de Lisboa (Novaims); University of Évora (UEvora)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 24 pages, 5 Figures, 4 Tables

点击查看摘要

Abstract:Automatically identifying characters and their interactions from fiction books is, arguably, a complex task that requires pipelines that leverage multiple Natural Language Processing (NLP) methods, such as Named Entity Recognition (NER) and Part-of-speech (POS) tagging. However, these methods are not optimized for the task that leads to the construction of Social Networks of Characters. Indeed, the currently available methods tend to underperform, especially in less-represented languages, due to a lack of manually annotated data for training. Here, we propose a pipeline, which we call Taggus, to extract social networks from literary fiction works in Portuguese. Our results show that compared to readily available State-of-the-Art tools – off-the-shelf NER tools and Large Language Models (ChatGPT) – the resulting pipeline, which uses POS tagging and a combination of heuristics, achieves satisfying results with an average F1-Score of 94.1% in the task of identifying characters and solving for co-reference and 75.9% in interaction detection. These represent, respectively, an increase of 50.7% and 22.3% on results achieved by the readily available State-of-the-Art tools. Further steps to improve results are outlined, such as solutions for detecting relationships between characters. Limitations on the size and scope of our testing samples are acknowledged. The Taggus pipeline is publicly available to encourage development in this field for the Portuguese language.2
zh

[NLP-29] VLMQ: Efficient Post-Training Quantization for Large Vision-Language Models via Hessian Augmentation

【速读】: 该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在后训练量化(Post-Training Quantization, PTQ)过程中因模态差异导致的性能下降问题,具体表现为文本token数量有限而视觉token冗余严重,而现有基于海森矩阵(Hessian-based)的LLM量化方法未考虑token级重要性,对所有token一视同仁,从而造成显著性能损失。其解决方案的关键在于提出一种面向VLM的感知重要性的PTQ框架——VLMQ,核心创新包括:1)设计了一个引入token级重要性因子的优化目标,重构海森矩阵以反映不同token的重要性,同时保持与并行权重更新的兼容性;2)通过单次轻量级分块反向传播高效计算这些重要性因子,该方法基于理论上的token级扰动关系,兼顾效率与效果。实验表明,VLMQ在多个基准测试中达到SOTA性能,尤其在低比特(如2-bit)设置下表现突出。

链接: https://arxiv.org/abs/2508.03351
作者: Yufei Xue,Yushi Huang,Jiawei Shao,Jun Zhang
机构: TeleAI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 5 figures

点击查看摘要

Abstract:Post-training quantization (PTQ) has emerged as an effective approach for compressing large models and accelerating their inference without retraining. While PTQ has been extensively studied in the context of large language models (LLMs), its applicability to vision-language models (VLMs) remains underexplored. In this paper, we identify a modality discrepancy (\emphi.e., limited text tokens \emphvs. excessive and redundant vision tokens) of VLMs. However, existing Hessian-based LLM PTQ methods treat all tokens equally during quantization, resulting in severe performance drops when applied to VLMs. Motivated by this observation, we propose a novel importance-aware PTQ framework tailored for VLMs, dubbed VLMQ. Specifically, to address vision token redundancy, VLMQ 1) optimizes an importance-aware objective that yields an enhanced Hessian with token-level importance factors, while retaining compatibility with parallelized weight updates, and 2) ensures efficiency and effectiveness by computing these factors via a single lightweight block-wise backward pass, guided by a theoretical connection to token-level perturbations. Extensive evaluations on 8 benchmarks across 0.5B \sim 32B VLMs demonstrate the state-of-the-art (SOTA) performance of our VLMQ, particularly under low-bit settings. For example, it achieves a substantial \textbf16.45% improvement on MME-RealWorld under 2-bit quantization.
zh

[NLP-30] CTTS: Collective Test-Time Scaling

【速读】: 该论文旨在解决现有测试时扩展(Test-time Scaling, TTS)方法受限于单一代理(Single-Agent, SA)能力的问题,即传统方法如Best-of-N和Self-Consistency因依赖单个大语言模型(Large Language Models, LLMs)与奖励模型(Reward Model, RM)交互而难以突破性能上限。为突破这一限制,论文提出集体测试时扩展(Collective Test-Time Scaling, CTTS),其关键在于设计并验证三种多代理与多奖励模型协同的范式,并最终提出CTTS-MM框架:其中,Agent Collaboration Search(ACS)用于从大规模候选代理池中搜索最优代理组合,Mixture of Reward Models(MoR)则通过先验奖励模型集成选择(PRES)结合成对奖励排序(Pair-wise Reward Ranking, PRR)机制筛选最优奖励模型组合,从而实现多代理与多奖励模型的协同优化,显著提升推理性能。

链接: https://arxiv.org/abs/2508.03333
作者: Zhende Song,Shengji Tang,Peng Ye,Jiayuan Fan,Tao Chen
机构: Shanghai AI Lab(上海人工智能实验室); 1: University of Science and Technology of China (中国科学技术大学); 2: Shanghai AI Lab(上海人工智能实验室); 3: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Test-time scaling (TTS) has emerged as a promising research field for enhancing the effectiveness of large language models (LLMs) without extra training. However, most existing approaches, e.g., Best-of-N and Self-Consistency rely on a single agent interacting with a reward model (SA-SR), constrained by limited capabilities of a single test-time scaling (STTS) paradigm. On the other hand, recent works demonstrate that collective-agent methods can break through the upper bound of single-agent systems by orchestrating diverse models. Thus, in this paper, we take a first step towards exploring Collective Test-Time Scaling (CTTS). Consider the different interaction types of single and multiple models, we design three primary paradigms to investigate the optimal paradigm of CTTS: (1) single agent to multiple reward models (SA-MR); (2) multiple agents to single reward model (MA-SR); and (3) multiple agents to multiple reward models (MA-MR). Extensive experiments demonstrate that MA-MR consistently achieves the best performance. Based on this, we propose a novel framework named CTTS-MM that effectively leverages both multi-agent and multi-reward-model collaboration for enhanced inference. Specifically, for multi-agent collaboration, we propose an Agent Collaboration Search (ACS), which searches for the most effective combination of LLM agents from a large candidate pool; for multi-reward-model collaboration, we propose Mixture of Reword Models (MoR), which consists of a curated question pool and a Prior Reward model Ensemble Selection (PRES) to select the optimal combinations of reward models via Pair-wise Reward Ranking (PRR) metric. Experiments across seven mainstream benchmarks demonstrate that the proposed CTTS-MM consistently obtains superior performance. Code will be released at this https URL.
zh

[NLP-31] Reliable Evaluation Protocol for Low-Precision Retrieval

【速读】: 该论文旨在解决低精度(low-precision)检索系统中因数值粒度降低导致的相关性分数(relevance scores)出现虚假并列(spurious ties)的问题,这种并列会引发结果排序的高变异性,从而削弱评估的可靠性。解决方案的关键在于提出一种更鲁棒的检索评估协议,包含两个核心组件:(1) 高精度评分(High-Precision Scoring, HPS),通过在最终评分步骤中将计算精度提升至更高水平,以最小的计算开销精确化解析并列候选对象;(2) 有 tie-aware 的检索指标(Tie-aware Retrieval Metrics, TRM),用于量化并列候选对象排序不确定性,报告期望得分、得分范围和偏差等统计信息。实验证明,HPS 显著减少了由并列引起的不稳定性,而 TRM 能准确恢复预期的评估指标值,从而实现对低精度检索系统的更一致和可靠的评估。

链接: https://arxiv.org/abs/2508.03306
作者: Kisu Yang,Yoonna Jang,Hwanseok Jang,Kenneth Choi,Isabelle Augenstein,Heuiseok Lim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 11 pages, 5 figures, submitted to ARR

点击查看摘要

Abstract:Lowering the numerical precision of model parameters and computations is widely adopted to improve the efficiency of retrieval systems. However, when computing relevance scores between the query and documents in low-precision, we observe spurious ties due to the reduced granularity. This introduces high variability in the results based on tie resolution, making the evaluation less reliable. To address this, we propose a more robust retrieval evaluation protocol designed to reduce score variation. It consists of: (1) High-Precision Scoring (HPS), which upcasts the final scoring step to higher precision to resolve tied candidates with minimal computational cost; and (2) Tie-aware Retrieval Metrics (TRM), which report expected scores, range, and bias to quantify order uncertainty of tied candidates. Our experiments test multiple models with three scoring functions on two retrieval datasets to demonstrate that HPS dramatically reduces tie-induced instability, and TRM accurately recovers expected metric values. This combination enables a more consistent and reliable evaluation system for lower-precision retrievals.
zh

[NLP-32] owards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

【速读】: 该论文旨在解决当前内容审核系统在处理社交媒体平台上有害或违规内容时存在的准确性不足、决策不透明以及与平台审核政策脱节的问题。现有方法多依赖于噪声标签驱动的学习,导致模型难以对齐具体审核规则,且输出结果缺乏可解释性,阻碍了人工复核。解决方案的关键在于提出Hierarchical Guard (Hi-Guard)——一个基于规则对齐的多模态审核框架,其核心创新包括:(1) 构建两级分层审核流水线,先用轻量级二分类模型过滤安全内容,再由强模型进行细粒度风险分类;(2) 在第二阶段引入路径式层次化分类结构,实现从粗到细的层级预测;(3) 将审核规则直接嵌入模型提示(prompt),确保模型决策与政策动态对齐;(4) 设计多级软边界奖励机制并采用Group Relative Policy Optimization (GRPO) 进行优化,通过惩罚语义相近的误分类来提升结构化预测能力和解释质量。实验证明,该方案显著提升了分类精度、泛化能力与可解释性,为构建可扩展、透明且可信的内容安全系统提供了新路径。

链接: https://arxiv.org/abs/2508.03296
作者: Anqi Li,Wenwei Jin,Jintao Tong,Pengda Qin,Weijia Li,Guo Lu
机构: Shanghai Jiao Tong University (上海交通大学); Xiaohongshu Inc. (小红书公司); Huazhong University of Science and Technology (华中科技大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term “Hierarchical” reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: this https URL.
zh

[NLP-33] NLP Methods May Actually Be Better Than Professors at Estimating Question Difficulty

【速读】: 该论文旨在解决教师在评估考试题目难度时存在局限性的问题,尤其是在神经网络(Neural Networks)和机器学习(Machine Learning)领域中,教授对试题难易程度的判断往往不够准确。其关键解决方案是利用大语言模型(Large Language Model, LLM)的不确定性信息,在仅需42个标注样本的监督学习设置下,显著提升对真/假题正确作答率的预测精度,从而辅助教师更科学地设计考试题目,提高评估质量。

链接: https://arxiv.org/abs/2508.03294
作者: Leonidas Zotos,Ivo Pascal de Jong,Matias Valdenegro-Toro,Andreea Ioana Sburlea,Malvina Nissim,Hedderik van Rijn
机构: Center for Language and Cognition, University of Groningen (语言与认知中心,格罗宁根大学); Bernoulli Institute, University of Groningen (伯努利研究所,格罗宁根大学); Department of Experimental Psychology, University of Groningen (实验心理学系,格罗宁根大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 10 pages, 2 figures, accepted at the 2nd International Workshop on AI in Society, Education and Educational Research (AISEER)

点击查看摘要

Abstract:Estimating the difficulty of exam questions is essential for developing good exams, but professors are not always good at this task. We compare various Large Language Model-based methods with three professors in their ability to estimate what percentage of students will give correct answers on True/False exam questions in the areas of Neural Networks and Machine Learning. Our results show that the professors have limited ability to distinguish between easy and difficult questions and that they are outperformed by directly asking Gemini 2.5 to solve this task. Yet, we obtained even better results using uncertainties of the LLMs solving the questions in a supervised learning setting, using only 42 training samples. We conclude that supervised learning using LLM uncertainty can help professors better estimate the difficulty of exam questions, improving the quality of assessment.
zh

[NLP-34] Investigating Gender Bias in LLM -Generated Stories via Psychological Stereotypes

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成式任务中可能隐性放大性别偏见的问题,尤其关注传统基于显式性别线索或短文本任务(如句子补全和问答)难以捕捉的深层次、情境化偏见。其解决方案的关键在于引入一个心理学基础的新型数据集 StereoBias-Stories,该数据集包含基于25种心理刻板印象(如攻击性或八卦)的开放式叙事生成任务,并通过控制随机属性(1、2或6个)与故事结局条件来系统分析模型输出中的性别贡献变化。研究发现:(1)未受条件约束时模型普遍偏向男性,但引入与性别无关的属性可缓解此偏见;(2)同一性别刻板印象的多属性组合会增强模型行为——男性属性加剧偏见,女性属性则削弱偏见;(3)模型偏见与心理学真实分类高度一致,且随模型规模增大而强化。这表明,以心理学理论为基准的评估方法对揭示和量化LLMs中的隐性性别偏见具有关键价值。

链接: https://arxiv.org/abs/2508.03292
作者: Shahed Masoudian,Gustavo Escobedo,Hannah Strauss,Markus Schedl
机构: Johannes Kepler University (JKU); Linz Institute of Technology (LIT); University of Innsbruck
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Under Review

点击查看摘要

Abstract:As Large Language Models (LLMs) are increasingly used across different applications, concerns about their potential to amplify gender biases in various tasks are rising. Prior research has often probed gender bias using explicit gender cues as counterfactual, or studied them in sentence completion and short question answering tasks. These formats might overlook more implicit forms of bias embedded in generative behavior of longer content. In this work, we investigate gender bias in LLMs using gender stereotypes studied in psychology (e.g., aggressiveness or gossiping) in an open-ended task of narrative generation. We introduce a novel dataset called StereoBias-Stories containing short stories either unconditioned or conditioned on (one, two, or six) random attributes from 25 psychological stereotypes and three task-related story endings. We analyze how the gender contribution in the overall story changes in response to these attributes and present three key findings: (1) While models, on average, are highly biased towards male in unconditioned prompts, conditioning on attributes independent from gender stereotypes mitigates this bias. (2) Combining multiple attributes associated with the same gender stereotype intensifies model behavior, with male ones amplifying bias and female ones alleviating it. (3) Model biases align with psychological ground-truth used for categorization, and alignment strength increases with model size. Together, these insights highlight the importance of psychology-grounded evaluation of LLMs.
zh

[NLP-35] Understanding the Embedding Models on Hyper-relational Knowledge Graph CIKM2025

【速读】: 该论文旨在解决当前Hyper-relational Knowledge Graph Embedding (HKGE)模型性能提升的根源不明确问题,即其优势是源于基础的知识图谱嵌入(Knowledge Graph Embedding, KGE)模型本身,还是来自专门设计的限定符处理模块。为验证这一点,作者通过三种分解方法将Hyper-relational Knowledge Graphs (HKGs)转换为传统知识图谱(KG)格式,并评估经典KGE模型在这些转换后的数据上的表现,发现部分KGE模型性能可媲美HKGE模型。进一步分析表明,现有分解方法破坏了原始HKG拓扑结构,未能充分保留限定符信息;同时,当前HKGE模型在捕捉长距离依赖关系和融合主三元组与限定符信息方面存在不足,主要受制于信息压缩问题。为此,论文提出FormerGNN框架,其关键创新在于:引入限定符集成器以保持原HKG拓扑结构,采用基于图神经网络(Graph Neural Network, GNN)的编码器捕获长程依赖,并改进主三元组与限定符信息的融合机制以缓解信息压缩问题,实验结果表明FormerGNN显著优于现有HKGE模型。

链接: https://arxiv.org/abs/2508.03280
作者: Yubo Wang,Shimin Di,Zhili Wang,Haoyang Li,Fei Teng,Hao Xin,Lei Chen
机构: HKUST(香港科技大学); Southeast University(东南大学); The Hong Kong Polytechnic University(香港理工大学); Tencent(腾讯); HKUST & HKUST(GZ)(香港科技大学及香港科技大学(广州))
类目: Machine Learning (cs.LG); Computation and Language (cs.CL); Social and Information Networks (cs.SI)
备注: Accepted by CIKM 2025

点击查看摘要

Abstract:Recently, Hyper-relational Knowledge Graphs (HKGs) have been proposed as an extension of traditional Knowledge Graphs (KGs) to better represent real-world facts with additional qualifiers. As a result, researchers have attempted to adapt classical Knowledge Graph Embedding (KGE) models for HKGs by designing extra qualifier processing modules. However, it remains unclear whether the superior performance of Hyper-relational KGE (HKGE) models arises from their base KGE model or the specially designed extension module. Hence, in this paper, we data-wise convert HKGs to KG format using three decomposition methods and then evaluate the performance of several classical KGE models on HKGs. Our results show that some KGE models achieve performance comparable to that of HKGE models. Upon further analysis, we find that the decomposition methods alter the original HKG topology and fail to fully preserve HKG information. Moreover, we observe that current HKGE models are either insufficient in capturing the graph’s long-range dependency or struggle to integrate main-triple and qualifier information due to the information compression issue. To further justify our findings and offer a potential direction for future HKGE research, we propose the FormerGNN framework. This framework employs a qualifier integrator to preserve the original HKG topology, and a GNN-based graph encoder to capture the graph’s long-range dependencies, followed by an improved approach for integrating main-triple and qualifier information to mitigate compression issues. Our experimental results demonstrate that FormerGNN outperforms existing HKGE models.
zh

[NLP-36] Do language models accommodate their users? A study of linguistic convergence

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话生成中是否表现出语言趋同性(linguistic convergence)这一问题,即模型是否会适应并模仿用户的语言风格。其解决方案的关键在于通过系统比较十六个不同语言模型在三个对话语料库上的生成结果与原始人类回复之间的风格差异,利用多种文体特征(stylometric features)进行量化分析,发现模型普遍表现出显著的语言趋同现象,且在某些情况下甚至超过人类基线的拟合程度;同时,研究进一步揭示了指令微调和更大规模模型收敛程度较低的现象,暗示了人类与模型在语言趋同机制上的本质差异。

链接: https://arxiv.org/abs/2508.03276
作者: Terra Blevins,Susanne Schmalwieser,Benjamin Roth
机构: University of Vienna (维也纳大学); Khoury College of Computer Sciences, Northeastern University (东北大学计算机科学学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) are generally considered proficient in generating language, how similar their language usage is to that of humans remains understudied. In this paper, we test whether models exhibit linguistic convergence, a core pragmatic element of human language communication, asking: do models adapt, or converge, to the linguistic patterns of their user? To answer this, we systematically compare model completions of exisiting dialogues to the original human responses across sixteen language models, three dialogue corpora, and a variety of stylometric features. We find that models strongly converge to the conversation’s style, often significantly overfitting relative to the human baseline. While convergence patterns are often feature-specific, we observe consistent shifts in convergence across modeling settings, with instruction-tuned and larger models converging less than their pretrained counterparts. Given the differences between human and model convergence patterns, we hypothesize that the underlying mechanisms for these behaviors are very different.
zh

[NLP-37] LECTOR: LLM -Enhanced Concept-based Test-Oriented Repetition for Adaptive Spaced Learning

【速读】: 该论文旨在解决现有间隔重复系统(Spaced Repetition Systems, SRS)在语义干扰(semantic interference)和个性化适应性方面的不足,尤其是在以考试为导向的语言学习场景中,如何提升词汇记忆的准确率与稳定性。其解决方案的关键在于提出一种基于大语言模型(Large Language Models, LLM)增强的概念导向型间隔重复算法——LECTOR,通过LLM实现对词汇语义相似性的精准评估,并将该语义信息整合进经典SRS的复习调度逻辑中,从而有效减少因语义相近导致的记忆混淆错误,同时保持计算效率,最终在模拟实验中显著优于六种基线算法,成功将学习成功率提升至90.2%。

链接: https://arxiv.org/abs/2508.03275
作者: Jiahao Zhao
机构: Xi’an University of Posts and Telecommunications (西安邮电大学)
类目: Computation and Language (cs.CL)
备注: 15 pages, 4 figures, 1 table

点击查看摘要

Abstract:Spaced repetition systems are fundamental to efficient learning and memory retention, but existing algorithms often struggle with semantic interference and personalized adaptation. We present LECTOR (\textbfLLM-\textbfEnhanced \textbfConcept-based \textbfTest-\textbfOriented \textbfRepetition), a novel adaptive scheduling algorithm specifically designed for test-oriented learning scenarios, particularly language examinations where success rate is paramount. LECTOR leverages large language models for semantic analysis while incorporating personalized learning profiles, addressing the critical challenge of semantic confusion in vocabulary learning by utilizing LLM-powered semantic similarity assessment and integrating it with established spaced repetition principles. Our comprehensive evaluation against six baseline algorithms (SSP-MMC, SM2, HLR, FSRS, ANKI, THRESHOLD) across 100 simulated learners over 100 days demonstrates significant improvements: LECTOR achieves a 90.2% success rate compared to 88.4% for the best baseline (SSP-MMC), representing a 2.0% relative improvement. The algorithm shows particular strength in handling semantically similar concepts, reducing confusion-induced errors while maintaining computational efficiency. Our results establish LECTOR as a promising direction for intelligent tutoring systems and adaptive learning platforms.
zh

[NLP-38] Pay What LLM Wants: Can LLM Simulate Economics Experiment with 522 Real-human Persona?

【速读】: 该论文试图解决当前大型语言模型(Large Language Models, LLMs)在模拟人类经济决策行为时普遍依赖虚构人格(fictional personas)而导致的现实性不足问题。其解决方案的关键在于使用真实个体数据——具体为522名韩国参与者在文化消费场景下的“愿付价格”(Pay-What-You-Want, PWYW)实验数据,系统评估三种先进的多模态LLMs在预测个体经济行为方面的表现,并考察不同人物信息注入方法(persona injection methods)对预测性能的影响。研究发现,尽管LLMs难以准确复现个体层面的选择,但在群体层面能较好捕捉行为趋势,且传统提示技术(prompting techniques)并无显著优于朴素提示法(naive prompting),叙事重构或检索增强生成(retrieval augmented generation)亦未带来明显提升,从而为基于真实人格的计算社会科学模拟提供了首个实证基准与指导。

链接: https://arxiv.org/abs/2508.03262
作者: Junhyuk Choi,Hyeonchu Park,Haemin Lee,Hyebeen Shin,Hyun Joung Jin,Bugeun Kim
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Recent advances in Large Language Models (LLMs) have generated significant interest in their capacity to simulate human-like behaviors, yet most studies rely on fictional personas rather than actual human data. We address this limitation by evaluating LLMs’ ability to predict individual economic decision-making using Pay-What-You-Want (PWYW) pricing experiments with real 522 human personas. Our study systematically compares three state-of-the-art multimodal LLMs using detailed persona information from 522 Korean participants in cultural consumption scenarios. We investigate whether LLMs can accurately replicate individual human choices and how persona injection methods affect prediction performance. Results reveal that while LLMs struggle with precise individual-level predictions, they demonstrate reasonable group-level behavioral tendencies. Also, we found that commonly adopted prompting techniques are not much better than naive prompting methods; reconstruction of personal narrative nor retrieval augmented generation have no significant gain against simple prompting method. We believe that these findings can provide the first comprehensive evaluation of LLMs’ capabilities on simulating economic behavior using real human data, offering empirical guidance for persona-based simulation in computational social science.
zh

[NLP-39] Exploring Stability-Plasticity Trade-offs for Continual Named Entity Recognition

【速读】: 该论文旨在解决持续命名实体识别(Continual Named Entity Recognition, CNER)中模型因灾难性遗忘导致的稳定性-可塑性失衡问题,即现有方法过度强调保留旧知识的稳定性,而忽视了对新实体类型的适应能力。其解决方案的关键在于提出一种稳定性-可塑性权衡(Stability-Plasticity Trade-off, SPT)机制:从表示层面引入池化操作改进知识蒸馏(Knowledge Distillation, KD),适度降低表征一致性以提升模型对新知识的获取能力;从权重层面动态融合新旧模型参数,并通过权重引导的选择机制优先保留重要权重,从而强化旧知识的同时保持新知识;此外,设计基于置信度的伪标签策略处理非实体类别的语义漂移问题,显著提升了CNER场景下的持续学习性能。

链接: https://arxiv.org/abs/2508.03259
作者: Duzhen Zhang,Chenxing Li,Jiahua Dong,Qi Liu,Dong Yu
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学); Tencent AI Lab (腾讯AI实验室); University of Hong Kong (香港大学); Tencent AI Lab (腾讯AI实验室)
类目: Computation and Language (cs.CL)
备注: Accepted by IEEE/ACM Transactions on Audio, Speech and Language Processing

点击查看摘要

Abstract:Continual Named Entity Recognition (CNER) is an evolving field that focuses on sequentially updating an existing model to incorporate new entity types. Previous CNER methods primarily utilize Knowledge Distillation (KD) to preserve prior knowledge and overcome catastrophic forgetting, strictly ensuring that the representations of old and new models remain consistent. Consequently, they often impart the model with excessive stability (i.e., retention of old knowledge) but limited plasticity (i.e., acquisition of new knowledge). To address this issue, we propose a Stability-Plasticity Trade-off (SPT) method for CNER that balances these aspects from both representation and weight perspectives. From the representation perspective, we introduce a pooling operation into the original KD, permitting a level of plasticity by consolidating representation dimensions. From the weight perspective, we dynamically merge the weights of old and new models, strengthening old knowledge while maintaining new knowledge. During this fusion, we implement a weight-guided selective mechanism to prioritize significant weights. Moreover, we develop a confidence-based pseudo-labeling approach for the current non-entity type, which predicts entity types using the old model to handle the semantic shift of the non-entity type, a challenge specific to CNER that has largely been ignored by previous methods. Extensive experiments across ten CNER settings on three benchmark datasets demonstrate that our SPT method surpasses previous CNER approaches, highlighting its effectiveness in achieving a suitable stability-plasticity trade-off.
zh

[NLP-40] RooseBERT: A New Deal For Political Language Modelling

【速读】: 该论文旨在解决政治辩论文本中自动分析的挑战,尤其是由于政治语言的独特性及辩论中隐含的沟通策略和隐式论证所导致的复杂性问题。当前通用预训练语言模型在处理此类任务时表现有限。解决方案的关键在于提出一种专为政治话语(political discourse)设计的预训练语言模型——RooseBERT,其通过在大规模英文政治辩论语料库(8000场辩论,每场包含多个子议题)上进行预训练,有效捕捉政治语境下的语言特征与论证结构。实验表明,RooseBERT在命名实体识别、情感分析、论点组件检测与分类以及论点关系预测等四项下游任务上均显著优于通用语言模型,验证了领域特定预训练对提升政治辩论分析性能的有效性。

链接: https://arxiv.org/abs/2508.03250
作者: Deborah Dore,Elena Cabrio,Serena Villata
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The increasing amount of political debates and politics-related discussions calls for the definition of novel computational methods to automatically analyse such content with the final goal of lightening up political deliberation to citizens. However, the specificity of the political language and the argumentative form of these debates (employing hidden communication strategies and leveraging implicit arguments) make this task very challenging, even for current general-purpose pre-trained Language Models. To address this issue, we introduce a novel pre-trained Language Model for political discourse language called RooseBERT. Pre-training a language model on a specialised domain presents different technical and linguistic challenges, requiring extensive computational resources and large-scale data. RooseBERT has been trained on large political debate and speech corpora (8K debates, each composed of several sub-debates on different topics) in English. To evaluate its performances, we fine-tuned it on four downstream tasks related to political debate analysis, i.e., named entity recognition, sentiment analysis, argument component detection and classification, and argument relation prediction and classification. Our results demonstrate significant improvements over general-purpose Language Models on these four tasks, highlighting how domain-specific pre-training enhances performance in political debate analysis. We release the RooseBERT language model for the research community.
zh

[NLP-41] Somatic in the East Psychological in the West?: Investigating Clinically-Grounded Cross-Cultural Depression Symptom Expression in LLM s

【速读】: 该论文试图解决的问题是:当前广泛应用于心理健康领域的大型语言模型(Large Language Models, LLMs)是否能够准确反映东西方文化在心理症状表达上的差异,即西方个体更倾向于报告心理症状,而东方个体更常报告躯体症状。解决方案的关键在于通过设计不同文化身份的提示(persona prompting),测试LLMs在英文和主要东方语言(中文、日语、印地语)下的响应模式,并识别其文化敏感性不足的根本原因。研究发现,LLMs在英文提示下难以再现文化差异,而在东方语言提示下表现有所改善;其失败的主要原因是模型对文化提示的敏感度低,以及存在一个跨文化的症状层级结构(symptom hierarchy),后者压制了文化线索的影响。这一发现表明,尽管提示语言重要,但现有通用型LLMs仍缺乏稳健的文化感知能力,限制了其在心理健康应用中的安全性与有效性。

链接: https://arxiv.org/abs/2508.03247
作者: Shintaro Sakai,Jisun An,Migyeong Kang,Haewoon Kwak
机构: Indiana University Bloomington (印第安纳大学布卢明顿分校); Sungkyunkwan University (成均馆大学)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Prior clinical psychology research shows that Western individuals with depression tend to report psychological symptoms, while Eastern individuals report somatic ones. We test whether Large Language Models (LLMs), which are increasingly used in mental health, reproduce these cultural patterns by prompting them with Western or Eastern personas. Results show that LLMs largely fail to replicate the patterns when prompted in English, though prompting in major Eastern languages (i.e., Chinese, Japanese, and Hindi) improves alignment in several configurations. Our analysis pinpoints two key reasons for this failure: the models’ low sensitivity to cultural personas and a strong, culturally invariant symptom hierarchy that overrides cultural cues. These findings reveal that while prompt language is important, current general-purpose LLMs lack the robust, culture-aware capabilities essential for safe and effective mental health applications.
zh

[NLP-42] CardiffNLP at CLEARS-2025: Prompting Large Language Models for Plain Language and Easy-to-Read Text Rewriting

【速读】: 该论文旨在解决西班牙语文本适应(Spanish text adaptation)问题,这是自然语言处理(Natural Language Processing, NLP)中的一项关键任务,目标是将源文本在保持语义不变的前提下,适配到特定风格、领域或受众的表达方式。为应对这一挑战,作者团队采用大语言模型(Large Language Models, LLMs)提示工程(prompting)策略,通过设计多种提示模板(prompt variations)来引导模型生成符合目标语境的文本。其解决方案的关键在于:基于Gemini-3模型进行优化,并系统性地探索不同提示结构与示例组合,最终在CLEARS共享任务的两个子任务中分别获得第二名和第三名的成绩,验证了提示工程在跨场景文本迁移中的有效性。

链接: https://arxiv.org/abs/2508.03240
作者: Mutaz Ayesh,Nicolás Gutiérrez-Rolón,Fernando Alva-Manchego
机构: Cardiff University (卡迪夫大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper details the CardiffNLP team’s contribution to the CLEARS shared task on Spanish text adaptation, hosted by IberLEF 2025. The shared task contained two subtasks and the team submitted to both. Our team took an LLM-prompting approach with different prompt variations. While we initially experimented with LLaMA-3.2, we adopted Gemma-3 for our final submission, and landed third place in Subtask 1 and second place in Subtask 2. We detail our numerous prompt variations, examples, and experimental results.
zh

[NLP-43] Probing Syntax in Large Language Models : Successes and Remaining Challenges

【速读】: 该论文旨在解决当前结构化探测(structural probes)在评估大语言模型(LLMs)句法表征能力时存在的系统性偏差问题,即现有方法多基于非受控语料进行测试,导致难以区分句法结构与统计或表面特征的影响。其解决方案的关键在于构建三个受控基准(controlled benchmarks),通过严格控制句法结构、词汇可预测性及语法合法性等变量,系统分析结构探针的表现。结果表明:结构探针易受词间距离影响(表面相关性偏差)、对深层句法结构敏感度低且易受交互名词或不合法动词形式干扰,但不受单个词可预测性影响,从而揭示了当前结构探针在句法表征评估中的局限性,并提出应采用受控刺激基准以更准确地衡量其性能。

链接: https://arxiv.org/abs/2508.03211
作者: Pablo J. Diego-Simón,Emmanuel Chemla,Jean-Rémi King,Yair Lakretz
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The syntactic structures of sentences can be readily read-out from the activations of large language models (LLMs). However, the ``structural probes’’ that have been developed to reveal this phenomenon are typically evaluated on an indiscriminate set of sentences. Consequently, it remains unclear whether structural and/or statistical factors systematically affect these syntactic representations. To address this issue, we conduct an in-depth analysis of structural probes on three controlled benchmarks. Our results are three-fold. First, structural probes are biased by a superficial property: the closer two words are in a sentence, the more likely structural probes will consider them as syntactically linked. Second, structural probes are challenged by linguistic properties: they poorly represent deep syntactic structures, and get interfered by interacting nouns or ungrammatical verb forms. Third, structural probes do not appear to be affected by the predictability of individual words. Overall, this work sheds light on the current challenges faced by structural probes. Providing a benchmark made of controlled stimuli to better evaluate their performance.
zh

[NLP-44] Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在训练过程中可能泄露敏感个人信息的问题,尤其是在数据隐私保护法规(如GDPR)日益严格的背景下。其核心挑战在于,尽管完全匿名化难以实现,但可通过预处理手段对文本数据中的私密信息进行掩码或伪名化处理,从而降低隐私风险。解决方案的关键在于采用领域无关(domain-agnostic)的自然语言处理(Natural Language Processing, NLP)方法,实现对文本中敏感内容的有效识别与脱敏,以保障数据安全的同时维持模型性能。

链接: https://arxiv.org/abs/2508.03204
作者: Abhirup Sinha,Pritilata Saha,Tithi Saha
机构: Paderborn University (帕德博恩大学); Vellore Institute of Technology (维洛尔理工学院)
类目: Computation and Language (cs.CL)
备注: To be published in the Proceedings of Die Studierendenkonferenz Informatik (SKILL) 2024

点击查看摘要

Abstract:Privacy is a fundamental human right. Data privacy is protected by different regulations, such as GDPR. However, modern large language models require a huge amount of data to learn linguistic variations, and the data often contains private information. Research has shown that it is possible to extract private information from such language models. Thus, anonymizing such private and sensitive information is of utmost importance. While complete anonymization may not be possible, a number of different pre-processing approaches exist for masking or pseudonymizing private information in textual data. This report focuses on a few of such approaches for domain-agnostic NLP tasks.
zh

[NLP-45] Beyond Content: How Grammatical Gender Shapes Visual Representation in Text-to-Image Models

【速读】: 该论文试图解决的问题是:语法性别(grammatical gender)如何影响跨语言文本到图像(Text-to-Image, T2I)模型的视觉表征,尤其是当语法性别与刻板性别认知不一致时(如法语中“une sentinelle”为阴性语法形式但指代通常被视为阳性的“守卫”)。此前研究多聚焦于人口统计学特征和刻板印象属性,忽视了语言结构本身的潜在偏见。其解决方案的关键在于构建了一个跨语言基准数据集,涵盖五种有语法性别区分的语言(法语、西班牙语、德语、意大利语、俄语)和两种无语法性别的控制语言(英语、中文),共800个独特提示词生成28,800张图像,并在三种先进T2I模型上进行系统评估。结果表明,语法性别显著改变生成图像中的性别比例——阳性语法标记使男性出现率平均提升至73%(对比英语仅22%),阴性语法标记则使女性比例升至38%(对比英语28%),且这种效应受语言资源丰富度和模型架构调节,揭示了语言结构本身是塑造AI视觉输出的新关键因素,为多语言多模态系统中的公平性研究提供了全新维度。

链接: https://arxiv.org/abs/2508.03199
作者: Muhammed Saeed,Shaina Raza,Ashmal Vayani,Muhammad Abdul-Mageed,Ali Emami,Shady Shehata
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Research on bias in Text-to-Image (T2I) models has primarily focused on demographic representation and stereotypical attributes, overlooking a fundamental question: how does grammatical gender influence visual representation across languages? We introduce a cross-linguistic benchmark examining words where grammatical gender contradicts stereotypical gender associations (e.g., une sentinelle'' - grammatically feminine in French but referring to the stereotypically masculine concept guard’'). Our dataset spans five gendered languages (French, Spanish, German, Italian, Russian) and two gender-neutral control languages (English, Chinese), comprising 800 unique prompts that generated 28,800 images across three state-of-the-art T2I models. Our analysis reveals that grammatical gender dramatically influences image generation: masculine grammatical markers increase male representation to 73% on average (compared to 22% with gender-neutral English), while feminine grammatical markers increase female representation to 38% (compared to 28% in English). These effects vary systematically by language resource availability and model architecture, with high-resource languages showing stronger effects. Our findings establish that language structure itself, not just content, shapes AI-generated visual outputs, introducing a new dimension for understanding bias and fairness in multilingual, multimodal systems.
zh

[NLP-46] Analyzing German Parliamentary Speeches: A Machine Learning Approach for Topic and Sentiment Classification

【速读】: 该论文旨在解决德国联邦议院(Bundestag)政治话语演变的量化分析问题,具体聚焦于议题趋势、情感动态及政党特异性话语策略的识别与建模。解决方案的关键在于构建并训练两个机器学习模型——用于主题分类和情感分类的模型,其在人工标注数据集上表现优异(主题分类平均AUROC达0.94,情感分类AUROC为0.89),并通过这些模型对近28,000条议会演讲进行系统性分析,揭示了政党角色转换(如从执政到在野)对其话语风格的显著影响,同时验证了意识形态与治理责任共同塑造政治话语结构的核心发现。

链接: https://arxiv.org/abs/2508.03181
作者: Lukas Pätz,Moritz Beyer,Jannik Späth,Lasse Bohlen,Patrick Zschech,Mathias Kraus,Julian Rosenberger
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted at 20th International Conference on Wirtschaftsinformatik (WI25); September 2025, Münster, Germany

点击查看摘要

Abstract:This study investigates political discourse in the German parliament, the Bundestag, by analyzing approximately 28,000 parliamentary speeches from the last five years. Two machine learning models for topic and sentiment classification were developed and trained on a manually labeled dataset. The models showed strong classification performance, achieving an area under the receiver operating characteristic curve (AUROC) of 0.94 for topic classification (average across topics) and 0.89 for sentiment classification. Both models were applied to assess topic trends and sentiment distributions across political parties and over time. The analysis reveals remarkable relationships between parties and their role in parliament. In particular, a change in style can be observed for parties moving from government to opposition. While ideological positions matter, governing responsibilities also shape discourse. The analysis directly addresses key questions about the evolution of topics, sentiment dynamics, and party-specific discourse strategies in the Bundestag.
zh

[NLP-47] Light-IF: Endowing LLM s with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在执行复杂指令时存在指令遵循不一致的问题,尤其是当指令包含严格约束条件时,模型常因“懒惰推理”(lazy reasoning)导致错误输出。其解决方案的关键在于构建一个包含预览(preview)与自检(self-checking)机制的严谨推理框架:首先通过筛选生成具有复杂约束的指令数据,构建高质量的提示(prompt)数据集;进而采用拒绝采样(rejection sampling)从简单样本中提取高质小规模数据集以实现冷启动初始化;最后结合熵保持的监督微调(Entropy-SFT)与基于规则密集奖励的token级熵自适应强化学习(TEA-RL),引导模型重构推理机制,从而显著提升对复杂指令的准确遵循能力,并在多个模型规模下实现泛化性增强。

链接: https://arxiv.org/abs/2508.03178
作者: Chenyang Wang,Liang Wen,Shousheng Jia,Xiangzheng Zhang,Liang Xu
机构: 360; 2: 未知; 3: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages, 10 figures, 7 tables

点击查看摘要

Abstract:While advancements in the reasoning abilities of LLMs have significantly enhanced their performance in solving mathematical problems, coding tasks, and general puzzles, their effectiveness in accurately adhering to instructions remains inconsistent, particularly with more complex directives. Our investigation identifies lazy reasoning during the thinking stage as the primary factor contributing to poor instruction adherence. To mitigate this issue, we propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking, essential for satisfying strict instruction constraints. Specifically, we first generate instructions with complex constraints and apply a filtering process to obtain valid prompts, resulting in three distinct prompt datasets categorized as hard, easy, and pass. Then, we employ rejection sampling on the pass prompts to curate a small yet high-quality dataset, enabling a cold-start initialization of the model and facilitating its adaptation to effective reasoning patterns. Subsequently, we employ an entropy-preserving supervised fine-tuning (Entropy-SFT) strategy coupled with token-wise entropy-adaptive (TEA-RL) reinforcement learning guided by rule-based dense rewards. This approach encourages the model to transform its reasoning mechanism, ultimately fostering generalizable reasoning abilities that encompass preview and self-checking. Extensive experiments conducted on instruction-following benchmarks demonstrate remarkable performance improvements across various model scales. Notably, our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
zh

[NLP-48] ChartCap: Mitigating Hallucination of Dense Chart Captioning ICCV2025

【速读】: 该论文旨在解决视觉语言模型在生成图表(chart)描述时存在的准确性低、信息量不足及幻觉(hallucination)严重的问题,其根源在于缺乏大规模、高质量的真实世界图表数据集。现有数据集常包含无法从图表本身推断的冗余信息,且未能充分捕捉图表的结构要素与关键洞察。解决方案的关键在于构建一个包含565K张真实图表图像及其类型特异性、密集标注的Caption数据集——ChartCap,该数据集通过四阶段流水线仅基于图表可辨识数据生成描述,并采用基于循环一致性的人工验证机制提升质量控制效率;同时提出一种新的评估指标“视觉一致性得分”(Visual Consistency Score),通过对比由描述重建的图表与原始图表的相似性来衡量caption质量,不依赖参考文本。实验表明,基于ChartCap微调的模型显著提升了描述准确性和信息丰富度,减少了幻觉现象,性能优于开源和商用模型乃至人工标注结果。

链接: https://arxiv.org/abs/2508.03164
作者: Junyoung Lim,Jaewoo Ahn,Gunhee Kim
机构: Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: ICCV 2025 (Highlight)

点击查看摘要

Abstract:Generating accurate, informative, and hallucination-free captions for charts remains challenging for vision language models, primarily due to the lack of large-scale, high-quality datasets of real-world charts. However, existing real-world chart datasets suffer from the inclusion of extraneous information that cannot be inferred from the chart and failure to sufficiently capture structural elements and key insights. Therefore, we introduce ChartCap, a large-scale dataset of 565K real-world chart images paired with type-specific, dense captions that exclude extraneous information and highlight both structural elements and key insights in detail. To build ChartCap, we design a four-stage pipeline that generates captions using only the discernible data from the chart and employ a cycle consistency-based human verification, which accelerates quality control without sacrificing accuracy. Additionally, we propose a novel metric, the Visual Consistency Score, which evaluates caption quality by measuring the similarity between the chart regenerated from a caption and the original chart, independent of reference captions. Extensive experiments confirms that models fine-tuned on ChartCap consistently generate more accurate and informative captions with reduced hallucinations, surpassing both open-source and proprietary models and even human-annotated captions.
zh

[NLP-49] RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior

【速读】: 该论文旨在解决如何在不显著损害长链式思维(Long Chain-of-Thought, CoT)能力的前提下,高效融合领域特定大语言模型(Domain-Specific Large Language Models)与具备复杂推理能力的推理模型(Reasoning Models),从而构建兼具通用推理能力和专业领域知识的双能力模型。当前主流模型合并方法常导致推理能力退化甚至输出崩溃,难以实现性能平衡。其解决方案的关键在于提出RCP-Merging框架,将推理模型权重视为先验基础,并引入推理能力指标来识别并保留核心长CoT能力权重,同时选择性地融合关键领域特定权重,从而在保持原始推理性能的同时显著提升领域任务表现。

链接: https://arxiv.org/abs/2508.03140
作者: Junyao Yang,Jianwei Wang,Huiping Zhuang,Cen Chen,Ziqian Zeng
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 15 pages, 7 figures

点击查看摘要

Abstract:Large Language Models (LLMs) with long chain-of-thought (CoT) capability, termed Reasoning Models, demonstrate superior intricate problem-solving abilities through multi-step long CoT reasoning. To create a dual-capability model with long CoT capability and domain-specific knowledge without substantial computational and data costs, model merging emerges as a highly resource-efficient method. However, significant challenges lie in merging domain-specific LLMs with long CoT ones since nowadays merging methods suffer from reasoning capability degradation, even gibberish output and output collapse. To overcome this, we introduce RCP-Merging: Merging Long Chain-of-Thought Models with Domain-Specific Models by Considering Reasoning Capability as Prior, a novel merging framework designed to integrate domain-specific LLMs with long CoT capability, meanwhile maintaining model performance in the original domain. Treating reasoning model weights as foundational prior, our method utilizes a reasoning capability indicator to preserve core long CoT capability model weights while selectively merging essential domain-specific weights. We conducted extensive experiments on Qwen2.5-7B, Llama3.1-8B, and Qwen2.5-1.5B models in BioMedicine and Finance domains. Our results show that RCP-Merging successfully merges a reasoning model with domain-specific ones, improving domain task performance by 9.5% and 9.2% over state-of-the-art methods, without significantly harming the original long CoT reasoning capability.
zh

[NLP-50] Long Story Generation via Knowledge Graph and Literary Theory

【速读】: 该论文旨在解决长文本生成(Long Text Generation, LTG)中多阶段生成方法存在的两大问题:一是因前序大纲记忆丢失导致的主题漂移(theme drift),二是情节冗余且逻辑不连贯,难以吸引人类读者。其解决方案的关键在于提出一种基于多智能体(multi-agent)的叙事生成结构,其中引入双层记忆存储机制——长期记忆存储用于识别并保留核心情节要素以防止主题偏移,短期记忆存储则记录每轮生成的最新大纲内容;同时设计基于叙事学理论的故事主题障碍框架(story theme obstacle framework),通过构建知识图谱和引入不确定因素与评估标准来增强故事吸引力,并在多智能体交互阶段模拟作者-读者对话,依据反馈动态修正文本,从而保障故事的一致性与逻辑性。

链接: https://arxiv.org/abs/2508.03137
作者: Ge Shi,Kaiyu Huang,Guochen Feng
机构: Beijing Jiaotong University (北京交通大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The generation of a long story consisting of several thousand words is a sub-task in the field of long text generation~(LTG). Previous research has addressed this challenge through outline-based generation, which employs a multi-stage method for generating outlines into stories. However, this approach suffers from two common issues: almost inevitable theme drift caused by the loss of memory of previous outlines, and tedious plots with incoherent logic that are less appealing to human readers. In this paper, we propose the multi-agent Story Generator structure to improve the multi-stage method, using large language models~(LLMs) as the core components of agents. To avoid theme drift, we introduce a memory storage model comprising two components: a long-term memory storage that identifies the most important memories, thereby preventing theme drift; and a short-term memory storage that retains the latest outlines from each generation round. To incorporate engaging elements into the story, we design a story theme obstacle framework based on literary narratology theory that introduces uncertain factors and evaluation criteria to generate outline. This framework calculates the similarity of the former storyline and enhances the appeal of the story by building a knowledge graph and integrating new node content. Additionally, we establish a multi-agent interaction stage to simulate writer-reader interaction through dialogue and revise the story text according to feedback, to ensure it remains consistent and logical. Evaluations against previous methods demonstrate that our approach can generate higher-quality long stories. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.03137 [cs.CL] (or arXiv:2508.03137v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.03137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-51] Cross-lingual Opinions and Emotions Mining in Comparable Documents

【速读】: 该论文旨在解决跨语言可比文本(comparable texts)中情感与情绪表达差异的量化分析问题,特别是针对英文与阿拉伯文新闻文档在不同来源下情感一致性与分歧的识别。其关键解决方案在于:首先采用跨语言标注方法对文档进行主观/客观情感分类,避免依赖机器翻译;其次通过人工翻译英文WordNet-Affect(WNA)词典构建双语情绪词典,用于标注愤怒、厌恶、恐惧、喜悦、悲伤和惊讶等六类情绪;最后利用统计指标评估每对源-目标文档间的情感与情绪一致性,从而揭示同一新闻机构产出的文档情感更一致,而不同来源则表现出显著差异。该方法具有语言无关性且可推广至其他语言对。

链接: https://arxiv.org/abs/2508.03112
作者: Motaz Saad,David Langlois,Kamel Smaili
机构: 未知
类目: Computation and Language (cs.CL)
备注: 16 pages, 5 figures

点击查看摘要

Abstract:Comparable texts are topic-aligned documents in multiple languages that are not direct translations. They are valuable for understanding how a topic is discussed across languages. This research studies differences in sentiments and emotions across English-Arabic comparable documents. First, texts are annotated with sentiment and emotion labels. We apply a cross-lingual method to label documents with opinion classes (subjective/objective), avoiding reliance on machine translation. To annotate with emotions (anger, disgust, fear, joy, sadness, surprise), we manually translate the English WordNet-Affect (WNA) lexicon into Arabic, creating bilingual emotion lexicons used to label the comparable corpora. We then apply a statistical measure to assess the agreement of sentiments and emotions in each source-target document pair. This comparison is especially relevant when the documents originate from different sources. To our knowledge, this aspect has not been explored in prior literature. Our study includes English-Arabic document pairs from Euronews, BBC, and Al-Jazeera (JSC). Results show that sentiment and emotion annotations align when articles come from the same news agency and diverge when they come from different ones. The proposed method is language-independent and generalizable to other language pairs.
zh

[NLP-52] oken-Level Precise Attack on RAG : Searching for the Best Alternatives to Mislead Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在安全方面的漏洞问题,即外部数据库中恶意内容可能被检索并用于操纵模型输出,从而引发可信度风险。现有攻击方法或依赖对检索器的白盒访问,或未能协同优化检索与生成两个阶段,导致在黑盒场景下效果有限。其解决方案的关键在于提出Token-level Precise Attack on the RAG (TPARAG),通过轻量级白盒大语言模型(LLM)在token粒度上迭代生成并优化恶意文本片段,确保这些片段既具备高可检索性(retrievability),又能高效触发生成阶段的攻击成功(attack success),从而在白盒和黑盒环境下均显著提升攻击有效性。

链接: https://arxiv.org/abs/2508.03110
作者: Zizhong Li,Haopeng Zhang,Jiawei Zhang
机构: University of California, Davis (加州大学戴维斯分校); University of Hawaii at Mānoa (夏威夷大学马诺阿分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:While large language models (LLMs) have achieved remarkable success in providing trustworthy responses for knowledge-intensive tasks, they still face critical limitations such as hallucinations and outdated knowledge. To address these issues, the retrieval-augmented generation (RAG) framework enhances LLMs with access to external knowledge via a retriever, enabling more accurate and real-time outputs about the latest events. However, this integration brings new security vulnerabilities: the risk that malicious content in the external database can be retrieved and used to manipulate model outputs. Although prior work has explored attacks on RAG systems, existing approaches either rely heavily on access to the retriever or fail to jointly consider both retrieval and generation stages, limiting their effectiveness, particularly in black-box scenarios. To overcome these limitations, we propose Token-level Precise Attack on the RAG (TPARAG), a novel framework that targets both white-box and black-box RAG systems. TPARAG leverages a lightweight white-box LLM as an attacker to generate and iteratively optimize malicious passages at the token level, ensuring both retrievability and high attack success in generation. Extensive experiments on open-domain QA datasets demonstrate that TPARAG consistently outperforms previous approaches in retrieval-stage and end-to-end attack effectiveness. These results further reveal critical vulnerabilities in RAG pipelines and offer new insights into improving their robustness.
zh

[NLP-53] Privacy-Aware Decoding: Mitigating Privacy Leakage of Large Language Models in Retrieval-Augmented Generation

【速读】: 该论文旨在解决检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理私有或敏感数据时面临的隐私泄露风险问题,即攻击者可能通过生成内容提取出原本应受保护的敏感信息。解决方案的关键在于提出一种轻量级、推理时(inference-time)的防御机制——隐私感知解码(Privacy-Aware Decoding, PAD),其核心包括:基于置信度的筛选策略以识别高风险token、高效的敏感性估计方法以减少不必要的噪声注入、以及上下文感知的噪声校准机制以平衡隐私保护与生成质量;同时引入Rényi差分隐私(Rényi Differential Privacy, RDP)会计机制,实现对每条响应的严格(ε,δ)(\varepsilon, \delta)-差分隐私(Differential Privacy, DP)保障。PAD无需模型重训练或语料过滤,具备模型无关性和低计算开销,显著优于现有基于检索或后处理的防御方法。

链接: https://arxiv.org/abs/2508.03098
作者: Haoran Wang,Xiongxiao Xu,Baixiang Huang,Kai Shu
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) enhances the factual accuracy of large language models (LLMs) by conditioning outputs on external knowledge sources. However, when retrieval involves private or sensitive data, RAG systems are susceptible to extraction attacks that can leak confidential information through generated responses. We propose Privacy-Aware Decoding (PAD), a lightweight, inference-time defense that adaptively injects calibrated Gaussian noise into token logits during generation. PAD integrates confidence-based screening to selectively protect high-risk tokens, efficient sensitivity estimation to minimize unnecessary noise, and context-aware noise calibration to balance privacy with generation quality. A \renyi Differential Privacy (RDP) accountant rigorously tracks cumulative privacy loss, enabling explicit per-response (\varepsilon, \delta) -DP guarantees for sensitive outputs. Unlike prior approaches requiring retraining or corpus-level filtering, PAD is model-agnostic and operates entirely at decoding time with minimal computational overhead. Experiments on three real-world datasets demonstrate that PAD substantially reduces private information leakage while preserving response utility, outperforming existing retrieval- and post-processing-based defenses. Our work takes an important step toward mitigating privacy risks in RAG via decoding strategies, paving the way for universal and scalable privacy solutions in sensitive domains. Our code is available: this https URL.
zh

[NLP-54] oward Verifiable Misinformation Detection: A Multi-Tool LLM Agent Framework

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在虚假信息检测中面临的准确性不足、推理过程不透明以及对改写内容敏感性高的问题。其解决方案的关键在于设计并实现了一个可验证的虚假信息检测LLM代理(LLM agent),该代理通过动态交互式地访问多样化的网络源,结合三个核心工具——精准网络搜索工具、信息源可信度评估工具和数值型主张验证工具,执行多步骤验证策略,并生成可追溯的证据日志与结构化推理链,从而显著提升检测准确率、推理透明度及对内容改写的鲁棒性。

链接: https://arxiv.org/abs/2508.03092
作者: Zikun Cui,Tianyi Huang,Chia-En Chiang,Cuiqianhe Du
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:With the proliferation of Large Language Models (LLMs), the detection of misinformation has become increasingly important and complex. This research proposes an innovative verifiable misinformation detection LLM agent that goes beyond traditional true/false binary judgments. The agent actively verifies claims through dynamic interaction with diverse web sources, assesses information source credibility, synthesizes evidence, and provides a complete verifiable reasoning process. Our designed agent architecture includes three core tools: precise web search tool, source credibility assessment tool and numerical claim verification tool. These tools enable the agent to execute multi-step verification strategies, maintain evidence logs, and form comprehensive assessment conclusions. We evaluate using standard misinformation datasets such as FakeNewsNet, comparing with traditional machine learning models and LLMs. Evaluation metrics include standard classification metrics, quality assessment of reasoning processes, and robustness testing against rewritten content. Experimental results show that our agent outperforms baseline methods in misinformation detection accuracy, reasoning transparency, and resistance to information rewriting, providing a new paradigm for trustworthy AI-assisted fact-checking.
zh

[NLP-55] VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision

【速读】: 该论文旨在解决强化学习中人类反馈(Reinforcement Learning from Human Feedback, RLHF)在真实场景下因奖励信号噪声或不完美而导致策略不稳定与泛化能力下降的问题,尤其是噪声会干扰优势估计过程中对关键词的关注。解决方案的关键在于提出一种以价值模型为核心的框架VRPO(Value-Regularized Proximal Policy Optimization),其核心创新包括:(1) 基于冻结语言模型的熵和困惑度引导的辅助损失,增强价值模型对上下文关键信息的捕捉能力;(2) 引入变分信息瓶颈机制,使价值模型能够主动过滤噪声并稳定优势估计,从而将价值模型从被动预测器转变为噪声调控器。实验表明,VRPO在数学推理、科学问答和多轮对话任务中均显著优于PPO和GRPO基线方法,验证了价值模型在噪声环境下提升策略鲁棒性的重要作用。

链接: https://arxiv.org/abs/2508.03058
作者: Dingwei Zhu,Shihan Dou,Zhiheng Xi,Senjie Jin,Guoqiang Zhang,Jiazheng Zhang,Junjie Ye,Mingxu Chai,Enyu Zhou,Ming Zhang,Caishuang Huang,Yunke Zhang,Yuran Wang,Tao Gui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model’s ability to filter out noise and capture key words from the context during advantage estimation, transforming it from a passive predictor into an active regulator of noise. Experiments on math reasoning, science QA, and multi-turn dialogue, under both rule-based and model-based noisy rewards, show that VRPO consistently outperforms PPO and GRPO baselines. Our findings underscore the often-overlooked importance of the value model in RLHF and offer a principled and practical approach to robust policy optimization in noisy real-world environments.
zh

[NLP-56] When Algorithms Meet Artists: Topic Modeling the AI-Art Debate 2013-2025

【速读】: 该论文试图解决当前关于生成式AI(Generative AI)艺术的公共与学术 discourse 中艺术家声音被边缘化的问题,特别是艺术家对版权、透明度及创意劳动未来等核心关切常被主流叙事忽视。其解决方案的关键在于采用基于 BERTopic 的可复现方法论,对2013至2025年间英文语料中439篇精选文本进行主题聚类分析,识别出五个稳定主题簇,并揭示艺术家感知与媒体叙事之间的显著错位;同时指出技术术语的使用可能构成一种隐蔽的准入壁垒,进而呼吁在AI-创意生态发展中建立以透明度为导向的深度对话机制,确保艺术家视角得到实质性回应。

链接: https://arxiv.org/abs/2508.03037
作者: Ariya Mukherjee-Gandhi,Oliver Muellerklein
机构: University of California Berkeley (加州大学伯克利分校)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC)
备注: 18 pages, 5 figures, 5 tables

点击查看摘要

Abstract:As generative AI continues to reshape artistic production and alternate modes of human expression, artists whose livelihoods are most directly affected have raised urgent concerns about consent, transparency, and the future of creative labor. However, the voices of artists are often marginalized in dominant public and scholarly discourse. This study presents a twelve-year analysis, from 2013 to 2025, of English-language discourse surrounding AI-generated art. It draws from 439 curated 500-word excerpts sampled from opinion articles, news reports, blogs, legal filings, and spoken-word transcripts. Through a reproducible methodology, we identify five stable thematic clusters and uncover a misalignment between artists’ perceptions and prevailing media narratives. Our findings highlight how the use of technical jargon can function as a subtle form of gatekeeping, often sidelining the very issues artists deem most urgent. Our work provides a BERTopic-based methodology and a multimodal baseline for future research, alongside a clear call for deeper, transparency-driven engagement with artist perspectives in the evolving AI-creative landscape.
zh

[NLP-57] AGENT iGraph: A Multi-Agent Knowledge Graph Framework for Interactive Domain-Specific LLM Chatbots CIKM2025

【速读】: 该论文旨在解决非技术用户在构建和管理领域特定知识库时面临的高门槛问题,尤其是在缺乏专业查询语言(如SQL或SPARQL)的情况下,难以通过自然语言与结构化知识图谱进行交互。其解决方案的关键在于提出AGENTiGraph系统,该系统基于生成式AI(Generative AI)驱动的代理架构,整合意图识别(intent classification)、任务规划(task planning)与自动知识集成(automatic knowledge integration)三大模块,实现多轮对话下的动态知识更新与跨任务推理,从而支持用户以自然语言完成知识图谱的增量构建与维护。

链接: https://arxiv.org/abs/2508.02999
作者: Xinjie Zhao,Moritz Blum,Fan Gao,Yingjian Chen,Boming Yang,Luis Marquez-Carpintero,Mónica Pina-Navarro,Yanran Fu,So Morikawa,Yusuke Iwasawa,Yutaka Matsuo,Chanjun Park,Irene Li
机构: The University of Tokyo(东京大学); University of Bielefeld(比勒费尔德大学); University of Alicante(阿尔卡拉大学); Soongsil University(中央大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: CIKM 2025, Demo Track

点击查看摘要

Abstract:AGENTiGraph is a user-friendly, agent-driven system that enables intuitive interaction and management of domain-specific data through the manipulation of knowledge graphs in natural language. It gives non-technical users a complete, visual solution to incrementally build and refine their knowledge bases, allowing multi-round dialogues and dynamic updates without specialized query languages. The flexible design of AGENTiGraph, including intent classification, task planning, and automatic knowledge integration, ensures seamless reasoning between diverse tasks. Evaluated on a 3,500-query benchmark within an educational scenario, the system outperforms strong zero-shot baselines (achieving 95.12% classification accuracy, 90.45% execution success), indicating potential scalability to compliance-critical or multi-step queries in legal and medical domains, e.g., incorporating new statutes or research on the fly. Our open-source demo offers a powerful new paradigm for multi-turn enterprise knowledge management that bridges LLMs and structured graphs.
zh

[NLP-58] CoCoTen: Detecting Adversarial Inputs to Large Language Models through Latent Space Features of Contextual Co-occurrence Tensors

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际应用中因复杂性和不可解释性而易受攻击,尤其是针对生成有害内容的“越狱”(jailbreak)攻击问题。为应对这一挑战,论文提出了一种基于上下文共现矩阵(Contextual Co-occurrence Matrix)及其张量表示的新检测方法,其关键在于利用该结构在低数据场景下的强表征能力,结合潜在空间特征来有效识别对抗性与越狱提示。实验表明,该方法仅需0.5%的标注样本即可达到F1分数0.83,较基线提升96.6%,且推理速度提升2.3至128.4倍,显著提升了检测效率与鲁棒性。

链接: https://arxiv.org/abs/2508.02997
作者: Sri Durga Sai Sowmya Kadali,Evangelos E. Papalexakis
机构: University of California, Riverside (加州大学河滨分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The widespread use of Large Language Models (LLMs) in many applications marks a significant advance in research and practice. However, their complexity and hard-to-understand nature make them vulnerable to attacks, especially jailbreaks designed to produce harmful responses. To counter these threats, developing strong detection methods is essential for the safe and reliable use of LLMs. This paper studies this detection problem using the Contextual Co-occurrence Matrix, a structure recognized for its efficacy in data-scarce environments. We propose a novel method leveraging the latent space characteristics of Contextual Co-occurrence Matrices and Tensors for the effective identification of adversarial and jailbreak prompts. Our evaluations show that this approach achieves a notable F1 score of 0.83 using only 0.5% of labeled prompts, which is a 96.6% improvement over baselines. This result highlights the strength of our learned patterns, especially when labeled data is scarce. Our method is also significantly faster, speedup ranging from 2.3 to 128.4 times compared to the baseline models. To support future research and reproducibility, we have made our implementation publicly available.
zh

[NLP-59] Unified Tool Integration for LLM s: A Protocol-Agnostic Approach to Function Calling

【速读】: 该论文旨在解决工具增强型大语言模型(Tool-augmented Large Language Models, LLMs)生态碎片化问题,即开发者需应对多种协议、手动定义模式(schema)以及复杂的执行流程所带来的高开发成本。其解决方案的关键在于提出一种统一的工具集成方法,通过协议无关设计抽象底层协议差异,并结合自动化模式生成、双模式并发执行和无缝多源工具管理,显著降低开发负担。实验表明,该方案在不同集成场景下可减少60–80%代码量,性能提升最高达3.1倍,同时兼容现有函数调用标准,为LLM应用开发提供了理论架构指导与实用工具链支持。

链接: https://arxiv.org/abs/2508.02979
作者: Peng Ding,Rick Stevens
机构: University of California, Berkeley (加州大学伯克利分校); Argonne National Laboratory (阿贡国家实验室)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: arXiv admin note: substantial text overlap with arXiv:2507.10593

点击查看摘要

Abstract:The proliferation of tool-augmented Large Language Models (LLMs) has created a fragmented ecosystem where developers must navigate multiple protocols, manual schema definitions, and complex execution workflows. We address this challenge by proposing a unified approach to tool integration that abstracts protocol differences while optimizing execution performance. Our solution demonstrates how protocol-agnostic design principles can significantly reduce development overhead through automated schema generation, dual-mode concurrent execution, and seamless multi-source tool management. Experimental results show 60-80% code reduction across integration scenarios, performance improvements up to 3.1x through optimized concurrency, and full compatibility with existing function calling standards. This work contributes both theoretical insights into tool integration architecture and practical solutions for real-world LLM application development.
zh

[NLP-60] Defend LLM s Through Self-Consciousness KDD

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对提示注入攻击(prompt injection attacks)时缺乏内在防御机制的问题。传统方法依赖外部分类器进行检测与拦截,存在部署复杂、泛化能力弱等局限性。论文提出一种基于元认知(Meta-Cognitive)和仲裁模块(Arbitration Modules)的自我意识防御框架,其核心在于利用LLM自身的推理能力实现对输出内容的自主评估与调控,从而形成内生式的安全防护机制。该方案无需额外模型或复杂预处理,显著提升了多类LLM在多个攻击数据集上的防御成功率,同时保持较低的计算开销,为生成式AI(Generative AI)在多平台应用中的伦理安全性提供了轻量且高效的解决方案。

链接: https://arxiv.org/abs/2508.02961
作者: Boshi Huang,Fabio Nonato de Paula
机构: Amazon Web Services(亚马逊网络服务)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注: Presented at KDD Workshop on Ethical Artificial Intelligence: Methods and Applications (EAI) 2025

点击查看摘要

Abstract:This paper introduces a novel self-consciousness defense mechanism for Large Language Models (LLMs) to combat prompt injection attacks. Unlike traditional approaches that rely on external classifiers, our method leverages the LLM’s inherent reasoning capabilities to perform self-protection. We propose a framework that incorporates Meta-Cognitive and Arbitration Modules, enabling LLMs to evaluate and regulate their own outputs autonomously. Our approach is evaluated on seven state-of-the-art LLMs using two datasets: AdvBench and Prompt-Injection-Mixed-Techniques-2024. Experiment results demonstrate significant improvements in defense success rates across models and datasets, with some achieving perfect and near-perfect defense in Enhanced Mode. We also analyze the trade-off between defense success rate improvement and computational overhead. This self-consciousness method offers a lightweight, cost-effective solution for enhancing LLM ethics, particularly beneficial for GenAI use cases across various platforms.
zh

[NLP-61] Can LLM s Generate High-Quality Task-Specific Conversations?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在对话生成中难以精确控制对话质量的问题,具体包括话题连贯性(topic coherence)、知识推进性(knowledge progression)、角色一致性(character consistency)以及控制粒度(control granularity)等方面的挑战。解决方案的关键在于提出了一种参数化框架,通过定义九个关键参数(涵盖六个维度)来实现对对话属性的精准指定,并在当前最先进的LLMs上验证了该方法能够显著改变生成对话的质量特征,从而为教育、心理治疗、客户服务和娱乐等应用场景提供标准化的对话质量控制手段。

链接: https://arxiv.org/abs/2508.02931
作者: Shengqi Li,Amarnath Gupta
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper introduces a parameterization framework for controlling conversation quality in large language models. We explore nine key parameters across six dimensions that enable precise specification of dialogue properties. Through experiments with state-of-the-art LLMs, we demonstrate that parameter-based control produces statistically significant differences in generated conversation properties. Our approach addresses challenges in conversation generation, including topic coherence, knowledge progression, character consistency, and control granularity. The framework provides a standardized method for conversation quality control with applications in education, therapy, customer service, and entertainment. Future work will focus on implementing additional parameters through architectural modifications and developing benchmark datasets for evaluation.
zh

[NLP-62] Following Route Instructions using Large Vision-Language Models: A Comparison between Low-level and Panoramic Action Spaces

【速读】: 该论文旨在解决两个核心问题:一是评估现成的大型视觉语言模型(Large Vision-Language Models, LVLMs)是否能够在不进行架构修改或模拟器训练的情况下有效支持视觉-语言导航(Vision-and-Language Navigation, VLN)任务;二是考察这类模型能否同时适配低级动作空间(low-level action space,如“左转”或“前进”)和全景动作空间(panoramic action space,离散可导航视点)。解决方案的关键在于对开源模型 Qwen2.5-VL-3B-Instruct 在 Room-to-Room (R2R) 数据集上进行微调,并在两种不同动作空间下评估其性能。实验表明,该模型在 R2R 测试集上达到 41% 的成功率,证明了 off-the-shelf LVLMs 具备学习 VLN 能力,但仍落后于专为导航设计的模型。

链接: https://arxiv.org/abs/2508.02917
作者: Vebjørn Haug Kåsene,Pierre Lison
机构: University of Oslo (奥斯陆大学); Norwegian Computing Center (挪威计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Robotics (cs.RO)
备注: This paper has been accepted to ICNSLP 2025

点击查看摘要

Abstract:Vision-and-Language Navigation (VLN) refers to the task of enabling autonomous robots to navigate unfamiliar environments by following natural language instructions. While recent Large Vision-Language Models (LVLMs) have shown promise in this task, most current VLM systems rely on models specifically designed and optimized for navigation, leaving the potential of off-the-shelf LVLMs underexplored. Furthermore, while older VLN approaches used low-level action spaces with egocentric views and atomic actions (such as “turn left” or “move forward”), newer models tend to favor panoramic action spaces with discrete navigable viewpoints. This paper investigates (1) whether off-the-shelf LVLMs (fine-tuned without architectural modifications or simulator-based training) can effectively support VLN tasks and (2) whether such models can support both low-level and panoramic action paradigms. To this end, we fine-tune the open-source model Qwen2.5-VL-3B-Instruct on the Room-to-Room (R2R) dataset and evaluate its empirical performance across both low-level and panoramic action spaces. The best resulting model achieves a 41% success rate on the R2R test set, demonstrating that while off-the-shelf LVLMs can learn to perform Vision-and-Language Navigation, they still lag behind models specifically designed for this task.
zh

[NLP-63] SLIM-LLM s: Modeling of Style-Sensory Language RelationshipsThrough Low-Dimensional Representations

【速读】: 该论文旨在解决如何高效建模感官语言(sensorial language)与传统风格特征(如LIWC测量的特征)之间关系的问题,同时降低模型复杂度。其关键解决方案是提出一种基于低秩岭回归(Reduced-Rank Ridge Regression, R4)的降维方法,将LIWC特征压缩至r=24的低维潜空间,显著减少参数量的同时保留核心风格信息,并在此基础上构建可解释的风格感知模型SLIM-LLMs,该模型能捕捉非线性风格维度关系,在五种文体中达到与全规模语言模型相当的性能,参数量最多减少80%。

链接: https://arxiv.org/abs/2508.02901
作者: Osama Khalid,Sanvesh Srivastava,Padmini Srinivasan
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sensorial language – the language connected to our senses including vision, sound, touch, taste, smell, and interoception, plays a fundamental role in how we communicate experiences and perceptions. We explore the relationship between sensorial language and traditional stylistic features, like those measured by LIWC, using a novel Reduced-Rank Ridge Regression (R4) approach. We demonstrate that low-dimensional latent representations of LIWC features r = 24 effectively capture stylistic information for sensorial language prediction compared to the full feature set (r = 74). We introduce Stylometrically Lean Interpretable Models (SLIM-LLMs), which model non-linear relationships between these style dimensions. Evaluated across five genres, SLIM-LLMs with low-rank LIWC features match the performance of full-scale language models while reducing parameters by up to 80%.
zh

[NLP-64] VisuCraft: Enhancing Large Vision-Language Models for Complex Visual-Guided Creative Content Generation via Structured Information Extraction

【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂视觉引导的长文本生成任务中普遍存在的问题,包括视觉保真度不足、创造力有限以及对用户指令细微差异响应不准确等挑战。解决方案的关键在于提出VisuCraft框架,其核心由两个模块构成:一是多模态结构化信息提取器(Multimodal Structured Information Extractor, E),用于从输入图像中提取细粒度视觉属性并构建结构化表示;二是动态提示生成模块(Dynamic Prompt Generation Module, G),将提取的视觉特征与用户指令融合,生成高度优化的提示词以驱动底层LVLM(如LLaVA、InstructBLIP)进行更精准、更具创造性的内容生成。实验证明,该方法在ImageStoryGen-500K数据集上显著提升了视觉锚定性、创意性和指令遵循度,尤其在创造性输出和用户意图一致性方面表现突出。

链接: https://arxiv.org/abs/2508.02890
作者: Rongxin Jiang,Robert Long,Chenghao Gu,Mingrui Yan
机构: Heilongjiang University of Science and Technology (黑龙江科技大学); University of Padua (帕多瓦大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper introduces VisuCraft, a novel framework designed to significantly enhance the capabilities of Large Vision-Language Models (LVLMs) in complex visual-guided creative content generation. Existing LVLMs often exhibit limitations in maintaining high visual fidelity, genuine creativity, and precise adherence to nuanced user instructions when generating long-form texts. VisuCraft addresses these challenges by integrating a multimodal structured information extractor (E) and a dynamic prompt generation module (G). The extractor distills fine-grained visual attributes from input images into a rich, structured representation, which the dynamic prompt module then combines with user instructions to create highly optimized prompts for underlying LVLMs (e.g., LLaVA, InstructBLIP). Evaluated on the self-constructed ImageStoryGen-500K dataset using VisuGen Metrics (Visual Grounding, Creativity, and Instruction Adherence), VisuCraft consistently outperforms baseline LVLMs across tasks like story generation and poetry composition. Our results demonstrate remarkable improvements, particularly in creativity and instruction adherence, validating VisuCraft’s effectiveness in producing imaginative, visually grounded, and user-aligned long-form creative text. This work unlocks new potential for LVLMs in sophisticated creative AI applications.
zh

[NLP-65] Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在复杂、多步骤的跨模态常识推理任务中表现不足的问题,尤其体现在缺乏“审慎思考”(deliberative thinking),难以实现从视觉信息到抽象概念的深层链式推理。解决方案的关键在于提出一种新的协同多模态推理框架(Coherent Multimodal Reasoning Framework, CMRF),其核心是通过迭代自评机制增强模型的推理能力:该框架包含三个关键模块——推理分解单元(Reasoning Decomposition Unit, RDU)用于将复杂问题拆解为子问题,上下文推理引擎(Contextual Inference Engine, CIE)执行情境化推理,以及一致性评估模块(Coherence Assessment Module, CAM)对推理路径进行逻辑一致性与置信度评估;同时结合自适应迭代优化策略,系统性地修正错误并提升推理连贯性与准确性。

链接: https://arxiv.org/abs/2508.02886
作者: Wenjie Luo,Ruocheng Li,Shanshan Zhu,Julian Perry
机构: Fujian University of Technology (福建理工大学); Delta University for Science and Technology (德尔塔科学与技术大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite significant advancements, current large language models (LLMs) and vision-language models (LVLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of “deliberative thinking.” They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. To address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances LVLMs’ common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Our framework integrates three key modules: a Reasoning Decomposition Unit (RDU) for breaking down problems into sub-questions, a Contextual Inference Engine (CIE) for contextual inference, and a Coherence Assessment Module (CAM) for evaluating logical consistency and confidence. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaVA-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source LVLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. It attains an average accuracy of 69.4%, surpassing the best open-source baseline by +2.4 percentage points, with particular strength in complex reasoning scenarios. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning.
zh

[NLP-66] Merge-based syntax is mediated by distinct neurocognitive mechanisms: A clustering analysis of comprehension abilities in 84000 individuals with language deficits across nine languages

【速读】: 该论文试图解决的问题是:生成式语法中的核心操作“Merge”在人类语言认知中是否由单一机制支持,还是存在多种不同的神经认知机制分别处理不同类型的句法结构。解决方案的关键在于通过系统性地考察参与者对句法复杂度递增的句子的理解行为,结合聚类分析识别出三类具有显著差异的结构类型——即简单命令结构(如“eat apples”)、形容词与名词合并结构(如“red boat”)以及名词与空间介词合并结构(如“laptop behind the sofa”),从而表明尽管Merge可能在进化上一次性出现并促成人类符号化语言的转折,但其具体实现依赖于分化的认知机制,并可能对应不同的发育阶段和选择性损伤模式。

链接: https://arxiv.org/abs/2508.02885
作者: Elliot Murphy,Rohan Venkatesh,Edward Khokhlovich,Andrey Vyshedskiy
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In the modern language sciences, the core computational operation of syntax, ‘Merge’, is defined as an operation that combines two linguistic units (e.g., ‘brown’, ‘cat’) to form a categorized structure (‘brown cat’, a Noun Phrase). This can then be further combined with additional linguistic units based on this categorial information, respecting non-associativity such that abstract grouping is respected. Some linguists have embraced the view that Merge is an elementary, indivisible operation that emerged in a single evolutionary step. From a neurocognitive standpoint, different mental objects constructed by Merge may be supported by distinct mechanisms: (1) simple command constructions (e.g., “eat apples”); (2) the merging of adjectives and nouns (“red boat”); and (3) the merging of nouns with spatial prepositions (“laptop behind the sofa”). Here, we systematically investigate participants’ comprehension of sentences with increasing levels of syntactic complexity. Clustering analyses revealed behavioral evidence for three distinct structural types, which we discuss as potentially emerging at different developmental stages and subject to selective impairment. While a Merge-based syntax may still have emerged suddenly in evolutionary time, responsible for the structured symbolic turn our species took, different cognitive mechanisms seem to underwrite the processing of various types of Merge-based objects.
zh

[NLP-67] Highlight Summarize: RAG without the jailbreaks

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)面临的越狱攻击(jailbreaking)和模型劫持(model hijacking)问题,即恶意用户通过精心设计的输入提示诱导LLM生成不当内容或执行非预期任务。现有缓解方法通常依赖于强化系统提示(system prompt)或训练内容分类器来检测不良输出,但这些基于概率的方案易被绕过,因输入与输出空间极大。论文提出一种名为“Highlight Summarize”(HS)的新设计模式,其核心在于重构检索增强生成(Retrieval-Augmented Generation, RAG)系统架构:将标准RAG流程拆分为两个组件——“highlighter”从检索到的文档中提取与用户问题相关的段落(称为“highlights”),再由“summarizer”基于这些片段生成最终答案,从而始终不向生成式AI(Generative AI)暴露原始用户问题。这一机制从根本上防止了攻击者通过输入操控模型行为,实现在设计层面防御越狱和劫持攻击。

链接: https://arxiv.org/abs/2508.02872
作者: Giovanni Cherubin,Andrew Paverd
机构: Microsoft Security Response Center (微软安全响应中心)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Preventing jailbreaking and model hijacking of Large Language Models (LLMs) is an important yet challenging task. For example, when interacting with a chatbot, malicious users can input specially crafted prompts to cause the LLM to generate undesirable content or perform a completely different task from its intended purpose. Existing mitigations for such attacks typically rely on hardening the LLM’s system prompt or using a content classifier trained to detect undesirable content or off-topic conversations. However, these probabilistic approaches are relatively easy to bypass due to the very large space of possible inputs and undesirable outputs. In this paper, we present and evaluate Highlight Summarize (HS), a new design pattern for retrieval-augmented generation (RAG) systems that prevents these attacks by design. The core idea is to perform the same task as a standard RAG pipeline (i.e., to provide natural language answers to questions, based on relevant sources) without ever revealing the user’s question to the generative LLM. This is achieved by splitting the pipeline into two components: a highlighter, which takes the user’s question and extracts relevant passages (“highlights”) from the retrieved documents, and a summarizer, which takes the highlighted passages and summarizes them into a cohesive answer. We describe several possible instantiations of HS and evaluate their generated responses in terms of correctness, relevance, and response quality. Surprisingly, when using an LLM-based highlighter, the majority of HS responses are judged to be better than those of a standard RAG pipeline.
zh

[NLP-68] Modeling Annotator Disagreement with Demographic-Aware Experts and Synthetic Perspectives

【速读】: 该论文旨在解决主观自然语言处理(NLP)任务中标注者分歧(annotator disagreement)的建模问题,尤其是如何更准确地捕捉不同人口统计群体(demographic groups)间的结构化差异。其核心解决方案是提出一种名为DEM-MoE(Demographic-Aware Mixture of Experts)的模型架构,该模型根据标注者的 demographic 信息将输入路由至不同的专家子网络(expert subnetworks),从而更好地表征群体层面的差异。关键创新在于通过数据驱动的路由机制实现对多样化标注视角的显式建模,并结合零样本角色提示生成的大语言模型(LLM)合成标注数据进行数据插补,以缓解低频人口统计群体的数据稀疏问题。实验表明,该方法在高标注分歧数据集上表现优异,且最优的“真实与合成数据融合策略”取决于数据集的具体结构。

链接: https://arxiv.org/abs/2508.02853
作者: Yinuo Xu,Veronica Derricks,Allison Earl,David Jurgens
机构: University of Michigan (密歇根大学); University of Colorado Boulder (科罗拉多大学博尔德分校)
类目: Computation and Language (cs.CL)
备注: 28 pages, 17 figures

点击查看摘要

Abstract:We present an approach to modeling annotator disagreement in subjective NLP tasks through both architectural and data-centric innovations. Our model, DEM-MoE (Demographic-Aware Mixture of Experts), routes inputs to expert subnetworks based on annotator demographics, enabling it to better represent structured, group-level variation compared to prior models. DEM-MoE consistently performs competitively across demographic groups, and shows especially strong results on datasets with high annotator disagreement. To address sparse demographic coverage, we test whether LLM-generated synthetic annotations via zero-shot persona prompting can be used for data imputation. We show these synthetic judgments align moderately well with human annotations on our data and offer a scalable way to potentially enrich training data. We then propose and evaluate approaches for blending real and synthetic data using strategies tailored to dataset structure. We find that the optimal strategies depend on dataset structure. Together, these contributions improve the representation of diverse perspectives.
zh

[NLP-69] NeuroSync: Intent-Aware Code-Based Problem Solving via Direct LLM Understanding Modification

【速读】: 该论文旨在解决对话式大语言模型(Conversational LLMs)在领域用户(尤其是编程经验有限的用户)使用过程中,因用户意图与生成代码之间存在错位而导致的交互低效问题。其核心原因是双向歧义:用户意图和编码任务本身具有非线性特性,却需通过线性提示(prompt)和代码序列进行表达与解析。解决方案的关键在于提出“直接意图-任务匹配”(direct intent-task matching)这一新型人-大语言模型交互范式,通过显式化并允许用户直接操作大语言模型在生成代码前推断出的编码任务及其关系,从而提升意图与任务之间的对齐度。为此,作者实现了一个名为NeuroSync的原型系统,利用知识蒸馏提取大语言模型的理解、用户意图及其映射关系,并通过可视化界面使用户能直观检查和编辑这些中间表示,显著降低认知负荷并提升编码效率。

链接: https://arxiv.org/abs/2508.02823
作者: Wenshuo Zhang,Leixian Shen,Shuchang Xu,Jindu Wang,Jian Zhao,Huamin Qu,Linping Yuan
机构: The Hong Kong University of Science and Technology (香港科技大学); University of Waterloo (滑铁卢大学)
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Software Engineering (cs.SE)
备注: Accepted in UIST 2025

点击查看摘要

Abstract:Conversational LLMs have been widely adopted by domain users with limited programming experience to solve domain problems. However, these users often face misalignment between their intent and generated code, resulting in frustration and rounds of clarification. This work first investigates the cause of this misalignment, which dues to bidirectional ambiguity: both user intents and coding tasks are inherently nonlinear, yet must be expressed and interpreted through linear prompts and code sequences. To address this, we propose direct intent-task matching, a new human-LLM interaction paradigm that externalizes and enables direct manipulation of the LLM understanding, i.e., the coding tasks and their relationships inferred by the LLM prior to code generation. As a proof-of-concept, this paradigm is then implemented in NeuroSync, which employs a knowledge distillation pipeline to extract LLM understanding, user intents, and their mappings, and enhances the alignment by allowing users to intuitively inspect and edit them via visualizations. We evaluate the algorithmic components of NeuroSync via technical experiments, and assess its overall usability and effectiveness via a user study (N=12). The results show that it enhances intent-task alignment, lowers cognitive effort, and improves coding efficiency.
zh

[NLP-70] Clinically Grounded Agent -based Report Evaluation: An Interpretable Metric for Radiology Report Generation

【速读】: 该论文旨在解决生成式医学影像报告(Radiology Report Generation, RRG)在临床部署中缺乏可靠、可解释的评估方法的问题。现有评价指标多依赖表面相似性或为黑箱模型,难以保证对临床关键信息的准确捕捉与可追溯性。解决方案的关键在于提出ICARE(Interpretable and Clinically-grounded Agent-based Report Evaluation)框架,其核心机制是利用大语言模型(Large Language Model, LLM)驱动的双代理交互系统:两个代理分别基于真实报告和生成报告生成具有临床意义的问题并互相问答,通过答案一致性量化临床信息的保留与一致性,从而提供可解释的精度(precision)与召回率(recall)代理指标,并实现评分结果与具体问答对的映射,显著提升评估过程的透明度与临床相关性。

链接: https://arxiv.org/abs/2508.02808
作者: Radhika Dua,Young Joon(Fred)Kwon,Siddhant Dogra,Daniel Freedman,Diana Ruan,Motaz Nashawaty,Danielle Rigau,Daniel Alexander Alber,Kang Zhang,Kyunghyun Cho,Eric Karl Oermann
机构: New York University Langone Health (纽约大学朗格尼健康中心)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Radiological imaging is central to diagnosis, treatment planning, and clinical decision-making. Vision-language foundation models have spurred interest in automated radiology report generation (RRG), but safe deployment requires reliable clinical evaluation of generated reports. Existing metrics often rely on surface-level similarity or behave as black boxes, lacking interpretability. We introduce ICARE (Interpretable and Clinically-grounded Agent-based Report Evaluation), an interpretable evaluation framework leveraging large language model agents and dynamic multiple-choice question answering (MCQA). Two agents, each with either the ground-truth or generated report, generate clinically meaningful questions and quiz each other. Agreement on answers captures preservation and consistency of findings, serving as interpretable proxies for clinical precision and recall. By linking scores to question-answer pairs, ICARE enables transparent, and interpretable assessment. Clinician studies show ICARE aligns significantly more with expert judgment than prior metrics. Perturbation analyses confirm sensitivity to clinical content and reproducibility, while model comparisons reveal interpretable error patterns.
zh

[NLP-71] aching at Scale: Leverag ing AI to Evaluate and Elevate Engineering Education

【速读】: 该论文旨在解决大规模高校(尤其是工程类专业)中教学效果评估难以规模化实施的问题,传统人工审核学生反馈的方法效率低下且易遗漏关键信息。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的可扩展框架,通过分层摘要、匿名化处理与异常检测机制,从开放式评语中提取可操作的主题,并结合可视化分析将量化评分置于百分位比较、历史趋势和教学负荷等上下文中进行解读。该方法在保障伦理安全的前提下,整合了学生、同行及自我反思等多种评价来源,支持形成性评价与教师专业发展,而非替代人事决策,从而实现了教学评估的科学化、规模化与透明化。

链接: https://arxiv.org/abs/2508.02731
作者: Jean-Francois Chamberland,Martin C. Carlisle,Arul Jayaraman,Krishna R. Narayanan,Sunay Palsole,Karan Watson
机构: Texas A&M University (德克萨斯A&M大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Evaluating teaching effectiveness at scale remains a persistent challenge for large universities, particularly within engineering programs that enroll tens of thousands of students. Traditional methods, such as manual review of student evaluations, are often impractical, leading to overlooked insights and inconsistent data use. This article presents a scalable, AI-supported framework for synthesizing qualitative student feedback using large language models. The system employs hierarchical summarization, anonymization, and exception handling to extract actionable themes from open-ended comments while upholding ethical safeguards. Visual analytics contextualize numeric scores through percentile-based comparisons, historical trends, and instructional load. The approach supports meaningful evaluation and aligns with best practices in qualitative analysis and educational assessment, incorporating student, peer, and self-reflective inputs without automating personnel decisions. We report on its successful deployment across a large college of engineering. Preliminary validation through comparisons with human reviewers, faculty feedback, and longitudinal analysis suggests that LLM-generated summaries can reliably support formative evaluation and professional development. This work demonstrates how AI systems, when designed with transparency and shared governance, can promote teaching excellence and continuous improvement at scale within academic institutions.
zh

[NLP-72] Efficient Agents : Building Effective Agents While Reducing Cost

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)驱动的智能体(agent)系统在执行复杂多步骤任务时日益增长的成本问题,该问题严重威胁了系统的可扩展性和可访问性。论文通过系统性研究效率与效果之间的权衡关系,识别出任务内在复杂性、模块冗余以及框架设计对成本的影响,并基于GAIA基准的实证分析量化了不同因素在“每轮成本”(cost-of-pass)指标下的效率-性能权衡。其解决方案的关键在于提出一种新型高效智能体框架——Efficient Agents,该框架在保持96.7% OWL基准性能的同时,将每轮成本从0.398降低至0.228,实现28.4%的效率提升,从而实现了任务需求与系统复杂度之间的最优匹配。

链接: https://arxiv.org/abs/2508.02694
作者: Ningning Wang,Xavier Hu,Pai Liu,He Zhu,Yue Hou,Heyuan Huang,Shengyu Zhang,Jian Yang,Jiaheng Liu,Ge Zhang,Changwang Zhang,Jun Wang,Yuchen Eleanor Jiang,Wangchunshu Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multiagent Systems (cs.MA)
备注: Work in progress. For GitHub repository, see this https URL

点击查看摘要

Abstract:The remarkable capabilities of Large Language Model (LLM)-driven agents have enabled sophisticated systems to tackle complex, multi-step tasks, but their escalating costs threaten scalability and accessibility. This work presents the first systematic study of the efficiency-effectiveness trade-off in modern agent systems, addressing the critical need for cost-effective designs without sacrificing performance. We investigate three key questions: (1) How much complexity do agentic tasks inherently require? (2) When do additional modules yield diminishing returns? (3) How much efficiency can be gained through the design of efficient agent frameworks? Through an empirical analysis on the GAIA benchmark, we evaluate the impact of LLM backbone selection, agent framework designs, and test-time scaling strategies. Using the cost-of-pass metric, we quantify the efficiency-performance trade-off across these dimensions. Our findings inform the development of Efficient Agents , a novel agent framework that has an optimal complexity to task requirements. Efficient Agents retains 96.7% of the performance of OWL, one leading open-source agent framework, while reducing operational costs from 0.398 to 0.228, resulting in a 28.4% improvement in cost-of-pass. Our work provides actionable insights for designing efficient, high-performing agent systems, advancing the accessibility and sustainability of AI-driven solutions.
zh

[NLP-73] Bridging LLM s and Symbolic Reasoning in Educational QA Systems: Insights from the XAI Challenge at IJCNN 2025 IJCNN2025

【速读】: 该论文旨在解决生成式 AI (Generative AI) 在教育场景中缺乏透明度与可解释性的问题,特别是在学生查询大学政策时,现有问答系统难以提供逻辑清晰、可信的解释。其解决方案的关键在于通过组织“可解释人工智能(XAI)挑战赛2025”,引导参赛者构建基于轻量级大语言模型(LLMs)或混合LLM-符号系统(hybrid LLM-symbolic systems)的问答系统,并要求输出具有逻辑基础的自然语言解释;同时,利用基于Z3验证的逻辑模板构建高质量数据集并经由专家学生审校,确保系统在真实教育场景下的可信赖性和实用性。

链接: https://arxiv.org/abs/2508.01263
作者: Long S. T. Nguyen,Khang H. N. Vo,Thu H. A. Nguyen,Tuan C. Bui,Duc Q. Nguyen,Thanh-Tung Tran,Anh D. Nguyen,Minh L. Nguyen,Fabien Baldacci,Thang H. Bui,Emanuel Di Nardo,Angelo Ciaramella,Son H. Le,Ihsan Ullah,Lorenzo Di Rocco,Tho T. Quan
机构: URA Research Group, Ho Chi Minh City University of Technology (HCMUT), Vietnam; Ho Chi Minh City International University (HCMIU), Vietnam; University of South-Eastern Norway, Norway; Japan Advanced Institute of Science and Technology (JAIST), Japan; Univ. Bordeaux, CNRS, Bordeaux INP, LaBRI, UMR 5800, F-33400 Talence, France; University of Naples Parthenope, Italy; Sapienza University of Rome, Italy; VNU Information Technology Institute, Vietnam National University, Vietnam; Visual Intelligence Lab, School of Computer Science & Insight Center for Data Analyitcs, University of Galway, Ireland; Tho T. Quan
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Logic in Computer Science (cs.LO)
备注: The XAI Challenge @ TRNS-AI Workshop, IJCNN 2025: Explainable AI for Educational Question Answering. Website: this https URL

点击查看摘要

Abstract:The growing integration of Artificial Intelligence (AI) into education has intensified the need for transparency and interpretability. While hackathons have long served as agile environments for rapid AI prototyping, few have directly addressed eXplainable AI (XAI) in real-world educational contexts. This paper presents a comprehensive analysis of the XAI Challenge 2025, a hackathon-style competition jointly organized by Ho Chi Minh City University of Technology (HCMUT) and the International Workshop on Trustworthiness and Reliability in Neurosymbolic AI (TRNS-AI), held as part of the International Joint Conference on Neural Networks (IJCNN 2025). The challenge tasked participants with building Question-Answering (QA) systems capable of answering student queries about university policies while generating clear, logic-based natural language explanations. To promote transparency and trustworthiness, solutions were required to use lightweight Large Language Models (LLMs) or hybrid LLM-symbolic systems. A high-quality dataset was provided, constructed via logic-based templates with Z3 validation and refined through expert student review to ensure alignment with real-world academic scenarios. We describe the challenge’s motivation, structure, dataset construction, and evaluation protocol. Situating the competition within the broader evolution of AI hackathons, we argue that it represents a novel effort to bridge LLMs and symbolic reasoning in service of explainability. Our findings offer actionable insights for future XAI-centered educational systems and competitive research initiatives.
zh

[NLP-74] oolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)在集成外部工具时面临的碎片化、协议限制和实现复杂性问题,这些问题导致开发成本高昂。解决方案的关键在于提出一个协议无关的工具管理库 Toolregistry,通过统一接口简化工具的注册、表示、执行及生命周期管理,从而显著降低代码冗余并提升性能与兼容性。

链接: https://arxiv.org/abs/2507.10593
作者: Peng Ding
机构: University of Chicago (芝加哥大学)
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents Toolregistry, a protocol-agnostic tool management library that simplifies tool registration, representation, execution, and lifecycle management via a unified interface. Our evaluation demonstrates that \toolregistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and 100% compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. \toolregistry is open-source and available at this https URL, with comprehensive documentation at this https URL.
zh

[NLP-75] SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

【速读】: 该论文旨在解决现有语音编解码器(speech codec)在语义编码方面存在的若干挑战,包括残留的副语言信息(如音色、情感)、语义完整性不足、重建能力有限以及不支持流式传输等问题。其核心解决方案是提出SecoustiCodec,一种跨模态对齐的低比特率流式语音编解码器,通过在单码本空间中解耦语义与副语言信息,并引入副语言编码以弥合语义与声学编码之间的信息鸿沟;同时采用基于变分自编码器(VAE)和有限标量量化(FSQ)的语义专用高效量化方法,缓解令牌长尾分布问题并保持高码本利用率;进一步利用对比学习实现语义解耦,将文本与语音对齐至联合多模态帧级空间,有效去除语义编码中的副语言信息;最后通过声学约束的多阶段优化策略保障训练收敛的鲁棒性与稳定性。

链接: https://arxiv.org/abs/2508.02849
作者: Chunyu Qiang,Haoyu Wang,Cheng Gong,Tianrui Wang,Ruibo Fu,Tao Wang,Ruilong Chen,Jiangyan Yi,Zhengqi Wen,Chen Zhang,Longbiao Wang,Jianwu Dang,Jianhua Tao
机构: Tianjin University (天津大学); Kuaishou Technology (快手科技); Chinese Academy of Sciences (中国科学院); Tsinghua University (清华大学); Shenzhen Institute of Advanced Technology (深圳先进技术研究院)
类目: Audio and Speech Processing (eess.AS); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient semantic completeness, limited reconstruction capability, and lack of support for streaming. To address these challenges, we propose SecoustiCodec, a cross-modal aligned low-bitrate streaming speech codec that disentangles semantic and paralinguistic information in a single-codebook space. To ensure semantic completeness and reconstruction fidelity, paralinguistic encoding is introduced to bridge the information gap between semantic and acoustic encoding. A semantic-only efficient quantization method based on VAE (Variational Autoencoder) and FSQ (Finite Scalar Quantization) is proposed. This approach alleviates the long-tail distribution problem of tokens while maintaining high codebook utilization. A semantic disentanglement method based on contrastive learning is proposed, which aligns text and speech in a joint multimodal frame-level space, effectively removing paralinguistic information from semantic encoding. An acoustic-constrained multi-stage optimization strategy is proposed to ensure robust and stable convergence. Figure~\reffig:pesq_kbps_below_2kbps shows SecoustiCodec achieves SOTA (state-of-the-art) reconstruction quality (PESQ) of 1.77/2.58 at 0.27/1 kbps. The code and model weights for SecoustiCodec will be open-sourced upon the completion of the peer-review process. We’ve open-sourced SecoustiCodec’s demo, code, and model weights.
zh

[NLP-76] CreditARF: A Framework for Corporate Credit Rating with Annual Report and Financial Feature Integration

【速读】: 该论文旨在解决现有企业信用评级模型过度依赖财务指标和深度学习方法,而忽视非财务数据(如企业年报文本信息)所带来的信息价值不足的问题。其解决方案的关键在于构建一个融合财务数据与基于FinBERT提取的年报文本特征的企业信用评级框架,并开发了一个大规模数据集——综合企业评级数据集(Comprehensive Corporate Rating Dataset, CCRD),从而充分挖掘非结构化文本数据的潜在价值,实验证明该方法可将评级预测准确率提升8%-12%。

链接: https://arxiv.org/abs/2508.02738
作者: Yumeng Shi,Zhongliang Yang,DiYang Lu,Yisi Wang,Yiting Zhou,Linna Zhou
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Guotai Junan Securities (国泰君安证券); Hebei University of Technology (河北工业大学)
类目: atistical Finance (q-fin.ST); Computational Engineering, Finance, and Science (cs.CE); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Corporate credit rating serves as a crucial intermediary service in the market economy, playing a key role in maintaining economic order. Existing credit rating models rely on financial metrics and deep learning. However, they often overlook insights from non-financial data, such as corporate annual reports. To address this, this paper introduces a corporate credit rating framework that integrates financial data with features extracted from annual reports using FinBERT, aiming to fully leverage the potential value of unstructured text data. In addition, we have developed a large-scale dataset, the Comprehensive Corporate Rating Dataset (CCRD), which combines both traditional financial data and textual data from annual reports. The experimental results show that the proposed method improves the accuracy of the rating predictions by 8-12%, significantly improving the effectiveness and reliability of corporate credit ratings.
zh

计算机视觉

[CV-0] rokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition ICCV2025

【速读】:该论文旨在解决少样本动作识别(few-shot action recognition)中如何有效建模运动与外观信息的问题,尤其针对轨迹点选择不精准和运动模式建模不足的两大挑战。其解决方案的关键在于提出一种名为Trokens的新方法:首先设计了一种语义感知的采样策略(semantic-aware sampling strategy),根据物体尺度和语义相关性自适应分布追踪点;其次构建了一个运动建模框架,通过方向位移直方图(Histogram of Oriented Displacements, HoD)捕捉单条轨迹内部动态,并结合轨迹间关系建模复杂动作模式。最终将轨迹token与语义特征融合,以运动信息增强外观特征,显著提升了在六个不同基准数据集上的性能表现。

链接: https://arxiv.org/abs/2508.03695
作者: Pulkit Kumar,Shuaiyi Huang,Matthew Walmer,Sai Saketh Rambhatla,Abhinav Shrivastava
机构: University of Maryland, College Park (马里兰大学学院市分校); GenAI, Meta (Meta人工智能部门)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at ICCV 2025; First two authors contributed equally

点击查看摘要

Abstract:Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. For project page see this https URL
zh

[CV-1] LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation

【速读】:该论文旨在解决可控长视频生成中的关键挑战,包括时间不一致性(temporal inconsistency)和视觉退化(visual degradation),这些问题在现有方法中尤为突出,尤其是在扩展至超长视频时。其解决方案的核心在于提出一个端到端的自回归框架 LongVie,包含两个关键设计:一是统一噪声初始化策略(unified noise initialization),确保跨片段生成的一致性;二是全局控制信号归一化(global control signal normalization),在控制空间中实现全程对齐。此外,为缓解视觉退化,LongVie引入多模态控制机制(multi-modal control framework),融合密集(如深度图)与稀疏(如关键点)控制信号,并结合退化感知训练策略(degradation-aware training strategy),动态调整模态权重以维持高质量输出。

链接: https://arxiv.org/abs/2508.03694
作者: Jianxiong Gao,Zhaoxi Chen,Xian Liu,Jianfeng Feng,Chenyang Si,Yanwei Fu,Yu Qiao,Ziwei Liu
机构: Nanjing University (南京大学); Fudan University (复旦大学); S-Lab, Nanyang Technological University (南洋理工大学 S-Lab); NVIDIA (英伟达); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL

点击查看摘要

Abstract:Controllable ultra-long video generation is a fundamental yet challenging task. Although existing methods are effective for short clips, they struggle to scale due to issues such as temporal inconsistency and visual degradation. In this paper, we initially investigate and identify three key factors: separate noise initialization, independent control signal normalization, and the limitations of single-modality guidance. To address these issues, we propose LongVie, an end-to-end autoregressive framework for controllable long video generation. LongVie introduces two core designs to ensure temporal consistency: 1) a unified noise initialization strategy that maintains consistent generation across clips, and 2) global control signal normalization that enforces alignment in the control space throughout the entire video. To mitigate visual degradation, LongVie employs 3) a multi-modal control framework that integrates both dense (e.g., depth maps) and sparse (e.g., keypoints) control signals, complemented by 4) a degradation-aware training strategy that adaptively balances modality contributions over time to preserve visual quality. We also introduce LongVGenBench, a comprehensive benchmark consisting of 100 high-resolution videos spanning diverse real-world and synthetic environments, each lasting over one minute. Extensive experiments show that LongVie achieves state-of-the-art performance in long-range controllability, consistency, and quality.
zh

[CV-2] LiDARCrafter: Dynamic 4D World Modeling from LiDAR Sequences

【速读】:该论文旨在解决当前生成式世界模型在自动驾驶场景中对LiDAR数据建模的不足,特别是动态4D(空间+时间)LiDAR生成面临的可控性差、时序一致性弱及评估标准不统一等问题。其解决方案的关键在于提出LiDARCrafter框架,通过自然语言输入解析为以自车为中心的场景图(ego-centric scene graphs),并以此作为条件驱动三分支扩散网络分别生成物体结构、运动轨迹与几何信息;同时引入自回归模块确保时序连贯的4D LiDAR序列生成,从而实现细粒度场景编辑与高质量模拟。此外,作者构建了涵盖场景级、对象级和序列级的综合评估基准,推动该领域标准化发展。

链接: https://arxiv.org/abs/2508.03692
作者: Ao Liang,Youquan Liu,Yu Yang,Dongyue Lu,Linfeng Li,Lingdong Kong,Huaici Zhao,Wei Tsang Ooi
机构: 1. Tsinghua University (清华大学); 2. Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); 3. Beijing Institute of Technology (北京理工大学); 4. Alibaba Group (阿里巴巴集团); 5. Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 28 pages, 18 figures, 12 tables; Project Page at this https URL

点击查看摘要

Abstract:Generative world models have become essential data engines for autonomous driving, yet most existing efforts focus on videos or occupancy grids, overlooking the unique LiDAR properties. Extending LiDAR generation to dynamic 4D world modeling presents challenges in controllability, temporal coherence, and evaluation standardization. To this end, we present LiDARCrafter, a unified framework for 4D LiDAR generation and editing. Given free-form natural language inputs, we parse instructions into ego-centric scene graphs, which condition a tri-branch diffusion network to generate object structures, motion trajectories, and geometry. These structured conditions enable diverse and fine-grained scene editing. Additionally, an autoregressive module generates temporally coherent 4D LiDAR sequences with smooth transitions. To support standardized evaluation, we establish a comprehensive benchmark with diverse metrics spanning scene-, object-, and sequence-level aspects. Experiments on the nuScenes dataset using this benchmark demonstrate that LiDARCrafter achieves state-of-the-art performance in fidelity, controllability, and temporal consistency across all levels, paving the way for data augmentation and simulation. The code and benchmark are released to the community.
zh

[CV-3] La La LiDAR: Large-Scale Layout Generation from LiDAR Data

【速读】:该论文旨在解决当前基于扩散模型的LiDAR场景生成方法在前景物体控制和空间关系建模方面的不足,从而限制了其在自动驾驶与机器人场景仿真及安全验证中的应用。解决方案的关键在于提出了一种名为La La LiDAR的新型布局引导生成框架,其核心创新包括:1)引入语义增强的场景图扩散(semantic-enhanced scene graph diffusion)与关系感知的上下文条件(relation-aware contextual conditioning),实现结构化LiDAR布局的可控生成;2)通过前景感知的控制注入(foreground-aware control injection)机制,在保持空间与语义一致性的同时,实现对物体位置的定制化控制。该方案显著提升了生成场景的可控性与真实性,推动了可控制3D场景生成技术的发展。

链接: https://arxiv.org/abs/2508.03691
作者: Youquan Liu,Lingdong Kong,Weidong Yang,Xin Li,Ao Liang,Runnan Chen,Ben Fei,Tongliang Liu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 10 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Controllable generation of realistic LiDAR scenes is crucial for applications such as autonomous driving and robotics. While recent diffusion-based models achieve high-fidelity LiDAR generation, they lack explicit control over foreground objects and spatial relationships, limiting their usefulness for scenario simulation and safety validation. To address these limitations, we propose Large-scale Layout-guided LiDAR generation model (“La La LiDAR”), a novel layout-guided generative framework that introduces semantic-enhanced scene graph diffusion with relation-aware contextual conditioning for structured LiDAR layout generation, followed by foreground-aware control injection for complete scene generation. This enables customizable control over object placement while ensuring spatial and semantic consistency. To support our structured LiDAR generation, we introduce Waymo-SG and nuScenes-SG, two large-scale LiDAR scene graph datasets, along with new evaluation metrics for layout synthesis. Extensive experiments demonstrate that La La LiDAR achieves state-of-the-art performance in both LiDAR generation and downstream perception tasks, establishing a new benchmark for controllable 3D scene generation.
zh

[CV-4] Veila: Panoramic LiDAR Generation from a Monocular RGB Image

【速读】:该论文旨在解决全景激光雷达(LiDAR)数据生成中可控性不足与跨模态对齐困难的问题,尤其是在利用单目RGB图像作为空间控制信号时面临的三大挑战:(i)RGB图像中语义和深度信息的空间分布不一致导致条件生成不可靠;(ii)RGB外观与LiDAR几何之间的模态差异在噪声扩散过程中放大对齐误差;(iii)单目RGB与全景LiDAR之间结构一致性难以维持,尤其在非重叠区域。解决方案的关键在于提出一个名为Veila的新型条件扩散框架,其核心创新包括:置信度感知条件机制(Confidence-Aware Conditioning Mechanism, CACM),通过自适应平衡局部可靠性的语义与深度线索增强RGB条件;几何跨模态对齐机制(Geometric Cross-Modal Alignment, GCMA),提升噪声扩散下的RGB-LiDAR鲁棒对齐能力;以及全景特征一致性机制(Panoramic Feature Coherence, PFC),确保单目RGB与全景LiDAR间的全局结构一致性。该方法在nuScenes、SemanticKITTI及自建KITTI-Weather基准上实现了最先进的生成保真度与跨模态一致性,并支持生成式数据增强以提升下游LiDAR语义分割性能。

链接: https://arxiv.org/abs/2508.03690
作者: Youquan Liu,Lingdong Kong,Weidong Yang,Ao Liang,Jianxiong Gao,Yang Wu,Xiang Xu,Xin Li,Linfeng Li,Runnan Chen,Ben Fei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Preprint; 10 pages, 6 figures, 7 tables

点击查看摘要

Abstract:Realistic and controllable panoramic LiDAR data generation is critical for scalable 3D perception in autonomous driving and robotics. Existing methods either perform unconditional generation with poor controllability or adopt text-guided synthesis, which lacks fine-grained spatial control. Leveraging a monocular RGB image as a spatial control signal offers a scalable and low-cost alternative, which remains an open problem. However, it faces three core challenges: (i) semantic and depth cues from RGB are vary spatially, complicating reliable conditioning generation; (ii) modality gaps between RGB appearance and LiDAR geometry amplify alignment errors under noisy diffusion; and (iii) maintaining structural coherence between monocular RGB and panoramic LiDAR is challenging, particularly in non-overlap regions between images and LiDAR. To address these challenges, we propose Veila, a novel conditional diffusion framework that integrates: a Confidence-Aware Conditioning Mechanism (CACM) that strengthens RGB conditioning by adaptively balancing semantic and depth cues according to their local reliability; a Geometric Cross-Modal Alignment (GCMA) for robust RGB-LiDAR alignment under noisy diffusion; and a Panoramic Feature Coherence (PFC) for enforcing global structural consistency across monocular RGB and panoramic LiDAR. Additionally, we introduce two metrics, Cross-Modal Semantic Consistency and Cross-Modal Depth Consistency, to evaluate alignment quality across modalities. Experiments on nuScenes, SemanticKITTI, and our proposed KITTI-Weather benchmark demonstrate that Veila achieves state-of-the-art generation fidelity and cross-modal consistency, while enabling generative data augmentation that improves downstream LiDAR semantic segmentation.
zh

[CV-5] OmniShape: Zero-Shot Multi-Hypothesis Shape and Pose Estimation in the Real World ICRA2025

【速读】:该论文旨在解决从单张观测图像中同时估计物体的位姿(pose)和完整形状(shape)的问题,且不依赖于已知的3D模型或类别信息。其核心挑战在于如何在无先验几何知识的情况下,从稀疏或遮挡的观测中推断出合理的多模态解。解决方案的关键在于提出OmniShape方法,通过将形状补全过程解耦为两个多模态分布:一是测量投影到由数据集定义的归一化物体参考坐标系中的分布,二是以三平面神经场(triplanar neural fields)表示的对象几何先验分布;并通过分别训练条件扩散模型(conditional diffusion models)来建模这两个分布,从而实现从联合位姿与形状分布中采样多个合理假设,显著提升了真实世界复杂场景下的估计性能。

链接: https://arxiv.org/abs/2508.03669
作者: Katherine Liu,Sergey Zakharov,Dian Chen,Takuya Ikeda,Greg Shakhnarovich,Adrien Gaidon,Rares Ambrus
机构: Toyota Research Institute (丰田研究院); Woven by Toyota (丰田织物); Toyota Technological Institute at Chicago (丰田芝加哥技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: 8 pages, 5 figures. This version has typo fixes on top of the version published at ICRA 2025

点击查看摘要

Abstract:We would like to estimate the pose and full shape of an object from a single observation, without assuming known 3D model or category. In this work, we propose OmniShape, the first method of its kind to enable probabilistic pose and shape estimation. OmniShape is based on the key insight that shape completion can be decoupled into two multi-modal distributions: one capturing how measurements project into a normalized object reference frame defined by the dataset and the other modelling a prior over object geometries represented as triplanar neural fields. By training separate conditional diffusion models for these two distributions, we enable sampling multiple hypotheses from the joint pose and shape distribution. OmniShape demonstrates compelling performance on challenging real world datasets. Project website: this https URL
zh

[CV-6] DiWA: Diffusion Policy Adaptation with World Models

【速读】:该论文旨在解决扩散策略(diffusion policies)在强化学习(Reinforcement Learning, RL)微调过程中面临的两大挑战:一是每个动作预测所需的长去噪序列阻碍了奖励的有效传播;二是传统RL方法依赖数百万次真实环境交互,严重限制了实际应用中的样本效率和安全性。解决方案的关键在于提出DiWA框架,该框架利用一个一次性训练好的世界模型(world model),实现完全离线的强化学习微调,从而显著提升样本效率。通过将扩散策略的去噪过程建模为马尔可夫决策过程(Markov Decision Process, MDP),DiWA仅需少量离线交互数据(约数十万次)即可完成机器人技能的高效适应,在CALVIN基准测试中实现了8项任务性能提升,且物理交互次数较模型无关基线减少多个数量级。

链接: https://arxiv.org/abs/2508.03645
作者: Akshay L Chandra,Iman Nematollahi,Chenguang Huang,Tim Welschehold,Wolfram Burgard,Abhinav Valada
机构: University of Freiburg (弗莱堡大学); University of Technology Nuremberg (纽伦堡工业大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at the 2025 Conference on Robot Learning (CoRL)

点击查看摘要

Abstract:Fine-tuning diffusion policies with reinforcement learning (RL) presents significant challenges. The long denoising sequence for each action prediction impedes effective reward propagation. Moreover, standard RL methods require millions of real-world interactions, posing a major bottleneck for practical fine-tuning. Although prior work frames the denoising process in diffusion policies as a Markov Decision Process to enable RL-based updates, its strong dependence on environment interaction remains highly inefficient. To bridge this gap, we introduce DiWA, a novel framework that leverages a world model for fine-tuning diffusion-based robotic skills entirely offline with reinforcement learning. Unlike model-free approaches that require millions of environment interactions to fine-tune a repertoire of robot skills, DiWA achieves effective adaptation using a world model trained once on a few hundred thousand offline play interactions. This results in dramatically improved sample efficiency, making the approach significantly more practical and safer for real-world robot learning. On the challenging CALVIN benchmark, DiWA improves performance across eight tasks using only offline adaptation, while requiring orders of magnitude fewer physical interactions than model-free baselines. To our knowledge, this is the first demonstration of fine-tuning diffusion policies for real-world robotic skills using an offline world model. We make the code publicly available at this https URL.
zh

[CV-7] Uni3R: Unified 3D Reconstruction and Semantic Understanding via Generalizable Gaussian Splatting from Unposed Multi-View Images

【速读】:该论文旨在解决从稀疏的2D视图中联合重建高保真3D场景表示并实现开放词汇语义理解的问题,传统方法通常将语义理解与重建过程解耦或依赖昂贵的逐场景优化,限制了其可扩展性和泛化能力。解决方案的关键在于提出Uni3R框架,该框架通过一个跨视图Transformer(Cross-View Transformer)鲁棒地整合任意多视角输入信息,并回归一组带有语义特征场的3D高斯原语(3D Gaussian primitives),从而在单次前向传播中实现高质量的新视角合成、开放词汇3D语义分割和深度预测,构建了一个统一且通用的3D场景重建与理解范式。

链接: https://arxiv.org/abs/2508.03643
作者: Xiangyu Sun,Haoyi jiang,Liu Liu,Seungtae Nam,Gyeongjin Kang,Xinjie wang,Wei Sui,Zhizhong Su,Wenyu Liu,Xinggang Wang,Eunbyung Park
机构: 1. Shanghai Jiao Tong University (上海交通大学); 2. National University of Singapore (新加坡国立大学); 3. Seoul National University (首尔国立大学); 4. Chinese Academy of Sciences (中国科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The code is available at this https URL

点击查看摘要

Abstract:Reconstructing and semantically interpreting 3D scenes from sparse 2D views remains a fundamental challenge in computer vision. Conventional methods often decouple semantic understanding from reconstruction or necessitate costly per-scene optimization, thereby restricting their scalability and generalizability. In this paper, we introduce Uni3R, a novel feed-forward framework that jointly reconstructs a unified 3D scene representation enriched with open-vocabulary semantics, directly from unposed multi-view images. Our approach leverages a Cross-View Transformer to robustly integrate information across arbitrary multi-view inputs, which then regresses a set of 3D Gaussian primitives endowed with semantic feature fields. This unified representation facilitates high-fidelity novel view synthesis, open-vocabulary 3D semantic segmentation, and depth prediction, all within a single, feed-forward pass. Extensive experiments demonstrate that Uni3R establishes a new state-of-the-art across multiple benchmarks, including 25.07 PSNR on RE10K and 55.84 mIoU on ScanNet. Our work signifies a novel paradigm towards generalizable, unified 3D scene reconstruction and understanding. The code is available at this https URL.
zh

[CV-8] AttZoom: Attention Zoom for Better Visual Features ICCV

【速读】:该论文旨在解决卷积神经网络(Convolutional Neural Networks, CNNs)在特征提取过程中对高重要性空间区域关注不足的问题,从而提升模型的分类性能。解决方案的关键在于提出了一种模块化且与模型无关的空间注意力机制——Attention Zoom,其作为一个独立层嵌入到CNN架构中,无需针对特定网络结构进行改造,即可通过空间上增强输入特征图中高重要性区域的响应,实现更精细和多样化的注意力分布,从而在不增加显著计算开销的前提下提升模型表现。

链接: https://arxiv.org/abs/2508.03625
作者: Daniel DeAlcala,Aythami Morales,Julian Fierrez,Ruben Tolosana
机构: Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid (马德里自治大学), Spain
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted at ICCVw HiCV

点击查看摘要

Abstract:We present Attention Zoom, a modular and model-agnostic spatial attention mechanism designed to improve feature extraction in convolutional neural networks (CNNs). Unlike traditional attention approaches that require architecture-specific integration, our method introduces a standalone layer that spatially emphasizes high-importance regions in the input. We evaluated Attention Zoom on multiple CNN backbones using CIFAR-100 and TinyImageNet, showing consistent improvements in Top-1 and Top-5 classification accuracy. Visual analyses using Grad-CAM and spatial warping reveal that our method encourages fine-grained and diverse attention patterns. Our results confirm the effectiveness and generality of the proposed layer for improving CCNs with minimal architectural overhead.
zh

[CV-9] FPG-NAS: FLOPs-Aware Gated Differentiable Neural Architecture Search for Efficient 6DoF Pose Estimation

【速读】:该论文旨在解决6DoF(六自由度)物体位姿估计在资源受限场景下计算效率低的问题,即如何在保证精度的同时显著降低模型的浮点运算量(FLOPs)。解决方案的关键在于提出了一种面向FLOPs感知的门控可微神经架构搜索框架(FPG-NAS),其核心创新包括:1)设计了一个任务特定的搜索空间,适配6DoF位姿估计任务;2)引入可微门控机制实现离散多候选算子选择,提升架构多样性;3)通过FLOPs正则化项平衡精度与效率,从而在约10⁹²个可能架构中高效搜索出高性能、轻量级模型。

链接: https://arxiv.org/abs/2508.03618
作者: Nassim Ali Ousalah,Peyman Rostami,Anis Kacem,Enjie Ghorbel,Emmanuel Koumandakis,Djamila Aouada
机构: SnT, University of Luxembourg, Luxembourg; Cristal Lab, ENSI, Manouba University, Tunisia; Infinite Orbits, Toulouse, France
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 27th IEEE International Workshop on Multimedia Signal Processing (MMSP) 2025

点击查看摘要

Abstract:We introduce FPG-NAS, a FLOPs-aware Gated Differentiable Neural Architecture Search framework for efficient 6DoF object pose estimation. Estimating 3D rotation and translation from a single image has been widely investigated yet remains computationally demanding, limiting applicability in resource-constrained scenarios. FPG-NAS addresses this by proposing a specialized differentiable NAS approach for 6DoF pose estimation, featuring a task-specific search space and a differentiable gating mechanism that enables discrete multi-candidate operator selection, thus improving architectural diversity. Additionally, a FLOPs regularization term ensures a balanced trade-off between accuracy and efficiency. The framework explores a vast search space of approximately 10\textsuperscript92 possible architectures. Experiments on the LINEMOD and SPEED+ datasets demonstrate that FPG-NAS-derived models outperform previous methods under strict FLOPs constraints. To the best of our knowledge, FPG-NAS is the first differentiable NAS framework specifically designed for 6DoF object pose estimation.
zh

[CV-10] vTransFER: A Transfer Learning Framework for Event-based Facial Expression Recognition

【速读】:该论文旨在解决基于事件相机(event-based camera)的面部表情识别(face expression recognition)准确率低的问题。其关键解决方案是提出了一种名为evTransFER的迁移学习框架,核心在于利用对抗生成方法在人脸重建任务上预训练一个特征提取器,并将该编码器权重迁移到表情识别任务中,从而有效捕捉面部时空动态信息;此外,还引入了TIE事件表示和LSTM结构以增强对长期表情变化的建模能力,最终在e-CK+数据集上实现了93.6%的识别率,显著优于现有方法(提升超25个百分点)。

链接: https://arxiv.org/abs/2508.03609
作者: Rodrigo Verschae,Ignacio Bugueno-Cordova
机构: Institute of Engineering Sciences, Universidad de O’Higgins (奥希金斯大学工程科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Event-based cameras are bio-inspired vision sensors that asynchronously capture per-pixel intensity changes with microsecond latency, high temporal resolution, and high dynamic range, providing valuable information about the spatio-temporal dynamics of the scene. In the present work, we propose evTransFER, a transfer learning-based framework and architecture for face expression recognition using event-based cameras. The main contribution is a feature extractor designed to encode the spatio-temporal dynamics of faces, built by training an adversarial generative method on a different problem (facial reconstruction) and then transferring the trained encoder weights to the face expression recognition system. We show that this proposed transfer learning method greatly improves the ability to recognize facial expressions compared to training a network from scratch. In addition, we propose an architecture that incorporates an LSTM to capture longer-term facial expression dynamics, and we introduce a new event-based representation, referred to as TIE, both of which further improve the results. We evaluate the proposed framework on the event-based facial expression database e-CK+ and compare it to state-of-the-art methods. The results show that the proposed framework evTransFER achieves a 93.6% recognition rate on the e-CK+ database, significantly improving the accuracy (25.9% points or more) when compared to state-of-the-art performance for similar problems.
zh

[CV-11] CloudBreaker: Breaking the Cloud Covers of Sentinel-2 Images using Multi-Stage Trained Conditional Flow Matching on Sentinel-1

【速读】:该论文旨在解决卫星遥感中因云层覆盖和夜间条件导致多光谱影像(如Sentinel-2)可用性受限的问题,从而影响遥感应用的连续性和可靠性。其解决方案的关键在于提出CloudBreaker框架,该框架通过条件潜在流匹配(conditional latent flow matching)实现从不受天气影响的Sentinel-1雷达数据到高质量多光谱信号(包括RGB图像及关键植被指数NDVI和水体指数NDWI)的生成。该方法采用新颖的多阶段训练策略,并首次将余弦调度(cosine scheduling)引入流匹配机制,显著提升了生成影像的保真度与结构一致性,FID得分达0.7432,NDVI和NDWI的SSIM分别达到0.6874和0.6156,验证了其在复杂遥感场景下的有效性。

链接: https://arxiv.org/abs/2508.03608
作者: Saleh Sakib Ahmed,Sara Nowreen,M. Sohel Rahman
机构: Bangladesh University of Engineering and Technology (孟加拉国工程技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Cloud cover and nighttime conditions remain significant limitations in satellite-based remote sensing, often restricting the availability and usability of multi-spectral imagery. In contrast, Sentinel-1 radar images are unaffected by cloud cover and can provide consistent data regardless of weather or lighting conditions. To address the challenges of limited satellite imagery, we propose CloudBreaker, a novel framework that generates high-quality multi-spectral Sentinel-2 signals from Sentinel-1 data. This includes the reconstruction of optical (RGB) images as well as critical vegetation and water indices such as NDVI and this http URL employed a novel multi-stage training approach based on conditional latent flow matching and, to the best of our knowledge, are the first to integrate cosine scheduling with flow matching. CloudBreaker demonstrates strong performance, achieving a Frechet Inception Distance (FID) score of 0.7432, indicating high fidelity and realism in the generated optical imagery. The model also achieved Structural Similarity Index Measure (SSIM) of 0.6156 for NDWI and 0.6874 for NDVI, indicating a high degree of structural similarity. This establishes CloudBreaker as a promising solution for a wide range of remote sensing applications where multi-spectral data is typically unavailable or unreliable
zh

[CV-12] DyCAF-Net: Dynamic Class-Aware Fusion Network

【速读】:该论文旨在解决当前目标检测模型在动态场景中因静态融合策略和类无关注意力机制导致的性能瓶颈问题,特别是在存在遮挡、背景杂乱及类别不平衡等挑战时表现不佳。其解决方案的关键在于提出动态类感知融合网络(DyCAF-Net),通过三项核心创新实现:(1) 输入条件驱动的基于平衡的颈部结构,利用隐式不动点建模迭代优化多尺度特征;(2) 双重动态注意力机制,根据输入和类别信息自适应校准通道与空间响应;(3) 类感知特征适配模块,对稀有类别的判别区域进行优先强化。该框架在保持计算效率的同时显著提升了检测精度和鲁棒性,适用于医疗影像、监控和自动驾驶等实际应用场景。

链接: https://arxiv.org/abs/2508.03598
作者: Md Abrar Jahin,Shahriar Soudeep,M. F. Mridha,Nafiz Fahad,Md. Jakir Hossen
机构: University of Southern California (南加州大学); American International University-Bangladesh (美国国际大学-孟加拉国分校); Multimedia University (多媒体大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted to IEEE DSAA 2025 (10 pages, 5 figures)

点击查看摘要

Abstract:Recent advancements in object detection rely on modular architectures with multi-scale fusion and attention mechanisms. However, static fusion heuristics and class-agnostic attention limit performance in dynamic scenes with occlusions, clutter, and class imbalance. We introduce Dynamic Class-Aware Fusion Network (DyCAF-Net) that addresses these challenges through three innovations: (1) an input-conditioned equilibrium-based neck that iteratively refines multi-scale features via implicit fixed-point modeling, (2) a dual dynamic attention mechanism that adaptively recalibrates channel and spatial responses using input- and class-dependent cues, and (3) class-aware feature adaptation that modulates features to prioritize discriminative regions for rare classes. Through comprehensive ablation studies with YOLOv8 and related architectures, alongside benchmarking against nine state-of-the-art baselines, DyCAF-Net achieves significant improvements in precision, mAP@50, and mAP@50-95 across 13 diverse benchmarks, including occlusion-heavy and long-tailed datasets. The framework maintains computational efficiency ( \sim 11.1M parameters) and competitive inference speeds, while its adaptability to scale variance, semantic overlaps, and class imbalance positions it as a robust solution for real-world detection tasks in medical imaging, surveillance, and autonomous systems.
zh

[CV-13] MetaScope: Optics-Driven Neural Network for Ultra-Micro Metalens Endoscopy ICCV2025

【速读】:该论文旨在解决金属透镜(metalens)内窥镜成像中因物理特性差异导致的数据采集与算法研究严重滞后的问题,尤其针对金属透镜微尺度成像带来的强度衰减和色差等光学先验约束下的图像质量退化问题。其解决方案的关键在于提出MetaScope——一种基于物理光学驱动的神经网络架构,核心创新包括:1)光学信息引导的强度调整(Optics-informed Intensity Adjustment, OIA),通过学习光学嵌入来校正强度衰减;2)光学信息引导的色差校正(Optics-informed Chromatic Correction, OCC),通过学习由点扩散函数(Point Spread Function, PSF)分布推导的空间形变来缓解色差;同时引入梯度引导的知识蒸馏机制以增强联合优化能力,从而在金属透镜内窥镜的分割与复原任务上显著优于现有方法,并展现出良好的实际生物医学场景泛化性能。

链接: https://arxiv.org/abs/2508.03596
作者: Wuyang Li,Wentao Pan,Xiaoyuan Liu,Zhendong Luo,Chenxin Li,Hengyu Liu,Din Ping Tsai,Mu Ku Chen,Yixuan Yuan
机构: The Chinese University of Hong Kong (香港中文大学); City University of Hong Kong (香港城市大学); EPFL (洛桑联邦理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV 2025 (Highlight); Project Page: this https URL

点击查看摘要

Abstract:Miniaturized endoscopy has advanced accurate visual perception within the human body. Prevailing research remains limited to conventional cameras employing convex lenses, where the physical constraints with millimetre-scale thickness impose serious impediments on the micro-level clinical. Recently, with the emergence of meta-optics, ultra-micro imaging based on metalenses (micron-scale) has garnered great attention, serving as a promising solution. However, due to the physical difference of metalens, there is a large gap in data acquisition and algorithm research. In light of this, we aim to bridge this unexplored gap, advancing the novel metalens endoscopy. First, we establish datasets for metalens endoscopy and conduct preliminary optical simulation, identifying two derived optical issues that physically adhere to strong optical priors. Second, we propose MetaScope, a novel optics-driven neural network tailored for metalens endoscopy driven by physical optics. MetaScope comprises two novel designs: Optics-informed Intensity Adjustment (OIA), rectifying intensity decay by learning optical embeddings, and Optics-informed Chromatic Correction (OCC), mitigating chromatic aberration by learning spatial deformations informed by learned Point Spread Function (PSF) distributions. To enhance joint learning, we further deploy a gradient-guided distillation to transfer knowledge from the foundational model adaptively. Extensive experiments demonstrate that MetaScope not only outperforms state-of-the-art methods in both metalens segmentation and restoration but also achieves impressive generalized ability in real biomedical scenes.
zh

[CV-14] RadProPoser: A Framework for Human Pose Estimation with Uncertainty Quantification from Raw Radar Data

【速读】:该论文旨在解决雷达感知中人体姿态估计(Human Pose Estimation, HPE)因噪声和多径效应导致的精度低、不确定性难以量化的问题。其核心挑战在于如何从原始复数雷达张量数据中提取可靠且可解释的三维关节位置信息,并准确建模每关节的不确定性。解决方案的关键在于提出 RadProPoser,一种基于变分推理的概率编码器-解码器架构,能够联合预测 26 个三维关节位置及其异方差性(heteroscedastic)的偶然不确定性(aleatoric uncertainty),并通过重新校准生成总不确定性(total uncertainty)。该方法首次在端到端雷达张量输入下实现对每个关节不确定性的显式建模与量化,显著提升了雷达 HPE 的可靠性与可解释性,同时支持基于不确定性采样的数据增强,提升下游活动分类性能。

链接: https://arxiv.org/abs/2508.03578
作者: Jonas Leo Mueller,Lukas Engel,Eva Dorschky,Daniel Krauss,Ingrid Ullmann,Martin Vossiek,Bjoern M. Eskofier
机构: Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希-亚历山大大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Radar-based human pose estimation (HPE) provides a privacy-preserving, illumination-invariant sensing modality but is challenged by noisy, multipath-affected measurements. We introduce RadProPoser, a probabilistic encoder-decoder architecture that processes complex-valued radar tensors from a compact 3-transmitter, 4-receiver MIMO radar. By incorporating variational inference into keypoint regression, RadProPoser jointly predicts 26 three-dimensional joint locations alongside heteroscedastic aleatoric uncertainties and can be recalibrated to predict total uncertainty. We explore different probabilistic formulations using both Gaussian and Laplace distributions for latent priors and likelihoods. On our newly released dataset with optical motion-capture ground truth, RadProPoser achieves an overall mean per-joint position error (MPJPE) of 6.425 cm, with 5.678 cm at the 45 degree aspect angle. The learned uncertainties exhibit strong alignment with actual pose errors and can be calibrated to produce reliable prediction intervals, with our best configuration achieving an expected calibration error of 0.021. As an additional demonstration, sampling from these latent distributions enables effective data augmentation for downstream activity classification, resulting in an F1 score of 0.870. To our knowledge, this is the first end-to-end radar tensor-based HPE system to explicitly model and quantify per-joint uncertainty from raw radar tensor data, establishing a foundation for explainable and reliable human motion analysis in radar applications.
zh

[CV-15] SAM2-UNeXT: An Improved High-Resolution Baseline for Adapting Foundation Models to Downstream Segmentation Tasks

【速读】:该论文旨在解决如何构建一个更具表现力和泛化能力的编码器,以进一步提升Segment Anything Model (SAM) 在下游任务中的性能这一开放性问题。其解决方案的关键在于提出SAM2-UNeXT框架,该框架在SAM2-UNet基础上引入了一个辅助的DINOv2编码器,通过双分辨率策略与密集连接(dense glue layer)设计,在保持简单架构的同时实现了更精确的分割效果,从而避免了对复杂解码器结构的依赖。

链接: https://arxiv.org/abs/2508.03566
作者: Xinyu Xiong,Zihuang Wu,Lei Zhang,Lei Lu,Ming Li,Guanbin Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Report

点击查看摘要

Abstract:Recent studies have highlighted the potential of adapting the Segment Anything Model (SAM) for various downstream tasks. However, constructing a more powerful and generalizable encoder to further enhance performance remains an open challenge. In this work, we propose SAM2-UNeXT, an advanced framework that builds upon the core principles of SAM2-UNet while extending the representational capacity of SAM2 through the integration of an auxiliary DINOv2 encoder. By incorporating a dual-resolution strategy and a dense glue layer, our approach enables more accurate segmentation with a simple architecture, relaxing the need for complex decoder designs. Extensive experiments conducted on four benchmarks, including dichotomous image segmentation, camouflaged object detection, marine animal segmentation, and remote sensing saliency detection, demonstrate the superior performance of our proposed method. The code is available at this https URL.
zh

[CV-16] A Scalable Machine Learning Pipeline for Building Footprint Detection in Historical Maps

【速读】:该论文旨在解决传统机器学习方法在处理历史地图时存在的计算效率低、难以适用于大范围农村地区的问题,尤其是针对稀疏建筑分布场景下提取建筑轮廓的挑战。其关键解决方案是提出一种分层式机器学习流水线:首先利用卷积神经网络(Convolutional Neural Network, CNN)分类器对地图区域进行逐级筛选,剔除极不可能包含建筑物的部分,从而大幅减少后续精细分析的范围;随后对高概率区域应用CNN分割算法完成建筑特征提取。该方法在爱尔兰历史25英寸与6英寸地图系列上的验证表明,相比传统单一分割策略,该方案不仅性能优越,且显著提升了计算效率,同时成功识别出一处可能因大饥荒而废弃的村落,凸显了其在历史地理和考古研究中的应用潜力。

链接: https://arxiv.org/abs/2508.03564
作者: Annemarie McCarthy
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 11 figures

点击查看摘要

Abstract:Historical maps offer a valuable lens through which to study past landscapes and settlement patterns. While prior research has leveraged machine learning based techniques to extract building footprints from historical maps, such approaches have largely focused on urban areas and tend to be computationally intensive. This presents a challenge for research questions requiring analysis across extensive rural regions, such as verifying historical census data or locating abandoned settlements. In this paper, this limitation is addressed by proposing a scalable and efficient pipeline tailored to rural maps with sparse building distributions. The method described employs a hierarchical machine learning based approach: convolutional neural network (CNN) classifiers are first used to progressively filter out map sections unlikely to contain buildings, significantly reducing the area requiring detailed analysis. The remaining high probability sections are then processed using CNN segmentation algorithms to extract building features. The pipeline is validated using test sections from the Ordnance Survey Ireland historical 25 inch map series and 6 inch map series, demonstrating both high performance and improved efficiency compared to conventional segmentation-only approaches. Application of the technique to both map series, covering the same geographic region, highlights its potential for historical and archaeological discovery. Notably, the pipeline identified a settlement of approximately 22 buildings in Tully, Co. Galway, present in the 6 inch map, produced in 1839, but absent from the 25 inch map, produced in 1899, suggesting it may have been abandoned during the Great Famine period.
zh

[CV-17] Advancing Wildlife Monitoring: Drone-Based Sampling for Roe Deer Density Estimation

【速读】:该论文旨在解决传统野生动物密度估算方法(如标记重捕法、距离取样法或相机陷阱法)存在的劳动强度大、空间受限等问题,提出一种基于无人机的高效、非侵入式动物计数方案。其解决方案的关键在于:利用热成像(IR)与可见光(RGB)图像,在落叶期通过多架无人机沿预设网格和系统随机航线进行单日大范围调查,飞行高度设定为60米以避免干扰欧亚赤鹿(Capreolus capreolus),并通过三种逐步复杂的外推方法(简单面积外推、自举法、零膨胀负二项模型)将观测数据转化为单位面积密度;同时结合相机陷阱数据使用随机相遇模型(REM)进行对比,结果表明无人机方法在多数情况下估计密度更高,且能反映白天开放与林地环境中的活动模式,从而提供了一种可扩展、高精度的野生动物密度监测新途径。

链接: https://arxiv.org/abs/2508.03545
作者: Stephanie Wohlfahrt,Christoph Praschl,Horst Leitner,Wolfram Jantsch,Julia Konic,Silvio Schueler,Andreas Stöckl,David C. Schedl
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注: 6 pages, 1 figure, 1 table, International Wildlife Congress 2025

点击查看摘要

Abstract:We use unmanned aerial drones to estimate wildlife density in southeastern Austria and compare these estimates to camera trap data. Traditional methods like capture-recapture, distance sampling, or camera traps are well-established but labour-intensive or spatially constrained. Using thermal (IR) and RGB imagery, drones enable efficient, non-intrusive animal counting. Our surveys were conducted during the leafless period on single days in October and November 2024 in three areas of a sub-Illyrian hill and terrace landscape. Flight transects were based on predefined launch points using a 350 m grid and an algorithm that defined the direction of systematically randomized transects. This setup allowed surveying large areas in one day using multiple drones, minimizing double counts. Flight altitude was set at 60 m to avoid disturbing roe deer (Capreolus capreolus) while ensuring detection. Animals were manually annotated in the recorded imagery and extrapolated to densities per square kilometer. We applied three extrapolation methods with increasing complexity: naive area-based extrapolation, bootstrapping, and zero-inflated negative binomial modelling. For comparison, a Random Encounter Model (REM) estimate was calculated using camera trap data from the flight period. The drone-based methods yielded similar results, generally showing higher densities than REM, except in one area in October. We hypothesize that drone-based density reflects daytime activity in open and forested areas, while REM estimates average activity over longer periods within forested zones. Although both approaches estimate density, they offer different perspectives on wildlife presence. Our results show that drones offer a promising, scalable method for wildlife density estimation.
zh

[CV-18] Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

【速读】:该论文旨在解决将口语化数学表达(spoken mathematical expressions)准确转换为结构化符号表示(如LaTeX)的问题,这一任务在教育和科研场景中具有重要意义。其核心挑战在于处理数学表达在语音中的歧义性,而现有方法受限于数据规模、语言覆盖范围不足及仅针对孤立公式等缺陷。解决方案的关键在于构建首个大规模、开源的多语言(英语与俄语)数学语音数据集(含66,000个标注音频样本),并结合ASR后校正、少样本提示(few-shot prompting)与音频语言模型(audio language models),在新提出的S2L-equations基准上显著优于现有方法(CER 27% vs. 64%),同时首次建立了数学句子识别基准(S2L-sentences),为多模态人工智能在数学内容识别领域的进一步发展奠定基础。

链接: https://arxiv.org/abs/2508.03542
作者: Dmitrii Korzh,Dmitrii Tarasov,Artyom Iudin,Elvir Karimov,Matvey Skripkin,Nikita Kuzmin,Andrey Kuznetsov,Oleg Y. Rogov,Ivan Oseledets
机构: 1. Skolkovo Institute of Science and Technology (斯科尔科沃科学技术研究所); 2. Moscow Institute of Physics and Technology (莫斯科物理技术学院); 3. Yandex(雅库扎); 4. Russian Academy of Sciences (俄罗斯科学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in both English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 40 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.
zh

[CV-19] Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection

【速读】:该论文旨在解决现有基于扩散模型和粗粒度图像修复(coarse inpainting)的异常合成方法中存在的结构缺陷问题,如微观结构不连续、语义控制能力有限以及生成效率低下等挑战。其解决方案的关键在于提出了一种语言条件下的自回归异常合成方法(ARAS),通过令牌锚定的潜在空间编辑(token-anchored latent editing)实现对正常图像中局部缺陷的精确注入;同时引入硬门控自回归操作符与免训练的上下文保持掩码采样核,显著提升异常的真实感并保留细粒度材料纹理,从而提供连续的语义控制能力。此外,该方法集成于质量感知重加权异常检测框架(QARAD)中,利用双编码器模型计算图像-文本相似度得分动态加权高质量合成样本,进一步优化检测性能。

链接: https://arxiv.org/abs/2508.03539
作者: Long Qian,Bingke Zhu,Yingying Chen,Ming Tang,Jinqiao Wang
机构: Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所基础模型研究中心); School of Artificial Intelligence, University of Chinese Academy of Sciences (中国科学院大学人工智能学院); Objecteye Inc. (北京物眼科技有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite substantial progress in anomaly synthesis methods, existing diffusion-based and coarse inpainting pipelines commonly suffer from structural deficiencies such as micro-structural discontinuities, limited semantic controllability, and inefficient generation. To overcome these limitations, we introduce ARAS, a language-conditioned, auto-regressive anomaly synthesis approach that precisely injects local, text-specified defects into normal images via token-anchored latent editing. Leveraging a hard-gated auto-regressive operator and a training-free, context-preserving masked sampling kernel, ARAS significantly enhances defect realism, preserves fine-grained material textures, and provides continuous semantic control over synthesized anomalies. Integrated within our Quality-Aware Re-weighted Anomaly Detection (QARAD) framework, we further propose a dynamic weighting strategy that emphasizes high-quality synthetic samples by computing an image-text similarity score with a dual-encoder model. Extensive experiments across three benchmark datasets-MVTec AD, VisA, and BTAD, demonstrate that our QARAD outperforms SOTA methods in both image- and pixel-level anomaly detection tasks, achieving improved accuracy, robustness, and a 5 times synthesis speedup compared to diffusion-based alternatives. Our complete code and synthesized dataset will be publicly available.
zh

[CV-20] Retinal Lipidomics Associations as Candidate Biomarkers for Cardiovascular Health

【速读】:该论文旨在解决血清脂质亚类(如游离脂肪酸、二酰甘油、三酰甘油和胆固醇酯)与视网膜微血管特征之间关联不明确的问题,从而探索视网膜微血管结构是否可作为系统性代谢健康的无创标志物。其解决方案的关键在于首次将深度学习(Deep Learning, DL)提取的视网膜特征与脂质组学亚类在健康人群中进行整合分析,通过Spearman相关分析并结合Benjamini-Hochberg校正控制假发现率(BH-FDR),揭示了不同脂质亚类与视网膜血管形态参数(如扭曲度、宽度及复杂性)之间的显著关联,为理解微血管结构变化与循环脂质谱的关系提供了新证据。

链接: https://arxiv.org/abs/2508.03538
作者: Inamullah,Imran Razzak,Shoaib Jameel
机构: University of Southampton (南安普顿大学); Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Retinal microvascular imaging is increasingly recognised as a non invasive method for evaluating systemic vascular and metabolic health. However, the association between lipidomics and retinal vasculature remains inadequate. This study investigates the relationships between serum lipid subclasses, free fatty acids (FA), diacylglycerols (DAG), triacylglycerols (TAG), and cholesteryl esters (CE), and retinal microvascular characteristics in a large population-based cohort. Using Spearman correlation analysis, we examined the interconnection between lipid subclasses and ten retinal microvascular traits, applying the Benjamini-Hochberg false discovery rate (BH-FDR) to adjust for statistical significance. Results indicated that FA were linked to retinal vessel twistiness, while CE correlated with the average widths of arteries and veins. Conversely, DAG and TAG showed negative correlations with the width and complexity of arterioles and venules. These findings suggest that retinal vascular architecture reflects distinct circulating lipid profiles, supporting its role as a non-invasive marker of systemic metabolic health. This study is the first to integrate deep learning (DL)derived retinal traits with lipidomic subclasses in a healthy cohort, thereby providing insights into microvascular structural changes independent of disease status or treatment effects. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.03538 [cs.CV] (or arXiv:2508.03538v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.03538 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[CV-21] CoEmoGen: Towards Semantically-Coherent and Scalable Emotional Image Content Generation

【速读】:该论文旨在解决情感图像内容生成(Emotional Image Content Generation, EICG)中因抽象情绪难以建模而导致的语义不一致与可扩展性不足的问题。现有方法多依赖词级属性标签进行引导,存在语义模糊、歧义及扩展困难等缺陷。其解决方案的关键在于提出CoEmoGen框架:首先利用多模态大语言模型(Multimodal Large Language Models, MLLMs)构建聚焦于情绪触发内容的高质量描述文本,实现语义丰富的上下文引导;其次受心理学启发设计分层低秩适配(Hierarchical Low-Rank Adaptation, HiLoRA)模块,协同建模共通极性低层特征与情绪特异高层语义,从而提升生成图像的情感忠实度与语义一致性。

链接: https://arxiv.org/abs/2508.03535
作者: Kaishen Yuan,Yuting Zhang,Shang Gao,Yijie Zhu,Wenshuo Chen,Yutao Yue
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 9 figures

点击查看摘要

Abstract:Emotional Image Content Generation (EICG) aims to generate semantically clear and emotionally faithful images based on given emotion categories, with broad application prospects. While recent text-to-image diffusion models excel at generating concrete concepts, they struggle with the complexity of abstract emotions. There have also emerged methods specifically designed for EICG, but they excessively rely on word-level attribute labels for guidance, which suffer from semantic incoherence, ambiguity, and limited scalability. To address these challenges, we propose CoEmoGen, a novel pipeline notable for its semantic coherence and high scalability. Specifically, leveraging multimodal large language models (MLLMs), we construct high-quality captions focused on emotion-triggering content for context-rich semantic guidance. Furthermore, inspired by psychological insights, we design a Hierarchical Low-Rank Adaptation (HiLoRA) module to cohesively model both polarity-shared low-level features and emotion-specific high-level semantics. Extensive experiments demonstrate CoEmoGen’s superiority in emotional faithfulness and semantic coherence from quantitative, qualitative, and user study perspectives. To intuitively showcase scalability, we curate EmoArt, a large-scale dataset of emotionally evocative artistic images, providing endless inspiration for emotion-driven artistic creation. The dataset and code are available at this https URL.
zh

[CV-22] Semantic Mosaicing of Histo-Pathology Image Frag ments using Visual Foundation Models

【速读】:该论文旨在解决组织病理学中全片扫描图像(Whole Mount Slide, WMS)自动拼接问题,尤其针对组织切片制备过程中常见的组织丢失、形态畸变不均、染色不一致、错位导致的缺失区域及边缘毛糙等挑战。传统基于边界形状匹配的拼接方法难以在复杂场景下重建高质量WMS。其解决方案的关键在于提出SemanticStitcher框架,利用视觉病理学基础模型(visual histopathology foundation model)提取的潜在特征表示(latent feature representations),实现不同组织片段间语义层面的邻近区域识别,并通过大量语义匹配候选点进行鲁棒的姿态估计,从而生成高精度的多片段拼接结果。实验表明,该方法在三个不同病理数据集上均显著优于现有最优方法,在正确边界匹配率上表现稳定提升。

链接: https://arxiv.org/abs/2508.03524
作者: Stefan Brandstätter,Maximilian Köller,Philipp Seeböck,Alissa Blessing,Felicitas Oberndorfer,Svitlana Pochepnia,Helmut Prosch,Georg Langs
机构: Medical University Vienna (维也纳医科大学); Computational Imaging Research Lab; Division of General and Pediatric Radiology; Christian Doppler Laboratory for Machine Learning Driven Precision Imaging; Comprehensive Center for Artificial Intelligence in Medicine; Department of Pathology
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In histopathology, tissue samples are often larger than a standard microscope slide, making stitching of multiple fragments necessary to process entire structures such as tumors. Automated stitching is a prerequisite for scaling analysis, but is challenging due to possible tissue loss during preparation, inhomogeneous morphological distortion, staining inconsistencies, missing regions due to misalignment on the slide, or frayed tissue edges. This limits state-of-the-art stitching methods using boundary shape matching algorithms to reconstruct artificial whole mount slides (WMS). Here, we introduce SemanticStitcher using latent feature representations derived from a visual histopathology foundation model to identify neighboring areas in different fragments. Robust pose estimation based on a large number of semantic matching candidates derives a mosaic of multiple fragments to form the WMS. Experiments on three different histopathology datasets demonstrate that SemanticStitcher yields robust WMS mosaicing and consistently outperforms the state of the art in correct boundary matches.
zh

[CV-23] Distribution-aware Knowledge Unification and Association for Non-exemplar Lifelong Person Re-identification

【速读】:该论文旨在解决持续学习场景下行人重识别(Lifelong Person Re-Identification, LReID)中的核心挑战:如何在保留历史知识的同时有效适应新数据,避免灾难性遗忘并提升跨域泛化能力。现有方法多依赖知识蒸馏进行表征对齐,但忽略了实例级分布感知与跨域统一知识学习这两个关键因素。其解决方案的关键在于提出一种分布感知的知识统一与关联(Distribution-aware Knowledge Unification and Association, DKUA)框架:首先设计分布感知模型(distribution-aware model),将当前域的实例级表征转换为具有特定域风格的域专属表征,从而无需存储旧样本即可保留已学知识;其次引入自适应知识整合(Adaptive Knowledge Consolidation, AKC)机制,动态生成跨域表示中心作为统一表征;进一步通过统一知识关联(Unified Knowledge Association, UKA)机制,利用该中心显式建模域间关联以缩小域间差异;最后提出基于分布的知识迁移(Distribution-based Knowledge Transfer, DKT),约束当前域分布不偏离跨域中心,增强适应能力。实验表明,DKUA在抗遗忘和泛化性能上分别提升7.6%和5.3%的平均mAP和R@1。

链接: https://arxiv.org/abs/2508.03516
作者: Shiben Liu,Mingyue Xu,Huijie Fan,Qiang Wang,Yandong Tang,Zhi Han
机构: State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China; University of Chinese Academy of Sciences, Beijing 100049, China; Key Laboratory of Manufacturing Industrial Integrated Automation, Shenyang University
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 papges, 6 figures

点击查看摘要

Abstract:Lifelong person re-identification (LReID) encounters a key challenge: balancing the preservation of old knowledge with adaptation to new information. Existing LReID methods typically employ knowledge distillation to enforce representation alignment. However, these approaches ignore two crucial aspects: specific distribution awareness and cross-domain unified knowledge learning, both of which are essential for addressing this challenge. To overcome these limitations, we propose a novel distribution-aware knowledge unification and association (DKUA) framework where domain-style modeling is performed for each instance to propagate domain-specific representations, enhancing anti-forgetting and generalization capacity. Specifically, we design a distribution-aware model to transfer instance-level representations of the current domain into the domain-specific representations with the different domain styles, preserving learned knowledge without storing old samples. Next, we propose adaptive knowledge consolidation (AKC) to dynamically generate the unified representation as a cross-domain representation center. To further mitigate forgetting, we develop a unified knowledge association (UKA) mechanism, which explores the unified representation as a bridge to explicitly model inter-domain associations, reducing inter-domain gaps. Finally, distribution-based knowledge transfer (DKT) is proposed to prevent the current domain distribution from deviating from the cross-domain distribution center, improving adaptation capacity. Experimental results show our DKUA outperforms the existing methods by 7.6%/5.3% average mAP/R@1 improvement on anti-forgetting and generalization capacity, respectively. Our code will be publicly released.
zh

[CV-24] MAUP: Training-free Multi-center Adaptive Uncertainty-aware Prompting for Cross-domain Few-shot Medical Image Segmentation MICCAI2025

【速读】:该论文旨在解决跨域少样本医学图像分割(Cross-domain Few-shot Medical Image Segmentation, CD-FSMIS)中现有模型依赖大量源域训练导致泛化能力弱、部署复杂的问题。其核心解决方案是提出一种无需训练的适应策略——多中心自适应不确定性感知提示(Multi-center Adaptive Uncertainty-aware Prompting, MAUP),通过K-means聚类生成多中心提示以实现空间覆盖优化,结合不确定性感知的提示选择机制聚焦难分区域,并引入自适应提示优化动态调整提示复杂度,从而在不进行任何额外训练的情况下,利用预训练的Segment Anything Model (SAM) 和DINOv2特征编码器,在三个医学数据集上实现高精度分割。

链接: https://arxiv.org/abs/2508.03511
作者: Yazhou Zhu,Haofeng Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by MICCAI 2025

点击查看摘要

Abstract:Cross-domain Few-shot Medical Image Segmentation (CD-FSMIS) is a potential solution for segmenting medical images with limited annotation using knowledge from other domains. The significant performance of current CD-FSMIS models relies on the heavily training procedure over other source medical domains, which degrades the universality and ease of model deployment. With the development of large visual models of natural images, we propose a training-free CD-FSMIS model that introduces the Multi-center Adaptive Uncertainty-aware Prompting (MAUP) strategy for adapting the foundation model Segment Anything Model (SAM), which is trained with natural images, into the CD-FSMIS task. To be specific, MAUP consists of three key innovations: (1) K-means clustering based multi-center prompts generation for comprehensive spatial coverage, (2) uncertainty-aware prompts selection that focuses on the challenging regions, and (3) adaptive prompt optimization that can dynamically adjust according to the target region complexity. With the pre-trained DINOv2 feature encoder, MAUP achieves precise segmentation results across three medical datasets without any additional training compared with several conventional CD-FSMIS models and training-free FSMIS model. The source code is available at: this https URL.
zh

[CV-25] EditGarment: An Instruction-Based Garment Editing Dataset Constructed with Automated MLLM Synthesis and Semantic-Aware Evaluation

【速读】:该论文旨在解决指令驱动的服装编辑(instruction-based garment editing)中高质量指令-图像配对数据稀缺的问题,这一瓶颈限制了模型在时尚设计与定制场景中的精准性与泛化能力。其核心挑战在于:一方面需理解服装特定语义及属性依赖关系,另一方面传统人工标注成本高且难以扩展;尽管多模态大语言模型(MLLMs)具备自动化数据合成潜力,但受限于指令建模不精确和缺乏时尚领域监督信号。解决方案的关键在于提出一个自动化数据构建流水线:首先定义六类符合真实时尚工作流的编辑指令类别,以生成平衡多样化的指令-图像三元组;其次引入“时尚编辑评分”(Fashion Edit Score),一种捕捉服装属性间语义依赖关系的感知评估指标,作为可靠监督信号提升数据质量。最终构建了包含20,596个高质量样本的EditGarment数据集,填补了独立服装编辑任务的数据空白。

链接: https://arxiv.org/abs/2508.03497
作者: Deqiang Yin,Junyi Guo,Huanda Lu,Fangyu Wu,Dongming Lu
机构: Xi’an Jiaotong-Liverpool University (西安交通大学利物浦大学); NingboTech University (宁波工程学院); Zhejiang University (浙江大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Instruction-based garment editing enables precise image modifications via natural language, with broad applications in fashion design and customization. Unlike general editing tasks, it requires understanding garment-specific semantics and attribute dependencies. However, progress is limited by the scarcity of high-quality instruction-image pairs, as manual annotation is costly and hard to scale. While MLLMs have shown promise in automated data synthesis, their application to garment editing is constrained by imprecise instruction modeling and a lack of fashion-specific supervisory signals. To address these challenges, we present an automated pipeline for constructing a garment editing dataset. We first define six editing instruction categories aligned with real-world fashion workflows to guide the generation of balanced and diverse instruction-image triplets. Second, we introduce Fashion Edit Score, a semantic-aware evaluation metric that captures semantic dependencies between garment attributes and provides reliable supervision during construction. Using this pipeline, we construct a total of 52,257 candidate triplets and retain 20,596 high-quality triplets to build EditGarment, the first instruction-based dataset tailored to standalone garment editing. The project page is this https URL.
zh

[CV-26] Prototype-Enhanced Confidence Modeling for Cross-Modal Medical Image-Report Retrieval

【速读】:该论文旨在解决跨模态检索任务(如图像到报告和报告到图像检索)中医学图像与文本报告之间语义对齐的难题,尤其针对医学数据固有的模糊性和变异性导致现有模型难以捕捉多层级语义关系、进而影响检索准确性的挑战。其解决方案的关键在于提出原型增强置信度建模(Prototype-Enhanced Confidence Modeling, PECM)框架,通过为每种模态引入多层级原型来更好地表征语义多样性,并采用双流置信度估计机制——利用原型相似度分布和自适应加权策略,有效抑制高不确定性数据对检索排序的影响,从而提升检索精度与一致性,在多种数据集和任务(包括全监督与零样本场景)上实现最高达10.17%的性能提升,达到新的最先进水平。

链接: https://arxiv.org/abs/2508.03494
作者: Shreyank N Gowda,Xiaobo Jin,Christian Wagner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In cross-modal retrieval tasks, such as image-to-report and report-to-image retrieval, accurately aligning medical images with relevant text reports is essential but challenging due to the inherent ambiguity and variability in medical data. Existing models often struggle to capture the nuanced, multi-level semantic relationships in radiology data, leading to unreliable retrieval results. To address these issues, we propose the Prototype-Enhanced Confidence Modeling (PECM) framework, which introduces multi-level prototypes for each modality to better capture semantic variability and enhance retrieval robustness. PECM employs a dual-stream confidence estimation that leverages prototype similarity distributions and an adaptive weighting mechanism to control the impact of high-uncertainty data on retrieval rankings. Applied to radiology image-report datasets, our method achieves significant improvements in retrieval precision and consistency, effectively handling data ambiguity and advancing reliability in complex clinical scenarios. We report results on multiple different datasets and tasks including fully supervised and zero-shot retrieval obtaining performance gains of up to 10.17%, establishing in new state-of-the-art.
zh

[CV-27] Quality Versus Sparsity in Image Recovery by Dictionary Learning Using Iterative Shrinkage

【速读】:该论文旨在解决稀疏字典学习(Sparse Dictionary Learning, SDL)中 sparsity(稀疏性)的约束程度与图像恢复质量之间的权衡问题。具体而言,研究关注在何种稀疏性水平下,SDL 方法仍能保持良好的图像恢复性能,而不因过度强调稀疏性而损害重建质量。解决方案的关键在于通过多种优化方法分析不同稀疏性 regimes(稀疏性区间),并发现:即使 recovered image(恢复图像)与训练数据库差异较大,高稀疏性通常也不会显著降低恢复质量,表明在 SDL 中适度甚至较高程度的稀疏性约束是可行且有效的。

链接: https://arxiv.org/abs/2508.03492
作者: Mohammadsadegh Khoshghiaferezaee,Moritz Krauth,Shima Shabani,Michael Breuß
机构: Brandenburg University of Technology (勃兰登堡工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, 3 tables, IEEE-IPTA,2025

点击查看摘要

Abstract:Sparse dictionary learning (SDL) is a fundamental technique that is useful for many image processing tasks. As an example we consider here image recovery, where SDL can be cast as a nonsmooth optimization problem. For this kind of problems, iterative shrinkage methods represent a powerful class of algorithms that are subject of ongoing research. Sparsity is an important property of the learned solutions, as exactly the sparsity enables efficient further processing or storage. The sparsity implies that a recovered image is determined as a combination of a number of dictionary elements that is as low as possible. Therefore, the question arises, to which degree sparsity should be enforced in SDL in order to not compromise recovery quality. In this paper we focus on the sparsity of solutions that can be obtained using a variety of optimization methods. It turns out that there are different sparsity regimes depending on the method in use. Furthermore, we illustrate that high sparsity does in general not compromise recovery quality, even if the recovered image is quite different from the learning database.
zh

[CV-28] ParticleSAM: Small Particle Segmentation for Material Quality Monitoring in Recycling Processes

【速读】:该论文旨在解决建筑行业中再生骨料(Recycled Aggregate)质量监测依赖人工方法效率低下的问题,尤其是针对图像中密集分布的微小颗粒难以进行精准分割的挑战。解决方案的关键在于提出ParticleSAM,即对分割基础模型(Segmentation Foundation Model)的改进版本,专门适配于包含大量小而密集物体(如建筑废料颗粒)的图像场景;同时构建了一个基于自动化数据生成与标注流程的新颖密集多颗粒数据集,为视觉材料质量控制自动化提供基准测试平台,并验证了该方法在建筑及其他需要小颗粒分割的应用领域中的有效性。

链接: https://arxiv.org/abs/2508.03490
作者: Yu Zhou,Pelle Thielmann,Ayush Chamoli,Bruno Mirbach,Didier Stricker,Jason Rambach
机构: German Research Center for Artificial Intelligence (DFKI); RPTU Kaiserslautern
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures. Accepted for presentation at EUSIPCO 2025, September 8-12, 2025. List of accepted papers available at this http URL

点击查看摘要

Abstract:The construction industry represents a major sector in terms of resource consumption. Recycled construction material has high reuse potential, but quality monitoring of the aggregates is typically still performed with manual methods. Vision-based machine learning methods could offer a faster and more efficient solution to this problem, but existing segmentation methods are by design not directly applicable to images with hundreds of small particles. In this paper, we propose ParticleSAM, an adaptation of the segmentation foundation model to images with small and dense objects such as the ones often encountered in construction material particles. Moreover, we create a new dense multi-particle dataset simulated from isolated particle images with the assistance of an automated data generation and labeling pipeline. This dataset serves as a benchmark for visual material quality control automation while our segmentation approach has the potential to be valuable in application areas beyond construction where small-particle segmentation is needed. Our experimental results validate the advantages of our method by comparing to the original SAM method both in quantitative and qualitative experiments.
zh

[CV-29] LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Text-to-Image Generation

【速读】:该论文旨在解决扩散 Transformer(Diffusion Transformers, DiTs)在后训练量化(Post-training Quantization, PTQ)过程中,尤其是在极端低比特设置下性能严重退化的问题。其核心挑战在于:(1)模型权重呈现类高斯分布且尾部较长,导致均匀量化难以合理分配量化区间,引发显著误差;(2)激活值中存在两类异常值——温和异常值(Mild Outliers)和显著异常值(Salient Outliers),它们分别因轻微幅值升高和特定通道集中大数值而破坏激活量化稳定性。为应对上述问题,作者提出 LRQ-DiT 框架,关键创新包括:引入双对数量化(Twin-Log Quantization, TLQ),通过基于对数的映射更贴合权重分布以降低量化误差;设计自适应旋转方案(Adaptive Rotation Scheme, ARS),依据激活波动动态选择 Hadamard 旋转或异常值感知旋转,有效缓解两类异常值的影响。实验表明,LRQ-DiT 在 PixArt 和 FLUX 模型上均实现了高质量的低比特量化,显著优于现有 PTQ 基线方法。

链接: https://arxiv.org/abs/2508.03485
作者: Lianwei Yang,Haokun Lin,Tianchen Zhao,Yichen Wu,Hongyu Zhu,Ruiqi Xie,Zhenan Sun,Yu Wang,Qingyi Gu
机构: 1. Tsinghua University (清华大学); 2. Institute for AI Industry Research (清华智源研究院); 3. Alibaba Group (阿里巴巴集团); 4. Zhejiang University (浙江大学); 5. Tongyi Lab (通义实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have achieved impressive performance in text-to-image generation. However, their high computational cost and large parameter sizes pose significant challenges for usage in resource-constrained scenarios. Post-training quantization (PTQ) is a promising solution to reduce memory usage and accelerate inference, but existing PTQ methods suffer from severe performance degradation under extreme low-bit settings. We identify two key obstacles to low-bit post-training quantization for DiT models: (1) model weights follow a Gaussian-like distribution with long tails, causing uniform quantization to poorly allocate intervals and leading to significant errors; (2) two types of activation outliers: (i) Mild Outliers with slightly elevated values, and (ii) Salient Outliers with large magnitudes concentrated in specific channels, which disrupt activation quantization. To address these issues, we propose LRQ-DiT, an efficient and accurate PTQ framework. We introduce Twin-Log Quantization (TLQ), a log-based method that aligns well with the weight distribution and reduces quantization errors. We also propose an Adaptive Rotation Scheme (ARS) that dynamically applies Hadamard or outlier-aware rotations based on activation fluctuation, effectively mitigating the impact of both types of outliers. We evaluate LRQ-DiT on PixArt and FLUX under various bit-width settings, and validate the performance on COCO, MJHQ, and sDCI datasets. LRQ-DiT achieves low-bit quantization of DiT models while preserving image quality, outperforming existing PTQ baselines.
zh

[CV-30] When Cars Have Stereotypes: Auditing Demographic Bias in Objects from Text-to-Image Models

【速读】:该论文旨在解决生成式 AI(Generative AI)模型中隐性但普遍存在的物体类别的群体偏差问题,即在文本到图像生成过程中,特定人口统计学线索(如性别、种族等)如何导致生成对象(如汽车)在视觉属性上呈现系统性差异。其解决方案的关键在于提出了一种名为 SODA(Stereotyped Object Diagnostic Audit)的新颖审计框架,通过对比带有 demographic cues 的提示与中性提示所生成的图像,系统性测量不同群体与视觉特征之间的关联强度,从而揭示并量化模型中嵌入的刻板印象及其放大效应。

链接: https://arxiv.org/abs/2508.03483
作者: Dasol Choi Jihwan Lee,Minjae Lee,Minsuk Kahng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:While prior research on text-to-image generation has predominantly focused on biases in human depictions, we investigate a more subtle yet pervasive phenomenon: demographic bias in generated objects (e.g., cars). We introduce SODA (Stereotyped Object Diagnostic Audit), a novel framework for systematically measuring such biases. Our approach compares visual attributes of objects generated with demographic cues (e.g., "for young people’') to those from neutral prompts, across 2,700 images produced by three state-of-the-art models (GPT Image-1, Imagen 4, and Stable Diffusion) in five object categories. Through a comprehensive analysis, we uncover strong associations between specific demographic groups and visual attributes, such as recurring color patterns prompted by gender or ethnicity cues. These patterns reflect and reinforce not only well-known stereotypes but also more subtle and unintuitive biases. We also observe that some models generate less diverse outputs, which in turn amplifies the visual disparities compared to neutral prompts. Our proposed auditing framework offers a practical approach for testing, revealing how stereotypes still remain embedded in today’s generative models. We see this as an essential step toward more systematic and responsible AI development.
zh

[CV-31] VideoGuard: Protecting Video Content from Unauthorized Editing

【速读】:该论文旨在解决生成式视频内容易被恶意编辑的问题,当前针对图像的保护方法在视频场景中效果有限,主要由于视频帧间存在冗余性及视频扩散模型中的帧间注意力机制,使得单独对每帧应用图像级保护策略无法有效防御未经授权的编辑。解决方案的关键在于提出VideoGuard方法,通过联合优化所有视频帧作为整体,并融合视频运动信息至优化目标,引入几乎不可察觉的扰动,从而干扰生成扩散模型的正常运作,使生成结果出现不一致和不合理现象,显著提升视频内容的抗篡改能力。

链接: https://arxiv.org/abs/2508.03480
作者: Junjie Cao,Kaizhou Li,Xinchun Yu,Hongxiang Li,Xiaoping Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ai security, 10pages, 5 figures

点击查看摘要

Abstract:With the rapid development of generative technology, current generative models can generate high-fidelity digital content and edit it in a controlled manner. However, there is a risk that malicious individuals might misuse these capabilities for misleading activities. Although existing research has attempted to shield photographic images from being manipulated by generative models, there remains a significant disparity in the protection offered to video content editing. To bridge the gap, we propose a protection method named VideoGuard, which can effectively protect videos from unauthorized malicious editing. This protection is achieved through the subtle introduction of nearly unnoticeable perturbations that interfere with the functioning of the intended generative diffusion models. Due to the redundancy between video frames, and inter-frame attention mechanism in video diffusion models, simply applying image-based protection methods separately to every video frame can not shield video from unauthorized editing. To tackle the above challenge, we adopt joint frame optimization, treating all video frames as an optimization entity. Furthermore, we extract video motion information and fuse it into optimization objectives. Thus, these alterations can effectively force the models to produce outputs that are implausible and inconsistent. We provide a pipeline to optimize this perturbation. Finally, we use both objective metrics and subjective metrics to demonstrate the efficacy of our method, and the results show that the protection performance of VideoGuard is superior to all the baseline methods.
zh

[CV-32] IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在跨模态协同推理过程中出现的“幻觉”问题,即生成内容与输入图像不一致的现象。研究表明,这类幻觉随生成序列长度增加而加剧,其根本原因尚不明确。作者通过分析注意力机制发现,当前LVLMs存在长期偏差:随着文本生成序列增长,模型对视觉信息的关注度逐渐下降,这被认为是导致幻觉增强的关键因素。为此,论文提出Image attention-guided Key-value merging cOllaborative Decoding (IKOD)策略,其核心在于利用短序列中更高视觉注意力所生成的logits,通过键值合并方式与原解码结果融合,从而缓解注意力退化并抑制幻觉,且无需额外训练或外部工具,具备轻量高效的特点。

链接: https://arxiv.org/abs/2508.03469
作者: Jiabing Yang,Chenhang Cui,Yiyang Zhou,Yixiang Chen,Peng Xia,Ying Wei,Tao Yu,Yan Huang,Liang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated significant progress across multiple domains. However, these models still face the inherent challenge of integrating vision and language for collaborative inference, which often leads to “hallucinations”, outputs that are not grounded in the corresponding images. Many efforts have been made to address these issues, but each comes with its own limitations, such as high computational cost or expensive dataset annotation. Recent research shows that LVLMs exhibit a long-term bias where hallucinations increase as the sequence length grows, yet the underlying cause remains poorly understood. Building on extensive research into attention mechanisms in LVLMs, we analyze the relationship between this long-term bias and visual attention. In our research, we identify a consistent phenomenon in current LVLMs: the model’s attention to visual input diminishes as the generated sequence grows, which we hypothesize to be a key factor contributing to observed increasing hallucinations. Based on these insights, we propose Image attention-guided Key-value merging cOllaborative Decoding (IKOD), a collaborative decoding strategy generating more image-focused sequences. This method derives logits from shorter sequences with higher image attention through key-value merging and combines them with those from the original decoding, effectively mitigating attention degradation and suppressing hallucinations while not incurring too much inference cost. Extensive experiments on both hallucination and comprehensive benchmarks demonstrate IKOD’s superior effectiveness in mitigating hallucinations and improving comprehensive capacities for LVLMs. Importantly, IKOD requires no additional training or external tools, making it a lightweight and efficient framework applicable to various models.
zh

[CV-33] AVPDN: Learning Motion-Robust and Scale-Adaptive Representations for Video-Based Polyp Detection

【速读】:该论文旨在解决动态结肠镜视频中息肉(polyp)检测的准确性问题,尤其针对因快速相机运动导致的背景噪声干扰和结构信息破坏,从而引发假阳性率升高的挑战。解决方案的关键在于提出自适应视频息肉检测网络(Adaptive Video Polyp Detection Network, AVPDN),其核心创新包括两个模块:一是自适应特征交互与增强(Adaptive Feature Interaction and Augmentation, AFIA)模块,通过三分支架构融合密集自注意力(dense self-attention)建模全局上下文、稀疏自注意力(sparse self-attention)降低低相似度查询-键对聚合影响,并引入通道混洗(channel shuffle)促进跨分支信息交换;二是尺度感知上下文融合(Scale-Aware Context Integration, SACI)模块,利用不同感受野的空洞卷积(dilated convolutions)捕获多尺度空间上下文信息,提升模型去噪能力与多尺度特征整合效果。该方法在多个公开基准数据集上验证了优越的性能与泛化能力。

链接: https://arxiv.org/abs/2508.03458
作者: Zilin Chen,Shengnan Lu
机构: Xi’an Shiyou University (西安石油大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate detection of polyps is of critical importance for the early and intermediate stages of colorectal cancer diagnosis. Compared to static images, dynamic colonoscopy videos provide more comprehensive visual information, which can facilitate the development of effective treatment plans. However, unlike fixed-camera recordings, colonoscopy videos often exhibit rapid camera movement, introducing substantial background noise that disrupts the structural integrity of the scene and increases the risk of false positives. To address these challenges, we propose the Adaptive Video Polyp Detection Network (AVPDN), a robust framework for multi-scale polyp detection in colonoscopy videos. AVPDN incorporates two key components: the Adaptive Feature Interaction and Augmentation (AFIA) module and the Scale-Aware Context Integration (SACI) module. The AFIA module adopts a triple-branch architecture to enhance feature representation. It employs dense self-attention for global context modeling, sparse self-attention to mitigate the influence of low query-key similarity in feature aggregation, and channel shuffle operations to facilitate inter-branch information exchange. In parallel, the SACI module is designed to strengthen multi-scale feature integration. It utilizes dilated convolutions with varying receptive fields to capture contextual information at multiple spatial scales, thereby improving the model’s denoising capability. Experiments conducted on several challenging public benchmarks demonstrate the effectiveness and generalization ability of the proposed method, achieving competitive performance in video-based polyp detection tasks.
zh

[CV-34] READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation

【速读】:该论文旨在解决基于扩散模型(diffusion models)的音频驱动人脸生成(audio-driven talking head generation)中推理速度极慢的问题,从而限制了其实际应用。解决方案的关键在于提出了一种名为READ的实时扩散-Transformer框架:首先通过时间维度上的变分自编码器(temporal VAE)学习一个时空高度压缩的视频潜在空间,显著减少token数量以加速生成;其次引入预训练的语音自编码器(Speech Autoencoder, SpeechAE),生成与视频潜在空间对齐的时间压缩语音潜在码,提升音视频同步质量;随后采用精心设计的音频到视频扩散Transformer(Audio-to-Video Diffusion Transformer, A2V-DiT)建模潜在表示,实现高效人脸合成;最后创新性地提出异步噪声调度器(asynchronous noise scheduler, ANS),在训练和推理阶段均采用异步加噪与异步运动引导生成策略,保障长时生成中的时序一致性与推理加速。

链接: https://arxiv.org/abs/2508.03457
作者: Haotian Wang,Yuzhe Weng,Jun Du,Haoran Xu,Xiaoyan Wu,Shan He,Bing Yin,Cong Liu,Jianqing Gao,Qingfeng Liu
机构: 1. Tsinghua University (清华大学); 2. Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD)
备注: 9 pages

点击查看摘要

Abstract:The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, the first real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference process of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.
zh

[CV-35] Video Demoireing using Focused-Defocused Dual-Camera System

【速读】:该论文旨在解决单摄像头图像/视频去摩尔纹(moire pattern)方法中存在的两个关键问题:一是难以区分摩尔纹与视觉上相似的真实纹理,二是去除摩尔纹时难以保持色调一致性(tonal consistency)和时间连贯性(temporal coherence)。其解决方案的关键在于提出一种双摄像头框架,同步拍摄同一场景的两组视频:一为聚焦视频(保留高质量纹理但可能含摩尔纹),另一为散焦视频(摩尔纹显著减弱但纹理模糊)。利用散焦视频作为参考,帮助区分摩尔纹与真实纹理,并引导聚焦视频的去摩尔纹处理。具体实现中,采用基于光流的帧对齐步骤校正位移和遮挡差异,随后通过多尺度卷积神经网络(CNN)结合多维训练损失进行去摩尔纹处理,并最终使用联合双边滤波器(joint bilateral filter)在保持色调和时间一致性的前提下输出高质量结果。

链接: https://arxiv.org/abs/2508.03449
作者: Xuan Dong,Xiangyuan Sun,Xia Wang,Jian Song,Ya Li,Weixin Li
机构: Beijing University of Posts and Telecommunications (北京邮电大学); DeepScience Co., Ltd. (深思科技有限公司); Beihang University (北京航空航天大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Moire patterns, unwanted color artifacts in images and videos, arise from the interference between spatially high-frequency scene contents and the spatial discrete sampling of digital cameras. Existing demoireing methods primarily rely on single-camera image/video processing, which faces two critical challenges: 1) distinguishing moire patterns from visually similar real textures, and 2) preserving tonal consistency and temporal coherence while removing moire artifacts. To address these issues, we propose a dual-camera framework that captures synchronized videos of the same scene: one in focus (retaining high-quality textures but may exhibit moire patterns) and one defocused (with significantly reduced moire patterns but blurred textures). We use the defocused video to help distinguish moire patterns from real texture, so as to guide the demoireing of the focused video. We propose a frame-wise demoireing pipeline, which begins with an optical flow based alignment step to address any discrepancies in displacement and occlusion between the focused and defocused frames. Then, we leverage the aligned defocused frame to guide the demoireing of the focused frame using a multi-scale CNN and a multi-dimensional training loss. To maintain tonal and temporal consistency, our final step involves a joint bilateral filter to leverage the demoireing result from the CNN as the guide to filter the input focused frame to obtain the final output. Experimental results demonstrate that our proposed framework largely outperforms state-of-the-art image and video demoireing methods.
zh

[CV-36] CoPS: Conditional Prompt Synthesis for Zero-Shot Anomaly Detection

【速读】:该论文旨在解决预训练视觉-语言模型在零样本异常检测(Zero-Shot Anomaly Detection, ZSAD)中面临的两个核心挑战:(i) 静态可学习标记难以捕捉正常与异常状态的连续性和多样性,限制了对未见类别的泛化能力;(ii) 固定文本标签提供的类别信息过于稀疏,导致模型易过拟合特定语义子空间。解决方案的关键在于提出条件提示合成(Conditional Prompt Synthesis, CoPS)框架,其通过两种机制实现动态提示生成:首先,从细粒度图像块特征中提取正常和异常原型,并显式注入提示中以实现状态自适应建模;其次,利用变分自编码器(Variational Autoencoder, VAE)建模语义图像特征,隐式融合多样化类别令牌到提示中,缓解标签稀疏问题。此外,结合空间感知对齐机制,CoPS 在13个工业与医疗数据集上显著优于现有方法,分类与分割任务平均提升2.5% AUROC。

链接: https://arxiv.org/abs/2508.03447
作者: Qiyu Chen,Zhen Qu,Wei Luo,Haiming Yao,Yunkang Cao,Yuxin Jiang,Yinan Duan,Huiyuan Luo,Chengkan Lv,Zhengtao Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 19 pages, 33 figures, 14 tables

点击查看摘要

Abstract:Recently, large pre-trained vision-language models have shown remarkable performance in zero-shot anomaly detection (ZSAD). With fine-tuning on a single auxiliary dataset, the model enables cross-category anomaly detection on diverse datasets covering industrial defects and medical lesions. Compared to manually designed prompts, prompt learning eliminates the need for expert knowledge and trial-and-error. However, it still faces the following challenges: (i) static learnable tokens struggle to capture the continuous and diverse patterns of normal and anomalous states, limiting generalization to unseen categories; (ii) fixed textual labels provide overly sparse category information, making the model prone to overfitting to a specific semantic subspace. To address these issues, we propose Conditional Prompt Synthesis (CoPS), a novel framework that synthesizes dynamic prompts conditioned on visual features to enhance ZSAD performance. Specifically, we extract representative normal and anomaly prototypes from fine-grained patch features and explicitly inject them into prompts, enabling adaptive state modeling. Given the sparsity of class labels, we leverage a variational autoencoder to model semantic image features and implicitly fuse varied class tokens into prompts. Additionally, integrated with our spatially-aware alignment mechanism, extensive experiments demonstrate that CoPS surpasses state-of-the-art methods by 2.5% AUROC in both classification and segmentation across 13 industrial and medical datasets. Code will be available at this https URL.
zh

[CV-37] RAAG: Ratio Aware Adaptive Guidance

【速读】:该论文旨在解决流模型(flow-based generative models)在快速采样(低步数)场景下,条件引导(classifier-free guidance, CFG)因早期逆向步骤对引导尺度敏感而引发的不稳定性问题。研究表明,这种不稳定性源于条件预测与无条件预测之间的相对强度比(RATIO)在早期步骤中出现显著峰值,且该现象具有数据分布固有性,不受模型架构影响,并会导致强引导下的误差指数级放大。解决方案的关键在于提出一种基于RATIO感知的自适应引导调度策略:通过闭式指数衰减函数自动在早期步骤降低引导尺度,从而抑制误差放大。该方法轻量、无需额外推理开销,兼容标准流模型框架,在图像(SD3.5、Lumina)和视频(WAN2.1)生成任务中实现高达3倍加速的同时保持或提升生成质量、鲁棒性和语义一致性。

链接: https://arxiv.org/abs/2508.03442
作者: Shangwen Zhu,Qianyu Peng,Yuting Hu,Zhantao Yang,Han Zhang,Zhao Pu,Ruili Feng,Fan Cheng
机构: Shanghai Jiao Tong University (上海交通大学); University of Waterloo (滑铁卢大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Flow-based generative models have recently achieved remarkable progress in image and video synthesis, with classifier-free guidance (CFG) becoming the standard tool for high-fidelity, controllable generation. However, despite their practical success, little is known about how guidance interacts with different stages of the sampling process-especially in the fast, low-step regimes typical of modern flow-based pipelines. In this work, we uncover and analyze a fundamental instability: the earliest reverse steps are acutely sensitive to the guidance scale, owing to a pronounced spike in the relative strength (RATIO) of conditional to unconditional predictions. Through rigorous theoretical analysis and empirical validation, we show that this RATIO spike is intrinsic to the data distribution, independent of the model architecture, and causes exponential error amplification when paired with strong guidance. To address this, we propose a simple, theoretically grounded, RATIO-aware adaptive guidance schedule that automatically dampens the guidance scale at early steps based on the evolving RATIO, using a closed-form exponential decay. Our method is lightweight, requires no additional inference overhead, and is compatible with standard flow frameworks. Experiments across state-of-the-art image (SD3.5, Lumina) and video (WAN2.1) models demonstrate that our approach enables up to 3x faster sampling while maintaining or improving generation quality, robustness, and semantic alignment. Extensive ablation studies further confirm the generality and stability of our schedule across models, datasets, and hyperparameters. Our findings highlight the critical role of stepwise guidance adaptation in unlocking the full potential of fast flow-based generative models.
zh

[CV-38] MedCAL-Bench: A Comprehensive Benchmark on Cold-Start Active Learning with Foundation Models for Medical Image Analysis

【速读】:该论文旨在解决冷启动主动学习(Cold-Start Active Learning, CSAL)在医学图像分析中因缺乏先验知识而导致样本标注效率低、模型性能受限的问题。现有方法多依赖目标数据集上的自监督学习(Self-Supervised Learning, SSL)进行特征提取,存在效率低且特征表示能力不足的局限性。其解决方案的关键在于引入预训练基础模型(Foundation Models, FMs),利用其强大的通用特征提取能力提升CSAL性能,并构建首个系统性的FM-based CSAL基准测试平台MedCAL-Bench,首次同时评估特征提取与样本选择两个阶段的表现。实验表明,多数FMs在CSAL中具有显著有效性,其中DINO系列在分割任务中表现最优,而不同采样策略(如ALPS用于分割、RepDiv用于分类)需根据任务类型适配,从而为高效医学图像主动学习提供了新范式。

链接: https://arxiv.org/abs/2508.03441
作者: Ning Zhu,Xiaochuan Ma,Shaoting Zhang,Guotai Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 23 pages, 6 figures, 10 tables

点击查看摘要

Abstract:Cold-Start Active Learning (CSAL) aims to select informative samples for annotation without prior knowledge, which is important for improving annotation efficiency and model performance under a limited annotation budget in medical image analysis. Most existing CSAL methods rely on Self-Supervised Learning (SSL) on the target dataset for feature extraction, which is inefficient and limited by insufficient feature representation. Recently, pre-trained Foundation Models (FMs) have shown powerful feature extraction ability with a potential for better CSAL. However, this paradigm has been rarely investigated, with a lack of benchmarks for comparison of FMs in CSAL tasks. To this end, we propose MedCAL-Bench, the first systematic FM-based CSAL benchmark for medical image analysis. We evaluate 14 FMs and 7 CSAL strategies across 7 datasets under different annotation budgets, covering classification and segmentation tasks from diverse medical modalities. It is also the first CSAL benchmark that evaluates both the feature extraction and sample selection stages. Our experimental results reveal that: 1) Most FMs are effective feature extractors for CSAL, with DINO family performing the best in segmentation; 2) The performance differences of these FMs are large in segmentation tasks, while small for classification; 3) Different sample selection strategies should be considered in CSAL on different datasets, with Active Learning by Processing Surprisal (ALPS) performing the best in segmentation while RepDiv leading for classification. The code is available at this https URL.
zh

[CV-39] Spatial Imputation Drives Cross-Domain Alignment for EEG Classification

【速读】:该论文旨在解决跨域脑电图(Electroencephalogram, EEG)信号分类中因电极配置异构性、采集协议差异及硬件不一致导致的数据分布偏移问题。其解决方案的关键在于提出一种通道依赖的掩码与插补自监督学习框架(IMAC),将跨域EEG数据对齐建模为时空序列插补任务:首先通过3D到2D的位置统一映射策略标准化不同电极布局,构建统一的空间表示;其次引入通道依赖掩码和重建任务,将其视为低分辨率到高分辨率的空间插补问题,从而模拟通道缺失和时间不稳定等跨域变化;此外,IMAC采用解耦结构分别建模EEG信号的时序与空间信息,在降低计算复杂度的同时提升模型灵活性与适应性,最终在10个公开EEG数据集上实现跨被试与跨中心验证下的最优分类准确率,并在模拟与真实分布偏移下展现出显著鲁棒性。

链接: https://arxiv.org/abs/2508.03437
作者: Hongjun Liu,Chao Yao,Yalan Zhang,Xiaokun wang,Xiaojuan Ban
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ACMMM 2025 poster

点击查看摘要

Abstract:Electroencephalogram (EEG) signal classification faces significant challenges due to data distribution shifts caused by heterogeneous electrode configurations, acquisition protocols, and hardware discrepancies across domains. This paper introduces IMAC, a novel channel-dependent mask and imputation self-supervised framework that formulates the alignment of cross-domain EEG data shifts as a spatial time series imputation task. To address heterogeneous electrode configurations in cross-domain scenarios, IMAC first standardizes different electrode layouts using a 3D-to-2D positional unification mapping strategy, establishing unified spatial representations. Unlike previous mask-based self-supervised representation learning methods, IMAC introduces spatio-temporal signal alignment. This involves constructing a channel-dependent mask and reconstruction task framed as a low-to-high resolution EEG spatial imputation problem. Consequently, this approach simulates cross-domain variations such as channel omissions and temporal instabilities, thus enabling the model to leverage the proposed imputer for robust signal alignment during inference. Furthermore, IMAC incorporates a disentangled structure that separately models the temporal and spatial information of the EEG signals separately, reducing computational complexity while enhancing flexibility and adaptability. Comprehensive evaluations across 10 publicly available EEG datasets demonstrate IMAC’s superior performance, achieving state-of-the-art classification accuracy in both cross-subject and cross-center validation scenarios. Notably, IMAC shows strong robustness under both simulated and real-world distribution shifts, surpassing baseline methods by up to 35 % in integrity scores while maintaining consistent classification accuracy.
zh

[CV-40] R2GenKG: Hierarchical Multi-modal Knowledge Graph for LLM -based Radiology Report Generation

【速读】:该论文旨在解决X射线医学报告生成中仍存在的幻觉(hallucination)问题以及疾病诊断能力较弱的问题。其核心解决方案是构建一个大规模多模态医学知识图谱(multi-modal medical knowledge graph, M3KG),该图谱基于真实医疗报告利用GPT-4o构建,包含2477个实体、3类关系、37424条三元组及6943个与疾病相关的视觉标记(disease-aware vision tokens)。通过采样得到多粒度语义图并使用R-GCN编码器提取特征,结合Swin-Transformer提取图像特征并与知识图谱进行交叉注意力交互,再通过Q-former检索疾病感知的视觉标记,最终由大语言模型将语义知识图谱、输入X光图像和疾病感知视觉标记映射为结构化语言描述,从而显著提升报告生成的质量与准确性。

链接: https://arxiv.org/abs/2508.03426
作者: Futian Wang,Yuhan Qiao,Xiao Wang,Fuling Wang,Yuxiang Zhang,Dengdi Sun
机构: Anhui University (安徽大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:X-ray medical report generation is one of the important applications of artificial intelligence in healthcare. With the support of large foundation models, the quality of medical report generation has significantly improved. However, challenges such as hallucination and weak disease diagnostic capability still persist. In this paper, we first construct a large-scale multi-modal medical knowledge graph (termed M3KG) based on the ground truth medical report using the GPT-4o. It contains 2477 entities, 3 kinds of relations, 37424 triples, and 6943 disease-aware vision tokens for the CheXpert Plus dataset. Then, we sample it to obtain multi-granularity semantic graphs and use an R-GCN encoder for feature extraction. For the input X-ray image, we adopt the Swin-Transformer to extract the vision features and interact with the knowledge using cross-attention. The vision tokens are fed into a Q-former and retrieved the disease-aware vision tokens using another cross-attention. Finally, we adopt the large language model to map the semantic knowledge graph, input X-ray image, and disease-aware vision tokens into language descriptions. Extensive experiments on multiple datasets fully validated the effectiveness of our proposed knowledge graph and X-ray report generation framework. The source code of this paper will be released on this https URL.
zh

[CV-41] Learning Latent Representations for Image Translation using Frequency Distributed CycleGAN

【速读】:该论文旨在解决图像到图像(image-to-image, I2I)翻译任务中生成图像与真实数据分布对齐不足、局部语义细节缺失以及低数据场景下模式多样性差的问题。其解决方案的关键在于提出Fd-CycleGAN框架,通过引入局部邻域编码(Local Neighborhood Encoding, LNE)和频域感知监督机制,在保留源域结构一致性的同时增强细粒度像素语义建模能力;同时采用基于分布的损失函数(如KL/JS散度和对数相似性度量),显式量化空间域与频域中真实与生成图像分布的对齐程度,从而实现更视觉一致且语义连贯的图像翻译效果。

链接: https://arxiv.org/abs/2508.03415
作者: Shivangi Nigam,Adarsh Prasad Behera,Shekhar Verma,P. Nagabhushan
机构: Indian Institute of Information Technology, Allahabad (印度信息科技学院,阿拉哈巴德); KTH Royal Institute of Technology (皇家理工学院); Vignan University (维格南大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Graphics (cs.GR)
备注: This paper is currently under review for publication in an IEEE Transactions. If accepted, the copyright will be transferred to IEEE

点击查看摘要

Abstract:This paper presents Fd-CycleGAN, an image-to-image (I2I) translation framework that enhances latent representation learning to approximate real data distributions. Building upon the foundation of CycleGAN, our approach integrates Local Neighborhood Encoding (LNE) and frequency-aware supervision to capture fine-grained local pixel semantics while preserving structural coherence from the source domain. We employ distribution-based loss metrics, including KL/JS divergence and log-based similarity measures, to explicitly quantify the alignment between real and generated image distributions in both spatial and frequency domains. To validate the efficacy of Fd-CycleGAN, we conduct experiments on diverse datasets – Horse2Zebra, Monet2Photo, and a synthetically augmented Strike-off dataset. Compared to baseline CycleGAN and other state-of-the-art methods, our approach demonstrates superior perceptual quality, faster convergence, and improved mode diversity, particularly in low-data regimes. By effectively capturing local and global distribution characteristics, Fd-CycleGAN achieves more visually coherent and semantically consistent translations. Our results suggest that frequency-guided latent learning significantly improves generalization in image translation tasks, with promising applications in document restoration, artistic style transfer, and medical image synthesis. We also provide comparative insights with diffusion-based generative models, highlighting the advantages of our lightweight adversarial approach in terms of training efficiency and qualitative output.
zh

[CV-42] SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation

【速读】:该论文旨在解决无监督视频分割(unsupervised video segmentation)任务中因缺乏监督信号且视觉场景复杂而导致的挑战,尤其是当前基于slot attention的先进模型通常依赖于庞大且计算成本高的神经网络架构。其解决方案的关键在于提出一种简洁的知识蒸馏框架SlotMatch,通过余弦相似度对齐教师模型与学生模型的slot表示,无需额外的蒸馏目标或辅助监督信号,从而将物体中心表征高效迁移至轻量级学生模型。实验证明,该方法在保持性能的同时显著降低参数量(减少3.6倍)和推理速度提升(加快1.9倍),且优于现有无监督视频分割模型。

链接: https://arxiv.org/abs/2508.03411
作者: Diana-Nicoleta Grigore,Neelu Madan,Andreas Mogelmose,Thomas B. Moeslund,Radu Tudor Ionescu
机构: 1. University Politehnica of Bucharest (布加勒斯特理工大学); 2. Aalborg University (奥尔堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Unsupervised video segmentation is a challenging computer vision task, especially due to the lack of supervisory signals coupled with the complexity of visual scenes. To overcome this challenge, state-of-the-art models based on slot attention often have to rely on large and computationally expensive neural architectures. To this end, we propose a simple knowledge distillation framework that effectively transfers object-centric representations to a lightweight student. The proposed framework, called SlotMatch, aligns corresponding teacher and student slots via the cosine similarity, requiring no additional distillation objectives or auxiliary supervision. The simplicity of SlotMatch is confirmed via theoretical and empirical evidence, both indicating that integrating additional losses is redundant. We conduct experiments on two datasets to compare the state-of-the-art teacher model, SlotContrast, with our distilled student. The results show that our student based on SlotMatch matches and even outperforms its teacher, while using 3.6x less parameters and running 1.9x faster. Moreover, our student surpasses previous unsupervised video segmentation models.
zh

[CV-43] Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling

【速读】:该论文旨在解决现有视觉语言模型(Vision-Language Models, VLMs)在文档理解与视觉问答(Visual Question Answering, VQA)任务中面临的三大局限:参数规模受限、缺乏稳健的自我修正能力,以及在长视觉上下文和复杂推理任务中表现不佳的问题。其核心解决方案是提出一种多智能体协作框架——MACT(Multi-Agent Collaboration framework with Test-Time scaling),该框架由四个角色明确的小规模智能体组成:规划(planning)、执行(execution)、判断(judgment)和答案生成(answer)智能体,通过结构化协作实现高效决策。关键创新在于判断智能体仅负责验证正确性并引导前序智能体进行修正,从而超越传统纠错策略;同时引入混合奖励建模以平衡个体能力和全局协作,并采用智能体级的混合测试时缩放(agent-wise hybrid test-time scaling)策略,针对不同智能体的功能定制缩放方法,显著提升模型在长视觉上下文和复杂推理场景下的性能表现。

链接: https://arxiv.org/abs/2508.03404
作者: Xinlei Yu,Zhangquan Chen,Yudong Zhang,Shilin Lu,Ruolin Shen,Jiangning Zhang,Xiaobin Hu,Yanwei Fu,Shuicheng Yan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Existing vision-language models (VLMs), whether generalists or specialists, remain constrained by their parameter scale, lack robust self-correction capabilities, and underperform in tasks involving long visual contexts and complex reasoning, resulting in suboptimal performance on document-based tasks. To address this, we propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling, tailored for visual document understanding and visual question answering (VQA). It comprises four distinct small-scale agents, i.e., planning, execution, judgment, and answer agents, with clearly defined roles and effective collaboration. Notably, the judgment agent exclusively verifies correctness and redirects to prior agents for revisions, outperforming conventional correction strategies. To further expand the capability boundaries of the framework, we propose mixed reward modeling that balances agent-specific abilities and global collaboration, as well as agent-wise hybrid test-time scaling, which customizes different scaling strategies for each agent based on their functions. Evaluated on benchmarks spanning both document-based and non-document-based settings, our MACT shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks. Especially, it stands out in benchmarks involving long visual contexts and complicated reasoning. The three variants of MACT consistently hold the top three positions in average scores, leading in 13 of the 15 benchmarks. Code will be available at: this https URL.
zh

[CV-44] Sparsity and Total Variation Constrained Multilayer Linear Unmixing for Hyperspectral Imagery

【速读】:该论文旨在解决高光谱图像解混(hyperspectral unmixing)中端元(endmember)和丰度(abundance)估计的准确性问题。其解决方案的关键在于提出了一种基于多层矩阵分解模型的新型方法——稀疏性和总变差(Total Variation, TV)约束多层线性解混(Sparsity and Total Variation Constrained Multilayer Linear Unmixing, STVMLU)。该方法通过引入L1/2-范数稀疏约束以有效刻画丰度矩阵的稀疏特性,并结合TV正则项利用邻域空间相似性提升解混精度;优化过程采用交替方向乘子法(ADMM),实现端元与丰度矩阵的同步提取,从而显著提升了算法性能。

链接: https://arxiv.org/abs/2508.03403
作者: Gang Yang
机构: Southwest China Institute of Electronic Technology (西南电子技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:

点击查看摘要

Abstract:Hyperspectral unmixing aims at estimating material signatures (known as endmembers) and the corresponding proportions (referred to abundances), which is a critical preprocessing step in various hyperspectral imagery applications. This study develops a novel approach called sparsity and total variation (TV) constrained multilayer linear unmixing (STVMLU) for hyperspectral imagery. Specifically, based on a multilayer matrix factorization model, to improve the accuracy of unmixing, a TV constraint is incorporated to consider adjacent spatial similarity. Additionally, a L1/2-norm sparse constraint is adopted to effectively characterize the sparsity of the abundance matrix. For optimizing the STVMLU model, the method of alternating direction method of multipliers (ADMM) is employed, which allows for the simultaneous extraction of endmembers and their corresponding abundance matrix. Experimental results illustrate the enhanced performance of the proposed STVMLU when compared to other algorithms.
zh

[CV-45] SCFlow: Implicitly Learning Style and Content Disentanglement with Flow Models ICCV2025

【速读】:该论文旨在解决视觉模型中风格(style)与内容(content)显式解耦的难题,这一问题因两者语义重叠及人类感知的主观性而尤为复杂。现有方法依赖生成或判别目标进行分离,但仍难以克服概念交织带来的固有歧义。其解决方案的关键在于摒弃显式解耦的直接目标,转而学习风格与内容的可逆合并过程——即通过SCFlow框架构建双向映射,使解耦能力自然涌现。该方法基于三大核心洞察:一是仅训练合并任务即可实现无需显式监督的可逆解耦;二是利用流匹配(flow matching)技术在任意分布间建立映射,避免扩散模型和归一化流对高斯先验的依赖;三是构建包含510,000样本的合成数据集以系统模拟风格-内容配对,从而有效引导解耦结构的学习。

链接: https://arxiv.org/abs/2508.03402
作者: Pingchuan Ma,Xiaopei Yang,Yusong Li,Ming Gui,Felix Krause,Johannes Schusterbauer,Björn Ommer
机构: CompVis @ LMU Munich (CompVis @ 慕尼黑大学); Munich Center for Machine Learning (MCML) (慕尼黑机器学习中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICCV 2025, Project Page: this https URL

点击查看摘要

Abstract:Explicitly disentangling style and content in vision models remains challenging due to their semantic overlap and the subjectivity of human perception. Existing methods propose separation through generative or discriminative objectives, but they still face the inherent ambiguity of disentangling intertwined concepts. Instead, we ask: Can we bypass explicit disentanglement by learning to merge style and content invertibly, allowing separation to emerge naturally? We propose SCFlow, a flow-matching framework that learns bidirectional mappings between entangled and disentangled representations. Our approach is built upon three key insights: 1) Training solely to merge style and content, a well-defined task, enables invertible disentanglement without explicit supervision; 2) flow matching bridges on arbitrary distributions, avoiding the restrictive Gaussian priors of diffusion models and normalizing flows; and 3) a synthetic dataset of 510,000 samples (51 styles \times 10,000 content samples) was curated to simulate disentanglement through systematic style-content pairing. Beyond controllable generation tasks, we demonstrate that SCFlow generalizes to ImageNet-1k and WikiArt in zero-shot settings and achieves competitive performance, highlighting that disentanglement naturally emerges from the invertible merging process.
zh

[CV-46] DepthGait: Multi-Scale Cross-Level Feature Fusion of RGB-Derived Depth and Silhouette Sequences for Robust Gait Recognition

【速读】:该论文旨在解决当前步态识别(gait recognition)中因依赖二维表示(如二值轮廓和骨骼结构)而导致的判别能力不足问题,尤其是在处理视角变化和捕捉细微步态特征方面表现有限。其解决方案的关键在于提出了一种名为DepthGait的新框架,该框架通过从RGB图像序列中显式估计深度图(depth map)作为新的模态,并将其与传统轮廓表示相结合,从而增强对人类运动中固有判别特征的提取能力;同时,设计了一种多尺度跨层级融合机制,有效弥合深度图与轮廓之间的模态差异,显著提升了模型在标准基准上的识别性能。

链接: https://arxiv.org/abs/2508.03397
作者: Xinzhu Li,Juepeng Zheng,Yikun Chen,Xudong Mao,Guanghui Yue,Wei Zhou,Chenlei Lv,Ruomei Wang,Fan Zhou,Baoquan Zhao
机构: Sun Yat-sen University (中山大学); Guangdong Zhiyun Urban Construction Technology Co., Ltd. (广东志云城市建设科技有限公司); Shenzhen University (深圳大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Robust gait recognition requires highly discriminative representations, which are closely tied to input modalities. While binary silhouettes and skeletons have dominated recent literature, these 2D representations fall short of capturing sufficient cues that can be exploited to handle viewpoint variations, and capture finer and meaningful details of gait. In this paper, we introduce a novel framework, termed DepthGait, that incorporates RGB-derived depth maps and silhouettes for enhanced gait recognition. Specifically, apart from the 2D silhouette representation of the human body, the proposed pipeline explicitly estimates depth maps from a given RGB image sequence and uses them as a new modality to capture discriminative features inherent in human locomotion. In addition, a novel multi-scale and cross-level fusion scheme has also been developed to bridge the modality gap between depth maps and silhouettes. Extensive experiments on standard benchmarks demonstrate that the proposed DepthGait achieves state-of-the-art performance compared to peer methods and attains an impressive mean rank-1 accuracy on the challenging datasets.
zh

[CV-47] Neutralizing Token Aggregation via Information Augmentation for Efficient Test-Time Adaptation

【速读】:该论文旨在解决高效测试时适应(Efficient Test-Time Adaptation, ETTA)问题,即在保持视觉Transformer(Vision Transformer, ViT)对分布偏移的适应能力的同时,显著降低推理延迟。现有方法虽通过插件式标记聚合(token aggregation)减少冗余标记以提升效率,但直接与主流TTA方法结合时会导致性能显著下降。解决方案的关键在于提出NAVIA(Neutralize Token Aggregation via Information Augmentation),其核心思想是:从互信息视角理论证明标记聚合会引发信息损失,且传统基于范数调整的TTA无法完全补偿;为此,NAVIA通过直接增强[CLS]标记嵌入并引入浅层自适应偏置,在熵最小化优化下恢复因聚合丢失的信息,从而在保持高适应性能的同时实现超过20%的推理延迟降低。

链接: https://arxiv.org/abs/2508.03388
作者: Yizhe Xiong,Zihan Zhou,Yiwen Liang,Hui Chen,Zijia Lin,Tianxiang Hao,Fan Zhang,Jungong Han,Guiguang Ding
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:Test-Time Adaptation (TTA) has emerged as an effective solution for adapting Vision Transformers (ViT) to distribution shifts without additional training data. However, existing TTA methods often incur substantial computational overhead, limiting their applicability in resource-constrained real-world scenarios. To reduce inference cost, plug-and-play token aggregation methods merge redundant tokens in ViTs to reduce total processed tokens. Albeit efficient, it suffers from significant performance degradation when directly integrated with existing TTA methods. We formalize this problem as Efficient Test-Time Adaptation (ETTA), seeking to preserve the adaptation capability of TTA while reducing inference latency. In this paper, we first provide a theoretical analysis from a novel mutual information perspective, showing that token aggregation inherently leads to information loss, which cannot be fully mitigated by conventional norm-tuning-based TTA methods. Guided by this insight, we propose to \textbfNeutralize Token \textbfAggregation \textbfvia \textbfInformation \textbfAugmentation (\textbfNAVIA). Specifically, we directly augment the [CLS] token embedding and incorporate adaptive biases into the [CLS] token in shallow layers of ViTs. We theoretically demonstrate that these augmentations, when optimized via entropy minimization, recover the information lost due to token aggregation. Extensive experiments across various out-of-distribution benchmarks demonstrate that NAVIA significantly outperforms state-of-the-art methods by over 2.5%, while achieving an inference latency reduction of more than 20%, effectively addressing the ETTA challenge.
zh

[CV-48] GaitAdapt: Continual Learning for Evolving Gait Recognition

【速读】:该论文旨在解决当前步态识别(gait recognition)方法在面对新数据集时需重新训练,且易出现灾难性遗忘(catastrophic forgetting)的问题,即模型在新任务上性能提升的同时,原有任务的识别能力显著下降。解决方案的关键在于提出一种非回放式持续学习框架 GaitAdapter,其核心创新包括:1)GPAK(GaitPartition Adaptive Knowledge)模块,利用图神经网络从当前数据中提取共通步态模式并构建图向量知识库,用于增强新任务下特征的判别能力;2)基于负样本对的欧氏距离稳定性方法(EDSN),确保不同任务间步态样本的相对空间分布保持一致,从而缓解因任务迁移导致的原始特征可分性退化问题。实验表明,该方法能有效保留多任务步态知识,显著优于现有对比方法。

链接: https://arxiv.org/abs/2508.03375
作者: Jingjie Wang,Shunli Zhang,Xiang Wei,Senmao Tian
机构: Beijing Jiaotong University (北京交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current gait recognition methodologies generally necessitate retraining when encountering new datasets. Nevertheless, retrained models frequently encounter difficulties in preserving knowledge from previous datasets, leading to a significant decline in performance on earlier test sets. To tackle these challenges, we present a continual gait recognition task, termed GaitAdapt, which supports the progressive enhancement of gait recognition capabilities over time and is systematically categorized according to various evaluation scenarios. Additionally, we propose GaitAdapter, a non-replay continual learning approach for gait recognition. This approach integrates the GaitPartition Adaptive Knowledge (GPAK) module, employing graph neural networks to aggregate common gait patterns from current data into a repository constructed from graph vectors. Subsequently, this repository is used to improve the discriminability of gait features in new tasks, thereby enhancing the model’s ability to effectively recognize gait patterns. We also introduce a Euclidean Distance Stability Method (EDSN) based on negative pairs, which ensures that newly added gait samples from different classes maintain similar relative spatial distributions across both previous and current gait tasks, thereby alleviating the impact of task changes on the distinguishability of original domain features. Extensive evaluations demonstrate that GaitAdapter effectively retains gait knowledge acquired from diverse tasks, exhibiting markedly superior discriminative capability compared to alternative methods.
zh

[CV-49] GRASPing Anatomy to Improve Pathology Segmentation MICCAI

【速读】:该论文旨在解决当前深度学习在病理分割任务中过度依赖纯模式识别、忽视解剖结构上下文的问题,从而导致模型对病灶定位和边界划分的准确性受限。解决方案的关键在于提出一种模块化、即插即用的框架GRASP(Guided Representation Alignment for the Segmentation of Pathologies),其核心机制是通过伪标签集成(pseudolabel integration)与特征对齐(feature alignment)策略,将已有的解剖结构分割模型知识无缝注入到病理分割流程中,而无需重新训练解剖组件。具体而言,GRASP采用双路径解剖信息注入策略:一方面将解剖伪标签作为输入通道增强空间感知能力,另一方面利用Transformer引导的解剖特征融合机制,实现跨模态特征的空间一致性对齐,显著提升病理分割性能。

链接: https://arxiv.org/abs/2508.03374
作者: Keyi Li,Alexander Jaus,Jens Kleesiek,Rainer Stiefelhagen
机构: 11; 1122††; 3344; 11
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at 16th MICCAI Workshop on Machine Learning in Medical Imaging (MLMI2025)

点击查看摘要

Abstract:Radiologists rely on anatomical understanding to accurately delineate pathologies, yet most current deep learning approaches use pure pattern recognition and ignore the anatomical context in which pathologies develop. To narrow this gap, we introduce GRASP (Guided Representation Alignment for the Segmentation of Pathologies), a modular plug-and-play framework that enhances pathology segmentation models by leveraging existing anatomy segmentation models through pseudolabel integration and feature alignment. Unlike previous approaches that obtain anatomical knowledge via auxiliary training, GRASP integrates into standard pathology optimization regimes without retraining anatomical components. We evaluate GRASP on two PET/CT datasets, conduct systematic ablation studies, and investigate the framework’s inner workings. We find that GRASP consistently achieves top rankings across multiple evaluation metrics and diverse architectures. The framework’s dual anatomy injection strategy, combining anatomical pseudo-labels as input channels with transformer-guided anatomical feature fusion, effectively incorporates anatomical context.
zh

[CV-50] Diffusion Once and Done: Degradation-Aware LoRA for Efficient All-in-One Image Restoration

【速读】:该论文旨在解决当前全功能图像复原(All-in-One Image Restoration, AiOIR)方法在推理成本高和对多种退化类型适应性差的问题。现有方法通常需要重新训练或微调预训练扩散模型,并引入额外的条件引导机制,导致计算开销大且泛化能力受限。其解决方案的关键在于提出一种高效的一次采样策略——Diffusion Once and Done (DOD),通过引入多退化特征调制模块以捕捉不同退化提示信息,结合参数高效的低秩条件适配技术将提示嵌入到Stable Diffusion (SD)模型中,实现对多种退化类型的快速适应;同时,在SD解码器中集成高保真细节增强模块,显著提升结构与纹理细节的恢复质量,从而在保持极低推理成本的同时获得优于现有基于扩散模型的复原方法的性能。

链接: https://arxiv.org/abs/2508.03373
作者: Ni Tang,Xiaotong Luo,Zihan Cheng,Liangtai Zhou,Dongxiao Zhang,Yanyun Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models have revealed powerful potential in all-in-one image restoration (AiOIR), which is talented in generating abundant texture details. The existing AiOIR methods either retrain a diffusion model or fine-tune the pretrained diffusion model with extra conditional guidance. However, they often suffer from high inference costs and limited adaptability to diverse degradation types. In this paper, we propose an efficient AiOIR method, Diffusion Once and Done (DOD), which aims to achieve superior restoration performance with only one-step sampling of Stable Diffusion (SD) models. Specifically, multi-degradation feature modulation is first introduced to capture different degradation prompts with a pretrained diffusion model. Then, parameter-efficient conditional low-rank adaptation integrates the prompts to enable the fine-tuning of the SD model for adapting to different degradation types. Besides, a high-fidelity detail enhancement module is integrated into the decoder of SD to improve structural and textural details. Experiments demonstrate that our method outperforms existing diffusion-based restoration approaches in both visual quality and inference efficiency.
zh

[CV-51] FedPromo: Federated Lightweight Proxy Models at the Edge Bring New Domains to Foundation Models AAAI2026

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在大规模基础模型(large-scale foundation models)场景下,客户端设备因计算资源有限而难以直接训练模型的问题。解决方案的关键在于提出FedPromo框架,通过两阶段机制实现高效且隐私安全的个性化适应:首先在服务器端进行知识蒸馏(knowledge distillation),将大模型(如Transformer)的表示与轻量级代理模型(如CNN)对齐;随后在客户端部署轻量级编码器并本地训练可微分类器,聚合后无缝迁移至中心模型,从而避免直接访问用户数据,显著降低客户端计算开销,同时支持多域去中心化学习。

链接: https://arxiv.org/abs/2508.03356
作者: Matteo Caligiuri,Francesco Barbato,Donald Shenaj,Umberto Michieli,Pietro Zanuttigh
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages (main document) + 12 pages (appendix), 3 figures (main) + 12 figures (appendix), 5 tables (main) + 6 tables (appendix), submitted to AAAI 2026

点击查看摘要

Abstract:Federated Learning (FL) is an established paradigm for training deep learning models on decentralized data. However, as the size of the models grows, conventional FL approaches often require significant computational resources on client devices, which may not be feasible. We introduce FedPromo, a novel framework that enables efficient adaptation of large-scale foundation models stored on a central server to new domains encountered only by remote clients. Instead of directly training the large model on client devices, FedPromo optimizes lightweight proxy models via FL, significantly reducing computational overhead while maintaining privacy. Our method follows a two-stage process: first, server-side knowledge distillation aligns the representations of a large-scale foundation model (e.g., a transformer) with those of a compact counterpart (e.g., a CNN). Then, the compact model encoder is deployed to client devices, where trainable classifiers are learned locally. These classifiers are subsequently aggregated and seamlessly transferred back to the foundation model, facilitating personalized adaptation without requiring direct access to user data. Through novel regularization strategies, our framework enables decentralized multi-domain learning, balancing performance, privacy, and resource efficiency. Extensive experiments on five image classification benchmarks demonstrate that FedPromo outperforms existing methods while assuming limited-resource clients.
zh

[CV-52] WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval

【速读】:该论文旨在解决文本驱动的3D动作检索(Text-Motion Retrieval, TMR)中语义匹配精度不足的问题,其核心挑战在于人体动作的复杂结构及其时空动态特性。现有方法通常采用通用编码策略,难以区分不同身体部位及其随时间变化的动力学特征,导致语义对齐不精确。为此,作者提出基于小波变换的多频特征提取框架WaMo,其关键创新在于:(1)轨迹小波分解(Trajectory Wavelet Decomposition)将运动信号分解为保留局部运动细节与全局语义的多频成分;(2)轨迹小波重构(Trajectory Wavelet Reconstruction)通过可学习的小波逆变换重建原始关节轨迹,保障时空信息完整性;(3)无序动作序列预测(Disordered Motion Sequence Prediction)通过对打乱序列重新排序以增强时序一致性建模,从而提升动作与文本之间的细粒度对齐能力。实验表明,该方法在HumanML3D和KIT-ML数据集上分别实现了Rsum指标17.0%和18.2%的显著提升,优于当前最优方法。

链接: https://arxiv.org/abs/2508.03343
作者: Junlong Ren,Gangjian Zhang,Honghao Fu,Pengcheng Wu,Hao Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Text-Motion Retrieval (TMR) aims to retrieve 3D motion sequences semantically relevant to text descriptions. However, matching 3D motions with text remains highly challenging, primarily due to the intricate structure of human body and its spatial-temporal dynamics. Existing approaches often overlook these complexities, relying on general encoding methods that fail to distinguish different body parts and their dynamics, limiting precise semantic alignment. To address this, we propose WaMo, a novel wavelet-based multi-frequency feature extraction framework. It fully captures part-specific and time-varying motion details across multiple resolutions on body joints, extracting discriminative motion features to achieve fine-grained alignment with texts. WaMo has three key components: (1) Trajectory Wavelet Decomposition decomposes motion signals into frequency components that preserve both local kinematic details and global motion semantics. (2) Trajectory Wavelet Reconstruction uses learnable inverse wavelet transforms to reconstruct original joint trajectories from extracted features, ensuring the preservation of essential spatial-temporal information. (3) Disordered Motion Sequence Prediction reorders shuffled motion sequences to improve the learning of inherent temporal coherence, enhancing motion-text alignment. Extensive experiments demonstrate WaMo’s superiority, achieving 17.0% and 18.2% improvements in Rsum on HumanML3D and KIT-ML datasets, respectively, outperforming existing state-of-the-art (SOTA) methods.
zh

[CV-53] UniFucGrasp: Human-Hand-Inspired Unified Functional Grasp Annotation Strategy and Dataset for Diverse Dexterous Hands

【速读】:该论文旨在解决当前灵巧抓取数据集普遍忽视功能性抓取(functional grasp)的问题,即现有数据集多聚焦于抓取稳定性,而缺乏对执行具体任务(如开瓶盖或握持杯柄)所需的抓取方式的标注与支持。此外,现有方法依赖昂贵、复杂且难以控制的高自由度(high-degree-of-freedom, DOF)Shadow Hand进行数据采集,导致成本高、可扩展性差。其解决方案的关键在于提出UniFucGrasp——一种通用的功能性抓取标注策略与数据集构建方法,该方法基于仿生学原理,将自然的人类动作映射到多种机器人手结构,并利用几何力闭合(geometry-based force closure)机制确保抓取既功能可行又稳定、类人。该方案实现了低成本、高效地收集多样化高质量功能性抓取数据,并首次建立了多类型灵巧手的功能性抓取数据集,显著提升了跨不同机器人手型的泛化能力与任务执行准确性。

链接: https://arxiv.org/abs/2508.03339
作者: Haoran Lin,Wenrui Chen,Xianchi Chen,Fan Yang,Qiang Diao,Wenxin Xie,Sijie Wu,Kailun Yang,Maojun Li,Yaonan Wang
机构: Hunan University (湖南大学); National Engineering Research Center of Robot Visual Perception and Control Technology (机器人视觉感知与控制技术国家工程研究中心)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: The project page is at this https URL

点击查看摘要

Abstract:Dexterous grasp datasets are vital for embodied intelligence, but mostly emphasize grasp stability, ignoring functional grasps needed for tasks like opening bottle caps or holding cup handles. Most rely on bulky, costly, and hard-to-control high-DOF Shadow Hands. Inspired by the human hand’s underactuated mechanism, we establish UniFucGrasp, a universal functional grasp annotation strategy and dataset for multiple dexterous hand types. Based on biomimicry, it maps natural human motions to diverse hand structures and uses geometry-based force closure to ensure functional, stable, human-like grasps. This method supports low-cost, efficient collection of diverse, high-quality functional grasps. Finally, we establish the first multi-hand functional grasp dataset and provide a synthesis model to validate its effectiveness. Experiments on the UFG dataset, IsaacSim, and complex robotic tasks show that our method improves functional manipulation accuracy and grasp stability, enables efficient generalization across diverse robotic hands, and overcomes annotation cost and generalization challenges in dexterous grasping. The project page is at this https URL.
zh

[CV-54] CIVQLLIE: Causal Intervention with Vector Quantization for Low-Light Image Enhancement

【速读】:该论文旨在解决夜间场景下图像因亮度严重降低而导致可见度下降的问题,即低光照图像增强(Low-Light Image Enhancement, LLIE)中现有方法面临的两大挑战:基于数据驱动的端到端映射网络缺乏可解释性或依赖不可靠先验,在极端暗光条件下表现不佳;而物理模型方法则依赖于简化的假设,在复杂真实场景中往往失效。其解决方案的关键在于提出了一种名为CIVQLLIE的新框架,通过因果推理与离散表示学习相结合的方式实现增强效果提升。核心创新点包括:利用向量量化(Vector Quantization, VQ)将连续图像特征映射到从大规模高质量图像中学习得到的离散视觉标记(visual tokens)代码本(codebook),该代码本编码了标准化的亮度和颜色模式,独立于退化过程;为应对输入分布偏移问题,设计多层次因果干预策略——像素级因果干预(Pixel-level Causal Intervention, PCI)用于对齐低层特征分布,特征感知因果干预(Feature-aware Causal Intervention, FCI)结合低频选择性注意力门控(Low-frequency Selective Attention Gating, LSAG)识别并增强受光照退化影响最显著的通道,从而提升代码本匹配精度并增强编码器泛化能力;最后在解码阶段引入高频细节重建模块(High-frequency Detail Reconstruction Module, HDRM),利用匹配代码本表征中保留的结构信息,通过可变形卷积技术恢复精细纹理。

链接: https://arxiv.org/abs/2508.03338
作者: Tongshun Zhang,Pingping Liu,Zhe Zhang,Qiuzhan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Images captured in nighttime scenes suffer from severely reduced visibility, hindering effective content perception. Current low-light image enhancement (LLIE) methods face significant challenges: data-driven end-to-end mapping networks lack interpretability or rely on unreliable prior guidance, struggling under extremely dark conditions, while physics-based methods depend on simplified assumptions that often fail in complex real-world scenarios. To address these limitations, we propose CIVQLLIE, a novel framework that leverages the power of discrete representation learning through causal reasoning. We achieve this through Vector Quantization (VQ), which maps continuous image features to a discrete codebook of visual tokens learned from large-scale high-quality images. This codebook serves as a reliable prior, encoding standardized brightness and color patterns that are independent of degradation. However, direct application of VQ to low-light images fails due to distribution shifts between degraded inputs and the learned codebook. Therefore, we propose a multi-level causal intervention approach to systematically correct these shifts. First, during encoding, our Pixel-level Causal Intervention (PCI) module intervenes to align low-level features with the brightness and color distributions expected by the codebook. Second, a Feature-aware Causal Intervention (FCI) mechanism with Low-frequency Selective Attention Gating (LSAG) identifies and enhances channels most affected by illumination degradation, facilitating accurate codebook token matching while enhancing the encoder’s generalization performance through flexible feature-level intervention. Finally, during decoding, the High-frequency Detail Reconstruction Module (HDRM) leverages structural information preserved in the matched codebook representations to reconstruct fine details using deformable convolution techniques.
zh

[CV-55] Less is More: Token-Efficient Video-QA via Adaptive Frame-Pruning and Semantic Graph Integration

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在视频问答(Video Question Answering, Video-QA)任务中因处理大量视频帧而导致的高token消耗问题,以及现有关键帧选择方法仍存在显著时间冗余(称为“视觉回声”,visual echoes)和过度采样导致性能下降(context dilution)的问题。其解决方案的关键在于提出一种新颖的后处理方法——自适应帧剪枝(Adaptive Frame-Pruning, AFP),该方法首先在融合ResNet-50与CLIP特征空间中采用自适应分层聚类算法识别并合并视觉回声,形成代表性帧;随后引入轻量级文本语义图以最小token开销补偿信息损失。该方法在LongVideoBench和VideoMME等多个基准上实现了最高达86.9%的帧数减少和83.2%的总token减少,同时提升或保持了准确率,验证了“少而精”策略的有效性。

链接: https://arxiv.org/abs/2508.03337
作者: Shaoguang Wang(1),Jianxiang He(1),Yijie Xu(1),Ziyang Chen(1),Weiyu Guo(1),Hui Xiong(1) ((1) The Hong Kong University of Science and Technology (Guangzhou))
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Corresponding authors: Weiyu Guo, Hui Xiong

点击查看摘要

Abstract:The practical application of Multimodal Large Language Models (MLLMs) to Video Question Answering (Video-QA) is severely hindered by the high token cost of processing numerous video frames. While increasing the number of sampled frames is a common strategy, we observe a “less is more” phenomenon where excessive frames can paradoxically degrade performance due to context dilution. Concurrently, state-of-the-art keyframe selection methods, while effective, still yield significant temporal redundancy, which we term ‘visual echoes’. To address these dual challenges, we propose Adaptive Frame-Pruning (AFP), a novel post-processing method that intelligently prunes the selected keyframes. AFP employs an adaptive hierarchical clustering algorithm on a fused ResNet-50 and CLIP feature space to identify and merge these echoes into single representatives. To compensate for information loss, we then introduce a lightweight, text-based semantic graph that provides critical context with minimal token overhead. Conducting extensive experiments on the LongVideoBench and VideoMME benchmarks across multiple leading MLLMs, our full approach demonstrates a drastic reduction in required frames by up to 86.9% and total input tokens by up to 83.2%. Crucially, by providing a concise, high-quality set of frames, our method not only enhances efficiency but often improves accuracy over baselines that use more frames. The code will be released upon publication.
zh

[CV-56] Beyond Illumination: Fine-Grained Detail Preservation in Extreme Dark Image Restoration

【速读】:该论文旨在解决极端暗图像中精细细节恢复困难的问题,其核心挑战在于结构信息严重丢失和噪声干扰,导致现有增强方法难以保留复杂纹理与锐利边缘,从而限制了下游任务(如文本检测和边缘检测)的性能。解决方案的关键在于提出一种高效的双阶段框架:第一阶段引入残差傅里叶引导模块(Residual Fourier-Guided Module, RFGM),通过频域中的残差连接建模跨阶段与跨通道依赖关系,提供鲁棒先验以实现全局光照恢复并降低误差累积风险;第二阶段则采用互补的Mamba模块进行纹理结构精细化处理——Patch Mamba在未下采样的补丁上建模像素级相关性以增强细粒度细节,Grad Mamba聚焦于高梯度区域缓解状态空间模型的状态衰减问题,优先重建锐利边缘与边界。该方案在保持轻量化的同时显著提升了细节恢复能力,并可无缝集成至现有基于傅里叶的方法中。

链接: https://arxiv.org/abs/2508.03336
作者: Tongshun Zhang,Pingping Liu,Zixuan Zhong,Zijian Zhang,Qiuzhan Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering fine-grained details in extremely dark images remains challenging due to severe structural information loss and noise corruption. Existing enhancement methods often fail to preserve intricate details and sharp edges, limiting their effectiveness in downstream applications like text and edge detection. To address these deficiencies, we propose an efficient dual-stage approach centered on detail recovery for dark images. In the first stage, we introduce a Residual Fourier-Guided Module (RFGM) that effectively restores global illumination in the frequency domain. RFGM captures inter-stage and inter-channel dependencies through residual connections, providing robust priors for high-fidelity frequency processing while mitigating error accumulation risks from unreliable priors. The second stage employs complementary Mamba modules specifically designed for textural structure refinement: (1) Patch Mamba operates on channel-concatenated non-downsampled patches, meticulously modeling pixel-level correlations to enhance fine-grained details without resolution loss. (2) Grad Mamba explicitly focuses on high-gradient regions, alleviating state decay in state space models and prioritizing reconstruction of sharp edges and boundaries. Extensive experiments on multiple benchmark datasets and downstream applications demonstrate that our method significantly improves detail recovery performance while maintaining efficiency. Crucially, the proposed modules are lightweight and can be seamlessly integrated into existing Fourier-based frameworks with minimal computational overhead. Code is available at this https URL.
zh

[CV-57] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

【速读】:该论文旨在解决当前自回归扩散模型在长视频生成中面临的两大核心问题:一是由于误差累积导致的时序漂移(temporal drift),二是难以实现并行化处理从而限制了生成效率。其解决方案的关键在于提出了一种“先规划后填充”的新框架——Macro-from-Micro Planning (MMPL),该框架通过两级层次化规划机制实现长视频一致性与高效生成:首先在微观层面(Micro Planning)预测每个短视频片段内的稀疏关键帧,提供运动和外观先验以指导高质量片段生成;随后在宏观层面(Macro Planning)将这些局部关键帧扩展至全视频范围,利用自回归链式结构确保跨片段长期一致性;最后通过基于MMPL的内容填充策略并行生成所有中间帧,并结合自适应负载调度优化GPU执行效率,从而显著提升长视频生成的质量与稳定性。

链接: https://arxiv.org/abs/2508.03334
作者: Xunzhi Xiang,Yabo Chen,Guiyu Zhang,Zhongyu Wang,Zhe Gao,Quanming Xiang,Gonghu Shang,Junqi Liu,Haibin Huang,Yang Gao,Chi Zhang,Qi Fan,Xuelong Li
机构: Nanjing University (南京大学); TeleAI (TeleAI); Shanghai Jiao Tong University (上海交通大学); Chinese University of Hong Kong, Shenzhen (香港中文大学深圳校区); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.
zh

[CV-58] LRDDv2: Enhanced Long-Range Drone Detection Dataset with Range Information and Comprehensive Real-World Challenges

【速读】:该论文旨在解决长距离环境下无人机(Unmanned Aerial Vehicles, UAVs)检测的难题,尤其是在人口密集区域保障安全运行的需求日益增长背景下。当前基于深度学习的计算机视觉技术虽已取得显著进展,但对小尺寸空中目标的检测仍具挑战性。为此,作者提出了Long Range Drone Detection (LRDD) Version 2数据集,其关键创新在于扩充了图像多样性并首次为超过8000张图像标注了目标距离信息,从而支持无人机距离估计算法的研究与开发;同时,数据集中绝大多数图像中无人机在1080p分辨率下占据50像素或更少,精准匹配长距离检测场景,为提升远距无人机识别能力提供了高质量、多样化的训练与评估资源。

链接: https://arxiv.org/abs/2508.03331
作者: Amirreza Rouhi,Sneh Patel,Noah McCarthy,Siddiqa Khan,Hadi Khorsand,Kaleb Lefkowitz,David K.Han
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted and presented at ISRR 2024

点击查看摘要

Abstract:The exponential growth in Unmanned Aerial Vehicles (UAVs) usage underscores the critical need of detecting them at extended distances to ensure safe operations, especially in densely populated areas. Despite the tremendous advances made in computer vision through deep learning, the detection of these small airborne objects remains a formidable challenge. While several datasets have been developed specifically for drone detection, the need for a more extensive and diverse collection of drone image data persists, particularly for long-range detection under varying environmental conditions. We introduce here the Long Range Drone Detection (LRDD) Version 2 dataset, comprising 39,516 meticulously annotated images, as a second release of the LRDD dataset released previously. The LRDDv2 dataset enhances the LRDDv1 by incorporating a greater variety of images, providing a more diverse and comprehensive resource for drone detection research. What sets LRDDv2 apart is its inclusion of target range information for over 8,000 images, making it possible to develop algorithms for drone range estimation. Tailored for long-range aerial object detection, the majority of LRDDv2’s dataset consists of images capturing drones with 50 or fewer pixels in 1080p resolution. For access to the complete Long-Range Drone Detection Dataset (LRDD)v2, please visit this https URL .
zh

[CV-59] Live Demonstration: Neuromorphic Radar for Gesture Recognition ICASSP2025

【速读】:该论文旨在解决传统雷达手势识别(Hand Gesture Recognition, HGR)系统中高功耗、高延迟和计算资源浪费的问题。现有方法通常依赖于连续采样与处理,导致不必要的内存占用和能量消耗,难以满足实时嵌入式应用的需求。其解决方案的关键在于提出一种类生物感知的事件驱动架构:通过24 GHz Doppler雷达前端结合定制的类神经形态采样器,利用异步sigma-delta编码将中频(IF)信号转换为稀疏脉冲表示,再由部署在Cortex-M0微控制器上的轻量级神经网络直接处理这些事件,无需进行频谱图重建即可实现低延迟推理。该设计仅在检测到有意义运动时激活,显著降低了功耗与计算开销,实现了高效的实时手势识别。

链接: https://arxiv.org/abs/2508.03324
作者: Satyapreet Singh Yadav,Chandra Sekhar Seelamantula,Chetan Singh Thakur
机构: Indian Institute of Science (印度科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE); Systems and Control (eess.SY)
备注: Neuromorphic Radar, Hand Gesture Recognition, Event-Driven, Sigma-Delta Encoding, Sparse Representation. Presented in ICASSP 2025 at Hyderabad, India

点击查看摘要

Abstract:We present a neuromorphic radar framework for real-time, low-power hand gesture recognition (HGR) using an event-driven architecture inspired by biological sensing. Our system comprises a 24 GHz Doppler radar front-end and a custom neuromorphic sampler that converts intermediate-frequency (IF) signals into sparse spike-based representations via asynchronous sigma-delta encoding. These events are directly processed by a lightweight neural network deployed on a Cortex-M0 microcontroller, enabling low-latency inference without requiring spectrogram reconstruction. Unlike conventional radar HGR pipelines that continuously sample and process data, our architecture activates only when meaningful motion is detected, significantly reducing memory, power, and computation overhead. Evaluated on a dataset of five gestures collected from seven users, our system achieves 85% real-time accuracy. To the best of our knowledge, this is the first work that employs bio-inspired asynchronous sigma-delta encoding and an event-driven processing framework for radar-based HGR.
zh

[CV-60] Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation

【速读】:该论文旨在解决多模态任务中模型架构冗余与资源消耗高的问题,即传统方法通常需要为图像理解、文本到图像生成和图像编辑等不同任务分别设计专用模块或连接器,导致系统复杂且难以在消费级硬件上高效部署。其核心解决方案在于提出Skywork UniPic——一个参数量仅为15亿的自回归多模态统一模型,通过三个关键技术实现高性能与低资源需求:(1) 解耦编码策略,采用掩码自回归编码器(masked autoregressive encoder)用于合成,SigLIP2编码器用于理解,二者共享同一自回归解码器;(2) 渐进式分辨率感知训练策略,从256×256逐步扩展至1024×1024,并动态解冻参数以平衡模型容量与训练稳定性;(3) 基于1亿级高质量数据集并结合任务特定奖励模型进行精细化优化,显著提升生成与编辑质量。该方案使模型在仅需约15 GB GPU内存(如RTX 4090)条件下即可实现高保真多模态能力,验证了轻量化部署下高性能多模态AI的可行性。

链接: https://arxiv.org/abs/2508.03320
作者: Peiyu Wang,Yi Peng,Yimeng Gan,Liang Hu,Tianyidan Xie,Xiaokun Wang,Yichen Wei,Chuanxin Tang,Bo Zhu,Changshi Li,Hongyang Wei,Eric Li,Xuchen Song,Yang Liu,Yahui Zhou
机构: Skywork AI(天工智能)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We introduce Skywork UniPic, a 1.5 billion-parameter autoregressive model that unifies image understanding, text-to-image generation, and image editing within a single architecture-eliminating the need for task-specific adapters or inter-module connectors-and demonstrate that compact multimodal systems can achieve state-of-the-art performance on commodity hardware. Skywork UniPic achieves a GenEval score of 0.86, surpassing most existing unified models; sets a new DPG-Bench complex-generation record of 85.5; attains 5.83 on GEditBench-EN and 3.49 on ImgEdit-Bench for image editing; and generates 1024 x 1024 images with under 15 GB of GPU memory (e.g., RTX 4090). (1) a decoupled encoding strategy that leverages a masked autoregressive encoder for synthesis and a SigLIP2 encoder for understanding, all feeding a shared autoregressive decoder; (2) a progressive, resolution-aware training schedule scaling from 256 x 256 to 1024 x 1024 while dynamically unfreezing parameters to balance capacity and stability; and (3) meticulously curated, 100 million-scale datasets augmented with task-specific reward models to refine generation and editing objectives. By demonstrating that high-fidelity multimodal integration need not incur prohibitive resource demands, Skywork UniPic establishes a practical paradigm for deployable, high-fidelity multimodal AI. Code and weights are publicly available at this https URL.
zh

[CV-61] Architectural Insights into Knowledge Distillation for Object Detection: A Comprehensive Review

【速读】:该论文旨在解决深度学习在目标检测(object detection)中因模型复杂度提升而导致计算成本增加的问题,从而限制了其在资源受限设备上的部署。解决方案的关键在于知识蒸馏(Knowledge Distillation, KD)技术的应用,通过让轻量级的学生模型(student model)从性能更强但参数更多的教师模型(teacher model)中学习,实现高效且准确的目标检测系统。文章提出了一种以架构为中心的KD方法分类体系,区分了基于卷积神经网络(CNN-based)和基于Transformer的目标检测器,并针对不同层级(如骨干网络、特征融合层、检测头等)设计了相应的蒸馏策略,从而系统性地提升了KD在目标检测任务中的适应性和有效性。

链接: https://arxiv.org/abs/2508.03317
作者: Mahdi Golizadeh,Nassibeh Golizadeh,Mohammad Ali Keyvanrad,Hossein Shirazi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 20 pages, 11 figures, This paper was submitted to IEEE Transactions on Neural Networks and Learning Systems

点击查看摘要

Abstract:Object detection has achieved remarkable accuracy through deep learning, yet these improvements often come with increased computational cost, limiting deployment on resource-constrained devices. Knowledge Distillation (KD) provides an effective solution by enabling compact student models to learn from larger teacher models. However, adapting KD to object detection poses unique challenges due to its dual objectives-classification and localization-as well as foreground-background imbalance and multi-scale feature representation. This review introduces a novel architecture-centric taxonomy for KD methods, distinguishing between CNN-based detectors (covering backbone-level, neck-level, head-level, and RPN/RoI-level distillation) and Transformer-based detectors (including query-level, feature-level, and logit-level distillation). We further evaluate representative methods using the MS COCO and PASCAL VOC datasets with mAP@0.5 as performance metric, providing a comparative analysis of their effectiveness. The proposed taxonomy and analysis aim to clarify the evolving landscape of KD in object detection, highlight current challenges, and guide future research toward efficient and scalable detection systems.
zh

[CV-62] BaroPoser: Real-time Human Motion Tracking from IMUs and Barometers in Everyday Devices

【速读】:该论文旨在解决基于智能手机和智能手表等日常设备中惯性测量单元(Inertial Measurement Units, IMUs)进行人体姿态估计时,在非平坦地形上精度不足的问题。现有方法受限于传感器数据稀疏性和缺乏复杂地形下的标注数据,难以准确恢复全身姿态及全局位移。其解决方案的关键在于提出BaroPoser,首次融合IMU与气压计(barometric)数据以实时估计人体姿态和全局平移;通过气压读数估计传感器高度变化,为姿态估计精度提升和非平坦地形上的全局位移预测提供关键约束,并引入局部大腿坐标系来解耦局部与全局运动输入,从而优化姿态表示学习。

链接: https://arxiv.org/abs/2508.03313
作者: Libo Zhang,Xinyu Yi,Feng Xu
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 10 figures

点击查看摘要

Abstract:In recent years, tracking human motion using IMUs from everyday devices such as smartphones and smartwatches has gained increasing popularity. However, due to the sparsity of sensor measurements and the lack of datasets capturing human motion over uneven terrain, existing methods often struggle with pose estimation accuracy and are typically limited to recovering movements on flat terrain only. To this end, we present BaroPoser, the first method that combines IMU and barometric data recorded by a smartphone and a smartwatch to estimate human pose and global translation in real time. By leveraging barometric readings, we estimate sensor height changes, which provide valuable cues for both improving the accuracy of human pose estimation and predicting global translation on non-flat terrain. Furthermore, we propose a local thigh coordinate frame to disentangle local and global motion input for better pose representation learning. We evaluate our method on both public benchmark datasets and real-world recordings. Quantitative and qualitative results demonstrate that our approach outperforms the state-of-the-art (SOTA) methods that use IMUs only with the same hardware configuration.
zh

[CV-63] Zero Shot Domain Adaptive Semantic Segmentation by Synthetic Data Generation and Progressive Adaptation IROS2025

【速读】:该论文旨在解决零样本域自适应语义分割(zero-shot domain adaptive semantic segmentation)问题,即在目标域无任何图像数据、仅提供目标域风格文本描述的情况下,如何实现模型的有效迁移。其核心挑战在于分布偏移(distribution shift)和合成数据中的布局失真(layout distortion)。解决方案的关键在于提出SDGPA(Synthetic Data Generation and Progressive Adaptation)方法:首先利用预训练的文本到图像扩散模型生成目标风格的合成训练数据,通过局部小块裁剪-编辑-拼接策略提升合成图像的空间精度;其次构建一个增强的中间域以缓解大域间差距,促进稳定适应;最后设计渐进式适应策略,逐步优化模型对噪声合成数据的鲁棒性学习能力。该方案在无需目标域图像的前提下实现了最先进的零样本语义分割性能。

链接: https://arxiv.org/abs/2508.03300
作者: Jun Luo,Zijing Zhao,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学计算机技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2025

点击查看摘要

Abstract:Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain’s style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at this https URL
zh

[CV-64] Efficient Multi-Slide Visual-Language Feature Fusion for Placental Disease Classification

【速读】:该论文旨在解决胎盘疾病诊断中基于全切片图像(Whole Slide Images, WSIs)分析的两大挑战:一是现有方法在图像块(patch)选择策略上存在性能与计算效率难以平衡的问题;二是基于图像块级别的处理方式导致全局组织学上下文信息丢失。其解决方案的关键在于提出一种高效的多模态患者级胎盘疾病诊断框架(Efficient multimodal framework for Patient-level placental disease Diagnosis, EmmPD),包含两个核心创新:首先,设计了一个两阶段图像块选择模块,融合无参数压缩与可学习压缩策略,在保证关键特征保留的同时显著降低计算负担;其次,构建了一个混合多模态融合模块,通过自适应图学习增强病理特征表示,并引入文本医学报告以补充全局语境理解,从而实现更准确、鲁棒的胎盘疾病诊断。

链接: https://arxiv.org/abs/2508.03277
作者: Hang Guo,Qing Zhang,Zixuan Gao,Siyuan Yang,Shulin Peng,Xiang Tao,Ting Yu,Yan Wang,Qingli Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ACMMM’25

点击查看摘要

Abstract:Accurate prediction of placental diseases via whole slide images (WSIs) is critical for preventing severe maternal and fetal complications. However, WSI analysis presents significant computational challenges due to the massive data volume. Existing WSI classification methods encounter critical limitations: (1) inadequate patch selection strategies that either compromise performance or fail to sufficiently reduce computational demands, and (2) the loss of global histological context resulting from patch-level processing approaches. To address these challenges, we propose an Efficient multimodal framework for Patient-level placental disease Diagnosis, named EmmPD. Our approach introduces a two-stage patch selection module that combines parameter-free and learnable compression strategies, optimally balancing computational efficiency with critical feature preservation. Additionally, we develop a hybrid multimodal fusion module that leverages adaptive graph learning to enhance pathological feature representation and incorporates textual medical reports to enrich global contextual understanding. Extensive experiments conducted on both a self-constructed patient-level Placental dataset and two public datasets demonstrating that our method achieves state-of-the-art diagnostic performance. The code is available at this https URL.
zh

[CV-65] EgoPrompt: Prompt Pool Learning for Egocentric Action Recognition

【速读】:该论文旨在解决第一人称视角动作识别(egocentric action recognition)中,现有方法将动作的动词(verb)和名词(noun)组件视为独立分类任务而导致语义关系被忽略、表征碎片化及泛化能力不足的问题。其解决方案的关键在于提出一种基于提示学习(prompt learning)的框架 EgoPrompt,通过构建统一提示池(Unified Prompt Pool)空间,在细粒度模式层面实现动词与名词表示之间的交互融合:首先将两类组件表示分解为提示对形式的细粒度模式,再利用注意力机制进行特征融合以促进跨组件信息交换;同时引入多样化的提示池训练目标(Diverse Pool Criteria),从提示选择频率正则化和提示知识正交化两个维度提升提示池的信息丰富性,从而显著提升模型在多个数据集上的性能与泛化能力。

链接: https://arxiv.org/abs/2508.03266
作者: Huaihai Lyu,Chaofan Chen,Yuheng Ji,Changsheng Xu
机构: MAIS, Institute of Automation, Chinese Academy of Sciences (中国科学院自动化研究所); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Driven by the increasing demand for applications in augmented and virtual reality, egocentric action recognition has emerged as a prominent research area. It is typically divided into two subtasks: recognizing the performed behavior (i.e., verb component) and identifying the objects being acted upon (i.e., noun component) from the first-person perspective. However, most existing approaches treat these two components as independent classification tasks, focusing on extracting component-specific knowledge while overlooking their inherent semantic and contextual relationships, leading to fragmented representations and sub-optimal generalization capability. To address these challenges, we propose a prompt learning-based framework, EgoPrompt, to conduct the egocentric action recognition task. Building on the existing prompting strategy to capture the component-specific knowledge, we construct a Unified Prompt Pool space to establish interaction between the two types of component representations. Specifically, the component representations (from verbs and nouns) are first decomposed into fine-grained patterns with the prompt pair form. Then, these pattern-level representations are fused through an attention-based mechanism to facilitate cross-component interaction. To ensure the prompt pool is informative, we further introduce a novel training objective, Diverse Pool Criteria. This objective realizes our goals from two perspectives: Prompt Selection Frequency Regularization and Prompt Knowledge Orthogonalization. Extensive experiments are conducted on the Ego4D, EPIC-Kitchens, and EGTEA datasets. The results consistently show that EgoPrompt achieves state-of-the-art performance across within-dataset, cross-dataset, and base-to-novel generalization benchmarks.
zh

[CV-66] Beyond Isolated Words: Diffusion Brush for Handwritten Text-Line Generation ICCV2025

【速读】:该论文旨在解决手写文本行生成中的两大核心挑战:一是如何准确建模包含词内(intra-word)与词间(inter-word)关系的复杂风格模式,二是如何在大量字符中保持内容准确性。现有方法多局限于孤立单词生成,难以满足真实场景下对整体文本行结构(如垂直对齐和水平间距)的要求。为此,作者提出DiffBrush——一种基于扩散机制的手写文本行生成模型,其关键创新在于两个策略:(1) 内容解耦风格学习(content-decoupled style learning),通过列方向和行方向掩码分离风格与内容,从而更精准捕捉词内与词间风格特征;(2) 多尺度内容学习(multi-scale content learning),引入文本行和单词级判别器以保障全局连贯性与局部内容准确性。实验表明,DiffBrush在风格还原度与内容保真度方面均显著优于现有方法。

链接: https://arxiv.org/abs/2508.03256
作者: Gang Dai,Yifan Zhang,Yutao Qin,Qiangya Guo,Shuangping Huang,Shuicheng Yan
机构: South China University of Technology (华南理工大学); MiroMind AI; National University of Singapore (新加坡国立大学); Pazhou Laboratory (琶洲实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ICCV2025

点击查看摘要

Abstract:Existing handwritten text generation methods primarily focus on isolated words. However, realistic handwritten text demands attention not only to individual words but also to the relationships between them, such as vertical alignment and horizontal spacing. Therefore, generating entire text lines emerges as a more promising and comprehensive task. However, this task poses significant challenges, including the accurate modeling of complex style patterns encompassing both intra- and inter-word relationships, and maintaining content accuracy across numerous characters. To address these challenges, we propose DiffBrush, a novel diffusion-based model for handwritten text-line generation. Unlike existing methods, DiffBrush excels in both style imitation and content accuracy through two key strategies: (1) content-decoupled style learning, which disentangles style from content to better capture intra-word and inter-word style patterns by using column- and row-wise masking; and (2) multi-scale content learning, which employs line and word discriminators to ensure global coherence and local accuracy of textual content. Extensive experiments show that DiffBrush excels in generating high-quality text lines, particularly in style reproduction and content preservation. Code is available at this https URL.
zh

[CV-67] V.I.P. : Iterative Online Preference Distillation for Efficient Video Diffusion Models ICCV2025

【速读】:该论文旨在解决文本到视频(Text-to-Video, T2V)模型在资源受限环境下的高计算成本问题,特别是现有知识蒸馏方法因依赖监督微调(Supervised Fine-Tuning, SFT)导致的模式坍缩(mode collapse)问题——即剪枝后的学生模型由于容量下降无法直接复现教师模型输出,从而性能显著下降。解决方案的关键在于提出一种融合DPO(Direct Preference Optimization)与SFT的新颖蒸馏方法ReDPO:通过DPO引导学生模型专注于恢复目标属性(如语义一致性或视觉质量),而非被动模仿教师输出,同时结合SFT提升整体生成性能;此外,论文还设计了V.I.P.框架用于筛选和构建高质量配对数据集,并采用分步在线校准训练策略,显著提升了蒸馏效率与视频生成质量。

链接: https://arxiv.org/abs/2508.03254
作者: Jisoo Kim,Wooseok Seo,Junwan Kim,Seungho Park,Sooyeon Park,Youngjae Yu
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: ICCV2025 accepted

点击查看摘要

Abstract:With growing interest in deploying text-to-video (T2V) models in resource-constrained environments, reducing their high computational cost has become crucial, leading to extensive research on pruning and knowledge distillation methods while maintaining performance. However, existing distillation methods primarily rely on supervised fine-tuning (SFT), which often leads to mode collapse as pruned models with reduced capacity fail to directly match the teacher’s outputs, ultimately resulting in degraded quality. To address this challenge, we propose an effective distillation method, ReDPO, that integrates DPO and SFT. Our approach leverages DPO to guide the student model to focus on recovering only the targeted properties, rather than passively imitating the teacher, while also utilizing SFT to enhance overall performance. We additionally propose V.I.P., a novel framework for filtering and curating high-quality pair datasets, along with a step-by-step online approach for calibrated training. We validate our method on two leading T2V models, VideoCrafter2 and AnimateDiff, achieving parameter reduction of 36.2% and 67.5% each, while maintaining or even surpassing the performance of full models. Further experiments demonstrate the effectiveness of both ReDPO and V.I.P. framework in enabling efficient and high-quality video generation. Our code and videos are available at this https URL.
zh

[CV-68] Robust Single-Stage Fully Sparse 3D Object Detection via Detachable Latent Diffusion

【速读】:该论文旨在解决现有基于扩散模型(Diffusion Probabilistic Models, DPMs)的3D目标检测方法在推理阶段依赖多步迭代导致效率低下的问题,同时应对稀疏表示下中心特征缺失和鲁棒性不足的挑战。其核心解决方案是提出一种单阶段全稀疏3D目标检测网络RSDNet,关键在于引入可分离的潜在空间扩散框架(Detachable Latent Framework, DLF),通过轻量级多层级去噪自编码器(multi-level denoising autoencoders, DAEs)在潜在特征空间中学习去噪过程,从而实现单步推理;此外,DLF重构了噪声注入与去噪机制以生成多类型、多层级噪声样本,增强对扰动的鲁棒性,并结合语义-几何条件引导策略提升边界与形状感知能力,缓解稀疏表示中的中心特征缺失问题。

链接: https://arxiv.org/abs/2508.03252
作者: Wentao Qu,Guofeng Mei,Jing Wang,Yujiao Wu,Xiaoshui Huang,Liang Xiao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Denoising Diffusion Probabilistic Models (DDPMs) have shown success in robust 3D object detection tasks. Existing methods often rely on the score matching from 3D boxes or pre-trained diffusion priors. However, they typically require multi-step iterations in inference, which limits efficiency. To address this, we propose a \textbfRobust single-stage fully \textbfSparse 3D object \textbfDetection \textbfNetwork with a Detachable Latent Framework (DLF) of DDPMs, named RSDNet. Specifically, RSDNet learns the denoising process in latent feature spaces through lightweight denoising networks like multi-level denoising autoencoders (DAEs). This enables RSDNet to effectively understand scene distributions under multi-level perturbations, achieving robust and reliable detection. Meanwhile, we reformulate the noising and denoising mechanisms of DDPMs, enabling DLF to construct multi-type and multi-level noise samples and targets, enhancing RSDNet robustness to multiple perturbations. Furthermore, a semantic-geometric conditional guidance is introduced to perceive the object boundaries and shapes, alleviating the center feature missing problem in sparse representations, enabling RSDNet to perform in a fully sparse detection pipeline. Moreover, the detachable denoising network design of DLF enables RSDNet to perform single-step detection in inference, further enhancing detection efficiency. Extensive experiments on public benchmarks show that RSDNet can outperform existing methods, achieving state-of-the-art detection.
zh

[CV-69] Ultralight Polarity-Split Neuromorphic SNN for Event-Stream Super-Resolution

【速读】:该论文旨在解决事件相机(Event Camera)因空间分辨率有限而导致细粒度感知任务性能受限的问题。其核心解决方案是提出一种基于脉冲神经网络(Spiking Neural Networks, SNNs)的轻量级、流式事件到事件超分辨率方法,关键创新在于:一是设计了双前向极性分离编码策略(Dual-Forward Polarity-Split Event Encoding),将正负事件通过共享SNN分别独立处理以降低模型复杂度;二是引入可学习时空极性感知损失函数(Learnable Spatio-temporal Polarity-aware Loss, LearnSTPLoss),利用可学习的不确定性权重自适应平衡时序、空间和极性一致性约束,从而在显著压缩模型尺寸与推理时间的同时实现优异的超分辨率性能。

链接: https://arxiv.org/abs/2508.03244
作者: Chuanzhi Xu,Haoxian Zhou,Langyi Chen,Yuk Ying Chung,Qiang Qu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Event cameras offer unparalleled advantages such as high temporal resolution, low latency, and high dynamic range. However, their limited spatial resolution poses challenges for fine-grained perception tasks. In this work, we propose an ultra-lightweight, stream-based event-to-event super-resolution method based on Spiking Neural Networks (SNNs), designed for real-time deployment on resource-constrained devices. To further reduce model size, we introduce a novel Dual-Forward Polarity-Split Event Encoding strategy that decouples positive and negative events into separate forward paths through a shared SNN. Furthermore, we propose a Learnable Spatio-temporal Polarity-aware Loss (LearnSTPLoss) that adaptively balances temporal, spatial, and polarity consistency using learnable uncertainty-based weights. Experimental results demonstrate that our method achieves competitive super-resolution performance on multiple datasets while significantly reducing model size and inference time. The lightweight design enables embedding the module into event cameras or using it as an efficient front-end preprocessing for downstream vision tasks.
zh

[CV-70] MVTOP: Multi-View Transformer-based Object Pose-Estimation

【速读】:该论文旨在解决多视角刚体姿态估计中的 pose ambiguity(姿态歧义)问题,即单视角方法难以区分相似外观物体的不同朝向,而传统多视角后处理方法也无法有效融合视图间信息以消除歧义。其解决方案的关键在于提出一种基于 Transformer 的端到端可训练模型 MVTOP,通过早期融合各视角特征,并利用从相机中心发出的视线(lines of sight)建模多视角几何关系,从而在统一框架内整合全局多视角信息,实现对复杂场景中物体姿态的可靠估计。

链接: https://arxiv.org/abs/2508.03243
作者: Lukas Ranftl,Felix Brendel,Bertram Drost,Carsten Steger
机构: MVTec Software GmbH( MVTec 软件公司); Technical University of Munich(慕尼黑工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures

点击查看摘要

Abstract:We present MVTOP, a novel transformer-based method for multi-view rigid object pose estimation. Through an early fusion of the view-specific features, our method can resolve pose ambiguities that would be impossible to solve with a single view or with a post-processing of single-view poses. MVTOP models the multi-view geometry via lines of sight that emanate from the respective camera centers. While the method assumes the camera interior and relative orientations are known for a particular scene, they can vary for each inference. This makes the method versatile. The use of the lines of sight enables MVTOP to correctly predict the correct pose with the merged multi-view information. To show the model’s capabilities, we provide a synthetic data set that can only be solved with such holistic multi-view approaches since the poses in the dataset cannot be solved with just one view. Our method outperforms single-view and all existing multi-view approaches on our dataset and achieves competitive results on the YCB-V dataset. To the best of our knowledge, no holistic multi-view method exists that can resolve such pose ambiguities reliably. Our model is end-to-end trainable and does not require any additional data, e.g., depth.
zh

[CV-71] FFHQ-Makeup: Paired Synthetic Makeup Dataset with Facial Consistency Across Multiple Styles

【速读】:该论文旨在解决当前高质量裸妆-化妆图像对(bare-makeup image pairs)数据集稀缺的问题,这一瓶颈限制了虚拟试妆、面部隐私保护和面部美学分析等美颜相关任务的发展。现有合成方法要么依赖基于形变的转换方式导致面部几何失真,要么采用文本到图像生成技术引发身份与表情不一致问题。其解决方案的关键在于提出FFHQ-Makeup数据集构建流程,通过改进的妆容迁移方法实现身份与妆容的解耦(disentanglement),在保留原始身份和表情一致性的前提下,将真实世界妆容风格迁移至18,000个身份上,每个身份配以5种不同妆容风格,最终生成90,000对高质量图像对,填补了该领域缺乏大规模、高保真配对数据的空白。

链接: https://arxiv.org/abs/2508.03241
作者: Xingchao Yang,Shiori Ueda,Yuantian Huang,Tomoya Akiyama,Takafumi Taketomi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project: this https URL , Datasets: this https URL

点击查看摘要

Abstract:Paired bare-makeup facial images are essential for a wide range of beauty-related tasks, such as virtual try-on, facial privacy protection, and facial aesthetics analysis. However, collecting high-quality paired makeup datasets remains a significant challenge. Real-world data acquisition is constrained by the difficulty of collecting large-scale paired images, while existing synthetic approaches often suffer from limited realism or inconsistencies between bare and makeup images. Current synthetic methods typically fall into two categories: warping-based transformations, which often distort facial geometry and compromise the precision of makeup; and text-to-image generation, which tends to alter facial identity and expression, undermining consistency. In this work, we present FFHQ-Makeup, a high-quality synthetic makeup dataset that pairs each identity with multiple makeup styles while preserving facial consistency in both identity and expression. Built upon the diverse FFHQ dataset, our pipeline transfers real-world makeup styles from existing datasets onto 18K identities by introducing an improved makeup transfer method that disentangles identity and makeup. Each identity is paired with 5 different makeup styles, resulting in a total of 90K high-quality bare-makeup image pairs. To the best of our knowledge, this is the first work that focuses specifically on constructing a makeup dataset. We hope that FFHQ-Makeup fills the gap of lacking high-quality bare-makeup paired datasets and serves as a valuable resource for future research in beauty-related tasks.
zh

[CV-72] Zero-shot Shape Classification of Nanoparticles in SEM Images using Vision Foundation Models

【速读】:该论文旨在解决纳米颗粒形貌表征中传统深度学习方法依赖大量标注数据和计算资源的问题,从而限制了其在科研与工业场景中的可及性。其解决方案的关键在于提出一种零样本分类流程,利用两个视觉基础模型——Segment Anything Model (SAM) 实现对象分割,DINOv2 提取特征嵌入,并结合轻量级分类器完成高精度形状分类,无需大规模参数微调即可在三个形态多样的纳米颗粒数据集上取得优异性能,且对小样本、细微形貌差异及自然图像到科学成像域偏移具有鲁棒性。

链接: https://arxiv.org/abs/2508.03235
作者: Freida Barnatan,Emunah Goldstein,Einav Kalimian,Orchen Madar,Avi Huri,David Zitoun,Ya’akov Mandelbaum,Moshe Amitay
机构: Jerusalem College of Technology (耶路撒冷技术学院); Bar-Ilan University (巴伊兰大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate and efficient characterization of nanoparticle morphology in Scanning Electron Microscopy (SEM) images is critical for ensuring product quality in nanomaterial synthesis and accelerating development. However, conventional deep learning methods for shape classification require extensive labeled datasets and computationally demanding training, limiting their accessibility to the typical nanoparticle practitioner in research and industrial settings. In this study, we introduce a zero-shot classification pipeline that leverages two vision foundation models: the Segment Anything Model (SAM) for object segmentation and DINOv2 for feature embedding. By combining these models with a lightweight classifier, we achieve high-precision shape classification across three morphologically diverse nanoparticle datasets - without the need for extensive parameter fine-tuning. Our methodology outperforms a fine-tuned YOLOv11 and ChatGPT o4-mini-high baselines, demonstrating robustness to small datasets, subtle morphological variations, and domain shifts from natural to scientific imaging. Quantitative clustering metrics on PCA plots of the DINOv2 features are discussed as a means of assessing the progress of the chemical synthesis. This work highlights the potential of foundation models to advance automated microscopy image analysis, offering an alternative to traditional deep learning pipelines in nanoparticle research which is both more efficient and more accessible to the user.
zh

[CV-73] race3D: Consistent Segmentation Lifting via Gaussian Instance Tracing

【速读】:该论文旨在解决基于高斯溅射(Gaussian Splatting)的2D视觉分割向3D提升过程中存在的问题,即不同视角下的2D掩码不一致以及由此导致的分割边界噪声。现有方法忽视了语义信息对高斯分布的优化作用,难以获得清晰且一致的3D分割结果。解决方案的关键在于提出高斯实例追踪(Gaussian Instance Tracing, GIT),通过引入跨视角的实例权重矩阵增强标准高斯表示,利用高斯在3D空间中的固有一致性识别并修正2D分割不一致性;同时设计一种GIT引导的自适应密度控制机制,在训练中对模糊高斯进行分裂与剪枝,从而显著提升2D和3D分割边界的锐度与连贯性。

链接: https://arxiv.org/abs/2508.03227
作者: Hongyu Shen,Junfeng Ni,Yixin Chen,Weishuo Li,Mingtao Pei,Siyuan Huang
机构: Beijing Institute of Technology (北京理工大学); State Key Laboratory of General Artificial Intelligence, BIGAI; Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We address the challenge of lifting 2D visual segmentation to 3D in Gaussian Splatting. Existing methods often suffer from inconsistent 2D masks across viewpoints and produce noisy segmentation boundaries as they neglect these semantic cues to refine the learned Gaussians. To overcome this, we introduce Gaussian Instance Tracing (GIT), which augments the standard Gaussian representation with an instance weight matrix across input views. Leveraging the inherent consistency of Gaussians in 3D, we use this matrix to identify and correct 2D segmentation inconsistencies. Furthermore, since each Gaussian ideally corresponds to a single object, we propose a GIT-guided adaptive density control mechanism to split and prune ambiguous Gaussians during training, resulting in sharper and more coherent 2D and 3D segmentation boundaries. Experimental results show that our method extracts clean 3D assets and consistently improves 3D segmentation in both online (e.g., self-prompting) and offline (e.g., contrastive lifting) settings, enabling applications such as hierarchical segmentation, object extraction, and scene editing.
zh

[CV-74] BadBlocks: Low-Cost and Stealthy Backdoor Attacks Tailored for Text-to-Image Diffusion Models

【速读】:该论文旨在解决扩散模型(Diffusion Models)在训练过程中易受后门攻击(Backdoor Attacks)的问题,尤其针对现有攻击方法计算资源消耗高、易被先进防御机制检测到的局限性。其解决方案的关键在于提出一种名为BadBlocks的新颖后门攻击方式,该方法仅需约30%的计算资源和20%的GPU时间即可实现高效隐蔽攻击:通过选择性污染UNet架构中的特定模块(block),在保持其余网络正常功能的前提下注入后门,从而显著降低攻击门槛并有效绕过基于注意力机制的检测防御框架。实验表明,BadBlocks在极低资源约束下仍能实现高攻击成功率(ASR)与低感知质量损失(FID Score),揭示了部分神经网络层在后门注入中的关键作用,为大规模扩散模型的安全风险提供了新认知。

链接: https://arxiv.org/abs/2508.03221
作者: Yu Pan,Jiahao Chen,Lin Wang,Bingrong Dai,Yi Du
机构: Shanghai Polytechnic University (上海理工大学); Shanghai Development Center of Computer Software Technology (上海计算机软件技术发展中心)
类目: Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years,Diffusion models have achieved remarkable progress in the field of image this http URL,recent studies have shown that diffusion models are susceptible to backdoor attacks,in which attackers can manipulate the output by injecting covert triggers such as specific visual patterns or textual phrases into the training this http URL,with the continuous advancement of defense techniques,defenders have become increasingly capable of identifying and mitigating most backdoor attacks using visual inspection and neural network-based detection this http URL,in this paper,we identify a novel type of backdoor threat that is more lightweight and covert than existing approaches,which we name BadBlocks,requires only about 30% of the computational resources and 20% GPU time typically needed by previous backdoor attacks,yet it successfully injects backdoors and evades the most advanced defense this http URL enables attackers to selectively contaminate specific blocks within the UNet architecture of diffusion models while maintaining normal functionality in the remaining this http URL results demonstrate that BadBlocks achieves a high attack success rate (ASR) and low perceptual quality loss (as measured by FID Score),even under extremely constrained computational resources and GPU this http URL,BadBlocks is able to bypass existing defense frameworks,especially the attention-based backdoor detection method, highlighting it as a novel and noteworthy this http URL studies further demonstrate that effective backdoor injection does not require fine-tuning the entire network and highlight the pivotal role of certain neural network layers in backdoor this http URL,BadBlocks significantly reduces the barrier to conducting backdoor attacks in all this http URL enables attackers to inject backdoors into large-scale diffusion models even using consumer-grade GPUs.
zh

[CV-75] ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow

【速读】:该论文旨在解决基于学习的机器人操作中低层动作估计精度不足的问题,这一问题已成为制约整体操作性能的关键瓶颈。解决方案的核心在于提出名为ActionSink的新框架,其关键创新是将机器人动作重新建模为从视频中自监督提取的“动作光流”(action flow),并通过迭代检索与去噪机制实现粗到细的动作流匹配,同时引入动态动作流集成模块,利用工作记忆池高效管理历史动作流,并通过多层融合策略整合当前直接估计与历史动作流信息,从而显著提升动作估计的准确性。

链接: https://arxiv.org/abs/2508.03218
作者: Shanshan Guo,Xiwen Liang,Junfan Lin,Yuzheng Zhuang,Liang Lin,Xiaodan Liang
机构: Northeastern University (东北大学); Shenzhen Campus of Sun Yat-Sen University (中山大学深圳校区); Pengcheng Laboratory (鹏城实验室); Huawei Noah’s Ark Lab (华为诺亚方舟实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called “action flow”, in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.
zh

[CV-76] he Power of Many: Synergistic Unification of Diverse Augmentations for Efficient Adversarial Robustness

【速读】:该论文旨在解决深度学习模型在面对对抗扰动(adversarial perturbations)时的脆弱性问题,同时克服现有对抗训练(Adversarial Training, AT)方法计算成本高且标准性能下降的局限性。针对数据增强类防御策略中存在的鲁棒性提升有限或训练开销大的问题,作者提出了一种名为通用对抗增强器(Universal Adversarial Augmenter, UAA)的框架。其核心创新在于通过离线预计算一个通用变换(universal transformation),将昂贵的扰动生成过程与模型训练解耦,从而在训练阶段无需在线生成对抗样本即可高效生成每个样本的独特扰动,显著提升了训练效率并实现了当前基于数据增强的最强鲁棒性表现(SOTA)。

链接: https://arxiv.org/abs/2508.03213
作者: Wang Yu-Hang,Shiwei Li,Jianxiang Liao,Li Bohan,Jian Liu,Wenfei Yin
机构: Hefei University of Technology (合肥工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages,2 figures,6 tables

点击查看摘要

Abstract:Adversarial perturbations pose a significant threat to deep learning models. Adversarial Training (AT), the predominant defense method, faces challenges of high computational costs and a degradation in standard performance. While data augmentation offers an alternative path, existing techniques either yield limited robustness gains or incur substantial training overhead. Therefore, developing a defense mechanism that is both highly efficient and strongly robust is of paramount this http URL this work, we first conduct a systematic analysis of existing augmentation techniques, revealing that the synergy among diverse strategies – rather than any single method – is crucial for enhancing robustness. Based on this insight, we propose the Universal Adversarial Augmenter (UAA) framework, which is characterized by its plug-and-play nature and training efficiency. UAA decouples the expensive perturbation generation process from model training by pre-computing a universal transformation offline, which is then used to efficiently generate unique adversarial perturbations for each sample during this http URL experiments conducted on multiple benchmarks validate the effectiveness of UAA. The results demonstrate that UAA establishes a new state-of-the-art (SOTA) for data-augmentation-based adversarial defense strategies , without requiring the online generation of adversarial examples during training. This framework provides a practical and efficient pathway for building robust models,Our code is available in the supplementary materials.
zh

[CV-77] GeoShield: Safeguarding Geolocation Privacy from Vision-Language Models via Adversarial Perturbations

【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)如GPT-4o能够从用户公开分享的图像中准确推断地理位置所带来的地理隐私(geoprivacy)风险问题。现有对抗扰动方法在高分辨率图像和低扰动预算下表现不佳,且可能引入无关语义内容,难以满足实际场景需求。解决方案的关键在于提出GeoShield框架,其核心包括三个模块:特征解耦模块用于分离图像中的地理与非地理信息,暴露元素识别模块精准定位图像中泄露地理信息的区域,以及尺度自适应增强模块,在全局与局部层面协同优化扰动策略,从而在多种分辨率下实现高效且隐蔽的隐私保护。实验表明,GeoShield在黑盒设置中显著优于现有方法,能在最小影响图像视觉和语义质量的前提下提供强健的地理隐私防护。

链接: https://arxiv.org/abs/2508.03209
作者: Xinwei Liu,Xiaojun Jia,Yuan Xun,Simeng Qin,Xiaochun Cao
机构: Institute of Information Engineering, CAS (中国科学院信息工程研究所); School of Cyberspace Security, UCAS (中国科学院大学网络空间安全学院); Nanyang Technological University (南洋理工大学); Northeastern University (东北大学); Sun Yat-sen University, Shenzhen Campus (中山大学深圳校区)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision-Language Models (VLMs) such as GPT-4o now demonstrate a remarkable ability to infer users’ locations from public shared images, posing a substantial risk to geoprivacy. Although adversarial perturbations offer a potential defense, current methods are ill-suited for this scenario: they often perform poorly on high-resolution images and low perturbation budgets, and may introduce irrelevant semantic content. To address these limitations, we propose GeoShield, a novel adversarial framework designed for robust geoprivacy protection in real-world scenarios. GeoShield comprises three key modules: a feature disentanglement module that separates geographical and non-geographical information, an exposure element identification module that pinpoints geo-revealing regions within an image, and a scale-adaptive enhancement module that jointly optimizes perturbations at both global and local levels to ensure effectiveness across resolutions. Extensive experiments on challenging benchmarks show that GeoShield consistently surpasses prior methods in black-box settings, achieving strong privacy protection with minimal impact on visual or semantic quality. To our knowledge, this work is the first to explore adversarial perturbations for defending against geolocation inference by advanced VLMs, providing a practical and effective solution to escalating privacy concerns.
zh

[CV-78] Open-Vocabulary HOI Detection with Interaction-aware Prompt and Concept Calibration

【速读】:该论文旨在解决开放词汇人类-物体交互(Human-Object Interaction, HOI)检测中模型泛化能力不足的问题,特别是针对训练集中未见的交互类别难以识别,以及现有方法依赖视觉语言模型(Vision and Language Models, VLMs)时因图像编码器与细粒度区域级交互检测任务不匹配、文本描述视觉特征表达不充分所导致的性能瓶颈。其解决方案的关键在于提出一种端到端的开放词汇HOI检测框架INP-CC(INteraction-aware Prompting with Concept Calibration),核心创新包括:1)设计交互感知提示生成器(interaction-aware prompt generator),根据输入场景动态生成紧凑提示集,实现相似交互间的注意力共享,聚焦关键交互模式而非通用图像语义;2)通过语言模型引导的概念校准机制(concept calibration),优化HOI概念表征,借助跨类别的视觉相似性分析增强不同HOI概念的区分能力,并结合负采样策略提升跨模态相似性建模效果,从而更好地区分视觉相似但语义不同的动作。

链接: https://arxiv.org/abs/2508.03207
作者: Ting Lei,Shaofeng Yin,Qingchao Chen,Yuxin Peng,Yang Liu
机构: Wangxuan Institute of Computer Technology, Peking University (北京大学王选计算机研究所); National Institute of Health Data Science, Peking University (北京大学健康医疗数据科学研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Open Vocabulary Human-Object Interaction (HOI) detection aims to detect interactions between humans and objects while generalizing to novel interaction classes beyond the training set. Current methods often rely on Vision and Language Models (VLMs) but face challenges due to suboptimal image encoders, as image-level pre-training does not align well with the fine-grained region-level interaction detection required for HOI. Additionally, effectively encoding textual descriptions of visual appearances remains difficult, limiting the model’s ability to capture detailed HOI relationships. To address these issues, we propose INteraction-aware Prompting with Concept Calibration (INP-CC), an end-to-end open-vocabulary HOI detector that integrates interaction-aware prompts and concept calibration. Specifically, we propose an interaction-aware prompt generator that dynamically generates a compact set of prompts based on the input scene, enabling selective sharing among similar interactions. This approach directs the model’s attention to key interaction patterns rather than generic image-level semantics, enhancing HOI detection. Furthermore, we refine HOI concept representations through language model-guided calibration, which helps distinguish diverse HOI concepts by investigating visual similarities across categories. A negative sampling strategy is also employed to improve inter-modal similarity modeling, enabling the model to better differentiate visually similar but semantically distinct actions. Extensive experimental results demonstrate that INP-CC significantly outperforms state-of-the-art models on the SWIG-HOI and HICO-DET datasets. Code is available at this https URL.
zh

[CV-79] AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding

【速读】:该论文旨在解决弱监督视觉定位(Weakly Supervised Visual Grounding, WSVG)中因类别和属性层面的语义模糊性而导致跨模态推理能力不足的问题,即现有方法难以准确区分文本描述中的细微语义差异。其解决方案的关键在于提出了一种基于查询的语义匹配框架 AlignCAT,通过两个核心模块实现:首先,粗粒度对齐模块利用类别信息与全局上下文增强视觉-语言对齐,有效抑制类别不一致对象的干扰;其次,细粒度对齐模块则聚焦于描述性信息与词级文本特征,确保属性一致性。该框架通过充分挖掘语言线索逐步过滤误对齐的视觉查询,从而提升对比学习效率,在 RefCOCO、RefCOCO+ 和 RefCOCOg 三个基准数据集上验证了其优越性。

链接: https://arxiv.org/abs/2508.03201
作者: Yidan Wang,Chenyi Zhuang,Wutao Liu,Pan Gao,Nicu Sebe
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Weakly supervised visual grounding (VG) aims to locate objects in images based on text descriptions. Despite significant progress, existing methods lack strong cross-modal reasoning to distinguish subtle semantic differences in text expressions due to category-based and attribute-based ambiguity. To address these challenges, we introduce AlignCAT, a novel query-based semantic matching framework for weakly supervised VG. To enhance visual-linguistic alignment, we propose a coarse-grained alignment module that utilizes category information and global context, effectively mitigating interference from category-inconsistent objects. Subsequently, a fine-grained alignment module leverages descriptive information and captures word-level text features to achieve attribute consistency. By exploiting linguistic cues to their fullest extent, our proposed AlignCAT progressively filters out misaligned visual queries and enhances contrastive learning efficiency. Extensive experiments on three VG benchmarks, namely RefCOCO, RefCOCO+, and RefCOCOg, verify the superiority of AlignCAT against existing weakly supervised methods on two VG tasks. Our code is available at: this https URL.
zh

[CV-80] Neovascularization Segmentation via a Multilateral Interaction-Enhanced Graph Convolutional Network

【速读】:该论文旨在解决湿性年龄相关性黄斑变性(wet age-related macular degeneration, wet AMD)中脉络膜新生血管(choroidal neovascularization, CNV)在光学相干断层扫描血管成像(optical coherence tomography angiography, OCTA)图像中的精准分割问题。现有方法受限于CNV形态不规则、成像伪影(如投影伪影、噪声和边界模糊)以及缺乏公开标注数据集,导致分割精度不足。解决方案的关键在于:首先构建首个公开的CNV分割数据集CNVSeg;其次提出一种多任务图卷积交互增强网络(multilateral graph convolutional interaction-enhanced CNV segmentation network, MTG-Net),其核心创新包括:1)设计一个多任务框架,将图像解耦为三个任务特异性特征图以捕获病变区域与血管的几何特性;2)引入两个基于图结构的跨任务模块——多边交互图推理(Multilateral Interaction Graph Reasoning, MIGR)与多边强化图推理(Multilateral Reinforcement Graph Reasoning, MRGR),通过图机制迭代推理任务间的高阶关系,实现任务目标互补优化;3)采用不确定性加权损失函数,降低伪影和噪声对分割性能的影响。实验表明,MTG-Net在区域与血管分割上分别达到87.21%和88.12%的Dice分数,显著优于现有方法。

链接: https://arxiv.org/abs/2508.03197
作者: Tao Chen,Dan Zhang,Da Chen,Huazhu Fu,Kai Jin,Shanshan Wang,Laurent D. Cohen,Yitian Zhao,Quanyong Yi,Jiong Zhang
机构: Ningbo Institute of Materials Technology and Engineering, Chinese Academy of Sciences (中国科学院宁波材料技术与工程研究所); Ningbo University of Technology (宁波工程学院); Qilu University of Technology, Shandong Academy of Sciences (齐鲁工业大学,山东省科学院); Institute of High Performance Computing, A*STAR (新加坡高性能计算研究院); Zhejiang University, The Second Affiliated Hospital (浙江大学附属第二医院); Shenzhen Institute of Advanced Technology, Chinese Academy of Science (中国科学院深圳先进技术研究院); University Paris Dauphine, PSL Research University, CNRS, UMR 7534, CEREMADE (巴黎达芬奇大学,PSL研究大学,法国国家科学研究中心,CEREMADE实验室); Ningbo Eye Hospital, Wenzhou Medical University (宁波眼科医院,温州医科大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Choroidal neovascularization (CNV), a primary characteristic of wet age-related macular degeneration (wet AMD), represents a leading cause of blindness worldwide. In clinical practice, optical coherence tomography angiography (OCTA) is commonly used for studying CNV-related pathological changes, due to its micron-level resolution and non-invasive nature. Thus, accurate segmentation of CNV regions and vessels in OCTA images is crucial for clinical assessment of wet AMD. However, challenges existed due to irregular CNV shapes and imaging limitations like projection artifacts, noises and boundary blurring. Moreover, the lack of publicly available datasets constraints the CNV analysis. To address these challenges, this paper constructs the first publicly accessible CNV dataset (CNVSeg), and proposes a novel multilateral graph convolutional interaction-enhanced CNV segmentation network (MTG-Net). This network integrates both region and vessel morphological information, exploring semantic and geometric duality constraints within the graph domain. Specifically, MTG-Net consists of a multi-task framework and two graph-based cross-task modules: Multilateral Interaction Graph Reasoning (MIGR) and Multilateral Reinforcement Graph Reasoning (MRGR). The multi-task framework encodes rich geometric features of lesion shapes and surfaces, decoupling the image into three task-specific feature maps. MIGR and MRGR iteratively reason about higher-order relationships across tasks through a graph mechanism, enabling complementary optimization for task-specific objectives. Additionally, an uncertainty-weighted loss is proposed to mitigate the impact of artifacts and noise on segmentation accuracy. Experimental results demonstrate that MTG-Net outperforms existing methods, achieving a Dice socre of 87.21% for region segmentation and 88.12% for vessel segmentation.
zh

[CV-81] Unifying Locality of KANs and Feature Drift Compensation for Data-free Continual Face Forgery Detection

【速读】:该论文旨在解决人脸伪造检测在持续学习(continual learning)场景下因学习新伪造类型而导致对旧伪造类型性能急剧下降的灾难性遗忘(catastrophic forgetting)问题。解决方案的关键在于提出一种基于Kolmogorov-Arnold Networks (KANs) 的持续人脸伪造检测框架(KAN-CFD),其核心创新包括:1)Domain-Group KAN Detector (DG-KD),通过引入结构化分组机制使KAN能够有效处理高维图像输入,同时保持局部可塑性;2)一种无需使用历史数据的、基于KAN漂移补偿投影的数据无关回放特征分离策略(FS-KDCP),通过在输入空间中分离不同域的映射区域来避免跨任务特征重叠导致的参数冲突,从而显著缓解灾难性遗忘现象。

链接: https://arxiv.org/abs/2508.03189
作者: Tianshuo Zhang,Siran Peng,Li Gao,Haoyuan Zhang,Xiangyu Zhu,Zhen Lei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rapid advancements in face forgery techniques necessitate that detectors continuously adapt to new forgery methods, thus situating face forgery detection within a continual learning paradigm. However, when detectors learn new forgery types, their performance on previous types often degrades rapidly, a phenomenon known as catastrophic forgetting. Kolmogorov-Arnold Networks (KANs) utilize locally plastic splines as their activation functions, enabling them to learn new tasks by modifying only local regions of the functions while leaving other areas unaffected. Therefore, they are naturally suitable for addressing catastrophic forgetting. However, KANs have two significant limitations: 1) the splines are ineffective for modeling high-dimensional images, while alternative activation functions that are suitable for images lack the essential property of locality; 2) in continual learning, when features from different domains overlap, the mapping of different domains to distinct curve regions always collapses due to repeated modifications of the same regions. In this paper, we propose a KAN-based Continual Face Forgery Detection (KAN-CFD) framework, which includes a Domain-Group KAN Detector (DG-KD) and a data-free replay Feature Separation strategy via KAN Drift Compensation Projection (FS-KDCP). DG-KD enables KANs to fit high-dimensional image inputs while preserving locality and local plasticity. FS-KDCP avoids the overlap of the KAN input spaces without using data from prior tasks. Experimental results demonstrate that the proposed method achieves superior performance while notably reducing forgetting.
zh

[CV-82] Monocular Depth Estimation with Global-Aware Discretization and Local Context Modeling

【速读】:该论文旨在解决单目深度估计(monocular depth estimation)中因从单一视角恢复三维结构的病态性(ill-posed nature)而导致的准确性难题,即多个合理的深度配置可能产生相同的二维投影。解决方案的关键在于融合局部与全局线索以提升预测精度:首先提出门控大核注意力模块(Gated Large Kernel Attention Module, GLKAM),通过带门控机制的大核卷积有效捕获多尺度局部结构信息;其次引入全局bin预测模块(Global Bin Prediction Module, GBPM),估计深度bin的全局分布并为深度回归提供结构引导。实验表明,该方法在NYU-V2和KITTI数据集上均优于现有技术,验证了各组件的有效性。

链接: https://arxiv.org/abs/2508.03186
作者: Heng Wu,Qian Zhang,Guixu Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Accurate monocular depth estimation remains a challenging problem due to the inherent ambiguity that stems from the ill-posed nature of recovering 3D structure from a single view, where multiple plausible depth configurations can produce identical 2D projections. In this paper, we present a novel depth estimation method that combines both local and global cues to improve prediction accuracy. Specifically, we propose the Gated Large Kernel Attention Module (GLKAM) to effectively capture multi-scale local structural information by leveraging large kernel convolutions with a gated mechanism. To further enhance the global perception of the network, we introduce the Global Bin Prediction Module (GBPM), which estimates the global distribution of depth bins and provides structural guidance for depth regression. Extensive experiments on the NYU-V2 and KITTI dataset demonstrate that our method achieves competitive performance and outperforms existing approaches, validating the effectiveness of each proposed component.
zh

[CV-83] Duplex-GS: Proxy-Guided Weighted Blending for Real-Time Order-Independent Gaussian Splatting

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)方法在渲染过程中因依赖计算密集的顺序alpha混合(sequential alpha-blending)操作而导致的性能瓶颈问题,尤其是在资源受限平台上的显著延迟。解决方案的关键在于提出Duplex-GS框架,其核心是通过引入双层次结构:一方面利用代理高斯表示(proxy Gaussian representations)进行局部高斯管理,另一方面结合无序透明度渲染(Order-Independent Transparency, OIT)技术实现高效且物理合理的加权求和渲染。特别地,作者设计了基于细胞的代理机制(cell proxies)与细胞搜索光栅化(cell search rasterization),有效缓解了视图自适应基数排序(view-adaptive radix sort)带来的开销,并通过OIT驱动的加权合成策略同时消除“跳跃”(popping)和“透明度”伪影,从而在保持图像质量的前提下实现1.5至4倍的速度提升及高达86.9%的基数排序开销降低。

链接: https://arxiv.org/abs/2508.03180
作者: Weihang Liu,Yuke Li,Yuxuan Li,Jingyi Yu,Xin Lou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated remarkable rendering fidelity and efficiency. However, these methods still rely on computationally expensive sequential alpha-blending operations, resulting in significant overhead, particularly on resource-constrained platforms. In this paper, we propose Duplex-GS, a dual-hierarchy framework that integrates proxy Gaussian representations with order-independent rendering techniques to achieve photorealistic results while sustaining real-time performance. To mitigate the overhead caused by view-adaptive radix sort, we introduce cell proxies for local Gaussians management and propose cell search rasterization for further acceleration. By seamlessly combining our framework with Order-Independent Transparency (OIT), we develop a physically inspired weighted sum rendering technique that simultaneously eliminates “popping” and “transparency” artifacts, yielding substantial improvements in both accuracy and efficiency. Extensive experiments on a variety of real-world datasets demonstrate the robustness of our method across diverse scenarios, including multi-scale training views and large-scale environments. Our results validate the advantages of the OIT rendering paradigm in Gaussian Splatting, achieving high-quality rendering with an impressive 1.5 to 4 speedup over existing OIT based Gaussian Splatting approaches and 52.2% to 86.9% reduction of the radix sort overhead without quality degradation.
zh

[CV-84] Advancing Precision in Multi-Point Cloud Fusion Environments

【速读】:该论文旨在解决工业视觉检测中点云(point cloud)配准与多点云匹配的准确性与效率问题,尤其在表面缺陷识别场景下。其解决方案的关键在于提出一种新型CloudCompare插件,用于合并多个点云并可视化表面缺陷,同时构建了一个合成数据集以定量评估配准方法,并引入多种距离度量用于点云比较,从而显著提升了自动化检测系统的精度和效率。

链接: https://arxiv.org/abs/2508.03179
作者: Ulugbek Alibekov,Vanessa Staderini,Philipp Schneider,Doris Antensteiner
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Graphics (cs.GR)
备注: Accpeted for publication in Communications in Computer and Information Science, Springer

点击查看摘要

Abstract:This research focuses on visual industrial inspection by evaluating point clouds and multi-point cloud matching methods. We also introduce a synthetic dataset for quantitative evaluation of registration method and various distance metrics for point cloud comparison. Additionally, we present a novel CloudCompare plugin for merging multiple point clouds and visualizing surface defects, enhancing the accuracy and efficiency of automated inspection systems.
zh

[CV-85] SAVER: Mitigating Hallucinations in Large Vision-Language Models via Style-Aware Visual Early Revision

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在处理风格化图像时易产生幻觉(hallucination)的问题,而此前的研究主要聚焦于摄影图像,忽视了风格化图像在游戏场景理解、艺术教育和医学分析等关键场景中的潜在风险。解决方案的关键在于提出一种名为Style-Aware Visual Early Revision(SAVER)的新机制,该机制通过分析token级的视觉注意力模式,在早期层利用反馈信号动态调整模型输出,从而有效缓解由风格化图像引发的幻觉问题。

链接: https://arxiv.org/abs/2508.03177
作者: Zhaoxu Li,Chenqi Kong,Yi Yu,Qiangqiang Wu,Xinghao Jiang,Ngai-Man Cheung,Bihan Wen,Alex Kot,Xudong Jiang
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Hong Kong Polytechnic University (香港理工大学); 3. Tsinghua University (清华大学); 4. Peking University (北京大学); 5. University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Large Vision-Language Models (LVLMs) recently achieve significant breakthroughs in understanding complex visual-textual contexts. However, hallucination issues still limit their real-world applicability. Although previous mitigation methods effectively reduce hallucinations in photographic images, they largely overlook the potential risks posed by stylized images, which play crucial roles in critical scenarios such as game scene understanding, art education, and medical analysis. In this work, we first construct a dataset comprising photographic images and their corresponding stylized versions with carefully annotated caption labels. We then conduct head-to-head comparisons on both discriminative and generative tasks by benchmarking 13 advanced LVLMs on the collected datasets. Our findings reveal that stylized images tend to induce significantly more hallucinations than their photographic counterparts. To address this issue, we propose Style-Aware Visual Early Revision SAVER, a novel mechanism that dynamically adjusts LVLMs’ final outputs based on the token-level visual attention patterns, leveraging early-layer feedback to mitigate hallucinations caused by stylized images. Extensive experiments demonstrate that SAVER achieves state-of-the-art performance in hallucination mitigation across various models, datasets, and tasks.
zh

[CV-86] LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing

【速读】:该论文旨在解决基于逆向生成模型(inversion-based editing methods)在文本驱动图像编辑中存在结构缺陷的问题,即由反演噪声中编码的源概念语义偏差导致注意力机制对目标概念关注不足,尤其在源与目标语义差异较大时,易引发编辑失败或非目标区域的 unintended modifications。解决方案的关键在于提出 LORE 方法,其核心思想是直接优化反演噪声(inverted noise),而非依赖模型架构调整或微调,从而提升编辑的可控性、稳定性和泛化能力,实现无需训练的通用概念替换。

链接: https://arxiv.org/abs/2508.03144
作者: Liangyang Ouyang,Jiafeng Mao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: We will make our implementation available soon

点击查看摘要

Abstract:Text-driven image editing enables users to flexibly modify visual content through natural language instructions, and is widely applied to tasks such as semantic object replacement, insertion, and removal. While recent inversion-based editing methods using rectified flow models have achieved promising results in image quality, we identify a structural limitation in their editing behavior: the semantic bias toward the source concept encoded in the inverted noise tends to suppress attention to the target concept. This issue becomes particularly critical when the source and target semantics are dissimilar, where the attention mechanism inherently leads to editing failure or unintended modifications in non-target regions. In this paper, we systematically analyze and validate this structural flaw, and introduce LORE, a training-free and efficient image editing method. LORE directly optimizes the inverted noise, addressing the core limitations in generalization and controllability of existing approaches, enabling stable, controllable, and general-purpose concept replacement, without requiring architectural modification or model fine-tuning. We conduct comprehensive evaluations on three challenging benchmarks: PIEBench, SmartEdit, and GapEdit. Experimental results show that LORE significantly outperforms strong baselines in terms of semantic alignment, image quality, and background fidelity, demonstrating the effectiveness and scalability of latent-space optimization for general-purpose image editing.
zh

[CV-87] SARD: Segmentation-Aware Anomaly Synthesis via Region-Constrained Diffusion with Discriminative Mask Guidance

【速读】:该论文旨在解决工业异常检测系统中生成真实且空间精确的异常样本这一关键挑战,尤其针对基于扩散模型的方法在空间可控性不足和局部区域保真度差的问题。其解决方案的核心在于提出SARD(Segmentation-Aware anomaly synthesis via Region-constrained Diffusion with discriminative mask Guidance)框架:首先引入区域约束扩散(Region-Constrained Diffusion, RCD)机制,在逆向去噪过程中冻结背景区域仅更新前景异常区域,从而有效减少背景伪影;其次,在判别器中嵌入判别性掩码引导(Discriminative Mask Guidance, DMG)模块,通过像素级掩码指导对全局真实性和局部异常保真度的联合评估,显著提升生成异常的精度与视觉质量。

链接: https://arxiv.org/abs/2508.03143
作者: Yanshu Wang,Xichen Xu,Xiaoning Lei,Guoyang Xie
机构: Global Institute of Future Technology, Shanghai Jiao Tong University (上海交通大学未来技术学院); Department of Intelligent Manufacturing, CATL (宁德时代)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by The 2025 International Conference on Machine Intelligence and Nature-InspireD Computing (MIND)

点击查看摘要

Abstract:Synthesizing realistic and spatially precise anomalies is essential for enhancing the robustness of industrial anomaly detection systems. While recent diffusion-based methods have demonstrated strong capabilities in modeling complex defect patterns, they often struggle with spatial controllability and fail to maintain fine-grained regional fidelity. To overcome these limitations, we propose SARD (Segmentation-Aware anomaly synthesis via Region-constrained Diffusion with discriminative mask Guidance), a novel diffusion-based framework specifically designed for anomaly generation. Our approach introduces a Region-Constrained Diffusion (RCD) process that preserves the background by freezing it and selectively updating only the foreground anomaly regions during the reverse denoising phase, thereby effectively reducing background artifacts. Additionally, we incorporate a Discriminative Mask Guidance (DMG) module into the discriminator, enabling joint evaluation of both global realism and local anomaly fidelity, guided by pixel-level masks. Extensive experiments on the MVTec-AD and BTAD datasets show that SARD surpasses existing methods in segmentation accuracy and visual quality, setting a new state-of-the-art for pixel-level anomaly synthesis.
zh

[CV-88] UniEdit-I: Training-free Image Editing for Unified VLM via Iterative Understanding Editing and Verifying

【速读】:该论文旨在解决统一视觉语言模型(Unified Vision-Language Models, VLMs)在图像编辑能力方面的缺失问题,尽管其在视觉理解与生成任务中已表现出强大性能,但如何在不进行额外训练的前提下实现高效、高保真度的图像编辑仍是一个未解难题。解决方案的关键在于提出了一种无需训练的框架UniEdit-I,其核心机制为“理解-编辑-验证”三步迭代流程:首先通过结构化语义分析生成源提示并基于编辑指令调整为目标提示;其次引入时间自适应偏移量,在去噪过程中实现从粗到细的连贯编辑;最后通过一致性评分与自动反馈机制验证中间结果,并决定是否继续迭代,直至收敛。该方法在BLIP3-o基础上实现了SOTA的图像编辑效果,显著提升了统一VLM的实用性与灵活性。

链接: https://arxiv.org/abs/2508.03142
作者: Chengyu Bai,Jintao Chen,Xiang Bai,Yilong Chen,Qi She,Ming Lu,Shanghang Zhang
机构: 北京大学(University of Peking)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In recent years, unified vision-language models (VLMs) have rapidly advanced, effectively tackling both visual understanding and generation tasks within a single design. While many unified VLMs have explored various design choices, the recent hypothesis from OpenAI’s GPT-4o suggests a promising generation pipeline: Understanding VLM-Visual Feature-Projector-Diffusion Model-Image. The understanding VLM is frozen, and only the generation-related modules are trained. This pipeline maintains the strong capability of understanding VLM while enabling the image generation ability of the unified VLM. Although this pipeline has shown very promising potential for the future development of unified VLM, how to easily enable image editing capability is still unexplored. In this paper, we introduce a novel training-free framework named UniEdit-I to enable the unified VLM with image editing capability via three iterative steps: understanding, editing, and verifying. 1. The understanding step analyzes the source image to create a source prompt through structured semantic analysis and makes minimal word replacements to form the target prompt based on the editing instruction. 2. The editing step introduces a time-adaptive offset, allowing for coherent editing from coarse to fine throughout the denoising process. 3. The verification step checks the alignment between the target prompt and the intermediate edited image, provides automatic consistency scores and corrective feedback, and determines whether to stop early or continue the editing loop. This understanding, editing, and verifying loop iterates until convergence, delivering high-fidelity editing in a training-free manner. We implemented our method based on the latest BLIP3-o and achieved state-of-the-art (SOTA) performance on the GEdit-Bench benchmark.
zh

[CV-89] Uint: Building Uint Detection Dataset

【速读】:该论文旨在解决当前火灾场景数据集中标注数据稀缺的问题,尤其是针对建筑单元的火情检测任务中缺乏高质量、多场景的训练数据。解决方案的关键在于构建一个基于无人机拍摄的合成数据集,通过真实多层建筑背景与运动模糊、亮度调整等增强技术相结合,模拟不同拍摄条件下的火灾效果,并利用大模型生成多样化的火势表现,从而提升模型在复杂环境中的泛化能力。该方法生成了包含1,978张图像的多样化数据集,有效降低了真实火灾数据采集的风险与成本,同时增强了火情检测模型的鲁棒性与适用性。

链接: https://arxiv.org/abs/2508.03139
作者: Haozhou Zhai,Yanzhe Gao,Tianjiang Hu
机构: Sun Yat-sen University (中山大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Fire scene datasets are crucial for training robust computer vision models, particularly in tasks such as fire early warning and emergency rescue operations. However, among the currently available fire-related data, there is a significant shortage of annotated data specifically targeting building this http URL tackle this issue, we introduce an annotated dataset of building units captured by drones, which incorporates multiple enhancement techniques. We construct backgrounds using real multi-story scenes, combine motion blur and brightness adjustment to enhance the authenticity of the captured images, simulate drone shooting conditions under various circumstances, and employ large models to generate fire effects at different this http URL synthetic dataset generated by this method encompasses a wide range of building scenarios, with a total of 1,978 images. This dataset can effectively improve the generalization ability of fire unit detection, providing multi-scenario and scalable data while reducing the risks and costs associated with collecting real fire data. The dataset is available at this https URL.
zh

[CV-90] COFFEE: A Shadow-Resilient Real-Time Pose Estimator for Unknown Tumbling Asteroids using Sparse Neural Networks

【速读】:该论文旨在解决空间中未知天体(如小行星)的实时姿态估计问题,特别是针对因自遮挡阴影导致的传统特征提取方法(如SIFT、ORB、AKAZE及深度学习方法)存在显著偏差的问题。解决方案的关键在于提出COFFEE框架——通过利用星载太阳跟踪传感器提供的先验太阳相位角信息,将显著轮廓与其投影阴影关联,从而检测出对阴影运动不变的稀疏特征点;随后采用稀疏神经网络与基于注意力机制的图神经网络联合训练,实现帧间特征匹配,最终构建出无偏、高精度且比现有先进深度学习方法快一个数量级的姿态估计流程。

链接: https://arxiv.org/abs/2508.03132
作者: Arion Zimmermann,Soon-Jo Chung,Fred Hadaegh
机构: EPFL(瑞士联邦理工学院); California Institute of Technology(加州理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: in Proc. 75th Int. Astronautical Congress (IAC-24), Milan, Italy, Oct. 2024

点击查看摘要

Abstract:The accurate state estimation of unknown bodies in space is a critical challenge with applications ranging from the tracking of space debris to the shape estimation of small bodies. A necessary enabler to this capability is to find and track features on a continuous stream of images. Existing methods, such as SIFT, ORB and AKAZE, achieve real-time but inaccurate pose estimates, whereas modern deep learning methods yield higher quality features at the cost of more demanding computational resources which might not be available on space-qualified hardware. Additionally, both classical and data-driven methods are not robust to the highly opaque self-cast shadows on the object of interest. We show that, as the target body rotates, these shadows may lead to large biases in the resulting pose estimates. For these objects, a bias in the real-time pose estimation algorithm may mislead the spacecraft’s state estimator and cause a mission failure, especially if the body undergoes a chaotic tumbling motion. We present COFFEE, the Celestial Occlusion Fast FEature Extractor, a real-time pose estimation framework for asteroids designed to leverage prior information on the sun phase angle given by sun-tracking sensors commonly available onboard spacecraft. By associating salient contours to their projected shadows, a sparse set of features are detected, invariant to the motion of the shadows. A Sparse Neural Network followed by an attention-based Graph Neural Network feature matching model are then jointly trained to provide a set of correspondences between successive frames. The resulting pose estimation pipeline is found to be bias-free, more accurate than classical pose estimation pipelines and an order of magnitude faster than other state-of-the-art deep learning pipelines on synthetic data as well as on renderings of the tumbling asteroid Apophis.
zh

[CV-91] Landsat30-AU: A Vision-Language Dataset for Australian Landsat Imagery

【速读】:该论文旨在解决当前视觉语言模型(Vision Language Models, VLMs)在遥感领域应用中的关键局限性:现有数据集主要基于高分辨率、短期、单一卫星的影像,忽视了低分辨率、多卫星、长期历史档案(如Landsat)对实现低成本、抗偏见的全球监测的重要性。解决方案的核心是构建并发布Landsat30-AU数据集,其包含两个组件——Landsat30-AU-Cap(196,262张图像-标题对)和Landsat30-AU-VQA(17,725个经人工验证的视觉问答样本),覆盖澳大利亚地区超过36年、4颗Landsat卫星(5、7、8、9)采集的30米分辨率影像。该数据集通过一种迭代优化的自举式流水线(bootstrapped pipeline)生成,并结合通用VLM与人工校验确保质量。实验表明,现成VLM性能有限,而轻量级微调(如Qwen2.5-VL-7B)可显著提升Captioning指标(SPIDEr从0.11升至0.31)和VQA准确率(从0.74升至0.87),凸显了高质量遥感语义数据对模型适配的关键作用。

链接: https://arxiv.org/abs/2508.03127
作者: Sai Ma,Zhuang Li,John A Taylor
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Vision language models (VLMs) that enable natural language interaction with satellite imagery can democratize Earth observation by accelerating expert workflows, making data accessible to non-specialists, and enabling planet-scale automation. However, existing datasets focus mainly on short-term, high-resolution imagery from a limited number of satellites, overlooking low-resolution, multi-satellite, long-term archives, such as Landsat, that are essential for affordable and bias-robust global monitoring. We address this gap with Landsat30-AU, a large-scale vision-language dataset built from 30-meter resolution imagery collected by four Landsat satellites (5, 7, 8, and 9) over Australia, spanning more than 36 years. The dataset includes two components: Landsat30-AU-Cap, containing 196,262 image-caption pairs, and Landsat30-AU-VQA, comprising 17,725 human-verified visual question answering (VQA) samples across eight remote sensing domains. Both datasets are curated through a bootstrapped pipeline that leverages generic VLMs with iterative refinement and human verification to ensure quality. Our evaluation of eight VLMs on our benchmark reveals that off-the-shelf models struggle to understand satellite imagery. The open-source remote-sensing VLM EarthDial achieves only 0.07 SPIDEr in captioning and a VQA accuracy of 0.48, highlighting the limitations of current approaches. Encouragingly, lightweight fine-tuning of Qwen2.5-VL-7B on Landsat30-AU improves captioning performance from 0.11 to 0.31 SPIDEr and boosts VQA accuracy from \textbf0.74 to 0.87. Code and data are available at this https URL.
zh

[CV-92] H3R: Hybrid Multi-view Correspondence for Generalizable 3D Reconstruction ICCV2025

【速读】:该论文旨在解决当前前向传播式三维高斯溅射(3D Gaussian Splatting)方法在多视角对应关系建模中泛化能力不足的问题,尤其针对显式方法几何精度高但难以处理模糊区域、隐式方法鲁棒性强但收敛速度慢的固有矛盾。其解决方案的关键在于提出一种混合框架H3R,通过融合体素潜在空间融合(volumetric latent fusion)与基于注意力机制的特征聚合(attention-based feature aggregation),包含两个互补模块:一是利用极线约束强制几何一致性的高效潜在体积,二是借助普吕克坐标(Plücker coordinates)实现相机感知的自适应对应关系精修的Transformer结构。该设计在提升重建泛化能力的同时,使收敛速度比现有方法快2倍,并验证了空间对齐的基础模型(如SD-VAE)相较于语义对齐模型(如DINOv2)更适用于三维重建任务,从而有效解决了语义表示与空间重建需求之间的不匹配问题。

链接: https://arxiv.org/abs/2508.03118
作者: Heng Jia,Linchao Zhu,Na Zhao
机构: ReLER Lab, CCAI, Zhejiang University (浙江大学); The State Key Lab of Brain-Machine Intelligence, Zhejiang University (浙江大学); Singapore University of Technology and Design (新加坡科技设计大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Despite recent advances in feed-forward 3D Gaussian Splatting, generalizable 3D reconstruction remains challenging, particularly in multi-view correspondence modeling. Existing approaches face a fundamental trade-off: explicit methods achieve geometric precision but struggle with ambiguous regions, while implicit methods provide robustness but suffer from slow convergence. We present H3R, a hybrid framework that addresses this limitation by integrating volumetric latent fusion with attention-based feature aggregation. Our framework consists of two complementary components: an efficient latent volume that enforces geometric consistency through epipolar constraints, and a camera-aware Transformer that leverages Plücker coordinates for adaptive correspondence refinement. By integrating both paradigms, our approach enhances generalization while converging 2 \times faster than existing methods. Furthermore, we show that spatial-aligned foundation models (e.g., SD-VAE) substantially outperform semantic-aligned models (e.g., DINOv2), resolving the mismatch between semantic representations and spatial reconstruction requirements. Our method supports variable-number and high-resolution input views while demonstrating robust cross-dataset generalization. Extensive experiments show that our method achieves state-of-the-art performance across multiple benchmarks, with significant PSNR improvements of 0.59 dB, 1.06 dB, and 0.22 dB on the RealEstate10K, ACID, and DTU datasets, respectively. Code is available at this https URL.
zh

[CV-93] Causal Disentanglement and Cross-Modal Alignment for Enhanced Few-Shot Learning

【速读】:该论文旨在解决少样本学习(Few-shot Learning, FSL)中因模型依赖纠缠表示(entangled representations)而导致适应效率低下的问题,尤其在仅有少量标注数据时,模型难以隐式恢复解缠过程以获得解缠表示,从而限制了性能提升。解决方案的关键在于提出一种名为因果CLIP适配器(Causal CLIP Adapter, CCA)的新框架:首先利用无监督独立成分分析(Independent Component Analysis, ICA)显式解缠CLIP提取的视觉特征,避免从有限标签数据中学习解混过程,显著减少可训练参数并缓解过拟合;其次,为补偿ICA可能破坏CLIP原有的模态内与模态间对齐(intra- and inter-modal alignment),CCA通过两种方式增强CLIP的跨模态对齐能力——单向地微调基于CLIP的文本分类器,以及双向地引入交叉注意力机制以促进视觉与文本表征的相互增强。最终,通过线性组合单模态与跨模态分类输出,实现准确率的有效提升。

链接: https://arxiv.org/abs/2508.03102
作者: Tianjiao Jiang,Zhen Zhang,Yuhang Liu,Javen Qinfeng Shi
机构: Australian Institute for Machine Learning (澳大利亚机器学习研究所); The University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Few-shot learning (FSL) often requires effective adaptation of models using limited labeled data. However, most existing FSL methods rely on entangled representations, requiring the model to implicitly recover the unmixing process to obtain disentangled representations using only limited supervision, which hinders effective adaptation. Recent theoretical studies show that multimodal contrastive learning methods, such as CLIP, can disentangle latent representations up to linear transformations. In light of this, we propose the Causal CLIP Adapter (CCA), a novel framework that explicitly disentangles visual features extracted from CLIP using unsupervised Independent Component Analysis (ICA). This removes the need to learn the unmixing process from the labeled data, thereby reducing the number of trainable parameters and mitigating overfitting. Taking a step further, while ICA can obtain visual disentangled representations, it may also disrupt CLIP’s intra- and inter-modal alignment. To counteract this, CCA further leverages CLIP’s inherent cross-modal alignment by enhancing it in two ways: unidirectionally, through fine-tuning a CLIP-based text classifier, and bidirectionally, via a cross-attention mechanism that enriches visual and textual representations through mutual interaction. Both unimodal and cross-modal classification outputs can be effectively combined linearly to improve classification accuracy. Extensive experiments on 11 benchmark datasets demonstrate that our method consistently outperforms state-of-the-art approaches in terms of few-shot performance and robustness to distributional shifts, while maintaining computational efficiency. Code will be available at this https URL.
zh

[CV-94] AVATAR: Reinforcement Learning to See Hear and Reason Over Video

【速读】:该论文旨在解决长时视频多模态推理中面临的三大挑战:(1) 基于策略的训练方法存在数据效率低下问题;(2) 相同或相近奖励导致优势值消失(vanishing advantage),削弱学习信号;(3) 信用分配方式均匀,无法突出关键推理步骤。解决方案的关键在于提出 AVATAR(Audio-Video Agent for Alignment and Reasoning)框架,其核心创新包括:(1) 采用离策略(off-policy)训练架构,通过重用历史经验并增强奖励多样性来提升样本效率并缓解优势消失问题;(2) 引入时间优势塑造(Temporal Advantage Shaping, TAS)机制,在学习过程中对关键推理阶段进行加权,实现更精准的信用分配。该方案在多个基准测试中显著优于基线模型,并实现了超过35%的样本效率提升。

链接: https://arxiv.org/abs/2508.03100
作者: Yogesh Kulkarni,Pooyan Fazli
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency.
zh

[CV-95] Augmenting Continual Learning of Diseases with LLM -Generated Visual Concepts

【速读】:该论文旨在解决医疗图像分类系统在动态临床环境中持续学习(continual learning)时面临的挑战,特别是现有方法仅依赖简单模板化的文本信息(如类别名称),忽略了更丰富的语义信息,从而限制了模型对新类别的适应能力。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)生成的视觉概念(visual concepts)作为判别性语义引导,构建基于相似性的视觉概念池以避免冗余,并引入跨模态图像-概念注意力模块(cross-modal image-concept attention module)结合注意力损失,使模型能够从相关视觉概念中提取语义知识,融合生成类代表性特征用于分类。该方法显著提升了持续学习性能,在医学与自然图像数据集上达到当前最优效果。

链接: https://arxiv.org/abs/2508.03094
作者: Jiantao Tan,Peixian Ma,Kanghao Chen,Zhiming Dai,Ruixuan Wang
机构: Sun Yat-sen University (中山大学); The Hong Kong University of Science and Technology (Guangzhou) (香港科技大学(广州)); Peng Cheng Laboratory (鹏城实验室); Key Laboratory of Machine Intelligence and Advanced Computing, MOE (教育部机器智能与先进计算重点实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Continual learning is essential for medical image classification systems to adapt to dynamically evolving clinical environments. The integration of multimodal information can significantly enhance continual learning of image classes. However, while existing approaches do utilize textual modality information, they solely rely on simplistic templates with a class name, thereby neglecting richer semantic information. To address these limitations, we propose a novel framework that harnesses visual concepts generated by large language models (LLMs) as discriminative semantic guidance. Our method dynamically constructs a visual concept pool with a similarity-based filtering mechanism to prevent redundancy. Then, to integrate the concepts into the continual learning process, we employ a cross-modal image-concept attention module, coupled with an attention loss. Through attention, the module can leverage the semantic knowledge from relevant visual concepts and produce class-representative fused features for classification. Experiments on medical and natural image datasets show our method achieves state-of-the-art performance, demonstrating the effectiveness and superiority of our method. We will release the code publicly.
zh

[CV-96] 2UE: Generating Unlearnable Examples from Text Descriptions ACM-MM2025

【速读】:该论文旨在解决大规模预训练模型(如CLIP)在使用网络爬取数据时可能侵犯用户隐私的问题,特别是针对未经授权训练场景下如何有效保护敏感数据。现有方法通过生成不可学习样本(Unlearnable Examples, UEs)来干扰模型对受保护数据的学习,但其依赖图像与文本的联合优化过程计算开销大,需借助第三方服务完成,导致用户必须先暴露原始数据才能获得保护,形成隐私悖论。解决方案的关键在于提出Text-to-Unlearnable Example (T2UE) 框架:仅凭文本描述即可生成有效UE,利用文本到图像(text-to-image, T2I)模型将文本映射至图像空间,并结合误差最小化机制生成扰动噪声,从而实现“零接触数据保护”,即无需直接提供原始图像即可完成隐私防护,且保护效果在多种模型架构和监督学习场景中均具泛化性。

链接: https://arxiv.org/abs/2508.03091
作者: Xingjun Ma,Hanxun Huang,Tianwei Song,Ye Sun,Yifeng Gao,Yu-Gang Jiang
机构: Fudan University (复旦大学); The University of Melbourne (墨尔本大学)
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Computer Vision and Pattern Recognition (cs.CV)
备注: To appear in ACM MM 2025

点击查看摘要

Abstract:Large-scale pre-training frameworks like CLIP have revolutionized multimodal learning, but their reliance on web-scraped datasets, frequently containing private user data, raises serious concerns about misuse. Unlearnable Examples (UEs) have emerged as a promising countermeasure against unauthorized model training, employing carefully crafted unlearnable noise to disrupt the learning of meaningful representations from protected data. Current approaches typically generate UEs by jointly optimizing unlearnable noise for both images and their associated text descriptions (or labels). However, this optimization process is often computationally prohibitive for on-device execution, forcing reliance on external third-party services. This creates a fundamental privacy paradox: users must initially expose their data to these very services to achieve protection, thereby compromising privacy in the process. Such a contradiction has severely hindered the development of practical, scalable data protection solutions. To resolve this paradox, we introduce \textbfText-to-Unlearnable Example (T2UE), a novel framework that enables users to generate UEs using only text descriptions. T2UE circumvents the need for original image data by employing a text-to-image (T2I) model to map text descriptions into the image (noise) space, combined with an error-minimization framework to produce effective unlearnable noise. Extensive experiments show that T2UE-protected data substantially degrades performance in downstream tasks (e.g., cross-modal retrieval) for state-of-the-art models. Notably, the protective effect generalizes across diverse architectures and even to supervised learning settings. Our work demonstrates the feasibility of “zero-contact data protection”, where personal data can be safeguarded based solely on their textual descriptions, eliminating the need for direct data exposure.
zh

[CV-97] Contrastive Cross-Bag Augmentation for Multiple Instance Learning-based Whole Slide Image Classification

【速读】:该论文针对多实例学习(Multiple Instance Learning, MIL)在全切片图像(Whole Slide Image, WSI)分类中伪袋(pseudo-bag)多样性不足的问题展开研究,其核心挑战在于现有伪袋增强方法仅从有限数量的袋子中采样实例,导致伪袋特征分布受限,进而影响模型对小肿瘤区域样本的判别能力。解决方案的关键在于提出对比跨袋增强(Contrastive Cross-Bag Augmentation, C²Aug),通过从同类别所有袋子中采样实例以提升伪袋多样性;同时引入袋级和组级对比学习框架,强化具有不同语义含义特征的区分能力,从而缓解因关键实例(如肿瘤实例)增多而导致的伪袋中稀有关键样本缺失问题,显著提升模型在小肿瘤区域测试切片上的性能表现。

链接: https://arxiv.org/abs/2508.03081
作者: Bo Zhang,Xu Xinan,Shuo Yan,Yu Bai,Zheng Zhang,Wufan Wang,Wendong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent pseudo-bag augmentation methods for Multiple Instance Learning (MIL)-based Whole Slide Image (WSI) classification sample instances from a limited number of bags, resulting in constrained diversity. To address this issue, we propose Contrastive Cross-Bag Augmentation ( C^2Aug ) to sample instances from all bags with the same class to increase the diversity of pseudo-bags. However, introducing new instances into the pseudo-bag increases the number of critical instances (e.g., tumor instances). This increase results in a reduced occurrence of pseudo-bags containing few critical instances, thereby limiting model performance, particularly on test slides with small tumor areas. To address this, we introduce a bag-level and group-level contrastive learning framework to enhance the discrimination of features with distinct semantic meanings, thereby improving model performance. Experimental results demonstrate that C^2Aug consistently outperforms state-of-the-art approaches across multiple evaluation metrics.
zh

[CV-98] Exploring Fairness across Fine-Grained Attributes in Large Vision-Language Models CVPR2025

【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在多样化属性上的公平性问题,现有研究多集中于种族和性别等传统人口统计学特征,而忽视了更细粒度属性的偏见。其解决方案的关键在于构建一个基于大语言模型(Large Language Models, LLMs)的开放集偏见属性知识库,并在此基础上系统评估LVLMs在多种属性维度下的公平性表现,从而揭示文化、环境与行为因素对模型决策的影响显著高于传统人口统计学特征。

链接: https://arxiv.org/abs/2508.03079
作者: Zaiying Zhao,Toshihiko Yamasaki
机构: The University of Tokyo (东京大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the Responsible Generative AI (ReGenAI) Workshop, CVPR 2025

点击查看摘要

Abstract:The rapid expansion of applications using Large Vision-Language Models (LVLMs), such as GPT-4o, has raised significant concerns about their fairness. While existing studies primarily focus on demographic attributes such as race and gender, fairness across a broader range of attributes remains largely unexplored. In this study, we construct an open-set knowledge base of bias attributes leveraging Large Language Models (LLMs) and evaluate the fairness of LVLMs across finer-grained attributes. Our experimental results reveal that LVLMs exhibit biased outputs across a diverse set of attributes and further demonstrate that cultural, environmental, and behavioral factors have a more pronounced impact on LVLM decision-making than traditional demographic attributes.
zh

[CV-99] RobustGS: Unified Boosting of Feedforward 3D Gaussian Splatting under Low-Quality Conditions

【速读】:该论文旨在解决feedforward 3D Gaussian Splatting (3DGS) 在真实世界复杂成像条件下(如噪声、低光照或雨天等退化场景)重建质量下降的问题。现有方法通常假设输入图像为干净高质量,但在实际应用中,这些退化因素会导致几何误差和重建性能显著退化。解决方案的关键在于提出一个通用且高效的多视图特征增强模块——RobustGS,其核心创新包括:1)引入广义退化学习器(Generalized Degradation Learner),从多视角输入中提取多种退化的通用表征与分布,提升对退化类型的感知能力;2)设计一种语义感知状态空间模型(semantic-aware state-space model),利用退化表征在特征空间中增强受损图像,并通过语义一致性策略聚合跨视角的相似信息,从而提取细粒度的跨视图对应关系,最终显著提升3D重建质量。该模块可无缝集成至现有预训练流水线中,实现即插即用的鲁棒性增强。

链接: https://arxiv.org/abs/2508.03077
作者: Anran Wu,Long Peng,Xin Di,Xueyuan Dai,Chen Wu,Yang Wang,Xueyang Fu,Yang Cao,Zheng-Jun Zha
机构: 1. 中国科学院自动化研究所(Chinese Academy of Sciences Institute of Automation); 2. 南京大学(Nanjing University)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Feedforward 3D Gaussian Splatting (3DGS) overcomes the limitations of optimization-based 3DGS by enabling fast and high-quality reconstruction without the need for per-scene optimization. However, existing feedforward approaches typically assume that input multi-view images are clean and high-quality. In real-world scenarios, images are often captured under challenging conditions such as noise, low light, or rain, resulting in inaccurate geometry and degraded 3D reconstruction. To address these challenges, we propose a general and efficient multi-view feature enhancement module, RobustGS, which substantially improves the robustness of feedforward 3DGS methods under various adverse imaging conditions, enabling high-quality 3D reconstruction. The RobustGS module can be seamlessly integrated into existing pretrained pipelines in a plug-and-play manner to enhance reconstruction robustness. Specifically, we introduce a novel component, Generalized Degradation Learner, designed to extract generic representations and distributions of multiple degradations from multi-view inputs, thereby enhancing degradation-awareness and improving the overall quality of 3D reconstruction. In addition, we propose a novel semantic-aware state-space model. It first leverages the extracted degradation representations to enhance corrupted inputs in the feature space. Then, it employs a semantic-aware strategy to aggregate semantically similar information across different views, enabling the extraction of fine-grained cross-view correspondences and further improving the quality of 3D representations. Extensive experiments demonstrate that our approach, when integrated into existing methods in a plug-and-play manner, consistently achieves state-of-the-art reconstruction quality across various types of degradations.
zh

[CV-100] SSFMamba: Symmetry-driven Spatial-Frequency Feature Fusion for 3D Medical Image Segmentation

【速读】:该论文旨在解决3D医学图像分割中空间域建模全局上下文能力有限的问题。现有方法虽尝试引入频域表示以增强全局信息捕捉,但普遍采用的特征提取策略忽略了频域特有的共轭对称性(conjugate symmetry)以及空间与频域数据分布的本质差异,导致频域优势被削弱或淹没。解决方案的关键在于提出SSFMamba网络,其核心是基于Mamba架构的对称驱动型时空特征融合机制:通过双分支结构分别提取空间与频域特征,并利用Mamba块实现异构特征的有效融合,在保留全局上下文的同时强化局部细节;同时设计了三维多方向扫描机制以增强局部与全局线索的协同融合,从而显著提升分割性能。

链接: https://arxiv.org/abs/2508.03069
作者: Bo Zhang,Yifan Zhang,Shuo Yan,Yu Bai,Zheng Zhang,Wu Liu,Xiuzhuang Zhou,Wendong Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In light of the spatial domain’s limited capacity for modeling global context in 3D medical image segmentation, emerging approaches have begun to incorporate frequency domain representations. However, straightforward feature extraction strategies often overlook the unique properties of frequency domain information, such as conjugate symmetry. They also fail to account for the fundamental differences in data distribution between the spatial and frequency domains, which can ultimately dilute or obscure the complementary strengths that frequency-based representations offer. In this paper, we propose SSFMamba, a Mamba based Symmetry-driven Spatial-Frequency feature fusion network for 3D medical image segmentation. SSFMamba employs a complementary dual-branch architecture that extracts features from both the spatial and frequency domains, and leverages a Mamba block to fuse these heterogeneous features to preserve global context while reinforcing local details. In the frequency domain branch, we harness Mamba’s exceptional capability to extract global contextual information in conjunction with the synergistic effect of frequency domain features to further enhance global modeling. Moreover, we design a 3D multi-directional scanning mechanism to strengthen the fusion of local and global cues. Extensive experiments on the BraTS2020 and BraTS2023 datasets demonstrate that our approach consistently outperforms state-of-the-art methods across various evaluation metrics.
zh

[CV-101] CORE-ReID: Comprehensive Optimization and Refinement through Ensemble fusion in Domain Adaptation for person re-identification

【速读】:该论文旨在解决无监督域自适应(Unsupervised Domain Adaptation, UDA)在行人重识别(Person Re-identification, ReID)任务中的性能瓶颈问题,尤其是跨摄像头场景下源域与目标域间图像特征分布差异导致的模型泛化能力不足。其核心解决方案是提出一种名为“基于集成融合的综合优化与精炼框架”(Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person ReID, CORE-ReID)的新方法:首先利用CycleGAN进行预训练阶段的数据增强以对齐不同相机来源的图像特征;随后在微调阶段引入教师-学生网络结构,结合多视角特征进行多层次聚类生成多样化伪标签;最关键的是设计了一个可学习的集成融合模块(learnable Ensemble Fusion),聚焦于全局特征中的细粒度局部信息,从而提升特征表示的全面性并缓解多伪标签带来的歧义性问题。该方案通过高效通道注意力块(Efficient Channel Attention Block)和双向均值特征归一化(Bidirectional Mean Feature Normalization)进一步优化全局与局部特征的自适应融合,显著提升了Mean Average Precision、Top-1、Top-5及Top-10等指标表现。

链接: https://arxiv.org/abs/2508.03064
作者: Trinh Quoc Nguyen,Oky Dicky Ardiansyah Prima,Katsuyoshi Hotta
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This study introduces a novel framework, “Comprehensive Optimization and Refinement through Ensemble Fusion in Domain Adaptation for Person Re-identification (CORE-ReID)”, to address an Unsupervised Domain Adaptation (UDA) for Person Re-identification (ReID). The framework utilizes CycleGAN to generate diverse data that harmonizes differences in image characteristics from different camera sources in the pre-training stage. In the fine-tuning stage, based on a pair of teacher-student networks, the framework integrates multi-view features for multi-level clustering to derive diverse pseudo labels. A learnable Ensemble Fusion component that focuses on fine-grained local information within global features is introduced to enhance learning comprehensiveness and avoid ambiguity associated with multiple pseudo-labels. Experimental results on three common UDAs in Person ReID demonstrate significant performance gains over state-of-the-art approaches. Additional enhancements, such as Efficient Channel Attention Block and Bidirectional Mean Feature Normalization mitigate deviation effects and adaptive fusion of global and local features using the ResNet-based model, further strengthening the framework. The proposed framework ensures clarity in fusion features, avoids ambiguity, and achieves high ac-curacy in terms of Mean Average Precision, Top-1, Top-5, and Top-10, positioning it as an advanced and effective solution for the UDA in Person ReID. Our codes and models are available at this https URL.
zh

[CV-102] CHARM: Collaborative Harmonization across Arbitrary Modalities for Modality-agnostic Semantic Segmentation

【速读】:该论文旨在解决多模态语义分割中因显式特征对齐导致的模态特异性优势削弱与互补性破坏问题,即现有方法通过强制模态同质化(homogenization)来实现跨模态融合,反而抑制了各模态的独特信息价值。其解决方案的关键在于提出CHARM框架,通过两个核心组件实现协同和谐(harmonization):一是互感知单元(Mutual Perception Unit, MPU),利用窗口化的交叉模态交互机制,使不同模态彼此作为查询和上下文,隐式发现模态间对应关系;二是双路径优化策略,将训练解耦为协作学习策略(Collaborative Learning Strategy, CoL)用于互补融合,以及个体增强策略(Individual Enhancement Strategy, InE)保护模态特定特征的独立优化。该设计有效保留了各模态的优势并促进其协同作用,显著提升了脆弱模态的表现。

链接: https://arxiv.org/abs/2508.03060
作者: Lekang Wen,Jing Xiao,Liang Liao,Jiajun Chen,Mi Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Modality-agnostic Semantic Segmentation (MaSS) aims to achieve robust scene understanding across arbitrary combinations of input modality. Existing methods typically rely on explicit feature alignment to achieve modal homogenization, which dilutes the distinctive strengths of each modality and destroys their inherent complementarity. To achieve cooperative harmonization rather than homogenization, we propose CHARM, a novel complementary learning framework designed to implicitly align content while preserving modality-specific advantages through two components: (1) Mutual Perception Unit (MPU), enabling implicit alignment through window-based cross-modal interaction, where modalities serve as both queries and contexts for each other to discover modality-interactive correspondences; (2) A dual-path optimization strategy that decouples training into Collaborative Learning Strategy (CoL) for complementary fusion learning and Individual Enhancement Strategy (InE) for protected modality-specific optimization. Experiments across multiple datasets and backbones indicate that CHARM consistently outperform the baselines, with significant increment on the fragile modalities. This work shifts the focus from model homogenization to harmonization, enabling cross-modal complementarity for true harmony in diversity.
zh

[CV-103] Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation ACM-MM2025

【速读】:该论文旨在解决面部滤镜(face filters)在存在遮挡(occlusion)时性能下降的问题,尤其是在手部、头发或配饰等物体遮挡面部区域时,传统方法难以准确分离前景与背景。解决方案的关键在于提出了一种全新的任务——面部抠图(face matting),并设计了无需trimap的不确定性感知框架FaceMat,通过两阶段训练机制实现高质量alpha matte预测:首先训练教师模型联合估计alpha matte和像素级不确定性,再利用该不确定性引导学生模型进行空间自适应知识蒸馏,使模型聚焦于模糊或遮挡区域,从而提升泛化能力和语义一致性;此外,通过将皮肤明确设为前景、遮挡物设为背景来重构抠图目标,进一步优化合成效果。

链接: https://arxiv.org/abs/2508.03055
作者: Hyebin Cho,Jaehyup Lee
机构: Korea Advanced Institute of Science & Technology (韩国科学技术院); Kyungpook National University (庆北国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ACM MM 2025. 9 pages, 8 figures, 6 tables

点击查看摘要

Abstract:Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at this https URL
zh

[CV-104] Multi-human Interactive Talking Dataset

【速读】:该论文旨在解决现有对话视频生成研究中普遍存在的局限性,即主要聚焦于单人独白或孤立的面部动画,难以适用于真实场景下的多人群体交互。为填补这一空白,作者提出了MIT数据集,其核心创新在于构建了一个大规模、高分辨率且带有精细标注的多说话者对话视频集合,涵盖2–4名参与者,并包含身体姿态与语音交互的细粒度标注,从而为多人群体交互行为的研究提供丰富资源。解决方案的关键在于提出CovOG基线模型,其中包含两个核心组件:Multi-Human Pose Encoder(MPE)用于通过聚合个体姿态嵌入来处理不同数量的说话者,以及Interactive Audio Driver(IAD)用于根据特定说话者的音频特征调制头部动态,二者共同实现了多人群体对话视频生成的可行性验证与挑战识别,使MIT成为未来相关研究的重要基准。

链接: https://arxiv.org/abs/2508.03050
作者: Zeyu Zhu,Weijia Wu,Mike Zheng Shou
机构: Show Lab, National University of Singapore (新加坡国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 4 figures, 4 tables

点击查看摘要

Abstract:Existing studies on talking video generation have predominantly focused on single-person monologues or isolated facial animations, limiting their applicability to realistic multi-human interactions. To bridge this gap, we introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation. To this end, we develop an automatic pipeline that collects and annotates multi-person conversational videos. The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers, with fine-grained annotations of body poses and speech interactions. It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors. To demonstrate the potential of MIT, we furthur propose CovOG, a baseline model for this novel task. It integrates a Multi-Human Pose Encoder (MPE) to handle varying numbers of speakers by aggregating individual pose embeddings, and an Interactive Audio Driver (IAD) to modulate head dynamics based on speaker-specific audio features. Together, these components showcase the feasibility and challenges of generating realistic multi-human talking videos, establishing MIT as a valuable benchmark for future research. The code is avalibale at: this https URL.
zh

[CV-105] VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering

【速读】:该论文旨在解决跨视频问答(Cross-video Question Answering)中的核心挑战,即如何在多视频流之间建立有意义的关联并高效管理多源信息检索的复杂性。其解决方案的关键在于提出 VideoForest 框架,通过人锚定的分层推理机制实现跨视频理解:首先利用 ReID(Re-Identification)和跟踪算法提取人级特征作为视频间的自然连接点,构建跨视频的时空关系;其次设计多粒度跨度树结构,以人物轨迹为中心对视觉内容进行分层组织;最后采用多智能体推理框架遍历该结构以回答复杂的跨视频问题,从而在无需端到端训练的前提下实现高精度的跨视频识别、行为分析与推理任务。

链接: https://arxiv.org/abs/2508.03039
作者: Yiran Meng,Junhong Ye,Wei Zhou,Guanghui Yue,Xudong Mao,Ruomei Wang,Baoquan Zhao
机构: Sun Yat-Sen University (中山大学); Cardiff University (卡迪夫大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Cross-video question answering presents significant challenges beyond traditional single-video understanding, particularly in establishing meaningful connections across video streams and managing the complexity of multi-source information retrieval. We introduce VideoForest, a novel framework that addresses these challenges through person-anchored hierarchical reasoning. Our approach leverages person-level features as natural bridge points between videos, enabling effective cross-video understanding without requiring end-to-end training. VideoForest integrates three key innovations: 1) a human-anchored feature extraction mechanism that employs ReID and tracking algorithms to establish robust spatiotemporal relationships across multiple video sources; 2) a multi-granularity spanning tree structure that hierarchically organizes visual content around person-level trajectories; and 3) a multi-agent reasoning framework that efficiently traverses this hierarchical structure to answer complex cross-video queries. To evaluate our approach, we develop CrossVideoQA, a comprehensive benchmark dataset specifically designed for person-centric cross-video analysis. Experimental results demonstrate VideoForest’s superior performance in cross-video reasoning tasks, achieving 71.93% accuracy in person recognition, 83.75% in behavior analysis, and 51.67% in summarization and reasoning, significantly outperforming existing methods. Our work establishes a new paradigm for cross-video understanding by unifying multiple video streams through person-level features, enabling sophisticated reasoning across distributed visual information while maintaining computational efficiency.
zh

[CV-106] MoCA: Identity-Preserving Text-to-Video Generation via Mixture of Cross Attention

【速读】:该论文旨在解决生成式 AI(Generative AI)中文本到视频(Text-to-Video, T2V)生成任务里保持身份一致性(ID-preserving)的难题,尤其针对现有基于扩散模型的方法在捕捉细粒度面部动态和维持时间上身份连贯性方面的不足。解决方案的关键在于提出一种名为 MoCA 的新型视频扩散模型,其核心创新是将受“专家混合(Mixture-of-Experts)”启发的交叉注意力机制嵌入到扩散 Transformer(DiT)架构中:通过层次化时间池化(Hierarchical Temporal Pooling)提取多尺度身份特征,并利用时序感知交叉注意力专家(Temporal-Aware Cross-Attention Experts)动态建模时空关系;同时引入潜在视频感知损失(Latent Video Perceptual Loss)以增强帧间身份一致性和细节保真度。

链接: https://arxiv.org/abs/2508.03034
作者: Qi Xie(1),Yongjia Ma(2),Donglin Di(2),Xuehao Gao(3),Xun Yang(1) ((1) University of Science and Technology of China, (2) Li Auto, (3) Northwestern Polytechnical University)
机构: University of Science and Technology of China (中国科学技术大学); Li Auto (理想汽车); Northwestern Polytechnical University (西北工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Achieving ID-preserving text-to-video (T2V) generation remains challenging despite recent advances in diffusion-based models. Existing approaches often fail to capture fine-grained facial dynamics or maintain temporal identity coherence. To address these limitations, we propose MoCA, a novel Video Diffusion Model built on a Diffusion Transformer (DiT) backbone, incorporating a Mixture of Cross-Attention mechanism inspired by the Mixture-of-Experts paradigm. Our framework improves inter-frame identity consistency by embedding MoCA layers into each DiT block, where Hierarchical Temporal Pooling captures identity features over varying timescales, and Temporal-Aware Cross-Attention Experts dynamically model spatiotemporal relationships. We further incorporate a Latent Video Perceptual Loss to enhance identity coherence and fine-grained details across video frames. To train this model, we collect CelebIPVid, a dataset of 10,000 high-resolution videos from 1,000 diverse individuals, promoting cross-ethnicity generalization. Extensive experiments on CelebIPVid show that MoCA outperforms existing T2V methods by over 5% across Face similarity.
zh

[CV-107] SA-3DGS: A Self-Adaptive Compression Method for 3D Gaussian Splatting AAAI2026

【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在场景表示中因大量高斯点导致的存储成本过高问题,同时现有压缩方法难以有效识别并剔除真正冗余的高斯点,从而影响后续剪枝、压缩质量和渲染性能。解决方案的关键在于提出SA-3DGS框架,其核心创新包括:(1)通过学习重要性评分自动识别对场景重建贡献最小的高斯点,实现高效剪枝;(2)引入重要性感知聚类模块,更精准地将高斯属性压缩至码本,提升码本表达能力并减小模型尺寸;(3)设计码本修复模块,利用场景上下文信息恢复丢失的高斯点属性,缓解因信息损失导致的渲染质量下降。实验表明,该方法可在保持甚至提升渲染质量的前提下实现最高66倍的压缩比,并显著增强其他基于剪枝的方法(如LightGaussian)的性能与泛化能力。

链接: https://arxiv.org/abs/2508.03017
作者: Liheng Zhang,Weihao Yu,Zubo Lu,Haozhi Gu,Jin Huang
机构: South China Normal University (华南师范大学); China Telecom Research Institute (中国电信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages, 7 figures. Under review at AAAI 2026

点击查看摘要

Abstract:Recent advancements in 3D Gaussian Splatting have enhanced efficient and high-quality novel view synthesis. However, representing scenes requires a large number of Gaussian points, leading to high storage demands and limiting practical deployment. The latest methods facilitate the compression of Gaussian models but struggle to identify truly insignificant Gaussian points in the scene, leading to a decline in subsequent Gaussian pruning, compression quality, and rendering performance. To address this issue, we propose SA-3DGS, a method that significantly reduces storage costs while maintaining rendering quality. SA-3DGS learns an importance score to automatically identify the least significant Gaussians in scene reconstruction, thereby enabling effective pruning and redundancy reduction. Next, the importance-aware clustering module compresses Gaussians attributes more accurately into the codebook, improving the codebook’s expressive capability while reducing model size. Finally, the codebook repair module leverages contextual scene information to repair the codebook, thereby recovering the original Gaussian point attributes and mitigating the degradation in rendering quality caused by information loss. Experimental results on several benchmark datasets show that our method achieves up to 66x compression while maintaining or even improving rendering quality. The proposed Gaussian pruning approach is not only adaptable to but also improves other pruning-based methods (e.g., LightGaussian), showcasing excellent performance and strong generalization ability.
zh

[CV-108] Enhancing Long Video Question Answering with Scene-Localized Frame Grouping

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在长视频理解任务中表现不佳的问题,其核心挑战在于资源限制导致无法有效处理全部视频帧及其关联信息,且现有评估框架聚焦于从大量无关帧中识别特定目标帧,与真实应用场景需求不匹配。为应对这一问题,作者提出了一种新的视频问答场景——SceneQA,强调场景级细节感知与推理能力,并构建了LVSQA数据集以支持该任务。解决方案的关键在于提出一种名为SLFG(Scene-Level Frame Grouping)的新方法,其核心思想是通过语义一致的场景帧组合机制,将原始帧重新组织为具有语义连贯性的场景帧,利用场景定位技术和动态帧重排策略提升MLLMs对长视频的理解能力;该方法无需修改原模型架构,具备良好的即插即用特性,在多个长视频基准测试中表现出显著性能提升。

链接: https://arxiv.org/abs/2508.03009
作者: Xuyi Yang,Wenhao Zhang,Hongbo Jin,Lin Liu,Hongbo Xu,Yongwei Nie,Fei Yu,Fei Ma
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Alibaba Group (阿里巴巴集团); 3. Tsinghua University (清华大学); 4. Huawei Technologies Co., Ltd. (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Current Multimodal Large Language Models (MLLMs) often perform poorly in long video understanding, primarily due to resource limitations that prevent them from processing all video frames and their associated information. Efficiently extracting relevant information becomes a challenging task. Existing frameworks and evaluation tasks focus on identifying specific frames containing core objects from a large number of irrelevant frames, which does not align with the practical needs of real-world applications. To address this issue, we propose a new scenario under the video question-answering task, SceneQA, which emphasizes scene-based detail perception and reasoning abilities. And we develop the LVSQA dataset to support the SceneQA task, which is built upon carefully selected videos from LVBench and contains a new collection of question-answer pairs to promote a more fair evaluation of MLLMs’ scene perception abilities in long videos. Inspired by human cognition, we introduce a novel method called SLFG. The core idea of SLFG is to combine individual frames into semantically coherent scene frames. By leveraging scene localization methods and dynamic frame reassembly mechanisms, SLFG significantly enhances the understanding capabilities of existing MLLMs in long videos. SLFG requires no modification to the original model architecture and boasts excellent plug-and-play usability. Experimental results show that this method performs exceptionally well in several long video benchmark tests. Code and dataset will be released at this http URL.
zh

[CV-109] Multi-Granularity Feature Calibration via VFM for Domain Generalized Semantic Segmentation

【速读】:该论文旨在解决域泛化语义分割(Domain Generalized Semantic Segmentation, DGSS)中模型在未见域上泛化能力不足的问题,尤其是现有方法多聚焦于全局特征微调而忽视了多层次特征的层次化适应,导致密集预测精度受限。解决方案的关键在于提出多粒度特征校准(Multi-Granularity Feature Calibration, MGFC)框架,通过粗粒度到细粒度的逐级对齐策略:首先校准粗粒度特征以捕捉全局上下文语义与场景结构,继而增强中粒度特征的类别区分性,最后通过高频空间细节增强优化细粒度特征,从而实现视觉基础模型(Vision Foundation Models, VFMs)在DGSS任务中的高效迁移与鲁棒性提升。

链接: https://arxiv.org/abs/2508.03007
作者: Xinhui Li,Xiaojie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Domain Generalized Semantic Segmentation (DGSS) aims to improve the generalization ability of models across unseen domains without access to target data during training. Recent advances in DGSS have increasingly exploited vision foundation models (VFMs) via parameter-efficient fine-tuning strategies. However, most existing approaches concentrate on global feature fine-tuning, while overlooking hierarchical adaptation across feature levels, which is crucial for precise dense prediction. In this paper, we propose Multi-Granularity Feature Calibration (MGFC), a novel framework that performs coarse-to-fine alignment of VFM features to enhance robustness under domain shifts. Specifically, MGFC first calibrates coarse-grained features to capture global contextual semantics and scene-level structure. Then, it refines medium-grained features by promoting category-level feature discriminability. Finally, fine-grained features are calibrated through high-frequency spatial detail enhancement. By performing hierarchical and granularity-aware calibration, MGFC effectively transfers the generalization strengths of VFMs to the domain-specific task of DGSS. Extensive experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art DGSS approaches, highlighting the effectiveness of multi-granularity adaptation for the semantic segmentation task of domain generalization.
zh

[CV-110] Seeing It Before It Happens: In-Generation NSFW Detection for Diffusion-Based Text-to-Image Models

【速读】:该论文旨在解决生成式 AI(Generative AI)中扩散模型在图像生成过程中产生不安全内容(Not-Safe-For-Work, NSFW)的检测难题,尤其针对现有方法主要集中在生成前提示词过滤或生成后图像审核,而忽视了生成过程中的实时监测问题。其解决方案的关键在于提出一种“生成中检测”(In-Generation Detection, IGD)方法,利用扩散模型在去噪过程中预测的噪声作为内部信号,捕捉区分NSFW与良性内容的语义线索,从而实现对恶意生成内容的早期识别。实验表明,IGD在七类NSFW内容上平均检测准确率达91.32%,优于多个基线方法,且对对抗性提示具有鲁棒性。

链接: https://arxiv.org/abs/2508.03006
作者: Fan Yang,Yihao Huang,Jiayi Zhu,Ling Shi,Geguang Pu,Jin Song Dong,Kailong Wang
机构: Huazhong University of Science and Technology (华中科技大学); National University of Singapore (新加坡国立大学); East China Normal University (华东师范大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages

点击查看摘要

Abstract:Diffusion-based text-to-image (T2I) models enable high-quality image generation but also pose significant risks of misuse, particularly in producing not-safe-for-work (NSFW) content. While prior detection methods have focused on filtering prompts before generation or moderating images afterward, the in-generation phase of diffusion models remains largely unexplored for NSFW detection. In this paper, we introduce In-Generation Detection (IGD), a simple yet effective approach that leverages the predicted noise during the diffusion process as an internal signal to identify NSFW content. This approach is motivated by preliminary findings suggesting that the predicted noise may capture semantic cues that differentiate NSFW from benign prompts, even when the prompts are adversarially crafted. Experiments conducted on seven NSFW categories show that IGD achieves an average detection accuracy of 91.32% over naive and adversarial NSFW prompts, outperforming seven baseline methods.
zh

[CV-111] VCNet: Recreating High-Level Visual Cortex Principles for Robust Artificial Vision

【速读】:该论文旨在解决现代卷积神经网络(Convolutional Neural Networks, CNNs)在数据效率低、分布外泛化能力差以及对对抗扰动敏感等方面的固有局限性。其解决方案的关键在于借鉴灵长类视觉皮层(primate visual cortex)的宏观组织架构,提出一种名为视觉皮层网络(Visual Cortex Network, VCNet)的新架构,该架构模拟了包括跨皮层区域的分层处理、双流信息分离及自上而下的预测反馈等关键生物机制,从而显著提升了模型在特定任务上的准确率与鲁棒性。

链接: https://arxiv.org/abs/2508.02995
作者: Brennen A. Hill,Zhang Xinyu,Timothy Putra Prasetio
机构: University of Wisconsin-Madison (威斯康星大学麦迪逊分校); National University of Singapore (新加坡国立大学)
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Despite their success in image classification, modern convolutional neural networks (CNNs) exhibit fundamental limitations, including data inefficiency, poor out-of-distribution generalization, and vulnerability to adversarial perturbations. The primate visual system, in contrast, demonstrates superior efficiency and robustness, suggesting that its architectural principles may offer a blueprint for more capable artificial vision systems. This paper introduces Visual Cortex Network (VCNet), a novel neural network architecture whose design is informed by the macro-scale organization of the primate visual cortex. VCNet emulates key biological mechanisms, including hierarchical processing across distinct cortical areas, dual-stream information segregation, and top-down predictive feedback. We evaluate VCNet on two specialized benchmarks: the Spots-10 animal pattern dataset and a light field image classification task. Our results show that VCNet achieves a classification accuracy of 92.1% on Spots-10 and 74.4% on the light field dataset, surpassing contemporary models of comparable size. This work demonstrates that integrating neuroscientific principles into network design can lead to more efficient and robust models, providing a promising direction for addressing long-standing challenges in machine learning.
zh

[CV-112] Adversarial Attention Perturbations for Large Object Detection Transformers ICCV2025

【速读】:该论文旨在解决当前对抗扰动方法在目标检测任务中对基于Transformer的检测器攻击效果弱、且多数方法仅适用于卷积神经网络(CNN)架构的问题。其解决方案的关键在于提出一种神经架构无关的注意力聚焦进攻梯度(Attention-Focused Offensive Gradient, AFOG)攻击方法,通过引入可学习的注意力机制,将扰动集中于图像中易受攻击的区域,从而提升对多框检测任务的攻击效率;同时,AFOG通过融合两种特征损失并结合迭代扰动注入策略,在保持视觉不可感知性的前提下显著增强攻击成功率,实验证明其在COCO数据集上对12种大型检测Transformer模型均表现出优越性能,相比现有方法提升达83%,且具备高效性和隐蔽性。

链接: https://arxiv.org/abs/2508.02987
作者: Zachary Yahn,Selim Furkan Tekin,Fatih Ilhan,Sihao Hu,Tiansheng Huang,Yichang Xu,Margaret Loper,Ling Liu
机构: Georgia Institute of Technology (佐治亚理工学院); Georgia Tech Research Institute (佐治亚理工研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025

点击查看摘要

Abstract:Adversarial perturbations are useful tools for exposing vulnerabilities in neural networks. Existing adversarial perturbation methods for object detection are either limited to attacking CNN-based detectors or weak against transformer-based detectors. This paper presents an Attention-Focused Offensive Gradient (AFOG) attack against object detection transformers. By design, AFOG is neural-architecture agnostic and effective for attacking both large transformer-based object detectors and conventional CNN-based detectors with a unified adversarial attention framework. This paper makes three original contributions. First, AFOG utilizes a learnable attention mechanism that focuses perturbations on vulnerable image regions in multi-box detection tasks, increasing performance over non-attention baselines by up to 30.6%. Second, AFOG’s attack loss is formulated by integrating two types of feature loss through learnable attention updates with iterative injection of adversarial perturbations. Finally, AFOG is an efficient and stealthy adversarial perturbation method. It probes the weak spots of detection transformers by adding strategically generated and visually imperceptible perturbations which can cause well-trained object detection models to fail. Extensive experiments conducted with twelve large detection transformers on COCO demonstrate the efficacy of AFOG. Our empirical results also show that AFOG outperforms existing attacks on transformer-based and CNN-based object detectors by up to 83% with superior speed and imperceptibility. Code is available at this https URL.
zh

[CV-113] MoExDA: Domain Adaptation for Edge-based Action Recognition

【速读】:该论文旨在解决现代动作识别模型中存在的静态偏差(static bias)问题,该偏差导致模型在跨域场景下泛化性能下降。解决方案的关键在于提出一种轻量级域适应方法 MoExDA,通过引入边缘帧(edge frames)与RGB帧联合建模,增强模型对动态信息的感知能力,从而有效抑制静态偏差,同时降低计算开销,提升动作识别的鲁棒性。

链接: https://arxiv.org/abs/2508.02981
作者: Takuya Sugimoto,Ning Ding,Toru Tamaki
机构: Nagoya Institute of Technology (名古屋工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages

点击查看摘要

Abstract:Modern action recognition models suffer from static bias, leading to reduced generalization performance. In this paper, we propose MoExDA, a lightweight domain adaptation between RGB and edge information using edge frames in addition to RGB frames to counter the static bias issue. Experiments demonstrate that the proposed method effectively suppresses static bias with a lower computational cost, allowing for more robust action recognition than previous approaches.
zh

[CV-114] Separating Shared and Domain-Specific LoRAs for Multi-Domain Learning

【速读】:该论文旨在解决多域学习中共享LoRA(Low-Rank Adaptation)与域特定LoRA在表征空间上可能重叠、从而导致域特异性信息捕捉不充分的问题。其解决方案的关键在于确保共享LoRA与域特定LoRA分别位于预训练权重矩阵的不同子空间中:具体而言,共享LoRA位于预训练权重的列空间(column space),而域特定LoRA则位于其左零空间(left null space),从而实现两者的正交分离,提升模型对不同域特征的解耦能力。

链接: https://arxiv.org/abs/2508.02978
作者: Yusaku Takama,Ning Ding,Tatsuya Yokota,Toru Tamaki
机构: Nagoya Institute of Technology (名古屋工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:Existing architectures of multi-domain learning have two types of adapters: shared LoRA for all domains and domain-specific LoRA for each particular domain. However, it remains unclear whether this structure effectively captures domain-specific information. In this paper, we propose a method that ensures that shared and domain-specific LoRAs exist in different subspaces; specifically, the column and left null subspaces of the pre-trained weights. We apply the proposed method to action recognition with three datasets (UCF101, Kinetics400, and HMDB51) and demonstrate its effectiveness in some cases along with the analysis of the dimensions of LoRA weights.
zh

[CV-115] Diffusion Models with Adaptive Negative Sampling Without External Resources

【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在文本到图像生成任务中普遍存在的一致性与质量波动问题,即模型对文本提示(prompt)的遵循程度不一、生成图像质量不稳定。现有方法如负向提示(negative prompting)虽能提升提示合规性,但依赖人工设计的负向提示词,存在信息损失和表达不完整的问题。解决方案的关键在于提出一种无需额外训练的采样机制——自适应负向采样(Adaptive Negative Sampling Without External Resources, ANSWER),该方法通过利用扩散模型内部对否定语义的理解,结合无分类器引导(Classifier-Free Guidance, CFG)机制,在单个正向提示下自动推导出隐式的负向约束,从而在不显式指定负向提示词的情况下实现更精确的图像生成控制,显著提升了生成结果对提示的忠实度和人类偏好度。

链接: https://arxiv.org/abs/2508.02973
作者: Alakh Desai,Nuno Vasconcelos
机构: University of California, San Diego (加州大学圣地亚哥分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (CFG) to develop a sampling procedure, \it Adaptive Negative Sampling Without External Resources (ANSWER), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.
zh

[CV-116] owards Robust Image Denoising with Scale Equivariance

【速读】:该论文旨在解决图像去噪模型在面对分布外(out-of-distribution, OOD)噪声模式时泛化能力不足的问题,尤其是空间非均匀噪声(spatially variant noise)条件下性能显著下降的挑战。其解决方案的关键在于引入尺度等变性(scale equivariance)作为核心归纳偏置(inductive bias),并构建包含两个核心组件的鲁棒盲去噪框架:异质归一化模块(Heterogeneous Normalization Module, HNM)用于稳定特征分布并在不同噪声强度下动态校正特征,交互门控模块(Interactive Gating Module, IGM)通过信号路径与特征路径之间的门控交互实现有效信息调制,从而提升模型对空间异质噪声的适应能力。

链接: https://arxiv.org/abs/2508.02967
作者: Dawei Zhang,Xiaojie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Despite notable advances in image denoising, existing models often struggle to generalize beyond in-distribution noise patterns, particularly when confronted with out-of-distribution (OOD) conditions characterized by spatially variant noise. This generalization gap remains a fundamental yet underexplored challenge. In this work, we investigate \emphscale equivariance as a core inductive bias for improving OOD robustness. We argue that incorporating scale-equivariant structures enables models to better adapt from training on spatially uniform noise to inference on spatially non-uniform degradations. Building on this insight, we propose a robust blind denoising framework equipped with two key components: a Heterogeneous Normalization Module (HNM) and an Interactive Gating Module (IGM). HNM stabilizes feature distributions and dynamically corrects features under varying noise intensities, while IGM facilitates effective information modulation via gated interactions between signal and feature paths. Extensive evaluations demonstrate that our model consistently outperforms state-of-the-art methods on both synthetic and real-world benchmarks, especially under spatially heterogeneous noise. Code will be made publicly available.
zh

[CV-117] X-Actor: Emotional and Expressive Long-Range Portrait Acting from Audio

【速读】:该论文旨在解决现有音频驱动人脸动画方法在长时序、情感表达和表演质量上的局限性问题,尤其针对传统方法难以实现与语音节奏和内容协同的动态情绪变化及长时间稳定生成的问题。解决方案的关键在于提出一个两阶段解耦生成框架——首先使用基于扩散机制的自回归模型,在不依赖身份信息的面部运动潜在空间中预测长期上下文下的表情动作潜码(latent tokens),该过程通过扩散强制训练范式捕捉音频与面部动态之间的长程相关性;随后由扩散视频合成模块将这些运动潜码转化为高保真视频动画,从而实现无误差累积的无限长度情感丰富的人脸表演生成。

链接: https://arxiv.org/abs/2508.02944
作者: Chenxu Zhang,Zenan Li,Hongyi Xu,You Xie,Xiaochen Zhao,Tianpei Gu,Guoxian Song,Xin Chen,Chao Liang,Jianwen Jiang,Linjie Luo
机构: Bytedance Intelligent Creation(字节跳动智能创作)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page at this https URL

点击查看摘要

Abstract:We present X-Actor, a novel audio-driven portrait animation framework that generates lifelike, emotionally expressive talking head videos from a single reference image and an input audio clip. Unlike prior methods that emphasize lip synchronization and short-range visual fidelity in constrained speaking scenarios, X-Actor enables actor-quality, long-form portrait performance capturing nuanced, dynamically evolving emotions that flow coherently with the rhythm and content of speech. Central to our approach is a two-stage decoupled generation pipeline: an audio-conditioned autoregressive diffusion model that predicts expressive yet identity-agnostic facial motion latent tokens within a long temporal context window, followed by a diffusion-based video synthesis module that translates these motions into high-fidelity video animations. By operating in a compact facial motion latent space decoupled from visual and identity cues, our autoregressive diffusion model effectively captures long-range correlations between audio and facial dynamics through a diffusion-forcing training paradigm, enabling infinite-length emotionally-rich motion prediction without error accumulation. Extensive experiments demonstrate that X-Actor produces compelling, cinematic-style performances that go beyond standard talking head animations and achieves state-of-the-art results in long-range, audio-driven emotional portrait acting.
zh

[CV-118] Infrared Object Detection with Ultra Small ConvNets: Is ImageNet Pretraining Still Useful?

【速读】:该论文旨在解决小规模模型(ultra-small models,参数量约1M)在嵌入式设备上进行红外视觉目标检测任务时,预训练(ImageNet预训练)对其分布外检测鲁棒性的影响问题。研究发现,尽管预训练对小模型仍有一定益处,但当模型容量低于某一阈值时,其在分布外检测任务中的鲁棒性提升趋于饱和甚至边际收益递减。解决方案的关键在于通过基于标准图像识别架构的缩放定律构建两类超小骨干网络,并系统评估其在三个不同数据集上的性能表现,从而为边缘设备部署提供实践指导:即应继续采用预训练策略,同时避免使用过小的模型,因其虽在域内任务中表现良好,但在操作条件变化时易出现脆弱性。

链接: https://arxiv.org/abs/2508.02927
作者: Srikanth Muralidharan,Heitor R. Medeiros,Masih Aminbeidokhti,Eric Granger,Marco Pedersoli
机构: LIVIA, Dept. of Systems Engineering, ETS Montreal, Canada; International Laboratory on Learning Systems (ILLS)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Many real-world applications require recognition models that are robust to different operational conditions and modalities, but at the same time run on small embedded devices, with limited hardware. While for normal size models, pre-training is known to be very beneficial in accuracy and robustness, for small models, that can be employed for embedded and edge devices, its effect is not clear. In this work, we investigate the effect of ImageNet pretraining on increasingly small backbone architectures (ultra-small models, with 1M parameters) with respect to robustness in downstream object detection tasks in the infrared visual modality. Using scaling laws derived from standard object recognition architectures, we construct two ultra-small backbone families and systematically study their performance. Our experiments on three different datasets reveal that while ImageNet pre-training is still useful, beyond a certain capacity threshold, it offers diminishing returns in terms of out-of-distribution detection robustness. Therefore, we advise practitioners to still use pre-training and, when possible avoid too small models as while they might work well for in-domain problems, they are brittle when working conditions are different.
zh

[CV-119] How Diffusion Prior Landscapes Shape the Posterior in Blind Deconvolution

【速读】:该论文旨在解决盲去卷积(blind deconvolution)中最大后验估计(MAP)方法在使用稀疏性促进图像先验时倾向于产生模糊解的问题。其关键解决方案在于引入基于扩散模型的图像先验(diffusion-based priors),通过分析先验似然景观发现:虽然模糊图像具有更高的似然值,但景观中存在大量对应自然图像的局部极小值点;因此,尽管MAP估计器本身趋向于生成尖锐滤波器和模糊图像,若能通过梯度下降法从良好的局部初始点出发收敛至这些局部极小值,则可获得高质量、真实的清晰图像,从而有效解决盲去卷积问题。这一发现表明,改进MAP性能的关键在于设计合理的局部初始化策略以定位后验分布中的真实图像局部极小值。

链接: https://arxiv.org/abs/2508.02923
作者: Minh-Hai Nguyen,Edouard Pauwels,Pierre Weiss
机构: IRIT(图卢兹信息与技术研究所); Toulouse University(图卢兹大学); Toulouse School of Economics(图卢兹经济学院); Université Toulouse Capitole(图卢兹-卡皮托勒大学); IUF(法国国家科学研究中心); CNRS(法国国家科学研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The Maximum A Posteriori (MAP) estimation is a widely used framework in blind deconvolution to recover sharp images from blurred observations. The estimated image and blur filter are defined as the maximizer of the posterior distribution. However, when paired with sparsity-promoting image priors, MAP estimation has been shown to favors blurry solutions, limiting its effectiveness. In this paper, we revisit this result using diffusion-based priors, a class of models that capture realistic image distributions. Through an empirical examination of the prior’s likelihood landscape, we uncover two key properties: first, blurry images tend to have higher likelihoods; second, the landscape contains numerous local minimizers that correspond to natural images. Building on these insights, we provide a theoretical analysis of the blind deblurring posterior. This reveals that the MAP estimator tends to produce sharp filters (close to the Dirac delta function) and blurry solutions. However local minimizers of the posterior, which can be obtained with gradient descent, correspond to realistic, natural images, effectively solving the blind deconvolution problem. Our findings suggest that overcoming MAP’s limitations requires good local initialization to local minima in the posterior landscape. We validate our analysis with numerical experiments, demonstrating the practical implications of our insights for designing improved priors and optimization techniques.
zh

[CV-120] How Would It Sound? Material-Controlled Multimodal Acoustic Profile Generation for Indoor Scenes ICCV2025

【速读】:该论文旨在解决材料控制的声学特性生成问题(material-controlled acoustic profile generation),即在给定室内场景的音视频特征基础上,根据用户定义的材料配置动态生成目标房间脉冲响应(Room Impulse Response, RIR)。其核心挑战在于如何从多模态观测中编码场景关键属性,并基于材料参数条件化地合成高保真RIR。解决方案的关键在于提出一种新颖的编码器-解码器架构:编码器从音视频输入中提取场景特征,解码器则以用户指定的材料信息为条件,生成对应的RIR。该方法实现了在推理阶段灵活调整材料配置并生成多样化、高质量RIR的能力,同时构建了Acoustic Wonderland Dataset作为基准数据集,用于评估材料感知的RIR预测方法在复杂环境下的性能。

链接: https://arxiv.org/abs/2508.02905
作者: Mahnoor Fatima Saad,Ziad Al-Halah
机构: University of Utah (犹他大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: Accepted to ICCV 2025. Project Page: this https URL

点击查看摘要

Abstract:How would the sound in a studio change with a carpeted floor and acoustic tiles on the walls? We introduce the task of material-controlled acoustic profile generation, where, given an indoor scene with specific audio-visual characteristics, the goal is to generate a target acoustic profile based on a user-defined material configuration at inference time. We address this task with a novel encoder-decoder approach that encodes the scene’s key properties from an audio-visual observation and generates the target Room Impulse Response (RIR) conditioned on the material specifications provided by the user. Our model enables the generation of diverse RIRs based on various material configurations defined dynamically at inference time. To support this task, we create a new benchmark, the Acoustic Wonderland Dataset, designed for developing and evaluating material-aware RIR prediction methods under diverse and challenging settings. Our results demonstrate that the proposed model effectively encodes material information and generates high-fidelity RIRs, outperforming several baselines and state-of-the-art methods.
zh

[CV-121] RDDPM: Robust Denoising Diffusion Probabilistic Model for Unsupervised Anomaly Segmentation ICCV2025

【速读】:该论文旨在解决扩散模型在无监督异常分割任务中对纯净正常数据依赖性强的问题,即在实际场景中往往难以获取仅含正常样本的训练数据,而现有方法无法有效处理混合了正常与异常样本的污染数据。其解决方案的关键在于将最大似然估计转化为非线性回归问题,并引入稳健回归(robust regression)思想,从而重构出一种鲁棒的去噪扩散概率模型(robust denoising diffusion probabilistic model)。这一框架不仅提升了模型在污染数据下的适应能力,还实现了对异常区域的精准定位,在MVTec数据集上相较当前最优扩散模型实现了最高达8.08%的AUROC和10.37%的AUPRC提升。

链接: https://arxiv.org/abs/2508.02903
作者: Mehrdad Moradi,Kamran Paynabar
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 10 pages, 5 figures. Accepted to the ICCV 2025 Workshop on Vision-based Industrial InspectiON (VISION)

点击查看摘要

Abstract:Recent advancements in diffusion models have demonstrated significant success in unsupervised anomaly segmentation. For anomaly segmentation, these models are first trained on normal data; then, an anomalous image is noised to an intermediate step, and the normal image is reconstructed through backward diffusion. Unlike traditional statistical methods, diffusion models do not rely on specific assumptions about the data or target anomalies, making them versatile for use across different domains. However, diffusion models typically assume access to normal data for training, limiting their applicability in realistic settings. In this paper, we propose novel robust denoising diffusion models for scenarios where only contaminated (i.e., a mix of normal and anomalous) unlabeled data is available. By casting maximum likelihood estimation of the data as a nonlinear regression problem, we reinterpret the denoising diffusion probabilistic model through a regression lens. Using robust regression, we derive a robust version of denoising diffusion probabilistic models. Our novel framework offers flexibility in constructing various robust diffusion models. Our experiments show that our approach outperforms current state of the art diffusion models, for unsupervised anomaly segmentation when only contaminated data is available. Our method outperforms existing diffusion-based approaches, achieving up to 8.08% higher AUROC and 10.37% higher AUPRC on MVTec datasets. The implementation code is available at: this https URL
zh

[CV-122] Evaluation and Analysis of Deep Neural Transformers and Convolutional Neural Networks on Modern Remote Sensing Datasets

【速读】:该论文旨在解决Transformer-based神经网络在高分辨率光学卫星影像中目标检测任务上的性能评估问题,尤其是在与传统卷积神经网络(Convolutional Neural Networks, CNNs)进行大规模对比时的系统性分析。其解决方案的关键在于构建了一个全面的实验框架,通过训练和评估33个深度神经网络模型(包括5种基于Transformer的架构和6种卷积网络),在三个公开的高分辨率遥感影像数据集上比较不同特征提取方法和检测算法的性能表现,从而揭示Transformer在遥感图像目标检测中的优势与潜力。

链接: https://arxiv.org/abs/2508.02871
作者: J. Alex Hurt,Trevor M. Bajkowski,Grant J. Scott,Curt H. Davis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:In 2012, AlexNet established deep convolutional neural networks (DCNNs) as the state-of-the-art in CV, as these networks soon led in visual tasks for many domains, including remote sensing. With the publication of Visual Transformers, we are witnessing the second modern leap in computational vision, and as such, it is imperative to understand how various transformer-based neural networks perform on satellite imagery. While transformers have shown high levels of performance in natural language processing and CV applications, they have yet to be compared on a large scale to modern remote sensing data. In this paper, we explore the use of transformer-based neural networks for object detection in high-resolution electro-optical satellite imagery, demonstrating state-of-the-art performance on a variety of publicly available benchmark data sets. We compare eleven distinct bounding-box detection and localization algorithms in this study, of which seven were published since 2020, and all eleven since 2015. The performance of five transformer-based architectures is compared with six convolutional networks on three state-of-the-art opensource high-resolution remote sensing imagery datasets ranging in size and complexity. Following the training and evaluation of thirty-three deep neural models, we then discuss and analyze model performance across various feature extraction methodologies and detection algorithms.
zh

[CV-123] MIDAR: Mimicking LiDAR Detection for Traffic Applications with a Lightweight Plug-and-Play Model

【速读】:该论文旨在解决自动驾驶(Autonomous Driving, AD)研究中因大规模真实车辆部署不现实而导致的协同感知(Cooperative Perception, CP)数据获取难题,特别是在微观交通仿真器(如SUMO)缺乏感知建模能力、而基于游戏引擎的仿真器(如CARLA)难以扩展至多车场景的问题。解决方案的关键在于提出MIDAR模型,该模型通过利用微观交通仿真器中可获得的车辆级特征(如空间布局和尺寸),构建一个改进的多跳视距(Refined Multi-hop Line-of-Sight, RM-LoS)图来编码车辆间的遮挡关系,并采用GRU增强的APPNP架构实现特征传播,从而精准模拟主流3D LiDAR检测模型(如CenterPoint)的真阳性(True Positives, TPs)与假阴性(False Negatives, FNs)结果,AUC达0.909,在nuScenes数据集上验证了其有效性。

链接: https://arxiv.org/abs/2508.02858
作者: Tianheng Zhu,Yiheng Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 9 figures

点击查看摘要

Abstract:As autonomous driving (AD) technology advances, increasing research has focused on leveraging cooperative perception (CP) data collected from multiple AVs to enhance traffic applications. Due to the impracticality of large-scale real-world AV deployments, simulation has become the primary approach in most studies. While game-engine-based simulators like CARLA generate high-fidelity raw sensor data (e.g., LiDAR point clouds) which can be used to produce realistic detection outputs, they face scalability challenges in multi-AV scenarios. In contrast, microscopic traffic simulators such as SUMO scale efficiently but lack perception modeling capabilities. To bridge this gap, we propose MIDAR, a LiDAR detection mimicking model that approximates realistic LiDAR detections using vehicle-level features readily available from microscopic traffic simulators. Specifically, MIDAR predicts true positives (TPs) and false negatives (FNs) from ideal LiDAR detection results based on the spatial layouts and dimensions of surrounding vehicles. A Refined Multi-hop Line-of-Sight (RM-LoS) graph is constructed to encode the occlusion relationships among vehicles, upon which MIDAR employs a GRU-enhanced APPNP architecture to propagate features from the ego AV and occluding vehicles to the prediction target. MIDAR achieves an AUC of 0.909 in approximating the detection results generated by CenterPoint, a mainstream 3D LiDAR detection model, on the nuScenes AD dataset. Two CP-based traffic applications further validate the necessity of such realistic detection modeling, particularly for tasks requiring accurate individual vehicle observations (e.g., position, speed, lane index). As demonstrated in the applications, MIDAR can be seamlessly integrated into traffic simulators and trajectory datasets and will be open-sourced upon publication.
zh

[CV-124] RefineSeg: Dual Coarse-to-Fine Learning for Medical Image Segmentation

【速读】:该论文旨在解决医学图像分割任务中高质量像素级标注成本高、依赖专业医生标注的问题。其解决方案的关键在于提出一种基于粗粒度标注的自上而下分层优化框架,通过引入**过渡矩阵(transition matrices)**来建模粗标注中的不准确和不完整区域,并利用多组粗标注联合训练,逐步优化网络输出,从而推断出真实的分割分布,实现对精细标签的鲁棒逼近。该方法在ACDC、MSCMRseg及UK Biobank等心脏影像数据集上验证了有效性,性能优于现有弱监督方法并接近全监督水平。

链接: https://arxiv.org/abs/2508.02844
作者: Anghong Du,Nay Aung,Theodoros N. Arvanitis,Stefan K. Piechnik,Joao A C Lima,Steffen E. Petersen,Le Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:High-quality pixel-level annotations of medical images are essential for supervised segmentation tasks, but obtaining such annotations is costly and requires medical expertise. To address this challenge, we propose a novel coarse-to-fine segmentation framework that relies entirely on coarse-level annotations, encompassing both target and complementary drawings, despite their inherent noise. The framework works by introducing transition matrices in order to model the inaccurate and incomplete regions in the coarse annotations. By jointly training on multiple sets of coarse annotations, it progressively refines the network’s outputs and infers the true segmentation distribution, achieving a robust approximation of precise labels through matrix-based modeling. To validate the flexibility and effectiveness of the proposed method, we demonstrate the results on two public cardiac imaging datasets, ACDC and MSCMRseg, and further evaluate its performance on the UK Biobank dataset. Experimental results indicate that our approach surpasses the state-of-the-art weakly supervised methods and closely matches the fully supervised approach.
zh

[CV-125] GENIE: Gaussian Encoding for Neural Radiance Fields Interactive Editing

【速读】:该论文旨在解决神经辐射场(Neural Radiance Fields, NeRF)在场景编辑与物理交互方面存在的局限性,即其隐式编码结构难以实现高效编辑和与物理仿真集成的问题。为此,作者提出了一种混合模型GENIE(Gaussian Encoding for Neural Radiance Fields Interactive Editing),其核心创新在于将高保真渲染能力的NeRF与可编辑性强的高斯点阵(Gaussian Splatting, GS)相结合:通过为每个高斯原始对象分配可训练的特征嵌入(feature embedding),并利用k近邻高斯点对查询点进行条件化控制,从而实现局部感知的实时编辑;同时引入基于改进射线追踪管道的快速最近邻搜索方法(Ray-Traced Gaussian Proximity Search, RT-GPS)和多分辨率哈希网格初始化机制,显著提升效率与灵活性。此方案有效融合了隐式表示的高质量渲染与显式表示的结构可控性,实现了几何编辑、动态交互及物理仿真兼容性的统一。

链接: https://arxiv.org/abs/2508.02831
作者: Mikołaj Zieliński,Krzysztof Byrski,Tomasz Szczepanik,Przemysław Spurek
机构: 1. University of Warsaw (华沙大学); 2. Polish Academy of Sciences (波兰科学院); 3. Institute of Computer Science, Polish Academy of Sciences (波兰科学院计算机科学研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) have recently transformed 3D scene representation and rendering. NeRF achieves high-fidelity novel view synthesis by learning volumetric representations through neural networks, but its implicit encoding makes editing and physical interaction challenging. In contrast, GS represents scenes as explicit collections of Gaussian primitives, enabling real-time rendering, faster training, and more intuitive manipulation. This explicit structure has made GS particularly well-suited for interactive editing and integration with physics-based simulation. In this paper, we introduce GENIE (Gaussian Encoding for Neural Radiance Fields Interactive Editing), a hybrid model that combines the photorealistic rendering quality of NeRF with the editable and structured representation of GS. Instead of using spherical harmonics for appearance modeling, we assign each Gaussian a trainable feature embedding. These embeddings are used to condition a NeRF network based on the k nearest Gaussians to each query point. To make this conditioning efficient, we introduce Ray-Traced Gaussian Proximity Search (RT-GPS), a fast nearest Gaussian search based on a modified ray-tracing pipeline. We also integrate a multi-resolution hash grid to initialize and update Gaussian features. Together, these components enable real-time, locality-aware editing: as Gaussian primitives are repositioned or modified, their interpolated influence is immediately reflected in the rendered output. By combining the strengths of implicit and explicit representations, GENIE supports intuitive scene manipulation, dynamic interaction, and compatibility with physical simulation, bridging the gap between geometry-based editing and neural rendering. The code can be found under (this https URL)
zh

[CV-126] Elucidating the Role of Feature Normalization in IJEPA

【速读】:该论文旨在解决标准图像联合嵌入预测架构(IJEPA)中特征层归一化(Layer Normalization, LN)对视觉标记能量层次结构的破坏问题。LN强制所有特征具有相同的L2范数,导致高能量标记(编码语义重要区域的标记)无法在预测损失中获得应有的权重,从而引发损失图中的棋盘状伪影,并削弱模型性能。解决方案的关键在于用DynTanh激活函数替代LN,以更好地保留标记的自然能量分布,使高能量标记在预测损失中发挥更大作用,从而改善损失分布的长尾特性并消除伪影。实验证明,该改进显著提升了ImageNet线性探测准确率(ViT-Small从38%提升至42.7%)和NYU Depth V2单目深度估计的RMSE(降低0.08),表明保持自然标记能量对自监督视觉表示学习至关重要。

链接: https://arxiv.org/abs/2508.02829
作者: Adam Colton
机构: Harmony AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:In the standard image joint embedding predictive architecture (IJEPA), features at the output of the teacher encoder are layer normalized (LN) before serving as a distillation target for the student encoder and predictor. We propose that this feature normalization disrupts the natural energy hierarchy of visual tokens, where high-energy tokens (those with larger L2 norms) encode semantically important image regions. LN forces all features to have identical L2 norms, effectively equalizing their energies and preventing the model from prioritizing semantically rich regions. We find that IJEPA models trained with feature LN exhibit loss maps with significant checkerboard-like artifacts. We propose that feature LN be replaced with a DynTanh activation as the latter better preserves token energies and allows high-energy tokens to greater contribute to the prediction loss. We show that IJEPA trained with feature DynTanh exhibits a longer-tailed loss distribution and fixes the checkerboard artifacts in the loss map. Our empirical results show that our simple modification improves ImageNet linear probe accuracy from 38% to 42.7% for ViT-Small and reduces RMSE by 0.08 on NYU Depth V2 monocular depth estimation. These results suggest that preserving natural token energies is crucial for effective self-supervised visual representation learning.
zh

[CV-127] DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework

【速读】:该论文旨在解决视频虚拟试穿(Video Virtual Try-On, VVT)技术在真实场景中面临的两大挑战:一是现有端到端方法严重依赖稀缺的成对服装中心数据集,难以有效利用预训练视觉模型和测试时输入信息;二是难以在非受限场景下准确保留细粒度服装细节并维持时间一致性。解决方案的关键在于提出一个两阶段框架DreamVVT,其核心创新在于:第一阶段通过引入多帧虚拟试穿模型与视觉-语言模型(Vision-Language Model, VLM),从输入视频中采样代表性帧并生成高保真、语义一致的关键帧图像,作为后续视频生成的外观引导;第二阶段则提取骨架图、细粒度运动与外观描述,并结合关键帧图像输入至增强LoRA适配器的预训练视频生成模型,从而确保未见区域的长期时间连贯性和逼真的动态运动表现。

链接: https://arxiv.org/abs/2508.02807
作者: Tongchun Zuo,Zaiyu Huang,Shuliang Ning,Ente Lin,Chao Liang,Zerong Zheng,Jianwen Jiang,Yuan Zhang,Mingyuan Gao,Xin Dong
机构: ByteDance Intelligent Creation (字节跳动智能创作); Shenzhen International Graduate School, Tsinghua University (清华大学深圳国际研究生院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages, 12 figures

点击查看摘要

Abstract:Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbfIn the second stage, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page this https URL
zh

[CV-128] PyCAT4: A Hierarchical Vision Transformer-based Framework for 3D Human Pose Estimation

【速读】:该论文旨在解决当前3D人体姿态估计(3D Human Pose Estimation)中特征提取不充分、时序信息利用不足以及多尺度特征融合不均衡的问题。解决方案的关键在于:(1) 引入基于自注意力机制的Transformer特征提取层,以增强对低层次视觉特征的捕捉能力;(2) 通过特征时序融合技术提升对视频序列中时间动态信号的理解与建模;(3) 设计空间金字塔结构实现多尺度特征的融合,有效缓解不同尺度间特征表示差异,从而显著提升模型在COCO和3DPW数据集上的检测性能。

链接: https://arxiv.org/abs/2508.02806
作者: Zongyou Yang,Jonathan Loo
机构: University College London (UCL)(伦敦大学学院); Queen Mary University of London (QMUL)(伦敦玛丽女王大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10 pages, 20 figures

点击查看摘要

Abstract:Recently, a significant improvement in the accuracy of 3D human pose estimation has been achieved by combining convolutional neural networks (CNNs) with pyramid grid alignment feedback loops. Additionally, innovative breakthroughs have been made in the field of computer vision through the adoption of Transformer-based temporal analysis architectures. Given these advancements, this study aims to deeply optimize and improve the existing Pymaf network architecture. The main innovations of this paper include: (1) Introducing a Transformer feature extraction network layer based on self-attention mechanisms to enhance the capture of low-level features; (2) Enhancing the understanding and capture of temporal signals in video sequences through feature temporal fusion techniques; (3) Implementing spatial pyramid structures to achieve multi-scale feature fusion, effectively balancing feature representations differences across different scales. The new PyCAT4 model obtained in this study is validated through experiments on the COCO and 3DPW datasets. The results demonstrate that the proposed improvement strategies significantly enhance the network’s detection capability in human pose estimation, further advancing the development of human pose estimation technology.
zh

[CV-129] he Architecture of Trust: A Framework for AI-Augmented Real Estate Valuation in the Era of Structured Data

【速读】:该论文旨在解决传统住宅估值中因报告格式非结构化、评估师间差异显著及系统性偏差导致的可靠性不足问题,同时应对监管标准化(如Uniform Appraisal Dataset 3.6)与生成式AI(Generative AI)技术发展之间的协同挑战。其解决方案的关键在于构建一个三层AI增强估值框架,涵盖物理数据采集、语义理解与认知推理三个层级,通过整合计算机视觉、自然语言处理等新兴技术,在保持专业监督的前提下实现估值流程的自动化与透明化,并强调算法公平性、不确定性量化和合规性等信任机制,从而推动房地产市场效率提升与系统性风险降低。

链接: https://arxiv.org/abs/2508.02765
作者: Petteri Teikari,Mike Jarrell,Maryam Azh,Harri Pesola
机构: Mill Hill Garage; JB Real Estate Valuation & Advisory, LLC; Atlas Insights
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 46 pages, 6 figures

点击查看摘要

Abstract:The Uniform Appraisal Dataset (UAD) 3.6’s mandatory 2026 implementation transforms residential property valuation from narrative reporting to structured, machine-readable formats. This paper provides the first comprehensive analysis of this regulatory shift alongside concurrent AI advances in computer vision, natural language processing, and autonomous systems. We develop a three-layer framework for AI-augmented valuation addressing technical implementation and institutional trust requirements. Our analysis reveals how regulatory standardization converging with AI capabilities enables fundamental market restructuring with profound implications for professional practice, efficiency, and systemic risk. We make four key contributions: (1) documenting institutional failures including inter-appraiser variability and systematic biases undermining valuation reliability; (2) developing an architectural framework spanning physical data acquisition, semantic understanding, and cognitive reasoning that integrates emerging technologies while maintaining professional oversight; (3) addressing trust requirements for high-stakes financial applications including regulatory compliance, algorithmic fairness, and uncertainty quantification; (4) proposing evaluation methodologies beyond generic AI benchmarks toward domain-specific protocols. Our findings indicate successful transformation requires not merely technological sophistication but careful human-AI collaboration, creating systems that augment rather than replace professional expertise while addressing historical biases and information asymmetries in real estate markets.
zh

[CV-130] FastInit: Fast Noise Initialization for Temporally Consistent Video Generation

【速读】:该论文旨在解决视频生成中高时间一致性(temporal consistency)难以实现的问题,尤其是在使用扩散模型(diffusion models)时,传统方法因训练-推理差距(training-inference gap)导致帧间不一致。其解决方案的关键在于提出FastInit,一种快速噪声初始化方法,通过训练一个视频噪声预测网络(Video Noise Prediction Network, VNPNet),在单次前向传播中直接生成优化后的噪声,从而避免了以往需要迭代精修噪声带来的高计算开销。该方法显著提升了视频生成效率,并在多个文本到视频模型上验证了其在提升视频质量和帧间一致性方面的有效性。

链接: https://arxiv.org/abs/2506.16119
作者: Chengyu Bai,Yuming Li,Zhongyu Zhao,Jintao Chen,Peidong Jia,Qi She,Ming Lu,Shanghang Zhang
机构: Peking University (北京大学); ByteDance (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Video generation has made significant strides with the development of diffusion models; however, achieving high temporal consistency remains a challenging task. Recently, FreeInit identified a training-inference gap and introduced a method to iteratively refine the initial noise during inference. However, iterative refinement significantly increases the computational cost associated with video generation. In this paper, we introduce FastInit, a fast noise initialization method that eliminates the need for iterative refinement. FastInit learns a Video Noise Prediction Network (VNPNet) that takes random noise and a text prompt as input, generating refined noise in a single forward pass. Therefore, FastInit greatly enhances the efficiency of video generation while achieving high temporal consistency across frames. To train the VNPNet, we create a large-scale dataset consisting of pairs of text prompts, random noise, and refined noise. Extensive experiments with various text-to-video models show that our method consistently improves the quality and temporal consistency of the generated videos. FastInit not only provides a substantial improvement in video generation but also offers a practical solution that can be applied directly during inference. The code and dataset will be released.
zh

[CV-131] CADD: Context aware disease deviations via restoration of brain images using normative conditional diffusion models

【速读】:该论文旨在解决在异质性医疗影像数据(如医院档案中的脑部图像)中检测病理异常的难题,尤其是现有基于扩散模型的无监督异常检测方法因缺乏临床信息引导和健康区域重建质量差而导致检测性能不佳的问题。解决方案的关键在于提出首个用于3D图像的条件扩散模型(Conditional Diffusion Model for Normative Modeling, CADD),并设计了一种新颖的推理修复(inpainting)策略,在去除异常的同时保留个体特异性特征,从而提升健康区域的重建质量与病理检测准确性。

链接: https://arxiv.org/abs/2508.03594
作者: Ana Lawry Aguila,Ayodeji Ijishakin,Juan Eugenio Iglesias,Tomomi Takenaga,Yukihiro Nomura,Takeharu Yoshikawa,Osamu Abe,Shouhei Hanaoka
机构: Athinoula A. Martinos Center for Biomedical Imaging, Massachusetts General Hospital and Harvard Medical School, Boston, USA (阿提诺拉·A·马特诺斯生物医学成像中心,马萨诸塞州总医院和哈佛医学院,波士顿,美国); Hawkes Institute, University College London, London, UK (霍克斯研究所,伦敦大学学院,伦敦,英国); Computer Science & Artificial Intelligence Lab, Massachusetts Institute of Technology, Boston, USA (计算机科学与人工智能实验室,麻省理工学院,波士顿,美国); Mecha Health, San Francisco, USA (Mecha Health,旧金山,美国); Department of Radiology, the University of Tokyo, Tokyo, Japan (东京大学放射科,东京,日本); Medical Engineering, Chiba University, Chiba, Japan (千叶大学医学工程系,千叶,日本); Department of Computational Diagnostic Radiology and Preventive Medicine, the University of Tokyo Hospital, Tokyo, Japan (东京大学医院计算诊断放射学与预防医学系,东京,日本)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Applying machine learning to real-world medical data, e.g. from hospital archives, has the potential to revolutionize disease detection in brain images. However, detecting pathology in such heterogeneous cohorts is a difficult challenge. Normative modeling, a form of unsupervised anomaly detection, offers a promising approach to studying such cohorts where the ``normal’’ behavior is modeled and can be used at subject level to detect deviations relating to disease pathology. Diffusion models have emerged as powerful tools for anomaly detection due to their ability to capture complex data distributions and generate high-quality images. Their performance relies on image restoration; differences between the original and restored images highlight potential abnormalities. However, unlike normative models, these diffusion model approaches do not incorporate clinical information which provides important context to guide the disease detection process. Furthermore, standard approaches often poorly restore healthy regions, resulting in poor reconstructions and suboptimal detection performance. We present CADD, the first conditional diffusion model for normative modeling in 3D images. To guide the healthy restoration process, we propose a novel inference inpainting strategy which balances anomaly removal with retention of subject-specific features. Evaluated on three challenging datasets, including clinical scans, which may have lower contrast, thicker slices, and motion artifacts, CADD achieves state-of-the-art performance in detecting neurological abnormalities in heterogeneous cohorts.
zh

[CV-132] Evaluating the Predictive Value of Preoperative MRI for Erectile Dysfunction Following Radical Prostatectomy MICCAI MICCAI2025

【速读】:该论文旨在解决前列腺根治术(radical prostatectomy)后勃起功能障碍(erectile dysfunction, ED)的术前精准预测问题,以改善患者术前咨询和决策。其核心解决方案在于系统评估了四种建模策略:仅基于临床特征的基准模型、基于手工提取MRI解剖特征的经典模型、直接在MRI图像切片上训练的深度学习模型,以及影像与临床信息融合的多模态模型。关键发现是,尽管MRI相关模型在理论上具备捕捉解剖细节的能力,但其预测性能(最大AUC 0.569)仍显著低于纯临床模型(AUC 0.663),且融合模型也未超越临床基线;SHAP分析进一步证实临床特征对预测贡献最大,而MRI模型主要关注前列腺及神经血管束等解剖区域,提示其虽未能提升整体性能,但可能通过识别与ED相关的结构模式为未来多模态整合提供潜在支持。

链接: https://arxiv.org/abs/2508.03461
作者: Gideon N. L. Rouwendaal,Daniël Boeke,Inge L. Cox,Henk G. van der Poel,Margriet C. van Dijk-de Haan,Regina G. H. Beets-Tan,Thierry N. Boellaard,Wilson Silva
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 5 figures, 2 tables. Accepted at PRedictive Intelligence in MEdicine workshop @ MICCAI 2025 (PRIME-MICCAI). This is the submitted manuscript with added link to github repo, funding acknowledgements and authors’ names and affiliations. No further post submission improvements or corrections were integrated. Final version not published yet

点击查看摘要

Abstract:Accurate preoperative prediction of erectile dysfunction (ED) is important for counseling patients undergoing radical prostatectomy. While clinical features are established predictors, the added value of preoperative MRI remains underexplored. We investigate whether MRI provides additional predictive value for ED at 12 months post-surgery, evaluating four modeling strategies: (1) a clinical-only baseline, representing current state-of-the-art; (2) classical models using handcrafted anatomical features derived from MRI; (3) deep learning models trained directly on MRI slices; and (4) multimodal fusion of imaging and clinical inputs. Imaging-based models (maximum AUC 0.569) slightly outperformed handcrafted anatomical approaches (AUC 0.554) but fell short of the clinical baseline (AUC 0.663). Fusion models offered marginal gains (AUC 0.586) but did not exceed clinical-only performance. SHAP analysis confirmed that clinical features contributed most to predictive performance. Saliency maps from the best-performing imaging model suggested a predominant focus on anatomically plausible regions, such as the prostate and neurovascular bundles. While MRI-based models did not improve predictive performance over clinical features, our findings suggest that they try to capture patterns in relevant anatomical structures and may complement clinical predictors in future multimodal approaches.
zh

[CV-133] GL-LCM: Global-Local Latent Consistency Models for Fast High-Resolution Bone Suppression in Chest X-Ray Images MICCAI2025

【速读】:该论文旨在解决胸部X光(Chest X-Ray, CXR)图像中骨骼结构遮挡肺部细节的问题,以提升诊断准确性。现有基于扩散模型的骨抑制方法在完全去除骨骼的同时难以保留局部纹理细节,且计算复杂度高、处理时间长,限制了其在临床实践中的应用。解决方案的关键在于提出一种全局-局部潜在一致性模型(Global-Local Latent Consistency Model, GL-LCM),该模型通过肺部分割、双路径采样与全局-局部融合机制,在保持高分辨率输出的同时实现快速骨抑制;进一步引入局部增强引导(Local-Enhanced Guidance)策略,在无需额外训练的情况下有效缓解局部采样带来的边界伪影和细节模糊问题,从而在性能与效率之间取得良好平衡。

链接: https://arxiv.org/abs/2508.03357
作者: Yifei Sun,Zhanghao Chen,Hao Zheng,Yuqing Lu,Lixin Duan,Fenglei Fan,Ahmed Elazab,Xiang Wan,Changmiao Wang,Ruiquan Ge
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages, 3 figures, accepted by MICCAI 2025

点击查看摘要

Abstract:Chest X-Ray (CXR) imaging for pulmonary diagnosis raises significant challenges, primarily because bone structures can obscure critical details necessary for accurate diagnosis. Recent advances in deep learning, particularly with diffusion models, offer significant promise for effectively minimizing the visibility of bone structures in CXR images, thereby improving clarity and diagnostic accuracy. Nevertheless, existing diffusion-based methods for bone suppression in CXR imaging struggle to balance the complete suppression of bones with preserving local texture details. Additionally, their high computational demand and extended processing time hinder their practical use in clinical settings. To address these limitations, we introduce a Global-Local Latent Consistency Model (GL-LCM) architecture. This model combines lung segmentation, dual-path sampling, and global-local fusion, enabling fast high-resolution bone suppression in CXR images. To tackle potential boundary artifacts and detail blurring in local-path sampling, we further propose Local-Enhanced Guidance, which addresses these issues without additional training. Comprehensive experiments on a self-collected dataset SZCH-X-Rays, and the public dataset JSRT, reveal that our GL-LCM delivers superior bone suppression and remarkable computational efficiency, significantly outperforming several competitive methods. Our code is available at this https URL.
zh

[CV-134] Investigation on deep learning-based galaxy image translation models

【速读】:该论文旨在解决生成式模型在星系图像翻译中对高阶物理信息(如光谱红移)保留不足的问题,尽管现有方法在像素级和形态学统计层面表现良好。其解决方案的关键在于系统评估四种代表性生成模型(Swin Transformer、SRGAN、胶囊网络与扩散模型)在保留红移信息方面的差异,发现即使全局结构和形态统计可被较好复现,模型仍存在显著的信息丢失;研究进一步指出跨波段峰值流量蕴含红移信息但易受多对一映射特性影响而产生不确定性,表明不完美的翻译图像仍可能携带足够信息用于对图像保真度要求不高的下游科学应用。

链接: https://arxiv.org/abs/2508.03291
作者: Hengxin Ruan,Qiufan Lin,Shupei Chen,Yang Wang,Wei Zhang
机构: PCL (中国科学院紫金山天文台)
类目: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at AA; 18+6 pages; 12+6 figures

点击查看摘要

Abstract:Galaxy image translation is an important application in galaxy physics and cosmology. With deep learning-based generative models, image translation has been performed for image generation, data quality enhancement, information extraction, and generalized for other tasks such as deblending and anomaly detection. However, most endeavors on image translation primarily focus on the pixel-level and morphology-level statistics of galaxy images. There is a lack of discussion on the preservation of complex high-order galaxy physical information, which would be more challenging but crucial for studies that rely on high-fidelity image translation. Therefore, we investigated the effectiveness of generative models in preserving high-order physical information (represented by spectroscopic redshift) along with pixel-level and morphology-level information. We tested four representative models, i.e. a Swin Transformer, an SRGAN, a capsule network, and a diffusion model, using the SDSS and CFHTLS galaxy images. We found that these models show different levels of incapabilities in retaining redshift information, even if the global structures of galaxies and morphology-level statistics can be roughly reproduced. In particular, the cross-band peak fluxes of galaxies were found to contain meaningful redshift information, whereas they are subject to noticeable uncertainties in the translation of images, which may substantially be due to the nature of many-to-many mapping. Nonetheless, imperfect translated images may still contain a considerable amount of information and thus hold promise for downstream applications for which high image fidelity is not strongly required. Our work can facilitate further research on how complex physical information is manifested on galaxy images, and it provides implications on the development of image translation models for scientific use.
zh

[CV-135] Nexus-INR: Diverse Knowledge-guided Arbitrary-Scale Multimodal Medical Image Super-Resolution

【速读】:该论文旨在解决医学图像超分辨率(Super-Resolution, SR)中因传统卷积神经网络(CNN)方法固定上采样因子而导致的分辨率适应性差的问题,以及现有基于隐式神经表示(Implicit Neural Representation, INR)的方法在处理多模态、不同分辨率和细节信息时仍存在性能瓶颈的问题。解决方案的关键在于提出Nexus-INR框架,其核心创新包括:1)双分支编码器结合辅助分类任务,实现共享解剖结构与模态特异性特征的有效解耦;2)基于跨模态注意力的知识蒸馏模块,利用高分辨率参考引导低分辨率模态重建,并引入自监督一致性损失增强稳定性;3)集成分割模块嵌入解剖语义信息,同时提升重建质量与下游分割性能。该框架在BraTS2020数据集上的实验表明,在多种指标下均优于当前最优方法。

链接: https://arxiv.org/abs/2508.03073
作者: Bo Zhang,JianFei Huo,Zheng Zhang,Wufan Wang,Hui Gao,Xiangyang Gong,Wendong Wang
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Arbitrary-resolution super-resolution (ARSR) provides crucial flexibility for medical image analysis by adapting to diverse spatial resolutions. However, traditional CNN-based methods are inherently ill-suited for ARSR, as they are typically designed for fixed upsampling factors. While INR-based methods overcome this limitation, they still struggle to effectively process and leverage multi-modal images with varying resolutions and details. In this paper, we propose Nexus-INR, a Diverse Knowledge-guided ARSR framework, which employs varied information and downstream tasks to achieve high-quality, adaptive-resolution medical image super-resolution. Specifically, Nexus-INR contains three key components. A dual-branch encoder with an auxiliary classification task to effectively disentangle shared anatomical structures and modality-specific features; a knowledge distillation module using cross-modal attention that guides low-resolution modality reconstruction with high-resolution reference, enhanced by self-supervised consistency loss; an integrated segmentation module that embeds anatomical semantics to improve both reconstruction quality and downstream segmentation performance. Experiments on the BraTS2020 dataset for both super-resolution and downstream segmentation demonstrate that Nexus-INR outperforms state-of-the-art methods across various metrics.
zh

[CV-136] A Survey of Medical Point Cloud Shape Learning: Registration Reconstruction and Variation

【速读】:该论文旨在解决医学点云(medical point clouds)中基于学习的形状分析问题,重点聚焦于配准(registration)、重建(reconstruction)和变异建模(variation modeling)三大核心任务。其解决方案的关键在于系统梳理2021至2025年间相关研究进展,总结代表性方法、数据集与评估指标,并强调临床应用场景下的独特挑战,如数据稀缺性、患者间变异性以及模型可解释性和鲁棒性需求。此外,论文指出当前趋势包括混合表示融合、大规模自监督模型及生成式技术的应用,为推进点云驱动的医学影像形状学习提供了方向性指导。

链接: https://arxiv.org/abs/2508.03057
作者: Tongxu Zhang,Zhiming Liang,Bei Wang
机构: East China University of Science and Technology (华东理工大学); The Hong Kong Polytechnic University (香港理工大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Point clouds have become an increasingly important representation for 3D medical imaging, offering a compact, surface-preserving alternative to traditional voxel or mesh-based approaches. Recent advances in deep learning have enabled rapid progress in extracting, modeling, and analyzing anatomical shapes directly from point cloud data. This paper provides a comprehensive and systematic survey of learning-based shape analysis for medical point clouds, focusing on three fundamental tasks: registration, reconstruction, and variation modeling. We review recent literature from 2021 to 2025, summarize representative methods, datasets, and evaluation metrics, and highlight clinical applications and unique challenges in the medical domain. Key trends include the integration of hybrid representations, large-scale self-supervised models, and generative techniques. We also discuss current limitations, such as data scarcity, inter-patient variability, and the need for interpretable and robust solutions for clinical deployment. Finally, future directions are outlined for advancing point cloud-based shape learning in medical imaging.
zh

[CV-137] ClinicalFMamba: Advancing Clinical Assessment using Mamba-based Multimodal Neuroimaging Fusion MICCAI

【速读】:该论文旨在解决多模态医学图像融合中现有深度学习方法的局限性问题,即卷积神经网络(CNN)难以建模全局上下文信息,而Transformer虽能有效捕捉长程依赖却因二次计算复杂度难以在临床场景中部署。解决方案的关键在于提出一种端到端的CNN-Mamba混合架构——ClinicalFMamba,通过引入状态空间模型(SSM)的线性时间复杂度特性实现高效全局特征建模,并设计三平面扫描策略以有效学习三维体数据中的空间依赖关系,从而在保持实时融合速度的同时显著提升融合质量与下游任务性能。

链接: https://arxiv.org/abs/2508.03008
作者: Meng Zhou,Farzad Khalvati
机构: University of Toronto (多伦多大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI MLMI 2025 Workshop

点击查看摘要

Abstract:Multimodal medical image fusion integrates complementary information from different imaging modalities to enhance diagnostic accuracy and treatment planning. While deep learning methods have advanced performance, existing approaches face critical limitations: Convolutional Neural Networks (CNNs) excel at local feature extraction but struggle to model global context effectively, while Transformers achieve superior long-range modeling at the cost of quadratic computational complexity, limiting clinical deployment. Recent State Space Models (SSMs) offer a promising alternative, enabling efficient long-range dependency modeling in linear time through selective scan mechanisms. Despite these advances, the extension to 3D volumetric data and the clinical validation of fused images remains underexplored. In this work, we propose ClinicalFMamba, a novel end-to-end CNN-Mamba hybrid architecture that synergistically combines local and global feature modeling for 2D and 3D images. We further design a tri-plane scanning strategy for effectively learning volumetric dependencies in 3D images. Comprehensive evaluations on three datasets demonstrate the superior fusion performance across multiple quantitative metrics while achieving real-time fusion. We further validate the clinical utility of our approach on downstream 2D/3D brain tumor classification tasks, achieving superior performance over baseline methods. Our method establishes a new paradigm for efficient multimodal medical image fusion suitable for real-time clinical deployment.
zh

[CV-138] AMD-Mamba: A Phenotype-Aware Multi-Modal Framework for Robust AMD Prognosis MICCAI2025

【速读】:该论文旨在解决年龄相关性黄斑变性(Age-related Macular Degeneration, AMD)的预后预测问题,以实现对疾病进展的早期识别与精准干预。其解决方案的关键在于提出了一种多模态框架AMD-Mamba,该框架融合了彩色眼底图像、遗传变异和人口统计学变量,并引入一种基于AMD严重程度评分的度量学习策略,使模型能够将特征表示与临床表型对齐,从而提升对疾病演变模式的捕捉能力;同时,采用Vision Mamba架构而非传统CNN,有效整合局部病变(如drusen)与远距离全局信息(如血管变化),并通过多尺度融合机制增强图像与临床变量在不同分辨率下的协同表达,显著提升了高风险患者的早期识别准确率。

链接: https://arxiv.org/abs/2508.02957
作者: Puzhen Wu,Mingquan Lin,Qingyu Chen,Emily Y. Chew,Zhiyong Lu,Yifan Peng,Hexin Dong
机构: Weill Cornell Medicine (威尔康奈尔医学院); University of Minnesota (明尼苏达大学); Yale School of Medicine (耶鲁医学院); National Eye Institute (国家眼科研究所); National Library of Medicine (国家医学图书馆); National Institutes of Health (美国国立卫生研究院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at the MICCAI 2025 MIML Workshop

点击查看摘要

Abstract:Age-related macular degeneration (AMD) is a leading cause of irreversible vision loss, making effective prognosis crucial for timely intervention. In this work, we propose AMD-Mamba, a novel multi-modal framework for AMD prognosis, and further develop a new AMD biomarker. This framework integrates color fundus images with genetic variants and socio-demographic variables. At its core, AMD-Mamba introduces an innovative metric learning strategy that leverages AMD severity scale score as prior knowledge. This strategy allows the model to learn richer feature representations by aligning learned features with clinical phenotypes, thereby improving the capability of conventional prognosis methods in capturing disease progression patterns. In addition, unlike existing models that use traditional CNN backbones and focus primarily on local information, such as the presence of drusen, AMD-Mamba applies Vision Mamba and simultaneously fuses local and long-range global information, such as vascular changes. Furthermore, we enhance prediction performance through multi-scale fusion, combining image information with clinical variables at different resolutions. We evaluate AMD-Mamba on the AREDS dataset, which includes 45,818 color fundus photographs, 52 genetic variants, and 3 socio-demographic variables from 2,741 subjects. Our experimental results demonstrate that our proposed biomarker is one of the most significant biomarkers for the progression of AMD. Notably, combining this biomarker with other existing variables yields promising improvements in detecting high-risk AMD patients at early stages. These findings highlight the potential of our multi-modal framework to facilitate more precise and proactive management of AMD.
zh

[CV-139] REFLECT: Rectified Flows for Efficient Brain Anomaly Correction Transport MICCAI2025

【速读】:该论文旨在解决脑部影像中无监督异常检测(Unsupervised Anomaly Detection, UAD)的难题,特别是如何在缺乏标注数据的情况下准确定位异常区域。由于脑部解剖结构复杂且异常样本稀缺,传统方法难以实现高精度定位。其解决方案的关键在于提出一种名为REFLECT的新框架,该框架利用校正流(Rectified Flows)建立一条直接、线性的传输路径,将异常磁共振(MR)图像映射到正常分布空间。通过学习一个单步纠正的传输映射,模型能够高效修正脑部异常,并通过对比输入异常图像与校正后图像之间的差异来精确定位异常区域。相比基于扩散模型的UAD方法需要迭代随机采样,校正流提供了一个确定性的单步推断机制,显著提升了效率和精度。

链接: https://arxiv.org/abs/2508.02889
作者: Farzad Beizaee,Sina Hajimiri,Ismail Ben Ayed,Gregory Lodygensky,Christian Desrosiers,Jose Dolz
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted in Medical Image Computing and Computer Assisted Intervention Society (MICCAI 2025)

点击查看摘要

Abstract:Unsupervised anomaly detection (UAD) in brain imaging is crucial for identifying pathologies without the need for labeled data. However, accurately localizing anomalies remains challenging due to the intricate structure of brain anatomy and the scarcity of abnormal examples. In this work, we introduce REFLECT, a novel framework that leverages rectified flows to establish a direct, linear trajectory for correcting abnormal MR images toward a normal distribution. By learning a straight, one-step correction transport map, our method efficiently corrects brain anomalies and can precisely localize anomalies by detecting discrepancies between anomalous input and corrected counterpart. In contrast to the diffusion-based UAD models, which require iterative stochastic sampling, rectified flows provide a direct transport map, enabling single-step inference. Extensive experiments on popular UAD brain segmentation benchmarks demonstrate that REFLECT significantly outperforms state-of-the-art unsupervised anomaly detection methods. The code is available at this https URL.
zh

[CV-140] Evaluation of 3D Counterfactual Brain MRI Generation

【速读】:该论文旨在解决生成式AI在3D脑部磁共振成像(MRI)中模拟假设性变化时面临的挑战,包括数据稀缺、结构复杂性以及缺乏标准化评估协议,尤其关注如何在尊重解剖学和因果约束的前提下生成真实且可解释的脑部影像。解决方案的关键在于将六种生成模型转化为基于因果图的3D反事实生成方法,通过引入以区域脑体积为直接条件输入的解剖引导框架,实现对目标解剖区域的可控修改,从而提升生成结果的生理合理性与临床相关性。

链接: https://arxiv.org/abs/2508.02880
作者: Pengwei Sun,Wei Peng,Lun Yu Li,Yixin Wang,Kilian M. Pohl
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Counterfactual generation offers a principled framework for simulating hypothetical changes in medical imaging, with potential applications in understanding disease mechanisms and generating physiologically plausible data. However, generating realistic structural 3D brain MRIs that respect anatomical and causal constraints remains challenging due to data scarcity, structural complexity, and the lack of standardized evaluation protocols. In this work, we convert six generative models into 3D counterfactual approaches by incorporating an anatomy-guided framework based on a causal graph, in which regional brain volumes serve as direct conditioning inputs. Each model is evaluated with respect to composition, reversibility, realism, effectiveness and minimality on T1-weighted brain MRIs (T1w MRIs) from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). In addition, we test the generalizability of each model with respect to T1w MRIs of the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA). Our results indicate that anatomically grounded conditioning successfully modifies the targeted anatomical regions; however, it exhibits limitations in preserving non-targeted structures. Beyond laying the groundwork for more interpretable and clinically relevant generative modeling of brain MRIs, this benchmark highlights the need for novel architectures that more accurately capture anatomical interdependencies.
zh

人工智能

[AI-0] Self-Questioning Language Models

【速读】:该论文试图解决的问题是:大型语言模型是否能够在不依赖外部标注数据的情况下,通过自我生成问题与答案来提升自身的推理能力。解决方案的关键在于提出了一种称为Self-Questioning Language Models (SQLM) 的非对称自对弈框架,其中包含一个“提议者”(proposer)和一个“求解者”(solver),二者均通过强化学习进行训练。提议者根据给定主题生成问题,其奖励机制确保问题难度适中;求解者尝试解答问题,其奖励基于多数投票结果(作为正确性的代理指标)。对于编程任务,提议者可生成单元测试用于验证答案正确性。该方法使模型在无外部数据的前提下,通过持续生成更具挑战性的问题并尝试解决,从而在下游任务(如三位数乘法、代数题和Codeforces编程题)上实现性能提升。

链接: https://arxiv.org/abs/2508.03682
作者: Lili Chen,Mihir Prabhudesai,Katerina Fragkiadaki,Hao Liu,Deepak Pathak
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Can large language models improve without external data – by generating their own questions and answers? We hypothesize that a pre-trained language model can improve its reasoning skills given only a single prompt specifying the topic (e.g., algebra word problems) and asking the model to generate its own questions. To do this, we propose Self-Questioning Language Models (SQLM): an asymmetric self-play framework where a proposer is given the topic and generates a question for a solver, who tries to answer it. Both the proposer and solver are trained via reinforcement learning. The proposer receives a reward if the problem is not too easy or too difficult, and the solver receives a reward based on majority voting, a proxy for correctness in the absence of ground-truth answers. For coding, the proposer can instead generate unit tests which are used for verification. We study this asymmetric self-play framework on three benchmarks: three-digit multiplication, algebra problems from the OMEGA benchmark, and programming problems from Codeforces. By continually generating more interesting problems and attempting to solve them, language models can improve on downstream benchmarks without access to any curated training datasets.
zh

[AI-1] Agent Lightning: Train ANY AI Agents with Reinforcement Learning

【速读】:该论文旨在解决当前基于强化学习(Reinforcement Learning, RL)训练大语言模型(Large Language Models, LLMs)时存在的两大问题:一是现有方法通常将RL训练与特定Agent实现紧密耦合,导致难以适配不同架构的Agent;二是缺乏对复杂交互逻辑(如多Agent协作和动态工作流)的有效处理能力。解决方案的关键在于提出Agent Lightning框架,其核心创新包括:1)通过将Agent执行建模为马尔可夫决策过程(Markov Decision Process, MDP),定义统一的数据接口并设计分层RL算法LightningRL,其中包含信用分配模块,可将任意Agent生成的轨迹解构为训练转换(transition),从而实现训练与执行的完全解耦;2)采用Training-Agent Disaggregation系统架构,并引入可观测性(observability)框架以支持标准化的Agent微调接口,使框架能无缝集成LangChain、AutoGen等主流Agent开发工具,且几乎无需代码修改。

链接: https://arxiv.org/abs/2508.03680
作者: Xufang Luo,Yuge Zhang,Zhiyuan He,Zilong Wang,Siyun Zhao,Dongsheng Li,Luna K. Qiu,Yuqing Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework’s potential for real-world agent training and deployment.
zh

[AI-2] Classifying Epistemic Relationships in Human-AI Interaction: An Exploratory Approach

【速读】:该论文试图解决的问题是:随着生成式 AI (Generative AI) 系统在知识密集型工作中的广泛应用,现有人机交互(HCI)研究虽已提出多种AI角色分类体系,却普遍忽视了AI如何重塑用户作为知识贡献者的认知角色。解决方案的关键在于通过31位跨学科学者的访谈数据,构建了一个五部分编码手册,并识别出五类用户与AI之间的认识论关系类型:工具性依赖、条件性委托、共代理协作、权威位移和认识论回避。这些关系类型揭示了信任程度、评估方式、任务性质及人类认识地位的变化,强调AI的认知角色具有动态性和情境依赖性,从而推动HCI从静态的AI隐喻转向更精细的框架,以捕捉人与AI共同建构知识的关系性和规范性维度。

链接: https://arxiv.org/abs/2508.03673
作者: Shengnan Yang,Rongqian Ma
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:As AI systems become integral to knowledge-intensive work, questions arise not only about their functionality but also their epistemic roles in human-AI interaction. While HCI research has proposed various AI role typologies, it often overlooks how AI reshapes users’ roles as knowledge contributors. This study examines how users form epistemic relationships with AI-how they assess, trust, and collaborate with it in research and teaching contexts. Based on 31 interviews with academics across disciplines, we developed a five-part codebook and identified five relationship types: Instrumental Reliance, Contingent Delegation, Co-agency Collaboration, Authority Displacement, and Epistemic Abstention. These reflect variations in trust, assessment modes, tasks, and human epistemic status. Our findings show that epistemic roles are dynamic and context-dependent. We argue for shifting beyond static metaphors of AI toward a more nuanced framework that captures how humans and AI co-construct knowledge, enriching HCI’s understanding of the relational and normative dimensions of AI use.
zh

[AI-3] Beyond risk: A proto-framework for assessing the societal impact of AI systems

【速读】:该论文试图解决当前AI监管中过度聚焦于风险缓解而忽视对AI系统社会影响的系统性评估问题。其解决方案的关键在于提出一个以“自由”(freedom)为核心概念的原型框架,通过将自由划分为“能力自由”(freedom as capability)与“机会自由”(freedom as opportunity)两个维度,并将其与联合国可持续发展目标(Sustainable Development Goals, SDGs)相结合,从而为政策制定提供一种超越传统风险导向模式的评估工具,旨在推动生成式 AI (Generative AI) 等技术的社会价值实现与负责任发展。

链接: https://arxiv.org/abs/2508.03666
作者: Willem Fourie
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:

点击查看摘要

Abstract:In the discourse on AI regulation, ‘responsible AI’ is the dominant paradigm, with the focus on mitigating the risks related to AI systems. While this focus is important and necessary, it has limited use for a systematic consideration of AI’s societal impact. This paper proposes a proto-framework for assessing the societal impact of AI systems by operationalising the concept of freedom. This proto-framework is intended as a step towards a fully operationalised framework to be used in policymaking contexts. By drawing on Kantian philosophy and related contemporary interpretations, freedom is developed as the counterpart to the concept of responsibility. Two dimensions of freedom are developed in further detail: freedom as capability and freedom as opportunity. These two dimensions of freedom are then applied in a proto-framework that systematically considers AI’s impact on society using the Sustainable Development Goals. This proto-framework aims to complement current risk-based approaches and thereby offers a first step towards operationalising the concept of freedom in AI regulation.
zh

[AI-4] A DbC Inspired Neurosymbolic Layer for Trustworthy Agent Design

【速读】:该论文旨在解决生成式 AI(Generative AI)模型,尤其是大语言模型(Large Language Models, LLMs)在输出时缺乏可验证保证的问题。现有方法虽能生成流畅文本,但难以确保其语义正确性和类型一致性,从而限制了其在高可靠性场景中的应用。解决方案的关键在于引入一个“契约层”(contract layer),该层通过设计规范(Design by Contract, DbC)和类型理论原则,对每次LLM调用进行中介,明确输入输出的语义与类型要求,并结合概率性修复机制引导生成过程趋向合规。该契约层将LLM视为兼具语义解析器和概率黑盒特性的双重组件,同时定义了基于程序员指定条件的结构化数据校验逻辑,从而实现对生成结果的概率性验证与可控性增强。

链接: https://arxiv.org/abs/2508.03665
作者: Claudiu Leoveanu-Condrei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 3 pages, 1 figure

点击查看摘要

Abstract:Generative models, particularly Large Language Models (LLMs), produce fluent outputs yet lack verifiable guarantees. We adapt Design by Contract (DbC) and type-theoretic principles to introduce a contract layer that mediates every LLM call. Contracts stipulate semantic and type requirements on inputs and outputs, coupled with probabilistic remediation to steer generation toward compliance. The layer exposes the dual view of LLMs as semantic parsers and probabilistic black-box components. Contract satisfaction is probabilistic and semantic validation is operationally defined through programmer-specified conditions on well-typed data structures. More broadly, this work postulates that any two agents satisfying the same contracts are \emphfunctionally equivalent with respect to those contracts.
zh

[AI-5] Automated Algorithmic Discovery for Gravitational-Wave Detection Guided by LLM -Informed Evolutionary Monte Carlo Tree Search

【速读】:该论文旨在解决引力波信号识别中现有算法(如匹配滤波器和深度神经网络)面临的根本性局限问题:匹配滤波器因依赖预定义理论波形模板而计算开销巨大,深度神经网络则因黑箱结构导致决策逻辑不透明并引入隐式偏差。其解决方案的关键在于提出进化蒙特卡洛树搜索(Evo-MCTS)框架,该框架通过融合树状结构搜索、进化优化与大语言模型启发式策略,在领域感知的物理约束下系统探索算法空间,从而生成可解释且高性能的算法组合。此方法不仅在MLGWSC-1基准数据集上实现比现有最优算法提升20.2%的检测性能,还揭示了不同算法路径的性能模式,并发现新颖的算法组合,为计算科学领域的自动化算法发现提供了可迁移的方法论。

链接: https://arxiv.org/abs/2508.03661
作者: He Wang,Liang Zeng
机构: 未知
类目: Artificial Intelligence (cs.AI); High Energy Astrophysical Phenomena (astro-ph.HE); Instrumentation and Methods for Astrophysics (astro-ph.IM); General Relativity and Quantum Cosmology (gr-qc)
备注: 89 pages (37 main), 6+6 figures, 1 table. Initial submission; subject to revision

点击查看摘要

Abstract:Computational scientific discovery increasingly relies on algorithms to process complex data and identify meaningful patterns - yet faces persistent challenges in gravitational-wave signal identification. While existing algorithmic approaches like matched filtering (MF) and deep neural networks (DNNs) have achieved partial success, their limitations directly stem from fundamental limitations: MF’s excessive computational demands arise from its reliance on predefined theoretical waveform templates, while DNNs’ black-box architectures obscure decision logic and introduce hidden biases. We propose Evolutionary Monte Carlo Tree Search (Evo-MCTS), a framework that addresses these limitations through systematic algorithm space exploration guided by domain-aware physical constraints. Our approach combines tree-structured search with evolutionary optimization and large language model heuristics to create interpretable algorithmic solutions. Our Evo-MCTS framework demonstrates substantial improvements, achieving a 20.2% improvement over state-of-the-art gravitational wave detection algorithms on the MLGWSC-1 benchmark dataset. High-performing algorithm variants consistently exceed thresholds. The framework generates human-interpretable algorithmic pathways that reveal distinct performance patterns. Beyond performance improvements, our framework discovers novel algorithmic combinations, thereby establishing a transferable methodology for automated algorithmic discovery across computational science domains.
zh

[AI-6] Probing the Gaps in ChatGPT Live Video Chat for Real-World Assistance for People who are Blind or Visually Impaired

【速读】:该论文旨在解决当前实时视频生成式 AI(Generative AI)在辅助盲人或视力障碍者(BVI)进行日常任务时存在的有效性与安全性问题,特别是其在动态环境中的适应能力不足以及可能引发用户误解和风险的局限性。解决方案的关键在于通过实证研究揭示现有系统在静态场景中可提供有效指导,但在动态情境下缺乏准确的空间感知与实时描述能力;同时强调需引入多模态传感增强、优化干预时机策略,并从生态适配性和安全角度设计更可靠的辅助视频 AI 代理系统。

链接: https://arxiv.org/abs/2508.03651
作者: Ruei-Che Chang,Rosiana Natalie,Wenqian Xu,Jovan Zheng Feng Yap,Anhong Guo
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: ACM ASSETS 2025

点击查看摘要

Abstract:Recent advancements in large multimodal models have provided blind or visually impaired (BVI) individuals with new capabilities to interpret and engage with the real world through interactive systems that utilize live video feeds. However, the potential benefits and challenges of such capabilities to support diverse real-world assistive tasks remain unclear. In this paper, we present findings from an exploratory study with eight BVI participants. Participants used ChatGPT’s Advanced Voice with Video, a state-of-the-art live video AI released in late 2024, in various real-world scenarios, from locating objects to recognizing visual landmarks, across unfamiliar indoor and outdoor environments. Our findings indicate that current live video AI effectively provides guidance and answers for static visual scenes but falls short in delivering essential live descriptions required in dynamic situations. Despite inaccuracies in spatial and distance information, participants leveraged the provided visual information to supplement their mobility strategies. Although the system was perceived as human-like due to high-quality voice interactions, assumptions about users’ visual abilities, hallucinations, generic responses, and a tendency towards sycophancy led to confusion, distrust, and potential risks for BVI users. Based on the results, we discuss implications for assistive video AI agents, including incorporating additional sensing capabilities for real-world use, determining appropriate intervention timing beyond turn-taking interactions, and addressing ecological and safety concerns.
zh

[AI-7] Cross-Model Semantics in Representation Learning

【速读】:该论文旨在解决深度神经网络内部表征对架构特异性选择的敏感性问题,即不同模型间学习到的结构在稳定性、对齐性和可迁移性方面的不确定性。其解决方案的关键在于引入结构约束(如线性塑形算子和修正路径),通过构建一个用于测量和分析具有不同但相关架构先验的网络之间表征对齐性的框架,证明这些结构规律能够诱导出在架构变化下更稳定的表征几何。这表明特定形式的归纳偏置不仅有助于单个模型的泛化能力,还能提升跨模型特征的互操作性。

链接: https://arxiv.org/abs/2508.03649
作者: Saleh Nikooroo,Thomas Engel
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The internal representations learned by deep networks are often sensitive to architecture-specific choices, raising questions about the stability, alignment, and transferability of learned structure across models. In this paper, we investigate how structural constraints–such as linear shaping operators and corrective paths–affect the compatibility of internal representations across different architectures. Building on the insights from prior studies on structured transformations and convergence, we develop a framework for measuring and analyzing representational alignment across networks with distinct but related architectural priors. Through a combination of theoretical insights, empirical probes, and controlled transfer experiments, we demonstrate that structural regularities induce representational geometry that is more stable under architectural variation. This suggests that certain forms of inductive bias not only support generalization within a model, but also improve the interoperability of learned features across models. We conclude with a discussion on the implications of representational transferability for model distillation, modular learning, and the principled design of robust learning systems.
zh

[AI-8] LLM Distill4Ads: Using Cross-Encoders to Distill from LLM Signals for Advertiser Keyphrase Recommendations at eBay

【速读】:该论文旨在解决广告关键词推荐系统中因点击数据偏差导致的检索性能下降问题,尤其是在电商平台 eBay 上,如何提升嵌入式检索(Embedding Based Retrieval, EBR)模型对广告主相关关键词的准确召回能力。其关键解决方案是提出了一种两阶段的大型语言模型(Large Language Model, LLM)知识蒸馏流程:首先利用 LLM 作为裁判(LLM-as-a-judge)生成高质量标签以校正点击数据中的偏见,随后通过交叉编码器(cross-encoder)辅助教师模型指导双编码器(bi-encoder)学生模型,在多任务训练框架下实现更精准的关键词检索,从而显著提升 EBR 模型在实际场景下的相关性表现。

链接: https://arxiv.org/abs/2508.03628
作者: Soumik Dey,Benjamin Braun,Naveen Ravipati,Hansi Wu,Binbin Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Sellers at eBay are recommended keyphrases to bid on to enhance the performance of their advertising campaigns. The relevance of these keyphrases is crucial in avoiding the overcrowding of search systems with irrelevant items and maintaining a positive seller perception. It is essential that keyphrase recommendations align with both seller and Search judgments regarding auctions. Due to the difficulty in procuring negative human judgment at scale, employing LLM-as-a-judge to mimic seller judgment has been established as the norm in several studies. This study introduces a novel two-step LLM distillation process from a LLM-judge used to debias our Embedding Based Retrieval (EBR) model from the various biases that exist in click-data. We distill from an LLM teacher via a cross-encoder assistant into a bi-encoder student using a multi-task training approach, ultimately employing the student bi-encoder to retrieve relevant advertiser keyphrases. We show that integrating a knowledge distillation process from LLMs in a multi-task training setup enhances bi-encoder performance in retrieving relevant advertiser keyphrases at eBay.
zh

[AI-9] Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在代码生成任务中对输入前提错误(faulty premises)高度敏感的问题,即当用户提供的输入包含错误假设时,模型易产生代码幻觉(code generation hallucinations),暴露出其自我审查能力的不足。解决方案的关键在于提出首个针对此类问题的评估框架——Faulty Premises Bench (FPBench),通过系统构建三类故障前提并融合多维评价指标,对15个代表性LLMs进行深度评测,揭示了模型在错误前提下推理能力薄弱、资源投入边际效益递减以及不同前提类型触发特定缺陷模式的三重分离现象,从而为开发具备主动前提验证能力、更可靠且以人类为中心的代码生成模型提供了理论基础与实践路径。

链接: https://arxiv.org/abs/2508.03622
作者: Jialin Li,Jinzhe Li,Gengxu Li,Yi Chang,Yuan Wu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the advancement of code generation capabilities in large language models (LLMs), their reliance on input premises has intensified. When users provide inputs containing faulty premises, the probability of code generation hallucinations rises significantly, exposing deficiencies in their self-scrutiny capabilities. This paper proposes Faulty Premises Bench (FPBench), the first code generation evaluation framework targeting faulty premises. By systematically constructing three categories of faulty premises and integrating multi-dimensional evaluation metrics, it conducts in-depth assessments of 15 representative LLMs. The key findings are as follows: (1) Most models exhibit poor reasoning abilities and suboptimal code generation performance under faulty premises, heavily relying on explicit prompts for error detection, with limited self-scrutiny capabilities; (2) Faulty premises trigger a point of diminishing returns in resource investment, leading to blindly increasing length fails to enhance quality; (3) The three types of faulty premises respectively activate distinct defect patterns in models, revealing a triple dissociation in the cognitive mechanisms of code generation models. This study not only highlights the urgent need for LLMs to proactively verify premises in code generation but also, through the proposed FPBench framework and multi-dimensional evaluation system, provides a theoretical foundation and practical pathway for developing reliable, human-centric code generation models.
zh

[AI-10] Hidden Dynamics of Massive Activations in Transformer Training

【速读】:该论文旨在解决大规模激活(massive activations)在Transformer模型训练过程中出现的时序动态机制不明确的问题。此前研究主要聚焦于模型训练完成后的静态特性,而对这些异常大值激活如何随训练逐步演化缺乏系统理解。解决方案的关键在于首次通过Pythia模型家族的多尺度、多训练阶段分析,发现大规模激活的涌现遵循可预测的数学规律——可用一个五参数的指数调制对数函数精确建模;并进一步构建机器学习框架,仅基于架构设计参数即可高精度预测稳态行为,中等精度预测涌现时机与幅度。这一成果使模型架构师能够在训练前预判并潜在调控关键激活特性,从而优化模型稳定性、训练效率及可解释性。

链接: https://arxiv.org/abs/2508.03616
作者: Jorge Gallego-Feliciano,S. Aaron McClendon,Juan Morinelli,Stavros Zervoudakis,Antonios Saravanos
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Massive activations are scalar values in transformer hidden states that achieve values orders of magnitude larger than typical activations and have been shown to be critical for model functionality. While prior work has characterized these phenomena in fully trained models, the temporal dynamics of their emergence during training remain poorly understood. We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed. Through systematic analysis of various model sizes across multiple training checkpoints, we demonstrate that massive activation emergence follows predictable mathematical patterns that can be accurately modeled using an exponentially-modulated logarithmic function with five key parameters. We develop a machine learning framework to predict these mathematical parameters from architectural specifications alone, achieving high accuracy for steady-state behavior and moderate accuracy for emergence timing and magnitude. These findings enable architects to predict and potentially control key aspects of massive activation emergence through design choices, with significant implications for model stability, training cycle length, interpretability, and optimization. Our findings demonstrate that the emergence of massive activations is governed by model design and can be anticipated, and potentially controlled, before training begins.
zh

[AI-11] Goedel-Prover-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self-Correction

【速读】:该论文旨在解决自动化定理证明(Automated Theorem Proving, ATP)中模型性能受限于训练数据复杂度、缺乏有效反馈机制以及训练后期输出多样性下降的问题。解决方案的关键在于三个创新:(1) 结构化数据合成(Scaffolded data synthesis),通过逐步增加难度的合成任务训练模型掌握更复杂的定理;(2) 验证器引导的自我修正(Verifier-guided self-correction),利用Lean编译器反馈迭代修正证明过程,提升正确性;(3) 模型平均(Model averaging),融合训练过程中多个检查点以缓解后期输出多样性下降问题。这些改进使得Goedel-Prover-V2在MiniF2F和PutnamBench等基准上显著超越此前最先进开源模型,且在更小模型规模和计算预算下实现更强性能。

链接: https://arxiv.org/abs/2508.03613
作者: Yong Lin,Shange Tang,Bohan Lyu,Ziran Yang,Jui-Hui Chung,Haoyu Zhao,Lai Jiang,Yihan Geng,Jiawei Ge,Jingruo Sun,Jiayun Wu,Jiri Gesi,Ximing Lu,David Acuna,Kaiyu Yang,Hongzhou Lin,Yejin Choi,Danqi Chen,Sanjeev Arora,Chi Jin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 24 pages, 10 figures, 4 tables

点击查看摘要

Abstract:We introduce Goedel-Prover-V2, a series of open-source language models that set a new state-of-the-art in automated theorem proving. Built on the standard expert iteration and reinforcement learning pipeline, our approach incorporates three key innovations: (1) Scaffolded data synthesis: We generate synthetic tasks of increasing difficulty to train the model to master increasingly complex theorems; (2) Verifier-guided self-correction: We enable the model to iteratively revise its proofs by leveraging feedback from the Lean compiler; (3) Model averaging: We merge model checkpoints to mitigate the decrease in model output diversity in later stages of training. Our small model, Goedel-Prover-V2-8B, reaches 84.6% pass@32 on MiniF2F and outperforms DeepSeek-Prover-V2-671B under the same metric, despite being 80X smaller. Our flagship model, Goedel-Prover-V2-32B, achieves 88.1% on MiniF2F at pass@32 in standard mode and 90.4% in self-correction mode, outperforming prior SOTA by a large margin. Additionally, our flagship model solves 86 problems on PutnamBench at pass@184, securing the first place among open-source models on the leaderboard, surpassing DeepSeek-Prover-V2-671B’s record of solving 47 problems by pass@1024 with a significantly smaller model size and compute budget. At the time of its release (July-August 2025), Goedel-Prover-V2 achieves the strongest overall performance among all open-source theorem provers. It also ranks among the top-performing models–including closed-source systems with publicly reported performance–under a constrained test-time compute budget. Our models, code, and data are released at this https URL.
zh

[AI-12] Block: Balancing Load in LLM Serving with Context Knowledge and Predictive Scheduling

【速读】:该论文旨在解决大规模语言模型(Large Language Model, LLM)服务框架中负载不均与资源自动分配效率低的问题,传统基于启发式策略的单体调度器难以适应动态请求场景下的性能优化需求。其解决方案的关键在于提出一种名为Block的分布式调度框架,该框架通过利用请求上下文信息(如主机配置、响应长度和硬件性能等)实现预测性调度,具备完全分布、无状态和低开销的特点,从而在保持高可靠性和可扩展性的前提下显著提升服务吞吐量并降低尾部延迟(P99)。

链接: https://arxiv.org/abs/2508.03611
作者: Wei Da,Evangelia Kalyvianaki
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注: 12 pages, 8 figures excluding appendix

点击查看摘要

Abstract:This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests. Unlike popular model serving systems that rely on monolithic and heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system to achieve low overhead, reliability, and scalability. It leverages the deterministic and predictable characteristics of LLM inferences, such as host configurations, response lengths, and hardware performance, to make scheduling decisions based on accurately predicted metrics. Evaluation on a 12 GPUs cluster shows that Block significantly outperforms heuristic schedulers, boosting serving capacity by up to 16.7% and reducing P99 tail latency by up to 49.5%. These performance gains remain consistent across diverse models, workloads and configurations. Code and data are open-sourced.
zh

[AI-13] DeepFaith: A Domain-Free and Model-Agnostic Unified Framework for Highly Faithful Explanations

【速读】:该论文旨在解决现有可解释人工智能(Explainable AI, XAI)方法因缺乏统一最优解释而无法进行客观评估与优化的问题。其关键解决方案是提出一种基于深度架构的忠实性解释框架(Deep architecture-based Faith explainer, DeepFaith),该框架在忠实性(faithfulness)视角下建立了一个领域无关且模型无关的统一解释范式。通过将多种广泛使用且经过验证的忠实性指标形式化为统一目标函数,推导出一个能同时在多个指标上实现最优忠实性的解释目标,从而从理论上提供了一个“真实标准”(ground truth)。该框架进一步设计了一种解释器学习机制,利用多种已有解释方法生成高质量监督信号,并通过去重与过滤策略优化模式一致性损失和局部相关性损失,训练出高忠实性的解释器;训练完成后,DeepFaith仅需一次前向传播即可生成高度忠实的解释,无需访问被解释模型。

链接: https://arxiv.org/abs/2508.03586
作者: Yuhan Guo,Lizhong Ding,Shihan Jia,Yanyu Ren,Pengqi Li,Jiarun Fu,Changsheng Li,Ye yuan,Guoren Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 22 pages

点击查看摘要

Abstract:Explainable AI (XAI) builds trust in complex systems through model attribution methods that reveal the decision rationale. However, due to the absence of a unified optimal explanation, existing XAI methods lack a ground truth for objective evaluation and optimization. To address this issue, we propose Deep architecture-based Faith explainer (DeepFaith), a domain-free and model-agnostic unified explanation framework under the lens of faithfulness. By establishing a unified formulation for multiple widely used and well-validated faithfulness metrics, we derive an optimal explanation objective whose solution simultaneously achieves optimal faithfulness across these metrics, thereby providing a ground truth from a theoretical perspective. We design an explainer learning framework that leverages multiple existing explanation methods, applies deduplicating and filtering to construct high-quality supervised explanation signals, and optimizes both pattern consistency loss and local correlation to train a faithful explainer. Once trained, DeepFaith can generate highly faithful explanations through a single forward pass without accessing the model being explained. On 12 diverse explanation tasks spanning 6 models and 6 datasets, DeepFaith achieves the highest overall faithfulness across 10 metrics compared to all baseline methods, highlighting its effectiveness and cross-domain generalizability.
zh

[AI-14] EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

【速读】:该论文旨在解决现有文本到语音(Text-to-Speech, TTS)系统在情感控制方面存在的粗粒度与不稳定性问题,即大多数模型依赖离散的情感标签或复杂的文本提示进行情感调节,难以实现精细、连续且可解释的情感操控,同时训练过程对高质量数据集依赖性强。其解决方案的关键在于提出一种无需重新训练的激活转向(activation steering)方法——EmoSteer-TTS,通过在基于流匹配(flow matching)的TTS模型内部识别并调整特定激活子集,实现情感转换、插值和擦除等细粒度操作;该方法包括激活提取、情感token搜索和推理时转向三个步骤,并借助一个精心构建的多样化情感语音数据集来生成有效的转向向量,从而在不改变预训练模型参数的前提下,实现高效、可控且连续的情感语音合成。

链接: https://arxiv.org/abs/2508.03543
作者: Tianxin Xie,Shan Yang,Chenxing Li,Dong Yu,Li Liu
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS.
zh

[AI-15] Error Detection and Correction for Interpretable Mathematics in Large Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在可解释数学任务中生成错误中间步骤、产生幻觉以及无法遵守指定输出格式的问题,这些问题会导致最终预测不准确。解决方案的关键在于提出EDCIM(Error Detection and Correction for Interpretable Mathematics),其核心机制是利用LLM生成问题对应的符号方程组,随后通过一个符号化的错误检测框架识别错误并提供针对性反馈以指导LLM进行修正;同时,EDCIM采用轻量级开源模型与高性能专有模型相结合的混合架构,在成本与准确性之间实现可控平衡,仅需调整单一超参数即可满足不同场景下的需求。

链接: https://arxiv.org/abs/2508.03500
作者: Yijin Yang,Cristina Cornelio,Mario Leiva,Paulo Shakarian
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent large language models (LLMs) have demonstrated the ability to perform explicit multi-step reasoning such as chain-of-thought prompting. However, their intermediate steps often contain errors that can propagate leading to inaccurate final predictions. Additionally, LLMs still struggle with hallucinations and often fail to adhere to prescribed output formats, which is particularly problematic for tasks like generating mathematical expressions or source code. This work introduces EDCIM (Error Detection and Correction for Interpretable Mathematics), a method for detecting and correcting these errors in interpretable mathematics tasks, where the model must generate the exact functional form that explicitly solve the problem (expressed in natural language) rather than a black-box solution. EDCIM uses LLMs to generate a system of equations for a given problem, followed by a symbolic error-detection framework that identifies errors and provides targeted feedback for LLM-based correction. To optimize efficiency, EDCIM integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy. This balance is controlled by a single hyperparameter, allowing users to control the trade-off based on their cost and accuracy requirements. Experimental results across different datasets show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy when the balance is properly configured.
zh

[AI-16] VQA support to Arabic Language Learning Educational Tool

【速读】:该论文旨在解决阿拉伯语语言学习工具在教育场景中稀缺的问题,尤其针对缺乏基于现代教学法(如主动学习)的资源,从而影响非母语学习者语言能力提升的现状。解决方案的关键在于设计并实现一个由人工智能驱动的教育工具,其核心机制是利用视觉-语言预训练模型(Vision-Language Pretraining)生成与语境相关的图像描述,并结合大语言模型(Large Language Model)通过提示工程(prompting)定制化生成交互式视觉问答(Visual Question Answering)练习,从而以建构主义学习理念促进词汇、语法和理解力的提升。该方案通过包含1266个真实场景视觉问答的人工标注基准进行评估,验证了其准确性和有效性,表明该工具具有作为个性化、互动式阿拉伯语学习平台的潜力。

链接: https://arxiv.org/abs/2508.03488
作者: Khaled Bachir Delassi(1),Lakhdar Zeggane(1),Hadda Cherroun(1),Abdelhamid Haouhat(1),Kaoutar Bouzouad(2) ((1) LIM Lab, Amar Telidji University, Laghouat, Algeria, (2) Computer Science Dept., USTHB, Algiers, Algeria)
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:We address the problem of scarcity of educational Arabic Language Learning tools that advocate modern pedagogical models such as active learning which ensures language proficiency. In fact, we investigate the design and evaluation of an AI-powered educational tool designed to enhance Arabic language learning for non-native speakers with beginner-to-intermediate proficiency level. The tool leverages advanced AI models to generate interactive visual quizzes, deploying Visual Question Answering as the primary activity. Adopting a constructivist learning approach, the system encourages active learning through real-life visual quizzes, and image-based questions that focus on improving vocabulary, grammar, and comprehension. The system integrates Vision-Language Pretraining models to generate contextually relevant image description from which Large Language Model generate assignments based on customized Arabic language Learning quizzes thanks to prompting. The effectiveness of the tool is evaluated through a manual annotated benchmark consisting of 1266 real-life visual quizzes, with human participants providing feedback. The results show a suitable accuracy rates, validating the tool’s potential to bridge the gap in Arabic language education and highlighting the tool’s promise as a reliable, AI-powered resource for Arabic learners, offering personalized and interactive learning experiences. Subjects: Artificial Intelligence (cs.AI); Software Engineering (cs.SE) Cite as: arXiv:2508.03488 [cs.AI] (or arXiv:2508.03488v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.03488 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-17] BitsAI-Fix: LLM -Driven Approach for Automated Lint Error Resolution in Practice

【速读】:该论文旨在解决企业级代码库中静态分析(lint)错误数量激增导致的技术债累积和开发效率下降的问题。其核心解决方案是提出 BitsAI-Fix,一个基于大语言模型(Large Language Models, LLMs)的自动化 lint 错误修复工作流。关键创新在于:利用 tree-sitter 实现上下文扩展以提升修复准确性;通过专用训练的 LLM 生成可直接应用的 search-and-replace 格式补丁,并结合 lint 扫描重验证机制确保修复正确性;引入一种渐进式强化学习(progressive reinforcement learning, RL)训练策略,在项目冷启动阶段自动获取可验证训练数据,并在部署后持续收集在线反馈样本迭代模型;同时设计了融合格式奖励与正确性奖励、并惩罚冗余修改的目标规则奖励机制,以及“代码 diff 匹配”方法用于持续追踪线上效果。该方案已在字节跳动大规模落地,支持超 5000 名工程师,累计修复超过 12000 个静态分析问题,平均修复准确率达 85%。

链接: https://arxiv.org/abs/2508.03487
作者: Yuanpeng Li,Qi Long,Zhiyuan Yao,Jian Xu,Lintao Xie,Xu He,Lu Geng,Xin Han,Yueyan Chen,Wenbo Duan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:As enterprise codebases continue to grow in scale and complexity, the volume of lint errors far exceeds engineers’ manual remediation capacity, leading to continuous accumulation of technical debt and hindered development efficiency. This paper presents BitsAI-Fix, an automated lint error remediation workflow based on Large Language Models (LLMs), designed to address this critical challenge in industrial-scale environments. BitsAI-Fix employs tree-sitter for context expansion and generates search-and-replace format patches through specially trained LLMs, followed by lint scan re-verification to output final remediation results. Additionally, our approach introduces an innovative progressive reinforcement learning (RL) training strategy that can automatically acquire verifiable training data during the project cold-start phase and continuously iterate the model by collecting online samples through feedback after system deployment. Furthermore, we designed a targeted rule-based reward mechanism that combines format rewards and correctness rewards while penalizing redundant modifications. We also propose a “code diff matching” methodology to continuously track online effectiveness. In production deployment at ByteDance, our solution has supported over 5,000 engineers, resolved more than 12,000 static analysis issues, achieved approximately 85% remediation accuracy, with around 1,000 weekly active adopters. This work demonstrates the practical feasibility of LLM-based code remediation solutions in enterprise environments and serves as a reference for automated code fix in large-scale industrial scenarios.
zh

[AI-18] Semantic-aware Graph-guided Behavior Sequences Generation with Large Language Models for Smart Homes

【速读】:该论文旨在解决智能家庭场景中下游模型因行为漂移(behavioral drift)导致性能下降的问题,尤其是在缺乏新数据用于重训练的情况下。解决方案的关键在于提出一个基于大语言模型(Large Language Model, LLM)的框架 SmartGen,其核心创新包括:1)时间与语义感知的序列分割模块,实现长行为序列的语义一致性划分;2)语义感知的序列压缩机制,在降低输入长度的同时保留关键语义特征;3)图引导的序列生成方法,通过构建行为关系图并编码高频转移模式作为提示(prompt),引导LLM生成符合上下文变化且保持核心行为模式的数据;4)两阶段异常值过滤机制,提升生成数据的事实一致性和行为有效性。该方案有效支持了智能家庭模型在动态环境下的持续适应能力。

链接: https://arxiv.org/abs/2508.03484
作者: Zhiyao Xu,Dan Zhao,Qingsong Zou,Qing Li,Yong Jiang,Yuhang Wang,Jingyu Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As smart homes become increasingly prevalent, intelligent models are widely used for tasks such as anomaly detection and behavior prediction. These models are typically trained on static datasets, making them brittle to behavioral drift caused by seasonal changes, lifestyle shifts, or evolving routines. However, collecting new behavior data for retraining is often impractical due to its slow pace, high cost, and privacy concerns. In this paper, we propose SmartGen, an LLM-based framework that synthesizes context-aware user behavior data to support continual adaptation of downstream smart home models. SmartGen consists of four key components. First, we design a Time and Semantic-aware Split module to divide long behavior sequences into manageable, semantically coherent subsequences under dual time-span constraints. Second, we propose Semantic-aware Sequence Compression to reduce input length while preserving representative semantics by clustering behavior mapping in latent space. Third, we introduce Graph-guided Sequence Synthesis, which constructs a behavior relationship graph and encodes frequent transitions into prompts, guiding the LLM to generate data aligned with contextual changes while retaining core behavior patterns. Finally, we design a Two-stage Outlier Filter to identify and remove implausible or semantically inconsistent outputs, aiming to improve the factual coherence and behavioral validity of the generated sequences. Experiments on three real-world datasets demonstrate that SmartGen significantly enhances model performance on anomaly detection and behavior prediction tasks under behavioral drift, with anomaly detection improving by 85.43% and behavior prediction by 70.51% on average. The code is available at this https URL.
zh

[AI-19] oward a Graph-Theoretic Model of Belief: Confidence Credibility and Structural Coherence

【速读】:该论文旨在解决传统信念系统建模方法中存在的局限性,即现有模型(如全局一致的命题集合或标量概率分布)往往掩盖了信念内部结构,混淆外部可信度与内在一致性,并难以刻画碎片化或矛盾的认知状态。其解决方案的关键在于提出一种基于有向加权图的最小形式化框架:节点表示个体信念,边编码认知关系(如支持或矛盾),并通过两个独立函数分别赋予权重——可信度(反映来源信任)和置信度(由内部结构支持推导得出)。该方法不假设先验一致性,也不依赖信念更新机制,同时避免逻辑或论证框架对二值证明状态或演绎封闭性的强制要求,从而实现对信念系统内部组织结构的静态精细表征,为分析一致性条件、认知张力及表征限制提供了更丰富的基础。

链接: https://arxiv.org/abs/2508.03465
作者: Saleh Nikooroo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Belief systems are often treated as globally consistent sets of propositions or as scalar-valued probability distributions. Such representations tend to obscure the internal structure of belief, conflate external credibility with internal coherence, and preclude the modeling of fragmented or contradictory epistemic states. This paper introduces a minimal formalism for belief systems as directed, weighted graphs. In this framework, nodes represent individual beliefs, edges encode epistemic relationships (e.g., support or contradiction), and two distinct functions assign each belief a credibility (reflecting source trust) and a confidence (derived from internal structural support). Unlike classical probabilistic models, our approach does not assume prior coherence or require belief updating. Unlike logical and argumentation-based frameworks, it supports fine-grained structural representation without committing to binary justification status or deductive closure. The model is purely static and deliberately excludes inference or revision procedures. Its aim is to provide a foundational substrate for analyzing the internal organization of belief systems, including coherence conditions, epistemic tensions, and representational limits. By distinguishing belief structure from belief strength, this formalism enables a richer classification of epistemic states than existing probabilistic, logical, or argumentation-based approaches.
zh

[AI-20] SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

【速读】:该论文旨在解决音乐录音中普遍存在的一系列音频质量问题,如过度混响、失真、削波、音色不平衡以及立体声像过窄等,这些问题在非专业环境下尤为突出。传统方法依赖多个专用工具和手动调整,效率低且难以统一处理多种问题。解决方案的关键在于提出SonicMaster,首个面向音乐修复与母带处理的统一生成式模型,其核心创新是通过自然语言指令实现对特定音频缺陷的定向增强,同时支持自动模式进行通用修复;该模型基于流匹配(flow-matching)的生成训练范式,学习从退化输入到高质量输出的映射关系,并利用自建的SonicMaster数据集(包含19种退化函数模拟五类增强操作)进行训练,从而实现多类型音频瑕疵的端到端修复与可控优化。

链接: https://arxiv.org/abs/2508.03448
作者: Jan Melechovsky,Ambuj Mehrish,Dorien Herremans
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Music recordings often suffer from audio quality issues such as excessive reverberation, distortion, clipping, tonal imbalances, and a narrowed stereo image, especially when created in non-professional settings without specialized equipment or expertise. These problems are typically corrected using separate specialized tools and manual adjustments. In this paper, we introduce SonicMaster, the first unified generative model for music restoration and mastering that addresses a broad spectrum of audio artifacts with text-based control. SonicMaster is conditioned on natural language instructions to apply targeted enhancements, or can operate in an automatic mode for general restoration. To train this model, we construct the SonicMaster dataset, a large dataset of paired degraded and high-quality tracks by simulating common degradation types with nineteen degradation functions belonging to five enhancements groups: equalization, dynamics, reverb, amplitude, and stereo. Our approach leverages a flow-matching generative training paradigm to learn an audio transformation that maps degraded inputs to their cleaned, mastered versions guided by text prompts. Objective audio quality metrics demonstrate that SonicMaster significantly improves sound quality across all artifact categories. Furthermore, subjective listening tests confirm that listeners prefer SonicMaster’s enhanced outputs over the original degraded audio, highlighting the effectiveness of our unified approach.
zh

[AI-21] Data Overdose? Time for a Quadruple Shot: Knowledge Graph Construction using Enhanced Triple Extraction

【速读】:该论文旨在解决医学领域中公开医疗数据快速增长与临床及研究人员知识获取能力之间日益扩大的鸿沟问题,即海量科学文献难以被系统性地整理和应用。其核心解决方案是构建一个基于大语言模型(Large Language Model, LLM)代理的自动化信息抽取与知识图谱(Knowledge Graph, KG)生成流程:首先将PubMed摘要分解为语义明确的命题句,并从中提取KG三元组;随后引入上下文变量使三元组扩展为“四元组”,增强语义独立性;同时结合开放域与本体驱动的信息抽取方法融合领域分类信息。实验表明,该方法在自然语言生成层面相比普通三元组具有更高的语义保真度(平均余弦相似度达0.874),且具备推断新关系、连接知识库中簇群的能力,从而为医疗从业者提供实时更新、可持续的知识中枢。

链接: https://arxiv.org/abs/2508.03438
作者: Taine J. Elliott,Stephen P. Levitt,Ken Nixon,Martin Bekker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 18 pages, 8 figures, Published in the Annual Conference of South African Institute of Computer Scientists and Information Technologists, Preprint (author original)

点击查看摘要

Abstract:The rapid expansion of publicly-available medical data presents a challenge for clinicians and researchers alike, increasing the gap between the volume of scientific literature and its applications. The steady growth of studies and findings overwhelms medical professionals at large, hindering their ability to systematically review and understand the latest knowledge. This paper presents an approach to information extraction and automatic knowledge graph (KG) generation to identify and connect biomedical knowledge. Through a pipeline of large language model (LLM) agents, the system decomposes 44 PubMed abstracts into semantically meaningful proposition sentences and extracts KG triples from these sentences. The triples are enhanced using a combination of open domain and ontology-based information extraction methodologies to incorporate ontological categories. On top of this, a context variable is included during extraction to allow the triple to stand on its own - thereby becoming `quadruples’. The extraction accuracy of the LLM is validated by comparing natural language sentences generated from the enhanced triples to the original propositions, achieving an average cosine similarity of 0.874. The similarity for generated sentences of enhanced triples were compared with generated sentences of ordinary triples showing an increase as a result of the context variable. Furthermore, this research explores the ability for LLMs to infer new relationships and connect clusters in the knowledge base of the knowledge graph. This approach leads the way to provide medical practitioners with a centralised, updated in real-time, and sustainable knowledge source, and may be the foundation of similar gains in a wide variety of fields.
zh

[AI-22] he Science Fiction Science Method

【速读】:该论文试图解决的问题是如何在新兴技术尚未实现之前,科学地预测其社会与行为影响,从而提前引导技术发展与监管。传统方法依赖定性叙事,难以提供可量化证据;为此,作者提出“科幻科学”(science fiction science)这一新方法,其关键在于通过实验手段模拟未来技术情境,对参与者施加受控变量并收集定量的行为与态度数据。该方法的核心创新在于将科幻设定转化为可操作的实验设计,同时承认其面临的效度挑战,并提出需对研究技术类型及沉浸式方法进行约束与优化,以推动该领域从边缘走向规范化,形成持续提升有效性的良性循环。

链接: https://arxiv.org/abs/2508.03430
作者: Iyad Rahwan,Azim Shariff,Jean-François Bonnefon
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Predicting the social and behavioral impact of future technologies, before they are achieved, would allow us to guide their development and regulation before these impacts get entrenched. Traditionally, this prediction has relied on qualitative, narrative methods. Here we describe a method which uses experimental methods to simulate future technologies, and collect quantitative measures of the attitudes and behaviors of participants assigned to controlled variations of the future. We call this method ‘science fiction science’. We suggest that the reason why this method has not been fully embraced yet, despite its potential benefits, is that experimental scientists may be reluctant to engage in work facing such serious validity threats as science fiction science. To address these threats, we consider possible constraints on the kind of technology that science fiction science may study, as well as the unconventional, immersive methods that science fiction science may require. We seek to provide perspective on the reasons why this method has been marginalized for so long, what benefits it would bring if it could be built on strong yet unusual methods, and how we can normalize these methods to help the diverse community of science fiction scientists to engage in a virtuous cycle of validity improvement.
zh

[AI-23] Multi-Objective Infeasibility Diagnosis for Routing Problems Using Large Language Models

【速读】:该论文旨在解决现实世界路由问题中因用户提出冲突或不合理需求而导致优化模型不可行的问题,即约束过于严格或相互矛盾,使得可行解集为空。现有基于大语言模型(Large Language Model, LLM)的方法虽能诊断不可行模型,但缺乏对多种可能调整方案的系统性考虑。其解决方案的关键在于提出多目标不可行性诊断(Multi-Objective Infeasibility Diagnosis, MOID),该方法将LLM代理与多目标优化相结合,在自动路由求解器中生成一组权衡路径成本与约束违反程度的调整方案;随后利用LLM代理提取这些方案中的实践洞察,构建针对原不可行模型的分析函数,从而提供多样化的诊断建议,显著提升模型恢复可行性的效率与实用性。

链接: https://arxiv.org/abs/2508.03406
作者: Kai Li,Ruihao Zheng,Xinye Hao,Zhenkun Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In real-world routing problems, users often propose conflicting or unreasonable requirements, which result in infeasible optimization models due to overly restrictive or contradictory constraints, leading to an empty feasible solution set. Existing Large Language Model (LLM)-based methods attempt to diagnose infeasible models, but modifying such models often involves multiple potential adjustments that these methods do not consider. To fill this gap, we introduce Multi-Objective Infeasibility Diagnosis (MOID), which combines LLM agents and multi-objective optimization within an automatic routing solver, to provide a set of representative actionable suggestions. Specifically, MOID employs multi-objective optimization to consider both path cost and constraint violation, generating a set of trade-off solutions, each encompassing varying degrees of model adjustments. To extract practical insights from these solutions, MOID utilizes LLM agents to generate a solution analysis function for the infeasible model. This function analyzes these distinct solutions to diagnose the original infeasible model, providing users with diverse diagnostic insights and suggestions. Finally, we compare MOID with several LLM-based methods on 50 types of infeasible routing problems. The results indicate that MOID automatically generates multiple diagnostic suggestions in a single run, providing more practical insights for restoring model feasibility and decision-making compared to existing methods.
zh

[AI-24] Hide and Seek with LLM s: An Adversarial Game for Sneaky Error Generation and Self-Improving Diagnosis

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在复杂错误识别与诊断能力上的不足问题,其根源在于模型训练目标侧重于生成正确答案,导致对错误模式的学习有限。解决方案的关键在于提出一种动态对抗框架——隐藏与寻找游戏(Hide and Seek Game, HSG),其中包含两个对抗角色:Sneaky负责生成隐蔽且具有欺骗性的推理错误,Diagnosis则致力于精准检测这些错误;通过这种对抗共进化机制,显著提升了模型的错误隐蔽能力和诊断精度。实验表明,HSG在数学推理任务上相比基线模型(如GPT-4o)实现了16.8%–31.4%的诊断准确率提升。

链接: https://arxiv.org/abs/2508.03396
作者: Rui Zou,Mengqi Wei,Yutao Zhu,Jirong Wen,Xin Zhao,Jing Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) excel in reasoning and generation across domains, but still struggle with identifying and diagnosing complex errors. This stems mainly from training objectives that prioritize correct answers, limiting exposure to and learning from errors. While recent studies have begun to address this by introducing error signals, most rely on shallow, static errors, restricting improvement in deep diagnostic ability. To overcome this, we propose Hide and Seek Game (HSG), a dynamic adversarial framework for error generation and diagnosis, and evaluate it on mathematical problem-solving. HSG involves two adversarial roles: Sneaky, which “hides” by generating subtle, deceptive reasoning errors, and Diagnosis, which “seeks” to accurately detect them. Through adversarial co-evolution, both error stealth and diagnostic precision are enhanced. Experiments on several math reasoning tasks show that HSG significantly boosts error diagnosis, achieving 16.8%–31.4% higher accuracy than baselines like GPT-4o. We also release a challenging dataset of deceptive errors and diagnostic annotations as a benchmark for future research.
zh

[AI-25] Agent ic AI in 6G Software Businesses: A Layered Maturity Model

【速读】:该论文试图解决的问题是:在6G软件业务中,如何评估和推动**代理型人工智能系统(agentic AI systems)**的采纳与组织转型,以应对技术不成熟、集成复杂性高、组织准备度不足以及性能-成本权衡等挑战。其解决方案的关键在于通过初步的主题映射分析,识别出29个促进因素(motivators)和27个阻碍因素(demotivators),并将其归纳为五类高层主题,从而构建一个结构化的组织就绪度框架;同时,该研究作为更广泛研究计划的第一阶段,旨在基于CMMI模型并结合软件架构的三个维度(数据、业务逻辑、表示层)开发并验证分层成熟度模型,为软件驱动型企业提供可操作的工具,以评估、规划和推进面向代理优先(agent-first)能力的发展,契合6G时代的需求。

链接: https://arxiv.org/abs/2508.03393
作者: Muhammad Zohaib,Muhammad Azeem Akbar,Sami Hyrynsalmi,Arif Ali Khan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 6 pages, 3 figures and FIT’25 Conference

点击查看摘要

Abstract:The emergence of agentic AI systems in 6G software businesses presents both strategic opportunities and significant challenges. While such systems promise increased autonomy, scalability, and intelligent decision-making across distributed environments, their adoption raises concerns regarding technical immaturity, integration complexity, organizational readiness, and performance-cost trade-offs. In this study, we conducted a preliminary thematic mapping to identify factors influencing the adoption of agentic software within the context of 6G. Drawing on a multivocal literature review and targeted scanning, we identified 29 motivators and 27 demotivators, which were further categorized into five high-level themes in each group. This thematic mapping offers a structured overview of the enabling and inhibiting forces shaping organizational readiness for agentic transformation. Positioned as a feasibility assessment, the study represents an early phase of a broader research initiative aimed at developing and validating a layered maturity model grounded in CMMI model with the software architectural three dimensions possibly Data, Business Logic, and Presentation. Ultimately, this work seeks to provide a practical framework to help software-driven organizations assess, structure, and advance their agent-first capabilities in alignment with the demands of 6G.
zh

[AI-26] Data Dependency Inference for Industrial Code Generation Based on UML Sequence Diagrams

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在从自然语言(Natural Language, NL)描述生成代码时,因文本表述固有的模糊性而难以准确捕捉复杂需求的问题,尤其是服务导向架构中隐含的数据依赖、条件逻辑和系统行为等关键约束。解决方案的关键在于提出一种名为UML2Dep的分步代码生成框架:首先设计了一种增强型统一建模语言(Unified Modeling Language, UML)序列图,通过集成决策表与API规范显式形式化服务交互中的结构关系与业务逻辑流;其次引入数据依赖推理(Data Dependency Inference, DDI)任务,将数据流建模为带约束的数学推理问题,并结合提示工程与静态解析技术构建显式的依赖图,从而降低上下文复杂度并提升LLM的推理准确性与效率。

链接: https://arxiv.org/abs/2508.03379
作者: Wenxin Mao,Zhitao Wang Long Wang,Sirong Chen,Cuiyun Gao,Luyang Cao,Ziming Liu,Qiming Zhang,Jun Zhou,Zhi Jin
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:

点击查看摘要

Abstract:Large language models (LLMs) excel at generating code from natural language (NL) descriptions. However, the plain textual descriptions are inherently ambiguous and often fail to capture complex requirements like intricate system behaviors, conditional logic, and architectural constraints; implicit data dependencies in service-oriented architectures are difficult to infer and handle correctly. To bridge this gap, we propose a novel step-by-step code generation framework named UML2Dep by leveraging unambiguous formal specifications of complex requirements. First, we introduce an enhanced Unified Modeling Language (UML) sequence diagram tailored for service-oriented architectures. This diagram extends traditional visual syntax by integrating decision tables and API specifications, explicitly formalizing structural relationships and business logic flows in service interactions to rigorously eliminate linguistic ambiguity. Second, recognizing the critical role of data flow, we introduce a dedicated data dependency inference (DDI) task. DDI systematically constructs an explicit data dependency graph prior to actual code synthesis. To ensure reliability, we formalize DDI as a constrained mathematical reasoning task through novel prompting strategies, aligning with LLMs’ excellent mathematical strengths. Additional static parsing and dependency pruning further reduce context complexity and cognitive load associated with intricate specifications, thereby enhancing reasoning accuracy and efficiency.
zh

[AI-27] Board Game Arena: A Framework and Benchmark for Assessing Large Language Models via Strategic Play

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在策略性决策场景中推理能力与博弈行为的系统性评估问题。现有方法缺乏统一、可扩展且支持多Agent比较的实验框架,难以有效量化LLM在复杂游戏环境中的表现。解决方案的关键在于构建Board Game Arena库,该框架基于Google OpenSpiel实现多种策略类棋盘游戏,并通过封装不同类型的智能体(如随机策略、人类玩家、强化学习代理等)提供标准化测试场景;同时集成LiteLLM API接口和vLLM本地部署能力以灵活调用各类LLM,并利用Ray实现分布式执行,最终结合详细的推理轨迹分析工具,为LLM的逻辑推理过程和博弈策略行为提供可量化的实证评估路径。

链接: https://arxiv.org/abs/2508.03368
作者: Lucia Cipolina-Kun,Marianna Nezhurina,Jenia Jitsev
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT)
备注:

点击查看摘要

Abstract:The Board Game Arena library provides a framework for evaluating the decision making abilities of large language models (LLMs) through strategic board games implemented in Google OpenSpiel library. The framework enables systematic comparisons between LLM based agents and other agents (random, human, reinforcement learning agents, etc.) in various game scenarios by wrapping multiple board and matrix games and supporting different agent types. It integrates API access to models via LiteLLM, local model deployment via vLLM, and offers distributed execution through Ray. Additionally it provides extensive analysis tools for the LLM reasoning traces. This paper summarizes the structure, key characteristics, and motivation of the repository, highlighting how it contributes to the empirical evaluation of the reasoning of LLM and game-theoretic behavior
zh

[AI-28] When Good Sounds Go Adversarial: Jailbreaking Audio-Language Models with Benign Inputs

【速读】:该论文旨在解决音频接口在生成式 AI(Generative AI)系统中引入的新安全漏洞问题,即攻击者可通过隐蔽的音频扰动操纵语音语言模型(Audio Language Model, ALM)生成有害内容,而这些扰动对人类听者而言是不可感知的。解决方案的关键在于提出 WhisperInject 框架,其包含两个阶段:第一阶段采用基于奖励的优化方法 Reinforcement Learning with Projected Gradient Descent (RL-PGD),引导目标模型绕过自身安全机制并生成原始有害响应;第二阶段通过 Projected Gradient Descent (PGD) 将微小扰动嵌入到良性音频载体(如天气查询或问候语)中,实现隐蔽的“载荷注入”。实验表明,该方法在多个主流多模态模型上成功率达 86% 以上,验证了其作为新型实用且隐蔽的音频原生攻击手段的可行性。

链接: https://arxiv.org/abs/2508.03365
作者: Bodam Kim,Hiskias Dingeto,Taeyoun Kwon,Dasol Choi,DongGeon Lee,Haon Park,JaeHoon Lee,Jongho Shin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:As large language models become increasingly integrated into daily life, audio has emerged as a key interface for human-AI interaction. However, this convenience also introduces new vulnerabilities, making audio a potential attack surface for adversaries. Our research introduces WhisperInject, a two-stage adversarial audio attack framework that can manipulate state-of-the-art audio language models to generate harmful content. Our method uses imperceptible perturbations in audio inputs that remain benign to human listeners. The first stage uses a novel reward-based optimization method, Reinforcement Learning with Projected Gradient Descent (RL-PGD), to guide the target model to circumvent its own safety protocols and generate harmful native responses. This native harmful response then serves as the target for Stage 2, Payload Injection, where we use Projected Gradient Descent (PGD) to optimize subtle perturbations that are embedded into benign audio carriers, such as weather queries or greeting messages. Validated under the rigorous StrongREJECT, LlamaGuard, as well as Human Evaluation safety evaluation framework, our experiments demonstrate a success rate exceeding 86% across Qwen2.5-Omni-3B, Qwen2.5-Omni-7B, and Phi-4-Multimodal. Our work demonstrates a new class of practical, audio-native threats, moving beyond theoretical exploits to reveal a feasible and covert method for manipulating AI behavior.
zh

[AI-29] CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment

【速读】:该论文旨在解决当前基于语音的认知障碍评估方法在跨语言和跨临床场景中泛化能力不足的问题,从而限制了其实际应用价值。解决方案的关键在于构建首个用于评估大语言模型(Large Language Models, LLMs)在语音认知评估任务中跨语言与跨站点泛化性能的基准——CogBench,并采用统一的多模态处理流程,在英文与中文语料(ADReSSo、NCMMSC2021-AD及新收集的CIR-E数据集)上系统评估模型表现。研究发现,传统深度学习模型在跨域迁移时性能显著下降,而引入思维链(Chain-of-Thought)提示策略的LLMs展现出更强适应性,同时通过低秩适应(Low-Rank Adaptation, LoRA)进行轻量微调可进一步提升目标域的泛化能力,为构建临床可用且语言鲁棒的语音认知评估工具提供了关键路径。

链接: https://arxiv.org/abs/2508.03360
作者: Feng Rui,Zhiyao Luo,Wei Wang,Yuting Song,Yong Liu,Tingting Zhu,Jianqing Li,Xingyao Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 19 pages, 9 figures, 12 tables

点击查看摘要

Abstract:Automatic assessment of cognitive impairment from spontaneous speech offers a promising, non-invasive avenue for early cognitive screening. However, current approaches often lack generalizability when deployed across different languages and clinical settings, limiting their practical utility. In this study, we propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models (LLMs) for speech-based cognitive impairment assessment. Using a unified multimodal pipeline, we evaluate model performance on three speech datasets spanning English and Mandarin: ADReSSo, NCMMSC2021-AD, and a newly collected test set, CIR-E. Our results show that conventional deep learning models degrade substantially when transferred across domains. In contrast, LLMs equipped with chain-of-thought prompting demonstrate better adaptability, though their performance remains sensitive to prompt design. Furthermore, we explore lightweight fine-tuning of LLMs via Low-Rank Adaptation (LoRA), which significantly improves generalization in target domains. These findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.
zh

[AI-30] Compressing Chain-of-Thought in LLM s via Step Entropy

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在使用思维链(Chain-of-Thought, CoT)提示进行复杂推理时产生的冗余中间步骤问题,这些问题导致推理过程冗长、计算成本高且效率低下。解决方案的关键在于引入基于步骤熵(step entropy)的压缩框架,通过量化每个推理步骤的信息贡献来识别并移除低熵冗余步骤;实验表明,高达80%的低熵中间步骤可被剪枝而对最终答案准确率影响甚微,远优于随机或高熵剪枝策略。进一步地,作者提出一种两阶段训练方法,结合监督微调(Supervised Fine-Tuning, SFT)与组相对策略优化(Group Relative Policy Optimization, GRPO)强化学习,使模型在推理阶段自主学习生成压缩后的CoT,通过引入[SKIP]标记实现高效推理,显著提升推理效率的同时严格保持准确性。

链接: https://arxiv.org/abs/2508.03346
作者: Zeju Li,Jianyuan Zhong,Ziyang Zheng,Xiangyu Wen,Zhijian Xu,Yingying Cheng,Fan Zhang,Qiang Xu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) using Chain-of-Thought (CoT) prompting excel at complex reasoning but generate verbose thought processes with considerable redundancy, leading to increased inference costs and reduced efficiency. We introduce a novel CoT compression framework based on step entropy, a metric that quantifies the informational contribution of individual reasoning steps to identify redundancy. Through theoretical analysis and extensive empirical validation on mathematical reasoning benchmarks, we demonstrate that steps with low entropy are indeed highly redundant. Our experiments reveal that an astonishing 80% of low-entropy intermediate steps can be pruned with minor degradation in the final answer accuracy across DeepSeek-R1-7B, 14B and Qwen3-8B. This finding sharply contrasts with random or high-entropy pruning, which severely impairs reasoning performance. Building on this, we propose a novel two-stage training strategy combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) reinforcement learning. This approach enables LLMs to autonomously learn to generate compressed COTs during inference by strategically incorporating [SKIP] tokens. Our method significantly enhances LLM inference efficiency while rigorously preserving accuracy, offering profound implications for practical LLM deployment and a deeper understanding of reasoning structures.
zh

[AI-31] Adaptive AI Agent Placement and Migration in Edge Intelligence Systems

【速读】:该论文旨在解决在动态边缘计算环境中部署和管理基于大语言模型(Large Language Models, LLMs)的智能代理(AI agents)所面临的挑战,特别是由于数据密集型、多模态边缘工作负载迁移至云端导致的高延迟问题,以及边缘端资源受限与异构性带来的服务质量(Quality of Service, QoS)保障难题。解决方案的关键在于提出了一种新颖的自适应框架,通过建模资源约束与延迟/成本指标,结合蚁群算法与LLM驱动的优化策略,实现AI代理的高效放置与轻量级迁移——仅传输必要状态信息,从而在保障QoS的同时显著降低部署延迟和迁移开销。

链接: https://arxiv.org/abs/2508.03345
作者: Xingdan Wang,Jiayi He,Zhiqing Tang,Jianxiong Guo,Jiong Lou,Liping Qian,Tian Wang,Weijia Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rise of LLMs such as ChatGPT and Claude fuels the need for AI agents capable of real-time task handling. However, migrating data-intensive, multi-modal edge workloads to cloud data centers, traditionally used for agent deployment, introduces significant latency. Deploying AI agents at the edge improves efficiency and reduces latency. However, edge environments present challenges due to limited and heterogeneous resources. Maintaining QoS for mobile users necessitates agent migration, which is complicated by the complexity of AI agents coordinating LLMs, task planning, memory, and external tools. This paper presents the first systematic deployment and management solution for LLM-based AI agents in dynamic edge environments. We propose a novel adaptive framework for AI agent placement and migration in edge intelligence systems. Our approach models resource constraints and latency/cost, leveraging ant colony algorithms and LLM-based optimization for efficient decision-making. It autonomously places agents to optimize resource utilization and QoS and enables lightweight agent migration by transferring only essential state. Implemented on a distributed system using AgentScope and validated across globally distributed edge servers, our solution significantly reduces deployment latency and migration costs.
zh

[AI-32] From Legacy to Standard: LLM -Assisted Transformation of Cybersecurity Playbooks into CACAO Format

【速读】:该论文旨在解决现有网络安全应急响应剧本(incident response playbooks)多采用异构、非机器可读格式的问题,从而限制了其在安全编排、自动化与响应(Security Orchestration, Automation, and Response, SOAR)平台中的自动化执行与互操作性。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)结合提示工程(Prompt Engineering),将遗留剧本自动转换为标准化的机器可读格式——CACAO(Cybersecurity Automation Coalition Action Object)。研究通过系统性地设计提示策略以最大化语法准确性与语义保真度,并构建包含语法检查模块和迭代优化机制的模块化转换流程,显著提升了转换精度并有效保留了复杂工作流结构,为自动化网络安全剧本转换提供了可行且高效的技术路径。

链接: https://arxiv.org/abs/2508.03342
作者: Mehdi Akbari Gurabi,Lasse Nitz,Radu-Mihai Castravet,Roman Matzutt,Avikarsha Mandal,Stefan Decker
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 20 pages, including appendices, 32 references, 4 tables, 7 main figures (some of them has sub-figures)

点击查看摘要

Abstract:Existing cybersecurity playbooks are often written in heterogeneous, non-machine-readable formats, which limits their automation and interoperability across Security Orchestration, Automation, and Response platforms. This paper explores the suitability of Large Language Models, combined with Prompt Engineering, to automatically translate legacy incident response playbooks into the standardized, machine-readable CACAO format. We systematically examine various Prompt Engineering techniques and carefully design prompts aimed at maximizing syntactic accuracy and semantic fidelity for control flow preservation. Our modular transformation pipeline integrates a syntax checker to ensure syntactic correctness and features an iterative refinement mechanism that progressively reduces syntactic errors. We evaluate the proposed approach on a custom-generated dataset comprising diverse legacy playbooks paired with manually created CACAO references. The results demonstrate that our method significantly improves the accuracy of playbook transformation over baseline models, effectively captures complex workflow structures, and substantially reduces errors. It highlights the potential for practical deployment in automated cybersecurity playbook transformation tasks.
zh

[AI-33] Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长期交互中缺乏持久记忆能力的问题,尤其是现有记忆系统因依赖人为设定的记忆单元粒度和被动规则驱动的知识提取机制,难以实现真正的学习与演化。解决方案的关键在于提出一种受人类认知原理启发的自组织记忆架构Nemori,其核心创新包含两个方面:一是基于事件分割理论(Event Segmentation Theory)的两步对齐原则(Two-Step Alignment Principle),实现了从原始对话流中自主划分语义连贯的事件片段,解决了记忆粒度的难题;二是基于自由能原理(Free-energy Principle)的预测-校准原则(Predict-Calibrate Principle),使代理能够主动从预测误差中学习,从而超越预设启发式规则,实现知识的自适应演化。

链接: https://arxiv.org/abs/2508.03341
作者: Jiayan Nan,Wenquan Ma,Wenlong Wu,Yize Chen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) demonstrate remarkable capabilities, yet their inability to maintain persistent memory in long contexts limits their effectiveness as autonomous agents in long-term interactions. While existing memory systems have made progress, their reliance on arbitrary granularity for defining the basic memory unit and passive, rule-based mechanisms for knowledge extraction limits their capacity for genuine learning and evolution. To address these foundational limitations, we present Nemori, a novel self-organizing memory architecture inspired by human cognitive principles. Nemori’s core innovation is twofold: First, its Two-Step Alignment Principle, inspired by Event Segmentation Theory, provides a principled, top-down method for autonomously organizing the raw conversational stream into semantically coherent episodes, solving the critical issue of memory granularity. Second, its Predict-Calibrate Principle, inspired by the Free-energy Principle, enables the agent to proactively learn from prediction gaps, moving beyond pre-defined heuristics to achieve adaptive knowledge evolution. This offers a viable path toward handling the long-term, dynamic workflows of autonomous agents. Extensive experiments on the LoCoMo and LongMemEval benchmarks demonstrate that Nemori significantly outperforms prior state-of-the-art systems, with its advantage being particularly pronounced in longer contexts.
zh

[AI-34] Exploring Layer-wise Information Effectiveness for Post-Training Quantization in Small Language Models

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在推理阶段因参数量庞大而导致的内存与能耗过高问题,尤其是在极端低比特压缩(sub-7B模型下2–3位精度)时难以维持模型准确性的挑战。其解决方案的关键在于提出LieQ框架,该框架通过三个互补的层级诊断指标——困惑度下降(Perplexity Drop)、表征紧凑性(Representational Compactness)和Top-k能量增益(Top-k Energy Gain),揭示了模型各层之间的功能分工规律,并据此实现无需梯度更新的自动位宽分配策略,从而在极低比特精度下显著提升压缩-准确率权衡性能。

链接: https://arxiv.org/abs/2508.03332
作者: He Xiao,Qingyao Yang,Dirui Xie,Wendong Xu,Wenyong Zhou,Haobo Liu,Zhengwu Liu,Ngai Wong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: low-bit quantization

点击查看摘要

Abstract:Large language models with billions of parameters are often over-provisioned: many layers contribute little unique information yet dominate the memory and energy footprint during inference. We present LieQ, a metric-driven post-training quantization framework that addresses the critical challenge of maintaining accuracy in sub-7B models under extreme low-bit compression. Our method introduces three complementary layer-wise diagnostics-Perplexity Drop, Representational Compactness, and Top-k Energy Gain -that reveal a canonical division of labour across layers, enabling automatic bit-width allocation without gradient updates. Unlike existing approaches that suffer severe accuracy degradation at 2-3 bits precision, LieQ achieves state-of-the-art compression-accuracy trade-offs: on Qwen3-4B, it recovers 95.9% of FP16 baseline performance at 2.05-bit quantization, outperforming GPTQ by 19.7% and AWQ by 18.1% on average across seven zero-shot reasoning tasks. Applied to LLaMA3.2-3B, LieQ maintains 98.2% of baseline accuracy at 2.07-bit precision while enabling 4x memory reduction, establishing new paradigms for deploying small language models on resource-constrained edge devices.
zh

[AI-35] Industrial LLM -based Code Optimization under Regulation: A Mixture-of-Agents Approach

【速读】:该论文旨在解决受监管行业中因数据隐私和合规要求限制,无法使用商业大型语言模型(Large Language Models, LLMs)进行代码优化的问题,从而在保障合规性的前提下实现高质量、高效率的软件性能工程。其解决方案的关键在于提出并实证了一种多智能体混合(Mixture-of-Agents, MoA)架构,该架构通过协同调用多个专用开源LLM直接合成优化代码,并在真实工业代码库上验证了其优越性:在受限环境中,MoA相较基线遗传算法(Genetic Algorithm, GA)系统可实现14.3%–22.2%的成本节约与28.6%–32.2%的加速优化时间,同时证明了GA在商用模型场景下的优势及集成方法整体优于单个LLM优化器。

链接: https://arxiv.org/abs/2508.03329
作者: Mari Ashiga,Vardan Voskanyan,Fateme Dinmohammadi,Jingzhi Gong,Paul Brookes,Matthew Truscott,Rafail Giavrimis,Mike Basios,Leslie Kanthan,Wei Jie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Submitted to ASE’25 Industry Showcase

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) for code optimization have enabled industrial platforms to automate software performance engineering at unprecedented scale and speed. Yet, organizations in regulated industries face strict constraints on which LLMs they can use - many cannot utilize commercial models due to data privacy regulations and compliance requirements, creating a significant challenge for achieving high-quality code optimization while maintaining cost-effectiveness. We address this by implementing a Mixture-of-Agents (MoA) approach that directly synthesizes code from multiple specialized LLMs, comparing it against TurinTech AI’s vanilla Genetic Algorithm (GA)-based ensemble system and individual LLM optimizers using real-world industrial codebases. Our key contributions include: (1) First MoA application to industrial code optimization using real-world codebases; (2) Empirical evidence that MoA excels with open-source models, achieving 14.3% to 22.2% cost savings and 28.6% to 32.2% faster optimization times for regulated environments; (3) Deployment guidelines demonstrating GA’s advantage with commercial models while both ensembles outperform individual LLMs; and (4) Real-world validation across 50 code snippets and seven LLM combinations, generating over 8,700 variants, addresses gaps in industrial LLM ensemble evaluation. This provides actionable guidance for organizations balancing regulatory compliance with optimization performance in production environments.
zh

[AI-36] oolVQA: A Dataset for Multi-step Reasoning VQA with External Tools

【速读】:该论文旨在解决当前大型基础模型(Large Foundation Models, LFMs)在真实世界多模态工具使用场景中表现不足的问题,尤其是在功能多样且需多步推理的任务上存在显著性能差距。解决方案的关键在于提出一个大规模、基于真实视觉上下文的多模态工具使用数据集ToolVQA,其包含23K个实例和10种不同模态的工具,覆盖7个任务领域,并通过一种名为ToolEngine的新颖数据生成流水线构建,该流水线采用深度优先搜索(Depth-First Search, DFS)结合动态上下文示例匹配机制,模拟人类式的工具使用推理过程,从而有效提升模型在复杂现实场景中的泛化能力与多步推理水平。

链接: https://arxiv.org/abs/2508.03284
作者: Shaofeng Yin,Ting Lei,Yang Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Integrating external tools into Large Foundation Models (LFMs) has emerged as a promising approach to enhance their problem-solving capabilities. While existing studies have demonstrated strong performance in tool-augmented Visual Question Answering (VQA), recent benchmarks reveal significant gaps in real-world tool-use proficiency, particularly in functionally diverse multimodal settings requiring multi-step reasoning. In this work, we introduce ToolVQA, a large-scale multimodal dataset comprising 23K instances, designed to bridge this gap. Unlike previous datasets that rely on synthetic scenarios and simplified queries, ToolVQA features real-world visual contexts and challenging implicit multi-step reasoning tasks, better aligning with real user interactions. To construct this dataset, we propose ToolEngine, a novel data generation pipeline that employs Depth-First Search (DFS) with a dynamic in-context example matching mechanism to simulate human-like tool-use reasoning. ToolVQA encompasses 10 multimodal tools across 7 diverse task domains, with an average inference length of 2.78 reasoning steps per instance. The fine-tuned 7B LFMs on ToolVQA not only achieve impressive performance on our test set but also surpass the large close-sourced model GPT-3.5-turbo on various out-of-distribution (OOD) datasets, demonstrating strong generalizability to real-world tool-use scenarios.
zh

[AI-37] Approximate Proportionality in Online Fair Division

【速读】:该论文致力于解决在线公平分配(online fair division)问题,即不可分物品按顺序到达并需立即且不可撤销地分配给参与者。传统公平性概念(如 envy-freeness 和 maximin share fairness)在此设定下已被证明难以近似,因此本文聚焦于比例公平性至多一个物品(proportionality up to one good, PROP1)这一更宽松的公平性度量,其近似可实现性此前未被明确解答。研究发现,三种自然的贪心算法在自适应对手模型下无法保证任何正的PROP1近似比,这出乎意料,因为这些算法在额外信息假设下通常可实现PROP1。为此,作者转向非自适应对手模型,并引入预测信息(如最大物品价值 MIV),提出基于随机分配和利用MIV预测的算法,可在高概率下获得有意义的PROP1近似;但更强的公平性要求(如EF1、MMS、PROPX)即使在完美MIV预测下仍不可近似。关键突破在于揭示了非自适应对手与侧信息(side-information)对提升PROP1近似性能的重要性,从而为在线公平分配提供了新的理论边界与算法设计方向。

链接: https://arxiv.org/abs/2508.03253
作者: Davin Choo,Winston Fu,Derek Khu,Tzeh Yuan Neoh,Tze-Yang Poon,Nicholas Teh
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably to agents. Prior work has established strong impossibility results for approximating classic fairness notions, such as envy-freeness and maximin share fairness, in this setting. In contrast, we focus on proportionality up to one good (PROP1), a natural relaxation of proportionality whose approximability remains unresolved. We begin by showing that three natural greedy algorithms fail to guarantee any positive approximation to PROP1 in general, against an adaptive adversary. This is surprising because greedy algorithms are commonly used in fair division and a natural greedy algorithm is known to be able to achieve PROP1 under additional information assumptions. This hardness result motivates the study of non-adaptive adversaries and the use of side-information, in the spirit of learning-augmented algorithms. For non-adaptive adversaries, we show that the simple uniformly random allocation can achieve a meaningful PROP1 approximation with high probability. Meanwhile, we present an algorithm that obtain robust approximation ratios against PROP1 when given predictions of the maximum item value (MIV). Interestingly, we also show that stronger fairness notions such as EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.
zh

[AI-38] Full-History Graphs with Edge-Type Decoupled Networks for Temporal Reasoning

【速读】:该论文旨在解决现实世界任务中实体间动态交互建模的问题,例如交通场景中驾驶员意图预测和金融网络中欺诈检测,这些任务要求显式捕捉谁在何时与谁交互,而传统时间序列预测方法无法满足此类对关系演变的推理需求。其解决方案的关键在于提出一种全历史图(full-history graph),将每个实体在每个时间步单独建模为一个节点,并通过两类边明确区分结构关系与时间演化:一类是同时间步内的边(intra-time-step edges)用于捕获当前帧内实体间的静态关系,另一类是跨时间步的边(inter-time-step edges)用于连接同一实体在连续时间步的状态;在此基础上设计了边缘类型解耦网络(Edge-Type Decoupled Network, ETDNet),采用并行模块分别处理空间关系(图注意力机制)和时序依赖(多头时间注意力机制),并通过融合模块在每一层整合两类信息,从而实现对结构和时序关系的联合建模。

链接: https://arxiv.org/abs/2508.03251
作者: Osama Mohammed,Jiaxin Pan,Mojtaba Nayyeri,Daniel Hernández,Steffen Staab
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: European Conference of Artificial Intelligence 2025

点击查看摘要

Abstract:Modeling evolving interactions among entities is critical in many real-world tasks. For example, predicting driver maneuvers in traffic requires tracking how neighboring vehicles accelerate, brake, and change lanes relative to one another over consecutive frames. Likewise, detecting financial fraud hinges on following the flow of funds through successive transactions as they propagate through the network. Unlike classic time-series forecasting, these settings demand reasoning over who interacts with whom and when, calling for a temporal-graph representation that makes both the relations and their evolution explicit. Existing temporal-graph methods typically use snapshot graphs to encode temporal evolution. We introduce a full-history graph that instantiates one node for every entity at every time step and separates two edge sets: (i) intra-time-step edges that capture relations within a single frame and (ii) inter-time-step edges that connect an entity to itself at consecutive steps. To learn on this graph we design an Edge-Type Decoupled Network (ETDNet) with parallel modules: a graph-attention module aggregates information along intra-time-step edges, a multi-head temporal-attention module attends over an entity’s inter-time-step history, and a fusion module combines the two messages after every layer. Evaluated on driver-intention prediction (Waymo) and Bitcoin fraud detection (Elliptic++), ETDNet consistently surpasses strong baselines, lifting Waymo joint accuracy to 75.6% (vs. 74.1%) and raising Elliptic++ illicit-class F1 to 88.1% (vs. 60.4%). These gains demonstrate the benefit of representing structural and temporal relations as distinct edges in a single graph.
zh

[AI-39] Navigation Pixie: Implementation and Empirical Study Toward On-demand Navigation Agents in Commercial Metaverse

【速读】:该论文旨在解决商业元宇宙平台中用户生成内容(User-Generated Content, UGC)缺乏动态适应用户兴趣与意图的导航辅助机制的问题。现有研究虽在受控环境中探索了按需代理(on-demand agents),但在具有多样化世界配置和平台约束的商业化场景中仍面临实施挑战。解决方案的关键在于提出Navigation Pixie——一个采用松耦合架构的按需导航代理,其通过整合结构化空间元数据(structured spatial metadata)与基于大语言模型(Large Language Model, LLM)的自然语言处理能力,在最小化平台依赖的前提下实现对商业元宇宙平台大规模用户的实验验证。实证结果表明,该方案显著提升了用户停留时间和自由探索行为,并揭示了不同平台环境下(PC vs. VR-HMD)用户偏好与社交感知优势的差异性表现。

链接: https://arxiv.org/abs/2508.03216
作者: Hikari Yanagawa,Yuichi Hiroi,Satomi Tokida,Yuji Hatada,Takefumi Hiraki
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: 11 pages + supplement 3 pages. To appear in IEEE ISMAR 2025

点击查看摘要

Abstract:While commercial metaverse platforms offer diverse user-generated content, they lack effective navigation assistance that can dynamically adapt to users’ interests and intentions. Although previous research has investigated on-demand agents in controlled environments, implementation in commercial settings with diverse world configurations and platform constraints remains challenging. We present Navigation Pixie, an on-demand navigation agent employing a loosely coupled architecture that integrates structured spatial metadata with LLM-based natural language processing while minimizing platform dependencies, which enables experiments on the extensive user base of commercial metaverse platforms. Our cross-platform experiments on commercial metaverse platform Cluster with 99 PC client and 94 VR-HMD participants demonstrated that Navigation Pixie significantly increased dwell time and free exploration compared to fixed-route and no-agent conditions across both platforms. Subjective evaluations revealed consistent on-demand preferences in PC environments versus context-dependent social perception advantages in VR-HMD. This research contributes to advancing VR interaction design through conversational spatial navigation agents, establishes cross-platform evaluation methodologies revealing environment-dependent effectiveness, and demonstrates empirical experimentation frameworks for commercial metaverse platforms. Comments: 11 pages + supplement 3 pages. To appear in IEEE ISMAR 2025 Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.03216 [cs.HC] (or arXiv:2508.03216v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2508.03216 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-40] StoryEnsemble: Enabling Dynamic Exploration Iteration in the Design Process with AI and Forward-Backward Propagation

【速读】:该论文旨在解决设计过程中因时间与资源限制导致的探索不足、反馈收集困难以及早期假设难以 revisiting 的问题,从而影响核心设计原则的贯彻。解决方案的关键在于提出 StoryEnsemble,这是一个将人工智能(AI)集成到节点-链接界面中的工具,利用前向和后向传播机制支持设计流程中多方向的动态探索与迭代,显著提升了设计阶段之间的灵活导航与快速迭代能力。

链接: https://arxiv.org/abs/2508.03182
作者: Sangho Suh,Michael Lai,Kevin Pu,Steven P. Dow,Tovi Grossman
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Design processes involve exploration, iteration, and movement across interconnected stages such as persona creation, problem framing, solution ideation, and prototyping. However, time and resource constraints often hinder designers from exploring broadly, collecting feedback, and revisiting earlier assumptions-making it difficult to uphold core design principles in practice. To better understand these challenges, we conducted a formative study with 15 participants-comprised of UX practitioners, students, and instructors. Based on the findings, we developed StoryEnsemble, a tool that integrates AI into a node-link interface and leverages forward and backward propagation to support dynamic exploration and iteration across the design process. A user study with 10 participants showed that StoryEnsemble enables rapid, multi-directional iteration and flexible navigation across design stages. This work advances our understanding of how AI can foster more iterative design practices by introducing novel interactions that make exploration and iteration more fluid, accessible, and engaging.
zh

[AI-41] InqEduAgent : Adaptive AI Learning Partners with Gaussian Process Augmentation

【速读】:该论文旨在解决探究式教育(inquiry-oriented education)中学习伙伴选择缺乏科学规划的问题,现有方法通常依赖经验分配或规则驱动的机器助手,难以实现知识扩展与灵活适配。解决方案的关键在于提出一种基于大语言模型(LLM)的智能代理模型 InqEduAgent,其通过生成式代理(generative agents)模拟真实场景下学习者的认知与评价特征,并结合高斯过程增强的自适应匹配算法,识别先验知识模式,从而为不同学习任务提供最优的学习伙伴匹配。实验表明,InqEduAgent 在多种知识学习场景和 LLM 能力水平下均表现出最佳性能,推动了人本学习伙伴与 AI 学习伙伴的智能化配置。

链接: https://arxiv.org/abs/2508.03174
作者: Tian-Fang Zhao,Wen-Xi Yang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Collaborative partnership matters in inquiry-oriented education. However, most study partners are selected either rely on experience-based assignments with little scientific planning or build on rule-based machine assistants, encountering difficulties in knowledge expansion and inadequate flexibility. This paper proposes an LLM-empowered agent model for simulating and selecting learning partners tailored to inquiry-oriented learning, named InqEduAgent. Generative agents are designed to capture cognitive and evaluative features of learners in real-world scenarios. Then, an adaptive matching algorithm with Gaussian process augmentation is formulated to identify patterns within prior knowledge. Optimal learning-partner matches are provided for learners facing different exercises. The experimental results show the optimal performance of InqEduAgent in most knowledge-learning scenarios and LLM environment with different levels of capabilities. This study promotes the intelligent allocation of human-based learning partners and the formulation of AI-based learning partners. The code, data, and appendix are publicly available at this https URL.
zh

[AI-42] Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在几何推理任务中难以处理形式化几何推理的问题,尤其是动态构建和验证辅助几何元素(auxiliary geometric elements)的能力不足。其解决方案的关键在于提出Geoint-R1框架,该框架创新性地融合了辅助元素构造、基于Lean4的形式化推理表示以及交互式可视化,从而实现从文本描述和视觉图示中生成可形式化验证的几何解题过程。

链接: https://arxiv.org/abs/2508.03173
作者: Jingxuan Wei,Caijun Jia,Qi Chen,Honghao He,Linzhuang Sun,Conghui He,Lijun Wu,Bihui Yu,Cheng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.
zh

[AI-43] Causal identification with Y_0

【速读】:该论文旨在解决因果推断中如何从观测数据或随机对照试验数据中识别可估计的因果效应问题,尤其关注在存在未观测混杂因素(unobserved confounders)的情况下,如何判断一个因果关系是否可以从现有数据中被识别,并进一步生成可非参数估计的符号化表达式(symbolic estimand)。解决方案的关键在于开发了一个名为 Y₀ 的 Python 工具包,它提供了一种领域特定语言(domain-specific language)来表示因果查询和估计量为符号概率表达式,支持带有未观测混杂因素的因果图模型(如无环有向混合图 ADMGs),并集成了近年来因果推断文献中的多种识别算法,从而实现对因果关系的定性识别与符号化转化。

链接: https://arxiv.org/abs/2508.03167
作者: Charles Tapley Hoyt,Craig Bakker,Richard J. Callahan,Joseph Cottam,August George,Benjamin M. Gyori,Haley M. Hummel,Nathaniel Merrill,Sara Mohammad Taheri,Pruthvi Prakash Navada,Marc-Antoine Parent,Adam Rupe,Olga Vitek,Jeremy Zucker
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We present the Y_0 Python package, which implements causal identification algorithms that apply interventional, counterfactual, and transportability queries to data from (randomized) controlled trials, observational studies, or mixtures thereof. Y_0 focuses on the qualitative investigation of causation, helping researchers determine whether a causal relationship can be estimated from available data before attempting to estimate how strong that relationship is. Furthermore, Y_0 provides guidance on how to transform the causal query into a symbolic estimand that can be non-parametrically estimated from the available data. Y_0 provides a domain-specific language for representing causal queries and estimands as symbolic probabilistic expressions, tools for representing causal graphical models with unobserved confounders, such as acyclic directed mixed graphs (ADMGs), and implementations of numerous identification algorithms from the recent causal inference literature. The Y_0 source code can be found under the MIT License at this https URL and it can be installed with pip install y0.
zh

[AI-44] CoTox: Chain-of-Thought-Based Molecular Toxicity Reasoning and Prediction

【速读】:该论文旨在解决药物毒性预测中传统机器学习模型依赖标注数据、缺乏可解释性以及难以捕捉器官特异性毒理机制的问题。其解决方案的关键在于提出一种名为CoTox的新框架,该框架结合大语言模型(Large Language Models, LLMs)与思维链(Chain-of-Thought, CoT)推理能力,整合化学结构信息、生物通路和基因本体(Gene Ontology, GO)术语,通过逐步推理生成可解释的多毒性预测结果。研究进一步表明,使用IUPAC命名表示化学结构比SMILES更利于LLM理解,从而提升推理能力和预测性能,并在细胞水平模拟药物处理以引入生理背景,使预测结果与实际生物学响应一致,显著增强了早期药物安全性评估的可行性与可信度。

链接: https://arxiv.org/abs/2508.03159
作者: Jueon Park,Yein Park,Minju Song,Soyon Park,Donghyeon Lee,Seungheun Baek,Jaewoo Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Under review

点击查看摘要

Abstract:Drug toxicity remains a major challenge in pharmaceutical development. Recent machine learning models have improved in silico toxicity prediction, but their reliance on annotated data and lack of interpretability limit their applicability. This limits their ability to capture organ-specific toxicities driven by complex biological mechanisms. Large language models (LLMs) offer a promising alternative through step-by-step reasoning and integration of textual data, yet prior approaches lack biological context and transparent rationale. To address this issue, we propose CoTox, a novel framework that integrates LLM with chain-of-thought (CoT) reasoning for multi-toxicity prediction. CoTox combines chemical structure data, biological pathways, and gene ontology (GO) terms to generate interpretable toxicity predictions through step-by-step reasoning. Using GPT-4o, we show that CoTox outperforms both traditional machine learning and deep learning model. We further examine its performance across various LLMs to identify where CoTox is most effective. Additionally, we find that representing chemical structures with IUPAC names, which are easier for LLMs to understand than SMILES, enhances the model’s reasoning ability and improves predictive performance. To demonstrate its practical utility in drug development, we simulate the treatment of relevant cell types with drug and incorporated the resulting biological context into the CoTox framework. This approach allow CoTox to generate toxicity predictions aligned with physiological responses, as shown in case study. This result highlights the potential of LLM-based frameworks to improve interpretability and support early-stage drug safety assessment. The code and prompt used in this work are available at this https URL.
zh

[AI-45] Estimating Worst-Case Frontier Risks of Open-Weight LLM s

【速读】:该论文旨在解决开源大语言模型(Large Language Model, LLM)在释放过程中可能引发的前沿风险(frontier risk)问题,特别是针对生物安全风险(biorisk)和网络安全风险(cybersecurity risk)的潜在危害。其解决方案的关键在于提出恶意微调(malicious fine-tuning, MFT)方法,通过在两个关键领域——生物学和网络安全——中对GPT-OSS进行强化学习(RL)和代理编码环境训练,以系统性地最大化其潜在危害能力。实验表明,尽管MFT使GPT-OSS在生物领域略有提升,但整体仍低于具备高准备度水平的闭源模型(如OpenAI o3),且未显著超越开源模型的前沿能力边界,从而为模型释放决策提供了量化依据,并为未来开放权重模型的风险评估提供可复用的方法论框架。

链接: https://arxiv.org/abs/2508.03153
作者: Eric Wallace,Olivia Watkins,Miles Wang,Kai Chen,Chris Koch
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as capable as possible in two domains: biology and cybersecurity. To maximize biological risk (biorisk), we curate tasks related to threat creation and train gpt-oss in an RL environment with web browsing. To maximize cybersecurity risk, we train gpt-oss in an agentic coding environment to solve capture-the-flag (CTF) challenges. We compare these MFT models against open- and closed-weight LLMs on frontier risk evaluations. Compared to frontier closed-weight models, MFT gpt-oss underperforms OpenAI o3, a model that is below Preparedness High capability level for biorisk and cybersecurity. Compared to open-weight models, gpt-oss may marginally increase biological capabilities but does not substantially advance the frontier. Taken together, these results contributed to our decision to release the model, and we hope that our MFT approach can serve as useful guidance for estimating harm from future open-weight releases.
zh

[AI-46] Can Large Language Models Bridge the Gap in Environmental Knowledge?

【速读】:该论文旨在解决大学学生在环境教育中存在知识鸿沟的问题,即学生对环境概念的理解不足,影响其环保意识与行动能力。解决方案的关键在于利用生成式 AI(Generative AI)模型,特别是大型语言模型(Large Language Models, LLMs),如 GPT-3.5、GPT-4、GPT-4o、Gemini、Claude Sonnet 和 Llama 2,来提供准确且易获取的环境知识,并通过标准化的环境知识测试(Environmental Knowledge Test, EKT-19)验证其有效性。研究发现,尽管这些 AI 模型具备广泛而有效的知识库,能够辅助教学和学习,但最终仍需环境科学领域的专业人员对信息准确性进行人工校验,以确保教育质量。

链接: https://arxiv.org/abs/2508.03149
作者: Linda Smail(College of Interdisciplinary Studies, Zayed University, UAE),David Santandreu Calonge(Department of Academic Development, Mohamed bin Zayed University of Artificial Intelligence, UAE),Firuz Kamalov(School of Engineering, Applied Science and Technology, Canadian University Dubai, UAE),Nur H. Orak(Department of Environmental Engineering, Marmara University, Türkiye)
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages, 3 figures, 7 tables. No external funding

点击查看摘要

Abstract:This research investigates the potential of Artificial Intelligence (AI) models to bridge the knowledge gap in environmental education among university students. By focusing on prominent large language models (LLMs) such as GPT-3.5, GPT-4, GPT-4o, Gemini, Claude Sonnet, and Llama 2, the study assesses their effectiveness in conveying environmental concepts and, consequently, facilitating environmental education. The investigation employs a standardized tool, the Environmental Knowledge Test (EKT-19), supplemented by targeted questions, to evaluate the environmental knowledge of university students in comparison to the responses generated by the AI models. The results of this study suggest that while AI models possess a vast, readily accessible, and valid knowledge base with the potential to empower both students and academic staff, a human discipline specialist in environmental sciences may still be necessary to validate the accuracy of the information provided.
zh

[AI-47] Frontier: Simulating the Next Generation of LLM Inference Systems

【速读】:该论文旨在解决当前大型语言模型(Large Language Model, LLM)推理系统日益复杂化带来的仿真难题,特别是针对混合专家(Mixture-of-Experts, MoE)模型和解耦架构(disaggregated architectures)中组件如预填充/解码(prefill/decode, PD)或注意力/前馈网络(attention/feed-forward network, AF)分离后所引发的系统动态难以建模的问题。现有仿真器因专为集中式、密集型模型设计,无法准确捕捉这些新兴范式的运行特性。解决方案的关键在于提出Frontier——一个从零开始构建的高保真仿真框架,其核心创新包括:统一建模集中式与解耦式系统的能力、原生支持MoE推理中的专家并行(expert parallelism, EP),以及对跨集群专家路由和高级流水线策略等复杂工作流的模拟能力;同时通过精细化的操作符模型提升仿真精度,从而赋能社区在大规模LLM推理场景下进行高效设计与优化。

链接: https://arxiv.org/abs/2508.03148
作者: Yicheng Feng,Xin Tan,Kin Hang Sew,Yimin Jiang,Yibo Zhu,Hong Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注:

点击查看摘要

Abstract:Large Language Model (LLM) inference is growing increasingly complex with the rise of Mixture-of-Experts (MoE) models and disaggregated architectures that decouple components like prefill/decode (PD) or attention/FFN (AF) for heterogeneous scaling. Existing simulators, architected for co-located, dense models, are unable to capture the intricate system dynamics of these emerging paradigms. We present Frontier, a high-fidelity simulator designed from the ground up for this new landscape. Frontier introduces a unified framework to model both co-located and disaggregated systems, providing native support for MoE inference with expert parallelism (EP). It enables the simulation of complex workflows like cross-cluster expert routing and advanced pipelining strategies for latency hiding. To ensure fidelity and usability, Frontier incorporates refined operator models for improved accuracy. Frontier empowers the community to design and optimize the future of LLM inference at scale.
zh

[AI-48] Attack the Messages Not the Agents : A Multi-round Adaptive Stealthy Tampering Framework for LLM -MAS

【速读】:该论文旨在解决大型语言模型多智能体系统(Large Language Model-based Multi-Agent Systems, LLM-MAS)在依赖智能体间通信完成复杂任务时所面临的严重安全漏洞问题。现有攻击方法要么针对智能体内部机制进行破坏,要么依赖直接且显式的说服策略,存在效果有限、适应性差和隐蔽性不足的缺陷。论文提出了一种多轮自适应隐秘篡改框架(MAST),其核心在于结合蒙特卡洛树搜索(Monte Carlo Tree Search)与直接偏好优化(Direct Preference Optimization)训练出一个攻击策略模型,能够自适应生成多轮篡改策略;同时,在篡改过程中引入语义相似性和嵌入相似性双重约束,以保障攻击的隐蔽性。实验表明,MAST在多种任务、通信架构和大语言模型上均实现了高成功率并显著优于基线方法,验证了其有效性、隐蔽性和适应性。

链接: https://arxiv.org/abs/2508.03125
作者: Bingyu Yan,Ziyi Zhou,Xiaoming Zhang,Chaozhuo Li,Ruilin Zeng,Yirui Qi,Tianbo Wang,Litian Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Large language model-based multi-agent systems (LLM-MAS) effectively accomplish complex and dynamic tasks through inter-agent communication, but this reliance introduces substantial safety vulnerabilities. Existing attack methods targeting LLM-MAS either compromise agent internals or rely on direct and overt persuasion, which limit their effectiveness, adaptability, and stealthiness. In this paper, we propose MAST, a Multi-round Adaptive Stealthy Tampering framework designed to exploit communication vulnerabilities within the system. MAST integrates Monte Carlo Tree Search with Direct Preference Optimization to train an attack policy model that adaptively generates effective multi-round tampering strategies. Furthermore, to preserve stealthiness, we impose dual semantic and embedding similarity constraints during the tampering process. Comprehensive experiments across diverse tasks, communication architectures, and LLMs demonstrate that MAST consistently achieves high attack success rates while significantly enhancing stealthiness compared to baselines. These findings highlight the effectiveness, stealthiness, and adaptability of MAST, underscoring the need for robust communication safeguards in LLM-MAS.
zh

[AI-49] Fine-Tuning Text-to-Speech Diffusion Models Using Reinforcement Learning with Human Feedback INTERSPEECH2025

【速读】:该论文旨在解决扩散模型在文本到语音(TTS)生成中效率低下、难以实时应用的问题,尤其是由于去噪步骤冗长以及对语调和节奏建模困难导致的性能瓶颈。其解决方案的关键在于提出一种基于奖励学习的人类反馈强化学习框架——扩散损失引导策略优化(Diffusion Loss-Guided Policy Optimization, DLPO),该方法将原始训练损失融入奖励函数中,在保留生成能力的同时显著提升效率;通过自然度评分作为反馈信号,使奖励优化与扩散模型结构相契合,从而实现高质量、低延迟的语音合成,适用于资源受限的实时场景。

链接: https://arxiv.org/abs/2508.03123
作者: Jingyi Chen,Ju Seung Byun,Micha Elsner,Pichao Wang,Andrew Perrault
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 4 pages, 1 figure, INTERSPEECH 2025. arXiv admin note: text overlap with arXiv:2405.14632

点击查看摘要

Abstract:Diffusion models produce high-fidelity speech but are inefficient for real-time use due to long denoising steps and challenges in modeling intonation and rhythm. To improve this, we propose Diffusion Loss-Guided Policy Optimization (DLPO), an RLHF framework for TTS diffusion models. DLPO integrates the original training loss into the reward function, preserving generative capabilities while reducing inefficiencies. Using naturalness scores as feedback, DLPO aligns reward optimization with the diffusion model’s structure, improving speech quality. We evaluate DLPO on WaveGrad 2, a non-autoregressive diffusion-based TTS model. Results show significant improvements in objective metrics (UTMOS 3.65, NISQA 4.02) and subjective evaluations, with DLPO audio preferred 67% of the time. These findings demonstrate DLPO’s potential for efficient, high-quality diffusion TTS in real-time, resource-limited settings.
zh

[AI-50] oward a Trustworthy Optimization Modeling Agent via Verifiable Synthetic Data Generation

【速读】:该论文旨在解决如何训练可信的大型语言模型(Large Language Model, LLM)代理以用于优化建模的问题,特别是在线性规划和混合整数线性规划场景中。其核心挑战在于确保模型生成的优化问题描述、数学公式及求解代码具备可验证性和高质量,从而在真实世界应用中可靠运行。解决方案的关键在于构建一个可验证的合成数据生成流水线(verifiable synthetic data generation pipeline),该流水线从结构化的符号表示出发,系统地生成自然语言描述、数学公式和可执行代码,并通过程序化构造每个实例并附带已知最优解,实现全流程的可验证性;同时利用教师模型生成的多语言步骤演示与多数投票交叉验证机制,自动过滤低质量样本,最终支持针对优化任务的监督微调(supervised fine-tuning),显著提升LLM代理在标准基准上的性能表现。

链接: https://arxiv.org/abs/2508.03117
作者: Vinicius Lima,Dzung T. Phan,Jayant Kalagnanam,Dhaval Patel,Nianjun Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 25 pages

点击查看摘要

Abstract:We present a framework for training trustworthy large language model (LLM) agents for optimization modeling via a verifiable synthetic data generation pipeline. Focusing on linear and mixed-integer linear programming, our approach begins with structured symbolic representations and systematically produces natural language descriptions, mathematical formulations, and solver-executable code. By programmatically constructing each instance with known optimal solutions, the pipeline ensures full verifiability and enables automatic filtering of low-quality demonstrations generated by teacher models. Each dataset instance includes a structured representation of the optimization problem, a corresponding natural language description, the verified optimal solution, and step-by-step demonstrations - generated by a teacher model - that show how to model and solve the problem across multiple optimization modeling languages. This enables supervised fine-tuning of open-source LLMs specifically tailored to optimization tasks. To operationalize this pipeline, we introduce OptiTrust, a modular LLM agent that performs multi-stage translation from natural language to solver-ready code, leveraging stepwise demonstrations, multi-language inference, and majority-vote cross-validation. Our agent achieves state-of-the-art performance on standard benchmarks. Out of 7 datasets, it achieves the highest accuracy on six and outperforms the next-best algorithm by at least 8 percentage on three of them. Our approach provides a scalable, verifiable, and principled path toward building reliable LLM agents for real-world optimization applications.
zh

[AI-51] NANDA Adaptive Resolver: Architecture for Dynamic Resolution of AI Agent Names

【速读】:该论文旨在解决分布式异构环境中AI代理(Agent)通信时静态端点解析(static endpoint resolution)带来的局限性,例如无法适应动态环境变化、缺乏上下文感知能力以及难以实现安全与服务质量保障。其解决方案的关键在于提出一种名为AdaptiveResolver的动态微服务架构,通过Agent Fact卡在Agent注册表中发布自身名称及上下文需求,使请求代理能够基于实时环境因素(如地理位置、系统负载、代理能力与安全威胁)进行上下文感知的端点选择,并支持信任协商、服务质量(QoS)与资源约束的动态调整,从而实现灵活、安全且可扩展的代理间交互,突破传统客户端-服务器模型的限制。

链接: https://arxiv.org/abs/2508.03113
作者: John Zinky,Hema Seshadri,Mahesh Lambe,Pradyumna Chari,Ramesh Raskar
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:AdaptiveResolver is a dynamic microservice architecture designed to address the limitations of static endpoint resolution for AI agent communication in distributed, heterogeneous environments. Unlike traditional DNS or static URLs, AdaptiveResolver enables context-aware, real-time selection of communication endpoints based on factors such as geographic location, system load, agent capabilities, and security threats. Agents advertise their Agent Name and context requirements through Agent Fact cards in an Agent Registry/Index. A requesting Agent discovers a Target Agent using the registry. The Requester Agent can then resolve the Target Agent Name to obtain a tailored communication channel to the agent based on actual environmental context between the agents. The architecture supports negotiation of trust, quality of service, and resource constraints, facilitating flexible, secure, and scalable agent-to-agent interactions that go beyond the classic client-server model. AdaptiveResolver provides a foundation for robust, future-proof agent communication that can evolve with increasing ecosystem complexity.
zh

[AI-52] GEDAN: Learning the Edit Costs for Graph Edit Distance

【速读】:该论文旨在解决图编辑距离(Graph Edit Distance, GED)计算的NP-hard难题,尤其是在现实应用中,传统基于神经网络(Neural Network, NN)的方法往往假设编辑操作成本为单位代价,这一限制严重削弱了模型在实际场景中的适用性。解决方案的关键在于提出一种新颖的图神经网络(Graph Neural Network, GNN)框架,该框架结合监督与无监督训练机制:在无监督设置下,采用仅依赖梯度的自组织优化机制,无需真实距离标签即可进行学习;同时引入广义加法模型(Generalized Additive Model, GAM),实现对上下文感知编辑成本的灵活且可解释的学习,从而显著提升模型的适应性和可解释性,尤其适用于分子分析和结构模式发现等复杂图结构场景。

链接: https://arxiv.org/abs/2508.03111
作者: Francesco Leonardi,Markus Orsi,Jean-Louis Reymond,Kaspar Riesen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Graph Edit Distance (GED) is defined as the minimum cost transformation of one graph into another and is a widely adopted metric for measuring the dissimilarity between graphs. The major problem of GED is that its computation is NP-hard, which has in turn led to the development of various approximation methods, including approaches based on neural networks (NN). Most of these NN-based models simplify the problem of GED by assuming unit-cost edit operations, a rather unrealistic constraint in real-world applications. In this work, we present a novel Graph Neural Network framework that approximates GED using both supervised and unsupervised training. In the unsupervised setting, it employs a gradient-only self-organizing mechanism that enables optimization without ground-truth distances. Moreover, a core component of our architecture is the integration of a Generalized Additive Model, which allows the flexible and interpretable learning of context-aware edit costs. Experimental results show that the proposed method achieves similar results as state-of-the-art reference methods, yet significantly improves both adaptability and interpretability. That is, the learned cost function offers insights into complex graph structures, making it particularly valuable in domains such as molecular analysis and structural pattern discovery.
zh

[AI-53] AgentS ME for Simulating Diverse Communication Modes in Smart Education

【速读】:该论文旨在解决生成式 AI (Generative AI) 在智能教育领域中因教育场景复杂性而导致的适配不足问题,特别是如何有效模拟个性化的人类教学交互以提升学习效果。其解决方案的关键在于提出一个统一的生成式代理框架 AgentSME,该框架基于大语言模型(LLM)构建,并引入三种定向通信模式——Solo、Mono 和 Echo,分别代表不同自主性和交互对称性的代理行为。实验表明,采用 Echo 模式的代理在准确性上表现最优,验证了该设计在增强代理学习能力和实现智能教育应用方面的有效性。

链接: https://arxiv.org/abs/2508.03109
作者: Wen-Xi Yang,Tian-Fang Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative agent models specifically tailored for smart education are critical, yet remain relatively underdeveloped. A key challenge stems from the inherent complexity of educational contexts: learners are human beings with various cognitive behaviors, and pedagogy is fundamentally centered on personalized human-to-human communication. To address this issue, this paper proposes AgentSME, a unified generative agent framework powered by LLM. Three directional communication modes are considered in the models, namely Solo, Mono, and Echo, reflecting different types of agency autonomy and communicative reciprocity. Accuracy is adopted as the primary evaluation metric, complemented by three diversity indices designed to assess the diversity of reasoning contents. Six widely used LLMs are tested to validate the robustness of communication modes across different model tiers, which are equally divided into base-capacity and high-capacity configurations. The results show that generative agents that employ the Echo communication mode achieve the highest accuracy scores, while DeepSeek exhibits the greatest diversity. This study provides valuable information to improve agent learning capabilities and inspire smart education models.
zh

[AI-54] Pseudo-label Induced Subspace Representation Learning for Robust Out-of-Distribution Detection

【速读】:该论文旨在解决分布外(Out-of-distribution, OOD)检测问题,即识别训练数据分布之外的新样本,这是实现鲁棒人工智能的核心挑战之一。现有方法通常依赖于对特征空间的严格假设,限制了分布内(In-distribution, ID)与OOD样本之间的可分性。本文提出一种基于**伪标签诱导子空间表示(pseudo-label-induced subspace representation)**的新框架,其关键在于在更宽松且符合实际的假设下建模特征空间,从而提升ID与OOD样本的区分能力;同时引入一种简单但有效的学习准则,将基于交叉熵的ID分类损失与基于子空间距离的正则化损失相结合,显著增强ID-OOD分离效果。

链接: https://arxiv.org/abs/2508.03108
作者: Tarhib Al Azad,Faizul Rakib Sayem,Shahana Ibrahim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Out-of-distribution (OOD) detection lies at the heart of robust artificial intelligence (AI), aiming to identify samples from novel distributions beyond the training set. Recent approaches have exploited feature representations as distinguishing signatures for OOD detection. However, most existing methods rely on restrictive assumptions on the feature space that limit the separability between in-distribution (ID) and OOD samples. In this work, we propose a novel OOD detection framework based on a pseudo-label-induced subspace representation, that works under more relaxed and natural assumptions compared to existing feature-based techniques. In addition, we introduce a simple yet effective learning criterion that integrates a cross-entropy-based ID classification loss with a subspace distance-based regularization loss to enhance ID-OOD separability. Extensive experiments validate the effectiveness of our framework.
zh

[AI-55] HiTeC: Hierarchical Contrastive Learning on Text-Attributed Hypergraph with Semantic-Aware Augmentation

【速读】:该论文旨在解决现有对比学习(Contrastive Learning, CL)方法在文本属性超图(Text-attributed Hypergraph, TAHG)上应用时的三大局限性:(1)使用与图结构无关的文本编码器,忽略了文本内容与超图拓扑之间的关联;(2)依赖随机数据增强引入噪声,削弱对比目标;(3)仅关注节点和超边级别的对比信号,难以捕捉长程依赖关系。其解决方案的关键在于提出一个两阶段分层对比学习框架 HiTeC:第一阶段通过结构感知的对比目标预训练文本编码器,以建模文本与超图结构的关系;第二阶段引入两种语义感知增强策略(提示增强文本增强和语义感知超边丢弃),生成更具信息量的视图,并设计多尺度对比损失,引入基于 s-walk 的子图级对比信号以更好地捕获长程依赖。该两阶段解耦设计在不牺牲表示质量的前提下显著提升了可扩展性。

链接: https://arxiv.org/abs/2508.03104
作者: Mengting Pan,Fan Li,Xiaoyang Wang,Wenjie Zhang,Xuemin Lin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 12 pages, 18 figures

点击查看摘要

Abstract:Contrastive learning (CL) has become a dominant paradigm for self-supervised hypergraph learning, enabling effective training without costly labels. However, node entities in real-world hypergraphs are often associated with rich textual information, which is overlooked in prior works. Directly applying existing CL-based methods to such text-attributed hypergraphs (TAHGs) leads to three key limitations: (1) The common use of graph-agnostic text encoders overlooks the correlations between textual content and hypergraph topology, resulting in suboptimal representations. (2) Their reliance on random data augmentations introduces noise and weakens the contrastive objective. (3) The primary focus on node- and hyperedge-level contrastive signals limits the ability to capture long-range dependencies, which is essential for expressive representation learning. Although HyperBERT pioneers CL on TAHGs, its co-training paradigm suffers from poor scalability. To fill the research gap, we introduce HiTeC, a two-stage hierarchical contrastive learning framework with semantic-aware augmentation for scalable and effective self-supervised learning on TAHGs. In the first stage, we pre-train the text encoder with a structure-aware contrastive objective to overcome the graph-agnostic nature of conventional methods. In the second stage, we introduce two semantic-aware augmentation strategies, including prompt-enhanced text augmentation and semantic-aware hyperedge drop, to facilitate informative view generation. Furthermore, we propose a multi-scale contrastive loss that extends existing objectives with an s -walk-based subgraph-level contrast to better capture long-range dependencies. By decoupling text encoder pretraining from hypergraph contrastive learning, this two-stage design enhances scalability without compromising representation quality. Extensive experiments confirm the effectiveness of HiTeC.
zh

[AI-56] Using the NANDA Index Architecture in Practice: An Enterprise Perspective

【速读】:该论文旨在解决当前自主AI代理(Autonomous AI Agents)在大规模部署中面临的基础设施短板问题,即如何构建一个安全、可信且可互操作的AI代理生态系统,以支持跨异构协议环境下的协作与治理。其核心挑战包括代理发现、能力验证、身份认证以及防止能力伪造、冒充攻击和敏感数据泄露等安全风险。解决方案的关键在于提出NANDA(Networked AI Agents in a Decentralized Architecture)框架,通过引入基于密码学的能力证明机制(AgentFacts)、实现零信任代理访问控制(Zero Trust Agentic Access, ZTAA),并支持多协议互通(如Anthropic MCP、Google A2A、Microsoft NLWeb及HTTPS),从而建立具备全局可见性与可控性的智能服务网络,为下一代自主智能系统提供基础支撑。

链接: https://arxiv.org/abs/2508.03101
作者: Sichao Wang,Ramesh Raskar,Mahesh Lambe,Pradyumna Chari,Rekha Singhal,Shailja Gupta,Rajesh Ranjan,Ken Huang
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:The proliferation of autonomous AI agents represents a paradigmatic shift from traditional web architectures toward collaborative intelligent systems requiring sophisticated mechanisms for discovery, authentication, capability verification, and secure collaboration across heterogeneous protocol environments. This paper presents a comprehensive framework addressing the fundamental infrastructure requirements for secure, trustworthy, and interoperable AI agent ecosystems. We introduce the NANDA (Networked AI Agents in a Decentralized Architecture) framework, providing global agent discovery, cryptographically verifiable capability attestation through AgentFacts, and cross-protocol interoperability across Anthropic’s Modal Context Protocol (MCP), Google’s Agent-to-Agent (A2A), Microsoft’s NLWeb, and standard HTTPS communications. NANDA implements Zero Trust Agentic Access (ZTAA) principles, extending traditional Zero Trust Network Access (ZTNA) to address autonomous agent security challenges including capability spoofing, impersonation attacks, and sensitive data leakage. The framework defines Agent Visibility and Control (AVC) mechanisms enabling enterprise governance while maintaining operational autonomy and regulatory compliance. Our approach transforms isolated AI agents into an interconnected ecosystem of verifiable, trustworthy intelligent services, establishing foundational infrastructure for large-scale autonomous agent deployment across enterprise and consumer environments. This work addresses the critical gap between current AI agent capabilities and infrastructure requirements for secure, scalable, multi-agent collaboration, positioning the foundation for next-generation autonomous intelligent systems.
zh

[AI-57] VFLAIR-LLM : A Comprehensive Framework and Benchmark for Split Learning of LLM s KDD2025

【速读】:该论文旨在解决在本地计算资源受限条件下,如何实现安全的大型语言模型(Large Language Models, LLMs)适应问题。由于直接使用云端LLM API存在数据隐私风险,而私有部署又对计算资源要求极高,因此亟需一种既能保护用户数据隐私又能降低资源消耗的方案。其关键解决方案是提出VFLAIR-LLM框架,这是一个轻量级、可扩展的分割学习(Split Learning, SL)框架,支持LLM的推理与微调任务,在客户端和服务器端之间进行模型分片协作,从而在保障隐私的同时显著减少本地计算开销。该框架提供两种模型划分方式、三类任务类型及18个数据集,并内置攻击与防御模块,系统评估了5种攻击和9种防御策略,为实际应用中模型划分配置、防御策略选择及超参数设置提供了实证依据。

链接: https://arxiv.org/abs/2508.03097
作者: Zixuan Gu,Qiufeng Fan,Long Sun,Yang Liu,Xiaojun Ye
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 12 pages, 10 figures, published in KDD2025

点击查看摘要

Abstract:With the advancement of Large Language Models (LLMs), LLM applications have expanded into a growing number of fields. However, users with data privacy concerns face limitations in directly utilizing LLM APIs, while private deployments incur significant computational demands. This creates a substantial challenge in achieving secure LLM adaptation under constrained local resources. To address this issue, collaborative learning methods, such as Split Learning (SL), offer a resource-efficient and privacy-preserving solution for adapting LLMs to private domains. In this study, we introduce VFLAIR-LLM (available at this https URL), an extensible and lightweight split learning framework for LLMs, enabling privacy-preserving LLM inference and fine-tuning in resource-constrained environments. Our library provides two LLM partition settings, supporting three task types and 18 datasets. In addition, we provide standard modules for implementing and evaluating attacks and defenses. We benchmark 5 attacks and 9 defenses under various Split Learning for LLM(SL-LLM) settings, offering concrete insights and recommendations on the choice of model partition configurations, defense strategies, and relevant hyperparameters for real-world applications.
zh

[AI-58] A Survey of AI Agent Registry Solutions

【速读】:该论文旨在解决自主AI代理(Autonomous AI Agents)在云、企业及去中心化环境中规模化部署时,因缺乏统一标准而导致的发现(discovery)、身份识别(identity)与能力共享(capability sharing)难题。其解决方案的关键在于提出并比较三种基于不同元数据模型的注册系统:MCP采用基于GitHub认证的集中式元注册表(metaregistry),通过结构化元数据实现服务器发现;A2A利用JSON格式的Agent Card支持去中心化交互,可通过通用URI、目录或直接配置发现;NANDA Index则引入AgentFacts,一种具备密码学可验证性和隐私保护特性的元数据模型,用于动态发现、凭证化能力及跨域互操作。这三种方案从安全性、可扩展性、认证机制和可维护性四个维度进行评估,为未来AI代理互联网的注册系统设计提供了理论依据与实践指导。

链接: https://arxiv.org/abs/2508.03095
作者: Aditi Singh,Abul Ehtesham,Ramesh Raskar,Mahesh Lambe,Pradyumna Chari,Jared James Grogan,Abhishek Singh,Saket Kumar
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:As As autonomous AI agents scale across cloud, enterprise, and decentralized environments, the need for standardized registry systems to support discovery, identity, and capability sharing has become essential. This paper surveys three prominent registry approaches each defined by a unique metadata model: MCP’s this http URL, A2A’s Agent Card, and NANDA’s AgentFacts. MCP uses a centralized metaregistry with GitHub authenticated publishing and structured metadata for server discovery. A2A enables decentralized interaction via JSON-based Agent Cards, discoverable through well-known URIs, curated catalogs, or direct configuration. NANDA Index introduces AgentFacts, a cryptographically verifiable and privacy-preserving metadata model designed for dynamic discovery, credentialed capabilities, and cross-domain interoperability. These approaches are compared across four dimensions: security, scalability, authentication, and maintainability. The paper concludes with suggestions and recommendations to guide future design and adoption of registry systems for the Internet of AI Agents.
zh

[AI-59] MissDDIM: Deterministic and Efficient Conditional Diffusion for Tabular Data Imputation

【速读】:该论文旨在解决当前基于随机去噪扩散概率模型(Stochastic Denoising Diffusion Probabilistic Models, DDPMs)的缺失数据填补方法在真实表格场景中面临的两个核心问题:一是推理延迟高,二是输出结果具有不确定性,导致下游任务处理困难。解决方案的关键在于提出一种条件扩散框架 MissDDIM,该框架将去噪扩散隐式模型(Denoising Diffusion Implicit Models, DDIM)引入到表格数据填补任务中,通过确定性采样机制替代原有的随机采样策略,在保持生成质量的同时显著降低推理时间并消除输出变异性,从而提升模型在实际应用中的稳定性和效率。

链接: https://arxiv.org/abs/2508.03083
作者: Youran Zhou,Mohamed Reda Bouadjenek,Sunil Aryal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Diffusion models have recently emerged as powerful tools for missing data imputation by modeling the joint distribution of observed and unobserved variables. However, existing methods, typically based on stochastic denoising diffusion probabilistic models (DDPMs), suffer from high inference latency and variable outputs, limiting their applicability in real-world tabular settings. To address these deficiencies, we present in this paper MissDDIM, a conditional diffusion framework that adapts Denoising Diffusion Implicit Models (DDIM) for tabular imputation. While stochastic sampling enables diverse completions, it also introduces output variability that complicates downstream processing.
zh

[AI-60] EoH-S: Evolution of Heuristic Set using LLM s for Automated Heuristic Design

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化启发式设计(Automated Heuristic Design, AHD)方法中存在的泛化能力不足问题,即现有方法通常仅生成单一启发式来应对所有问题实例,导致在不同分布或设置下性能下降。为应对这一挑战,作者提出自动化启发式集设计(Automated Heuristic Set Design, AHSD),其核心目标是自动构建一个小型且互补的启发式集合,确保每个问题实例至少能被集合中的一个启发式优化。AHSD的目标函数具有单调性和超模性(monotone and supermodular),这为高效搜索提供了理论保障。解决方案的关键在于提出的进化启发式集算法(Evolution of Heuristic Set, EoH-S),其包含两个创新机制:互补种群管理(complementary population management)和互补感知的遗传搜索(complementary-aware memetic search),从而有效生成高质量且多样化的启发式组合,在三个AHD任务上均显著优于现有最先进方法,性能提升最高达60%。

链接: https://arxiv.org/abs/2508.03082
作者: Fei Liu,Yilu Liu,Qingfu Zhang,Xialiang Tong,Mingxuan Yuan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automated Heuristic Design (AHD) using Large Language Models (LLMs) has achieved notable success in recent years. Despite the effectiveness of existing approaches, they only design a single heuristic to serve all problem instances, often inducing poor generalization across different distributions or settings. To address this issue, we propose Automated Heuristic Set Design (AHSD), a new formulation for LLM-driven AHD. The aim of AHSD is to automatically generate a small-sized complementary heuristic set to serve diverse problem instances, such that each problem instance could be optimized by at least one heuristic in this set. We show that the objective function of AHSD is monotone and supermodular. Then, we propose Evolution of Heuristic Set (EoH-S) to apply the AHSD formulation for LLM-driven AHD. With two novel mechanisms of complementary population management and complementary-aware memetic search, EoH-S could effectively generate a set of high-quality and complementary heuristics. Comprehensive experimental results on three AHD tasks with diverse instances spanning various sizes and distributions demonstrate that EoH-S consistently outperforms existing state-of-the-art AHD methods and achieves up to 60% performance improvements.
zh

[AI-61] ContractEval: Benchmarking LLM s for Clause-Level Legal Risk Identification in Commercial Contracts

【速读】:该论文旨在解决开放源代码大语言模型(Large Language Models, LLMs)在商业合同条款级法律风险识别任务中是否能够达到专有模型性能的问题,特别是在本地部署以保障数据隐私的前提下。其解决方案的关键在于构建首个专门针对此场景的基准测试工具——ContractEval,基于Contract Understanding Atticus Dataset (CUAD) 对15个开源LLMs和4个专有模型进行全面评估,从而系统性地揭示了开源模型在准确性、输出有效性、推理模式影响、响应倾向性及量化压缩带来的性能权衡等方面的局限与潜力,为未来法律领域LLM的针对性优化提供了实证依据和方向指引。

链接: https://arxiv.org/abs/2508.03080
作者: Shuang Liu,Zelong Li,Ruoyun Ma,Haiyan Zhao,Mengnan Du
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The potential of large language models (LLMs) in specialized domains such as legal risk analysis remains underexplored. In response to growing interest in locally deploying open-source LLMs for legal tasks while preserving data confidentiality, this paper introduces ContractEval, the first benchmark to thoroughly evaluate whether open-source LLMs could match proprietary LLMs in identifying clause-level legal risks in commercial contracts. Using the Contract Understanding Atticus Dataset (CUAD), we assess 4 proprietary and 15 open-source LLMs. Our results highlight five key findings: (1) Proprietary models outperform open-source models in both correctness and output effectiveness, though some open-source models are competitive in certain specific dimensions. (2) Larger open-source models generally perform better, though the improvement slows down as models get bigger. (3) Reasoning (“thinking”) mode improves output effectiveness but reduces correctness, likely due to over-complicating simpler tasks. (4) Open-source models generate “no related clause” responses more frequently even when relevant clauses are present. This suggests “laziness” in thinking or low confidence in extracting relevant content. (5) Model quantization speeds up inference but at the cost of performance drop, showing the tradeoff between efficiency and accuracy. These findings suggest that while most LLMs perform at a level comparable to junior legal assistants, open-source models require targeted fine-tuning to ensure correctness and effectiveness in high-stakes legal settings. ContractEval offers a solid benchmark to guide future development of legal-domain LLMs.
zh

[AI-62] Optimizing Bipedal Locomotion for The 100m Dash With Comparison to Human Running ICRA2023

【速读】:该论文旨在解决双足机器人在高速行走(running)过程中效率与稳定性不足的问题,尤其是如何实现接近人类跑步效率的高动态运动控制。其解决方案的关键在于:首先,提出一种优化方法,在多种速度下提升步态效率,从而支持硬件平台实现极高速度运行;其次,通过对比人类跑步生物力学研究,发现尽管形态差异显著,Cassie机器人与人类在关键步态特性上具有高度相似性,验证了优化步态的有效性;最后,将优化后的步态集成到完整的控制器中,满足100米短跑任务的起跑、加速、匀速及停止等实际约束条件,并在真实硬件上成功实现,创下双足机器人最快100米纪录。

链接: https://arxiv.org/abs/2508.03070
作者: Devin Crowley,Jeremy Dao,Helei Duan,Kevin Green,Jonathan Hurst,Alan Fern
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, 7 figures, published by IEEE at ICRA 2023, pp. 12205-12211, see this https URL

点击查看摘要

Abstract:In this paper, we explore the space of running gaits for the bipedal robot Cassie. Our first contribution is to present an approach for optimizing gait efficiency across a spectrum of speeds with the aim of enabling extremely high-speed running on hardware. This raises the question of how the resulting gaits compare to human running mechanics, which are known to be highly efficient in comparison to quadrupeds. Our second contribution is to conduct this comparison based on established human biomechanical studies. We find that despite morphological differences between Cassie and humans, key properties of the gaits are highly similar across a wide range of speeds. Finally, our third contribution is to integrate the optimized running gaits into a full controller that satisfies the rules of the real-world task of the 100m dash, including starting and stopping from a standing position. We demonstrate this controller on hardware to establish the Guinness World Record for Fastest 100m by a Bipedal Robot.
zh

[AI-63] Untraceable DeepFakes via Traceable Fingerprint Elimination

【速读】:该论文旨在解决当前生成式 AI (Generative AI) 深度伪造(DeepFakes)溯源模型(Attribution Models, AMs)易被攻击者规避的问题。现有攻击方法无法彻底消除生成模型(Generative Models, GMs)留下的痕迹,因而可被防御机制缓解。论文提出一种乘法攻击(multiplicative attack)作为解决方案,其关键在于通过训练一个仅使用真实数据的对抗模型,在黑盒场景下实现对多种生成模型的通用攻击,从根本上抹除GMs的可追溯特征,从而在不依赖AM结构的前提下实现高成功率的逃避检测。实验表明,该方法在6种先进AM上平均攻击成功率达97.08%,即使面对防御机制仍保持超过72.39%的成功率,揭示了当前AMs的脆弱性并推动更鲁棒的溯源技术发展。

链接: https://arxiv.org/abs/2508.03067
作者: Jiewei Lai,Lan Zhang,Chen Tang,Pengcheng Sun,Xinming Wang,Yunhao Wang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advancements in DeepFakes attribution technologies have significantly enhanced forensic capabilities, enabling the extraction of traces left by generative models (GMs) in images, making DeepFakes traceable back to their source GMs. Meanwhile, several attacks have attempted to evade attribution models (AMs) for exploring their limitations, calling for more robust AMs. However, existing attacks fail to eliminate GMs’ traces, thus can be mitigated by defensive measures. In this paper, we identify that untraceable DeepFakes can be achieved through a multiplicative attack, which can fundamentally eliminate GMs’ traces, thereby evading AMs even enhanced with defensive measures. We design a universal and black-box attack method that trains an adversarial model solely using real data, applicable for various GMs and agnostic to AMs. Experimental results demonstrate the outstanding attack capability and universal applicability of our method, achieving an average attack success rate (ASR) of 97.08% against 6 advanced AMs on DeepFakes generated by 9 GMs. Even in the presence of defensive mechanisms, our method maintains an ASR exceeding 72.39%. Our work underscores the potential challenges posed by multiplicative attacks and highlights the need for more robust AMs.
zh

[AI-64] Beyond Surface-Level Detection: Towards Cognitive-Driven Defense Against Jailbreak Attacks via Meta-Operations Reasoning

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对越狱攻击(jailbreak attacks)时缺乏有效防御机制的问题,尤其针对现有基于浅层模式匹配的防御方法难以泛化至新型和未见攻击策略的局限性。其解决方案的关键在于提出认知驱动防御(Cognitive-Driven Defense, CDD)框架,该框架通过引入“元操作”(meta-operations)——即基本的隐藏有害意图的操纵手段——并模拟人类认知推理过程,构建一个结构化的推理链:首先对提示进行全局感知,再进行局部分析以识别潜在的隐蔽操纵;同时,采用监督微调(supervised fine-tuning)使模型掌握已知操纵模式的识别与推理能力,并结合熵引导的强化学习算法(EG-GRPO)提升对未知类型元操作的探索能力,从而显著增强模型对未见越狱攻击的泛化防御性能。

链接: https://arxiv.org/abs/2508.03054
作者: Rui Pu,Chaozhuo Li,Rui Ha,Litian Zhang,Lirong Qiu,Xi Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Defending large language models (LLMs) against jailbreak attacks is essential for their safe and reliable deployment. Existing defenses often rely on shallow pattern matching, which struggles to generalize to novel and unseen attack strategies. To address this challenge, we propose the Cognitive-Driven Defense (CDD) framework, which targets the underlying structure of jailbreak prompts by applying meta-operations, defined as basic manipulations that conceal harmful this http URL emulates human cognitive reasoning through a structured reasoning chain. It begins with a global perception of the prompt and follows with a localized analysis to uncover hidden manipulations. By applying supervised fine-tuning on this structured chain, the model learns to identify and reason about known manipulation patterns. To enhance generalization to unseen threats, an entropy-guided reinforcement learning algorithm (EG-GRPO) is introduced to encourage exploration of new types and variants of meta-operations. Experiments demonstrate that CDD can achieve state-of-the-art defense performance and exhibit strong generalization to unseen jailbreak attacks.
zh

[AI-65] SkeNa: Learning to Navigate Unseen Environments Based on Abstract Hand-Drawn Maps

【速读】:该论文旨在解决在未见过的室内环境中,仅依靠手绘草图地图(sketch map)进行导航的问题,即提出一种基于草图地图的视觉导航任务(Sketch map-based visual Navigation, SkeNa)。传统导航方法通常依赖于精确的结构化地图或传感器数据,而本研究模拟人类通过草图引导路径规划的行为,挑战模型在低信息密度、高抽象度的地图下实现准确导航的能力。解决方案的关键在于提出SkeNavigator框架:其核心创新包括两个模块——基于射线的图描述符(Ray-based Map Descriptor, RMD),用于增强草图地图的特征表示,通过等距采样点和边界距离提升对空间尺度的感知;以及双图对齐的目标预测器(Dual-Map Aligned Goal Predictor, DAGP),利用现场构建的探索地图与草图地图之间的特征对应关系来预测目标位置并指导导航。该方法显著优于以往基于平面图的导航方法,在高抽象验证集上相对提升了105%的SPL(Success weighted by Path Length)。

链接: https://arxiv.org/abs/2508.03053
作者: Haojun Xu,Jiaqi Xiang,Wu Wei,Jinyu Chen,Linqing Zhong,Linjiang Huang,Hongyu Yang,Si Liu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:A typical human strategy for giving navigation guidance is to sketch route maps based on the environmental layout. Inspired by this, we introduce Sketch map-based visual Navigation (SkeNa), an embodied navigation task in which an agent must reach a goal in an unseen environment using only a hand-drawn sketch map as guidance. To support research for SkeNa, we present a large-scale dataset named SoR, comprising 54k trajectory and sketch map pairs across 71 indoor scenes. In SoR, we introduce two navigation validation sets with varying levels of abstraction in hand-drawn sketches, categorized based on their preservation of spatial scales in the environment, to facilitate future research. To construct SoR, we develop an automated sketch-generation pipeline that efficiently converts floor plans into hand-drawn representations. To solve SkeNa, we propose SkeNavigator, a navigation framework that aligns visual observations with hand-drawn maps to estimate navigation targets. It employs a Ray-based Map Descriptor (RMD) to enhance sketch map valid feature representation using equidistant sampling points and boundary distances. To improve alignment with visual observations, a Dual-Map Aligned Goal Predictor (DAGP) leverages the correspondence between sketch map features and on-site constructed exploration map features to predict goal position and guide navigation. SkeNavigator outperforms prior floor plan navigation methods by a large margin, improving SPL on the high-abstract validation set by 105% relatively. Our code and dataset will be released.
zh

[AI-66] ree-of-Reasoning : Towards Complex Medical Diagnosis via Multi-Agent Reasoning with Evidence Tree ACM-MM2025

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在真实世界复杂医疗诊断任务中表现不足的问题,其核心挑战在于模型推理深度不够,导致在处理大量专业医学数据时出现信息丢失或逻辑跳跃,从而引发诊断错误。解决方案的关键在于提出一种名为“思维树”(Tree-of-Reasoning, ToR)的多智能体框架,该框架通过引入树状结构清晰记录LLMs的推理路径及对应的临床证据,并设计交叉验证机制以确保多智能体决策的一致性,从而显著提升多智能体在复杂医疗场景下的临床推理能力。

链接: https://arxiv.org/abs/2508.03038
作者: Qi Peng,Jialin Cui,Jiayuan Xie,Yi Cai,Qing Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted by ACM MM 2025

点击查看摘要

Abstract:Large language models (LLMs) have shown great potential in the medical domain. However, existing models still fall short when faced with complex medical diagnosis task in the real world. This is mainly because they lack sufficient reasoning depth, which leads to information loss or logical jumps when processing a large amount of specialized medical data, leading to diagnostic errors. To address these challenges, we propose Tree-of-Reasoning (ToR), a novel multi-agent framework designed to handle complex scenarios. Specifically, ToR introduces a tree structure that can clearly record the reasoning path of LLMs and the corresponding clinical evidence. At the same time, we propose a cross-validation mechanism to ensure the consistency of multi-agent decision-making, thereby improving the clinical reasoning ability of multi-agents in complex medical scenarios. Experimental results on real-world medical data show that our framework can achieve better performance than existing baseline methods.
zh

[AI-67] From Text to Trajectories: GPT -2 as an ODE Solver via In-Context

【速读】:该论文旨在解决上下文学习(In-Context Learning, ICL)在自然语言处理(NLP)任务中高度非线性行为机制不明确的问题。其核心解决方案是通过将常微分方程(Ordinary Differential Equations, ODEs)的标准问题及其解形式化为序列提示(sequential prompts),并在GPT-2模型上进行实验验证。关键发现在于:GPT-2能够在ICL设定下有效学习一种元ODE算法,其收敛行为可媲美或优于欧拉法(Euler method),且随着演示样本数量增加呈现指数级精度提升;同时模型对分布外(out-of-distribution, OOD)问题具有强泛化能力,揭示了ICL潜在的数值计算能力和机制本质。

链接: https://arxiv.org/abs/2508.03031
作者: Ziyang Ma,Baojian Zhou,Deqing Yang,Yanghua Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In-Context Learning (ICL) has emerged as a new paradigm in large language models (LLMs), enabling them to perform novel tasks by conditioning on a few examples embedded in the prompt. Yet, the highly nonlinear behavior of ICL for NLP tasks remains poorly understood. To shed light on its underlying mechanisms, this paper investigates whether LLMs can solve ordinary differential equations (ODEs) under the ICL setting. We formulate standard ODE problems and their solutions as sequential prompts and evaluate GPT-2 models on these tasks. Experiments on two types of ODEs show that GPT-2 can effectively learn a meta-ODE algorithm, with convergence behavior comparable to, or better than, the Euler method, and achieve exponential accuracy gains with increasing numbers of demonstrations. Moreover, the model generalizes to out-of-distribution (OOD) problems, demonstrating robust extrapolation capabilities. These empirical findings provide new insights into the mechanisms of ICL in NLP and its potential for solving nonlinear numerical problems.
zh

[AI-68] Collab-Solver: Collaborative Solving Policy Learning for Mixed-Integer Linear Programming

【速读】:该论文旨在解决当前基于学习的混合整数线性规划(Mixed-Integer Linear Programming, MILP)求解方法中,各模块策略独立学习而忽视其相互依赖关系的问题,这导致求解速度和质量受限。解决方案的关键在于提出一种基于多智能体协作的策略学习框架——Collab-Solver,该框架将割平面选择与分支策略的协同优化建模为Stackelberg博弈,并设计两阶段学习范式:第一阶段通过数据通信实现策略预训练以增强稳定性,第二阶段进一步协调多个模块的策略学习,从而实现联合优化。该方法显著提升了合成及大规模真实世界MILP实例的求解性能,并展现出良好的跨实例集泛化能力。

链接: https://arxiv.org/abs/2508.03030
作者: Siyuan Li,Yifan Yu,Yanchen Deng,Zhihao Zhang,Mengjing Chen,Fangzhou Zhu,Tao Zhong,Jianye Hao,Peng Liu,Bo An
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Mixed-integer linear programming (MILP) has been a fundamental problem in combinatorial optimization. Previous works have designed a plethora of hard-coded heuristics to accomplish challenging MILP solving with domain knowledge. Driven by the high capability of neural networks, recent research is devoted to replacing manually designed heuristics with learned policies. Although learning-based MILP methods have shown great promise, existing worksindependentlytreatthepolicylearningineachmoduleofMILPsolvers without considering their interdependence, severely hurting the solving speed and quality. To address this issue, we propose a novel multi-agent-based policy learning framework for MILP (Collab-Solver), which can collaboratively optimize the policies for multiple modules. Specifically, we formulate the collaboration of cut selection and branching in MILP solving as a Stackelberg game. Under this formulation, we develop a two-phase learning paradigm to stabilize the collaborative policy learning, where the first phase achieves the data-communicated policy pretraining and the second phase further orchestrates the policy learning for various modules. The jointly learned policy significantly improves the solving performance on both synthetic and large-scale real-world MILP datasets. Moreover, the policies learned by Collab-Solver have also demonstrated excellent generalization abilities across different instance sets.
zh

[AI-69] Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

【速读】:该论文旨在解决大语言推理模型在交互环境中进行多轮代理规划(multi-round agentic planning)时面临的两个核心挑战:一是稀疏奖励环境下传统强化学习难以有效分配信用(credit assignment),二是冗长的逐步推理历史带来的计算开销过高。其解决方案的关键在于提出一个三阶段框架——BPO(bootstrapping, extrapolation, and refinement),构建了一个自增强的数据飞轮机制,以提升长期任务中推理模型的鲁棒性。具体而言,该框架首先利用“规划四元数”(planning quaternions)融合长短链式思维(long-short chain-of-thought)实现高效推理初始化;其次通过复杂度分层课程学习(complexity-stratified curriculum learning)将模型外推至分布外任务;最后通过奖励门控拒绝采样(reward-gated rejection sampling)选择经验进行迭代自我优化,从而显著提升性能与token效率。

链接: https://arxiv.org/abs/2508.03018
作者: Yutong Wang,Pengliang Ji,Kaixin Li,Baolong Bi,Tao Feng,Guillaume Sartoretti
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.
zh

[AI-70] ool-integrated Reinforcement Learning for Repo Deep Search

【速读】:该论文旨在解决软件开发中问题定位(issue localization)的难题,即如何准确识别需要修改的代码位置以修复软件缺陷。这一任务面临的主要挑战在于自然语言描述的问题与实际故障代码之间存在语义鸿沟,需通过复杂的多跳推理来理解代码依赖关系。现有基于大语言模型(Large Language Models, LLMs)的代理方法虽尝试整合仓库检索工具,但将问题转化为一种高复杂度的“仓库深度搜索”(Repo Deep Search)任务,要求LLM在多步推理和导航过程中高效利用多种检索工具。论文提出的关键解决方案是ToolTrain——一个两阶段工具集成训练框架,结合拒绝采样的监督微调(rejection-sampled supervised fine-tuning)与工具集成强化学习(tool-integrated reinforcement learning),显著提升LLM使用检索工具进行问题定位的能力。实验表明,ToolTrain训练的模型达到当前最优性能,其32B版本甚至超越Claude-3.7在函数级定位上的表现,并且定位能力的提升直接转化为端到端问题修复效果的改进,验证了针对问题定位进行专门训练的有效性。

链接: https://arxiv.org/abs/2508.03012
作者: Zexiong Ma,Chao Peng,Qunhong Zeng,Pengfei Gao,Yanzhen Zou,Bing Xie
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Issue localization, the process of identifying code locations that need modification to resolve software issues, is a critical yet challenging task in software development. The semantic gap between natural language issue descriptions and faulty code requires complex multi-hop reasoning through code dependencies. Existing LLM-based agents attempt to address this by integrating repository retrieval tools. However, this transforms issue localization into a demanding task we call Repo Deep Search, which requires the LLM to effectively utilize various repository retrieval tools throughout a multi-step reasoning and navigation process. To tackle this challenge, we present ToolTrain, a two-stage tool-integrated training framework combining rejection-sampled supervised fine-tuning and tool-integrated reinforcement learning to enhance LLMs’ ability to use retrieval tools for issue localization. Experimental results show that ToolTrain-trained models achieve state-of-the-art performance, with our 32B model even surpassing Claude-3.7 on function-level localization. The results also show that improved localization performance translates to better end-to-end issue resolution performance. This further demonstrates that training for issue localization is a viable and effective strategy for improving automated software development.
zh

[AI-71] When AIs Judge AIs: The Rise of Agent -as-a-Judge Evaluation for LLM s

【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在开放性和复杂任务中输出质量与安全性评估的瓶颈问题。传统依赖人工评价的方式难以满足日益增长的模型规模和应用场景需求,导致评估成本高、效率低且缺乏一致性。论文提出的解决方案核心在于引入“代理作为评判者”(agent-as-a-judge)的新范式,即利用LLMs自身的推理能力和视角转换能力,构建由AI代理组成的评价体系,从而实现可扩展、精细化的自动评估机制。该方法不仅提升了评估的可靠性与效率,还通过多代理辩论等动态框架增强了判断的深度与多样性,为下一代LLMs提供可信、高效的评估路径。

链接: https://arxiv.org/abs/2508.02994
作者: Fangyi Yu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As large language models (LLMs) grow in capability and autonomy, evaluating their outputs-especially in open-ended and complex tasks-has become a critical bottleneck. A new paradigm is emerging: using AI agents as the evaluators themselves. This “agent-as-a-judge” approach leverages the reasoning and perspective-taking abilities of LLMs to assess the quality and safety of other models, promising calable and nuanced alternatives to human evaluation. In this review, we define the agent-as-a-judge concept, trace its evolution from single-model judges to dynamic multi-agent debate frameworks, and critically examine their strengths and shortcomings. We compare these approaches across reliability, cost, and human alignment, and survey real-world deployments in domains such as medicine, law, finance, and education. Finally, we highlight pressing challenges-including bias, robustness, and meta evaluation-and outline future research directions. By bringing together these strands, our review demonstrates how agent-based judging can complement (but not replace) human oversight, marking a step toward trustworthy, scalable evaluation for next-generation LLMs.
zh

[AI-72] GACL: Grounded Adaptive Curriculum Learning with Active Task and Performance Monitoring IROS2025

【速读】:该论文旨在解决机器人领域中课程学习(Curriculum Learning)依赖人工设计课程所导致的工程成本高、主观性强且效果不佳的问题,尤其针对复杂任务空间下如何保持与目标域分布的相关性这一挑战。其解决方案的关键在于提出了一种名为“基于 grounded 的自适应课程学习”(Grounded Adaptive Curriculum Learning, GACL)的框架,包含三项核心创新:(1) 一种能统一处理复杂机器人任务设计的任务表示方法;(2) 一种主动性能追踪机制,可根据机器人当前能力动态生成适配的课程;(3) 一种通过在参考任务与合成任务间交替采样来维持目标域相关性的 grounding 方法。该方法在轮式导航和四足步态运动等复杂场景中显著优于现有最先进方法。

链接: https://arxiv.org/abs/2508.02988
作者: Linji Wang,Zifan Xu,Peter Stone,Xuesu Xiao
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: 7 pages, IROS 2025

点击查看摘要

Abstract:Curriculum learning has emerged as a promising approach for training complex robotics tasks, yet current applications predominantly rely on manually designed curricula, which demand significant engineering effort and can suffer from subjective and suboptimal human design choices. While automated curriculum learning has shown success in simple domains like grid worlds and games where task distributions can be easily specified, robotics tasks present unique challenges: they require handling complex task spaces while maintaining relevance to target domain distributions that are only partially known through limited samples. To this end, we propose Grounded Adaptive Curriculum Learning, a framework specifically designed for robotics curriculum learning with three key innovations: (1) a task representation that consistently handles complex robot task design, (2) an active performance tracking mechanism that allows adaptive curriculum generation appropriate for the robot’s current capabilities, and (3) a grounding approach that maintains target domain relevance through alternating sampling between reference and synthetic tasks. We validate GACL on wheeled navigation in constrained environments and quadruped locomotion in challenging 3D confined spaces, achieving 6.8% and 6.1% higher success rates, respectively, than state-of-the-art methods in each domain.
zh

[AI-73] Polymath: A Self-Optimizing Agent with Dynamic Hierarchical Workflow AAAI2026

【速读】:该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的智能体在构建通用代理时面临的可扩展性与效率瓶颈问题,尤其是依赖人工设计任务流程(如Chain-of-Thought、Self-Reflection和ReACT)导致的灵活性不足,以及现有自动化方法对标注数据的高度依赖限制了其在真实动态场景中的应用。解决方案的关键在于提出Polymath——一种具备动态分层工作流结构的自优化智能体,其核心创新是融合多网格启发式图优化与自反思引导的进化算法,从而在无需标注数据的情况下自动生成并优化代码表示的工作流,显著提升了复杂任务求解能力。

链接: https://arxiv.org/abs/2508.02959
作者: Chia-Tung Ho,Jing Gong,Xufeng Yao,Yunsheng Bai,Abhishek B Akkur,Haoxing Ren
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, 12 figures, under review for AAAI2026

点击查看摘要

Abstract:Large language models (LLMs) excel at solving complex tasks by executing agentic workflows composed of detailed instructions and structured operations. Yet, building general-purpose agents by manually embedding foundation models into agentic systems such as Chain-of-Thought, Self-Reflection, and ReACT through text interfaces limits scalability and efficiency. Recently, many researchers have sought to automate the generation and optimization of these workflows through code-based representations. However, existing methods often rely on labeled datasets to train and optimize workflows, making them ineffective and inflexible for solving real-world, dynamic problems where labeled data is unavailable. To address this challenge, we introduce Polymath, a self-optimizing agent with dynamic hierarchical workflow that leverages the flexibility of task flow graphs and the expressiveness of code-represented workflows to solve a wide range of real-world, dynamic problems. The proposed optimization methodology integrates multi-grid-inspired graph optimization with a self-reflection-guided evolutionary algorithm to refine workflows without labeled data. Experimental results on six benchmark datasets across coding, math, and multi-turn QA tasks show that Polymath achieves 8.1% average improvement over state-of-the-art baselines.
zh

[AI-74] MedBLINK: Probing Basic Perception in Multimodal Language Models for Medicine

【速读】:该论文旨在解决当前多模态语言模型(Multimodal Language Models, MLMs)在临床场景中难以被采纳的问题,核心原因在于这些模型在看似简单的视觉感知任务上频繁出错,例如判断医学图像方向或识别CT扫描是否使用了对比剂。为系统评估这一短板,作者提出Medblink基准,涵盖八个与临床相关的跨模态和解剖区域的任务,包含1,429道多项选择题和1,605张医学图像。关键解决方案是通过构建一个结构化、高覆盖度的评测集,量化当前主流MLMs在基础视觉感知能力上的不足——实验表明,尽管人类标注者准确率达96.4%,最优模型仅达65%,凸显了强化模型视觉 grounding 能力对推动其临床落地的重要性。

链接: https://arxiv.org/abs/2508.02951
作者: Mahtab Bigverdi,Wisdom Ikezogwo,Kevin Zhang,Hyewon Jeong,Mingyu Lu,Sungjae Cho,Linda Shapiro,Ranjay Krishna
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal language models (MLMs) show promise for clinical decision support and diagnostic reasoning, raising the prospect of end-to-end automated medical image interpretation. However, clinicians are highly selective in adopting AI tools; a model that makes errors on seemingly simple perception tasks such as determining image orientation or identifying whether a CT scan is contrast-enhance are unlikely to be adopted for clinical tasks. We introduce Medblink, a benchmark designed to probe these models for such perceptual abilities. Medblink spans eight clinically meaningful tasks across multiple imaging modalities and anatomical regions, totaling 1,429 multiple-choice questions over 1,605 images. We evaluate 19 state-of-the-art MLMs, including general purpose (GPT4o, Claude 3.5 Sonnet) and domain specific (Med Flamingo, LLaVA Med, RadFM) models. While human annotators achieve 96.4% accuracy, the best-performing model reaches only 65%. These results show that current MLMs frequently fail at routine perceptual checks, suggesting the need to strengthen their visual grounding to support clinical adoption. Data is available on our project page.
zh

[AI-75] AeroSafe: Mobile Indoor Air Purification using Aerosol Residence Time Analysis and Robotic Cough Emulator Testbed ICRA

【速读】:该论文旨在解决当前便携式空气净化器在实际应用中忽视咳嗽产生的呼吸气溶胶浓度问题,从而导致高暴露环境(如医疗场所和公共场所)中空气传播病原体风险难以有效控制的痛点。其解决方案的关键在于构建一个基于机器人咳嗽模拟器的物理测试平台与数字孪生模型相结合的方法:首先通过可移动假人模拟咳嗽事件并触发便携式空气净化器响应,采集实时气溶胶浓度数据;随后利用这些数据训练融合了物理机理 compartment model 与深度学习技术(LSTM 网络和图卷积层)的数字孪生模型,实现对气溶胶滞留时间的高精度预测(平均误差小于35秒),进而支持动态优化空气净化策略,显著优于传统静态滤网布置方式。

链接: https://arxiv.org/abs/2508.02947
作者: M Tanjid Hasan Tonmoy,Rahath Malladi,Kaustubh Singh,Forsad Al Hossain,Rajesh Gupta,Andrés E. Tejada-Martínez,Tauhidur Rahman
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: Accepted at IEEE International Conference on Robotics and Automation (ICRA) 2025. Author Accepted Manuscript

点击查看摘要

Abstract:Indoor air quality plays an essential role in the safety and well-being of occupants, especially in the context of airborne diseases. This paper introduces AeroSafe, a novel approach aimed at enhancing the efficacy of indoor air purification systems through a robotic cough emulator testbed and a digital-twins-based aerosol residence time analysis. Current portable air filters often overlook the concentrations of respiratory aerosols generated by coughs, posing a risk, particularly in high-exposure environments like healthcare facilities and public spaces. To address this gap, we present a robotic dual-agent physical emulator comprising a maneuverable mannequin simulating cough events and a portable air purifier autonomously responding to aerosols. The generated data from this emulator trains a digital twins model, combining a physics-based compartment model with a machine learning approach, using Long Short-Term Memory (LSTM) networks and graph convolution layers. Experimental results demonstrate the model’s ability to predict aerosol concentration dynamics with a mean residence time prediction error within 35 seconds. The proposed system’s real-time intervention strategies outperform static air filter placement, showcasing its potential in mitigating airborne pathogen risks.
zh

[AI-76] LLM -based IR-system for Bank Supervisors

【速读】:该论文旨在解决银行监管机构在制定新监管措施时,如何确保其与历史案例保持一致性的问题。解决方案的关键在于构建一个针对监管场景定制的信息检索(Information Retrieval, IR)系统,该系统通过融合词法、语义以及资本要求条例(Capital Requirements Regulation, CRR)模糊集匹配技术,从大规模历史数据中精准召回与当前调查发现最相关的过往案例及其对应措施。系统进一步利用基于Transformer的去噪自编码器进行微调,在部分标注数据条件下仍能实现高精度检索,最终在MAP@100和MRR@100指标上分别达到0.83和0.92,显著优于传统的BM25和BERT类模型。

链接: https://arxiv.org/abs/2508.02945
作者: Ilias Aarab
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Applications (stat.AP); Computation (stat.CO)
备注:

点击查看摘要

Abstract:Bank supervisors face the complex task of ensuring that new measures are consistently aligned with historical precedents. To address this challenge, we introduce a novel Information Retrieval (IR) System tailored to assist supervisors in drafting both consistent and effective measures. This system ingests findings from on-site investigations. It then retrieves the most relevant historical findings and their associated measures from a comprehensive database, providing a solid basis for supervisors to write well-informed measures for new findings. Utilizing a blend of lexical, semantic, and Capital Requirements Regulation (CRR) fuzzy set matching techniques, the IR system ensures the retrieval of findings that closely align with current cases. The performance of this system, particularly in scenarios with partially labeled data, is validated through a Monte Carlo methodology, showcasing its robustness and accuracy. Enhanced by a Transformer-based Denoising AutoEncoder for fine-tuning, the final model achieves a Mean Average Precision (MAP@100) of 0.83 and a Mean Reciprocal Rank (MRR@100) of 0.92. These scores surpass those of both standalone lexical models such as BM25 and semantic BERT-like models.
zh

[AI-77] AQUAH: Automatic Quantification and Unified Agent in Hydrology ICCV

【速读】:该论文旨在解决水文建模过程中因数据获取、模型配置与结果解释等环节高度依赖专家经验而导致的效率低、门槛高问题。解决方案的关键在于提出AQUAH,这是首个端到端的语言驱动型智能体(language-based agent),能够通过自然语言指令自主完成从地形、强迫数据和水文站点数据的检索,到水文模型配置、模拟运行及生成自包含PDF报告的全流程任务;其核心创新在于利用具备视觉能力的大语言模型(vision-enabled large language models),实时解析地图和栅格数据,并指导出口选择、参数初始化及不确定性分析等关键决策,从而实现无需人工干预的“冷启动”模拟与可解释性输出。

链接: https://arxiv.org/abs/2508.02936
作者: Songkun Yan,Zhi Li,Siyu Zhu,Yixin Wen,Mofan Zhang,Mengye Chen,Jie Cao,Yang Hong
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 8 pages, 5 figures, 2025 ICCV SEA workshop paper

点击查看摘要

Abstract:We introduce AQUAH, the first end-to-end language-based agent designed specifically for hydrologic modeling. Starting from a simple natural-language prompt (e.g., ‘simulate floods for the Little Bighorn basin from 2020 to 2022’), AQUAH autonomously retrieves the required terrain, forcing, and gauge data; configures a hydrologic model; runs the simulation; and generates a self-contained PDF report. The workflow is driven by vision-enabled large language models, which interpret maps and rasters on the fly and steer key decisions such as outlet selection, parameter initialization, and uncertainty commentary. Initial experiments across a range of U.S. basins show that AQUAH can complete cold-start simulations and produce analyst-ready documentation without manual intervention. The results are judged by hydrologists as clear, transparent, and physically plausible. While further calibration and validation are still needed for operational deployment, these early outcomes highlight the promise of LLM-centered, vision-grounded agents to streamline complex environmental modeling and lower the barrier between Earth observation data, physics-based tools, and decision makers.
zh

[AI-78] Realizing Scaling Laws in Recommender Systems: A Foundation-Expert Paradigm for Hyperscale Model Deployment

【速读】:该论文旨在解决大规模推荐系统中高效部署超大规模基础模型(Foundation Model, FM)的挑战,尤其针对推荐系统特有的问题:需在数据分布动态变化的在线流式数据下学习、适应不同推荐场景(recommendation surfaces)的多样化下游任务与输入分布,以及严苛的延迟和计算资源限制。解决方案的关键在于提出“基础-专家范式”(Foundation-Expert Paradigm),其中中央基础模型通过跨场景、多模态用户数据学习通用知识,并借助目标感知嵌入(target-aware embeddings)将知识高效迁移至轻量级、场景特定的“专家”模型,使其能以最小开销适配局部数据分布和优化目标。该方法由HyperCast基础设施系统支持,实现了训练、推理、日志记录和迭代的重构,已在Meta生产环境中部署,每日服务数十亿请求,显著提升在线指标并增强开发效率。

链接: https://arxiv.org/abs/2508.02929
作者: Dai Li,Kevin Course,Wei Li,Hongwei Li,Jie Hua,Yiqi Chen,Zhao Zhu,Rui Jian,Xuan Cao,Bi Xue,Yu Shi,Jing Qian,Kai Ren,Matt Ma,Qunshu Zhang,Rui Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:While scaling laws promise significant performance gains for recommender systems, efficiently deploying hyperscale models remains a major unsolved challenge. In contrast to fields where FMs are already widely adopted such as natural language processing and computer vision, progress in recommender systems is hindered by unique challenges including the need to learn from online streaming data under shifting data distributions, the need to adapt to different recommendation surfaces with a wide diversity in their downstream tasks and their input distributions, and stringent latency and computational constraints. To bridge this gap, we propose to leverage the Foundation-Expert Paradigm: a framework designed for the development and deployment of hyperscale recommendation FMs. In our approach, a central FM is trained on lifelong, cross-surface, multi-modal user data to learn generalizable knowledge. This knowledge is then efficiently transferred to various lightweight, surface-specific ``expert" models via target-aware embeddings, allowing them to adapt to local data distributions and optimization goals with minimal overhead. To meet our training, inference and development needs, we built HyperCast, a production-grade infrastructure system that re-engineers training, serving, logging and iteration to power this decoupled paradigm. Our approach is now deployed at Meta serving tens of billions of user requests daily, demonstrating online metric improvements over our previous one-stage production system while improving developer velocity and maintaining infrastructure efficiency. To the best of our knowledge, this work represents the first successful deployment of a Foundation-Expert paradigm at this scale, offering a proven, compute-efficient, and developer-friendly blueprint to realize the promise of scaling laws in recommender systems.
zh

[AI-79] GrandJury: A Collaborative Machine Learning Model Evaluation Protocol for Dynamic Quality Rubrics

【速读】:该论文旨在解决当前生成式机器学习模型(Generative Machine Learning models)评估体系中存在的核心问题:传统静态基准测试难以捕捉实际应用中响应的多元性与情境依赖性,导致模型优化偏向于提升排行榜分数而非真正对齐动态用户需求或不断变化的现实环境。其解决方案的关键在于提出GrandJury这一形式化评估协议,通过时间衰减聚合(time-decayed aggregation)、完整可追溯性(complete traceability)、动态透明的任务评分标准归属(dynamic, transparent task rubric attribution)以及多评估者人类判断(multi-rater human judgment)的结合,实现一种具有多元性、问责制且能反映共识演化与分歧暴露的评估范式,从而为缺乏绝对真值场景下的AI输出评估提供新的方法论支撑。

链接: https://arxiv.org/abs/2508.02926
作者: Arthur Cho
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 26 pages, 1 table. Open-source implementation available on PyPI (grandjury package) and GitHub. Dataset available on Hugging Face under CC-BY-4.0 license

点击查看摘要

Abstract:Generative Machine Learning models have become central to modern systems, powering applications in creative writing, summarization, multi-hop reasoning, and context-aware dialogue. These models underpin large-scale AI assistants, workflow automation, and autonomous decision-making. In such domains, acceptable response is rarely absolute or static, but plural and highly context-dependent. Yet standard evaluation regimes still rely on static, benchmark-style tests, incentivizing optimization toward leaderboard scores rather than alignment with dynamic user needs or evolving realities. GrandJury introduces a formal evaluation protocol combining time-decayed aggregation, complete traceability, with the support of dynamic, transparent task rubric attribution, and multi-rater human judgment. Together, these elements enable pluralistic, accountable evaluation that captures evolving consensus and surfaces disagreement. We provide an open-source implementation (grandjury PyPI package) and a public collection of Large Language Model (LLM) inference outputs to illustrate the need and method. GrandJury provides a new paradigm for AI practitioners when evaluating machine learning outputs without absolute ground truth.
zh

[AI-80] PentestJudge: Judging Agent Behavior Against Operational Requirements

【速读】:该论文旨在解决如何有效评估渗透测试代理(penetration testing agent)在实际操作中行为质量的问题,尤其是那些难以通过传统编程方式判定的复杂安全任务。其核心挑战在于现有方法无法全面、自动化地衡量代理在执行渗透测试时是否符合操作目标、操作安全性和战术规范等多维标准。解决方案的关键是提出PentestJudge系统——一种基于大语言模型(LLM)作为评判者(LLM-as-judge)的评估框架,该框架通过构建树状结构的评分规则(rubrics),将复杂的渗透测试任务分解为可量化判断的叶节点级二分类标准,并赋予LLM访问工具的能力以解析代理的状态轨迹和工具调用历史。实验表明,最优模型达到F1=0.83,且具备更强工具使用能力的模型更接近人类专家水平;更重要的是,低成本模型可有效验证高性能模型的行为,揭示了“验证”可能比“生成”更容易实现,为AI驱动的信息安全代理在生产环境中的可信部署提供了可扩展的评估路径。

链接: https://arxiv.org/abs/2508.02921
作者: Shane Caldwell,Max Harley,Michael Kouremetis,Vincent Abruzzo,Will Pearce
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 18 pages, 5 figures, 3 tables

点击查看摘要

Abstract:We introduce PentestJudge, a system for evaluating the operations of penetration testing agents. PentestJudge is a large language model (LLM)-as-judge with access to tools that allow it to consume arbitrary trajectories of agent states and tool call history to determine whether a security agent’s actions meet certain operating criteria that would be impractical to evaluate programmatically. We develop rubrics that use a tree structure to hierarchically collapse the penetration testing task for a particular environment into smaller, simpler, and more manageable sub-tasks and criteria until each leaf node represents simple yes-or-no criteria for PentestJudge to evaluate. Task nodes are broken down into different categories related to operational objectives, operational security, and tradecraft. LLM-as-judge scores are compared to human domain experts as a ground-truth reference, allowing us to compare their relative performance with standard binary classification metrics, such as F1 scores. We evaluate several frontier and open-source models acting as judge agents, with the best model reaching an F1 score of 0.83. We find models that are better at tool-use perform more closely to human experts. By stratifying the F1 scores by requirement type, we find even models with similar overall scores struggle with different types of questions, suggesting certain models may be better judges of particular operating criteria. We find that weaker and cheaper models can judge the trajectories of pentests performed by stronger and more expensive models, suggesting verification may be easier than generation for the penetration testing task. We share this methodology to facilitate future research in understanding the ability of judges to holistically and scalably evaluate the process quality of AI-based information security agents so that they may be confidently used in sensitive production environments.
zh

[AI-81] Enhancing Japanese Large Language Models with Reasoning Vectors

【速读】:该论文旨在解决日本语大语言模型(Japanese LLMs)在后训练(post-training)阶段性能提升困难的问题,主要受限于资源投入不足。其解决方案的关键在于利用任务向量(task vectors)思想,从具备推理能力的预训练模型中提取推理向量(reasoning vectors),并将这些向量应用于日本语大语言模型,从而有效增强其推理能力与整体性能。此方法无需大量计算资源,是一种简单且高效的改进策略。

链接: https://arxiv.org/abs/2508.02913
作者: Carolina Minami Oguchi,Leo Wei,Koyo Kobayashi,Hsin-Tai Wu,Dipak Ghosal
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Post-training methods have improved the performance and enhanced the reasoning capability for mainstream large language models (LLMs), but the same is challenging for Japanese LLMs to achieve due to the amount of resources required. Inspired by task vectors that extract the change of weights before and after training, specifically for a certain task, we obtain reasoning vectors from reasoning LLMs and apply them to Japanese LLMs to boost their performance. While the resources available present a challenge to improve Japanese LLMs, we present a simple and effective way to obtain high improvement and hope to inspire for other languages.
zh

[AI-82] Engineered over Emergent Communication in MARL for Scalable and Sample-Efficient Cooperative Task Allocation in a Partially Observable Grid

【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)环境中通信策略的有效性问题,即比较学习型通信与工程设计型通信在协作任务中的性能差异。其核心解决方案在于提出两种通信机制:一是学习型直接通信(Learned Direct Communication, LDC),通过神经网络同步生成消息与动作;二是工程设计型意图通信(Intention Communication),利用想象轨迹生成模块(Imagined Trajectory Generation Module, ITGM)和消息生成网络(Message Generation Network, MGN)基于对未来状态的预测来构造信息。研究表明,尽管涌现式通信可行,但工程设计型方法在复杂环境下的成功率和可扩展性更优。

链接: https://arxiv.org/abs/2508.02912
作者: Brennen A. Hill,Mant Koh En Wei,Thangavel Jishnuanandh
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Systems and Control (eess.SY)
备注:

点击查看摘要

Abstract:We compare the efficacy of learned versus engineered communication strategies in a cooperative multi-agent reinforcement learning (MARL) environment. For the learned approach, we introduce Learned Direct Communication (LDC), where agents generate messages and actions concurrently via a neural network. Our engineered approach, Intention Communication, employs an Imagined Trajectory Generation Module (ITGM) and a Message Generation Network (MGN) to formulate messages based on predicted future states. Both strategies are evaluated on their success rates in cooperative tasks under fully and partially observable conditions. Our findings indicate that while emergent communication is viable, the engineered approach demonstrates superior performance and scalability, particularly as environmental complexity increases.
zh

[AI-83] Seemingly Simple Planning Problems are Computationally Challenging: The Countdown Game

【速读】:该论文试图解决当前基础模型和智能体在长期规划能力上的局限性,以及现有规划评测基准无法有效衡量其规划能力的问题。现有基准或任务定义模糊(如旅行规划),或依赖于为测试传统自动规划器弱点而设计的国际竞赛领域,难以真实反映模型的规划性能。解决方案的关键在于提出一种基于“Countdown”游戏的新型规划基准构建方法:该问题具有直观的自然语言描述、计算上属于NP完全问题且实例空间丰富,可避免记忆化干扰;通过理论分析证明其复杂性,并验证生成方法优于公开基准。实验表明,相较于24 Game(Countdown的特例),该动态基准对当前基于大语言模型(LLM)的规划方法仍保持高度挑战性,从而更真实地评估模型的规划能力。

链接: https://arxiv.org/abs/2508.02900
作者: Michael Katz,Harsha Kokel,Sarath Sreedharan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:There is a broad consensus that the inability to form long-term plans is one of the key limitations of current foundational models and agents. However, the existing planning benchmarks remain woefully inadequate to truly measure their planning capabilities. Most existing benchmarks either focus on loosely defined tasks like travel planning or end up leveraging existing domains and problems from international planning competitions. While the former tasks are hard to formalize and verify, the latter were specifically designed to test and challenge the weaknesses of existing automated planners. To address these shortcomings, we propose a procedure for creating a planning benchmark centered around the game called Countdown, where a player is expected to form a target number from a list of input numbers through arithmetic operations. We discuss how this problem meets many of the desiderata associated with an ideal benchmark for planning capabilities evaluation. Specifically, the domain allows for an intuitive, natural language description for each problem instance, it is computationally challenging (NP-complete), and the instance space is rich enough that we do not have to worry about memorization. We perform an extensive theoretical analysis, establishing the computational complexity result and demonstrate the advantage of our instance generation procedure over public benchmarks. We evaluate a variety of existing LLM-assisted planning methods on instances generated using our procedure. Our results show that, unlike other domains like 24 Game (a special case of Countdown), our proposed dynamic benchmark remains extremely challenging for existing LLM-based approaches.
zh

[AI-84] CauKer: classification time series foundation models can be pretrained on synthetic data only

【速读】:该论文旨在解决时间序列基础模型(Time Series Foundation Models, TSFMs)在预训练阶段对大规模、高质量真实世界序列数据的高计算成本依赖问题,从而实现样本高效的预训练。其核心解决方案是提出CauKer算法,该算法通过结合高斯过程(Gaussian Process, GP)核组合与结构因果模型(Structural Causal Models, SCM),生成具有现实趋势、季节性和非线性交互关系的因果一致合成时间序列数据,以支持不同架构和预训练策略的TSFM模型高效预训练。

链接: https://arxiv.org/abs/2508.02879
作者: Shifeng Xie,Vasilii Feofanov,Marius Alonso,Ambroise Odonnat,Jianfeng Zhang,Themis Palpanas,Ievgen Redko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time series foundation models (TSFMs) have recently gained significant attention due to their strong zero-shot capabilities and widespread real-world applications. Such models typically require a computationally costly pretraining on large-scale, carefully curated collections of real-world sequences. To allow for a sample-efficient pretraining of TSFMs, we propose CauKer, a novel algorithm designed to generate diverse, causally coherent synthetic time series with realistic trends, seasonality, and nonlinear interactions. CauKer combines Gaussian Process (GP) kernel composition with Structural Causal Models (SCM) to produce data for sample-efficient pretraining of state-of-the-art classification TSFMs having different architectures and following different pretraining approaches. Additionally, our experiments reveal that CauKer-generated datasets exhibit clear scaling laws for both dataset size (10K to 10M samples) and model capacity (1M to 783M parameters), unlike real-world datasets, which display irregular scaling behavior.
zh

[AI-85] Beyond Least Squares: Robust Regression Transformer (R2T)

【速读】:该论文旨在解决传统鲁棒回归技术在面对非对称结构化噪声时性能下降的问题,尤其是基于最小二乘法(least-squares optimization)的方法在非高斯噪声场景下失效的局限性。其解决方案的关键在于提出了一种混合神经符号架构:通过Transformer编码器处理数值序列,压缩神经网络(compression NN)预测符号参数,并由固定符号方程重构原始序列。该方法利用合成数据训练目标——在加入非对称结构化噪声后仍能准确恢复原序列,从而实现由神经网络估计参数引导的符号拟合,显著提升了回归精度,在可穿戴设备合成数据上达到中位数均方误差(MSE)为6e-6至3.5e-5,相较普通最小二乘法和Huber损失、SoftL1等鲁棒回归方法提升达10–300倍。

链接: https://arxiv.org/abs/2508.02874
作者: Roman Gutierrez,Tony Kai Tang,Isabel Gutierrez
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注: 10 pages, 4 figures, 1 table

点击查看摘要

Abstract:Robust regression techniques rely on least-squares optimization, which works well for Gaussian noise but fails in the presence of asymmetric structured noise. We propose a hybrid neural-symbolic architecture where a transformer encoder processes numerical sequences, a compression NN predicts symbolic parameters, and a fixed symbolic equation reconstructs the original sequence. Using synthetic data, the training objective is to recover the original sequence after adding asymmetric structured noise, effectively learning a symbolic fit guided by neural parameter estimation. Our model achieves a median regression MSE of 6e-6 to 3.5e-5 on synthetic wearable data, which is a 10-300 times improvement when compared with ordinary least squares fit and robust regression techniques such as Huber loss or SoftL1.
zh

[AI-86] A Multi-Agent System for Complex Reasoning in Radiology Visual Question Answering

【速读】:该论文旨在解决放射学视觉问答(Radiology Visual Question Answering, RVQA)中多模态大语言模型(Multimodal Large Language Models, MLLMs)和检索增强生成(Retrieval-Augmented Generation, RAG)方法存在的事实准确性不足、幻觉现象以及跨模态对齐偏差等问题。解决方案的关键在于提出一种多智能体系统(Multi-Agent System, MAS),该系统由专门负责上下文理解、多模态推理和答案验证的子智能体组成,通过协同机制实现复杂推理过程的分解与校验,从而提升RVQA任务的可靠性、可解释性与准确性。

链接: https://arxiv.org/abs/2508.02841
作者: Ziruo Yi,Jinyu Liu,Ting Xiao,Mark V. Albert
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Radiology visual question answering (RVQA) provides precise answers to questions about chest X-ray images, alleviating radiologists’ workload. While recent methods based on multimodal large language models (MLLMs) and retrieval-augmented generation (RAG) have shown promising progress in RVQA, they still face challenges in factual accuracy, hallucinations, and cross-modal misalignment. We introduce a multi-agent system (MAS) designed to support complex reasoning in RVQA, with specialized agents for context understanding, multimodal reasoning, and answer validation. We evaluate our system on a challenging RVQA set curated via model disagreement filtering, comprising consistently hard cases across multiple MLLMs. Extensive experiments demonstrate the superiority and effectiveness of our system over strong MLLM baselines, with a case study illustrating its reliability and interpretability. This work highlights the potential of multi-agent approaches to support explainable and trustworthy clinical AI applications that require complex reasoning.
zh

[AI-87] Learning from B Cell Evolution: Adaptive Multi-Expert Diffusion for Antibody Design via Online Optimization

【速读】:该论文旨在解决当前基于扩散模型的抗体设计方法中普遍存在的问题:即采用统一生成策略无法适配不同抗原的独特需求,导致难以实现高亲和力、稳定性和自避性(self-avoidance)的多目标优化。其解决方案的关键在于提出首个受B细胞亲和力成熟启发的生物动机框架,通过将物理机制驱动的领域知识嵌入在线元学习系统,构建多个专业化专家模块(范德华力、分子识别、能量平衡与界面几何),这些专家参数在生成过程中依据迭代反馈动态演化,模拟自然抗体的渐进优化循环。这种方法实现了对每个抗体-抗原系统的个性化优化策略发现,无需预训练即可获得SE(3)-等变的引导策略,显著提升热点覆盖度和界面质量,从而达成治疗性抗体所需的多目标平衡,并具备跨不同设计挑战(从小表位到大蛋白界面)的泛化能力。

链接: https://arxiv.org/abs/2508.02834
作者: Hanqi Feng,Peng Qiu,Mengchun Zhang,Yiran Tao,You Fan,Jingtao Xu,Barnabas Poczos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in diffusion models have shown remarkable potential for antibody design, yet existing approaches apply uniform generation strategies that cannot adapt to each antigen’s unique requirements. Inspired by B cell affinity maturation, where antibodies evolve through multi-objective optimization balancing affinity, stability, and self-avoidance, we propose the first biologically-motivated framework that leverages physics-based domain knowledge within an online meta-learning system. Our method employs multiple specialized experts (van der Waals, molecular recognition, energy balance, and interface geometry) whose parameters evolve during generation based on iterative feedback, mimicking natural antibody refinement cycles. Instead of fixed protocols, this adaptive guidance discovers personalized optimization strategies for each target. Our experiments demonstrate that this approach: (1) discovers optimal SE(3)-equivariant guidance strategies for different antigen classes without pre-training, preserving molecular symmetries throughout optimization; (2) significantly enhances hotspot coverage and interface quality through target-specific adaptation, achieving balanced multi-objective optimization characteristic of therapeutic antibodies; (3) establishes a paradigm for iterative refinement where each antibody-antigen system learns its unique optimization profile through online evaluation; (4) generalizes effectively across diverse design challenges, from small epitopes to large protein interfaces, enabling precision-focused campaigns for individual targets.
zh

[AI-88] Automated Validation of LLM -based Evaluators for Software Engineering Artifacts

【速读】:该论文旨在解决生成式 AI(Generative AI)在软件工程任务中作为代码质量评估者时的可靠性问题,即现有自动化评估方法难以区分代码质量的细微差异,而人工评估则成本高且不可扩展。其解决方案的关键在于提出 REFINE 框架,该框架包含两个核心模块:Hierarchy Dataset Builder 利用新颖的生成技术自动构建具有逐步劣化质量的代码数据集,Evaluato r Tester 通过衡量评估模型输出排名与预期排序的一致性来量化其性能;该方法具备可控性,允许用户从粗粒度筛选到精细压力测试不同层级的质量差异,从而精准优化 LLM 作为“裁判”(Judge)的配置。

链接: https://arxiv.org/abs/2508.02827
作者: Ora Nova Fandina,Eitan Farchi,Shmulik Froimovich,Rami Katan,Alice Podolsky,Orna Raz,Avi Ziv
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Automation in software engineering increasingly relies on large language models (LLMs) to generate, review, and assess code artifacts. However, establishing LLMs as reliable evaluators remains an open challenge: human evaluations are costly, subjective and non scalable, while existing automated methods fail to discern fine grained variations in artifact quality. We introduce REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation), an automated framework for benchmarking LLM based evaluators across software engineering tasks. REFINE comprises of two modules: Hierarchy Dataset Builder applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality, and Evaluator Tester quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering. A key feature of REFINE is controllability: users can tune the granularity of degradation to progressively refine evaluator configurations, from coarse filtering to stress testing on subtle quality gaps. While the methodology is general, we focus on coding tasks reflecting the practical demands in our production setting. REFINE was integrated into IBM’s internal development workflows and applied to code generation, translation, and summarization for COBOL, an enterprise critical programming language, using industrial data. It was used to identify LLM as a Judge configurations that lifted alignment scores from below 0.7 to above 0.9 in some coding tasks. These nuance sensitive evaluators are now actively used by model training teams to support model release decisions. Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.02827 [cs.SE] (or arXiv:2508.02827v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.02827 Focus to learn more arXiv-issued DOI via DataCite Submission history From: Ora Nova Fandina [view email] [v1] Mon, 4 Aug 2025 18:52:01 UTC (784 KB)
zh

[AI-89] ransAM: Transformer-Based Agent Modeling for Multi-Agent Systems via Local Trajectory Encoding

【速读】:该论文旨在解决多智能体系统中代理建模(agent modeling)的问题,即如何在缺乏其他代理完整轨迹数据的情况下,仅基于受控代理的局部轨迹来学习对其他代理策略的鲁棒表示。传统方法通常假设可获取其他代理的历元轨迹(episodic trajectories),这在现实场景中往往不切实际。解决方案的关键在于提出一种基于Transformer的新型代理建模方法——TransAM,其能够将局部轨迹编码到嵌入空间中,从而有效捕捉其他代理的策略特征。实验表明,该方法在合作、竞争及混合环境中均能生成高质量的策略表示,提升代理建模精度并显著提高累积回报(episodic returns)。

链接: https://arxiv.org/abs/2508.02826
作者: Conor Wallace,Umer Siddique,Yongcan Cao
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Agent modeling is a critical component in developing effective policies within multi-agent systems, as it enables agents to form beliefs about the behaviors, intentions, and competencies of others. Many existing approaches assume access to other agents’ episodic trajectories, a condition often unrealistic in real-world applications. Consequently, a practical agent modeling approach must learn a robust representation of the policies of the other agents based only on the local trajectory of the controlled agent. In this paper, we propose \textttTransAM, a novel transformer-based agent modeling approach to encode local trajectories into an embedding space that effectively captures the policies of other agents. We evaluate the performance of the proposed method in cooperative, competitive, and mixed multi-agent environments. Extensive experimental results demonstrate that our approach generates strong policy representations, improves agent modeling, and leads to higher episodic returns.
zh

[AI-90] Real-World Receptivity to Adaptive Mental Health Interventions: Findings from an In-the-Wild Study

【速读】:该论文旨在解决当前基于智能手机的实时心理健康干预中,如何提升用户对干预措施的接受度与可行性问题。现有研究多关注于识别用户是否需要干预,但缺乏对用户“愿意且能够响应干预”这一关键维度——即 receptivity(包括 acceptance 和 feasibility)的深入探索。解决方案的关键在于构建一种基于强化学习的自适应干预机制,利用 Thompson Sampling 算法动态优化干预时机,通过融合被动传感数据(如手机使用模式、位置信息等)与主动上报的上下文信息,最大化用户的综合 receptivity 奖励信号,从而实现更及时、可执行的心理健康支持。

链接: https://arxiv.org/abs/2508.02817
作者: Nilesh Kumar Sahu,Aditya Sneh,Snehil Gupta,Haroon R Lone
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The rise of mobile health (mHealth) technologies has enabled real-time monitoring and intervention for mental health conditions using passively sensed smartphone data. Building on these capabilities, Just-in-Time Adaptive Interventions (JITAIs) seek to deliver personalized support at opportune moments, adapting to users’ evolving contexts and needs. Although prior research has examined how context affects user responses to generic notifications and general mHealth messages, relatively little work has explored its influence on engagement with actual mental health interventions. Furthermore, while much of the existing research has focused on detecting when users might benefit from an intervention, less attention has been paid to understanding receptivity, i.e., users’ willingness and ability to engage with and act upon the intervention. In this study, we investigate user receptivity through two components: acceptance(acknowledging or engaging with a prompt) and feasibility (ability to act given situational constraints). We conducted a two-week in-the-wild study with 70 students using a custom Android app, LogMe, which collected passive sensor data and active context reports to prompt mental health interventions. The adaptive intervention module was built using Thompson Sampling, a reinforcement learning algorithm. We address four research questions relating smartphone features and self-reported contexts to acceptance and feasibility, and examine whether an adaptive reinforcement learning approach can optimize intervention delivery by maximizing a combined receptivity reward. Our results show that several types of passively sensed data significantly influenced user receptivity to interventions. Our findings contribute insights into the design of context-aware, adaptive interventions that are not only timely but also actionable in real-world settings. Subjects: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Signal Processing (eess.SP) Cite as: arXiv:2508.02817 [cs.HC] (or arXiv:2508.02817v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2508.02817 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-91] Adaptive Knowledge Distillation for Device-Directed Speech Detection INTERSPEECH

【速读】:该论文旨在解决设备导向语音检测(Device-directed Speech Detection, DDSD)任务中的准确性与部署效率之间的矛盾问题,即如何在保证模型轻量化和高效推理的前提下提升对用户向语音助手(Voice Assistant, VA)发起的查询语音与背景噪声或旁白对话的区分能力。解决方案的关键在于提出一种新颖的自适应知识蒸馏(Adaptive Knowledge Distillation, KD)方法:通过冻结预训练的ASR大模型(教师模型)的声学编码器,并在其上添加任务特定的适配器(task-specific adapters),与学生模型联合训练,从而将教师模型中通用的声学表征知识有效迁移至学生模型,显著提升了关键词触发和无关键词后续交互场景下的DDSD性能(Equal Error Rate分别降低26%和19%),且该方法在Transformer和Conformer架构上均具备良好的泛化能力。

链接: https://arxiv.org/abs/2508.02801
作者: Hyung Gun Chi,Florian Pesce,Wonil Chang,Oggi Rudovic,Arturo Argueta,Stefan Braun,Vineet Garg,Ahmed Hussen Abdelaziz
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: 5 pages, 2 figures, Interspeech accepted

点击查看摘要

Abstract:Device-directed speech detection (DDSD) is a binary classification task that separates the user’s queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general representations of an ASR large pre-trained acoustic encoder (teacher). We apply task-specific adapters, on top of the (frozen) teacher encoder, trained jointly with the student model on DDSD. We demonstrate that the proposed adaptive KD outperforms the student model without distillation in the keyword and keyword-free (follow-up) invocations, with an improvement of +26% and +19% in terms of Equal Error Rate, respectively. We also show that this approach generalizes across the transformer and conformer-based model architectures.
zh

[AI-92] Cognitive Loop via In-Situ Optimization: Self-Adaptive Reasoning for Science

【速读】:该论文旨在解决当前人工智能(AI)在科学发现中面临的局限性问题,即现有模型要么依赖于非推理架构而缺乏对人类思维模式的可解释性,要么虽具备推理能力却难以实现用户对推理过程的精确控制与干预。为提升科学家在使用AI时的准确性、透明度和可操控性,作者提出了一种名为“认知循环通过原位优化”(Cognitive Loop via In-situ Optimization, CLIO)的新方法。CLIO的关键在于其开放设计机制,使大语言模型(LLMs)能够自我制定问题求解策略、根据内部置信度动态调整行为,并以图结构形式呈现最终信念状态的形成过程,从而支持科学家观察不确定性水平、理解推理路径并进行人工修正,显著提升了科学问答任务中的准确率(如在HLE生物医学文本题上相较基线GPT-4.1提升161.64%),同时揭示了内部不确定性振荡是决定结果精度的核心因素。

链接: https://arxiv.org/abs/2508.02789
作者: Newman Cheng,Gordon Broadbent,William Chappell
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The capacity for artificial intelligence (AI) to formulate, evolve, and test altered thought patterns under dynamic conditions indicates advanced cognition that is crucial for scientific discovery. The existing AI development landscape falls into two categories: 1) frameworks over non-reasoning models that natively incorporate opinions on how humans think, and 2) reasoning models that abstract precise control of the reasoning intuition away from end users. While powerful, for scientists to maximize utility of AI in scientific discovery, they not only require accuracy and transparency in reasoning, but also steerability. Hence, we introduce an alternative approach that enables deep and precise control over the reasoning process called: a cognitive loop via in-situ optimization (CLIO). CLIO enables large language models (LLMs) to self-formulate ways of approaching a problem, adapt behavior when self-confidence is low, and ultimately provide scientists with a final belief or answer. Through CLIO’s open design, scientists can observe uncertainty levels, understand how final belief states are formulated using graph structures, and interject corrections. Without any further post-training, OpenAI’s GPT-4.1 with CLIO yields an accuracy of 22.37% in text-based biology and medicine questions on Humanity’s Last Exam (HLE). This yields a 13.82% net or 161.64% relative increase when compared to the base GPT-4.1 model and surpasses OpenAI’s o3 performance in high and low reasoning effort modes. We further discovered that oscillations within internal uncertainty measures are key in determining the accuracy of CLIO’s results, revealing how its open design and internal mechanisms can provide insight and control into scientific decision-making processes.
zh

[AI-93] Web3 x AI Agents : Landscape Integrations and Foundational Challenges

【速读】:该论文旨在解决Web3技术与AI代理(AI agents)融合过程中存在的系统性认知空白问题,特别是缺乏对二者交叉领域的全面分析框架。其核心问题是:如何在去中心化生态系统中有效整合AI代理以提升金融效率、治理能力、安全性和信任机制,并识别其中的关键挑战与未来研究方向。解决方案的关键在于构建一个涵盖五大维度(市场格局、经济模型、治理结构、安全机制和信任体系)的综合性分析框架,通过对133个现有项目的系统性分类与实证研究,揭示AI代理在去中心化金融(DeFi)、智能合约审计、治理优化及可靠性保障等场景中的具体应用模式,从而为设计更稳健、智能且可信的下一代去中心化系统提供理论基础与实践指引。

链接: https://arxiv.org/abs/2508.02773
作者: Yiming Shen,Jiashuo Zhang,Zhenzhe Shao,Wenxuan Luo,Yanlin Wang,Ting Chen,Zibin Zheng,Jiachi Chen
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); General Economics (econ.GN)
备注:

点击查看摘要

Abstract:The convergence of Web3 technologies and AI agents represents a rapidly evolving frontier poised to reshape decentralized ecosystems. This paper presents the first and most comprehensive analysis of the intersection between Web3 and AI agents, examining five critical dimensions: landscape, economics, governance, security, and trust mechanisms. Through an analysis of 133 existing projects, we first develop a taxonomy and systematically map the current market landscape (RQ1), identifying distinct patterns in project distribution and capitalization. Building upon these findings, we further investigate four key integrations: (1) the role of AI agents in participating in and optimizing decentralized finance (RQ2); (2) their contribution to enhancing Web3 governance mechanisms (RQ3); (3) their capacity to strengthen Web3 security via intelligent vulnerability detection and automated smart contract auditing (RQ4); and (4) the establishment of robust reliability frameworks for AI agent operations leveraging Web3’s inherent trust infrastructure (RQ5). By synthesizing these dimensions, we identify key integration patterns, highlight foundational challenges related to scalability, security, and ethics, and outline critical considerations for future research toward building robust, intelligent, and trustworthy decentralized systems with effective AI agent interactions.
zh

[AI-94] he Silicon Reason able Person: Can AI Predict How Ordinary People Judge Reason ableness?

【速读】:该论文旨在解决法律实践中法官的直觉判断可能与社会普遍认知不一致的问题,即如何准确预测和理解人类在不同情境下对行为合理性的判断。其解决方案的关键在于验证大型语言模型(Large Language Models, LLMs)是否能够学习并内化驱动人类合理性判断的深层决策机制,而非仅模仿表面反应;研究通过跨多个法律场景的随机对照试验(超过10,000次模拟判断)发现,某些LLMs不仅捕捉到人类判断的模式,还在过失责任判定中优先考虑社会线索而非经济效率,这与人类行为一致但违背传统教科书理论,表明这些模型已习得可识别的伦理框架,从而为司法校准、政策测试及资源受限当事人预判争议点提供实用工具。

链接: https://arxiv.org/abs/2508.02766
作者: Yonathan A. Arbel
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 45 pages, 8 figures

点击查看摘要

Abstract:In everyday life, people make countless reasonableness judgments that determine appropriate behavior in various contexts. Predicting these judgments challenges the legal system, as judges’ intuitions may not align with broader societal views. This Article investigates whether large language models (LLMs) can learn to identify patterns driving human reasonableness judgments. Using randomized controlled trials comparing humans and models across multiple legal contexts with over 10,000 simulated judgments, we demonstrate that certain models capture not just surface-level responses but potentially their underlying decisional architecture. Strikingly, these systems prioritize social cues over economic efficiency in negligence determinations, mirroring human behavior despite contradicting textbook treatments. These findings suggest practical applications: judges could calibrate intuitions against broader patterns, lawmakers could test policy interpretations, and resource-constrained litigants could preview argument reception. As AI agents increasingly make autonomous real-world decisions, understanding whether they’ve internalized recognizable ethical frameworks becomes essential for anticipating their behavior. Comments: 45 pages, 8 figures Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Reportnumber: Alabama Working Paper Series–2025 Cite as: arXiv:2508.02766 [cs.CY] (or arXiv:2508.02766v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2508.02766 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-95] Context-Adaptive Multi-Prompt LLM Embedding for Vision-Language Alignment

【速读】:该论文旨在解决视觉-语言对比学习(vision-language contrastive learning)中语义表示单一、难以充分捕捉文本多维语义信息的问题。传统CLIP类模型依赖单一文本嵌入(text embedding),限制了其与视觉特征对齐的语义丰富性。解决方案的关键在于提出Context-Adaptive Multi-Prompt Embedding方法,通过引入多个结构化提示(structured prompts),每个提示包含一个能自适应捕获输入文本不同语义维度的token,并在单次前向传播中联合处理所有提示,将生成的提示嵌入融合为统一文本表征,从而实现更丰富的语义对齐;同时结合多样性正则化损失和否定感知损失,增强提示间的语义区分度与表达质量,显著提升图像-文本及视频-文本检索性能。

链接: https://arxiv.org/abs/2508.02762
作者: Dahun Kim,Anelia Angelova
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:We propose Context-Adaptive Multi-Prompt Embedding, a novel approach to enrich semantic representations in vision-language contrastive learning. Unlike standard CLIP-style models that rely on a single text embedding, our method introduces multiple structured prompts, each containing a distinct adaptive token that captures diverse semantic aspects of the input text. We process all prompts jointly in a single forward pass. The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features. To further promote semantic diversity and representation quality, we incorporate a diversity regularization loss and a negation-aware loss, encouraging specialization across prompts and improving contrastive discrimination. Our method achieves consistent improvements on both image-text and video-text retrieval benchmarks.
zh

[AI-96] owards a Manifesto for Cyber Humanities: Paradigms Ethics and Prospects

【速读】:该论文旨在解决数字基础设施与算法系统加速演进背景下,人文学科在后数字时代如何重构知识生产与文化实践的问题。当前数字人文(Digital Humanities)和数字人文主义(Digital Humanism)的传统范式已难以应对计算媒介化世界中伦理失序、认知基础设施异化及集体记忆碎片化等挑战。解决方案的关键在于提出“赛博人文”(Cyber Humanities)这一基础性范式,其核心是通过一套由十项原则构成的“十诫”(Decalogue)框架,整合伦理设计(Ethics-by-Design)、可持续数字实践、参与式知识体系,并以人类中心主义(Human-centered AI)为导向,推动对算法基础设施进行批判性反思(Algorithmic Reflexivity),从而重塑人文学术的知识生态系统(Knowledge Ecosystems)与数字主权(Digital Sovereignty)。

链接: https://arxiv.org/abs/2508.02760
作者: Giovanni Adorni,Emanuele Bellini
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL)
备注: 18 pages, 1 table, 48 references, to appear in: 1st. IEEE Int. Conf. on “Cyber Humanities”

点击查看摘要

Abstract:The accelerated evolution of digital infrastructures and algorithmic systems is reshaping how the humanities engage with knowledge and culture. Rooted in the traditions of Digital Humanities and Digital Humanism, the concept of “Cyber Humanities” proposes a critical reconfiguration of humanistic inquiry for the post-digital era. This Manifesto introduces a flexible framework that integrates ethical design, sustainable digital practices, and participatory knowledge systems grounded in human-centered approaches. By means of a Decalogue of foundational principles, the Manifesto invites the scientific community to critically examine and reimagine the algorithmic infrastructures that influence culture, creativity, and collective memory. Rather than being a simple extension of existing practices, “Cyber Humanities” should be understood as a foundational paradigm for humanistic inquiry in a computationally mediated world. Keywords: Cyber Humanities, Digital Humanities, Transdisciplinary Epistemology, Algorithmic Reflexivity, Human-centered AI, Ethics-by-Design, Knowledge Ecosystems, Digital Sovereignty, Cognitive Infrastructures Comments: 18 pages, 1 table, 48 references, to appear in: 1st. IEEE Int. Conf. on “Cyber Humanities” Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Digital Libraries (cs.DL) ACMclasses: I.2.0; K.3.m; K.4.0 Cite as: arXiv:2508.02760 [cs.CY] (or arXiv:2508.02760v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2508.02760 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-97] DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting

【速读】:该论文旨在解决时间序列预测(Time Series Forecasting, TSF)中长期存在的三大挑战:静态分解策略、依赖关系建模碎片化以及融合机制僵化,这些问题限制了模型对多尺度复杂时序依赖的建模能力。解决方案的关键在于提出一种动态多尺度协同框架(Dynamic Multi-Scale Coordination Framework, DMSC),其核心组件包括:1)多尺度补丁分解模块(Multi-Scale Patch Decomposition block, EMPD),通过输入自适应调整实现动态分层补丁划分,打破预设尺度约束;2)三元交互模块(Triad Interaction Block, TIB),在同一层内联合建模补丁内、补丁间及跨变量的依赖关系;3)自适应尺度路由MoE模块(Adaptive Scale Routing MoE block, ASR-MoE),利用时空感知加权机制动态融合多尺度预测结果。上述模块协同构建了一个多层渐进式级联架构,使得粗粒度特征可经门控路径引导细粒度特征提取,从而显著提升预测精度与计算效率。

链接: https://arxiv.org/abs/2508.02753
作者: Haonan Yang,Jianchao Tang,Zhuo Li,Long Lan
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer’s decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at this https URL.
zh

[AI-98] SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长文本推理场景中因KV缓存(Key-Value Cache, KV cache)资源受限而导致的性能下降问题,具体针对现有基于token级别的缓存淘汰策略存在的两个关键缺陷:一是其不可逆的淘汰机制无法适应解码过程中注意力模式的动态变化(即显著性漂移问题),二是将边际重要token与真正无意义token同等对待,忽略了边际token对模型性能的累积贡献(即边际信息过压缩问题)。解决方案的核心在于提出SmallKV方法——一种由小型模型辅助的KV缓存补偿机制,利用不同规模LLM间注意力矩阵的高度相似性,使小模型协助大模型感知全局重要的注意力信息,并通过小模型的注意力分数近似大模型中边际token的重要性,从而实现更精准、可逆且高效的缓存管理。实验表明,SmallKV在多个基准测试中显著提升性能,同时相较基线方法实现1.75–2.56倍的吞吐量提升。

链接: https://arxiv.org/abs/2508.02751
作者: Yi Zhao,Yajuan Peng,Cam-Tu Nguyen,Zuchao Li,Xiaoliang Wang,Hai Zhao,Xiaoming Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:KV cache eviction has emerged as an effective solution to alleviate resource constraints faced by LLMs in long-context scenarios. However, existing token-level eviction methods often overlook two critical aspects: (1) their irreversible eviction strategy fails to adapt to dynamic attention patterns during decoding (the saliency shift problem), and (2) they treat both marginally important tokens and truly unimportant tokens equally, despite the collective significance of marginal tokens to model performance (the marginal information over-compression problem). To address these issues, we design two compensation mechanisms based on the high similarity of attention matrices between LLMs of different scales. We propose SmallKV, a small model assisted compensation method for KV cache compression. SmallKV can maintain attention matching between different-scale LLMs to: 1) assist the larger model in perceiving globally important information of attention; and 2) use the smaller model’s attention scores to approximate those of marginal tokens in the larger model. Extensive experiments on benchmarks including GSM8K, BBH, MT-Bench, and LongBench demonstrate the effectiveness of SmallKV. Moreover, efficiency evaluations show that SmallKV achieves 1.75 - 2.56 times higher throughput than baseline methods, highlighting its potential for efficient and performant LLM inference in resource constrained environments.
zh

[AI-99] Pulse Shape Discrimination Algorithms: Survey and Benchmark

【速读】:该论文旨在解决脉冲形状分辨(Pulse Shape Discrimination, PSD)算法在辐射探测中性能评估与比较缺乏系统性标准的问题。其解决方案的关键在于构建了一个涵盖近六十种算法的全面分类框架,将方法划分为基于统计特征(时域、频域、神经网络)和基于先验知识(机器学习、深度学习)两大范式,并在两个标准化数据集上统一实现与评估这些算法,采用FOM、F1-score、ROC-AUC等多维指标进行量化分析。研究发现,深度学习模型(尤其是多层感知机MLP及融合统计特征与神经回归的混合方法)通常优于传统方法,同时指出FOM的局限性并推荐更可靠的替代评价指标,从而为PSD算法的优化与应用提供可复现的基准工具与开源资源(Python/MATLAB工具箱及数据集)。

链接: https://arxiv.org/abs/2508.02750
作者: Haoran Liu,Yihan Zhan,Mingzhe Liu,Yanhua Liu,Peng Li,Zhuo Zuo,Bingqi Liu,Runxi Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Nuclear Experiment (nucl-ex); Applied Physics (physics.app-ph); Atomic Physics (physics.atom-ph)
备注:

点击查看摘要

Abstract:This review presents a comprehensive survey and benchmark of pulse shape discrimination (PSD) algorithms for radiation detection, classifying nearly sixty methods into statistical (time-domain, frequency-domain, neural network-based) and prior-knowledge (machine learning, deep learning) paradigms. We implement and evaluate all algorithms on two standardized datasets: an unlabeled set from a 241Am-9Be source and a time-of-flight labeled set from a 238Pu-9Be source, using metrics including Figure of Merit (FOM), F1-score, ROC-AUC, and inter-method correlations. Our analysis reveals that deep learning models, particularly Multi-Layer Perceptrons (MLPs) and hybrid approaches combining statistical features with neural regression, often outperform traditional methods. We discuss architectural suitabilities, the limitations of FOM, alternative evaluation metrics, and performance across energy thresholds. Accompanying this work, we release an open-source toolbox in Python and MATLAB, along with the datasets, to promote reproducibility and advance PSD research.
zh

[AI-100] Large Language Model-based Data Science Agent : A Survey

【速读】:该论文旨在解决如何将大语言模型(Large Language Models, LLMs)驱动的智能体(agents)有效应用于数据科学任务中的问题,以提升自动化水平和任务执行效率。其解决方案的关键在于构建一个双重视角框架:一方面从智能体设计角度提炼出角色设定、执行机制、知识管理与反思方法等核心原则;另一方面从数据科学流程出发,梳理数据预处理、模型开发、评估与可视化等关键环节,并将两者有机结合,从而实现通用智能体设计原则与实际数据科学工作流的有效对接。

链接: https://arxiv.org/abs/2508.02744
作者: Peiran Wang,Yaoning Yu,Ke Chen,Xianyang Zhan,Haohan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has driven novel applications across diverse domains, with LLM-based agents emerging as a crucial area of exploration. This survey presents a comprehensive analysis of LLM-based agents designed for data science tasks, summarizing insights from recent studies. From the agent perspective, we discuss the key design principles, covering agent roles, execution, knowledge, and reflection methods. From the data science perspective, we identify key processes for LLM-based agents, including data preprocessing, model development, evaluation, visualization, etc. Our work offers two key contributions: (1) a comprehensive review of recent developments in applying LLMbased agents to data science tasks; (2) a dual-perspective framework that connects general agent design principles with the practical workflows in data science.
zh

[AI-101] DeepGB-TB: A Risk-Balanced Cross-Attention Gradient-Boosted Convolutional Network for Rapid Interpretable Tuberculosis Screening

【速读】:该论文旨在解决大规模结核病(Tuberculosis, TB)筛查受限于传统诊断方法成本高、操作复杂的问题,提出了一种名为DeepGB-TB的非侵入式人工智能系统。其关键创新在于:1)采用轻量级一维卷积神经网络处理咳嗽音频与梯度提升决策树融合基本人口统计学特征,并引入跨模态双向交叉注意力模块(Cross-Modal Bidirectional Cross-Attention, CM-BCA),实现多模态信息的迭代交互,模拟临床医生整合症状与风险因素的决策过程;2)设计结核病风险平衡损失函数(Tuberculosis Risk-Balanced Loss, TRBL),对假阴性预测施加更强惩罚,显著降低高风险漏诊率。该方案在7国共1,105名患者的多样化数据集上验证,达到AUROC 0.903和F1-score 0.851,且可在普通移动设备上实时离线运行,兼顾准确性、效率与可解释性,满足低资源环境下的公共卫生需求。

链接: https://arxiv.org/abs/2508.02741
作者: Zhixiang Lu,Yulong Li,Feilong Tang,Zhengyong Jiang,Chong Li,Mian Zhou,Tenglong Li,Jionglong Su
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注:

点击查看摘要

Abstract:Large-scale tuberculosis (TB) screening is limited by the high cost and operational complexity of traditional diagnostics, creating a need for artificial-intelligence solutions. We propose DeepGB-TB, a non-invasive system that instantly assigns TB risk scores using only cough audio and basic demographic data. The model couples a lightweight one-dimensional convolutional neural network for audio processing with a gradient-boosted decision tree for tabular features. Its principal innovation is a Cross-Modal Bidirectional Cross-Attention module (CM-BCA) that iteratively exchanges salient cues between modalities, emulating the way clinicians integrate symptoms and risk factors. To meet the clinical priority of minimizing missed cases, we design a Tuberculosis Risk-Balanced Loss (TRBL) that places stronger penalties on false-negative predictions, thereby reducing high-risk misclassifications. DeepGB-TB is evaluated on a diverse dataset of 1,105 patients collected across seven countries, achieving an AUROC of 0.903 and an F1-score of 0.851, representing a new state of the art. Its computational efficiency enables real-time, offline inference directly on common mobile devices, making it ideal for low-resource settings. Importantly, the system produces clinically validated explanations that promote trust and adoption by frontline health workers. By coupling AI innovation with public-health requirements for speed, affordability, and reliability, DeepGB-TB offers a tool for advancing global TB control.
zh

[AI-102] Who Gets Cited? Gender- and Majority-Bias in LLM -Driven Reference Selection

【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在学术文献推荐过程中可能引入的性别偏见问题,特别是其对参考文献选择的影响。研究通过控制实验设计,使用伪匿名作者姓名评估多个LLM(如GPT-4o、Claude Sonnet等),发现两类偏见:一是对男性作者文献的持续偏好;二是多数群体偏见,即倾向于选择候选池中占多数的性别文献。关键发现在于,这些偏见在更大候选池中被放大,且仅靠提示工程(prompt-based mitigation)难以有效缓解,同时不同学科领域偏见程度存在差异(社会科学研究中偏见最小)。因此,解决方案的关键在于开发更有效的干预策略,以避免LLMs在高风险学术流程中进一步固化现有性别不平等现象。

链接: https://arxiv.org/abs/2508.02740
作者: Jiangen He
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are rapidly being adopted as research assistants, particularly for literature review and reference recommendation, yet little is known about whether they introduce demographic bias into citation workflows. This study systematically investigates gender bias in LLM-driven reference selection using controlled experiments with pseudonymous author names. We evaluate several LLMs (GPT-4o, GPT-4o-mini, Claude Sonnet, and Claude Haiku) by varying gender composition within candidate reference pools and analyzing selection patterns across fields. Our results reveal two forms of bias: a persistent preference for male-authored references and a majority-group bias that favors whichever gender is more prevalent in the candidate pool. These biases are amplified in larger candidate pools and only modestly attenuated by prompt-based mitigation strategies. Field-level analysis indicates that bias magnitude varies across scientific domains, with social sciences showing the least bias. Our findings indicate that LLMs can reinforce or exacerbate existing gender imbalances in scholarly recognition. Effective mitigation strategies are needed to avoid perpetuating existing gender disparities in scientific citation practices before integrating LLMs into high-stakes academic workflows.
zh

[AI-103] Recovering Individual-Level Activity Sequences from Location-Based Service Data Using a Novel Transformer-Based Model

【速读】:该论文试图解决的问题是:如何利用高质量的基于位置服务(Location-Based Service, LBS)数据中提取的完整活动序列,来恢复个体层面不完整的活动序列,以提升对出行和活动模式推断的准确性。其解决方案的关键在于提出一种名为“变量选择网络融合插入Transformer”(Variable Selection Network-fused Insertion Transformer, VSNIT)的新模型,该模型结合了插入Transformer(Insertion Transformer)在灵活序列构建上的优势与变量选择网络(Variable Selection Network)动态处理协变量的能力,从而在保留已有数据的基础上有效填补缺失段落,并生成更符合现实世界多样性和过渡规律的活动模式。

链接: https://arxiv.org/abs/2508.02734
作者: Weiyu Luo,Chenfeng Xiong
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: 20 pages, 5 figures

点击查看摘要

Abstract:Location-Based Service (LBS) data provides critical insights into human mobility, yet its sparsity often yields incomplete trip and activity sequences, making accurate inferences about trips and activities difficult. We raise a research problem: Can we use activity sequences derived from high-quality LBS data to recover incomplete activity sequences at the individual level? This study proposes a new solution, the Variable Selection Network-fused Insertion Transformer (VSNIT), integrating the Insertion Transformer’s flexible sequence construction with the Variable Selection Network’s dynamic covariate handling capability, to recover missing segments in incomplete activity sequences while preserving existing data. The findings show that VSNIT inserts more diverse, realistic activity patterns, more closely matching real-world variability, and restores disrupted activity transitions more effectively aligning with the target. It also performs significantly better than the baseline model across all metrics. These results highlight VSNIT’s superior accuracy and diversity in activity sequence recovery tasks, demonstrating its potential to enhance LBS data utility for mobility analysis. This approach offers a promising framework for future location-based research and applications.
zh

[AI-104] A Note on Code Quality Score: LLM s for Maintainable Large Codebases ICLR

【速读】:该论文旨在解决大规模软件系统中代码质量维护的挑战,尤其是在多工程师并行开发场景下,如何高效识别代码变更中的质量问题并提供可操作的改进建议。解决方案的关键在于构建一个名为Code Quality Score (CQS) 的自动化系统,其核心由两个经过监督微调(SFT)和离线强化学习(offline RL)优化的Llama3模型组成:第一个模型用于检测与编码最佳实践相关的常见代码质量问题,第二个模型则针对大语言模型(LLM)生成的代码审查提供高质量“评述”。为确保用户体验,系统还引入了人工设计的规则层以过滤错误响应或幻觉内容。实证结果显示,该系统在工业级部署中实现了60%的周度用户满意度,验证了其有效性。

链接: https://arxiv.org/abs/2508.02732
作者: Sherman Wong,Jalaj Bhandari,Leo Zhou Fan Yang,Xylan Xu,Yi Zhuang,Cem Cayiroglu,Payal Bhuptani,Sheela Yadawad,Hung Duong
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 24 pages, ICLR format

点击查看摘要

Abstract:Maintaining code quality in large-scale software systems presents significant challenges, particularly in settings where a large numbers of engineers work concurrently on a codebase. This paper introduces Code Quality Score (CQS) system to automatically detect issues with a set of code changes and provide actionable insights. At its core, the CQS system is powered by two Llama3 models, fine-tuned (with SFT and offline RL approaches), to a) detect common code quality issues related to coding best practices and b) to provide good ``critiques’’ for LLM-generated code review respectively. To maintain good user experience, we layer the system with hand-crafted rules to filter out incorrect responses/hallucinations. Offline evaluations show that our CQS system is able to achieve an impressive precision rate for identifying valid issues. This system has already been rolled out to developers in an industrial scale setting and has consistently achieved 60% week over week user helpfulness rate, demonstrating its effectiveness in a real-world environment. In this paper, we present details of the CQS system along with some learnings on curating developer feedback to create training data for LLM fine-tuning.
zh

[AI-105] Interpreting Performance Profiles with Deep Learning

【速读】:该论文旨在解决传统性能分析工具(profiler)在实际应用中因用户难以将复杂性能数据与程序语义关联而带来的使用障碍问题,尤其针对非代码作者的软件工程师在识别可操作优化点时面临的挑战。解决方案的关键在于融合性能配置文件与程序语义信息——通过微调基于CodeBERT的模型生成代码摘要(code summary),并将这些语义信息整合进异步剖析器(Async Profiler)的输出中,从而在图形用户界面中直观展示任意调用路径的语义解释,提升对程序性能瓶颈的理解效率和优化可行性。

链接: https://arxiv.org/abs/2508.02729
作者: Zhuoran Liu
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: Master of Science in Computer Science thesis, North Carolina State University, 2022. Advisor: Dr. Xu Liu

点击查看摘要

Abstract:Profiling tools (also known as profilers) play an important role in understanding program performance at runtime, such as hotspots, bottlenecks, and inefficiencies. While profilers have been proven to be useful, they give extra burden to software engineers. Software engineers, as the users, are responsible to interpret the complex performance data and identify actionable optimization in program source code. However, it can be challenging for users to associate inefficiencies with the program semantics, especially if the users are not the authors of the code, which limits the applicability of profilers. In this thesis, we explore a new direction to combine performance profiles and program semantics with a deep learning approach. The key idea is to glean code summary for semantic information (at a certain level) and integrate it into a profiler, which can better understand program inefficiencies for actionable optimization. To be concrete, we combine profiles generated by Async Profiler (the state-of-the-art Java profiler) with code summarization from a fine-tuned CodeBERT-based model. We demonstrate the code summaries of any selected call path in a graphic user interface. Our system can effectively assist analysis on many Java benchmarks. Comments: Master of Science in Computer Science thesis, North Carolina State University, 2022. Advisor: Dr. Xu Liu Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Performance (cs.PF) Cite as: arXiv:2508.02729 [cs.SE] (or arXiv:2508.02729v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.02729 Focus to learn more arXiv-issued DOI via DataCite
zh

[AI-106] Forecasting NCAA Basketball Outcomes with Deep Learning: A Comparative Study of LSTM and Transformer Models

【速读】:该论文旨在解决NCAA一级男子和女子篮球锦标赛比赛结果的精准预测问题,其核心挑战在于如何有效建模球队表现的时序特征并提升预测的概率校准能力。解决方案的关键在于采用两种先进的序列建模架构——长短期记忆网络(LSTM)和Transformer,并结合多维特征工程(包括广义线性模型(GLM)推导的球队质量指标、Elo评分、种子差异及统计箱式得分汇总),同时对比不同损失函数(二元交叉熵损失BCE与Brier损失)对模型性能的影响。实验表明,Transformer配合BCE在判别能力上最优(AUC=0.8473),而LSTM配合Brier损失在概率校准上最佳(Brier Score=0.1589),凸显了根据任务需求选择模型结构与损失函数的重要性。

链接: https://arxiv.org/abs/2508.02725
作者: Md Imtiaz Habib
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 20 page scientific report

点击查看摘要

Abstract:In this research, I explore advanced deep learning methodologies to forecast the outcomes of the 2025 NCAA Division 1 Men’s and Women’s Basketball tournaments. Leveraging historical NCAA game data, I implement two sophisticated sequence-based models: Long Short-Term Memory (LSTM) and Transformer architectures. The predictive power of these models is augmented through comprehensive feature engineering, including team quality metrics derived from Generalized Linear Models (GLM), Elo ratings, seed differences, and aggregated box-score statistics. To evaluate the robustness and reliability of predictions, I train each model variant using both Binary Cross-Entropy (BCE) and Brier loss functions, providing insights into classification performance and probability calibration. My comparative analysis reveals that while the Transformer architecture optimized with BCE yields superior discriminative power (highest AUC of 0.8473), the LSTM model trained with Brier loss demonstrates superior probabilistic calibration (lowest Brier score of 0.1589). These findings underscore the importance of selecting appropriate model architectures and loss functions based on the specific requirements of forecasting tasks. The detailed analytical pipeline presented here serves as a reproducible framework for future predictive modeling tasks in sports analytics and beyond.
zh

[AI-107] Mathematical Foundations of Geometric Deep Learning

【速读】:该论文旨在解决几何深度学习(Geometric Deep Learning)研究中所依赖的核心数学概念不清晰或未系统梳理的问题。其解决方案的关键在于系统性地回顾和阐述支撑该领域发展的基础数学工具,包括群表示理论、微分几何、流形学习以及图神经网络中的结构化数据建模方法,从而为研究人员提供坚实的理论基础以推动几何深度学习的进一步发展。

链接: https://arxiv.org/abs/2508.02723
作者: Haitz Sáez de Ocáriz Borde,Michael Bronstein
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 78 pages

点击查看摘要

Abstract:We review the key mathematical concepts necessary for studying Geometric Deep Learning.
zh

[AI-108] Blueprint First Model Second: A Framework for Deterministic LLM Workflow

【速读】:该论文旨在解决大型语言模型(Large Language Model, LLM)代理在结构化操作环境中因固有的非确定性而导致的程序执行不可靠、难以保证流程一致性的问题。当前架构将概率性高层规划与低层动作执行混杂在一个生成过程中,导致行为难以预测和验证。解决方案的关键在于提出“蓝图优先、模型次之”(Blueprint First, Model Second)的新范式:首先由专家定义操作流程并将其编码为基于源代码的执行蓝图(Execution Blueprint),再由确定性引擎执行;LLM仅作为专用工具被调用以处理工作流中边界明确的复杂子任务,而不参与流程路径决策。这一解耦机制显著提升了执行效率和可验证性,使自主代理能够在严格遵循程序逻辑的应用场景中可靠部署。

链接: https://arxiv.org/abs/2508.02721
作者: Libin Qiu,Yuhang Ye,Zhirong Gao,Xide Zou,Junfu Chen,Ziming Gui,Weizhi Huang,Xiaobo Xue,Wenkai Qiu,Kun Zhao
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Programming Languages (cs.PL)
备注: 8 pages, 6 figures, 3 tables

点击查看摘要

Abstract:While powerful, the inherent non-determinism of large language model (LLM) agents limits their application in structured operational environments where procedural fidelity and predictable execution are strict requirements. This limitation stems from current architectures that conflate probabilistic, high-level planning with low-level action execution within a single generative process. To address this, we introduce the Source Code Agent framework, a new paradigm built on the “Blueprint First, Model Second” philosophy. Our framework decouples the workflow logic from the generative model. An expert-defined operational procedure is first codified into a source code-based Execution Blueprint, which is then executed by a deterministic engine. The LLM is strategically invoked as a specialized tool to handle bounded, complex sub-tasks within the workflow, but never to decide the workflow’s path. We conduct a comprehensive evaluation on the challenging tau-bench benchmark, designed for complex user-tool-rule scenarios. Our results demonstrate that the Source Code Agent establishes a new state-of-the-art, outperforming the strongest baseline by 10.1 percentage points on the average Pass^1 score while dramatically improving execution efficiency. Our work enables the verifiable and reliable deployment of autonomous agents in applications governed by strict procedural logic.
zh

[AI-109] ECGTwin: Personalized ECG Generation Using Controllable Diffusion Model

【速读】:该论文旨在解决个性化心电图(ECG)生成中的两大核心挑战:一是如何在缺乏真实标签的情况下提取个体特异性特征,二是如何在不干扰生成模型的前提下注入多种心脏病理条件。解决方案的关键在于提出一个两阶段框架ECGTwin:第一阶段通过对比学习训练的个体基础提取器(Individual Base Extractor)从参考ECG中鲁棒地捕捉个体特征;第二阶段则引入创新的AdaX条件注入器(AdaX Condition Injector),通过两条专用路径将个体特征与目标心脏条件融合至基于扩散模型的生成过程中,从而实现高保真度、多样性和细粒度可控性的个性化ECG生成,同时保持个体特异性特征不变。

链接: https://arxiv.org/abs/2508.02720
作者: Yongfan Lai,Bo Liu,Xinyan Guan,Qinghao Zhao,Hongyan Li,Shenda Hong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized electrocardiogram (ECG) generation is to simulate a patient’s ECG digital twins tailored to specific conditions. It has the potential to transform traditional healthcare into a more accurate individualized paradigm, while preserving the key benefits of conventional population-level ECG synthesis. However, this promising task presents two fundamental challenges: extracting individual features without ground truth and injecting various types of conditions without confusing generative model. In this paper, we present ECGTwin, a two-stage framework designed to address these challenges. In the first stage, an Individual Base Extractor trained via contrastive learning robustly captures personal features from a reference ECG. In the second stage, the extracted individual features, along with a target cardiac condition, are integrated into the diffusion-based generation process through our novel AdaX Condition Injector, which injects these signals via two dedicated and specialized pathways. Both qualitative and quantitative experiments have demonstrated that our model can not only generate ECG signals of high fidelity and diversity by offering a fine-grained generation controllability, but also preserving individual-specific features. Furthermore, ECGTwin shows the potential to enhance ECG auto-diagnosis in downstream application, confirming the possibility of precise personalized healthcare solutions.
zh

[AI-110] ZetA: A Riemann Zeta-Scaled Extension of Adam for Deep Learning

【速读】:该论文旨在解决深度学习优化中模型泛化能力不足与对噪声敏感的问题,尤其在高粒度分类或存在标签噪声的数据场景下表现不佳。其解决方案的关键在于提出一种名为ZetA的新颖深度学习优化器,该优化器通过引入基于黎曼ζ函数(Riemann zeta function)的动态梯度缩放机制,结合自适应阻尼、基于余弦相似度的动量增强、熵正则化损失以及类似Sharpness-Aware Minimization(SAM)的扰动策略,构建了一个混合更新机制,从而显著提升模型的鲁棒性和泛化性能。

链接: https://arxiv.org/abs/2508.02719
作者: Samiksha BC
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 1 figure, 4 references. This paper introduces a hybrid optimizer combining Adam with Riemann zeta-based scaling

点击查看摘要

Abstract:This work introduces ZetA, a novel deep learning optimizer that extends Adam by incorporating dynamic scaling based on the Riemann zeta function. To the best of our knowledge, ZetA is the first optimizer to apply zeta-based gradient scaling within deep learning optimization. The method improves generalization and robustness through a hybrid update mechanism that integrates adaptive damping, cosine similarity-based momentum boosting, entropy-regularized loss, and Sharpness-Aware Minimization (SAM)-style perturbations. Empirical evaluations on SVHN, CIFAR10, CIFAR100, STL10, and noisy CIFAR10 consistently show test accuracy improvements over Adam. All experiments employ a lightweight fully connected network trained for five epochs under mixed-precision settings. The results demonstrate that ZetA is a computationally efficient and robust alternative to Adam, particularly effective in noisy or high-granularity classification tasks.
zh

[AI-111] A Bayesian Hybrid Parameter-Efficient Fine-Tuning Method for Large Language Models

【速读】:该论文旨在解决现有混合参数高效微调(Hybrid Parameter-Efficient Fine-Tuning, Hybrid PEFT)方法在面向专业化业务场景应用大型语言模型(Large Language Models, LLMs)时的两大局限:一是依赖点估计(point estimates),无法量化不确定性以支持可靠决策;二是难以动态适应新数据,缺乏对现实世界变化的响应能力。解决方案的关键在于提出贝叶斯混合参数高效微调(Bayesian Hybrid Parameter-Efficient Fine-Tuning, BH-PEFT),通过将Adapter、LoRA与前缀微调(prefix-tuning)相结合,在Transformer的前馈和注意力层上进行微调,并引入贝叶斯学习框架将可学习参数建模为分布形式,从而实现不确定性量化;进一步设计贝叶斯动态微调机制,使当前轮次的后验分布作为下一轮的先验分布,有效提升模型对新数据的适应能力。

链接: https://arxiv.org/abs/2508.02711
作者: Yidong Chai(1 and 2),Yang Liu(1 and 2),Yonghang Zhou(1 and 2),Jiaheng Xie(3),Daniel Dajun Zeng(4) ((1) School of Management, Hefei University of Technology, Hefei, China, (2) Key Laboratory of Process Optimization and Intelligent Decision-making, Ministry of Education, Hefei, China, (3) Department of Accounting and MIS, Lerner College of Business and Economics, University of Delaware, Newark, Delaware, U.S., (4) Institute of Automation, Chinese Academy of Sciences, Beijing, China)
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated transformative potential in reshaping the world. As these models are pretrained on general corpora, they often require domain-specific fine-tuning to optimize performance in specialized business applications. Due to their massive scale, parameter-efficient fine-tuning (PEFT) methods are widely used to reduce training costs. Among them, hybrid PEFT methods that combine multiple PEFT techniques have achieved the best performance. However, existing hybrid PEFT methods face two main challenges when fine-tuning LLMs for specialized applications: (1) relying on point estimates, lacking the ability to quantify uncertainty for reliable decision-making, and (2) struggling to dynamically adapt to emerging data, lacking the ability to suit real-world situations. We propose Bayesian Hybrid Parameter-Efficient Fine-Tuning (BH-PEFT), a novel method that integrates Bayesian learning into hybrid PEFT. BH-PEFT combines Adapter, LoRA, and prefix-tuning to fine-tune feedforward and attention layers of the Transformer. By modeling learnable parameters as distributions, BH-PEFT enables uncertainty quantification. We further propose a Bayesian dynamic fine-tuning approach where the last posterior serves as the prior for the next round, enabling effective adaptation to new data. We evaluated BH-PEFT on business tasks such as sentiment analysis, news categorization, and commonsense reasoning. Results show that our method outperforms existing PEFT baselines, enables uncertainty quantification for more reliable decisions, and improves adaptability in dynamic scenarios. This work contributes to business analytics and data science by proposing a novel BH-PEFT method and dynamic fine-tuning approach that support uncertainty-aware and adaptive decision-making in real-world situations.
zh

[AI-112] Planning with Dynamically Changing Domains IJCAI2025 KR KR2025

【速读】:该论文旨在解决传统规划中因域闭合假设(Domain Closure Assumption, DCA)带来的限制问题,即在经典规划和一致性规划中假设对象集是静态且预先命名的,而现实中存在动态变化对象(如对象的创建与销毁)的情况。为应对这一挑战,论文提出一种基于一阶逻辑的规划形式化方法,通过设定初始理论为有限一致的谓词文字集合,并引入对动作序列长度的整数上限约束,组织在规划时即可实例化的动作序列进行搜索。其解决方案的关键在于:不依赖DCA的前提下,确保每个情境下可能的动作数量有限,从而保证搜索空间的可处理性,并通过构造性证明实现了该方法的可靠性和完备性,适用于无感知动作的顺序广义规划与一致性规划的交集问题,且限定于不含谓词析取的情形。

链接: https://arxiv.org/abs/2508.02697
作者: Mikhail Soutchanski,Yongmei Liu
机构: 未知
类目: Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: A revised version of the paper accepted to the 1st International Workshop on Trends in Knowledge Representation and Reasoning organized as a IJCAI 2025 workshop that takes place in August 2025 in Montreal, Canada. See the details at this https URL

点击查看摘要

Abstract:In classical planning and conformant planning, it is assumed that there are finitely many named objects given in advance, and only they can participate in actions and in fluents. This is the Domain Closure Assumption (DCA). However, there are practical planning problems where the set of objects changes dynamically as actions are performed; e.g., new objects can be created, old objects can be destroyed. We formulate the planning problem in first-order logic, assume an initial theory is a finite consistent set of fluent literals, discuss when this guarantees that in every situation there are only finitely many possible actions, impose a finite integer bound on the length of the plan, and propose to organize search over sequences of actions that are grounded at planning time. We show the soundness and completeness of our approach. It can be used to solve the bounded planning problems without DCA that belong to the intersection of sequential generalized planning (without sensing actions) and conformant planning, restricted to the case without the disjunction over fluent literals. We discuss a proof-of-the-concept implementation of our planner.
zh

[AI-113] AnnoSense: A Framework for Physiological Emotion Data Collection in Everyday Settings for AI

【速读】:该论文旨在解决真实世界环境中情感数据收集的挑战,尤其是如何获取高质量、准确标注的情感数据以支持生成式 AI (Generative AI) 算法的有效训练与应用。其核心问题在于:随着智能设备和人工智能技术的发展,虽然在日常场景中监测情绪成为可能,但情感标注过程变得日益复杂,且缺乏系统性的指导框架。解决方案的关键是提出并验证 AnnoSense 框架,该框架基于对119名关键利益相关者(包括公众与心理健康专业人员)的调研与访谈构建,旨在提升情感数据收集的规范性、可操作性和适应性,从而增强情绪人工智能(Emotion AI)在现实情境下的数据采集与分析能力。

链接: https://arxiv.org/abs/2508.02680
作者: Pragya Singh,Ankush Gupta,Mohan Kumar,Pushpendra Singh
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: To be published in IMWUT, September 2025

点击查看摘要

Abstract:Emotional and mental well-being are vital components of quality of life, and with the rise of smart devices like smartphones, wearables, and artificial intelligence (AI), new opportunities for monitoring emotions in everyday settings have emerged. However, for AI algorithms to be effective, they require high-quality data and accurate annotations. As the focus shifts towards collecting emotion data in real-world environments to capture more authentic emotional experiences, the process of gathering emotion annotations has become increasingly complex. This work explores the challenges of everyday emotion data collection from the perspectives of key stakeholders. We collected 75 survey responses, performed 32 interviews with the public, and 3 focus group discussions (FGDs) with 12 mental health professionals. The insights gained from a total of 119 stakeholders informed the development of our framework, AnnoSense, designed to support everyday emotion data collection for AI. This framework was then evaluated by 25 emotion AI experts for its clarity, usefulness, and adaptability. Lastly, we discuss the potential next steps and implications of AnnoSense for future research in emotion AI, highlighting its potential to enhance the collection and analysis of emotion data in real-world contexts.
zh

[AI-114] A Wireless Foundation Model for Multi-Task Prediction

【速读】:该论文旨在解决无线网络中多任务预测的泛化能力不足问题,特别是在不同场景和任务下传统深度学习(Deep Learning, DL)方法难以有效迁移的问题。其核心挑战在于如何统一处理异构任务(如信道状态信息CSI、用户位置和网络流量预测)并支持任意预测区间长度。解决方案的关键在于提出一个统一的基础模型(foundation model),通过单变量分解(univariate decomposition)实现任务间的结构统一,引入粒度编码(granularity encoding)增强对预测区间长度的感知能力,并采用因果Transformer(causal Transformer)作为骨干网络以提升预测精度;此外,在训练阶段引入补丁掩码(patch masking)策略,使模型能够适应任意输入长度,从而在大规模数据上训练后展现出对未见场景的强泛化能力和零样本(zero-shot)新任务性能。

链接: https://arxiv.org/abs/2507.05938
作者: Yucheng Sheng,Jiacheng Wang,Xingyu Zhou,Le Liang,Hao Ye,Shi Jin,Geoffrey Ye Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Theory (cs.IT); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:With the growing complexity and dynamics of the mobile communication networks, accurately predicting key system parameters, such as channel state information (CSI), user location, and network traffic, has become essential for a wide range of physical (PHY)-layer and medium access control (MAC)-layer tasks. Although traditional deep learning (DL)-based methods have been widely applied to such prediction tasks, they often struggle to generalize across different scenarios and tasks. In response, we propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals. The proposed model enforces univariate decomposition to unify heterogeneous tasks, encodes granularity for interval awareness, and uses a causal Transformer backbone for accurate predictions. Additionally, we introduce a patch masking strategy during training to support arbitrary input lengths. After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and achieves zero-shot performance on new tasks that surpass traditional full-shot baselines.
zh

[AI-115] Decoding and Engineering the Phytobiome Communication for Smart Agriculture

【速读】:该论文旨在解决传统农业面临的食物需求增长、环境污染和水资源短缺等挑战,以及当前对植物与其环境之间复杂通信机制理解不足的问题。其核心解决方案是引入通信工程视角,将植物-环境-共生生物构成的“植体组(phytobiome)”视为一个多层次通信网络,并基于分子通信(molecular communication, MC)与电生理信号建模,构建可量化分析的多尺度通信框架。该框架通过整合机器学习/人工智能(ML/AI)与生物纳米物联网(Internet of Bio-Nano-Things),实现对植体组内部信息传递过程的精准解析,从而推动智能灌溉、农化品靶向递送等新型智慧农业应用的发展,为可持续、高效、生态友好的农业生产提供理论基础与技术路径。

链接: https://arxiv.org/abs/2508.03584
作者: Fatih Gulec,Hamdan Awan,Nigel Wallbridge,Andrew W. Eckford
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Networking and Internet Architecture (cs.NI); Molecular Networks (q-bio.MN)
备注: Under revision for IEEE Communications Magazine

点击查看摘要

Abstract:Smart agriculture applications, integrating technologies like the Internet of Things and machine learning/artificial intelligence (ML/AI) into agriculture, hold promise to address modern challenges of rising food demand, environmental pollution, and water scarcity. Alongside the concept of the phytobiome, which defines the area including the plant, its environment, and associated organisms, and the recent emergence of molecular communication (MC), there exists an important opportunity to advance agricultural science and practice using communication theory. In this article, we motivate to use the communication engineering perspective for developing a holistic understanding of the phytobiome communication and bridge the gap between the phytobiome communication and smart agriculture. Firstly, an overview of phytobiome communication via molecular and electrophysiological signals is presented and a multi-scale framework modeling the phytobiome as a communication network is conceptualized. Then, how this framework is used to model electrophysiological signals is demonstrated with plant experiments. Furthermore, possible smart agriculture applications, such as smart irrigation and targeted delivery of agrochemicals, through engineering the phytobiome communication are proposed. These applications merge ML/AI methods with the Internet of Bio-Nano-Things enabled by MC and pave the way towards more efficient, sustainable, and eco-friendly agricultural production. Finally, the implementation challenges, open research issues, and industrial outlook for these applications are discussed.
zh

[AI-116] Supervised Dynamic Dimension Reduction with Deep Neural Network

【速读】:该论文旨在解决高维预测变量下时间序列预测的维度缩减问题,核心挑战在于如何有效提取具有强预测能力且可解释的潜在因子。解决方案的关键在于提出一种监督式深度动态主成分分析(Supervised Deep Dynamic Principal component analysis, SDDP)框架:首先通过时序神经网络构建目标感知型预测变量(target-aware predictors),以监督方式为原始变量赋权,增强对预测目标有贡献的变量权重;随后在这些加权后的预测变量上进行主成分分析(Principal Component Analysis, PCA),从而提取出既提升下游预测精度又具备目标特异性和可解释性的SDDP因子。该方法进一步扩展为因子增强的非线性动态预测模型,并成功应用于部分可观测预测变量场景,显著优于现有先进方法。

链接: https://arxiv.org/abs/2508.03546
作者: Zhanye Luo,Yuefeng Han,Xiufan Yu
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This paper studies the problem of dimension reduction, tailored to improving time series forecasting with high-dimensional predictors. We propose a novel Supervised Deep Dynamic Principal component analysis (SDDP) framework that incorporates the target variable and lagged observations into the factor extraction process. Assisted by a temporal neural network, we construct target-aware predictors by scaling the original predictors in a supervised manner, with larger weights assigned to predictors with stronger forecasting power. A principal component analysis is then performed on the target-aware predictors to extract the estimated SDDP factors. This supervised factor extraction not only improves predictive accuracy in the downstream forecasting task but also yields more interpretable and target-specific latent factors. Building upon SDDP, we propose a factor-augmented nonlinear dynamic forecasting model that unifies a broad family of factor-model-based forecasting approaches. To further demonstrate the broader applicability of SDDP, we extend our studies to a more challenging scenario when the predictors are only partially observable. We validate the empirical performance of the proposed method on several real-world public datasets. The results show that our algorithm achieves notable improvements in forecasting accuracy compared to state-of-the-art methods.
zh

[AI-117] Artificial Intelligence and Generative Models for Materials Discovery – A Review

【速读】:该论文旨在解决传统材料发现方法效率低、依赖经验试错的局限性,以及如何利用人工智能(AI)驱动的生成式模型实现“逆向设计”——即根据目标性能快速筛选和设计新材料。其解决方案的关键在于整合多种AI生成模型原理与材料表示方法,结合多模态学习、物理信息嵌入架构及闭环实验系统,以应对数据稀缺、计算成本高、可解释性差、合成可行性不足等挑战,并推动AI与实验流程深度融合,从而加速可持续能源、医疗健康等领域关键材料的发现进程。

链接: https://arxiv.org/abs/2508.03278
作者: Albertus Denny Handoko,Riko I Made
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Artificial Intelligence (cs.AI); Applied Physics (physics.app-ph)
备注: Review Article in the Thematic Issue on Artificial Intelligence for Materials Discovery in World Scientific Annual Review of Functional Materials

点击查看摘要

Abstract:High throughput experimentation tools, machine learning (ML) methods, and open material databases are radically changing the way new materials are discovered. From the experimentally driven approach in the past, we are moving quickly towards the artificial intelligence (AI) driven approach, realizing the ‘inverse design’ capabilities that allow the discovery of new materials given the desired properties. This review aims to discuss different principles of AI-driven generative models that are applicable for materials discovery, including different materials representations available for this purpose. We will also highlight specific applications of generative models in designing new catalysts, semiconductors, polymers, or crystals while addressing challenges such as data scarcity, computational cost, interpretability, synthesizability, and dataset biases. Emerging approaches to overcome limitations and integrate AI with experimental workflows will be discussed, including multimodal models, physics informed architectures, and closed-loop discovery systems. This review aims to provide insights for researchers aiming to harness AI’s transformative potential in accelerating materials discovery for sustainability, healthcare, and energy innovation.
zh

[AI-118] Spatiotemporal wall pressure forecast of a rectangular cylinder with physics-aware DeepUFNet

【速读】:该论文旨在解决流经矩形柱体时壁面压力的时空演化预测问题,传统深度学习模型通常仅能基于完整空间信息预测单一时刻的压力分布,难以捕捉复杂的瞬态特性。其解决方案的关键在于提出一种物理感知的DeepU-Fourier神经网络(DeepUFNet),该模型融合UNet结构与傅里叶神经网络,并在训练过程中嵌入物理高频损失控制机制(即参数β随训练轮次动态调整),以提升对高阶频率波动和壁面压力方差的预测精度。实验验证表明,该方法在统计特征、时序变化、功率谱密度、空间分布及时空相关性等方面均与风洞实测数据高度一致,且具备稀疏空间输入下的良好外推能力。

链接: https://arxiv.org/abs/2508.03183
作者: Junle Liu,Chang Liu,Yanyu Ke,Wenliang Chen,Kihing Shum,K.T. Tse,Gang Hu
机构: 未知
类目: Fluid Dynamics (physics.flu-dyn); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE)
备注: In total, 26 pages, 21 figures

点击查看摘要

Abstract:The wall pressure is of great importance in understanding the forces and structural responses induced by fluid. Recent works have investigated the potential of deep learning techniques in predicting mean pressure coefficients and fluctuating pressure coefficients, but most of existing deep learning frameworks are limited to predicting a single snapshot using full spatial information. To forecast spatiotemporal wall pressure of flow past a rectangular cylinder, this study develops a physics-aware DeepU-Fourier neural Network (DeepUFNet) deep learning model. DeepUFNet comprises the UNet structure and the Fourier neural network, with physical high-frequency loss control embedded in the model training stage to optimize model performance, where the parameter \beta varies with the development of the training epoch. Wind tunnel testing is performed to collect wall pressures of a two-dimensional rectangular cylinder with a side ratio of 1.5 at an angle of attack of zero using high-frequency pressure scanning, thereby constructing a database for DeepUFNet training and testing. The DeepUFNet model is found to forecast spatiotemporal wall pressure information with high accuracy. The comparison between forecast results and experimental data presents agreement in statistical information, temporal pressure variation, power spectrum density, spatial distribution, and spatiotemporal correlation. It is also found that embedding a physical high-frequency loss control coefficient \beta in the DeepUFNet model can significantly improve model performance in forecasting spatiotemporal wall pressure information, in particular, in forecasting high-order frequency fluctuation and wall pressure variance. Furthermore, the DeepUFNet extrapolation capability is tested with sparse spatial information input, and the model presents a satisfactory extrapolation ability
zh

[AI-119] Autonomous Inorganic Materials Discovery via Multi-Agent Physics-Aware Scientific Reasoning

【速读】:该论文旨在解决传统机器学习方法在无机材料设计中因训练数据局限性而难以实现自主、闭环式材料发现的问题,即如何构建一个能够独立完成从构想到实验验证再到迭代优化的全流程智能系统。其解决方案的关键在于提出SparksMatter——一种多智能体AI模型,该模型通过生成创意、设计并执行实验流程、持续评估与优化结果,并主动批判和改进自身输出,从而实现材料设计的自动化与智能化;同时,它还能识别研究盲区、建议后续验证步骤(如密度泛函理论计算和实验合成表征),并在结构化报告中呈现完整推理链条,显著提升了材料假设的新颖性、化学合理性与科学严谨性。

链接: https://arxiv.org/abs/2508.02956
作者: Alireza Ghafarollahi,Markus J. Buehler
机构: 未知
类目: Materials Science (cond-mat.mtrl-sci); Disordered Systems and Neural Networks (cond-mat.dis-nn); Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Conventional machine learning approaches accelerate inorganic materials design via accurate property prediction and targeted material generation, yet they operate as single-shot models limited by the latent knowledge baked into their training data. A central challenge lies in creating an intelligent system capable of autonomously executing the full inorganic materials discovery cycle, from ideation and planning to experimentation and iterative refinement. We introduce SparksMatter, a multi-agent AI model for automated inorganic materials design that addresses user queries by generating ideas, designing and executing experimental workflows, continuously evaluating and refining results, and ultimately proposing candidate materials that meet the target objectives. SparksMatter also critiques and improves its own responses, identifies research gaps and limitations, and suggests rigorous follow-up validation steps, including DFT calculations and experimental synthesis and characterization, embedded in a well-structured final report. The model’s performance is evaluated across case studies in thermoelectrics, semiconductors, and perovskite oxides materials design. The results demonstrate the capacity of SparksMatter to generate novel stable inorganic structures that target the user’s needs. Benchmarking against frontier models reveals that SparksMatter consistently achieves higher scores in relevance, novelty, and scientific rigor, with a significant improvement in novelty across multiple real-world design tasks as assessed by a blinded evaluator. These results demonstrate SparksMatter’s unique capacity to generate chemically valid, physically meaningful, and creative inorganic materials hypotheses beyond existing materials knowledge.
zh

[AI-120] Secure mmWave Beamforming with Proactive-ISAC Defense Against Beam-Stealing Attacks

【速读】:该论文旨在解决毫米波(Millimeter-wave, mmWave)通信系统在物理层面临的高级波束窃取攻击(beam-stealing attacks)问题,此类攻击严重威胁通信安全。解决方案的关键在于提出一种基于深度强化学习(Deep Reinforcement Learning, DRL)的主动防御框架,其核心创新是利用集成感知与通信(Integrated Sensing and Communications, ISAC)能力实现智能威胁评估;DRL代理采用近端策略优化(Proximal Policy Optimization, PPO)算法,动态控制ISAC探测行为以识别可疑活动,并通过密集课程学习策略确保训练过程中成功检测,从而学习到兼顾安全性与通信性能的鲁棒自适应策略。

链接: https://arxiv.org/abs/2508.02856
作者: Seyed Bagher Hashemi Natanzi,Hossein Mohammadi,Bo Tang,Vuk Marojevic
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Millimeter-wave (mmWave) communication systems face increasing susceptibility to advanced beam-stealing attacks, posing a significant physical layer security threat. This paper introduces a novel framework employing an advanced Deep Reinforcement Learning (DRL) agent for proactive and adaptive defense against these sophisticated attacks. A key innovation is leveraging Integrated Sensing and Communications (ISAC) capabilities for active, intelligent threat assessment. The DRL agent, built on a Proximal Policy Optimization (PPO) algorithm, dynamically controls ISAC probing actions to investigate suspicious activities. We introduce an intensive curriculum learning strategy that guarantees the agent experiences successful detection during training to overcome the complex exploration challenges inherent to such a security-critical task. Consequently, the agent learns a robust and adaptive policy that intelligently balances security and communication performance. Numerical results demonstrate that our framework achieves a mean attacker detection rate of 92.8% while maintaining an average user SINR of over 13 dB.
zh

[AI-121] Extracting Range-Doppler Information of Moving Targets from Wi-Fi Channel State Information

【速读】:该论文旨在解决利用商用Wi-Fi信道状态信息(Channel State Information, CSI)在单收发器(monostatic)配置下同时提取目标距离(range)和多普勒(Doppler)信息的难题。主要挑战包括硬件不同步导致的显著相位误差,以及发射(Tx)与接收(Rx)天线间强耦合对微弱运动信号的淹没效应。解决方案的关键在于提出一种新的信号处理方法,通过三项核心技术突破实现:时间偏移抵消(Time offset cancellation)、相位对齐校正(Phase alignment correction)和Tx/Rx耦合抑制(Tx/Rx coupling mitigation),从而在不依赖全双工硬件或任何设备修改的前提下,实现了厘米级精度的距离与多普勒估计,并在真实环境中成功检测与跟踪移动目标。

链接: https://arxiv.org/abs/2508.02799
作者: Jessica Sanson,Rahul C. Shah,Maximilian Pinaroc,Valerio Frascolla
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents, for the first time, a method to extract both range and Doppler information from commercial Wi-Fi Channel State Information (CSI) using a monostatic (single transceiver) setup. Utilizing the CSI phase in Wi-Fi sensing from a Network Interface Card (NIC) not designed for full-duplex operation is challenging due to (1) Hardware asynchronization, which introduces significant phase errors, and (2) Proximity of transmit (Tx) and receive (Rx) antennas, which creates strong coupling that overwhelms the motion signal of interest. We propose a new signal processing approach that addresses both challenges via three key innovations: Time offset cancellation, Phase alignment correction, and Tx/Rx coupling mitigation. Our method achieves cm-level accuracy in range and Doppler estimation for moving targets, validated using a commercial Intel Wi-Fi AX211 NIC. Our results show successful detection and tracking of moving objects in realistic environments, establishing the feasibility of high-precision sensing using standard Wi-Fi packet communications and off-the-shelf hardware without requiring any modification or specialized full-duplex capabilities.
zh

[AI-122] CTBench: Cryptocurrency Time Series Generation Benchmark

【速读】:该论文旨在解决现有时间序列生成(Time Series Generation, TSG)方法在加密货币市场中适用性不足的问题,具体表现为:现有方法多针对非金融或传统金融市场设计,忽视了加密货币市场的24/7交易、极端波动性和快速状态转换等特性;同时缺乏对交易应用的关键财务评估。解决方案的核心是提出首个面向加密货币领域的TSG基准测试工具——\textsfCTBench,其关键创新在于构建了一个包含452种代币的开源数据集,并通过13项涵盖预测准确性、排名保真度、交易表现、风险评估和计算效率的指标进行综合评测;此外,引入双任务评估框架,即“预测效用”任务用于衡量合成数据对时序与横截面模式的保留能力,“统计套利”任务则评估生成序列是否支持均值回归信号,从而实现从统计保真到实际盈利潜力的系统性权衡分析。

链接: https://arxiv.org/abs/2508.02758
作者: Yihao Ang,Qiang Wang,Qiang Huang,Yifan Bao,Xinyu Xi,Anthony K. H. Tung,Chen Jin,Zhiyong Huang
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Databases (cs.DB); Machine Learning (cs.LG)
备注: 14 pages, 14 figures, and 3 tables

点击查看摘要

Abstract:Synthetic time series are essential tools for data augmentation, stress testing, and algorithmic prototyping in quantitative finance. However, in cryptocurrency markets, characterized by 24/7 trading, extreme volatility, and rapid regime shifts, existing Time Series Generation (TSG) methods and benchmarks often fall short, jeopardizing practical utility. Most prior work (1) targets non-financial or traditional financial domains, (2) focuses narrowly on classification and forecasting while neglecting crypto-specific complexities, and (3) lacks critical financial evaluations, particularly for trading applications. To address these gaps, we introduce \textsfCTBench, the first comprehensive TSG benchmark tailored for the cryptocurrency domain. \textsfCTBench curates an open-source dataset from 452 tokens and evaluates TSG models across 13 metrics spanning 5 key dimensions: forecasting accuracy, rank fidelity, trading performance, risk assessment, and computational efficiency. A key innovation is a dual-task evaluation framework: (1) the \emphPredictive Utility task measures how well synthetic data preserves temporal and cross-sectional patterns for forecasting, while (2) the \emphStatistical Arbitrage task assesses whether reconstructed series support mean-reverting signals for trading. We benchmark eight representative models from five methodological families over four distinct market regimes, uncovering trade-offs between statistical fidelity and real-world profitability. Notably, \textsfCTBench offers model ranking analysis and actionable guidance for selecting and deploying TSG models in crypto analytics and strategy development.
zh

[AI-123] Beyond the Wavefunction: Qualia Abstraction Language Mechanics and the Grammar of Awareness

【速读】:该论文试图解决量子力学中长期存在的观测者悖论问题,即如何在物理理论中合理嵌入主观观察者的角色,而非将其视为外部干预的抽象实体。传统量子力学基于希尔伯特空间中的态矢量描述系统,但这种形式主义未能有效处理第一人称经验结构的建模问题。解决方案的关键在于提出一种名为“感受质抽象语言”(Qualia Abstraction Language, QAL)的形式化框架,将物理系统重构为由内省单元构成的动态流,这些单元具有模态、形状和功能效应的结构化序列,从而以具身化的语义方式重新诠释量子现象:叠加态表现为结构化模糊性,波函数坍缩转化为内省收缩,纠缠则被建模为跨感受质流的语义共振。QAL通过引入对主观经验的正式语法,实现了观察者与系统的统一,取代了传统理论中依赖外生投影的操作,为后柏拉图主义、以自我觉察为基础的物理学提供了新的范式。

链接: https://arxiv.org/abs/2508.02755
作者: Mikołaj Sienicki,Krzysztof Sienicki
机构: 未知
类目: History and Philosophy of Physics (physics.hist-ph); Artificial Intelligence (cs.AI)
备注: 65 pages, 49 references, 7 figures

点击查看摘要

Abstract:We propose a formal reconstruction of quantum mechanics grounded not in external mathematical abstractions, but in the structured dynamics of subjective experience. The Qualia Abstraction Language (QAL) models physical systems as evolving streams of introspective units, structured sequences of modality, shape, and functional effect, rather than as state vectors in Hilbert space. This approach reimagines core quantum concepts: superposition becomes a form of structured ambiguity; collapse is reframed as an introspective contraction; and entanglement is modeled as semantic resonance across streams of qualia. Drawing on insights from nominalist philosophy and oversight theoretic limits in AI, we argue that the observer paradox in quantum mechanics reflects not an ontological lacuna, but a linguistic one: the absence of a formal vocabulary for modeling first person structure. QAL introduces such a vocabulary, providing a morphodynamic framework that embeds the observer within the system and replaces abstract projection with endogenous transformation. We analyze the alignment of QAL with endophysical approaches, contrast it with standard interpretations of quantum theory, and explore its implications for a post Platonist, introspectively grounded physics.
zh

[AI-124] A Novel cVAE-Augmented Deep Learning Framework for Pan-Cancer RNA-Seq Classification

【速读】:该论文旨在解决泛癌种(pan-cancer)转录组(RNA-Seq)数据分类中因特征维度极高(20,531个基因表达特征)和样本量有限导致的模型过拟合与类别不平衡问题,从而提升肿瘤亚型识别和治疗选择的准确性。解决方案的关键在于提出一种基于类别条件变分自编码器(class-conditional variational autoencoder, cVAE)的深度学习框架:首先通过特征选择保留500个变异度最高的基因表达特征,随后利用cVAE在已知癌症类型条件下学习基因表达的潜在表示,并生成合成样本以扩充训练集(使数据量翻倍),最终使用双层多层感知机(MLP)分类器在增强后的数据上进行训练,显著提升了分类准确率(约98%),尤其改善了低频癌症类别的预测性能。

链接: https://arxiv.org/abs/2508.02743
作者: Vinil Polepalli
机构: 未知
类目: Genomics (q-bio.GN); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Pan-cancer classification using transcriptomic (RNA-Seq) data can inform tumor subtyping and therapy selection, but is challenging due to extremely high dimensionality and limited sample sizes. In this study, we propose a novel deep learning framework that uses a class-conditional variational autoencoder (cVAE) to augment training data for pan-cancer gene expression classification. Using 801 tumor RNA-Seq samples spanning 5 cancer types from The Cancer Genome Atlas (TCGA), we first perform feature selection to reduce 20,531 gene expression features to the 500 most variably expressed genes. A cVAE is then trained on this data to learn a latent representation of gene expression conditioned on cancer type, enabling the generation of synthetic gene expression samples for each tumor class. We augment the training set with these cVAE-generated samples (doubling the dataset size) to mitigate overfitting and class imbalance. A two-layer multilayer perceptron (MLP) classifier is subsequently trained on the augmented dataset to predict tumor type. The augmented framework achieves high classification accuracy (~98%) on a held-out test set, substantially outperforming a classifier trained on the original data alone. We present detailed experimental results, including VAE training curves, classifier performance metrics (ROC curves and confusion matrix), and architecture diagrams to illustrate the approach. The results demonstrate that cVAE-based synthetic augmentation can significantly improve pan-cancer prediction performance, especially for underrepresented cancer classes.
zh

[AI-125] SpectrumFM: A New Paradigm for Spectrum Cognition

【速读】:该论文旨在解决现有频谱认知(spectrum cognition)方法在不同频谱环境和任务中泛化能力有限、准确率不足的问题。其核心解决方案是提出一种名为SpectrumFM的频谱基础模型(spectrum foundation model),通过创新的频谱编码器融合卷积神经网络(CNN)与多头自注意力机制(multi-head self-attention),有效捕获频谱数据中的细粒度局部结构和高层全局依赖关系;同时设计了掩码重建(masked reconstruction)和下一时隙信号预测(next-slot signal prediction)两项自监督学习任务用于预训练,从而学习具有强迁移性的表征,并采用低秩适应(LoRA)参数高效微调策略,使模型可无缝适配多种下游任务(如频谱感知SS、异常检测AD和无线技术分类WTC),显著提升性能表现。

链接: https://arxiv.org/abs/2508.02742
作者: Chunyu Liu,Hao Zhang,Wei Wu,Fuhui Zhou,Qihui Wu,Derrick Wing Kwan Ng,Chan-Byoung Chae
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: This paper has been accepted for presentation at the 2025 IEEE Global Communications Conference (GLOBECOM 2025), Cognitive Radio and AI-Enabled Network Symposium

点击查看摘要

Abstract:The enhancement of spectrum efficiency and the realization of secure spectrum utilization are critically dependent on spectrum cognition. However, existing spectrum cognition methods often exhibit limited generalization and suboptimal accuracy when deployed across diverse spectrum environments and tasks. To overcome these challenges, we propose a spectrum foundation model, termed SpectrumFM, which provides a new paradigm for spectrum cognition. An innovative spectrum encoder that exploits the convolutional neural networks and the multi-head self attention mechanisms is proposed to effectively capture both fine-grained local signal structures and high-level global dependencies in the spectrum data. To enhance its adaptability, two novel self-supervised learning tasks, namely masked reconstruction and next-slot signal prediction, are developed for pre-training SpectrumFM, enabling the model to learn rich and transferable representations. Furthermore, low-rank adaptation (LoRA) parameter-efficient fine-tuning is exploited to enable SpectrumFM to seamlessly adapt to various downstream spectrum cognition tasks, including spectrum sensing (SS), anomaly detection (AD), and wireless technology classification (WTC). Extensive experiments demonstrate the superiority of SpectrumFM over state-of-the-art methods. Specifically, it improves detection probability in the SS task by 30% at -4 dB signal-to-noise ratio (SNR), boosts the area under the curve (AUC) in the AD task by over 10%, and enhances WTC accuracy by 9.6%.
zh

[AI-126] Kronos: A Foundation Model for the Language of Financial Markets

【速读】:该论文旨在解决当前时间序列基础模型(Time Series Foundation Models, TSFMs)在金融K线数据上应用受限、性能不佳,且普遍忽视关键下游任务(如波动率预测和合成数据生成)的问题。其解决方案的关键在于提出了一种名为Kronos的统一、可扩展的预训练框架,该框架引入了专门设计的分词器(tokenizer),将连续的市场信息离散化为token序列,从而同时保留价格动态与交易活动模式;并通过在包含超过120亿条K线记录的多市场语料库上进行自回归预训练,使模型能够学习到精细的时间依赖性和跨资产表征,最终在零样本设置下显著提升多项金融任务的性能表现。

链接: https://arxiv.org/abs/2508.02739
作者: Yu Shi,Zongliang Fu,Shuo Chen,Bohan Zhao,Wei Xu,Changshui Zhang,Jian Li
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:The success of large-scale pre-training paradigm, exemplified by Large Language Models (LLMs), has inspired the development of Time Series Foundation Models (TSFMs). However, their application to financial candlestick (K-line) data remains limited, often underperforming non-pre-trained architectures. Moreover, existing TSFMs often overlook crucial downstream tasks such as volatility prediction and synthetic data generation. To address these limitations, we propose Kronos, a unified, scalable pre-training framework tailored to financial K-line modeling. Kronos introduces a specialized tokenizer that discretizes continuous market information into token sequences, preserving both price dynamics and trade activity patterns. We pre-train Kronos using an autoregressive objective on a massive, multi-market corpus of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. Kronos excels in a zero-shot setting across a diverse set of financial tasks. On benchmark datasets, Kronos boosts price series forecasting RankIC by 93% over the leading TSFM and 87% over the best non-pre-trained baseline. It also achieves a 9% lower MAE in volatility forecasting and a 22% improvement in generative fidelity for synthetic K-line sequences. These results establish Kronos as a robust, versatile foundation model for end-to-end financial time series analysis. Our pre-trained model is publicly available at this https URL.
zh

[AI-127] Veli: Unsupervised Method and Unified Benchmark for Low-Cost Air Quality Sensor Correction

【速读】:该论文旨在解决低质量传感器(Low-cost Sensors, LCS)在城市空气质量(Air Quality, AQ)监测中因漂移、校准误差和环境干扰导致读数不准确的问题,从而限制了其大规模部署与应用。解决方案的关键在于提出一种无参考的变分估计方法 Veli(Reference-free Variational Estimation via Latent Inference),该方法基于贝叶斯框架,利用变分推断从LCS数据中学习解耦表示,有效分离真实污染物浓度与传感器噪声,无需依赖同址参考站即可实现精准校正,显著提升了模型在分布内和分布外场景下的泛化能力。

链接: https://arxiv.org/abs/2508.02724
作者: Yahia Dalbah,Marcel Worring,Yen-Chia Hsu
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Main content: 7 pages, 9 Figures, 3 Tables. Appendix: 4 pages, 6 Figures

点击查看摘要

Abstract:Urban air pollution is a major health crisis causing millions of premature deaths annually, underscoring the urgent need for accurate and scalable monitoring of air quality (AQ). While low-cost sensors (LCS) offer a scalable alternative to expensive reference-grade stations, their readings are affected by drift, calibration errors, and environmental interference. To address these challenges, we introduce Veli (Reference-free Variational Estimation via Latent Inference), an unsupervised Bayesian model that leverages variational inference to correct LCS readings without requiring co-location with reference stations, eliminating a major deployment barrier. Specifically, Veli constructs a disentangled representation of the LCS readings, effectively separating the true pollutant reading from the sensor noise. To build our model and address the lack of standardized benchmarks in AQ monitoring, we also introduce the Air Quality Sensor Data Repository (AQ-SDR). AQ-SDR is the largest AQ sensor benchmark to date, with readings from 23,737 LCS and reference stations across multiple regions. Veli demonstrates strong generalization across both in-distribution and out-of-distribution settings, effectively handling sensor drift and erratic sensor behavior. Code for model and dataset will be made public when this paper is published.
zh

[AI-128] SleepLiteCNN: Energy-Efficient Sleep Apnea Subtype Classification with 1-Second Resolution Using Single-Lead ECG

【速读】:该论文旨在解决睡眠呼吸暂停(Sleep Apnea)亚型(阻塞性、中枢性和混合性)在可穿戴设备上进行高时间分辨率实时检测的难题,以支持精准治疗与管理。其解决方案的关键在于提出了一种名为SleepLiteCNN的轻量级、低功耗卷积神经网络模型,该模型基于单导联心电图(ECG)信号,在1秒窗口内实现分类,经8位量化后每推理仅需1.8微焦耳能量,同时保持超过95%的准确率和92%的宏F1分数;并通过现场可编程门阵列(FPGA)综合验证了其硬件资源占用显著降低,证明其适用于资源受限环境下的连续实时监测。

链接: https://arxiv.org/abs/2508.02718
作者: Zahra Mohammadi,Siamak Mohammadi
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Apnea is a common sleep disorder characterized by breathing interruptions lasting at least ten seconds and occurring more than five times per hour. Accurate, high-temporal-resolution detection of sleep apnea subtypes - Obstructive, Central, and Mixed - is crucial for effective treatment and management. This paper presents an energy-efficient method for classifying these subtypes using a single-lead electrocardiogram (ECG) with high temporal resolution to address the real-time needs of wearable devices. We evaluate a wide range of classical machine learning algorithms and deep learning architectures on 1-second ECG windows, comparing their accuracy, complexity, and energy consumption. Based on this analysis, we introduce SleepLiteCNN, a compact and energy-efficient convolutional neural network specifically designed for wearable platforms. SleepLiteCNN achieves over 95% accuracy and a 92% macro-F1 score, while requiring just 1.8 microjoules per inference after 8-bit quantization. Field Programmable Gate Array (FPGA) synthesis further demonstrates significant reductions in hardware resource usage, confirming its suitability for continuous, real-time monitoring in energy-constrained environments. These results establish SleepLiteCNN as a practical and effective solution for wearable device sleep apnea subtype detection.
zh

[AI-129] Evaluation of Deep Learning Models for LBBB Classification in ECG Signals

【速读】:该论文旨在解决如何利用不同神经网络架构从心电图(Electrocardiographic, ECG)信号中有效提取时空特征,并准确分类为健康个体、左束支传导阻滞(Left Bundle Branch Block, LBBB)以及严格左束支传导阻滞(Strict Left Bundle Branch Block, sLBBB)的问题。其解决方案的关键在于通过深度学习模型对ECG信号进行端到端的特征学习,从而优化对LBBB患者的识别精度,进而辅助心脏再同步治疗(Cardiac Resynchronization Therapy, CRT)候选者的筛选,提升临床决策的科学性与效率。

链接: https://arxiv.org/abs/2508.02710
作者: Beatriz Macas Ordóñez,Diego Vinicio Orellana Villavicencio,José Manuel Ferrández,Paula Bonomini
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted for presentation in the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025)

点击查看摘要

Abstract:This study explores different neural network architectures to evaluate their ability to extract spatial and temporal patterns from electrocardiographic (ECG) signals and classify them into three groups: healthy subjects, Left Bundle Branch Block (LBBB), and Strict Left Bundle Branch Block (sLBBB). Clinical Relevance, Innovative technologies enable the selection of candidates for Cardiac Resynchronization Therapy (CRT) by optimizing the classification of subjects with Left Bundle Branch Block (LBBB). Comments: Accepted for presentation in the 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC 2025) Subjects: Signal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.02710 [eess.SP] (or arXiv:2508.02710v1 [eess.SP] for this version) https://doi.org/10.48550/arXiv.2508.02710 Focus to learn more arXiv-issued DOI via DataCite
zh

机器学习

[LG-0] PAC Apprenticeship Learning with Bayesian Active Inverse Reinforcement Learning

链接: https://arxiv.org/abs/2508.03693
作者: Ondrej Bajgar,Dewi S.W. Gould,Jonathon Liu,Alessandro Abate,Konstantinos Gatsis,Michael A. Osborne
类目: Machine Learning (cs.LG)
*备注: Published at RLC 2025

点击查看摘要

Abstract:As AI systems become increasingly autonomous, reliably aligning their decision-making to human preferences is essential. Inverse reinforcement learning (IRL) offers a promising approach to infer preferences from demonstrations. These preferences can then be used to produce an apprentice policy that performs well on the demonstrated task. However, in domains like autonomous driving or robotics, where errors can have serious consequences, we need not just good average performance but reliable policies with formal guarantees – yet obtaining sufficient human demonstrations for reliability guarantees can be costly. Active IRL addresses this challenge by strategically selecting the most informative scenarios for human demonstration. We introduce PAC-EIG, an information-theoretic acquisition function that directly targets probably-approximately-correct (PAC) guarantees for the learned policy – providing the first such theoretical guarantee for active IRL with noisy expert demonstrations. Our method maximises information gain about the regret of the apprentice policy, efficiently identifying states requiring further demonstration. We also present Reward-EIG as an alternative when learning the reward itself is the primary objective. Focusing on finite state-action spaces, we prove convergence bounds, illustrate failure modes of prior heuristic methods, and demonstrate our method’s advantages experimentally.

[LG-1] No LLM Solved Yu Tsumuras 554th Problem

链接: https://arxiv.org/abs/2508.03685
作者: Simon Frieder,William Hart
类目: Machine Learning (cs.LG)
*备注: 67 pages

点击查看摘要

Abstract:We show, contrary to the optimism about LLM’s problem-solving abilities, fueled by the recent gold medals that were attained, that a problem exists – Yu Tsumura’s 554th problem – that a) is within the scope of an IMO problem in terms of proof sophistication, b) is not a combinatorics problem which has caused issues for LLMs, c) requires fewer proof techniques than typical hard IMO problems, d) has a publicly available solution (likely in the training data of LLMs), and e) that cannot be readily solved by any existing off-the-shelf LLM (commercial or open-source).

[LG-2] What If But Privately: Private Counterfactual Retrieval

链接: https://arxiv.org/abs/2508.03681
作者: Shreya Meel,Mohamed Nomeir,Pasan Dissanayake,Sanghamitra Dutta,Sennur Ulukus
类目: Information Theory (cs.IT); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Signal Processing (eess.SP)
*备注: arXiv admin note: text overlap with arXiv:2410.13812 , arXiv:2411.10429

点击查看摘要

Abstract:Transparency and explainability are two important aspects to be considered when employing black-box machine learning models in high-stake applications. Providing counterfactual explanations is one way of catering this requirement. However, this also poses a threat to the privacy of the institution that is providing the explanation, as well as the user who is requesting it. In this work, we are primarily concerned with the user’s privacy who wants to retrieve a counterfactual instance, without revealing their feature vector to the institution. Our framework retrieves the exact nearest neighbor counterfactual explanation from a database of accepted points while achieving perfect, information-theoretic, privacy for the user. First, we introduce the problem of private counterfactual retrieval (PCR) and propose a baseline PCR scheme that keeps the user’s feature vector information-theoretically private from the institution. Building on this, we propose two other schemes that reduce the amount of information leaked about the institution database to the user, compared to the baseline scheme. Second, we relax the assumption of mutability of all features, and consider the setting of immutable PCR (I-PCR). Here, the user retrieves the nearest counterfactual without altering a private subset of their features, which constitutes the immutable set, while keeping their feature vector and immutable set private from the institution. For this, we propose two schemes that preserve the user’s privacy information-theoretically, but ensure varying degrees of database privacy. Third, we extend our PCR and I-PCR schemes to incorporate user’s preference on transforming their attributes, so that a more actionable explanation can be received. Finally, we present numerical results to support our theoretical findings, and compare the database leakage of the proposed schemes.

[LG-3] Streaming Generated Gaussian Process Experts for Online Learning and Control

链接: https://arxiv.org/abs/2508.03679
作者: Zewen Yang,Dongfa Zhang,Xiaobing Dai,Fengyi Yu,Chi Zhang,Bingkun Huang,Hamid Sadeghian,Sami Haddadin
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Gaussian Processes (GPs), as a nonparametric learning method, offer flexible modeling capabilities and calibrated uncertainty quantification for function approximations. Additionally, GPs support online learning by efficiently incorporating new data with polynomial-time computation, making them well-suited for safety-critical dynamical systems that require rapid adaptation. However, the inference and online updates of exact GPs, when processing streaming data, incur cubic computation time and quadratic storage memory complexity, limiting their scalability to large datasets in real-time settings. In this paper, we propose a \underlinestreaming \underlinekernel-induced progressivel\underliney generated expert framework of \underlineGaussian \underlineprocesses (SkyGP) that addresses both computational and memory constraints by maintaining a bounded set of experts, while inheriting the learning performance guarantees from exact Gaussian processes. Furthermore, two SkyGP variants are introduced, each tailored to a specific objective, either maximizing prediction accuracy (SkyGP-Dense) or improving computational efficiency (SkyGP-Fast). The effectiveness of SkyGP is validated through extensive benchmarks and real-time control experiments demonstrating its superior performance compared to state-of-the-art approaches.

[LG-4] MaLV-OS: Rethinking the Operating System Architecture for Machine Learning in Virtualized Clouds

链接: https://arxiv.org/abs/2508.03676
作者: Stella Bitchebe,Oana Balmau
类目: Operating Systems (cs.OS); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:A large body of research has employed Machine Learning (ML) models to develop learned operating systems (OSes) and kernels. The latter dynamically adapts to the job load and dynamically adjusts resources (CPU, IO, memory, network bandwidth) allocation to respond to the actual user demand. What this work has in common is that it utilizes ML to improve kernel decisions. To this day, and to the best of our knowledge, no work has taken the opposite direction, i.e., using OS to improve ML. While some work proposes applying system-level optimizations to ML algorithms, they do not tailor the OS to adapt to the ML context. To address this limitation, we take an orthogonal approach in this paper by leveraging the OS to enhance the performance of ML models and algorithms. We explore the path towards an ML-specialized OS, MaLV-OS. MaLV-OS rethinks the OS architecture to make it specifically tailored to ML workloads, especially in virtualized clouds, which are now widely used to run ML applications. MaLV-OS envisioned architecture includes (1) a micro-kernel, Micro-LAKE, which allows kernel space applications to use the GPU, and (2) an MLaaS (ML as a Service) subsystem that gathers ML models to help Micro-LAKE with memory management and CPU scheduling. MaLV-OS architecture also offloads system-sensitive parts of the models to the OS, to lighten the model complexity and programming, and speed up its execution. Finally, MaLV-OS integrates an open-source GPU virtualization software, merged directly into the hypervisor. For more flexibility, MaLV-OS vision is to enable the virtual machine to dynamically select MLaaS policies that can improve the performance of the model the user is running. Because MLaaS is designed as loadable kernel modules, the MaLV-OS architecture enables the dynamic addition of new capabilities to the MLaaS subsystem.

[LG-5] Morphlux: Programmable chip-to-chip photonic fabrics in multi-accelerator servers for ML

链接: https://arxiv.org/abs/2508.03674
作者: Abhishek Vijaya Kumar,Eric Ding,Arjun Devraj,Rachee Singh
类目: Networking and Internet Architecture (cs.NI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We optically interconnect accelerator chips (e.g., GPUs, TPUs) within compute servers using newly viable programmable chip-to-chip photonic fabrics. In contrast, today, commercial multi-accelerator compute servers that are workhorses of ML, use electrical interconnects to network accelerator chips in the server. However, recent trends have shown an interconnect bandwidth wall caused by accelerator FLOPS scaling at a faster rate than the bandwidth of the interconnect between accelerators in the same server. This has led to under-utilization and idling of GPU resources in cloud datacenters. We develop Morphlux, a server-scale programmable photonic fabric, to interconnect accelerators within servers. We show that augmenting state-of-the-art photonic ML-centric datacenters with Morphlux can improve the bandwidth of tenant compute allocations by up to 66% and reduce compute fragmentation by up to 70%. We develop a novel end-to-end hardware prototype of Morphlux to demonstrate these performance benefits, which translate to 1.72x improvement in training throughput of ML models. By rapidly programming the server-scale fabric in our hardware testbed, Morphlux can logically replace a failed accelerator chip in 1.2 seconds.

[LG-6] Personalized Recommendation of Dish and Restaurant Collections on iFood KDD2025 KDD ATC

链接: https://arxiv.org/abs/2508.03670
作者: Fernando F. Granado,Davi A. Bezerra,Iuri Queiroz,Nathan Oliveira,Pedro Fernandes,Bruno Schock
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: Workshop on Two-sided Marketplace Optimization: Search, Discovery, Matching, Pricing Growth in conjunction with KDD Conference (KDD 2025) in Toronto, Canada

点击查看摘要

Abstract:Food delivery platforms face the challenge of helping users navigate vast catalogs of restaurants and dishes to find meals they truly enjoy. This paper presents RED, an automated recommendation system designed for iFood, Latin America’s largest on-demand food delivery platform, to personalize the selection of curated food collections displayed to millions of users. Our approach employs a LightGBM classifier that scores collections based on three feature groups: collection characteristics, user-collection similarity, and contextual information. To address the cold-start problem of recommending newly created collections, we develop content-based representations using item embeddings and implement monotonicity constraints to improve generalization. We tackle data scarcity by bootstrapping from category carousel interactions and address visibility bias through unbiased sampling of impressions and purchases in production. The system demonstrates significant real-world impact through extensive A/B testing with 5-10% of iFood’s user base. Online results of our A/B tests add up to 97% improvement in Card Conversion Rate and 1.4% increase in overall App Conversion Rate compared to popularity-based baselines. Notably, our offline accuracy metrics strongly correlate with online performance, enabling reliable impact prediction before deployment. To our knowledge, this is the first work to detail large-scale recommendation of curated food collections in a dynamic commercial environment.

[LG-7] Efficient Morphology-Aware Policy Transfer to New Embodiments

链接: https://arxiv.org/abs/2508.03660
作者: Michael Przystupa,Hongyao Tang,Martin Jagersand,Santiago Miret,Mariano Phielipp,Matthew E. Taylor,Glen Berseth
类目: Machine Learning (cs.LG)
*备注: 19 pages, 10 Figures, Published at the 2025 Reinforcement Learning Conference

点击查看摘要

Abstract:Morphology-aware policy learning is a means of enhancing policy sample efficiency by aggregating data from multiple agents. These types of policies have previously been shown to help generalize over dynamic, kinematic, and limb configuration variations between agent morphologies. Unfortunately, these policies still have sub-optimal zero-shot performance compared to end-to-end finetuning on morphologies at deployment. This limitation has ramifications in practical applications such as robotics because further data collection to perform end-to-end finetuning can be computationally expensive. In this work, we investigate combining morphology-aware pretraining with parameter efficient finetuning (PEFT) techniques to help reduce the learnable parameters necessary to specialize a morphology-aware policy to a target embodiment. We compare directly tuning sub-sets of model weights, input learnable adapters, and prefix tuning techniques for online finetuning. Our analysis reveals that PEFT techniques in conjunction with policy pre-training generally help reduce the number of samples to necessary to improve a policy compared to training models end-to-end from scratch. We further find that tuning as few as less than 1% of total parameters will improve policy performance compared the zero-shot performance of the base pretrained a policy.

[LG-8] Cross-patient Seizure Onset Zone Classification by Patient-Dependent Weight

链接: https://arxiv.org/abs/2508.03635
作者: Xuyang Zhao,Hidenori Sugano,Toshihisa Tanaka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying the seizure onset zone (SOZ) in patients with focal epilepsy is essential for surgical treatment and remains challenging due to its dependence on visual judgment by clinical experts. The development of machine learning can assist in diagnosis and has made promising progress. However, unlike data in other fields, medical data is usually collected from individual patients, and each patient has different illnesses, physical conditions, and medical histories, which leads to differences in the distribution of each patient’s data. This makes it difficult for a machine learning model to achieve consistently reliable performance in every new patient dataset, which we refer to as the “cross-patient problem.” In this paper, we propose a method to fine-tune a pretrained model using patient-specific weights for every new test patient to improve diagnostic performance. First, the supervised learning method is used to train a machine learning model. Next, using the intermediate features of the trained model obtained through the test patient data, the similarity between the test patient data and each training patient’s data is defined to determine the weight of each training patient to be used in the following fine-tuning. Finally, we fine-tune all parameters in the pretrained model with training data and patient weights. In the experiment, the leave-one-patient-out method is used to evaluate the proposed method, and the results show improved classification accuracy for every test patient, with an average improvement of more than 10%.

[LG-9] Pair Correlation Factor and the Sample Complexity of Gaussian Mixtures

链接: https://arxiv.org/abs/2508.03633
作者: Farzad Aryan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 21 pages, no figures

点击查看摘要

Abstract:We study the problem of learning Gaussian Mixture Models (GMMs) and ask: which structural properties govern their sample complexity? Prior work has largely tied this complexity to the minimum pairwise separation between components, but we demonstrate this view is incomplete. We introduce the \emphPair Correlation Factor (PCF), a geometric quantity capturing the clustering of component means. Unlike the minimum gap, the PCF more accurately dictates the difficulty of parameter recovery. In the uniform spherical case, we give an algorithm with improved sample complexity bounds, showing when more than the usual \epsilon^-2 samples are necessary. Comments: 21 pages, no figures Subjects: Machine Learning (cs.LG); Machine Learning (stat.ML) MSC classes: 62H30, 68T05, 62F12, 68Q32 ACMclasses: I.2.6; G.3 Cite as: arXiv:2508.03633 [cs.LG] (or arXiv:2508.03633v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.03633 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-10] Minimal Convolutional RNNs Accelerate Spatiotemporal Learning ICANN2025

链接: https://arxiv.org/abs/2508.03614
作者: Coşku Can Horuz,Sebastian Otte,Martin V. Butz,Matthias Karlbauer
类目: Machine Learning (cs.LG); Neural and Evolutionary Computing (cs.NE)
*备注: Accepted at ICANN 2025

点击查看摘要

Abstract:We introduce MinConvLSTM and MinConvGRU, two novel spatiotemporal models that combine the spatial inductive biases of convolutional recurrent networks with the training efficiency of minimal, parallelizable RNNs. Our approach extends the log-domain prefix-sum formulation of MinLSTM and MinGRU to convolutional architectures, enabling fully parallel training while retaining localized spatial modeling. This eliminates the need for sequential hidden state updates during teacher forcing - a major bottleneck in conventional ConvRNN models. In addition, we incorporate an exponential gating mechanism inspired by the xLSTM architecture into the MinConvLSTM, which further simplifies the log-domain computation. Our models are structurally minimal and computationally efficient, with reduced parameter count and improved scalability. We evaluate our models on two spatiotemporal forecasting tasks: Navier-Stokes dynamics and real-world geopotential data. In terms of training speed, our architectures significantly outperform standard ConvLSTMs and ConvGRUs. Moreover, our models also achieve lower prediction errors in both domains, even in closed-loop autoregressive mode. These findings demonstrate that minimal recurrent structures, when combined with convolutional input aggregation, offer a compelling and efficient alternative for spatiotemporal sequence modeling, bridging the gap between recurrent simplicity and spatial complexity.

[LG-11] On the (In)Significance of Feature Selection in High-Dimensional Datasets

链接: https://arxiv.org/abs/2508.03593
作者: Bhavesh Neekhra,Debayan Gupta,Partha Pratim Chakravarti
类目: Machine Learning (cs.LG); Genomics (q-bio.GN); Machine Learning (stat.ML)
*备注: submitted to Nature Computational Science (double-blind review in progress). supplementary material included in pdf; anonymized code at: this https URL

点击查看摘要

Abstract:Extensive research has been done on feature selection (FS) algorithms for high-dimensional datasets aiming to improve model performance, reduce computational cost and identify features of interest. We test the null hypothesis of using randomly selected features to compare against features selected by FS algorithms to validate the performance of the latter. Our results show that FS on high-dimensional datasets (in particular gene expression) in classification tasks is not useful. We find that (1) models trained on small subsets (0.02%-1% of all features) of randomly selected features almost always perform comparably to those trained on all features, and (2) a “typical”- sized random subset provides comparable or superior performance to that of top-k features selected in various published studies. Thus, our work challenges many feature selection results on high dimensional datasets, particularly in computational genomics. It raises serious concerns about studies that propose drug design or targeted interventions based on computationally selected genes, without further validation in a wet lab.

[LG-12] SolarSeer: Ultrafast and accurate 24-hour solar irradiance forecasts outperforming numerical weather prediction across the USA

链接: https://arxiv.org/abs/2508.03590
作者: Mingliang Bai,Zuliang Fang,Shengyu Tao,Siqi Xiang,Jiang Bian,Yanfei Xiang,Pengcheng Zhao,Weixin Jin,Jonathan A. Weyn,Haiyu Dong,Bin Zhang,Hongyu Sun,Kit Thambiratnam,Qi Zhang,Hongbin Sun,Xuan Zhang,Qiuwei Wu
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE)
*备注:

点击查看摘要

Abstract:Accurate 24-hour solar irradiance forecasting is essential for the safe and economic operation of solar photovoltaic systems. Traditional numerical weather prediction (NWP) models represent the state-of-the-art in forecasting performance but rely on computationally costly data assimilation and solving complicated partial differential equations (PDEs) that simulate atmospheric physics. Here, we introduce SolarSeer, an end-to-end large artificial intelligence (AI) model for solar irradiance forecasting across the Contiguous United States (CONUS). SolarSeer is designed to directly map the historical satellite observations to future forecasts, eliminating the computational overhead of data assimilation and PDEs solving. This efficiency allows SolarSeer to operate over 1,500 times faster than traditional NWP, generating 24-hour cloud cover and solar irradiance forecasts for the CONUS at 5-kilometer resolution in under 3 seconds. Compared with the state-of-the-art NWP in the CONUS, i.e., High-Resolution Rapid Refresh (HRRR), SolarSeer significantly reduces the root mean squared error of solar irradiance forecasting by 27.28% in reanalysis data and 15.35% across 1,800 stations. SolarSeer also effectively captures solar irradiance fluctuations and significantly enhances the first-order irradiance difference forecasting accuracy. SolarSeer’s ultrafast, accurate 24-hour solar irradiance forecasts provide strong support for the transition to sustainable, net-zero energy systems.

[LG-13] VITA: Variational Pretraining of Transformers for Climate-Robust Crop Yield Forecasting

链接: https://arxiv.org/abs/2508.03589
作者: Adib Hasan,Mardavij Roozbehani,Munther Dahleh
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate crop yield forecasting is essential for global food security. However, current AI models systematically underperform when yields deviate from historical trends. This issue arises from key data challenges, including a major asymmetry between rich pretraining weather datasets and the limited data available for fine-tuning. We introduce VITA (Variational Inference Transformer for Asymmetric data), a variational pretraining framework that addresses this asymmetry. Instead of relying on input reconstruction, VITA uses detailed weather variables as proxy targets during pretraining and learns to predict rich atmospheric states through self-supervised feature masking. This allows the model to be fine-tuned using only basic weather statistics during deployment. Applied to 763 counties in the U.S. Corn Belt, VITA achieves state-of-the-art performance in predicting corn and soybean yields across all evaluation scenarios. While it consistently delivers superior performance under normal conditions, its advantages are particularly pronounced during extreme weather years, with statistically significant improvements (paired t-test, p \approx 0.01 ). Importantly, VITA outperforms prior frameworks like GNN-RNN using less data, making it more practical for real-world use–particularly in data-scarce regions. This work highlights how domain-aware AI design can overcome data limitations and support resilient agricultural forecasting in a changing climate.

[LG-14] Zero-Variance Gradients for Variational Autoencoders

链接: https://arxiv.org/abs/2508.03587
作者: Zilei Shao,Anji Liu,Guy Van den Broeck
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Training deep generative models like Variational Autoencoders (VAEs) is often hindered by the need to backpropagate gradients through the stochastic sampling of their latent variables, a process that inherently introduces estimation variance, which can slow convergence and degrade performance. In this paper, we propose a new perspective that sidesteps this problem, which we call Silent Gradients. Instead of improving stochastic estimators, we leverage specific decoder architectures to analytically compute the expected ELBO, yielding a gradient with zero variance. We first provide a theoretical foundation for this method and demonstrate its superiority over existing estimators in a controlled setting with a linear decoder. To generalize our approach for practical use with complex, expressive decoders, we introduce a novel training dynamic that uses the exact, zero-variance gradient to guide the early stages of encoder training before annealing to a standard stochastic estimator. Our experiments show that this technique consistently improves the performance of established baselines, including reparameterization, Gumbel-Softmax, and REINFORCE, across multiple datasets. This work opens a new direction for training generative models by combining the stability of analytical computation with the expressiveness of deep, nonlinear architecture.

[LG-15] Heterogeneity-Oblivious Robust Federated Learning

链接: https://arxiv.org/abs/2508.03579
作者: Weiyao Zhang,Jinyang Li,Qi Song,Miao Wang,Chungang Lin,Haitong Luo,Xuying Meng,Yujun Zhang
类目: Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Under review

点击查看摘要

Abstract:Federated Learning (FL) remains highly vulnerable to poisoning attacks, especially under real-world hyper-heterogeneity, where clients differ significantly in data distributions, communication capabilities, and model architectures. Such heterogeneity not only undermines the effectiveness of aggregation strategies but also makes attacks more difficult to detect. Furthermore, high-dimensional models expand the attack surface. To address these challenges, we propose Horus, a heterogeneity-oblivious robust FL framework centered on low-rank adaptations (LoRAs). Rather than aggregating full model parameters, Horus inserts LoRAs into empirically stable layers and aggregates only LoRAs to reduce the attack this http URL uncover a key empirical observation that the input projection (LoRA-A) is markedly more stable than the output projection (LoRA-B) under heterogeneity and poisoning. Leveraging this, we design a Heterogeneity-Oblivious Poisoning Score using the features from LoRA-A to filter poisoned clients. For the remaining benign clients, we propose projection-aware aggregation mechanism to preserve collaborative signals while suppressing drifts, which reweights client updates by consistency with the global directions. Extensive experiments across diverse datasets, model architectures, and attacks demonstrate that Horus consistently outperforms state-of-the-art baselines in both robustness and accuracy.

[LG-16] VRPRM: Process Reward Modeling via Visual Reasoning

链接: https://arxiv.org/abs/2508.03556
作者: Xinquan Chen,Bangwei Liu,Xuhong Wang
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM SFT data and 50K non-CoT PRM RL training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.

[LG-17] Vision-based Perception System for Automated Delivery Robot-Pedestrians Interactions

链接: https://arxiv.org/abs/2508.03541
作者: Ergi Tushe,Bilal Farooq
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The integration of Automated Delivery Robots (ADRs) into pedestrian-heavy urban spaces introduces unique challenges in terms of safe, efficient, and socially acceptable navigation. We develop the complete pipeline for a single vision sensor based multi-pedestrian detection and tracking, pose estimation, and monocular depth perception. Leveraging the real-world MOT17 dataset sequences, this study demonstrates how integrating human-pose estimation and depth cues enhances pedestrian trajectory prediction and identity maintenance, even under occlusions and dense crowds. Results show measurable improvements, including up to a 10% increase in identity preservation (IDF1), a 7% improvement in multiobject tracking accuracy (MOTA), and consistently high detection precision exceeding 85%, even in challenging scenarios. Notably, the system identifies vulnerable pedestrian groups supporting more socially aware and inclusive robot behaviour.

[LG-18] Parameter-Efficient Single Collaborative Branch for Recommendation

链接: https://arxiv.org/abs/2508.03518
作者: Marta Moscati,Shah Nawaz,Markus Schedl
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 5 pages

点击查看摘要

Abstract:Recommender Systems (RS) often rely on representations of users and items in a joint embedding space and on a similarity metric to compute relevance scores. In modern RS, the modules to obtain user and item representations consist of two distinct and separate neural networks (NN). In multimodal representation learning, weight sharing has been proven effective in reducing the distance between multiple modalities of a same item. Inspired by these approaches, we propose a novel RS that leverages weight sharing between the user and item NN modules used to obtain the latent representations in the shared embedding space. The proposed framework consists of a single Collaborative Branch for Recommendation (CoBraR). We evaluate CoBraR by means of quantitative experiments on e-commerce and movie recommendation. Our experiments show that by reducing the number of parameters and improving beyond-accuracy aspects without compromising accuracy, CoBraR has the potential to be applied and extended for real-world scenarios.

[LG-19] SLA-MORL: SLA-Aware Multi-Objective Reinforcement Learning for HPC Resource Optimization

链接: https://arxiv.org/abs/2508.03509
作者: Seraj Al Mahmud Mostafa,Aravind Mohan,Jianwu Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Dynamic resource allocation for machine learning workloads in cloud environments remains challenging due to competing objectives of minimizing training time and operational costs while meeting Service Level Agreement (SLA) constraints. Traditional approaches employ static resource allocation or single-objective optimization, leading to either SLA violations or resource waste. We present SLA-MORL, an adaptive multi-objective reinforcement learning framework that intelligently allocates GPU and CPU resources based on user-defined preferences (time, cost, or balanced) while ensuring SLA compliance. Our approach introduces two key innovations: (1) intelligent initialization through historical learning or efficient baseline runs that eliminates cold-start problems, reducing initial exploration overhead by 60%, and (2) dynamic weight adaptation that automatically adjusts optimization priorities based on real-time SLA violation severity, creating a self-correcting system. SLA-MORL constructs a 21-dimensional state representation capturing resource utilization, training progress, and SLA compliance, enabling an actor-critic network to make informed allocation decisions across 9 possible actions. Extensive evaluation on 13 diverse ML workloads using production HPC infrastructure demonstrates that SLA-MORL achieves 67.2% reduction in training time for deadline-critical jobs, 68.8% reduction in costs for budget-constrained workloads, and 73.4% improvement in overall SLA compliance compared to static baselines. By addressing both cold-start inefficiency and dynamic adaptation challenges, SLA-MORL provides a practical solution for cloud resource management that balances performance, cost, and reliability in modern ML training environments.

[LG-20] An Auditable Agent Platform For Automated Molecular Optimisation

链接: https://arxiv.org/abs/2508.03444
作者: Atabey Ünlü,Phil Rohr,Ahmet Celebi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Drug discovery frequently loses momentum when data, expertise, and tools are scattered, slowing design cycles. To shorten this loop we built a hierarchical, tool using agent framework that automates molecular optimisation. A Principal Researcher defines each objective, a Database agent retrieves target information, an AI Expert generates de novo scaffolds with a sequence to molecule deep learning model, a Medicinal Chemist edits them while invoking a docking tool, a Ranking agent scores the candidates, and a Scientific Critic polices the logic. Each tool call is summarised and stored causing the full reasoning path to remain inspectable. The agents communicate through concise provenance records that capture molecular lineage, to build auditable, molecule centered reasoning trajectories and reuse successful transformations via in context learning. Three cycle research loops were run against AKT1 protein using five large language models. After ranking the models by mean docking score, we ran 20 independent scale ups on the two top performers. We then compared the leading LLMs’ binding affinity results across three configurations, LLM only, single agent, and multi agent. Our results reveal an architectural trade off, the multi agent setting excelled at focused binding optimization, improving average predicted binding affinity by 31%. In contrast, single agent runs generated molecules with superior drug like properties at the cost of less potent binding scores. Unguided LLM runs finished fastest, yet their lack of transparent tool signals left the validity of their reasoning paths unverified. These results show that test time scaling, focused feedback loops and provenance convert general purpose LLMs into auditable systems for molecular design, and suggest that extending the toolset to ADMET and selectivity predictors could push research workflows further along the discovery pipeline.

[LG-21] AI on the Pulse: Real-Time Health Anomaly Detection with Wearable and Ambient Intelligence

链接: https://arxiv.org/abs/2508.03436
作者: Davide Gabrielli,Bardh Prenkaj,Paola Velardi,Stefano Faralli
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce AI on the Pulse, a real-world-ready anomaly detection system that continuously monitors patients using a fusion of wearable sensors, ambient intelligence, and advanced AI models. Powered by UniTS, a state-of-the-art (SoTA) universal time-series model, our framework autonomously learns each patient’s unique physiological and behavioral patterns, detecting subtle deviations that signal potential health risks. Unlike classification methods that require impractical, continuous labeling in real-world scenarios, our approach uses anomaly detection to provide real-time, personalized alerts for reactive home-care interventions. Our approach outperforms 12 SoTA anomaly detection methods, demonstrating robustness across both high-fidelity medical devices (ECG) and consumer wearables, with a ~ 22% improvement in F1 score. However, the true impact of AI on the Pulse lies in @HOME, where it has been successfully deployed for continuous, real-world patient monitoring. By operating with non-invasive, lightweight devices like smartwatches, our system proves that high-quality health monitoring is possible without clinical-grade equipment. Beyond detection, we enhance interpretability by integrating LLMs, translating anomaly scores into clinically meaningful insights for healthcare professionals.

[LG-22] Residual Neural Terminal Constraint for MPC-based Collision Avoidance in Dynamic Environments

链接: https://arxiv.org/abs/2508.03428
作者: Bojan Derajić,Mohamed-Khalil Bouzidi,Sebastian Bernhard,Wolfgang Hönig
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:In this paper, we propose a hybrid MPC local planner that uses a learning-based approximation of a time-varying safe set, derived from local observations and applied as the MPC terminal constraint. This set can be represented as a zero-superlevel set of the value function computed via Hamilton-Jacobi (HJ) reachability analysis, which is infeasible in real-time. We exploit the property that the HJ value function can be expressed as a difference of the corresponding signed distance function (SDF) and a non-negative residual function. The residual component is modeled as a neural network with non-negative output and subtracted from the computed SDF, resulting in a real-time value function estimate that is at least as safe as the SDF by design. Additionally, we parametrize the neural residual by a hypernetwork to improve real-time performance and generalization properties. The proposed method is compared with three state-of-the-art methods in simulations and hardware experiments, achieving up to 30% higher success rates compared to the best baseline while requiring a similar computational effort and producing high-quality (low travel-time) solutions.

[LG-23] A neural network machine-learning approach for characterising hydrogen trapping parameters from TDS experiments

链接: https://arxiv.org/abs/2508.03371
作者: N. Marrani,T. Hageman,E. Martínez-Pañeda
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Chemical Physics (physics.chem-ph)
*备注:

点击查看摘要

Abstract:The hydrogen trapping behaviour of metallic alloys is generally characterised using Thermal Desorption Spectroscopy (TDS). However, as an indirect method, extracting key parameters (trap binding energies and densities) remains a significant challenge. To address these limitations, this work introduces a machine learning-based scheme for parameter identification from TDS spectra. A multi-Neural Network (NN) model is developed and trained exclusively on synthetic data to predict trapping parameters directly from experimental data. The model comprises two multi-layer, fully connected, feed-forward NNs trained with backpropagation. The first network (classification model) predicts the number of distinct trap types. The second network (regression model) then predicts the corresponding trap densities and binding energies. The NN architectures, hyperparameters, and data pre-processing were optimised to minimise the amount of training data. The proposed model demonstrated strong predictive capabilities when applied to three tempered martensitic steels of different compositions. The code developed is freely provided.

[LG-24] Software Fairness Dilemma: Is Bias Mitigation a Zero-Sum Game?

链接: https://arxiv.org/abs/2508.03323
作者: Zhenpeng Chen,Xinyue Li,Jie M. Zhang,Weisong Sun,Ying Xiao,Tianlin Li,Yiling Lou,Yang Liu
类目: Machine Learning (cs.LG)
*备注: Accepted by the ACM International Conference on the Foundations of Software Engineering (FSE 2025)

点击查看摘要

Abstract:Fairness is a critical requirement for Machine Learning (ML) software, driving the development of numerous bias mitigation methods. Previous research has identified a leveling-down effect in bias mitigation for computer vision and natural language processing tasks, where fairness is achieved by lowering performance for all groups without benefiting the unprivileged group. However, it remains unclear whether this effect applies to bias mitigation for tabular data tasks, a key area in fairness research with significant real-world applications. This study evaluates eight bias mitigation methods for tabular data, including both widely used and cutting-edge approaches, across 44 tasks using five real-world datasets and four common ML models. Contrary to earlier findings, our results show that these methods operate in a zero-sum fashion, where improvements for unprivileged groups are related to reduced benefits for traditionally privileged groups. However, previous research indicates that the perception of a zero-sum trade-off might complicate the broader adoption of fairness policies. To explore alternatives, we investigate an approach that applies the state-of-the-art bias mitigation method solely to unprivileged groups, showing potential to enhance benefits of unprivileged groups without negatively affecting privileged groups or overall ML performance. Our study highlights potential pathways for achieving fairness improvements without zero-sum trade-offs, which could help advance the adoption of bias mitigation methods.

[LG-25] Bridging ocean wave physics and deep learning: Physics-informed neural operators for nonlinear wavefield reconstruction in real-time

链接: https://arxiv.org/abs/2508.03315
作者: Svenja Ehlers,Merten Stender,Norbert Hoffmann
类目: Machine Learning (cs.LG)
*备注: 13 pages, 7 figures

点击查看摘要

Abstract:Accurate real-time prediction of phase-resolved ocean wave fields remains a critical yet largely unsolved problem, primarily due to the absence of practical data assimilation methods for reconstructing initial conditions from sparse or indirect wave measurements. While recent advances in supervised deep learning have shown potential for this purpose, they require large labelled datasets of ground truth wave data, which are infeasible to obtain in real-world scenarios. To overcome this limitation, we propose a Physics-Informed Neural Operator (PINO) framework for reconstructing spatially and temporally phase-resolved, nonlinear ocean wave fields from sparse measurements, without the need for ground truth data during training. This is achieved by embedding residuals of the free surface boundary conditions of ocean gravity waves into the loss function of the PINO, constraining the solution space in a soft manner. After training, we validate our approach using highly realistic synthetic wave data and demonstrate the accurate reconstruction of nonlinear wave fields from both buoy time series and radar snapshots. Our results indicate that PINOs enable accurate, real-time reconstruction and generalize robustly across a wide range of wave conditions, thereby paving the way for operational, data-driven wave reconstruction and prediction in realistic marine environments.

[LG-26] Strategic Hypothesis Testing

链接: https://arxiv.org/abs/2508.03289
作者: Safwan Hossain,Yatong Chen,Yiling Chen
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We examine hypothesis testing within a principal-agent framework, where a strategic agent, holding private beliefs about the effectiveness of a product, submits data to a principal who decides on approval. The principal employs a hypothesis testing rule, aiming to pick a p-value threshold that balances false positives and false negatives while anticipating the agent’s incentive to maximize expected profitability. Building on prior work, we develop a game-theoretic model that captures how the agent’s participation and reporting behavior respond to the principal’s statistical decision rule. Despite the complexity of the interaction, we show that the principal’s errors exhibit clear monotonic behavior when segmented by an efficiently computable critical p-value threshold, leading to an interpretable characterization of their optimal p-value threshold. We empirically validate our model and these insights using publicly available data on drug approvals. Overall, our work offers a comprehensive perspective on strategic interactions within the hypothesis testing framework, providing technical and regulatory insights.

[LG-27] Online Continual Graph Learning

链接: https://arxiv.org/abs/2508.03283
作者: Giovanni Donghi,Luca Pasa,Daniele Zambon,Cesare Alippi,Nicolò Navarin
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:The aim of Continual Learning (CL) is to learn new tasks incrementally while avoiding catastrophic forgetting. Online Continual Learning (OCL) specifically focuses on learning efficiently from a continuous stream of data with shifting distribution. While recent studies explore Continual Learning on graphs exploiting Graph Neural Networks (GNNs), only few of them focus on a streaming setting. Yet, many real-world graphs evolve over time, often requiring timely and online predictions. Current approaches, however, are not well aligned with the standard OCL setting, partly due to the lack of a clear definition of online Continual Learning on graphs. In this work, we propose a general formulation for online Continual Learning on graphs, emphasizing the efficiency requirements on batch processing over the graph topology, and providing a well-defined setting for systematic model evaluation. Finally, we introduce a set of benchmarks and report the performance of several methods in the CL literature, adapted to our setting.

[LG-28] he alpha-beta divergence for real and complex data

链接: https://arxiv.org/abs/2508.03272
作者: Sergio Cruces
类目: Machine Learning (cs.LG); Information Theory (cs.IT); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Divergences are fundamental to the information criteria that underpin most signal processing algorithms. The alpha-beta family of divergences, designed for non-negative data, offers a versatile framework that parameterizes and continuously interpolates several separable divergences found in existing literature. This work extends the definition of alpha-beta divergences to accommodate complex data, specifically when the arguments of the divergence are complex vectors. This novel formulation is designed in such a way that, by setting the divergence hyperparameters to unity, it particularizes to the well-known Euclidean and Mahalanobis squared distances. Other choices of hyperparameters yield practical separable and non-separable extensions of several classical divergences. In the context of the problem of approximating a complex random vector, the centroid obtained by optimizing the alpha-beta mean distortion has a closed-form expression, which interpretation sheds light on the distinct roles of the divergence hyperparameters. These contributions may have wide potential applicability, as there are many signal processing domains in which the underlying data are inherently complex.

[LG-29] owards Interpretable Concept Learning over Time Series via Temporal Logic Semantics

链接: https://arxiv.org/abs/2508.03269
作者: Irene Ferfoglia,Simone Silvetti,Gaia Saveri,Laura Nenzi,Luca Bortolussi
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Time series classification is a task of paramount importance, as this kind of data often arises in safety-critical applications. However, it is typically tackled with black-box deep learning methods, making it hard for humans to understand the rationale behind their output. To take on this challenge, we propose a neuro-symbolic framework that unifies classification and explanation through direct embedding of trajectories into a space of Signal Temporal Logic (STL) concepts. By introducing a novel STL-inspired kernel that maps raw time series to their alignment with predefined STL formulae, our model jointly optimises for accuracy and interpretability, as each prediction is accompanied by the most relevant logical concepts that characterise it. This enables classification grounded in human-interpretable temporal patterns and produces both local and global symbolic explanations. Early results show competitive performance while offering high-quality logical justifications for model decisions.

[LG-30] HALO: Hindsight-Augmented Learning for Online Auto-Bidding

链接: https://arxiv.org/abs/2508.03267
作者: Pusen Dong,Chenglong Cao,Xinyu Zhou,Jirong You,Linhe Xu,Feifan Xu,Shuo Yuan
类目: Machine Learning (cs.LG)
*备注: 13 pages, 5 figures

点击查看摘要

Abstract:Digital advertising platforms operate millisecond-level auctions through Real-Time Bidding (RTB) systems, where advertisers compete for ad impressions through algorithmic bids. This dynamic mechanism enables precise audience targeting but introduces profound operational complexity due to advertiser heterogeneity: budgets and ROI targets span orders of magnitude across advertisers, from individual merchants to multinational brands. This diversity creates a demanding adaptation landscape for Multi-Constraint Bidding (MCB). Traditional auto-bidding solutions fail in this environment due to two critical flaws: 1) severe sample inefficiency, where failed explorations under specific constraints yield no transferable knowledge for new budget-ROI combinations, and 2) limited generalization under constraint shifts, as they ignore physical relationships between constraints and bidding coefficients. To address this, we propose HALO: Hindsight-Augmented Learning for Online Auto-Bidding. HALO introduces a theoretically grounded hindsight mechanism that repurposes all explorations into training data for arbitrary constraint configuration via trajectory reorientation. Further, it employs B-spline functional representation, enabling continuous, derivative-aware bid mapping across constraint spaces. HALO ensures robust adaptation even when budget/ROI requirements differ drastically from training scenarios. Industrial dataset evaluations demonstrate the superiority of HALO in handling multi-scale constraints, reducing constraint violations while improving GMV.

[LG-31] On Conformal Machine Unlearning

链接: https://arxiv.org/abs/2508.03245
作者: Yahya Alkhatib,Wee Peng Tay
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:The increasing demand for data privacy, driven by regulations such as GDPR and CCPA, has made Machine Unlearning (MU) essential for removing the influence of specific training samples from machine learning models while preserving performance on retained data. However, most existing MU methods lack rigorous statistical guarantees, rely on heuristic metrics, and often require computationally expensive retraining baselines. To overcome these limitations, we introduce a new definition for MU based on Conformal Prediction (CP), providing statistically sound, uncertainty-aware guarantees without the need for the concept of naive retraining. We formalize conformal criteria that quantify how often forgotten samples are excluded from CP sets, and propose empirical metrics,the Efficiently Covered Frequency (ECF at c) and its complement, the Efficiently Uncovered Frequency (EuCF at d), to measure the effectiveness of unlearning. We further present a practical unlearning method designed to optimize these conformal metrics. Extensive experiments across diverse forgetting scenarios, datasets and models demonstrate the efficacy of our approach in removing targeted data.

[LG-32] Revisiting Deep Information Propagation: Fractal Frontier and Finite-size Effects

链接: https://arxiv.org/abs/2508.03222
作者: Giuseppe Alessio D’Inverno,Zhiyuan Hu,Leo Davy,Michael Unser,Gianluigi Rozza,Jonathan Dong
类目: Machine Learning (cs.LG)
*备注: 17 pages

点击查看摘要

Abstract:Information propagation characterizes how input correlations evolve across layers in deep neural networks. This framework has been well studied using mean-field theory, which assumes infinitely wide networks. However, these assumptions break down for practical, finite-size networks. In this work, we study information propagation in randomly initialized neural networks with finite width and reveal that the boundary between ordered and chaotic regimes exhibits a fractal structure. This shows the fundamental complexity of neural network dynamics, in a setting that is independent of input data and optimization. To extend this analysis beyond multilayer perceptrons, we leverage recently introduced Fourier-based structured transforms, and show that information propagation in convolutional neural networks also follow the same behavior. Our investigation highlights the importance of finite network depth with respect to the tradeoff between separation and robustness.

[LG-33] Convergence of Deterministic and Stochastic Diffusion-Model Samplers: A Simple Analysis in Wasserstein Distance

链接: https://arxiv.org/abs/2508.03210
作者: Eliot Beyler(SIERRA),Francis Bach(SIERRA)
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We provide new convergence guarantees in Wasserstein distance for diffusion-based generative models, covering both stochastic (DDPM-like) and deterministic (DDIM-like) sampling methods. We introduce a simple framework to analyze discretization, initialization, and score estimation errors. Notably, we derive the first Wasserstein convergence bound for the Heun sampler and improve existing results for the Euler sampler of the probability flow ODE. Our analysis emphasizes the importance of spatial regularity of the learned score function and argues for controlling the score error with respect to the true reverse process, in line with denoising score matching. We also incorporate recent results on smoothed Wasserstein distances to sharpen initialization error bounds.

[LG-34] Scaling DRL for Decision Making: A Survey on Data Network and Training Budget Strategies

链接: https://arxiv.org/abs/2508.03194
作者: Yi Ma,Hongyao Tang,Chenjun Xiao,Yaodong Yang,Wei Wei,Jianye Hao,Jiye Liang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In recent years, the expansion of neural network models and training data has driven remarkable progress in deep learning, particularly in computer vision and natural language processing. This advancement is underpinned by the concept of Scaling Laws, which demonstrates that scaling model parameters and training data enhances learning performance. While these fields have witnessed breakthroughs, such as the development of large language models like GPT-4 and advanced vision models like Midjourney, the application of scaling laws in deep reinforcement learning (DRL) remains relatively unexplored. Despite its potential to improve performance, the integration of scaling laws into DRL for decision making has not been fully realized. This review addresses this gap by systematically analyzing scaling strategies in three dimensions: data, network, and training budget. In data scaling, we explore methods to optimize data efficiency through parallel sampling and data generation, examining the relationship between data volume and learning outcomes. For network scaling, we investigate architectural enhancements, including monolithic expansions, ensemble and MoE methods, and agent number scaling techniques, which collectively enhance model expressivity while posing unique computational challenges. Lastly, in training budget scaling, we evaluate the impact of distributed training, high replay ratios, large batch sizes, and auxiliary training on training efficiency and convergence. By synthesizing these strategies, this review not only highlights their synergistic roles in advancing DRL for decision making but also provides a roadmap for future research. We emphasize the importance of balancing scalability with computational efficiency and outline promising directions for leveraging scaling to unlock the full potential of DRL in various tasks such as robot control, autonomous driving and LLM training.

[LG-35] Adaptive Sparse Softmax: An Effective and Efficient Softmax Variant

链接: https://arxiv.org/abs/2508.03175
作者: Qi Lv,Lei Geng,Ziqiang Cao,Min Cao,Sujian Li,Wenjie Li,Guohong Fu
类目: Machine Learning (cs.LG)
*备注: Accept by IEEE TASLP (Early accept version)

点击查看摘要

Abstract:Softmax with the cross entropy loss is the standard configuration for current neural classification models. The gold score for a target class is supposed to be 1, but it is never reachable under the softmax schema. Such a problem makes the training process continue forever and leads to overfitting. Moreover, the “target-approach-1” training goal forces the model to continuously learn all samples, leading to a waste of time in handling some samples which have already been classified correctly with high confidence, while the test goal simply requires the target class of each sample to hold the maximum score. To solve the above weaknesses, we propose the Adaptive Sparse softmax (AS-Softmax) which designs a reasonable and test-matching transformation on top of softmax. For more purposeful learning, we discard the classes with far smaller scores compared with the actual class during training. Then the model could focus on learning to distinguish the target class from its strong opponents, which is also the great challenge in test. In addition, since the training losses of easy samples will gradually drop to 0 in AS-Softmax, we develop an adaptive gradient accumulation strategy based on the masked sample ratio to speed up training. We verify the proposed AS-Softmax on a variety of text multi-class, text multi-label, text token classification, image classification and audio classification tasks with class sizes ranging from 5 to 5000+. The results show that AS-Softmax consistently outperforms softmax and its variants, and the loss of AS-Softmax is remarkably correlated with classification performance in validation. Furthermore, adaptive gradient accumulation strategy can bring about 1.2x training speedup comparing with the standard softmax while maintaining classification effectiveness.

[LG-36] Quantum Spectral Reasoning : A Non-Neural Architecture for Interpretable Machine Learning

链接: https://arxiv.org/abs/2508.03170
作者: Andrew Kiruluta
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel machine learning architecture that departs from conventional neural network paradigms by leveraging quantum spectral methods, specifically Pade approximants and the Lanczos algorithm, for interpretable signal analysis and symbolic reasoning. The core innovation of our approach lies in its ability to transform raw time-domain signals into sparse, physically meaningful spectral representations without the use of backpropagation, high-dimensional embeddings, or data-intensive black-box models. Through rational spectral approximation, the system extracts resonant structures that are then mapped into symbolic predicates via a kernel projection function, enabling logical inference through a rule-based reasoning engine. This architecture bridges mathematical physics, sparse approximation theory, and symbolic artificial intelligence, offering a transparent and physically grounded alternative to deep learning models. We develop the full mathematical formalism underlying each stage of the pipeline, provide a modular algorithmic implementation, and demonstrate the system’s effectiveness through comparative evaluations on time-series anomaly detection, symbolic classification, and hybrid reasoning tasks. Our results show that this spectral-symbolic architecture achieves competitive accuracy while maintaining interpretability and data efficiency, suggesting a promising new direction for physically-informed, reasoning-capable machine learning.

[LG-37] Overcoming Algorithm Aversion with Transparency: Can Transparent Predictions Change User Behavior?

链接: https://arxiv.org/abs/2508.03168
作者: Lasse Bohlen,Sven Kruschel,Julian Rosenberger,Patrick Zschech,Mathias Kraus
类目: Machine Learning (cs.LG)
*备注: Accepted at 20th International Conference on Wirtschaftsinformatik (WI25); September 2025, Münster, Germany

点击查看摘要

Abstract:Previous work has shown that allowing users to adjust a machine learning (ML) model’s predictions can reduce aversion to imperfect algorithmic decisions. However, these results were obtained in situations where users had no information about the model’s reasoning. Thus, it remains unclear whether interpretable ML models could further reduce algorithm aversion or even render adjustability obsolete. In this paper, we conceptually replicate a well-known study that examines the effect of adjustable predictions on algorithm aversion and extend it by introducing an interpretable ML model that visually reveals its decision logic. Through a pre-registered user study with 280 participants, we investigate how transparency interacts with adjustability in reducing aversion to algorithmic decision-making. Our results replicate the adjustability effect, showing that allowing users to modify algorithmic predictions mitigates aversion. Transparency’s impact appears smaller than expected and was not significant for our sample. Furthermore, the effects of transparency and adjustability appear to be more independent than expected.

[LG-38] MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction INTERSPEECH2025

链接: https://arxiv.org/abs/2508.03166
作者: Mohammed Salah Al-Radhi,Géza Németh,Branislav Gerazov
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: 5 pages, 2 figures, 1 table. Accepted for presentation at Interspeech 2025

点击查看摘要

Abstract:Speech synthesis from intracranial EEG (iEEG) signals offers a promising avenue for restoring communication in individuals with severe speech impairments. However, achieving intelligible and natural speech remains challenging due to limitations in feature representation, prosody modeling, and phase reconstruction. We introduce MiSTR, a deep-learning framework that integrates: 1) Wavelet-based feature extraction to capture fine-grained temporal, spectral, and neurophysiological representations of iEEG signals, 2) A Transformer-based decoder for prosody-aware spectrogram prediction, and 3) A neural phase vocoder enforcing harmonic consistency via adaptive spectral correction. Evaluated on a public iEEG dataset, MiSTR achieves state-of-the-art speech intelligibility, with a mean Pearson correlation of 0.91 between reconstructed and original Mel spectrograms, improving over existing neural speech synthesis baselines.

[LG-39] Rethinking Selectivity in State Space Models: A Minimal Predictive Sufficiency Approach AAAI’26

链接: https://arxiv.org/abs/2508.03158
作者: Yiyi Wang,Jian’an Zhang,Hongyi Duan,Haoyang Liu,Qingyang Li
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: Submitted to AAAI’26

点击查看摘要

Abstract:State Space Models (SSMs), particularly recent selective variants like Mamba, have emerged as a leading architecture for sequence modeling, challenging the dominance of Transformers. However, the success of these state-of-the-art models largely relies on heuristically designed selective mechanisms, which lack a rigorous first-principle derivation. This theoretical gap raises questions about their optimality and robustness against spurious correlations. To address this, we introduce the Principle of Predictive Sufficiency, a novel information-theoretic criterion stipulating that an ideal hidden state should be a minimal sufficient statistic of the past for predicting the future. Based on this principle, we propose the Minimal Predictive Sufficiency State Space Model (MPS-SSM), a new framework where the selective mechanism is guided by optimizing an objective function derived from our principle. This approach encourages the model to maximally compress historical information without losing predictive power, thereby learning to ignore non-causal noise and spurious patterns. Extensive experiments on a wide range of benchmark datasets demonstrate that MPS-SSM not only achieves state-of-the-art performance, significantly outperforming existing models in long-term forecasting and noisy scenarios, but also exhibits superior robustness. Furthermore, we show that the MPS principle can be extended as a general regularization framework to enhance other popular architectures, highlighting its broad potential.

[LG-40] Unveiling Location-Specific Price Drivers: A Two-Stage Cluster Analysis for Interpretable House Price Predictions

链接: https://arxiv.org/abs/2508.03156
作者: Paul Gümmer,Julian Rosenberger,Mathias Kraus,Patrick Zschech,Nico Hambauer
类目: Machine Learning (cs.LG)
*备注: Accepted at 20th International Conference on Wirtschaftsinformatik (WI25); September 2025, Münster, Germany

点击查看摘要

Abstract:House price valuation remains challenging due to localized market variations. Existing approaches often rely on black-box machine learning models, which lack interpretability, or simplistic methods like linear regression (LR), which fail to capture market heterogeneity. To address this, we propose a machine learning approach that applies two-stage clustering, first grouping properties based on minimal location-based features before incorporating additional features. Each cluster is then modeled using either LR or a generalized additive model (GAM), balancing predictive performance with interpretability. Constructing and evaluating our models on 43,309 German house property listings from 2023, we achieve a 36% improvement for the GAM and 58% for LR in mean absolute error compared to models without clustering. Additionally, graphical analyses unveil pattern shifts between clusters. These findings emphasize the importance of cluster-specific insights, enhancing interpretability and offering practical value for buyers, sellers, and real estate analysts seeking more reliable property valuations.

[LG-41] RegMean: Enhancing Effectiveness and Generalization of Regression Mean for Model Merging

链接: https://arxiv.org/abs/2508.03121
作者: The-Hai Nguyen,Dang Huu-Tien,Takeshi Suzuki,Le-Minh Nguyen
类目: Machine Learning (cs.LG)
*备注: 17 pages, 11 figures, 11 tables

点击查看摘要

Abstract:Regression Mean (RegMean), an approach that formulates model merging as a linear regression problem, aims to find the optimal weights for each linear layer in the merge model by minimizing the discrepancy in predictions between the merge and candidate models. RegMean provides a precise closed-form solution for the merging problem; therefore, it offers explainability and computational efficiency. However, RegMean merges each linear layer independently, overlooking how the features and information in the earlier layers propagate through the layers and influence the final prediction in the merge model. In this paper, we introduce RegMean++, a simple yet effective alternative to RegMean, that explicitly incorporates both intra- and cross-layer dependencies between merge models’ layers into RegMean’s objective. By accounting for these dependencies, RegMean++ better captures the behaviors of the merge model. Extensive experiments demonstrate that RegMean++ consistently outperforms RegMean across diverse settings, including in-domain (ID) and out-of-domain (OOD) generalization, sequential merging, large-scale tasks, and robustness under several types of distribution shifts. Furthermore, RegMean++ achieves competitive or state-of-the-art performance compared to various recent advanced model merging methods. Our code is available at this https URL.

[LG-42] Accelerating SGDM via Learning Rate and Batch Size Schedules: A Lyapunov-Based Analysis

链接: https://arxiv.org/abs/2508.03105
作者: Yuichi Kondo,Hideaki Iiduka
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We analyze the convergence behavior of stochastic gradient descent with momentum (SGDM) under dynamic learning rate and batch size schedules by introducing a novel Lyapunov function. This Lyapunov function has a simpler structure compared with existing ones, facilitating the challenging convergence analysis of SGDM and a unified analysis across various dynamic schedules. Specifically, we extend the theoretical framework to cover three practical scheduling strategies commonly used in deep learning: (i) constant batch size with a decaying learning rate, (ii) increasing batch size with a decaying learning rate, and (iii) increasing batch size with an increasing learning rate. Our theoretical results reveal a clear hierarchy in convergence behavior: while (i) does not guarantee convergence of the expected gradient norm, both (ii) and (iii) do. Moreover, (iii) achieves a provably faster decay rate than (i) and (ii), demonstrating theoretical acceleration even in the presence of momentum. Empirical results validate our theory, showing that dynamically scheduled SGDM significantly outperforms fixed-hyperparameter baselines in convergence speed. We also evaluated a warm-up schedule in experiments, which empirically outperformed all other strategies in convergence behavior. These findings provide a unified theoretical foundation and practical guidance for designing efficient and stable training procedures in modern deep learning.

[LG-43] Achieving Limited Adaptivity for Multinomial Logistic Bandits

链接: https://arxiv.org/abs/2508.03072
作者: Sukruta Prakash Midigeshi,Tanmay Goyal,Gaurav Sinha
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Accepted to RLC 2025

点击查看摘要

Abstract:Multinomial Logistic Bandits have recently attracted much attention due to their ability to model problems with multiple outcomes. In this setting, each decision is associated with many possible outcomes, modeled using a multinomial logit function. Several recent works on multinomial logistic bandits have simultaneously achieved optimal regret and computational efficiency. However, motivated by real-world challenges and practicality, there is a need to develop algorithms with limited adaptivity, wherein we are allowed only M policy updates. To address these challenges, we present two algorithms, B-MNL-CB and RS-MNL, that operate in the batched and rarely-switching paradigms, respectively. The batched setting involves choosing the M policy update rounds at the start of the algorithm, while the rarely-switching setting can choose these M policy update rounds in an adaptive fashion. Our first algorithm, B-MNL-CB extends the notion of distributional optimal designs to the multinomial setting and achieves \tildeO(\sqrtT) regret assuming the contexts are generated stochastically when presented with \Omega(\log \log T) update rounds. Our second algorithm, RS-MNL works with adversarially generated contexts and can achieve \tildeO(\sqrtT) regret with \tildeO(\log T) policy updates. Further, we conducted experiments that demonstrate that our algorithms (with a fixed number of policy updates) are extremely competitive (and often better) than several state-of-the-art baselines (which update their policy every round), showcasing the applicability of our algorithms in various practical scenarios.

[LG-44] F-MLPNet: Tiny Real-Time Neural Speech Separation

链接: https://arxiv.org/abs/2508.03047
作者: Malek Itani,Tuochao Chen,Shyamnath Gollakota
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity 2025)

点击查看摘要

Abstract:Speech separation on hearable devices can enable transformative augmented and enhanced hearing capabilities. However, state-of-the-art speech separation networks cannot run in real-time on tiny, low-power neural accelerators designed for hearables, due to their limited compute capabilities. We present TF-MLPNet, the first speech separation network capable of running in real-time on such low-power accelerators while outperforming existing streaming models for blind speech separation and target speech extraction. Our network operates in the time-frequency domain, processing frequency sequences with stacks of fully connected layers that alternate along the channel and frequency dimensions, and independently processing the time sequence at each frequency bin using convolutional layers. Results show that our mixed-precision quantization-aware trained (QAT) model can process 6 ms audio chunks in real-time on the GAP9 processor, achieving a 3.5-4x runtime reduction compared to prior speech separation models.

[LG-45] A Novel Multimodal Framework for Early Detection of Alzheimers Disease Using Deep Learning

链接: https://arxiv.org/abs/2508.03046
作者: Tatwadarshi P Nagarhalli,Sanket Patil,Vishal Pande,Uday Aswalekar,Prafulla Patil
类目: Machine Learning (cs.LG)
*备注: Journal paper, 14 pages

点击查看摘要

Abstract:Alzheimers Disease (AD) is a progressive neurodegenerative disorder that poses significant challenges in its early diagnosis, often leading to delayed treatment and poorer outcomes for patients. Traditional diagnostic methods, typically reliant on single data modalities, fall short of capturing the multifaceted nature of the disease. In this paper, we propose a novel multimodal framework for the early detection of AD that integrates data from three primary sources: MRI imaging, cognitive assessments, and biomarkers. This framework employs Convolutional Neural Networks (CNN) for analyzing MRI images and Long Short-Term Memory (LSTM) networks for processing cognitive and biomarker data. The system enhances diagnostic accuracy and reliability by aggregating results from these distinct modalities using advanced techniques like weighted averaging, even in incomplete data. The multimodal approach not only improves the robustness of the detection process but also enables the identification of AD at its earliest stages, offering a significant advantage over conventional methods. The integration of biomarkers and cognitive tests is particularly crucial, as these can detect Alzheimer’s long before the onset of clinical symptoms, thereby facilitating earlier intervention and potentially altering the course of the disease. This research demonstrates that the proposed framework has the potential to revolutionize the early detection of AD, paving the way for more timely and effective treatments

[LG-46] Aerobatic maneuvers in insect-scale flapping-wing aerial robots via deep-learned robust tube model predictive control

链接: https://arxiv.org/abs/2508.03043
作者: Yi-Hsuan Hsiao,Andrea Tagliabue,Owen Matteson,Suhan Kim,Tong Zhao,Jonathan P. How,YuFeng Chen
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 27 pages, 26 supplementary pages, 6 main figures, 16 supplementary figures, 1 table

点击查看摘要

Abstract:Aerial insects exhibit highly agile maneuvers such as sharp braking, saccades, and body flips under disturbance. In contrast, insect-scale aerial robots are limited to tracking non-aggressive trajectories with small body acceleration. This performance gap is contributed by a combination of low robot inertia, fast dynamics, uncertainty in flapping-wing aerodynamics, and high susceptibility to environmental disturbance. Executing highly dynamic maneuvers requires the generation of aggressive flight trajectories that push against the hardware limit and a high-rate feedback controller that accounts for model and environmental uncertainty. Here, through designing a deep-learned robust tube model predictive controller, we showcase insect-like flight agility and robustness in a 750-millgram flapping-wing robot. Our model predictive controller can track aggressive flight trajectories under disturbance. To achieve a high feedback rate in a compute-constrained real-time system, we design imitation learning methods to train a two-layer, fully connected neural network, which resembles insect flight control architecture consisting of central nervous system and motor neurons. Our robot demonstrates insect-like saccade movements with lateral speed and acceleration of 197 centimeters per second and 11.7 meters per second square, representing 447 % and 255 % improvement over prior results. The robot can also perform saccade maneuvers under 160 centimeters per second wind disturbance and large command-to-force mapping errors. Furthermore, it performs 10 consecutive body flips in 11 seconds - the most challenging maneuver among sub-gram flyers. These results represent a milestone in achieving insect-scale flight agility and inspire future investigations on sensing and compute autonomy.

[LG-47] Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling

链接: https://arxiv.org/abs/2508.03042
作者: Ruixing Zhang,Bo Wang,Tongyu Zhu,Leilei Sun,Weifeng Lv
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban profiling aims to predict urban profiles in unknown regions and plays a critical role in economic and social censuses. Existing approaches typically follow a two-stage paradigm: first, learning representations of urban areas; second, performing downstream prediction via linear probing, which originates from the BERT era. Inspired by the development of GPT style models, recent studies have shown that novel self-supervised pretraining schemes can endow models with direct applicability to downstream tasks, thereby eliminating the need for task-specific fine-tuning. This is largely because GPT unifies the form of pretraining and inference through next-token prediction. However, urban data exhibit structural characteristics that differ fundamentally from language, making it challenging to design a one-stage model that unifies both pretraining and inference. In this work, we propose Urban In-Context Learning, a framework that unifies pretraining and inference via a masked autoencoding process over urban regions. To capture the distribution of urban profiles, we introduce the Urban Masked Diffusion Transformer, which enables each region’ s prediction to be represented as a distribution rather than a deterministic value. Furthermore, to stabilize diffusion training, we propose the Urban Representation Alignment Mechanism, which regularizes the model’s intermediate features by aligning them with those from classical urban profiling methods. Extensive experiments on three indicators across two cities demonstrate that our one-stage method consistently outperforms state-of-the-art two-stage approaches. Ablation studies and case studies further validate the effectiveness of each proposed module, particularly the use of diffusion modeling.

[LG-48] Neural Speech Extraction with Human Feedback INTERSPEECH2025

链接: https://arxiv.org/abs/2508.03041
作者: Malek Itani,Ashton Graves,Sefik Emre Eskimez,Shyamnath Gollakota
类目: ound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
*备注: Interspeech 2025

点击查看摘要

Abstract:We present the first neural target speech extraction (TSE) system that uses human feedback for iterative refinement. Our approach allows users to mark specific segments of the TSE output, generating an edit mask. The refinement system then improves the marked sections while preserving unmarked regions. Since large-scale datasets of human-marked errors are difficult to collect, we generate synthetic datasets using various automated masking functions and train models on each. Evaluations show that models trained with noise power-based masking (in dBFS) and probabilistic thresholding perform best, aligning with human annotations. In a study with 22 participants, users showed a preference for refined outputs over baseline TSE. Our findings demonstrate that human-in-the-loop refinement is a promising approach for improving the performance of neural speech extraction.

[LG-49] Where and How to Enhance: Discovering Bit-Width Contribution for Mixed Precision Quantization

链接: https://arxiv.org/abs/2508.03002
作者: Haidong Kang,Lianbo Ma,Guo Yu,Shangce Gao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Mixed precision quantization (MPQ) is an effective quantization approach to achieve accuracy-complexity trade-off of neural network, through assigning different bit-widths to network activations and weights in each layer. The typical way of existing MPQ methods is to optimize quantization policies (i.e., bit-width allocation) in a gradient descent manner, termed as Differentiable (DMPQ). At the end of the search, the bit-width associated to the quantization parameters which has the largest value will be selected to form the final mixed precision quantization policy, with the implicit assumption that the values of quantization parameters reflect the operation contribution to the accuracy improvement. While much has been discussed about the MPQ improvement, the bit-width selection process has received little attention. We study this problem and argue that the magnitude of quantization parameters does not necessarily reflect the actual contribution of the bit-width to the task performance. Then, we propose a Shapley-based MPQ (SMPQ) method, which measures the bit-width operation direct contribution on the MPQ task. To reduce computation cost, a Monte Carlo sampling-based approximation strategy is proposed for Shapley computation. Extensive experiments on mainstream benchmarks demonstrate that our SMPQ consistently achieves state-of-the-art performance than gradient-based competitors.

[LG-50] On the Fast Adaptation of Delayed Clients in Decentralized Federated Learning: A Centroid-Aligned Distillation Approach

链接: https://arxiv.org/abs/2508.02993
作者: Jiahui Bai,Hai Dong,A. K. Qin
类目: Machine Learning (cs.LG)
*备注: This paper is currently under peer review

点击查看摘要

Abstract:Decentralized Federated Learning (DFL) struggles with the slow adaptation of late-joining delayed clients and high communication costs in asynchronous environments. These limitations significantly hinder overall performance. To address this, we propose DFedCAD, a novel framework for rapid adaptation via Centroid-Aligned Distillation. DFedCAD first employs Weighted Cluster Pruning (WCP) to compress models into representative centroids, drastically reducing communication overhead. It then enables delayed clients to intelligently weigh and align with peer knowledge using a novel structural distance metric and a differentiable k-means distillation module, facilitating efficient end-to-end knowledge transfer. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that DFedCAD consistently achieves state-of-the-art performance, attaining the highest accuracy across all evaluated settings while reducing communication overhead by over 86%. Our framework provides a scalable and practical solution for efficient decentralized learning in dynamic, real-world scenarios.

[LG-51] Scalable Varied-Density Clustering via Graph Propagation

链接: https://arxiv.org/abs/2508.02989
作者: Ninh Pham,Yingtao Zheng,Hugo Phibbs
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We propose a novel perspective on varied-density clustering for high-dimensional data by framing it as a label propagation process in neighborhood graphs that adapt to local density variations. Our method formally connects density-based clustering with graph connectivity, enabling the use of efficient graph propagation techniques developed in network science. To ensure scalability, we introduce a density-aware neighborhood propagation algorithm and leverage advanced random projection methods to construct approximate neighborhood graphs. Our approach significantly reduces computational cost while preserving clustering quality. Empirically, it scales to datasets with millions of points in minutes and achieves competitive accuracy compared to existing baselines.

[LG-52] Injecting Measurement Information Yields a Fast and Noise-Robust Diffusion-Based Inverse Problem Solver

链接: https://arxiv.org/abs/2508.02964
作者: Jonathan Patsenker,Henry Li,Myeongseob Ko,Ruoxi Jia,Yuval Kluger
类目: Machine Learning (cs.LG); Computation (stat.CO)
*备注:

点击查看摘要

Abstract:Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and iterative sampling algorithm. These approaches often rely on Tweedie’s formula, which relates the diffusion variate \mathbfx_t to the posterior mean \mathbbE [\mathbfx_0 | \mathbfx_t] , in order to guide the diffusion trajectory with an estimate of the final denoised sample \mathbfx_0 . However, this does not consider information from the measurement \mathbfy , which must then be integrated downstream. In this work, we propose to estimate the conditional posterior mean \mathbbE [\mathbfx_0 | \mathbfx_t, \mathbfy] , which can be formulated as the solution to a lightweight, single-parameter maximum likelihood estimation problem. The resulting prediction can be integrated into any standard sampler, resulting in a fast and memory-efficient inverse solver. Our optimizer is amenable to a noise-aware likelihood-based stopping criteria that is robust to measurement noise in \mathbfy . We demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets and tasks.

[LG-53] Online Robust Multi-Agent Reinforcement Learning under Model Uncertainties

链接: https://arxiv.org/abs/2508.02948
作者: Zain Ulabedeen Farhat,Debamita Ghosh,George K. Atia,Yue Wang
类目: Machine Learning (cs.LG); Multiagent Systems (cs.MA)
*备注:

点击查看摘要

Abstract:Well-trained multi-agent systems can fail when deployed in real-world environments due to model mismatches between the training and deployment environments, caused by environment uncertainties including noise or adversarial attacks. Distributionally Robust Markov Games (DRMGs) enhance system resilience by optimizing for worst-case performance over a defined set of environmental uncertainties. However, current methods are limited by their dependence on simulators or large offline datasets, which are often unavailable. This paper pioneers the study of online learning in DRMGs, where agents learn directly from environmental interactions without prior data. We introduce the \it Robust Optimistic Nash Value Iteration (RONAVI) algorithm and provide the first provable guarantees for this setting. Our theoretical analysis demonstrates that the algorithm achieves low regret and efficiently finds the optimal robust policy for uncertainty sets measured by Total Variation divergence and Kullback-Leibler divergence. These results establish a new, practical path toward developing truly robust multi-agent systems.

[LG-54] PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models

链接: https://arxiv.org/abs/2508.02932
作者: Minghao Yan,Zhuang Wang,Zhen Jia,Shivaram Venkataraman,Yida Wang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Low-rank Adaptation (LoRA) has gained popularity as a fine-tuning approach for Large Language Models (LLMs) due to its low resource requirements and good performance. While a plethora of work has investigated improving LoRA serving efficiency by serving multiple LoRAs concurrently, existing methods assume that a wide range of LoRA adapters are available for serving. In our work, we conduct extensive empirical studies to identify that current training paradigms do not utilize hardware resources efficiently and require high overhead to obtain a performant LoRA. Leveraging these insights, we propose PLoRA, which automatically orchestrates concurrent LoRA fine-tuning jobs under given hardware and model constraints and develops performant kernels to improve training efficiency. Our experimental studies show that PLoRA reduces the makespan of LoRA fine-tuning over a given hyperparameter search space by up to 7.52x and improves training throughput by up to 12.8x across a range of state-of-the-art LLMs.

[LG-55] BoostTransformer: Enhancing Transformer Models with Subgrid Selection and Importance Sampling

链接: https://arxiv.org/abs/2508.02924
作者: Biyi Fang,Jean Utke,Truong Vo,Diego Klabjan
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: 10 pages, 5 figures, submitted for review at a major machine learning conference. arXiv admin note: substantial text overlap with arXiv:2203.00761 , arXiv:2507.22842

点击查看摘要

Abstract:Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with boosting principles through subgrid token selection and importance-weighted sampling. Our method incorporates a least square boosting objective directly into the transformer pipeline, enabling more efficient training and improved performance. Across multiple fine-grained text classification benchmarks, BoostTransformer demonstrates both faster convergence and higher accuracy, surpassing standard transformers while minimizing architectural search overhead.

[LG-56] Neural Approximators for Low-Thrust Trajectory Transfer Cost and Reachability

链接: https://arxiv.org/abs/2508.02911
作者: Zhong Zhang,Francesco Topputo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:In trajectory design, fuel consumption and trajectory reachability are two key performance indicators for low-thrust missions. This paper proposes general-purpose pretrained neural networks to predict these metrics. The contributions of this paper are as follows: Firstly, based on the confirmation of the Scaling Law applicable to low-thrust trajectory approximation, the largest dataset is constructed using the proposed homotopy ray method, which aligns with mission-design-oriented data requirements. Secondly, the data are transformed into a self-similar space, enabling the neural network to adapt to arbitrary semi-major axes, inclinations, and central bodies. This extends the applicability beyond existing studies and can generalize across diverse mission scenarios without retraining. Thirdly, to the best of our knowledge, this work presents the current most general and accurate low-thrust trajectory approximator, with implementations available in C++, Python, and MATLAB. The resulting neural network achieves a relative error of 0.78% in predicting velocity increments and 0.63% in minimum transfer time estimation. The models have also been validated on a third-party dataset, multi-flyby mission design problem, and mission analysis scenario, demonstrating their generalization capability, predictive accuracy, and computational efficiency.

[LG-57] Clus-UCB: A Near-Optimal Algorithm for Clustered Bandits

链接: https://arxiv.org/abs/2508.02909
作者: Aakash Gore,Prasanna Chaporkar
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We study a stochastic multi-armed bandit setting where arms are partitioned into known clusters, such that the mean rewards of arms within a cluster differ by at most a known threshold. While the clustering structure is known a priori, the arm means are unknown. This framework models scenarios where outcomes depend on multiple factors – some with significant and others with minor influence – such as online advertising, clinical trials, and wireless communication. We derive asymptotic lower bounds on the regret that improve upon the classical bound of Lai Robbins (1985). We then propose Clus-UCB, an efficient algorithm that closely matches this lower bound asymptotically. Clus-UCB is designed to exploit the clustering structure and introduces a new index to evaluate an arm, which depends on other arms within the cluster. In this way, arms share information among each other. We present simulation results of our algorithm and compare its performance against KL-UCB and other well-known algorithms for bandits with dependent arms. Finally, we address some limitations of this work and conclude by mentioning possible future research.

[LG-58] Physics-Embedded Neural ODEs for Sim2Real Edge Digital Twins of Hybrid Power Electronics Systems

链接: https://arxiv.org/abs/2508.02887
作者: Jialin Zheng,Haoyu Wang,Yangbin Zeng,Di Mou,Xin Zhang,Hong Li,Sergio Vazquez,Leopoldo G. Franquelo
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Edge Digital Twins (EDTs) are crucial for monitoring and control of Power Electronics Systems (PES). However, existing modeling approaches struggle to consistently capture continuously evolving hybrid dynamics that are inherent in PES, degrading Sim-to-Real generalization on resource-constrained edge devices. To address these challenges, this paper proposes a Physics-Embedded Neural ODEs (PENODE) that (i) embeds the hybrid operating mechanism as an event automaton to explicitly govern discrete switching and (ii) injects known governing ODE components directly into the neural parameterization of unmodeled dynamics. This unified design yields a differentiable end-to-end trainable architecture that preserves physical interpretability while reducing redundancy, and it supports a cloud-to-edge toolchain for efficient FPGA deployment. Experimental results demonstrate that PENODE achieves significantly higher accuracy in benchmarks in white-box, gray-box, and black-box scenarios, with a 75% reduction in neuron count, validating that the proposed PENODE maintains physical interpretability, efficient edge deployment, and real-time control enhancement.

[LG-59] Neural Networks with Orthogonal Jacobian

链接: https://arxiv.org/abs/2508.02882
作者: Alex Massucco,Davide Murari,Carola-Bibiane Schönlieb
类目: Machine Learning (cs.LG); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:Very deep neural networks achieve state-of-the-art performance by extracting rich, hierarchical features. Yet, training them via backpropagation is often hindered by vanishing or exploding gradients. Existing remedies, such as orthogonal or variance-preserving initialisation and residual architectures, allow for a more stable gradient propagation and the training of deeper models. In this work, we introduce a unified mathematical framework that describes a broad class of nonlinear feedforward and residual networks, whose input-to-output Jacobian matrices are exactly orthogonal almost everywhere. Such a constraint forces the resulting networks to achieve perfect dynamical isometry and train efficiently despite being very deep. Our formulation not only recovers standard architectures as particular cases but also yields new designs that match the trainability of residual networks without relying on conventional skip connections. We provide experimental evidence that perfect Jacobian orthogonality at initialisation is sufficient to stabilise training and achieve competitive performance. We compare this strategy to networks regularised to maintain the Jacobian orthogonality and obtain comparable results. We further extend our analysis to a class of networks well-approximated by those with orthogonal Jacobians and introduce networks with Jacobians representing partial isometries. These generalized models are then showed to maintain the favourable trainability properties.

[LG-60] Comparative Evaluation of Kolmogorov-Arnold Autoencoders and Orthogonal Autoencoders for Fault Detection with Varying Training Set Sizes

链接: https://arxiv.org/abs/2508.02860
作者: Enrique Luna Villagómez,Vladimir Mahalec
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Kolmogorov-Arnold Networks (KANs) have recently emerged as a flexible and parameter-efficient alternative to conventional neural networks. Unlike standard architectures that use fixed node-based activations, KANs place learnable functions on edges, parameterized by different function families. While they have shown promise in supervised settings, their utility in unsupervised fault detection remains largely unexplored. This study presents a comparative evaluation of KAN-based autoencoders (KAN-AEs) for unsupervised fault detection in chemical processes. We investigate four KAN-AE variants, each based on a different KAN implementation (EfficientKAN, FastKAN, FourierKAN, and WavKAN), and benchmark them against an Orthogonal Autoencoder (OAE) on the Tennessee Eastman Process. Models are trained on normal operating data across 13 training set sizes and evaluated on 21 fault types, using Fault Detection Rate (FDR) as the performance metric. WavKAN-AE achieves the highest overall FDR ( \geq 92%) using just 4,000 training samples and remains the top performer, even as other variants are trained on larger datasets. EfficientKAN-AE reaches \geq 90% FDR with only 500 samples, demonstrating robustness in low-data settings. FastKAN-AE becomes competitive at larger scales ( \geq 50,000 samples), while FourierKAN-AE consistently underperforms. The OAE baseline improves gradually but requires substantially more data to match top KAN-AE performance. These results highlight the ability of KAN-AEs to combine data efficiency with strong fault detection performance. Their use of structured basis functions suggests potential for improved model transparency, making them promising candidates for deployment in data-constrained industrial settings.

[LG-61] Resource-Efficient Automatic Software Vulnerability Assessment via Knowledge Distillation and Particle Swarm Optimization

链接: https://arxiv.org/abs/2508.02840
作者: Chaoyang Gao,Xiang Chen,Jiyu Wang,Jibin Wang,Guang Yang
类目: Machine Learning (cs.LG); Cryptography and Security (cs.CR)
*备注: Accepted by Engineering Applications of Artificial Intelligence

点击查看摘要

Abstract:The increasing complexity of software systems has led to a surge in cybersecurity vulnerabilities, necessitating efficient and scalable solutions for vulnerability assessment. However, the deployment of large pre-trained models in real-world scenarios is hindered by their substantial computational and storage demands. To address this challenge, we propose a novel resource-efficient framework that integrates knowledge distillation and particle swarm optimization to enable automated vulnerability assessment. Our framework employs a two-stage approach: First, particle swarm optimization is utilized to optimize the architecture of a compact student model, balancing computational efficiency and model capacity. Second, knowledge distillation is applied to transfer critical vulnerability assessment knowledge from a large teacher model to the optimized student model. This process significantly reduces the model size while maintaining high performance. Experimental results on an enhanced MegaVul dataset, comprising 12,071 CVSS (Common Vulnerability Scoring System) v3 annotated vulnerabilities, demonstrate the effectiveness of our approach. Our approach achieves a 99.4% reduction in model size while retaining 89.3% of the original model’s accuracy. Furthermore, it outperforms state-of-the-art baselines by 1.7% in accuracy with 60% fewer parameters. The framework also reduces training time by 72.1% and architecture search time by 34.88% compared to traditional genetic algorithms.

[LG-62] Agent ic Privacy-Preserving Machine Learning

链接: https://arxiv.org/abs/2508.02836
作者: Mengyu Zhang,Zhuotao Liu,Jingwen Huang,Xuanqi Liu
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Privacy-preserving machine learning (PPML) is critical to ensure data privacy in AI. Over the past few years, the community has proposed a wide range of provably secure PPML schemes that rely on various cryptography primitives. However, when it comes to large language models (LLMs) with billions of parameters, the efficiency of PPML is everything but acceptable. For instance, the state-of-the-art solution for confidential LLM inference represents at least 10,000-fold slower performance compared to plaintext inference. The performance gap is even larger when the context length increases. In this position paper, we propose a novel framework named Agentic-PPML to make PPML in LLMs practical. Our key insight is to employ a general-purpose LLM for intent understanding and delegate cryptographically secure inference to specialized models trained on vertical domains. By modularly separating language intent parsing - which typically involves little or no sensitive information - from privacy-critical computation, Agentic-PPML completely eliminates the need for the LLMs to process the encrypted prompts, enabling practical deployment of privacy-preserving LLM-centric services.

[LG-63] Defending Against Knowledge Poisoning Attacks During Retrieval-Augmented Generation

链接: https://arxiv.org/abs/2508.02835
作者: Kennedy Edemacu,Vinay M. Shashidhar,Micheal Tuape,Dan Abudu,Beakcheol Jang,Jong Wook Kim
类目: Machine Learning (cs.LG); Information Retrieval (cs.IR)
*备注: Preprint for Submission

点击查看摘要

Abstract:Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to boost the capabilities of large language models (LLMs) by incorporating external, up-to-date knowledge sources. However, this introduces a potential vulnerability to knowledge poisoning attacks, where attackers can compromise the knowledge source to mislead the generation model. One such attack is the PoisonedRAG in which the injected adversarial texts steer the model to generate an attacker-chosen response to a target question. In this work, we propose novel defense methods, FilterRAG and ML-FilterRAG, to mitigate the PoisonedRAG attack. First, we propose a new property to uncover distinct properties to differentiate between adversarial and clean texts in the knowledge data source. Next, we employ this property to filter out adversarial texts from clean ones in the design of our proposed approaches. Evaluation of these methods using benchmark datasets demonstrate their effectiveness, with performances close to those of the original RAG systems.

[LG-64] On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence

链接: https://arxiv.org/abs/2508.02833
作者: Lei Pang,Ruinan Jin
类目: Machine Learning (cs.LG)
*备注: 12 pages

点击查看摘要

Abstract:Group Relative Policy Optimization (GRPO), recently proposed by DeepSeek, is a critic-free reinforcement learning algorithm for fine tuning large language models. It replaces the value function in Proximal Policy Optimization (PPO) with group normalized rewards, while retaining PPO style token level importance sampling based on an old policy. We show that GRPO update rule in fact estimates the policy gradient at the old policy rather than the current one. However, since the old policy is refreshed every few steps, the discrepancy between the two remains small limiting the impact of this bias in practice. We validate this through an ablation study in which importance sampling is entirely removed, and updates are instead performed using the gradient estimated at a fixed old policy across multiple optimization steps. Remarkably, this simplification results in performance comparable to standard GRPO. Motivated by these findings, we propose a new algorithm: Trajectory level Importance Corrected GRPO (TIC GRPO). TIC GRPO replaces token level importance ratios with a single trajectory level probability ratio, yielding an unbiased estimate of the current policy gradient while preserving the critic free structure. Furthermore, we present the first theoretical convergence analysis for GRPO style methods, covering both the original GRPO and our proposed variant. Comments: 12 pages Subjects: Machine Learning (cs.LG) Cite as: arXiv:2508.02833 [cs.LG] (or arXiv:2508.02833v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.02833 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-65] Uncertainty Sets for Distributionally Robust Bandits Using Structural Equation Models

链接: https://arxiv.org/abs/2508.02812
作者: Katherine Avery,Chinmay Pendse,David Jensen
类目: Machine Learning (cs.LG)
*备注: 10 pages main text, 28 pages total

点击查看摘要

Abstract:Distributionally robust evaluation estimates the worst-case expected return over an uncertainty set of possible covariate and reward distributions, and distributionally robust learning finds a policy that maximizes that worst-case return across that uncertainty set. Unfortunately, current methods for distributionally robust evaluation and learning create overly conservative evaluations and policies. In this work, we propose a practical bandit evaluation and learning algorithm that tailors the uncertainty set to specific problems using mathematical programs constrained by structural equation models. Further, we show how conditional independence testing can be used to detect shifted variables for modeling. We find that the structural equation model (SEM) approach gives more accurate evaluations and learns lower-variance policies than traditional approaches, particularly for large shifts. Further, the SEM approach learns an optimal policy, assuming the model is sufficiently well-specified.

[LG-66] Synthetic medical data generation: state of the art and application to trauma mechanism classification

链接: https://arxiv.org/abs/2508.02771
作者: Océane Doremus,Ariel Guerra-Adames,Marta Avalos-Fernandez,Vianney Jouhet,Cédric Gil-Jardiné,Emmanuel Lagarde
类目: Machine Learning (cs.LG)
*备注: Accepted to CIBB 2025 as a short paper

点击查看摘要

Abstract:Faced with the challenges of patient confidentiality and scientific reproducibility, research on machine learning for health is turning towards the conception of synthetic medical databases. This article presents a brief overview of state-of-the-art machine learning methods for generating synthetic tabular and textual data, focusing their application to the automatic classification of trauma mechanisms, followed by our proposed methodology for generating high-quality, synthetic medical records combining tabular and unstructured text data.

[LG-67] Exponential convergence rate for Iterative Markovian Fitting

链接: https://arxiv.org/abs/2508.02770
作者: Kirill Sokolov,Alexander Korotin
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We consider the discrete-time Schrödinger bridge problem on a finite state space. Although it has been known that the Iterative Markovian Fitting (IMF) algorithm converges in Kullback-Leibler divergence to the ground truth solution, the speed of that convergence remained unquantified. In this work, we establish for the first time that IMF exhibits exponential convergence with an explicit contraction factor.

[LG-68] Considering Spatial Structure of the Road Network in Pavement Deterioration Modeling

链接: https://arxiv.org/abs/2508.02749
作者: Lu Gao,Ke Yu,Pan Lu
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Pavement deterioration modeling is important in providing information regarding the future state of the road network and in determining the needs of preventive maintenance or rehabilitation treatments. This research incorporated spatial dependence of road network into pavement deterioration modeling through a graph neural network (GNN). The key motivation of using a GNN for pavement performance modeling is the ability to easily and directly exploit the rich structural information in the network. This paper explored if considering spatial structure of the road network will improve the prediction performance of the deterioration models. The data used in this research comprises a large pavement condition data set with more than a half million observations taken from the Pavement Management Information System (PMIS) maintained by the Texas Department of Transportation. The promising comparison results indicates that pavement deterioration prediction models perform better when spatial relationship is considered.

[LG-69] Embedding-Enhanced Probabilistic Modeling of Ferroelectric Field Effect Transistors (FeFETs)

链接: https://arxiv.org/abs/2508.02737
作者: Tasnia Nobi Afee,Jack Hutchins,Md Mazharul Islam,Thomas Kampfe,Ahmedullah Aziz
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注: 15 pages, 6 figures, manuscript yet not submitted anywhere

点击查看摘要

Abstract:FeFETs hold strong potential for advancing memory and logic technologies, but their inherent randomness arising from both operational cycling and fabrication variability poses significant challenges for accurate and reliable modeling. Capturing this variability is critical, as it enables designers to predict behavior, optimize performance, and ensure reliability and robustness against variations in manufacturing and operating conditions. Existing deterministic and machine learning-based compact models often fail to capture the full extent of this variability or lack the mathematical smoothness required for stable circuit-level integration. In this work, we present an enhanced probabilistic modeling framework for FeFETs that addresses these limitations. Building upon a Mixture Density Network (MDN) foundation, our approach integrates C-infinity continuous activation functions for smooth, stable learning and a device-specific embedding layer to capture intrinsic physical variability across devices. Sampling from the learned embedding distribution enables the generation of synthetic device instances for variability-aware simulation. With an R2 of 0.92, the model demonstrates high accuracy in capturing the variability of FeFET current behavior. Altogether, this framework provides a scalable, data-driven solution for modeling the full stochastic behavior of FeFETs and offers a strong foundation for future compact model development and circuit simulation integration.

[LG-70] Low-Communication Resilient Distributed Estimation Algorithm Based on Memory Mechanism

链接: https://arxiv.org/abs/2508.02705
作者: Wei Li,Limei Hu,Feng Chen,Ye Yao
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:In multi-task adversarial networks, the accurate estimation of unknown parameters in a distributed algorithm is hindered by attacked nodes or links. To tackle this challenge, this brief proposes a low-communication resilient distributed estimation algorithm. First, a node selection strategy based on reputation is introduced that allows nodes to communicate with more reliable subset of neighbors. Subsequently, to discern trustworthy intermediate estimates, the Weighted Support Vector Data Description (W-SVDD) model is employed to train the memory data. This trained model contributes to reinforce the resilience of the distributed estimation process against the impact of attacked nodes or links. Additionally, an event-triggered mechanism is introduced to minimize ineffective updates to the W-SVDD model, and a suitable threshold is derived based on assumptions. The convergence of the algorithm is analyzed. Finally, simulation results demonstrate that the proposed algorithm achieves superior performance with less communication cost compared to other algorithms.

[LG-71] Overcoming the Loss Conditioning Bottleneck in Optimization-Based PDE Solvers: A Novel Well-Conditioned Loss Function

链接: https://arxiv.org/abs/2508.02692
作者: Wenbo Cao,Weiwei Zhang
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:Optimization-based PDE solvers that minimize scalar loss functions have gained increasing attention in recent years. These methods either define the loss directly over discrete variables, as in Optimizing a Discrete Loss (ODIL), or indirectly through a neural network surrogate, as in Physics-Informed Neural Networks (PINNs). However, despite their promise, such methods often converge much more slowly than classical iterative solvers and are commonly regarded as inefficient. This work provides a theoretical insight, attributing the inefficiency to the use of the mean squared error (MSE) loss, which implicitly forms the normal equations, squares the condition number, and severely impairs optimization. To address this, we propose a novel Stabilized Gradient Residual (SGR) loss. By tuning a weight parameter, it flexibly modulates the condition number between the original system and its normal equations, while reducing to the MSE loss in the limiting case. We systematically benchmark the convergence behavior and optimization stability of the SGR loss within both the ODIL framework and PINNs-employing either numerical or automatic differentiation-and compare its performance against classical iterative solvers. Numerical experiments on a range of benchmark problems demonstrate that, within the ODIL framework, the proposed SGR loss achieves orders-of-magnitude faster convergence than the MSE loss. Further validation within the PINNs framework shows that, despite the high nonlinearity of neural networks, SGR consistently outperforms the MSE loss. These theoretical and empirical findings help bridge the performance gap between classical iterative solvers and optimization-based solvers, highlighting the central role of loss conditioning, and provide key insights for the design of more efficient PDE solvers.

[LG-72] Accelerating Conjugate Gradient Solvers for Homogenization Problems with Unitary Neural Operators

链接: https://arxiv.org/abs/2508.02681
作者: Julius Herb,Felix Fritzen
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Rapid and reliable solvers for parametric partial differential equations (PDEs) are needed in many scientific and engineering disciplines. For example, there is a growing demand for composites and architected materials with heterogeneous microstructures. Designing such materials and predicting their behavior in practical applications requires solving homogenization problems for a wide range of material parameters and microstructures. While classical numerical solvers offer reliable and accurate solutions supported by a solid theoretical foundation, their high computational costs and slow convergence remain limiting factors. As a result, scientific machine learning is emerging as a promising alternative. However, such approaches often lack guaranteed accuracy and physical consistency. This raises the question of whether it is possible to develop hybrid approaches that combine the advantages of both data-driven methods and classical solvers. To address this, we introduce UNO-CG, a hybrid solver that accelerates conjugate gradient (CG) solvers using specially designed machine-learned preconditioners, while ensuring convergence by construction. As a preconditioner, we propose Unitary Neural Operators as a modification of Fourier Neural Operators. Our method can be interpreted as a data-driven discovery of Green’s functions, which are then used to accelerate iterative solvers. We evaluate UNO-CG on various homogenization problems involving heterogeneous microstructures and millions of degrees of freedom. Our results demonstrate that UNO-CG enables a substantial reduction in the number of iterations and is competitive with handcrafted preconditioners for homogenization problems that involve expert knowledge. Moreover, UNO-CG maintains strong performance across a variety of boundary conditions, where many specialized solvers are not applicable, highlighting its versatility and robustness.

[LG-73] Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws

链接: https://arxiv.org/abs/2508.03688
作者: Gérard Ben Arous,Murat A. Erdogdu,N. Mert Vural,Denny Wu
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 84 pages

点击查看摘要

Abstract:We study the optimization and sample complexity of gradient-based training of a two-layer neural network with quadratic activation function in the high-dimensional regime, where the data is generated as y \propto \sum_j=1^r\lambda_j \sigma\left(\langle \boldsymbol\theta_j, \boldsymbolx\rangle\right), \boldsymbolx \sim N(0,\boldsymbolI_d) , \sigma is the 2nd Hermite polynomial, and \lbrace\boldsymbol\theta_j \rbrace_j=1^r \subset \mathbbR^d are orthonormal signal directions. We consider the extensive-width regime r \asymp d^\beta for \beta \in [0, 1) , and assume a power-law decay on the (non-negative) second-layer coefficients \lambda_j\asymp j^-\alpha for \alpha \geq 0 . We present a sharp analysis of the SGD dynamics in the feature learning regime, for both the population limit and the finite-sample (online) discretization, and derive scaling laws for the prediction risk that highlight the power-law dependencies on the optimization time, sample size, and model width. Our analysis combines a precise characterization of the associated matrix Riccati differential equation with novel matrix monotonicity arguments to establish convergence guarantees for the infinite-dimensional effective dynamics.

[LG-74] Likelihood Matching for Diffusion Models

链接: https://arxiv.org/abs/2508.03636
作者: Lei Qian,Wu Su,Yanqi Huang,Song Xi Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST); Applications (stat.AP); Methodology (stat.ME)
*备注:

点击查看摘要

Abstract:We propose a Likelihood Matching approach for training diffusion models by first establishing an equivalence between the likelihood of the target data distribution and a likelihood along the sample path of the reverse diffusion. To efficiently compute the reverse sample likelihood, a quasi-likelihood is considered to approximate each reverse transition density by a Gaussian distribution with matched conditional mean and covariance, respectively. The score and Hessian functions for the diffusion generation are estimated by maximizing the quasi-likelihood, ensuring a consistent matching of both the first two transitional moments between every two time points. A stochastic sampler is introduced to facilitate computation that leverages on both the estimated score and Hessian information. We establish consistency of the quasi-maximum likelihood estimation, and provide non-asymptotic convergence guarantees for the proposed sampler, quantifying the rates of the approximation errors due to the score and Hessian estimation, dimensionality, and the number of diffusion steps. Empirical and simulation evaluations demonstrate the effectiveness of the proposed Likelihood Matching and validate the theoretical results.

[LG-75] Machine Learning Algorithms for Transplanting Accelerometer Observations in Future Satellite Gravimetry Missions

链接: https://arxiv.org/abs/2508.03522
作者: Mohsen Romeshkani,Jürgen Müller,Sahar Ebadi,Alexey Kupriyanov,Annike Knabe,Nina Fletling,Manuel Schilling
类目: Geophysics (physics.geo-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Accurate and continuous monitoring of Earth’s gravity field is essential for tracking mass redistribution processes linked to climate variability, hydrological cycles, and geodynamic phenomena. While the GRACE and GRACE Follow-On (GRACE-FO) missions have set the benchmark for satellite gravimetry using low-low satellite to satellite tracking (LL-SST), the precision of gravity field recovery still strongly depends on the quality of accelerometer (ACC) performance and the continuity of ACC data. Traditional electrostatic accelerometers (EA) face limitations that can hinder mission outcomes, prompting exploration of advanced sensor technologies and data recovery techniques. This study presents a systematic evaluation of accelerometer data transplantation using novel accelerometer configurations, including Cold Atom Interferometry (CAI) accelerometers and hybrid EA-CAI setups, and applying both analytical and machine learning-based methods. Using comprehensive closed-loop LL-SST simulations, we compare four scenarios ranging from the conventional EA-only setup to ideal dual hybrid configurations, with a particular focus on the performance of transplant-based approaches using different neural network approaches. Our results show that the dual hybrid configuration provides the most accurate gravity field retrieval. However, the transplant-based hybrid setup, especially when supported by machine learning, emerges as a robust and cost-effective alternative, achieving comparable performance with minimal extra hardware. These findings highlight the promise of combining quantum sensor technology and data-driven transplantation for future satellite gravimetry missions, paving the way for improved global monitoring of Earth’s dynamic gravity field.

[LG-76] Quantum Neural Network applications to Protein Binding Affinity Predictions

链接: https://arxiv.org/abs/2508.03446
作者: Erico Souza Teixeira,Lucas Barros Fernandes,Yara Rodrigues Inácio
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures

点击查看摘要

Abstract:Binding energy is a fundamental thermodynamic property that governs molecular interactions, playing a crucial role in fields such as healthcare and the natural sciences. It is particularly relevant in drug development, vaccine design, and other biomedical applications. Over the years, various methods have been developed to estimate protein binding energy, ranging from experimental techniques to computational approaches, with machine learning making significant contributions to this field. Although classical computing has demonstrated strong results in constructing predictive models, the variation of quantum computing for machine learning has emerged as a promising alternative. Quantum neural networks (QNNs) have gained traction as a research focus, raising the question of their potential advantages in predicting binding energies. To investigate this potential, this study explored the feasibility of QNNs for this task by proposing thirty variations of multilayer perceptron-based quantum neural networks. These variations span three distinct architectures, each incorporating ten different quantum circuits to configure their quantum layers. The performance of these quantum models was compared with that of a state-of-the-art classical multilayer perceptron-based artificial neural network, evaluating both accuracy and training time. A primary dataset was used for training, while two additional datasets containing entirely unseen samples were employed for testing. Results indicate that the quantum models achieved approximately 20% higher accuracy on one unseen dataset, although their accuracy was lower on the other datasets. Notably, quantum models exhibited training times several orders of magnitude shorter than their classical counterparts, highlighting their potential for efficient protein binding energy prediction.

[LG-77] Model Accuracy and Data Heterogeneity Shape Uncertainty Quantification in Machine Learning Interatomic Potentials

链接: https://arxiv.org/abs/2508.03405
作者: Fei Shuang,Zixiong Wei,Kai Liu,Wei Gao,Poulumi Dey
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Machine learning interatomic potentials (MLIPs) enable accurate atomistic modelling, but reliable uncertainty quantification (UQ) remains elusive. In this study, we investigate two UQ strategies, ensemble learning and D-optimality, within the atomic cluster expansion framework. It is revealed that higher model accuracy strengthens the correlation between predicted uncertainties and actual errors and improves novelty detection, with D-optimality yielding more conservative estimates. Both methods deliver well calibrated uncertainties on homogeneous training sets, yet they underpredict errors and exhibit reduced novelty sensitivity on heterogeneous datasets. To address this limitation, we introduce clustering-enhanced local D-optimality, which partitions configuration space into clusters during training and applies D-optimality within each cluster. This approach substantially improves the detection of novel atomic environments in heterogeneous datasets. Our findings clarify the roles of model fidelity and data heterogeneity in UQ performance and provide a practical route to robust active learning and adaptive sampling strategies for MLIP development.

[LG-78] A Dual Optimization View to Empirical Risk Minimization with f-Divergence Regularization

链接: https://arxiv.org/abs/2508.03314
作者: Francisco Daunas,Iñaki Esnaola,Samir M. Perlaza
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: Conference paper to appear in ITW 2025. arXiv admin note: substantial text overlap with arXiv:2502.14544 ; text overlap with arXiv:2402.00501

点击查看摘要

Abstract:The dual formulation of empirical risk minimization with f-divergence regularization (ERM-fDR) is introduced. The solution of the dual optimization problem to the ERM-fDR is connected to the notion of normalization function introduced as an implicit function. This dual approach leverages the Legendre-Fenchel transform and the implicit function theorem to provide a nonlinear ODE expression to the normalization function. Furthermore, the nonlinear ODE expression and its properties provide a computationally efficient method to calculate the normalization function of the ERM-fDR solution under a mild condition.

[LG-79] PatchDSU: Uncertainty Modeling for Out of Distribution Generalization in Keyword Spotting

链接: https://arxiv.org/abs/2508.03190
作者: Bronya Roni Chernyak,Yael Segal,Yosi Shrem,Joseph Keshet
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Deep learning models excel at many tasks but rely on the assumption that training and test data follow the same distribution. This assumption often does not hold in real-world speech systems, where distribution shifts are common due to varying environments, recording conditions, and speaker diversity. The method of Domain Shifts with Uncertainty (DSU) augments the input of each neural network layer based on the input feature statistics. It addresses the problem of out-of-domain generalization by assuming feature statistics follow a multivariate Gaussian distribution and substitutes the input with sampled features from this distribution. While effective for computer vision, applying DSU to speech presents challenges due to the nature of the data. Unlike static visual data, speech is a temporal signal commonly represented by a spectrogram - the change of frequency over time. This representation cannot be treated as a simple image, and the resulting sparsity can lead to skewed feature statistics when applied to the entire input. To tackle out-of-distribution issues in keyword spotting, we propose PatchDSU, which extends DSU by splitting the input into patches and independently augmenting each patch. We evaluated PatchDSU and DSU alongside other methods on the Google Speech Commands, Librispeech, and TED-LIUM. Additionally, we evaluated performance under white Gaussian and MUSAN music noise conditions. We also explored out-of-domain generalization by analyzing model performance on datasets they were not trained on. Overall, in most cases, both PatchDSU and DSU outperform other methods. Notably, PatchDSU demonstrates more consistent improvements across the evaluated scenarios compared to other approaches. Comments: This work has been submitted to the IEEE for possible publication Subjects: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG) Cite as: arXiv:2508.03190 [eess.AS] (or arXiv:2508.03190v1 [eess.AS] for this version) https://doi.org/10.48550/arXiv.2508.03190 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-80] he Open DAC 2025 Dataset for Sorbent Discovery in Direct Air Capture

链接: https://arxiv.org/abs/2508.03162
作者: Anuroop Sriram,Logan M. Brabson,Xiaohan Yu,Sihoon Choi,Kareem Abdelmaqsoud,Elias Moubarak,Pim de Haan,Sindy Löwe,Johann Brehmer,John R. Kitchin,Max Welling,C. Lawrence Zitnick,Zachary Ulissi,Andrew J. Medford,David S. Sholl
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Identifying useful sorbent materials for direct air capture (DAC) from humid air remains a challenge. We present the Open DAC 2025 (ODAC25) dataset, a significant expansion and improvement upon ODAC23 (Sriram et al., ACS Central Science, 10 (2024) 923), comprising nearly 70 million DFT single-point calculations for CO _2 , H _2 O, N _2 , and O _2 adsorption in 15,000 MOFs. ODAC25 introduces chemical and configurational diversity through functionalized MOFs, high-energy GCMC-derived placements, and synthetically generated frameworks. ODAC25 also significantly improves upon the accuracy of DFT calculations and the treatment of flexible MOFs in ODAC23. Along with the dataset, we release new state-of-the-art machine-learned interatomic potentials trained on ODAC25 and evaluate them on adsorption energy and Henry’s law coefficient predictions.

[LG-81] Hedging with memory: shallow and deep learning with signatures

链接: https://arxiv.org/abs/2508.02759
作者: Eduardo Abi Jaber,Louis-Amand Gérard
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We investigate the use of path signatures in a machine learning context for hedging exotic derivatives under non-Markovian stochastic volatility models. In a deep learning setting, we use signatures as features in feedforward neural networks and show that they outperform LSTMs in most cases, with orders of magnitude less training compute. In a shallow learning setting, we compare two regression approaches: the first directly learns the hedging strategy from the expected signature of the price process; the second models the dynamics of volatility using a signature volatility model, calibrated on the expected signature of the volatility. Solving the hedging problem in the calibrated signature volatility model yields more accurate and stable results across different payoffs and volatility dynamics.

[LG-82] MPCA-based Domain Adaptation for Transfer Learning in Ultrasonic Guided Waves

链接: https://arxiv.org/abs/2508.02726
作者: Lucio Pinello,Francesco Cadini,Luca Lomazzi
类目: Image and Video Processing (eess.IV); Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:

点击查看摘要

Abstract:Ultrasonic Guided Waves (UGWs) represent a promising diagnostic tool for Structural Health Monitoring (SHM) in thin-walled structures, and their integration with machine learning (ML) algorithms is increasingly being adopted to enable real-time monitoring capabilities. However, the large-scale deployment of UGW-based ML methods is constrained by data scarcity and limited generalisation across different materials and sensor configurations. To address these limitations, this work proposes a novel transfer learning (TL) framework based on Multilinear Principal Component Analysis (MPCA). First, a Convolutional Neural Network (CNN) for regression is trained to perform damage localisation for a plated structure. Then, MPCA and fine-tuning are combined to have the CNN work for a different plate. By jointly applying MPCA to the source and target domains, the method extracts shared latent features, enabling effective domain adaptation without requiring prior assumptions about dimensionality. Following MPCA, fine-tuning enables adapting the pre-trained CNN to a new domain without the need for a large training dataset. The proposed MPCA-based TL method was tested against 12 case studies involving different composite materials and sensor arrays. Statistical metrics were used to assess domains alignment both before and after MPCA, and the results demonstrate a substantial reduction in localisation error compared to standard TL techniques. Hence, the proposed approach emerges as a robust, data-efficient, and statistically based TL framework for UGW-based SHM.

[LG-83] Physics-guided denoiser network for enhanced additive manufacturing data quality

链接: https://arxiv.org/abs/2508.02712
作者: Pallock Halder,Satyajit Mojumder
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: 28 pages, 13 figures, 5 tables

点击查看摘要

Abstract:Modern engineering systems are increasingly equipped with sensors for real-time monitoring and decision-making. However, the data collected by these sensors is often noisy and difficult to interpret, limiting its utility for control and diagnostics. In this work, we propose a physics-informed denoising framework that integrates energy-based model and Fisher score regularization to jointly reduce data noise and enforce physical consistency with a physics-based model. The approach is first validated on benchmark problems, including the simple harmonic oscillator, Burgers’ equation, and Laplace’s equation, across varying noise levels. We then apply the denoising framework to real thermal emission data from laser powder bed fusion (LPBF) additive manufacturing experiments, using a trained Physics-Informed Neural Network (PINN) surrogate model of the LPBF process to guide denoising. Results show that the proposed method outperforms baseline neural network denoisers, effectively reducing noise under a range of LPBF processing conditions. This physics-guided denoising strategy enables robust, real-time interpretation of low-cost sensor data, facilitating predictive control and improved defect mitigation in additive manufacturing.

[LG-84] Measuring Dependencies between Biological Signals with Temporal Self-supervision and its Limitations NEURIPS2025

链接: https://arxiv.org/abs/2508.02703
作者: Evangelos Sariyanidi,John D. Herrington,Lisa Yankowitz,Pratik Chaudhari,Theodore D. Satterthwaite,Casey J. Zampella,Robert T. Schultz,Russell T. Shinohara,Birkan Tunc
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: To be submitted to NeurIPS 2025 AI for Science Workshop

点击查看摘要

Abstract:Measuring the statistical dependence between observed signals is a primary tool for scientific discovery. However, biological systems often exhibit complex non-linear interactions that currently cannot be captured without a priori knowledge regarding the nature of dependence. We introduce a self-supervised approach, concurrence, which is inspired by the observation that if two signals are dependent, then one should be able to distinguish between temporally aligned vs. misaligned segments extracted from them. Experiments with fMRI, physiological and behavioral signals show that, to our knowledge, concurrence is the first approach that can expose relationships across such a wide spectrum of signals and extract scientifically relevant differences without ad-hoc parameter tuning or reliance on a priori information, providing a potent tool for scientific discoveries across fields. However, depencencies caused by extraneous factors remain an open problem, thus researchers should validate that exposed relationships truely pertain to the question(s) of interest.

[LG-85] Evaluating Transfer Learning Methods on Real-World Data Streams: A Case Study in Financial Fraud Detection ECML KDD2025

链接: https://arxiv.org/abs/2508.02702
作者: Ricardo Ribeiro Pereira,Jacopo Bono,Hugo Ferreira,Pedro Ribeiro,Carlos Soares,Pedro Bizarro
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG)
*备注: 16 pages, 7 figures, submitted to ECML PKDD 2025

点击查看摘要

Abstract:When the available data for a target domain is limited, transfer learning (TL) methods can be used to develop models on related data-rich domains, before deploying them on the target domain. However, these TL methods are typically designed with specific, static assumptions on the amount of available labeled and unlabeled target data. This is in contrast with many real world applications, where the availability of data and corresponding labels varies over time. Since the evaluation of the TL methods is typically also performed under the same static data availability assumptions, this would lead to unrealistic expectations concerning their performance in real world settings. To support a more realistic evaluation and comparison of TL algorithms and models, we propose a data manipulation framework that (1) simulates varying data availability scenarios over time, (2) creates multiple domains through resampling of a given dataset and (3) introduces inter-domain variability by applying realistic domain transformations, e.g., creating a variety of potentially time-dependent covariate and concept shifts. These capabilities enable simulation of a large number of realistic variants of the experiments, in turn providing more information about the potential behavior of algorithms when deployed in dynamic settings. We demonstrate the usefulness of the proposed framework by performing a case study on a proprietary real-world suite of card payment datasets. Given the confidential nature of the case study, we also illustrate the use of the framework on the publicly available Bank Account Fraud (BAF) dataset. By providing a methodology for evaluating TL methods over time and in realistic data availability scenarios, our framework facilitates understanding of the behavior of models and algorithms. This leads to better decision making when deploying models for new domains in real-world environments.

[LG-86] On Improving PPG-Based Sleep Staging: A Pilot Study

链接: https://arxiv.org/abs/2508.02689
作者: Jiawei Wang,Yu Guan,Chen Chen,Ligang Zhou,Laurence T. Yang,Sai Gu
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Sleep monitoring through accessible wearable technology is crucial to improving well-being in ubiquitous computing. Although photoplethysmography(PPG) sensors are widely adopted in consumer devices, achieving consistently reliable sleep staging using PPG alone remains a non-trivial challenge. In this work, we explore multiple strategies to enhance the performance of PPG-based sleep staging. Specifically, we compare conventional single-stream model with dual-stream cross-attention strategies, based on which complementary information can be learned via PPG and PPG-derived modalities such as augmented PPG or synthetic ECG. To study the effectiveness of the aforementioned approaches in four-stage sleep monitoring task, we conducted experiments on the world’s largest sleep staging dataset, i.e., the Multi-Ethnic Study of Atherosclerosis(MESA). We found that substantial performance gain can be achieved by combining PPG and its auxiliary information under the dual-stream cross-attention architecture. Source code of this project can be found at this https URL

[LG-87] Benchmarking Classical and Quantum Models for DeFi Yield Prediction on Curve Finance

链接: https://arxiv.org/abs/2508.02685
作者: Chi-Sheng Chen,Aidan Hung-Wen Tsai
类目: atistical Finance (q-fin.ST); Machine Learning (cs.LG); Trading and Market Microstructure (q-fin.TR)
*备注:

点击查看摘要

Abstract:The rise of decentralized finance (DeFi) has created a growing demand for accurate yield and performance forecasting to guide liquidity allocation strategies. In this study, we benchmark six models, XGBoost, Random Forest, LSTM, Transformer, quantum neural networks (QNN), and quantum support vector machines with quantum feature maps (QSVM-QNN), on one year of historical data from 28 Curve Finance pools. We evaluate model performance on test MAE, RMSE, and directional accuracy. Our results show that classical ensemble models, particularly XGBoost and Random Forest, consistently outperform both deep learning and quantum models. XGBoost achieves the highest directional accuracy (71.57%) with a test MAE of 1.80, while Random Forest attains the lowest test MAE of 1.77 and 71.36% accuracy. In contrast, quantum models underperform with directional accuracy below 50% and higher errors, highlighting current limitations in applying quantum machine learning to real-world DeFi time series data. This work offers a reproducible benchmark and practical insights into model suitability for DeFi applications, emphasizing the robustness of classical methods over emerging quantum approaches in this domain.

信息检索

[IR-0] Demystifying Sequential Recommendations: Counterfactual Explanations via Genetic Algorithms

链接: https://arxiv.org/abs/2508.03606
作者: Domiziano Scarcelli,Filippo Betello,Giuseppe Perelli,Fabrizio Silvestri,Gabriele Tolomei
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential Recommender Systems (SRSs) have demonstrated remarkable effectiveness in capturing users’ evolving preferences. However, their inherent complexity as “black box” models poses significant challenges for explainability. This work presents the first counterfactual explanation technique specifically developed for SRSs, introducing a novel approach in this space, addressing the key question: What minimal changes in a user’s interaction history would lead to different recommendations? To achieve this, we introduce a specialized genetic algorithm tailored for discrete sequences and show that generating counterfactual explanations for sequential data is an NP-Complete problem. We evaluate these approaches across four experimental settings, varying between targeted-untargeted and categorized-uncategorized scenarios, to comprehensively assess their capability in generating meaningful explanations. Using three different datasets and three models, we are able to demonstrate that our methods successfully generate interpretable counterfactual explanation while maintaining model fidelity close to one. Our findings contribute to the growing field of Explainable AI by providing a framework for understanding sequential recommendation decisions through the lens of “what-if” scenarios, ultimately enhancing user trust and system transparency.

[IR-1] OpenLifelogQA: An Open-Ended Multi-Modal Lifelog Question-Answering Dataset

链接: https://arxiv.org/abs/2508.03583
作者: Quang-Linh Tran,Binh Nguyen,Gareth J. F. Jones,Cathal Gurrin
类目: Multimedia (cs.MM); Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Lifelogging refers to the process of passively collecting, storing, and analysing personal daily life data using wearable devices. This data can support applications in memory preservation and enhancement. For example, using an ask-and-answer strategy, question-answering (QA) on lifelog data opens an interactive and interesting way to explore memorable events and insights into daily life. However, research resources for QA on lifelog data are limited to small-sized or synthetic QA datasets. In this paper, we present a novel lifelog QA dataset called OpenLifelogQA, building upon an 18-month lifelog dataset. Our dataset focuses on an open-ended and practical QA with real-world application in daily lifelog usage. We construct 14,187 pairs of QA with diverse types and difficulty levels. A baseline experiment is reported for this dataset with competitive average performance of 89.7% BERT Score, 25.87% ROUGE-L and 3.9665 LLM Score from LLaVA-NeXT-Interleave 7B model. We release this QA dataset to the research community to support new research into lifelog technologies, such as enabling personal chat-based assistants for lifelog data to become a reality.

[IR-2] Dual-disentangle Framework for Diversified Sequential Recommendation

链接: https://arxiv.org/abs/2508.03172
作者: Haoran Zhang,Jingtong Liu,Jiangzhou Deng,Junpeng Guo
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Sequential recommendation predicts user preferences over time and has achieved remarkable success. However, the growing length of user interaction sequences and the complex entanglement of evolving user interests and intentions introduce significant challenges to diversity. To address these, we propose a model-agnostic Dual-disentangle framework for Diversified Sequential Recommendation (DDSRec). The framework refines user interest and intention modeling by adopting disentangling perspectives in interaction modeling and representation learning, thereby balancing accuracy and diversity in sequential recommendations. Extensive experiments on multiple public datasets demonstrate the effectiveness and superiority of DDSRec in terms of accuracy and diversity for sequential recommendations.

[IR-3] ADSeeker: A Knowledge-Infused Framework for Anomaly Detection and Reasoning

链接: https://arxiv.org/abs/2508.03088
作者: Kai Zhang,Zekai Zhang,Xihe Sun,Jingmeng Nie,Qinghui Chen,Han Hao,Jianyuan Guo,Jinglin Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Automatic vision inspection holds significant importance in industry inspection. While multimodal large language models (MLLMs) exhibit strong language understanding capabilities and hold promise for this task, their performance remains significantly inferior to that of human experts. In this context, we identify two key challenges: (i) insufficient integration of anomaly detection (AD) knowledge during pre-training, and (ii) the lack of technically precise and conte-aware language generation for anomaly reasoning. To address these issues, we propose ADSeeker, an anomaly task assistant designed to enhance inspection performance through knowledge-grounded reasoning. ADSeeker leverages a curated visual document knowledge base, SEEK-MVTecVisA (SEEK-MV), which we construct to address the limitations of existing resources that rely solely on unstructured text. SEEK-MV includes semantic-rich descriptions and image-document pairs, enabling more comprehensive anomaly understanding. To effectively retrieve and utilize this knowledge, we introduce the Query Image-Knowledge Retrieval-Augmented Generation (Q2K RAG) framework. To further enhance the performance in zero-shot anomaly detection (ZSAD), ADSeeker leverages the Hierarchical Sparse Prompt mechanism and type-level features to efficiently extract anomaly patterns. Furthermore, to tackle the challenge of limited in industry anomaly detection (IAD) data, we introduce the largest-scale AD dataset, Multi-type Anomaly (MulA), encompassing 72 multi-scale defect types across 26 Categories. Extensive experiments show that our plug-and-play framework, ADSeeker, achieves state-of-the-art zero-shot performance on several benchmark datasets.

[IR-4] KBest: Efficient Vector Search on Kunpeng CPU

链接: https://arxiv.org/abs/2508.03016
作者: Kaihao MA,Meiling Wang,Senkevich Oleg,Zijian LI,Daihao Xue,Dmitriy Malyshev,Yangming Lv,Shihai Xiao,Xiao Yan,Radionov Alexander,Weidi Zeng,Yuanzhan Gao,Zhiyu Zou,Yao xin,Liu Lin,Junhao Wu,Yiding Liu,Yaoyao Fu,Gongyi Wang,Gong Zhang,Fei Yi,Yingfan Liu
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Vector search, which returns the vectors most similar to a given query vector from a large vector dataset, underlies many important applications such as search, recommendation, and LLMs. To be economic, vector search needs to be efficient to reduce the resources required by a given query workload. However, existing vector search libraries (e.g., Faiss and DiskANN) are optimized for x86 CPU architectures (i.e., Intel and AMD CPUs) while Huawei Kunpeng CPUs are based on the ARM architecture and competitive in compute power. In this paper, we present KBest as a vector search library tailored for the latest Kunpeng 920 CPUs. To be efficient, KBest incorporates extensive hardware-aware and algorithmic optimizations, which include single-instruction-multiple-data (SIMD) accelerated distance computation, data prefetch, index refinement, early termination, and vector quantization. Experiment results show that KBest outperforms SOTA vector search libraries running on x86 CPUs, and our optimizations can improve the query throughput by over 2x. Currently, KBest serves applications from both our internal business and external enterprise clients with tens of millions of queries on a daily basis.

[IR-5] SustainableQA: A Comprehensive Question Answering Dataset for Corporate Sustainability and EU Taxonomy Reporting

链接: https://arxiv.org/abs/2508.03000
作者: Mohammed Ali,Abdelrahman Abdallah,Adam Jatowt
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:The growing demand for corporate sustainability transparency, particularly under new regulations like the EU Taxonomy, necessitates precise data extraction from large, unstructured corporate reports. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems, requires high-quality, domain-specific question-answering (QA) datasets to excel at particular domains. To address this, we introduce SustainableQA, a novel dataset and a scalable pipeline for generating a comprehensive QA datasets from corporate sustainability reports and annual reports. Our approach integrates semantic chunk classification, a hybrid span extraction pipeline combining fine-tuned Named Entity Recognition (NER), rule-based methods, and LLM-driven refinement, alongside a specialized table-to-paragraph transformation. With over 195,000 diverse factoid and non-factoid QA pairs, SustainableQA is an effective resource for developing and benchmarking advanced knowledge assistants capable of navigating complex sustainability compliance

[IR-6] Investigating the Cognitive Response of Brake Lights in Initiating Braking Action Using EEG

链接: https://arxiv.org/abs/2508.03274
作者: Ramaswamy Palaniappan,Surej Mouli,Howard Bowman,Ian McLoughlin
类目: ignal Processing (eess.SP); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Information Retrieval (cs.IR)
*备注: arXiv admin note: text overlap with arXiv:2010.10584

点击查看摘要

Abstract:Half of all road accidents result from either lack of driver attention or from maintaining insufficient separation between vehicles. Collision from the rear, in particular, has been identified as the most common class of accident in the UK, and its influencing factors have been widely studied for many years. Rear-mounted stop lamps, illuminated when braking, are the primary mechanism to alert following drivers to the need to reduce speed or brake. This paper develops a novel brain response approach to measuring subject reaction to different brake light designs. A variety of off-the-shelf brake light assemblies are tested in a physical simulated driving environment to assess the cognitive reaction times of 22 subjects. Eight pairs of LED-based and two pairs of incandescent bulb-based brake light assemblies are used and electroencephalogram (EEG) data recorded. Channel Pz is utilised to extract the P3 component evoked during the decision making process that occurs in the brain when a participant decides to lift their foot from the accelerator and depress the brake. EEG analysis shows that both incandescent bulb-based lights are statistically slower to evoke cognitive responses than all tested LED-based lights. Between the LED designs, differences are evident, but not statistically significant, attributed to the significant amount of movement artifact in the EEG signal.

附件下载

点击下载今日全部论文列表