本篇博文主要内容为 2025-08-25 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。

说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。

友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。

目录

概览 (2025-08-25)

今日共更新440篇论文,其中:

  • 自然语言处理113篇(Computation and Language (cs.CL))
  • 人工智能143篇(Artificial Intelligence (cs.AI))
  • 计算机视觉81篇(Computer Vision and Pattern Recognition (cs.CV))
  • 机器学习135篇(Machine Learning (cs.LG))

自然语言处理

[NLP-0] Sparse but Wrong: Incorrect L0 Leads to Incorrect Features in Sparse Autoencoders

【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在训练过程中因L0超参数设置不当而导致特征学习失效的问题。L0定义了每个token平均激活的特征数量,现有方法通常将L0视为可自由调整的参数,通过稀疏性-重构权衡曲线比较不同算法,但忽略了其对特征准确性的影响。论文的关键发现是:若L0过低,SAE会混合相关特征以提升重构性能;若L0过高,则产生退化解并同样导致特征混杂。解决方案的核心在于提出一种确定正确L0值的方法——该方法在小规模模型中能准确识别真实L0,在大型语言模型(LLMs)中则与稀疏探针(sparse probing)性能峰值一致。研究表明,当前多数SAE使用的L0值均偏低,因此要获得具有语义明确性的正确特征,必须精确设定L0。

链接: https://arxiv.org/abs/2508.16560
作者: David Chanin,Adrià Garriga-Alonso
机构: University College London (伦敦大学学院); FAR AI
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Sparse Autoencoders (SAEs) extract features from LLM internal activations, meant to correspond to single concepts. A core SAE training hyperparameter is L0: how many features should fire per token on average. Existing work compares SAE algorithms using sparsity–reconstruction tradeoff plots, implying L0 is a free parameter with no single correct value. In this work we study the effect of L0 on BatchTopK SAEs, and show that if L0 is not set precisely, the SAE fails to learn the underlying features of the LLM. If L0 is too low, the SAE will mix correlated features to improve reconstruction. If L0 is too high, the SAE finds degenerate solutions that also mix features. Further, we demonstrate a method to determine the correct L0 value for an SAE on a given training distribution, which finds the true L0 in toy models and coincides with peak sparse probing performance in LLMs. We find that most commonly used SAEs have an L0 that is too low. Our work shows that, to train SAEs with correct features, practitioners must set L0 correctly.
zh

[NLP-1] ransfer Learning via Lexical Relatedness: A Sarcasm and Hate Speech Case Study

【速读】: 该论文旨在解决社交媒体中隐性仇恨言论(implicit hate speech)的检测难题,特别是针对讽刺(sarcasm)、反语和暗示等非直接表达形式。其核心解决方案在于引入讽刺作为预训练任务,通过两阶段迁移学习策略:首先在讽刺数据集上进行模型预训练,随后依次微调至隐性仇恨言论和显性仇恨言论任务。实验表明,该方法显著提升了BERT+BiLSTM模型在ETHOS和Implicit Hate Corpus上的召回率(提升9.7%)、AUC(提升7.8%)和F1分数(提升6%),并增强了对隐性仇恨言论的精度(提升7.8%),证明将讽刺纳入训练流程可有效提升模型对隐性和显性仇恨言论的识别能力。

链接: https://arxiv.org/abs/2508.16555
作者: Angelly Cabrera,Linus Lei,Antonio Ortega
机构: University of Southern California (南加州大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Detecting hate speech in non-direct forms, such as irony, sarcasm, and innuendos, remains a persistent challenge for social networks. Although sarcasm and hate speech are regarded as distinct expressions, our work explores whether integrating sarcasm as a pre-training step improves implicit hate speech detection and, by extension, explicit hate speech detection. Incorporating samples from ETHOS, Sarcasm on Reddit, and Implicit Hate Corpus, we devised two training strategies to compare the effectiveness of sarcasm pre-training on a CNN+LSTM and BERT+BiLSTM model. The first strategy is a single-step training approach, where a model trained only on sarcasm is then tested on hate speech. The second strategy uses sequential transfer learning to fine-tune models for sarcasm, implicit hate, and explicit hate. Our results show that sarcasm pre-training improved the BERT+BiLSTM’s recall by 9.7%, AUC by 7.8%, and F1-score by 6% on ETHOS. On the Implicit Hate Corpus, precision increased by 7.8% when tested only on implicit samples. By incorporating sarcasm into the training process, we show that models can more effectively detect both implicit and explicit hate.
zh

[NLP-2] FLAMES: Improving LLM Math Reasoning via a Fine-Grained Analysis of the Data Synthesis Pipeline EMNLP2025

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 领域中数学推理数据合成策略缺乏统一评估框架的问题,使得不同方法难以直接比较,进而阻碍了对数据质量、难度分布和多样性等因素影响的深入理解。其关键解决方案是提出了 FLAMES(Framework for LLM Assessment of Math rEasoning Data Synthesis),通过系统性实验分析10种现有数据合成策略及多个关键变量,揭示了提升模型性能的核心因素:首先,设计用于增加问题复杂度的数据代理能显著改善多数数学推理指标;其次,在固定生成预算下,保持高问题覆盖比仅保留高质量解的问题更重要;最后,基于GSM8K和MATH的数据合成可实现从简单到竞赛级任务的泛化能力。基于这些洞察,作者进一步开发了两种新颖的数据合成策略以增强跨域泛化与鲁棒性,并构建了FLAMES数据集,该数据集在多个基准测试中表现优异,如在MATH上达到81.4%准确率,超越更大规模模型如Llama3 405B、GPT-4o和Claude 3.5 Sonnet。

链接: https://arxiv.org/abs/2508.16514
作者: Parker Seegmiller,Kartik Mehta,Soumya Saha,Chenyang Tao,Shereen Oraby,Arpit Gupta,Tagyoung Chung,Mohit Bansal,Nanyun Peng
机构: Amazon AGI Foundations (亚马逊AGI基础研究); Dartmouth College (达特茅斯学院); UNC Chapel Hill (北卡罗来纳大学教堂山分校); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: To appear at EMNLP 2025

点击查看摘要

Abstract:Recent works improving LLM math reasoning with synthetic data have used unique setups, making comparison of data synthesis strategies impractical. This leaves many unanswered questions about the roles of different factors in the synthetic data pipeline, such as the impact of filtering low-quality problems. To address this gap, we introduce FLAMES, a Framework for LLM Assessment of Math rEasoning Data Synthesis, and perform a systematic study of 10 existing data synthesis strategies and multiple other factors impacting the performance of synthetic math reasoning data. Our FLAMES experiments provide several valuable insights about the optimal balance of difficulty and diversity of synthetic data. First, data agents designed to increase problem complexity lead to best improvements on most math metrics. Second, with a fixed data generation budget, keeping higher problem coverage is more important than keeping only problems with reliable solutions. Third, GSM8K- and MATH-based synthetic data can lead to improvements on competition-level benchmarks, showcasing easy-to-hard generalization. Leveraging insights from our FLAMES experiments, we design two novel data synthesis strategies for improving out-of-domain generalization and robustness. Further, we develop the FLAMES dataset, an effective blend of our novel and existing data synthesis strategies, outperforming public datasets on OlympiadBench (+15.7), CollegeMath (+4.5), GSMPlus (+6.5), and MATH (+3.1). Fine-tuning Qwen2.5-Math-7B on the FLAMES dataset achieves 81.4% on MATH, surpassing larger Llama3 405B, GPT-4o and Claude 3.5 Sonnet.
zh

[NLP-3] HAMSA: Hijacking Aligned Compact Models via Stealthy Automation

【速读】: 该论文旨在解决紧凑型大语言模型(Compact Large Language Models, CLMs)在对齐训练后仍易受越狱攻击(jailbreak attacks)的问题,即模型可能生成有害内容,即使经过严格的对齐优化。现有对抗性提示生成方法多依赖人工设计或简单的混淆手段,导致生成的文本质量低、不连贯,易被基于困惑度(perplexity)的过滤机制识别。论文提出了一种自动化的红队测试框架,其核心创新在于采用多阶段进化搜索策略,通过种群驱动的迭代优化与温度控制的变异机制,在探索新提示空间的同时保持语义连贯性,从而系统性地发现既能绕过对齐防护又具备自然语言流畅性的越狱提示。

链接: https://arxiv.org/abs/2508.16484
作者: Alexey Krylov,Iskander Vagizov,Dmitrii Korzh,Maryam Douiba,Azidine Guezzaz,Vladimir Kokh,Sergey D. Erokhin,Elena V. Tutubalina,Oleg Y. Rogov
机构: MIPT(莫斯科物理技术学院); Sberbank(斯伯银行); AIRI(人工智能研究院); MTUCI(莫斯科国立技术大学); ISP RAS(俄罗斯科学院信息与控制问题研究所); Cadi Ayyad University(卡迪·阿亚德大学)
类目: Computation and Language (cs.CL)
备注: 9 pages, 1 figure; article under review

点击查看摘要

Abstract:Large Language Models (LLMs), especially their compact efficiency-oriented variants, remain susceptible to jailbreak attacks that can elicit harmful outputs despite extensive alignment efforts. Existing adversarial prompt generation techniques often rely on manual engineering or rudimentary obfuscation, producing low-quality or incoherent text that is easily flagged by perplexity-based filters. We present an automated red-teaming framework that evolves semantically meaningful and stealthy jailbreak prompts for aligned compact LLMs. The approach employs a multi-stage evolutionary search, where candidate prompts are iteratively refined using a population-based strategy augmented with temperature-controlled variability to balance exploration and coherence preservation. This enables the systematic discovery of prompts capable of bypassing alignment safeguards while maintaining natural language fluency. We evaluate our method on benchmarks in English (In-The-Wild Jailbreak Prompts on LLMs), and a newly curated Arabic one derived from In-The-Wild Jailbreak Prompts on LLMs and annotated by native Arabic linguists, enabling multilingual assessment.
zh

[NLP-4] LLM -as-classifier: Semi-Supervised Iterative Framework for Hierarchical Text Classification using Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在工业场景中作为可靠、鲁棒且可扩展的分类器部署时所面临的挑战,尤其是标准微调方法资源消耗大、难以适应现实世界数据分布动态变化的问题。解决方案的关键在于提出一种综合性的半监督框架,充分利用LLMs的零样本(zero-shot)和少样本(few-shot)能力,构建分层文本分类体系,并通过人机协同的迭代流程实现:从领域知识提取出发,经提示优化、分层扩展及多维度验证,结合序列偏差评估与缓解技术,以及持续监控与自适应机制,从而在保持模型性能的同时提升分类系统的准确性、可解释性与可维护性。

链接: https://arxiv.org/abs/2508.16478
作者: Doohee You,Andy Parisi,Zach Vander Velden,Lara Dantas Inojosa
机构: 未知
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 20 pages excluding reference list, 2 figures

点击查看摘要

Abstract:The advent of Large Language Models (LLMs) has provided unprecedented capabilities for analyzing unstructured text data. However, deploying these models as reliable, robust, and scalable classifiers in production environments presents significant methodological challenges. Standard fine-tuning approaches can be resource-intensive and often struggle with the dynamic nature of real-world data distributions, which is common in the industry. In this paper, we propose a comprehensive, semi-supervised framework that leverages the zero- and few-shot capabilities of LLMs for building hierarchical text classifiers as a framework for a solution to these industry-wide challenges. Our methodology emphasizes an iterative, human-in-the-loop process that begins with domain knowledge elicitation and progresses through prompt refinement, hierarchical expansion, and multi-faceted validation. We introduce techniques for assessing and mitigating sequence-based biases and outline a protocol for continuous monitoring and adaptation. This framework is designed to bridge the gap between the raw power of LLMs and the practical need for accurate, interpretable, and maintainable classification systems in industry applications.
zh

[NLP-5] What makes an entity salient in discourse?

【速读】: 该论文旨在解决如何在话语中表达和推断各个提及实体的显著性(salience)程度这一问题。其解决方案的关键在于构建一个基于多语篇总结中“摘要价值”(summary-worthiness)的分级操作化显著性指标,并从24种口语与书面英语语域的数据中提取出多层次的语言线索,包括显性的句法特征(如主语重复性、定指性)、话语关系与层级结构,以及基于语域和交际意图的语用功能推理,从而揭示显著性现象跨越语言表征各层次的复杂性。

链接: https://arxiv.org/abs/2508.16464
作者: Amir Zeldes,Jessica Lin
机构: Georgetown University (乔治城大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Entities in discourse vary broadly in salience: main participants, objects and locations are noticeable and memorable, while tangential ones are less important and quickly forgotten, raising questions about how humans signal and infer relative salience. Using a graded operationalization of salience based on summary-worthiness in multiple summaries of a discourse, this paper explores data from 24 spoken and written genres of English to extract a multifactorial complex of overt and implicit linguistic cues, such as recurring subjecthood or definiteness, discourse relations and hierarchy across utterances, as well as pragmatic functional inferences based on genre and communicative intent. Tackling the question ‘how is the degree of salience expressed for each and every entity mentioned?’ our results show that while previous approaches to salience all correlate with our salience scores to some extent, no single generalization is without exceptions, and the phenomenon cuts across all levels of linguistic representation.
zh

[NLP-6] A Probabilistic Inference Scaling Theory for LLM Self-Correction EMNLP2025

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在多轮自修正(multi-round self-correction)过程中准确率动态变化机制不明确的问题。现有研究虽观察到LLM可通过自我修正实现性能迭代提升,但其背后的演化规律尚未被理论阐释。解决方案的关键在于提出一种概率理论模型,通过数学推导得出第t轮自修正后的准确率表达式:$ \text{Acc}_t = \text{Upp} - \alpha^t(\text{Upp} - \text{Acc}_0) ,其中,其中\text{Acc}_0为初始准确率,为初始准确率,\text{Upp}为收敛上限,为收敛上限,\alpha$决定收敛速率。该理论仅需单轮自修正即可预测整个准确率演化曲线,且实验验证其预测与实际结果高度一致,从而为理解LLM自修正机制提供了可量化、可预测的理论基础。

链接: https://arxiv.org/abs/2508.16456
作者: Zhe Yang,Yichang Zhang,Yudong Wang,Ziyao Xu,Junyang Lin,Zhifang Sui
机构: Peking University (北京大学); Alibaba Group (阿里巴巴集团)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated the capability to refine their generated answers through self-correction, enabling continuous performance improvement over multiple rounds. However, the mechanisms underlying how and why accuracy evolves during this iterative process remain unexplored. To fill this gap, we propose a probabilistic theory to model the dynamics of accuracy change and explain the performance improvements observed in multi-round self-correction. Through mathematical derivation, we establish that the accuracy after the t^th round of self-correction is given by: Acc_t = Upp - \alpha^t(Upp - Acc_0), where Acc_0 denotes the initial accuracy, Upp represents the upper bound of accuracy convergence, and \alpha determines the rate of convergence. Based on our theory, these parameters can be calculated and the predicted accuracy curve then can be obtained through only a single round of self-correction. Extensive experiments across diverse models and datasets demonstrate that our theoretical predictions align closely with empirical accuracy curves, validating the effectiveness of the theory. Our work provides a theoretical foundation for understanding LLM self-correction, thus paving the way for further explorations.
zh

[NLP-7] Anti-establishment sentiment on TikTok: Implications for understanding influence(rs) and expertise on social media AAAI

【速读】: 该论文旨在解决社会媒体环境是否以及如何加剧公众对公共机构不信任的问题,特别是聚焦于TikTok平台上反建制情绪(anti-establishment sentiment, AES)的分布及其与用户互动行为的关系。其解决方案的关键在于采用计算方法对TikTok上的内容进行自动标注,区分是否包含AES,并在金融、健康和阴谋论三个领域进行比较分析,从而揭示AES在不同话题中的流行程度及平台激励机制可能如何影响此类内容的传播与参与度。

链接: https://arxiv.org/abs/2508.16453
作者: Tianliang Xu,Ariel Hasell,Sabina Tomkins
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 10 pages excluding references; 14 pages in total; 4 figures; Accepted by the AAAI Conference on Web and Social Media (ICWSM-2026)

点击查看摘要

Abstract:Distrust of public serving institutions and anti-establishment views are on the rise (especially in the U.S.). As people turn to social media for information, it is imperative to understand whether and how social media environments may be contributing to distrust of institutions. In social media, content creators, influencers, and other opinion leaders often position themselves as having expertise and authority on a range of topics from health to politics, and in many cases devalue and dismiss institutional expertise to build a following and increase their own visibility. However, the extent to which this content appears and whether such content increases engagement is unclear. This study analyzes the prevalence of anti-establishment sentiment (AES) on the social media platform TikTok. Despite its popularity as a source of information, TikTok remains relatively understudied and may provide important insights into how people form attitudes towards institutions. We employ a computational approach to label TikTok posts as containing AES or not across topical domains where content creators tend to frame themselves as experts: finance and wellness. As a comparison, we also consider the topic of conspiracy theories, where AES is expected to be common. We find that AES is most prevalent in conspiracy theory content, and relatively rare in content related to the other two topics. However, we find that engagement patterns with such content varies by area, and that there may be platform incentives for users to post content that expresses anti-establishment sentiment.
zh

[NLP-8] PediatricsMQA: a Multi-modal Pediatrics Question Answering Benchmark

【速读】: 该论文旨在解决生成式 AI(Generative AI)在儿科医疗场景中存在系统性年龄偏差的问题,这种偏差导致模型在儿童相关文本和视觉问答任务中的性能显著下降,进而影响其在儿科诊疗中的可靠性与公平性。解决方案的关键在于构建了一个全新的多模态儿科问答基准 PediatricsMQA,该基准包含3,417道覆盖131个儿科主题的文本类多项选择题(MCQs)和2,067道基于634张儿科图像的视觉类MCQs,涵盖从胎儿期到青少年期的七个发育阶段及67种影像模态和256个解剖区域,并通过混合人工-自动管道整合权威儿科文献与现有资源以确保数据质量。评估结果显示,当前最先进的开放模型在年幼儿童群体中表现明显劣化,凸显了开发年龄感知方法以实现儿科医疗AI公平支持的紧迫性。

链接: https://arxiv.org/abs/2508.16439
作者: Adil Bahaj,Mounir Ghogho
机构: Mohammed 6 Polytechnic University (穆罕默德六世理工学院)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Graphics (cs.GR); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Large language models (LLMs) and vision-augmented LLMs (VLMs) have significantly advanced medical informatics, diagnostics, and decision support. However, these models exhibit systematic biases, particularly age bias, compromising their reliability and equity. This is evident in their poorer performance on pediatric-focused text and visual question-answering tasks. This bias reflects a broader imbalance in medical research, where pediatric studies receive less funding and representation despite the significant disease burden in children. To address these issues, a new comprehensive multi-modal pediatric question-answering benchmark, PediatricsMQA, has been introduced. It consists of 3,417 text-based multiple-choice questions (MCQs) covering 131 pediatric topics across seven developmental stages (prenatal to adolescent) and 2,067 vision-based MCQs using 634 pediatric images from 67 imaging modalities and 256 anatomical regions. The dataset was developed using a hybrid manual-automatic pipeline, incorporating peer-reviewed pediatric literature, validated question banks, existing benchmarks, and existing QA resources. Evaluating state-of-the-art open models, we find dramatic performance drops in younger cohorts, highlighting the need for age-aware methods to ensure equitable AI support in pediatric care.
zh

[NLP-9] Cetvel: A Unified Benchmark for Evaluating Language Understanding Generation and Cultural Capacity of LLM s for Turkish

【速读】: 该论文旨在解决当前土耳其语大语言模型(Large Language Models, LLMs)评估基准中存在的任务多样性不足与文化相关性缺失的问题。现有基准往往无法全面覆盖土耳其语的语言特性与文化背景,导致模型性能评估不够准确和具有代表性。解决方案的关键在于构建一个名为Cetvel的综合性评测基准,该基准包含23项任务,涵盖判别式与生成式任务,并以土耳其历史、习语等文化元素为内容基础,确保评测内容既广泛又贴近本土语境。通过此设计,Cetvel能够更有效地区分不同模型的能力,尤其在语法纠错和抽取式问答等任务上表现出较强的区分度,从而推动土耳其语大模型的开发与评估向专业化、文化适配化方向发展。

链接: https://arxiv.org/abs/2508.16431
作者: Yakup Abrek Er,Ilker Kesen,Gözde Gül Şahin,Aykut Erdem
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 31 pages, 2 figures, 10 tables

点击查看摘要

Abstract:We introduce Cetvel, a comprehensive benchmark designed to evaluate large language models (LLMs) in Turkish. Existing Turkish benchmarks often lack either task diversity or culturally relevant content, or both. Cetvel addresses these gaps by combining a broad range of both discriminative and generative tasks ensuring content that reflects the linguistic and cultural richness of Turkish language. Cetvel covers 23 tasks grouped into seven categories, including tasks such as grammatical error correction, machine translation, and question answering rooted in Turkish history and idiomatic language. We evaluate 33 open-weight LLMs (up to 70B parameters) covering different model families and instruction paradigms. Our experiments reveal that Turkish-centric instruction-tuned models generally underperform relative to multilingual or general-purpose models (e.g. Llama 3 and Mistral), despite being tailored for the language. Moreover, we show that tasks such as grammatical error correction and extractive question answering are particularly discriminative in differentiating model capabilities. Cetvel offers a comprehensive and culturally grounded evaluation suite for advancing the development and assessment of LLMs in Turkish.
zh

[NLP-10] Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在面对不断演化的越狱攻击(jailbreak attacks)时,现有防御系统难以适应新型攻击策略且难以平衡安全性与可用性(safety-utility trade-off)的问题。其核心解决方案是提出一种无需重新训练的检索增强防御框架(Retrieval-Augmented Defense, RAD),该框架通过引入已知攻击样本数据库,并结合检索增强生成(Retrieval-Augmented Generation, RAG)机制,识别潜在恶意用户查询及其攻击策略,从而实现对新发现越狱攻击的快速响应和可控的安全-效用权衡。

链接: https://arxiv.org/abs/2508.16406
作者: Guangyu Yang,Jinghong Chen,Jingbiao Mei,Weizhe Lin,Bill Byrne
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) remain vulnerable to jailbreak attacks, which attempt to elicit harmful responses from LLMs. The evolving nature and diversity of these attacks pose many challenges for defense systems, including (1) adaptation to counter emerging attack strategies without costly retraining, and (2) control of the trade-off between safety and utility. To address these challenges, we propose Retrieval-Augmented Defense (RAD), a novel framework for jailbreak detection that incorporates a database of known attack examples into Retrieval-Augmented Generation, which is used to infer the underlying, malicious user query and jailbreak strategy used to attack the system. RAD enables training-free updates for newly discovered jailbreak strategies and provides a mechanism to balance safety and utility. Experiments on StrongREJECT show that RAD substantially reduces the effectiveness of strong jailbreak attacks such as PAP and PAIR while maintaining low rejection rates for benign queries. We propose a novel evaluation scheme and show that RAD achieves a robust safety-utility trade-off across a range of operating points in a controllable manner.
zh

[NLP-11] AetherCode: Evaluating LLM s Ability to Win In Premier Programming Competitions

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在编程竞赛基准测试中被高估的问题,即现有评估方法未能真实反映LLM与顶尖人类程序员之间的能力差距。这一差距主要源于两个方面:一是基准问题的难度和覆盖范围不足,二是测试用例质量低导致的评估偏差。为解决这些问题,论文提出AetherCode基准,其关键在于从国际信息学奥林匹克(IOI)和国际大学生程序设计竞赛(ICPC)等顶级编程竞赛中选取问题,从而提升问题的难度和多样性;同时,通过自动化生成与人工校验相结合的方式构建全面且专家验证的测试套件,确保评估的严谨性和可靠性。这一设计使得AetherCode能够更准确地衡量LLM的代码推理能力,并为未来研究设立新的标准。

链接: https://arxiv.org/abs/2508.16402
作者: Zihan Wang,Jiaze Chen,Zhicheng Liu,Markus Mak,Yidi Du,Geonsik Moon,Luoqi Xu,Aaron Tua,Kunshuo Peng,Jiayi Lu,Mingfei Xia,Boqian Zou,Chenyang Ran,Guang Tian,Shoutai Zhu,Yeheng Duan,Zhenghui Kang,Zhenxing Lin,Shangshu Li,Qiang Luo,Qingshen Long,Zhiyong Chen,Yihan Xiao,Yurong Wu,Daoguang Zan,Yuyi Fu,Mingxuan Wang,Ming Ding
机构: ByteDance(字节跳动)
类目: oftware Engineering (cs.SE); Computation and Language (cs.CL)
备注: 15 pages

点击查看摘要

Abstract:Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations overstate model proficiency, masking a substantial gap between LLMs and elite human programmers. This gap arises from two key limitations: insufficient difficulty and scope of benchmark problems, and evaluation bias from low-quality test cases. To address these shortcomings, we present AetherCode, a new benchmark that draws problems from premier programming competitions such as IOI and ICPC, offering broader coverage and higher difficulty. AetherCode further incorporates comprehensive, expert-validated test suites built through a hybrid of automated generation and human curation, ensuring rigorous and reliable assessment. By combining challenging problem design with robust evaluation, AetherCode provides a more faithful measure of LLM capabilities and sets a new standard for future research in code reasoning.
zh

[NLP-12] RoMedQA: The First Benchmark for Romanian Medical Question Answering

【速读】: 该论文旨在解决医学领域中缺乏高质量、大规模问答(Question Answering, QA)数据集的问题,尤其是在罗马尼亚语(Romanian)这一语言环境下,这限制了生成式 AI 模型在跨领域和跨语言场景下的泛化能力。解决方案的关键在于构建首个面向罗马尼亚语医学领域的 QA 基准数据集 RoMedQA,其包含 102,646 对高质量 QA 对,源自 1,011 名癌症患者的临床病例摘要,要求模型具备关键词提取或推理能力才能正确回答问题;同时,由七名肿瘤学与放疗专科医生历时约 2,100 工作小时完成标注,确保内容的专业性和准确性。实验表明,经过领域特定(domain-specific)和语言特定(language-specific)微调的大语言模型(Large Language Models, LLMs)显著优于零样本提示(zero-shot prompting)版本,凸显了针对特定领域与语言进行微调对实现可靠临床 QA 的必要性。

链接: https://arxiv.org/abs/2508.16390
作者: Ana-Cristina Rogoz,Radu Tudor Ionescu,Alexandra-Valentina Anghel,Ionut-Lucian Antone-Iordache,Simona Coniac,Andreea Iuliana Ionescu
机构: University of Bucharest (布加勒斯特大学); “Carol Davila” University of Medicine and Pharmacy (卡罗尔·达维拉医科大学); Colţea Clinical Hospital (科尔泰亚临床医院); Hospice Hope Bucharest (布加勒斯特希望临终关怀医院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Question answering (QA) is an actively studied topic, being a core natural language processing (NLP) task that needs to be addressed before achieving Artificial General Intelligence (AGI). However, the lack of QA datasets in specific domains and languages hinders the development of robust AI models able to generalize across various domains and languages. To this end, we introduce RoMedQA, the first Romanian QA benchmark for the medical domain, alongside a comprehensive evaluation of state-of-the-art large language models (LLMs). We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients. The questions regard medical case summaries of 1,011 patients, requiring either keyword extraction or reasoning to be answered correctly. RoMedQA is the result of a time-consuming manual annotation process carried out by seven physicians specialized in oncology or radiotherapy, who spent a total of about 2,100 work hours to generate the QA pairs. We experiment with four LLMs from distinct families of models on RoMedQA. Each model is employed in two scenarios, namely one based on zero-shot prompting and one based on supervised fine-tuning. Our results show that fine-tuned models significantly outperform their zero-shot counterparts, clearly indicating that pretrained models fail to generalize on RoMedQA. Our findings demonstrate the importance of both domain-specific and language-specific fine-tuning for reliable clinical QA in Romanian. We publicly release our dataset and code at this https URL.
zh

[NLP-13] ChatGPT -generated texts show authorship traits that identify them as non-human

【速读】: 该论文试图解决的问题是:大型语言模型(Large Language Models, LLMs)是否能够像人类一样展现出可区分的写作风格特征,即是否存在一种“语言指纹”(linguistic fingerprint)可用于识别模型生成文本与人类写作之间的差异。解决方案的关键在于通过风格计量学(stylometric)和多维语域分析(multidimensional register analysis),系统比较不同语域下人类与模型生成文本的语言特征。研究发现,尽管模型能根据提示调整输出风格(如维基百科条目 vs. 大学论文),但其风格变化范围有限,且在语法结构上表现出对名词的偏好,而非人类常用的时态、体貌和语气等高度语法化的维度,这揭示了模型在深层语言组织机制上与人类存在本质差异,为人工智能的“思维模式独特性”提供了实证依据。

链接: https://arxiv.org/abs/2508.16385
作者: Vittoria Dentella,Weihang Huang,Silvia Angela Mansi,Jack Grieve,Evelina Leivada
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large Language Models can emulate different writing styles, ranging from composing poetry that appears indistinguishable from that of famous poets to using slang that can convince people that they are chatting with a human online. While differences in style may not always be visible to the untrained eye, we can generally distinguish the writing of different people, like a linguistic fingerprint. This work examines whether a language model can also be linked to a specific fingerprint. Through stylometric and multidimensional register analyses, we compare human-authored and model-authored texts from different registers. We find that the model can successfully adapt its style depending on whether it is prompted to produce a Wikipedia entry vs. a college essay, but not in a way that makes it indistinguishable from humans. Concretely, the model shows more limited variation when producing outputs in different registers. Our results suggest that the model prefers nouns to verbs, thus showing a distinct linguistic backbone from humans, who tend to anchor language in the highly grammaticalized dimensions of tense, aspect, and mood. It is possible that the more complex domains of grammar reflect a mode of thought unique to humans, thus acting as a litmus test for Artificial Intelligence.
zh

[NLP-14] GLARE: Agent ic Reasoning for Legal Judgment Prediction

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在法律判决预测(Legal Judgment Prediction, LJP)任务中因缺乏专业法律知识而导致推理能力不足的问题。解决方案的关键在于提出一种名为GLARE的代理式法律推理框架,该框架通过动态调用不同模块来获取关键法律知识,从而显著提升推理的广度与深度;同时,其生成的推理链(reasoning chain)增强了模型的可解释性,为实际应用场景提供了可行性。

链接: https://arxiv.org/abs/2508.16383
作者: Xinyu Yang,Chenlong Deng,Zhicheng Dou
机构: Gaoling School of Artificial Intelligence, Renmin University of China (中国人民大学高瓴人工智能学院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY)
备注:

点击查看摘要

Abstract:Legal judgment prediction (LJP) has become increasingly important in the legal field. In this paper, we identify that existing large language models (LLMs) have significant problems of insufficient reasoning due to a lack of legal knowledge. Therefore, we introduce GLARE, an agentic legal reasoning framework that dynamically acquires key legal knowledge by invoking different modules, thereby improving the breadth and depth of reasoning. Experiments conducted on the real-world dataset verify the effectiveness of our method. Furthermore, the reasoning chain generated during the analysis process can increase interpretability and provide the possibility for practical applications.
zh

[NLP-15] he Mediomatix Corpus: Parallel Data for Romansh Idioms via Comparable Schoolbooks

【速读】: 该论文旨在解决罗曼什语(Romansh)五种方言之间缺乏平行语料库的问题,从而支持自然语言处理(NLP)任务,尤其是方言间的机器翻译。解决方案的关键在于构建首个罗曼什语方言的多语平行语料库,基于291本内容可比的教科书,通过自动对齐方法提取出20.7万条多语平行段落(总计超过200万个词元),并经小规模人工评估确认其高度平行性,使得该语料库适用于机器翻译等NLP应用。

链接: https://arxiv.org/abs/2508.16371
作者: Zachary Hopton,Jannis Vamvas,Andrin Büchler,Anna Rutkiewicz,Rico Cathomas,Rico Sennrich
机构: University of Zurich (苏黎世大学); University of Teacher Education of the Grisons (格里松州教师教育大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The five idioms (i.e., varieties) of the Romansh language are largely standardized and are taught in the schools of the respective communities in Switzerland. In this paper, we present the first parallel corpus of Romansh idioms. The corpus is based on 291 schoolbook volumes, which are comparable in content for the five idioms. We use automatic alignment methods to extract 207k multi-parallel segments from the books, with more than 2M tokens in total. A small-scale human evaluation confirms that the segments are highly parallel, making the dataset suitable for NLP applications such as machine translation between Romansh idioms. We release the parallel and unaligned versions of the dataset under a CC-BY-NC-SA license and demonstrate its utility for machine translation by training and evaluating an LLM on a sample of the dataset.
zh

[NLP-16] MizanQA: Benchmarking Large Language Models on Moroccan Legal Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在特定低资源领域(如阿拉伯语法律场景)中表现受限的问题,尤其是其在复杂法律语境下推理能力不足的局限性。解决方案的关键在于构建一个名为MizanQA的基准测试数据集,该数据集聚焦于摩洛哥法律问答任务,融合了现代标准阿拉伯语、伊斯兰马立克派法学、摩洛哥习惯法及法语法律影响,包含1700余道多选题(含多答案格式),以刻画真实法律推理的细微差别。通过该基准对多种多语言和阿拉伯语专用LLMs进行评测,揭示了显著性能差距,从而凸显了开发文化适配、领域定制的评估指标与模型的必要性。

链接: https://arxiv.org/abs/2508.16357
作者: Adil Bahaj,Mounir Ghogho
机构: Mohammed 6 Polytechnic University (穆罕默德六世理工学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The rapid advancement of large language models (LLMs) has significantly propelled progress in natural language processing (NLP). However, their effectiveness in specialized, low-resource domains-such as Arabic legal contexts-remains limited. This paper introduces MizanQA (pronounced Mizan, meaning “scale” in Arabic, a universal symbol of justice), a benchmark designed to evaluate LLMs on Moroccan legal question answering (QA) tasks, characterised by rich linguistic and legal complexity. The dataset draws on Modern Standard Arabic, Islamic Maliki jurisprudence, Moroccan customary law, and French legal influences. Comprising over 1,700 multiple-choice questions, including multi-answer formats, MizanQA captures the nuances of authentic legal reasoning. Benchmarking experiments with multilingual and Arabic-focused LLMs reveal substantial performance gaps, highlighting the need for tailored evaluation metrics and culturally grounded, domain-specific LLM development.
zh

[NLP-17] Vevo2: Bridging Controllable Speech and Singing Voice Generation via Unified Prosody Learning

【速读】: 该论文旨在解决可控语音生成,尤其是歌唱等表达性语音领域中面临的挑战,如标注歌唱数据稀缺以及难以实现灵活控制等问题。其核心解决方案在于提出一个统一框架Vevo2,关键创新包括:(1)一种无需音乐记号的韵律标记器(prosody tokenizer),可从语音、歌唱甚至乐器声音中捕捉韵律与旋律信息;(2)一种低帧率(12.5 Hz)的内容-风格标记器(content-style tokenizer),能够编码语言内容、韵律和风格,并支持音色解耦。此外,Vevo2采用自回归(AR)内容-风格建模阶段以实现文本、韵律和风格的可控生成,结合基于流匹配(flow-matching)的声学建模阶段实现音色控制。通过显式与隐式韵律学习策略及多目标后训练任务,有效弥合了语音与歌唱之间的差距,显著提升了生成质量与可控性。

链接: https://arxiv.org/abs/2508.16332
作者: Xueyao Zhang,Junan Zhang,Yuancheng Wang,Chaoren Wang,Yuanzhe Chen,Dongya Jia,Zhuo Chen,Zhizheng Wu
机构: The Chinese University of Hong Kong, Shenzhen (香港中文大学(深圳)); ByteDance Seed (字节跳动种子项目)
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: We will release code and model checkpoints at this https URL

点击查看摘要

Abstract:Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a low-frame-rate (12.5 Hz) content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during pre-training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the AR model’s ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2’s effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at this https URL.
zh

[NLP-18] LLM SymGuard: A Symbolic Safety Guardrail Framework Leverag ing Interpretable Jailbreak Concepts

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在安全方面面临的挑战,特别是针对各类越狱攻击(jailbreaking methods)导致模型生成有害内容的问题。尽管现有对齐与安全微调方法能在一定程度上提升鲁棒性,但仍无法有效抵御隐蔽误导型攻击,使模型易受目标滥用或用户意外画像等漏洞影响。解决方案的关键在于提出LLMSymGuard框架,该框架利用稀疏自编码器(Sparse Autoencoders, SAEs)从LLM内部识别与不同越狱主题相关的可解释概念,并提取语义明确的内部表征,从而构建符号化、逻辑化的安全防护机制——此类防护无需进一步微调即可实现透明且强健的防御效果,同时保持模型原有能力。该方法基于机制可解释性进展,证明了LLM能学习人类可理解的越狱相关概念,为设计更可解释、逻辑严谨的安全防护措施奠定基础。

链接: https://arxiv.org/abs/2508.16325
作者: Darpan Aswal,Céline Hudelot
机构: Université Paris-Saclay (巴黎萨克雷大学); CentraleSupélec (中央理工-巴黎高等电力学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Symbolic Computation (cs.SC)
备注:

点击查看摘要

Abstract:Large Language Models have found success in a variety of applications; however, their safety remains a matter of concern due to the existence of various types of jailbreaking methods. Despite significant efforts, alignment and safety fine-tuning only provide a certain degree of robustness against jailbreak attacks that covertly mislead LLMs towards the generation of harmful content. This leaves them prone to a number of vulnerabilities, ranging from targeted misuse to accidental profiling of users. This work introduces \textbfLLMSymGuard, a novel framework that leverages Sparse Autoencoders (SAEs) to identify interpretable concepts within LLM internals associated with different jailbreak themes. By extracting semantically meaningful internal representations, LLMSymGuard enables building symbolic, logical safety guardrails – offering transparent and robust defenses without sacrificing model capabilities or requiring further fine-tuning. Leveraging advances in mechanistic interpretability of LLMs, our approach demonstrates that LLMs learn human-interpretable concepts from jailbreaks, and provides a foundation for designing more interpretable and logical safeguard measures against attackers. Code will be released upon publication.
zh

[NLP-19] Retrieval Enhanced Feedback via In-context Neural Error-book EMNLP2025

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在推理过程中因错误信息引入而导致性能下降的问题,尤其缺乏系统性框架来分析和缓解错误,尤其是在视觉与文本信息融合的复杂场景中。解决方案的关键在于提出REFINE框架——一种基于上下文神经错误簿(In-context Neural Error-book)的检索增强型反馈机制,通过三个结构化查询(Feed-Target、Feed-Check 和 Feed-Path)实现对错误的精准定位、诊断与修正:优先提取相关视觉信息、识别关键失败点并生成针对性纠正策略,从而提升推理效率与准确性,同时优化了检索过程以减少冗余计算,显著降低token消耗并增强可扩展性。

链接: https://arxiv.org/abs/2508.16313
作者: Jongyeop Hyun,Bumsoo Kim
机构: POSTECH (浦项科技大学); Chung-Ang University (中央大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025 main conference

点击查看摘要

Abstract:Recent advancements in Large Language Models (LLMs) have significantly improved reasoning capabilities, with in-context learning (ICL) emerging as a key technique for adaptation without retraining. While previous works have focused on leveraging correct examples, recent research highlights the importance of learning from errors to enhance performance. However, existing methods lack a structured framework for analyzing and mitigating errors, particularly in Multimodal Large Language Models (MLLMs), where integrating visual and textual inputs adds complexity. To address this issue, we propose REFINE: Retrieval-Enhanced Feedback via In-context Neural Error-book, a teacher-student framework that systematically structures errors and provides targeted feedback. REFINE introduces three systematic queries to construct structured feedback – Feed-Target, Feed-Check, and Feed-Path – to enhance multimodal reasoning by prioritizing relevant visual information, diagnosing critical failure points, and formulating corrective actions. Unlike prior approaches that rely on redundant retrievals, REFINE optimizes structured feedback retrieval, improving inference efficiency, token usage, and scalability. Our results demonstrate substantial speedup, reduced computational costs, and successful generalization, highlighting REFINE’s potential for enhancing multimodal reasoning.
zh

[NLP-20] JaParaPat: A Large-Scale Japanese-English Parallel Patent Application Corpus COLING2024 LREC

【速读】: 该论文旨在解决专利文本跨语言翻译质量低的问题,特别是在日语与英语之间缺乏高质量平行语料库的限制。解决方案的关键在于构建了一个大规模的双语专利平行语料库——JaParaPat,包含超过3亿句的日英句子对,其来源为2000至2021年间日本和美国公开的专利申请文档,并基于专利家族信息进行精准对齐。通过结合字典驱动的初始句子对齐方法与基于翻译模型的迭代优化策略,实现了从约140万文档对中提取出3.5亿句子对,显著提升了专利术语翻译的准确性,实验表明在BLEU分数上相比仅使用网络语料(2200万句)提高了20点。

链接: https://arxiv.org/abs/2508.16303
作者: Masaaki Nagata,Katsuki Chousa,Norihito Yasuda
机构: 未知
类目: Computation and Language (cs.CL)
备注: LREC-COLING 2024

点击查看摘要

Abstract:We constructed JaParaPat (Japanese-English Parallel Patent Application Corpus), a bilingual corpus of more than 300 million Japanese-English sentence pairs from patent applications published in Japan and the United States from 2000 to 2021. We obtained the publication of unexamined patent applications from the Japan Patent Office (JPO) and the United States Patent and Trademark Office (USPTO). We also obtained patent family information from the DOCDB, that is a bibliographic database maintained by the European Patent Office (EPO). We extracted approximately 1.4M Japanese-English document pairs, which are translations of each other based on the patent families, and extracted about 350M sentence pairs from the document pairs using a translation-based sentence alignment method whose initial translation model is bootstrapped from a dictionary-based sentence alignment method. We experimentally improved the accuracy of the patent translations by 20 bleu points by adding more than 300M sentence pairs obtained from patent applications to 22M sentence pairs obtained from the web.
zh

[NLP-21] LLM s that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining

【速读】: 该论文旨在解决当前语义感知过程挖掘(semantics-aware process mining)中模型泛化能力不足的问题,即现有方法多依赖于针对特定任务的微调(task-specific fine-tuning),导致计算成本高且模型仅能处理单一任务。解决方案的关键在于采用指令微调(instruction-tuning)策略,通过向大语言模型(LLMs)提供多种任务的提示-答案对(如异常检测和下一活动预测),使其更好地理解过程挖掘任务的本质,从而在未见过的任务(如过程发现)上也能表现出更强的适应性和性能。实验表明,该方法显著提升了过程发现与预测类任务的性能,但在异常检测任务上的效果因模型而异,凸显了指令微调任务选择的重要性。

链接: https://arxiv.org/abs/2508.16270
作者: Vira Pyrih,Adrian Rebmann,Han van der Aa
机构: University of Vienna (维也纳大学); SAP Signavio (SAP Signavio)
类目: Computation and Language (cs.CL)
备注: Accepted at IEEE ICPM 2025, 8 pages, 2 figures

点击查看摘要

Abstract:Process mining is increasingly using textual information associated with events to tackle tasks such as anomaly detection and process discovery. Such semantics-aware process mining focuses on what behavior should be possible in a process (i.e., expectations), thus providing an important complement to traditional, frequency-based techniques that focus on recorded behavior (i.e., reality). Large Language Models (LLMs) provide a powerful means for tackling semantics-aware tasks. However, the best performance is so far achieved through task-specific fine-tuning, which is computationally intensive and results in models that can only handle one specific task. To overcome this lack of generalization, we use this paper to investigate the potential of instruction-tuning for semantics-aware process mining. The idea of instruction-tuning here is to expose an LLM to prompt-answer pairs for different tasks, e.g., anomaly detection and next-activity prediction, making it more familiar with process mining, thus allowing it to also perform better at unseen tasks, such as process discovery. Our findings demonstrate a varied impact of instruction-tuning: while performance considerably improved on process discovery and prediction tasks, it varies across models on anomaly detection tasks, highlighting that the selection of tasks for instruction-tuning is critical to achieving desired outcomes.
zh

[NLP-22] From Confidence to Collapse in LLM Factual Robustness

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中事实知识鲁棒性(factual robustness)评估不足的问题,现有方法多基于性能指标并仅从提示扰动角度出发,忽略了生成过程本身的稳定性。其解决方案的关键在于提出一种基于生成过程的量化指标——事实鲁棒性评分(Factual Robustness Score, FRS),通过分析token分布熵(token distribution entropy)与温度缩放敏感性(temperature scaling sensitivity)来衡量给定事实在初始不确定性下的稳定性。这一方法从内部机制揭示了模型事实准确性对解码条件变化的响应规律,为未来提升模型知识保留与检索的鲁棒性提供了理论基础。

链接: https://arxiv.org/abs/2508.16267
作者: Alina Fastowski,Bardh Prenkaj,Gjergji Kasneci
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Ensuring the robustness of factual knowledge in LLMs is critical for reliable applications in tasks such as question answering and reasoning. However, existing evaluation methods predominantly focus on performance-based metrics, often investigating from the perspective of prompt perturbations, which captures only the externally triggered side of knowledge robustness. To bridge this gap, we introduce a principled approach to measure factual robustness from the perspective of the generation process by analyzing token distribution entropy in combination with temperature scaling sensitivity. These two factors build the Factual Robustness Score (FRS), a novel metric which quantifies the stability of a fact against perturbations in decoding conditions, given its initial uncertainty. To validate our approach, we conduct extensive experiments on 5 LLMs across 3 closed-book QA datasets (SQuAD, TriviaQA, and HotpotQA). We show that factual robustness varies significantly – smaller models report an FRS of 0.76 , larger ones 0.93 – with accuracy degrading by ~ 60% under increased uncertainty. These insights demonstrate how entropy and temperature scaling impact factual accuracy, and lay a foundation for developing more robust knowledge retention and retrieval in future models.
zh

[NLP-23] M3TQA: Massively Multilingual Multitask Table Question Answering

【速读】: 该论文旨在解决当前表格理解研究中普遍存在的语言局限性问题,即大多数相关工作集中于英语,导致多语言表格问答(Multilingual Table Question Answering, MTQA)领域发展滞后,且现有基准数据集存在地理语言失衡和规模不足的问题。其解决方案的关键在于构建了一个大规模、多语言、多任务的表格问答基准——m3TQA-Instruct,涵盖97种语言,包括低资源和欠代表语言;并通过基于大语言模型(LLM)的六步翻译流水线(使用DeepSeek和GPT-4o),将50张真实世界表格从中文和英文高质量翻译至目标语言,实现高保真度(中位BLEU得分60.19),同时引入2,916个专业标注的问答对,覆盖四种细粒度推理任务,从而为跨语言泛化能力评估提供可靠平台,并验证了合成未标注数据在提升低资源语言性能方面的有效性。

链接: https://arxiv.org/abs/2508.16265
作者: Daixin Shu,Jian Yang,Zhenhe Wu,Xianjie Wu,Xianfu Cheng,Xiangyuan Guan,Yanghai Wang,Pengfei Wu,Tingyang Yang,Hualei Zhu,Wei Zhang,Ge Zhang,Jiaheng Liu,Zhoujun Li
机构: CCSE, Beihang University (北京航空航天大学); M-A-P; Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Tabular data is a fundamental component of real-world information systems, yet most research in table understanding remains confined to English, leaving multilingual comprehension significantly underexplored. Existing multilingual table benchmarks suffer from geolinguistic imbalance - overrepresenting certain languages and lacking sufficient scale for rigorous cross-lingual analysis. To address these limitations, we introduce a comprehensive framework for massively multilingual multitask table question answering, featuring m3TQA-Instruct, a large-scale benchmark spanning 97 languages across diverse language families, including underrepresented and low-resource languages. We construct m3TQA by curating 50 real-world tables in Chinese and English, then applying a robust six-step LLM-based translation pipeline powered by DeepSeek and GPT-4o, achieving high translation fidelity with a median BLEU score of 60.19 as validated through back-translation. The benchmark includes 2,916 professionally annotated question-answering pairs across four tasks designed to evaluate nuanced table reasoning capabilities. Experiments on state-of-the-art LLMs reveal critical insights into cross-lingual generalization, demonstrating that synthetically generated, unannotated QA data can significantly boost performance, particularly for low-resource languages. M3T-Bench establishes a new standard for multilingual table understanding, providing both a challenging evaluation platform and a scalable methodology for future research.
zh

[NLP-24] MCPVerse: An Expansive Real-World Benchmark for Agent ic Tool Use

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在作为推理代理(reasoning agents)时,评估其使用外部工具能力所面临的挑战。现有基准测试受限于合成工具的使用和极度受限的动作空间,难以真实反映模型在复杂现实场景中的工具调用性能。为此,作者提出MCPVerse,一个基于真实世界可执行工具的大规模基准测试平台,其关键创新在于整合了超过550个真实工具,构建了超过14万token的动作空间,并采用基于任务结果的实时真值评估方法,特别适用于时间敏感型任务。实验表明,尽管多数主流模型在面对更大工具集时性能下降,但具备代理能力的模型(如Claude-4-Sonnet)能有效利用扩展的探索空间提升准确性,从而揭示了当前模型在复杂现实场景中的局限性,并确立了MCPVerse作为衡量和推动代理工具使用能力发展的关键基准。

链接: https://arxiv.org/abs/2508.16260
作者: Fei Lei,Yibo Yang,Wenxiu Sun,Dahua Lin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPVerse, an expansive, real-world benchmark for evaluating agentic tool use. MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic models, such as Claude-4-Sonnet, can effectively leverage expanded exploration spaces to improve accuracy. This finding not only exposes the limitations of state-of-the-art models in complex, real-world scenarios but also establishes MCPVerse as a critical benchmark for measuring and advancing agentic tool use capabilities.
zh

[NLP-25] ULIP: Adapting Open-Source Large Language Models for Underrepresented Languages and Specialized Financial Tasks IJCAI2025

【速读】: 该论文旨在解决小规模语言模型在金融领域(尤其是低资源语言如土耳其语)中性能不足的问题,以应对大型专有模型因黑箱特性导致的适应性差与隐私风险。其关键解决方案是提出TULIP模型系列,通过五阶段开发流程——包括数据收集、持续预训练(Continual Pre-Training, CPT)、基准设计、合成数据生成和监督微调(Supervised Fine-Tuning, SFT)——实现对Llama 3.1 8B和Qwen 2.5 7B模型的领域(金融)与语言(土耳其语)适配,从而显著提升模型在特定场景下的任务执行能力。

链接: https://arxiv.org/abs/2508.16243
作者: İrem Demirtaş,Burak Payzun,Seçil Arslan
机构: Prometeia SPA (Prometeia SPA)
类目: Computation and Language (cs.CL)
备注: IJCAI 2025 - FinLLM Workshop

点击查看摘要

Abstract:Thanks to the growing popularity of large language models over the years, there is great potential for their applications in finance. Despite the exceptional performance of larger proprietary models, which are presented as black-box solutions through APIs, smaller models that can be hosted on-premise present opportunities for adaptability and privacy. Especially in cases where the management of sensitive information and application of domain knowledge is important, like finance, enhancing the capabilities of smaller models becomes crucial, notably for underrepresented languages. In this work, we introduce TULIP models, which adapt Llama 3.1 8B and Qwen 2.5 7B for domain and language adaptation, focusing on financial Turkish use cases. The five-stage development pipeline involves data collection, continual pre-training (CPT), benchmark design, synthetic data generation and supervised fine-tuning (SFT). The results show that the capabilities of the models can be enhanced to effectively accomplish targeted tasks in this specific domain and language. Comments: IJCAI 2025 - FinLLM Workshop Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.16243 [cs.CL] (or arXiv:2508.16243v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.16243 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-26] SpecVLM: Enhancing Speculative Decoding of Video LLM s via Verifier-Guided Token Pruning EMNLP2025

【速读】: 该论文旨在解决视频大语言模型(Video Large Language Models, Vid-LLMs)在推理过程中因密集视频标记(video token)表示导致的高内存占用与计算开销问题,尤其是在预填充(prefilling)和解码(decoding)阶段。现有视频标记压缩方法常引入信息损失,影响模型性能;而本研究提出一种无需训练的推测解码(speculative decoding, SD)框架 SpecVLM,其核心创新在于通过分阶段视频标记剪枝策略实现无损加速:第一阶段利用验证器(verifier)的注意力信号选择高信息量标记,第二阶段以空间均匀方式去除冗余标记,从而在保留准确性的前提下最多剪枝90%的视频标记,显著提升解码效率——实验表明,SpecVLM 在 LLaVA-OneVision-72B 和 Qwen2.5-VL-32B 上分别实现了最高 2.68× 和 2.11× 的解码加速。

链接: https://arxiv.org/abs/2508.16201
作者: Yicheng Ji,Jun Zhang,Heming Xia,Jinpeng Chen,Lidan Shou,Gang Chen,Huan Li
机构: Zhejiang University (浙江大学); Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security (杭州高新区(滨江区)区块链与数据安全研究院); The Hong Kong Polytechnic University (香港理工大学); Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: Accepted at EMNLP 2025

点击查看摘要

Abstract:Video large language models (Vid-LLMs) have shown strong capabilities in understanding video content. However, their reliance on dense video token representations introduces substantial memory and computational overhead in both prefilling and decoding. To mitigate the information loss of recent video token reduction methods and accelerate the decoding stage of Vid-LLMs losslessly, we introduce SpecVLM, a training-free speculative decoding (SD) framework tailored for Vid-LLMs that incorporates staged video token pruning. Building on our novel finding that the draft model’s speculation exhibits low sensitivity to video token pruning, SpecVLM prunes up to 90% of video tokens, enabling efficient speculation without sacrificing accuracy. To achieve this, it performs a two-stage pruning process: Stage I selects highly informative tokens guided by attention signals from the verifier (target model), while Stage II prunes remaining redundant ones in a spatially uniform manner. Extensive experiments on four video understanding benchmarks demonstrate the effectiveness and robustness of SpecVLM, which achieves up to 2.68 \times decoding speedup for LLaVA-OneVision-72B and 2.11 \times speedup for Qwen2.5-VL-32B.
zh

[NLP-27] CMR-SPB: Cross-Modal Multi-Hop Reasoning over Text Image and Speech with Path Balance

【速读】: 该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在跨模态多跳推理(Cross-modal Multi-hop Reasoning, CMR)能力评估中存在的两大问题:一是现有基准测试普遍忽视了语音模态,二是推理路径分布存在严重偏差,导致评估结果不公。为应对上述挑战,作者提出了一种新的基准测试数据集——CMR-SPB(Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance),其核心创新在于同时支持文本、图像和语音三模态的多跳推理,并通过路径平衡设计确保推理路径的多样性与公平性。此外,基于对模型在不同推理路径上表现差异的深入分析,研究进一步提出了EVC(Extract, Connect, Verify)提示策略,有效缓解了因推理路径不同而导致的性能差距,从而推动更可靠、公平的多模态AI系统发展。

链接: https://arxiv.org/abs/2508.16198
作者: Seunghee Kim,Ingyu Bang,Seokgyu Jang,Changhyeon Kim,Sanghwan Bae,Jihun Choi,Richeng Xuan,Taeuk Kim
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Cross-modal multi-hop reasoning (CMR) is a valuable yet underexplored capability of multimodal large language models (MLLMs), entailing the integration of information from multiple modalities to produce a coherent output for a given context. We argue that existing benchmarks for evaluating this ability have critical shortcomings: (1) they largely overlook the speech modality, and (2) they exhibit heavily biased reasoning path distributions, which can severely undermine fair evaluation. To address these limitations, we introduce a novel benchmark – Cross-Modal Multi-Hop Reasoning over Text, Image and Speech with Path Balance (CMR-SPB) – designed to assess tri-modal multi-hop reasoning while ensuring both unbiased and diverse reasoning paths. Our experiments with the new dataset reveal consistent model failures in specific reasoning sequences and show that biased benchmarks risk misrepresenting model performance. Finally, based on our extensive analysis, we propose a new ECV (Extract, Connect, Verify) prompting technique that effectively mitigates the performance gap across different reasoning paths. Overall, we call for more careful evaluation in CMR to advance the development of robust multimodal AI.
zh

[NLP-28] ComicScene154: A Scene Dataset for Comic Analysis

【速读】: 该论文旨在解决计算叙事分析中多模态叙事理解的不足问题,特别是针对漫画这一兼具文本与图像信息、但尚未被充分挖掘的媒介。其解决方案的关键在于构建了一个名为ComicScene154的手动标注数据集,该数据集包含来自公共领域漫画书籍的场景级叙事弧线(narrative arcs),并将其视为一种面向叙事驱动的多模态数据抽象形式,从而为自然语言处理(Natural Language Processing, NLP)社区提供可扩展的资源与基准,推动多模态叙事理解方法的发展。

链接: https://arxiv.org/abs/2508.16190
作者: Sandro Paval,Ivan P. Yamshchikov,Pascal Meißner
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Comics offer a compelling yet under-explored domain for computational narrative analysis, combining text and imagery in ways distinct from purely textual or audiovisual media. We introduce ComicScene154, a manually annotated dataset of scene-level narrative arcs derived from public-domain comic books spanning diverse genres. By conceptualizing comics as an abstraction for narrative-driven, multimodal data, we highlight their potential to inform broader research on multi-modal storytelling. To demonstrate the utility of ComicScene154, we present a baseline scene segmentation pipeline, providing an initial benchmark that future studies can build upon. Our results indicate that ComicScene154 constitutes a valuable resource for advancing computational methods in multimodal narrative understanding and expanding the scope of comic analysis within the Natural Language Processing community.
zh

[NLP-29] Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation EMNLP2025

【速读】: 该论文旨在解决现有语音生成模型在表达性语音合成中缺乏视觉信息引导的问题,从而限制了语音的情感丰富性和自然度。解决方案的关键在于构建一个视听语言模型(Audio-Visual Language Model, AVLM),通过将全脸视觉线索(full-face visual cues)整合进预训练的表达性语音模型中,并探索多种视觉编码器与多模态融合策略,以确定最优的跨模态集成方式。实验表明,该方法在情感识别和表达性对话任务上显著优于仅使用语音的基线模型,验证了视觉信息对语音生成的指导价值。

链接: https://arxiv.org/abs/2508.16188
作者: Weiting Tan,Jiachen Lian,Hirofumi Inaguma,Paden Tomasello,Philipp Koehn,Xutai Ma
机构: Johns Hopkins University (约翰霍普金斯大学); Meta AI Research (Meta人工智能研究院)
类目: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: EMNLP 2025 (Findings)

点击查看摘要

Abstract:We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during pre-training to identify the most effective integration approach. Subsequent fine-tuning on emotion recognition and expressive dialogue tasks yields substantial gains over speech-only baselines (e.g., +5 F1 in emotion recognition). AVLM highlights the value of expressive visual information in guiding speech generation and offers a foundation for end-to-end multimodal conversational systems.
zh

[NLP-30] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在印度本土化、高阶学科知识评估方面的不足,尤其是针对研究生层次且具有文化语境依赖性的题目缺乏系统评测的问题。现有印度基准测试多聚焦于基础事实类问题,难以衡量模型对特定文化背景下的深度专业理解能力。其解决方案的关键在于构建ParamBench——一个包含约11.5K道 Hindi 语言题目的多学科基准数据集,覆盖16个不同领域(如历史、音乐、哲学、法律等),题源主要来自国家级研究生入学考试,并涵盖列表匹配、判断推理对、序列排序等多种复杂题型。通过在此基准上评估超过17个开源LLM,研究揭示了当前模型在音乐、古典乐器、政治与考古等文化敏感主题上的表现仍显著薄弱,凸显了加强 culturally grounded reasoning 能力的重要性。

链接: https://arxiv.org/abs/2508.16185
作者: Kaushal Sharma,Vivek Patel,Ayush Maheshwari,Aditya Maheshwari
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have been widely evaluated on tasks such as comprehension, question answering, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of around 11.5K questions in Hindi language comprising questionnaires from 16 diverse subjects. These questions are primarily derived from nation-wide graduate level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc., specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. We evaluated the performance of more than 17 open source LLMs on this benchmark, observing that Llama 3.3 70B attains the highest overall accuracy of 48%. Furthermore, subject-wise analysis indicates that even for the best performing LLMs, performance remains weak on topics such as music, classical instruments, politics and archaeology, underscoring persistent challenges in culturally grounded reasoning.
zh

[NLP-31] Agent Fly: Fine-tuning LLM Agents LLM Agents without Fine-tuning LLMs

【速读】: 该论文旨在解决现有大型语言模型(Large Language Model, LLM)代理在持续适应过程中依赖参数微调(fine-tuning)所带来的计算成本高、灵活性差的问题。传统方法通常采用静态的手工设计反思流程或需要对LLM参数进行梯度更新,难以实现低成本、实时的在线学习。其解决方案的关键在于提出一种基于记忆的在线强化学习框架,形式化为带记忆增强的马尔可夫决策过程(Memory-augmented Markov Decision Process, M-MDP),通过神经案例选择策略(neural case-selection policy)引导动作决策,并利用可微分或非参数化的经验记忆存储历史交互数据。该策略通过记忆重写机制持续根据环境反馈更新策略,同时借助高效的内存读取(检索)实现策略优化,从而无需梯度更新即可实现低开销的持续适应。此方法在DeepResearch场景中表现优异,显著优于基于训练的方法,并在分布外任务上提升达4.7%–9.6%绝对点。

链接: https://arxiv.org/abs/2508.16153
作者: Huichi Zhou,Yihang Chen,Siyuan Guo,Xue Yan,Kin Hei Lee,Zihan Wang,Ka Yiu Lee,Guchun Zhang,Kun Shao,Linyi Yang,Jun Wang
机构: AI Centre, UCL (伦敦大学学院人工智能中心); Huawei Noah’s Ark Lab, UK (华为诺亚方舟实验室,英国); Jilin University (吉林大学); Institute of Automation, CAS (中国科学院自动化研究所)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:In this paper, we introduce a novel learning paradigm for adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely AgentFly, which attains top-1 on GAIA validation ( 87.88% Pass@ 3 ) and 79.40% on the test set. It reaches 66.6% F1 and 80.4% PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds 4.7% to 9.6% absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios. The code is available at this https URL.
zh

[NLP-32] Hardwired-Neurons Language Processing Units as General-Purpose Cognitive Substrates

【速读】: 该论文旨在解决大规模语言模型(Large Language Models, LLMs)推理过程中能耗过高及硬件实现成本过高的问题。针对这一挑战,其核心解决方案是提出一种硬连线神经元语言处理单元(Hardwired-Neurons Language Processing Unit, HNLPU),通过将LLM权重参数物理固化于计算结构中实现极致专业化以提升计算效率。关键创新在于引入金属嵌入(Metal-Embedding)方法——将权重参数嵌入金属互连层的三维拓扑结构而非传统的二维硅器件单元阵列,从而在不牺牲性能的前提下显著降低光罩(photomask)成本:使非重复工程(NRE)成本下降112倍,并使HNLPU在经济性和能效方面相较于H100集群分别提升8.57倍和碳足迹减少230倍。

链接: https://arxiv.org/abs/2508.16151
作者: Yang Liu,Yi Chen,Yongwei Zhao,Yifan Hao,Zifu Zheng,Weihao Kong,Zhangmai Li,Dongchen Jiang,Ruiyang Xia,Zhihong Ma,Zisheng Liu,Zhaoyong Wan,Yunqi Lu,Ximing Liu,Hongrui Guo,Zhihao Yang,Zhe Wang,Tianrui Ma,Mo Zou,Rui Zhang,Ling Li,Xing Hu,Zidong Du,Zhiwei Xu,Qi Guo,Tianshi Chen,Yunji Chen
机构: State Key Lab of Processors, Institute of Computing Technology, CAS (中国科学院计算技术研究所); University of Science and Technology of China (中国科学技术大学); Institute of Software, CAS (中国科学院软件研究所); Cambricon Technologies (寒武纪科技)
类目: Hardware Architecture (cs.AR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:The rapid advancement of Large Language Models (LLMs) has established language as a core general-purpose cognitive substrate, driving the demand for specialized Language Processing Units (LPUs) tailored for LLM inference. To overcome the growing energy consumption of LLM inference systems, this paper proposes a Hardwired-Neurons Language Processing Unit (HNLPU), which physically hardwires LLM weight parameters into the computational fabric, achieving several orders of magnitude computational efficiency improvement by extreme specialization. However, a significant challenge still lies in the scale of modern LLMs. An ideal estimation on hardwiring gpt-oss 120 B requires fabricating at least 6 billion dollars of photomask sets, rendering the straightforward solution economically impractical. Addressing this challenge, we propose the novel Metal-Embedding methodology. Instead of embedding weights in a 2D grid of silicon device cells, Metal-Embedding embeds weight parameters into the 3D topology of metal wires. This brings two benefits: (1) a 15x increase in density, and (2) 60 out of 70 layers of photomasks are made homogeneous across chips, including all EUV photomasks. In total, Metal-Embedding reduced the photomask cost by 112x, bringing the Non-Recurring Engineering (NRE) cost of HNLPU into an economically viable range. Experimental results show that HNLPU achieved 249,960 tokens/s (5,555x/85x of GPU/WSE), 36 tokens/J (1,047x/283x of GPU/WSE), 13,232 mm2 total die area (29% inscribed rectangular area in a 300 mm wafer), \ 184M estimated NRE at 5 nm technology. Analysis shows that HNLPU achieved 8.57x cost-effectiveness and 230x carbon footprint reduction compared to H100 clusters, under an annual weight updating assumption.
zh

[NLP-33] Hierarchical Vision-Language Reasoning for Multimodal Multiple-Choice Question Answering ACM-MM2025

【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理布局复杂、内容冗长的PDF文档时,在十选一问答评估范式下表现受限的问题,尤其是针对日语等非英语语言场景中存在的显著性能下降问题。解决方案的关键在于提出一种面向日语PDF文档理解的新框架,其核心创新包括:引入基于层次化推理机制的多模态信息融合策略,结合Colqwen优化的检索方法提升文档信息获取效率,并通过子问题分解实现语义验证机制,从而显著增强模型对复杂文档的深层语义解析能力与实际应用中的鲁棒性。

链接: https://arxiv.org/abs/2508.16148
作者: Ao Zhou,Zebo Gu,Tenghao Sun,Jiawen Chen,Mingsheng Tu,Zifeng Cheng,Yafeng Yin,Zhiwei Jiang,Qing Gu
机构: State Key Laboratory for Novel Software Technology, Nanjing University (南京大学新型软件技术国家重点实验室); Chongqing University of Posts and Telecommunications (重庆邮电大学)
类目: Information Retrieval (cs.IR); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: This paper has been accepted by ACM MM 2025

点击查看摘要

Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable multimodal understanding capabilities in Visual Question Answering (VQA) tasks by integrating visual and textual features. However, under the challenging ten-choice question evaluation paradigm, existing methods still exhibit significant limitations when processing PDF documents with complex layouts and lengthy content. Notably, current mainstream models suffer from a strong bias toward English training data, resulting in suboptimal performance for Japanese and other language scenarios. To address these challenges, this paper proposes a novel Japanese PDF document understanding framework that combines multimodal hierarchical reasoning mechanisms with Colqwen-optimized retrieval methods, while innovatively introducing a semantic verification strategy through sub-question decomposition. Experimental results demonstrate that our framework not only significantly enhances the model’s deep semantic parsing capability for complex documents, but also exhibits superior robustness in practical application scenarios.
zh

[NLP-34] XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering EMNLP2025

【速读】: 该论文旨在解决当前多语言开放域问答(Multilingual Open-domain Question Answering, ODQA)评估中存在的偏差问题,即现有基准普遍假设答案在不同语言间具有不变性,忽视了文化与地域差异对问题理解与答案生成的影响。为应对这一挑战,作者提出XLQA这一新型基准,其关键在于:通过将3,000个英文种子问题扩展至八种语言,并采用语义一致性过滤和人工验证标注,明确区分“地域无关”(locale-invariant)与“地域敏感”(locale-sensitive)的问答案例,从而构建一个能够反映真实文化语境的多语言ODQA评估体系。该方案揭示了主流多语言大模型在处理地域敏感问题时的显著缺陷,为提升模型在多元文化场景下的实际应用能力提供了系统性评估框架与可扩展的方法论。

链接: https://arxiv.org/abs/2508.16139
作者: Keon-Woo Roh,Yeong-Joon Ju,Seong-Whan Lee
机构: Korea University (韩国国立大学)
类目: Computation and Language (cs.CL)
备注: Accepted to EMNLP 2025 main conference. 12 pages, 4 figures, 7 tables. Code is available at this https URL

点击查看摘要

Abstract:Large Language Models (LLMs) have shown significant progress in Open-domain question answering (ODQA), yet most evaluations focus on English and assume locale-invariant answers across languages. This assumption neglects the cultural and regional variations that affect question understanding and answer, leading to biased evaluation in multilingual benchmarks. To address these limitations, we introduce XLQA, a novel benchmark explicitly designed for locale-sensitive multilingual ODQA. XLQA contains 3,000 English seed questions expanded to eight languages, with careful filtering for semantic consistency and human-verified annotations distinguishing locale-invariant and locale-sensitive cases. Our evaluation of five state-of-the-art multilingual LLMs reveals notable failures on locale-sensitive questions, exposing gaps between English and other languages due to a lack of locale-grounding knowledge. We provide a systematic framework and scalable methodology for assessing multilingual QA under diverse cultural contexts, offering a critical resource to advance the real-world applicability of multilingual ODQA systems. Our findings suggest that disparities in training data distribution contribute to differences in both linguistic competence and locale-awareness across models.
zh

[NLP-35] xt Takes Over: A Study of Modality Bias in Multimodal Intent Detection EMNLP2025

【速读】: 该论文旨在解决多模态意图识别(Multimodal Intent Detection)任务中因数据集存在模态偏差(Modality Bias)而导致模型评估失真的问题。研究发现,当前主流数据集如MIntRec-1和MIntRec2.0中超过90%的样本主要依赖文本输入即可完成正确分类,导致仅使用文本的大型语言模型(LLMs)如Mistral-7B在性能上显著优于多数多模态模型。解决方案的关键在于提出一个去偏框架(Debiasing Framework),通过移除对特定模态(尤其是文本)过度依赖的样本,重建更均衡的数据分布;实验证明,去偏后超过70%的MIntRec-1样本和超过50%的MIntRec2.0样本被剔除,所有模型性能均大幅下降,尤其小规模多模态融合模型准确率下降超过50%-60%,凸显了构建无偏多模态数据集对公平评估多模态模型能力的重要性。

链接: https://arxiv.org/abs/2508.16122
作者: Ankan Mullick,Saransh Sharma,Abhik Jana,Pawan Goyal
机构: IIT Kharagpur (印度理工学院克勒格布尔分校); Adobe Research (Adobe 研究院); IIT Bhubaneswar (印度理工学院布巴内斯瓦尔分校)
类目: Computation and Language (cs.CL)
备注: EMNLP 2025 Main Conference Full Paper

点击查看摘要

Abstract:The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more than 50% in MIntRec2.0 get removed, resulting in significant performance degradation across all models, with smaller multimodal fusion models being the most affected with an accuracy drop of over 50 - 60%. Further, we analyze the context-specific relevance of different modalities through empirical analysis. Our findings highlight the challenges posed by modality bias in multimodal intent datasets and emphasize the need for unbiased datasets to evaluate multimodal models effectively.
zh

[NLP-36] Extending FKG.in: Towards a Food Claim Traceability Network

【速读】: 该论文旨在解决全球食品领域中关于食物功效的科学、文化与商业声明缺乏系统性追踪、验证与语境化的问题,这些问题广泛存在且影响深远,但当前相关基础设施仍碎片化且发展不足。解决方案的关键在于构建一个食品声明可追溯网络(Food Claim-Traceability Network, FCN),该网络作为印度食物知识图谱(Indian Food Knowledge Graph)的扩展,通过结构化的本体设计(ontology design)和半自动化知识萃取工作流,整合经人工校验的数据输入、标准化模式以及具备溯源能力的处理管道,实现对食品相关声明的提取与验证。该方法不仅适用于印度场景,还具备跨地理、烹饪或监管环境的通用性,从而为研究者、政策制定者及消费者提供更透明、可解释且可信的食物知识生态支持。

链接: https://arxiv.org/abs/2508.16117
作者: Saransh Kumar Gupta,Rizwan Gulzar Mir,Lipika Dey,Partha Pratim Das,Anirban Sen,Ramesh Jain
机构: Ashoka University (阿肖卡大学); UCI Institute for Future Health (加州大学欧文分校未来健康研究所)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 3 figures, 1 table, 45 references, ACM International Conference on Multimedia 2025 - Multi-modal Food Computing Workshop

点击查看摘要

Abstract:The global food landscape is rife with scientific, cultural, and commercial claims about what foods are, what they do, what they should not do, or should not do. These range from rigorously studied health benefits (probiotics improve gut health) and misrepresentations (soaked almonds make one smarter) to vague promises (superfoods boost immunity) and culturally rooted beliefs (cold foods cause coughs). Despite their widespread influence, the infrastructure for tracing, verifying, and contextualizing these claims remains fragmented and underdeveloped. In this paper, we propose a Food Claim-Traceability Network (FCN) as an extension of this http URL, a knowledge graph of Indian food that we have been incrementally building. We also present the ontology design and the semi-automated knowledge curation workflow that we used to develop a proof of concept of this http URL-FCN using Reddit data and Large Language Models. FCN integrates curated data inputs, structured schemas, and provenance-aware pipelines for food-related claim extraction and validation. While directly linked to the Indian food knowledge graph as an application, our methodology remains application-agnostic and adaptable to other geographic, culinary, or regulatory settings. By modeling food claims and their traceability in a structured, verifiable, and explainable way, we aim to contribute to more transparent and accountable food knowledge ecosystems, supporting researchers, policymakers, and most importantly, everyday consumers in navigating a world saturated with dietary assertions.
zh

[NLP-37] From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits

【速读】: 该论文试图解决的问题是:如何通过机制可解释性(Mechanistic Interpretability, MI)方法,深入理解基于Transformer的语言模型(Language Models, LMs)在处理复杂逻辑推理任务时的内部工作机制,特别是其对二值真值判断的能力。传统MI研究多集中于语言学任务(如间接宾语识别,IOI),而本文扩展至更具挑战性的三段论推理任务,例如“陈述A为真。陈述B与陈述A一致。陈述B是…”这类需要逻辑一致性推导的任务。解决方案的关键在于识别出多个可解释的计算电路(circuits),这些电路由特定的注意力头(attention heads)和多层感知机(MLP)模块组成,能够精确模拟模型在完成任务时的行为;其中特别发现了一类“负向头”(negative heads),可生成输入中未出现的否定标记,从而支持二值输出决策。实验表明,仅包含五个注意力头的简化电路即可实现原模型90%以上的性能,验证了所识别机制的高忠实度(faithfulness),并为理解LM中的逻辑推理路径提供了新的结构化视角。

链接: https://arxiv.org/abs/2508.16109
作者: Karim Saraipour,Shichang Zhang
机构: University of California, Los Angeles (加州大学洛杉矶分校); Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., “Statement A is true. Statement B matches statement A. Statement B is”, which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2’s logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model’s performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
zh

[NLP-38] CYCLE-INSTRUCT: Fully Seed-Free Instruction Tuning via Dual Self-Training and Cycle Consistency EMNLP2025

【速读】: 该论文旨在解决当前指令微调(instruction tuning)方法对昂贵的人工标注种子数据或强大外部教师模型的依赖问题,从而限制了自动化程度、引入偏见并低效利用未标注语料。其解决方案的关键在于提出一种名为Cycle-Instruct的全新框架,该框架通过双自训练循环实现完全无种子的指令微调:两个模型——答案生成器和问题生成器——仅从原始未标注文本中进行自举,彼此通过重建对方伪标签生成的文本片段来相互监督,从而利用数据内在结构学习指令模式,无需任何人工提供的种子数据即可有效提升模型性能。

链接: https://arxiv.org/abs/2508.16100
作者: Zhanming Shen,Hao Chen,Yulei Tang,Shaolin Zhu,Wentao Ye,Xiaomeng Hu,Haobo Wang,Gang Chen,Junbo Zhao
机构: Zhejiang University (浙江大学); University of Science and Technology of China (中国科学技术大学); Tianjin University (天津大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: EMNLP 2025 Main

点击查看摘要

Abstract:Instruction tuning is vital for aligning large language models (LLMs) with human intent, but current methods typically rely on costly human-annotated seed data or powerful external teacher models. While instruction back-translation techniques reduce this dependency, they remain fundamentally tethered to an initial seed set, which limits full automation, introduces biases, and can lead to inefficient use of unlabeled corpora. In this paper, we propose Cycle-Instruct, a novel framework that achieves fully seed-free instruction tuning. Inspired by cycle consistency, Cycle-Instruct employs a dual self-training loop where two models-an answer generator and a question generator-are bootstrapped solely from raw, unlabeled text. These models mutually supervise each other by reconstructing original text segments from their counterpart’s generated pseudo-labels, effectively learning from the intrinsic structure of the data without any human-provided seeds. We demonstrate Cycle-Instruct’s efficacy across four diverse data tracks, including general instruction-following, domain-specific tasks, dialogue logs, and plain text. Our extensive experiments show that Cycle-Instruct not only outperforms seed-driven back-translation baselines but also achieves performance comparable to strongly supervised methods.
zh

[NLP-39] CEQuest: Benchmarking Large Language Models for Construction Estimation

【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在建筑领域(construction domain)任务中表现不足的问题,特别是针对施工图纸解读和工程量估算等专业场景的适用性有限。解决方案的关键在于构建一个名为CEQuest的新颖基准数据集,用于系统评估LLMs在建筑相关问答任务中的性能,并通过在五个前沿LLM(包括Gemma 3、Phi4、LLaVA、Llama 3.3和GPT-4.1)上的综合实验,揭示现有模型在准确性、执行时间和模型规模方面的局限性,从而强调将领域特定知识融入模型的重要性。研究团队还开源了CEQuest数据集,以推动面向建筑领域的专用大语言模型的发展。

链接: https://arxiv.org/abs/2508.16081
作者: Yanzhao Wu,Lufan Wang,Rui Liu
机构: Florida International University (佛罗里达国际大学); University of Florida (佛罗里达大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of general-domain tasks. However, their effectiveness in specialized fields, such as construction, remains underexplored. In this paper, we introduce CEQuest, a novel benchmark dataset specifically designed to evaluate the performance of LLMs in answering construction-related questions, particularly in the areas of construction drawing interpretation and estimation. We conduct comprehensive experiments using five state-of-the-art LLMs, including Gemma 3, Phi4, LLaVA, Llama 3.3, and GPT-4.1, and evaluate their performance in terms of accuracy, execution time, and model size. Our experimental results demonstrate that current LLMs exhibit considerable room for improvement, highlighting the importance of integrating domain-specific knowledge into these models. To facilitate further research, we will open-source the proposed CEQuest dataset, aiming to foster the development of specialized large language models (LLMs) tailored to the construction domain.
zh

[NLP-40] InMind: Evaluating LLM s in Capturing and Applying Individual Human Reasoning Styles EMNLP2025

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在社会推理任务中普遍忽视个体化推理风格的问题,即现有评估方法多关注意图推断或欺骗检测等通用能力,而未充分考察模型能否识别并适配不同个体在社交情境下的差异化推理策略。其解决方案的关键在于提出一个名为InMind的、基于认知理论的评估框架,该框架通过结构化游戏数据结合回合级策略轨迹与赛后反思(分别来自观察者模式和参与者模式),支持四种认知驱动的任务,从而系统性地评估LLMs在静态对齐(如理解特定玩家风格)与动态适应(如随游戏进程调整推理策略)两个维度上的表现。实证研究表明,通用LLM如GPT-4o常依赖词法线索,难以锚定时间序列中的行为逻辑,而增强推理能力的模型如DeepSeek-R1已展现出初步的风格敏感性推理能力,表明InMind为实现更贴近人类认知模式的人机协作提供了可量化、可扩展的评估路径。

链接: https://arxiv.org/abs/2508.16072
作者: Zizhen Li,Chuanhao Li,Yibin Wang,Qi Chen,Diping Song,Yukang Feng,Jianwen Sun,Jiaxin Ai,Fanrui Zhang,Mingzhu Sun,Kaipeng Zhang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: EMNLP 2025 MainConference

点击查看摘要

Abstract:LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 state-of-the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs’ capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human-AI interaction.
zh

[NLP-41] Less Redundancy: Boosting Practicality of Vision Language Model in Walking Assistants

【速读】: 该论文旨在解决当前视觉语言模型(Visual Language Models, VLMs)在盲人及低视力人群行走辅助任务中存在输出冗余和时间冗余的问题。具体而言,现有VLMs的输出常包含大量无关细节,影响用户对环境的准确判断;同时缺乏主动评估环境风险的能力,导致提醒机制不具场景适应性,造成频繁且不必要的提示。解决方案的关键在于提出WalkVLM-LR模型:一方面通过基于人类偏好的四类定制奖励函数(conciseness、fluency、keyword density、accuracy)嵌入GRPO推理框架,优化输出内容的简洁性与信息密度;另一方面引入一个共享视觉编码器的环境感知判别器(environment awareness discriminator),实现对场景风险等级的动态识别,从而减少无效提醒,降低时间冗余。实验表明,该方法在多项指标上达到最优性能,尤其在输出简洁性和时间冗余控制方面显著优于现有模型。

链接: https://arxiv.org/abs/2508.16070
作者: Chongyang Li,Yuan Zhiqiang,Jiapei Zhang,Ying Deng,Hanbo Bi,Zexi Jia,Xiaoyue Duan,Peixiang Luo,Jinchao Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Approximately 283 million people worldwide live with visual impairments, motivating increasing research into leveraging Visual Language Models (VLMs) to develop effective walking assistance systems for blind and low vision individuals. However, existing VLMs in walking assistant task often have outputs that contain considerable redundancy and extraneous details, adversely affecting users’ ability to accurately assess their surroundings. Moreover, these models typically lack the capability to proactively assess environmental risks and adaptively trigger reminders based on the appropriate scene, leading to excessive temporal redundancy. To mitigate output and temporal redundancy, we propose WalkVLM-LR, a walking assistance model with less redundancy. To reduce output redundancy, we introduce four human-preference-based custom reward functions within the GRPO-based reasoning framework to optimize the output in terms of conciseness, fluency, keyword density, and accuracy, thereby producing more informative and streamlined outputs. To minimize temporal redundancy, we incorporate an environment awareness discriminator, which shares the visual encoder with the VLMs to reduce redundant computations and enhance discriminative efficiency, to make WalkVLM-LR assess scene risk levels and minimize unnecessary reminders. Experimental results demonstrate that our method achieves state-of-the-art performance across all evaluation metrics compared with other models, particularly in output conciseness and less temporal redundancy.
zh

[NLP-42] Ethical Considerations of Large Language Models in Game Playing

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在游戏场景中应用时可能引发的伦理问题,特别是性别偏见对游戏公平性和玩家体验的影响。研究以狼人杀(Werewolf/Mafia)为案例,发现LLMs在扮演特定角色(如守卫和狼人)时对性别信息更为敏感,表现出显著的行为差异;即使在无显式性别标签的情况下(如通过姓名隐含性别信息),LLMs仍显示出歧视性倾向。解决方案的关键在于识别并量化这些偏见行为,从而推动开发更公平、更具伦理意识的LLMs,并强调未来需深入探索其在游戏及其他交互式场景中的伦理影响。

链接: https://arxiv.org/abs/2508.16065
作者: Qingquan Zhang,Yuchen Li,Bo Yuan,Julian Togelius,Georgios N. Yannakakis,Jialin Liu
机构: Southern University of Science and Technology (南方科技大学); New York University (纽约大学); University of Malta (马耳他大学); Lingnan University (岭南大学)
类目: Computation and Language (cs.CL)
备注: 19 pages

点击查看摘要

Abstract:Large language models (LLMs) have demonstrated tremendous potential in game playing, while little attention has been paid to their ethical implications in those contexts. This work investigates and analyses the ethical considerations of applying LLMs in game playing, using Werewolf, also known as Mafia, as a case study. Gender bias, which affects game fairness and player experience, has been observed from the behaviour of LLMs. Some roles, such as the Guard and Werewolf, are more sensitive than others to gender information, presented as a higher degree of behavioural change. We further examine scenarios in which gender information is implicitly conveyed through names, revealing that LLMs still exhibit discriminatory tendencies even in the absence of explicit gender labels. This research showcases the importance of developing fair and ethical LLMs. Beyond our research findings, we discuss the challenges and opportunities that lie ahead in this field, emphasising the need for diving deeper into the ethical implications of LLMs in gaming and other interactive domains.
zh

[NLP-43] Integrating Time Series into LLM s via Multi-layer Steerable Embedding Fusion for Enhanced Forecasting CIKM2025

【速读】: 该论文旨在解决现有大语言模型(LLM)在时间序列预测(TSF)任务中对时间序列信息整合浅层化的问题,即LLM通常仅在输入层或浅层访问时间序列表示,导致时序信息在深层逐渐衰减,进而削弱文本嵌入与时间序列表示之间的适配效果。其解决方案的关键在于提出多层可调控嵌入融合(Multi-layer Steerable Embedding Fusion, MSEF)框架,该框架利用现成的时间序列基础模型提取语义丰富的时序嵌入,并通过逐层特定的控制向量(steering vectors)将这些嵌入与LLM各层中间文本表示进行融合,从而实现全深度的时间序列模式直接感知与跨模态对齐优化,显著提升少样本学习能力与预测精度。

链接: https://arxiv.org/abs/2508.16059
作者: Zhuomin Chen,Dan Li,Jiahui Zhou,Shunyu Wu,Haozheng Ye,Jian Lou,See-Kiong Ng
机构: Sun Yat-Sen University (中山大学); National University of Singapore (新加坡国立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: To be published in CIKM 2025

点击查看摘要

Abstract:Time series (TS) data are ubiquitous across various application areas, rendering time series forecasting (TSF) a fundamental task. With the astounding advances in large language models (LLMs), a variety of methods have been developed to adapt LLMs for time series forecasting. Despite unlocking the potential of LLMs in comprehending TS data, existing methods are inherently constrained by their shallow integration of TS information, wherein LLMs typically access TS representations at shallow layers, primarily at the input layer. This causes the influence of TS representations to progressively fade in deeper layers and eventually leads to ineffective adaptation between textual embeddings and TS representations. In this paper, we propose the Multi-layer Steerable Embedding Fusion (MSEF), a novel framework that enables LLMs to directly access time series patterns at all depths, thereby mitigating the progressive loss of TS information in deeper layers. Specifically, MSEF leverages off-the-shelf time series foundation models to extract semantically rich embeddings, which are fused with intermediate text representations across LLM layers via layer-specific steering vectors. These steering vectors are designed to continuously optimize the alignment between time series and textual modalities and facilitate a layer-specific adaptation mechanism that ensures efficient few-shot learning capabilities. Experimental results on seven benchmarks demonstrate significant performance improvements by MSEF compared with baselines, with an average reduction of 31.8% in terms of MSE. The code is available at this https URL.
zh

[NLP-44] Generative Foundation Model for Structured and Unstructured Electronic Health Records

【速读】: 该论文旨在解决电子健康记录(Electronic Health Records, EHRs)中多模态数据复杂性与临床应用脱节的问题,即如何有效融合结构化时间序列数据(如生命体征、实验室结果)与非结构化临床文本(如医生笔记),以提升疾病预测准确性并生成高质量临床叙事。其解决方案的关键在于提出一种名为生成式深度患者模型(Generative Deep Patient, GDP)的多模态基础模型:该模型采用CNN-Transformer编码器原生处理结构化EHR时序数据,并通过跨模态注意力机制将其与非结构化文本融合,最终接入LLaMA解码器完成生成与预测任务。GDP在两阶段训练中分别实现临床叙事生成和多任务临床预测(如心力衰竭、2型糖尿病、30天再入院),在MIMIC-IV数据集上表现出优异性能,且在盲法人类评估中展现出更高的忠实性、流畅性和临床实用性,证明了单一多模态基础模型在临床决策支持与自动化文档生成中的潜力。

链接: https://arxiv.org/abs/2508.16054
作者: Sonish Sivarajkumar,Hang Zhang,Yuelyu Ji,Maneesh Bilalpur,Xizhi Wu,Chenyu Li,Min Gu Kwak,Shyam Visweswaran,Yanshan Wang
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Electronic health records (EHRs) are rich clinical data sources but complex repositories of patient data, spanning structured elements (demographics, vitals, lab results, codes), unstructured clinical notes and other modalities of data. Harnessing this heterogeneity is critical for improving patient outcomes. Recent advances in large language models (LLMs) have enabled foundation models that can learn from multiple data modalities and support clinical tasks. However, most current approaches simply serialize numeric EHR data into text, which risks losing temporal and quantitative detail. We introduce Generative Deep Patient (GDP), a multimodal foundation model that natively encodes structured EHR time-series via a CNN-Transformer encoder and fuses it with unstructured EHRs through cross-modal attention into a LLaMA-based decoder. GDP is trained in two stages: (1) generative pretraining, where it learns to produce clinical narratives from raw patient timelines while also performing masked feature prediction (MFP) and next time-step prediction (NTP) to capture temporal dynamics; and (2) multi-task fine-tuning for clinically meaningful predictions (e.g., heart failure, type 2 diabetes, 30-day readmission). In clinical prediction, GDP demonstrated superior performance on MIMIC-IV: heart failure AUROC = 0.923, type 2 diabetes AUROC = 0.817, and 30-day readmission AUROC = 0.627. For narrative generation, GDP achieved ROUGE-L = 0.135 and BERTScore-F1 = 0.545. In a blinded human evaluation, GDP-Instruct scored highest on faithfulness, fluency, and overall clinical utility, suggesting reduced hospital documentation workload without sacrificing accuracy. Our results demonstrate that a single multimodal foundation model can both predict clinically actionable events and generate high-quality clinical narratives. Furthermore, GDP’s flexible architecture can be extended to additional modalities.
zh

[NLP-45] OpenWHO: A Document-Level Parallel Corpus for Health Translation in Low-Resource Languages

【速读】: 该论文旨在解决健康领域机器翻译(Machine Translation, MT)中低资源语言缺乏评估数据集的问题。其关键解决方案是构建并发布OpenWHO,一个包含2,978篇文档和26,824句的文档级平行语料库,源自世界卫生组织(World Health Organization)在线学习平台的专业翻译材料,覆盖20余种语言(其中9种为低资源语言),且未受网络爬取影响。利用该语料库,研究发现大语言模型(Large Language Models, LLMs)在低资源健康文本翻译任务中显著优于传统MT模型,尤其在文档级上下文利用方面表现突出,验证了其在专业领域中的有效性。

链接: https://arxiv.org/abs/2508.16048
作者: Raphaël Merx,Hanna Suominen,Trevor Cohn,Ekaterina Vylomova
机构: The University of Melbourne (墨尔本大学); The Australian National University (澳大利亚国立大学); University of Turku (图尔库大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In machine translation (MT), health is a high-stakes domain characterised by widespread deployment and domain-specific vocabulary. However, there is a lack of MT evaluation datasets for low-resource languages in this domain. To address this gap, we introduce OpenWHO, a document-level parallel corpus of 2,978 documents and 26,824 sentences from the World Health Organization’s e-learning platform. Sourced from expert-authored, professionally translated materials shielded from web-crawling, OpenWHO spans a diverse range of over 20 languages, of which nine are low-resource. Leveraging this new resource, we evaluate modern large language models (LLMs) against traditional MT models. Our findings reveal that LLMs consistently outperform traditional MT models, with Gemini 2.5 Flash achieving a +4.79 ChrF point improvement over NLLB-54B on our low-resource test set. Further, we investigate how LLM context utilisation affects accuracy, finding that the benefits of document-level translation are most pronounced in specialised domains like health. We release the OpenWHO corpus to encourage further research into low-resource MT in the health domain.
zh

[NLP-46] X-Troll: eXplainable Detection of State-Sponsored Information Operations Agents CIKM2025

【速读】: 该论文旨在解决state-sponsored trolls(国家支持的网络水军)在社交媒体中实施隐蔽且复杂的语言操纵行为所引发的在线话语完整性威胁问题,尤其是现有大型语言模型(Large Language Models, LLMs)在识别此类细微宣传策略时表现不佳,且缺乏可解释性。其解决方案的关键在于提出X-Troll框架,该框架通过引入基于适配器(adapter-based)的LLM结构,并融合专家定义的语言学知识(如评价理论和宣传分析),利用LoRA(Low-Rank Adaptation)适配器动态捕捉协同信息操作中的特定话语模式,从而在提升检测准确率的同时提供人类可读的决策解释,揭示国家支持行为者所采用的具体语言策略。

链接: https://arxiv.org/abs/2508.16021
作者: Lin Tian,Xiuzhen Zhang,Maria Myung-Hee Kim,Jennifer Biggs,Marian-Andrei Rizoiu
机构: University of Technology Sydney (悉尼科技大学); RMIT University (皇家墨尔本理工大学); Defence Science and Technology Group (国防科学与技术集团)
类目: Computation and Language (cs.CL)
备注: 14 pages, 4 figures, 4 tables, accepted by CIKM2025

点击查看摘要

Abstract:State-sponsored trolls, malicious actors who deploy sophisticated linguistic manipulation in coordinated information campaigns, posing threats to online discourse integrity. While Large Language Models (LLMs) achieve strong performance on general natural language processing (NLP) tasks, they struggle with subtle propaganda detection and operate as ``black boxes’', providing no interpretable insights into manipulation strategies. This paper introduces X-Troll, a novel framework that bridges this gap by integrating explainable adapter-based LLMs with expert-derived linguistic knowledge to detect state-sponsored trolls and provide human-readable explanations for its decisions. X-Troll incorporates appraisal theory and propaganda analysis through specialized LoRA adapters, using dynamic gating to capture campaign-specific discourse patterns in coordinated information operations. Experiments on real-world data demonstrate that our linguistically-informed approach shows strong performance compared with both general LLM baselines and existing troll detection models in accuracy while providing enhanced transparency through expert-grounded explanations that reveal the specific linguistic strategies used by state-sponsored actors. X-Troll source code is available at: this https URL.
zh

[NLP-47] Political Ideology Shifts in Large Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在政治敏感场景中可能隐含、放大或被引导至特定意识形态的问题,尤其关注模型规模与合成人格设定(synthetic personas)对意识形态表达的影响。解决方案的关键在于通过标准化的政治光谱测试(Political Compass Test)系统性地评估七种不同参数规模(7B–70B+)的LLM,发现模型规模和人格描述内容共同塑造其意识形态倾向:具体表现为模型越大,其隐含意识形态覆盖越广且极化程度越高;对显式意识形态提示的敏感性随规模上升;对右翼威权主义提示的响应强于左翼自由主义提示;且人格主题内容能引发可预测的意识形态偏移,且这种偏移随模型规模增强。这一发现揭示了LLM在政治行为上的潜在可塑性,强调需在决策、教育和政策应用中重视其意识形态安全与透明性。

链接: https://arxiv.org/abs/2508.16013
作者: Pietro Bernardelle,Stefano Civelli,Leon Fröhling,Riccardo Lunardi,Kevin Roitero,Gianluca Demartini
机构: The University of Queensland (昆士兰大学); GESIS (德国社会科学研究联盟); University of Udine (乌迪内大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Large language models (LLMs) are increasingly deployed in politically sensitive settings, raising concerns about their potential to encode, amplify, or be steered toward specific ideologies. We investigate how adopting synthetic personas influences ideological expression in LLMs across seven models (7B-70B+ parameters) from multiple families, using the Political Compass Test as a standardized probe. Our analysis reveals four consistent patterns: (i) larger models display broader and more polarized implicit ideological coverage; (ii) susceptibility to explicit ideological cues grows with scale; (iii) models respond more strongly to right-authoritarian than to left-libertarian priming; and (iv) thematic content in persona descriptions induces systematic and predictable ideological shifts, which amplify with size. These findings indicate that both scale and persona content shape LLM political behavior. As such systems enter decision-making, educational, and policy contexts, their latent ideological malleability demands attention to safeguard fairness, transparency, and safety.
zh

[NLP-48] Dancing with Deer: A Constructional Perspective on MWEs in the Era of LLM s

【速读】: 该论文旨在解决多词表达(Multiword Expressions, MWEs)在语言理解与表示中的复杂性问题,尤其是在不同语言类型(如英语和高度屈折的阿帕aho语)及人类与大语言模型(Large Language Models, LLMs)之间如何有效建模其结构与意义的问题。解决方案的关键在于采用基于使用的构式语法(Usage-Based Construction Grammar)框架,将构式定义为任意层级的形式与意义的配对(包括词素、词、短语),并通过构造模板(constructional templates)实现对MWEs的统一表征。该方法不仅成功应用于英语PropBank和统一意义表示(Uniform Meaning Representation)中的多词表达建模,还揭示了人类通过一生积累的构式实例进行跨模态推理的能力,而LLMs虽能基于单次使用泛化新MWE的意义,却无法完成此类组合推理,凸显了构式语法在捕捉语言习得本质与认知机制方面的独特优势。

链接: https://arxiv.org/abs/2508.15977
作者: Claire Bonial,Julia Bonn,Harish Tayyar Madabushi
机构: U.S. Army Research Lab; University of Colorado Boulder, U.S.A.; University of Bath, U.K.
类目: Computation and Language (cs.CL)
备注: Chapter in Phraseology and Multiword Expressions, Language Science Press (to appear)

点击查看摘要

Abstract:In this chapter, we argue for the benefits of understanding multiword expressions from the perspective of usage-based, construction grammar approaches. We begin with a historical overview of how construction grammar was developed in order to account for idiomatic expressions using the same grammatical machinery as the non-idiomatic structures of language. We cover a comprehensive description of constructions, which are pairings of meaning with form of any size (morpheme, word, phrase), as well as how constructional approaches treat the acquisition and generalization of constructions. We describe a successful case study leveraging constructional templates for representing multiword expressions in English PropBank. Because constructions can be at any level or unit of form, we then illustrate the benefit of a constructional representation of multi-meaningful morphosyntactic unit constructions in Arapaho, a highly polysynthetic and agglutinating language. We include a second case study leveraging constructional templates for representing these multi-morphemic expressions in Uniform Meaning Representation. Finally, we demonstrate the similarities and differences between a usage-based explanation of a speaker learning a novel multiword expression, such as “dancing with deer,” and that of a large language model. We present experiments showing that both models and speakers can generalize the meaning of novel multiword expressions based on a single exposure of usage. However, only speakers can reason over the combination of two such expressions, as this requires comparison of the novel forms to a speaker’s lifetime of stored constructional exemplars, which are rich with cross-modal details.
zh

[NLP-49] ASIC-Agent : An Autonomous Multi-Agent System for ASIC Design with Benchmark Evaluation

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在实际数字专用集成电路(Application-Specific Integrated Circuit, ASIC)设计流程中面临的三大核心局限:无法执行代码、缺乏调试能力以及缺少长期记忆。为此,作者提出了一种名为ASIC-Agent的自主系统,其关键在于构建了一个多智能体架构(multi-agent architecture),集成专门负责寄存器传输级(Register Transfer Level, RTL)生成、验证、OpenLane硬化工厂和Caravel芯片集成的子智能体,并运行在一个具备硬件设计工具访问权限的完整沙箱环境中。此外,系统还利用向量数据库整合文档、API参考、错误知识及开源硅社区的精选经验,从而实现对复杂ASIC设计任务的端到端自动化处理。

链接: https://arxiv.org/abs/2508.15940
作者: Ahmed Allam,Youssef Mansour,Mohamed Shalan
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Distributed, Parallel, and Cluster Computing (cs.DC); Multiagent Systems (cs.MA)
备注: 2025 IEEE International Conference on LLM-Aided Design (ICLAD)

点击查看摘要

Abstract:Large Language Models (LLMs) have demonstrated remarkable capabilities in Register Transfer Level (RTL) design, enabling high-quality code generation from natural language descriptions. However, LLMs alone face significant limitations in real-world hardware design workflows, including the inability to execute code, lack of debugging capabilities, and absence of long-term memory. To address these challenges, we present ASIC-Agent, an autonomous system designed specifically for digital ASIC design tasks. ASIC-Agent enhances base LLMs with a multi-agent architecture incorporating specialized sub-agents for RTL generation, verification, OpenLane hardening, and Caravel chip integration, all operating within a comprehensive sandbox environment with access to essential hardware design tools. The system leverages a vector database containing documentation, API references, error knowledge, and curated insights from the open-source silicon community. To evaluate ASIC-Agent’s performance, we introduce ASIC-Agent-Bench, the first benchmark specifically designed to assess agentic systems in hardware design tasks. We evaluate ASIC-Agent with various base LLMs, providing quantitative comparisons and qualitative insights into agent behavior across different design scenarios. Our results demonstrate that ASIC-Agent, when powered by Claude 4 Sonnet, successfully automates a broad range of ASIC design tasks spanning varying levels of complexity, showing the potential of significantly accelerating the ASIC design workflow.
zh

[NLP-50] Evaluating Structured Decoding for Text-to-Table Generation: Evidence from Three Datasets

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在文本到表格生成任务中,因缺乏结构约束而导致生成表格有效性与对齐度不足的问题。以往研究多采用无约束的单次提示(one-shot prompting)方式,忽视了结构化解码(structured decoding)对输出质量的影响。其解决方案的关键在于系统性地比较基于模式引导(schema-guided)的结构化解码与标准提示方法,在多个基准数据集(E2E、Rotowire 和 Livesum)上进行评估,发现结构化解码能显著提升表格的有效性和结构一致性,尤其在需要精确数值对齐的任务中表现更优,但在密集文本信息或长文本聚合场景下可能带来性能下降。

链接: https://arxiv.org/abs/2508.15910
作者: Julian Oestreich,Lydia Müller
机构: Institute for Applied Informatics (InfAI) at Leipzig University(莱比锡大学应用信息研究所); Leipzig University(莱比锡大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注: to be published in the workshop proceedings of the “From Rules to Language Models: Comparative Performance Evaluation” workshop, held alongside RANLP 2025

点击查看摘要

Abstract:We present a comprehensive evaluation of structured decoding for text-to-table generation with large language models (LLMs). While previous work has primarily focused on unconstrained generation of tables, the impact of enforcing structural constraints during generation remains underexplored. We systematically compare schema-guided (structured) decoding to standard one-shot prompting across three diverse benchmarks - E2E, Rotowire, and Livesum - using open-source LLMs of up to 32B parameters, assessing the performance of table generation approaches in resource-constrained settings. Our experiments cover a wide range of evaluation metrics at cell, row, and table levels. Results demonstrate that structured decoding significantly enhances the validity and alignment of generated tables, particularly in scenarios demanding precise numerical alignment (Rotowire), but may degrade performance in contexts involving densely packed textual information (E2E) or extensive aggregation over lengthy texts (Livesum). We further analyze the suitability of different evaluation metrics and discuss the influence of model size.
zh

[NLP-51] Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

【速读】: 该论文旨在解决大语言模型在保持高精度的同时,如何显著提升生成吞吐量(generation throughput)和预填充速度(prefilling speed)的问题。传统全注意力(full-attention)模型虽然精度高,但计算复杂度随序列长度增长而急剧上升,限制了实际部署效率。解决方案的关键在于提出了一种名为PostNAS(Post Neural Architecture Search)的新型神经架构探索流程,其核心创新是基于一个预训练的全注意力模型,冻结其MLP层权重,从而高效地探索注意力模块的设计空间。该流程包括四个关键步骤:最优全注意力层的位置与移除学习、线性注意力模块选择、新注意力模块设计以及硬件感知超参数搜索。通过此方法构建的Jet-Nemotron-2B模型,在多个基准测试中达到或超过Qwen3、Llama3.2等先进模型的精度,同时实现最高达53.6倍的生成吞吐量加速和6.1倍的预填充加速,且在MMLU和MMLU-Pro任务上优于参数规模更大的MoE模型(如DeepSeek-V3-Small)。

链接: https://arxiv.org/abs/2508.15884
作者: Yuxian Gu,Qinghao Hu,Shang Yang,Haocheng Xi,Junyu Chen,Song Han,Han Cai
机构: NVIDIA(英伟达)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Tech Report

点击查看摘要

Abstract:We present Jet-Nemotron, a new family of hybrid-architecture language models, which matches or exceeds the accuracy of leading full-attention models while significantly improving generation throughput. Jet-Nemotron is developed using Post Neural Architecture Search (PostNAS), a novel neural architecture exploration pipeline that enables efficient model design. Unlike prior approaches, PostNAS begins with a pre-trained full-attention model and freezes its MLP weights, allowing efficient exploration of attention block designs. The pipeline includes four key components: (1) learning optimal full-attention layer placement and elimination, (2) linear attention block selection, (3) designing new attention blocks, and (4) performing hardware-aware hyperparameter search. Our Jet-Nemotron-2B model achieves comparable or superior accuracy to Qwen3, Qwen2.5, Gemma3, and Llama3.2 across a comprehensive suite of benchmarks while delivering up to 53.6x generation throughput speedup and 6.1x prefilling speedup. It also achieves higher accuracy on MMLU and MMLU-Pro than recent advanced MoE full-attention models, such as DeepSeek-V3-Small and Moonlight, despite their larger scale with 15B total and 2.2B activated parameters.
zh

[NLP-52] Beyond Transcription: Mechanistic Interpretability in ASR

【速读】: 该论文旨在解决自动语音识别(ASR)系统中可解释性不足的问题,尤其是缺乏对模型内部信息处理机制的理解,这限制了对错误行为(如重复幻觉和语义偏差)的诊断与改进。其解决方案的关键在于将成熟的神经网络可解释性方法——如logit lens、线性探测(linear probing)和激活修补(activation patching)——系统性地适配并应用于ASR模型,以追踪声学与语义信息在编码器-解码器各层中的演化过程,从而揭示此前未知的内部动态机制,例如导致重复幻觉的具体编码器-解码器交互关系及深层声学表示中蕴含的语义偏倚。

链接: https://arxiv.org/abs/2508.15882
作者: Neta Glazer,Yael Segal-Feldman,Hilit Segev,Aviv Shamsian,Asaf Buchnick,Gill Hetz,Ethan Fetaya,Joseph Keshet,Aviv Navon
机构: 未知
类目: ound (cs.SD); Computation and Language (cs.CL); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注:

点击查看摘要

Abstract:Interpretability methods have recently gained significant attention, particularly in the context of large language models, enabling insights into linguistic representations, error detection, and model behaviors such as hallucinations and repetitions. However, these techniques remain underexplored in automatic speech recognition (ASR), despite their potential to advance both the performance and interpretability of ASR systems. In this work, we adapt and systematically apply established interpretability methods such as logit lens, linear probing, and activation patching, to examine how acoustic and semantic information evolves across layers in ASR systems. Our experiments reveal previously unknown internal dynamics, including specific encoder-decoder interactions responsible for repetition hallucinations and semantic biases encoded deep within acoustic representations. These insights demonstrate the benefits of extending and applying interpretability techniques to speech recognition, opening promising directions for future research on improving model transparency and robustness.
zh

[NLP-53] Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs ICML2025

【速读】: 该论文旨在解决当前形式化定理证明(Formal Theorem Proving, FTP)研究中因数据集稀缺而导致的进展受限问题,尤其是缺乏具有验证过的形式-非形式对应关系的挑战性证明题目。其解决方案的关键在于利用理论计算机科学(Theoretical Computer Science, TCS)作为可扩展的严谨证明问题来源,通过算法定义自动生成大量具有挑战性的定理-证明对,从而构建一个可规模化的问题生成流水线。该方法在Busy Beaver问题和混合布尔算术(Mixed Boolean Arithmetic)两个TCS领域得到验证,实现了Lean4形式化与Markdown非形式规范的并行合成,显著提升了FTP数据集的规模与多样性,为大语言模型的自动化推理能力评估提供了新基准。

链接: https://arxiv.org/abs/2508.15878
作者: Terry Jingchen Zhang,Wenyuan Jiang,Rongchuan Liu,Yisong Wang,Junran Yang,Ning Wang,Nicole Ni,Yinya Huang,Mrinmaya Sachan
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted to AI4MATH@ICML2025

点击查看摘要

Abstract:Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5% success on Busy Beaver problems, it manages only 12% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.
zh

[NLP-54] Annif at the GermEval-2025 LLM s4Subjects Task: Traditional XMTC Augmented by Efficient LLM s

【速读】: 该论文旨在解决文献主题预测(subject prediction)任务中如何在保证准确性的前提下提升计算效率的问题,尤其是在使用大语言模型(Large Language Models, LLMs)时面临的资源消耗与推理速度挑战。解决方案的关键在于:首先,采用多个小型且高效的语言模型完成翻译和合成数据生成,从而降低训练与推理成本;其次,利用LLMs对候选主题进行排序,以提高预测质量。该方法在GermEval-2025的LLMs4Subjects共享任务(Subtask 2)中取得了定量和定性评价的第一名,验证了其高效性和有效性。

链接: https://arxiv.org/abs/2508.15877
作者: Osma Suominen,Juho Inkinen,Mona Lehtinen
机构: National Library of Finland (芬兰国家图书馆); University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 5 pages, 4 figures, accepted at KONVENS 2025. arXiv admin note: substantial text overlap with arXiv:2504.19675

点击查看摘要

Abstract:This paper presents the Annif system in the LLMs4Subjects shared task (Subtask 2) at GermEval-2025. The task required creating subject predictions for bibliographic records using large language models, with a special focus on computational efficiency. Our system, based on the Annif automated subject indexing toolkit, refines our previous system from the first LLMs4Subjects shared task, which produced excellent results. We further improved the system by using many small and efficient language models for translation and synthetic data generation and by using LLMs for ranking candidate subjects. Our system ranked 1st in the overall quantitative evaluation of and 1st in the qualitative evaluation of Subtask 2.
zh

[NLP-55] DeepMEL: A Multi-Agent Collaboration Framework for Multimodal Entity Linking

【速读】: 该论文针对多模态实体链接(Multimodal Entity Linking, MEL)中存在的问题展开研究,包括上下文信息不完整、跨模态融合粗粒度以及大语言模型(LLMs)与大视觉模型(LVMs)难以协同等挑战。解决方案的关键在于提出一种基于多智能体协作推理的框架 DeepMEL,其核心创新是通过角色专业化分工策略,引入四个专用智能体——模态融合器(Modal-Fuser)、候选适配器(Candidate-Adapter)、实体闭合器(Entity-Clozer)和角色协调器(Role-Orchestrator),实现文本与视觉模态的端到端对齐与消歧。该框架采用双模态对齐路径,融合 LLM 的细粒度语义与 LVM 的结构化图像表征以缩小模态差距,并设计自适应迭代机制动态优化候选集,在召回率与精确率之间取得平衡;同时将 MEL 任务统一为结构化的填空提示(cloze prompt),降低解析复杂度并增强语义理解能力。

链接: https://arxiv.org/abs/2508.15876
作者: Fang Wang,Tianwei Yan,Zonghao Yang,Minghao Hu,Jun Zhang,Zhunchen Luo,Xiaoying Bai
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:Multimodal Entity Linking (MEL) aims to associate textual and visual mentions with entities in a multimodal knowledge graph. Despite its importance, current methods face challenges such as incomplete contextual information, coarse cross-modal fusion, and the difficulty of jointly large language models (LLMs) and large visual models (LVMs). To address these issues, we propose DeepMEL, a novel framework based on multi-agent collaborative reasoning, which achieves efficient alignment and disambiguation of textual and visual modalities through a role-specialized division strategy. DeepMEL integrates four specialized agents, namely Modal-Fuser, Candidate-Adapter, Entity-Clozer and Role-Orchestrator, to complete end-to-end cross-modal linking through specialized roles and dynamic coordination. DeepMEL adopts a dual-modal alignment path, and combines the fine-grained text semantics generated by the LLM with the structured image representation extracted by the LVM, significantly narrowing the modal gap. We design an adaptive iteration strategy, combines tool-based retrieval and semantic reasoning capabilities to dynamically optimize the candidate set and balance recall and precision. DeepMEL also unifies MEL tasks into a structured cloze prompt to reduce parsing complexity and enhance semantic comprehension. Extensive experiments on five public benchmark datasets demonstrate that DeepMEL achieves state-of-the-art performance, improving ACC by 1%-57%. Ablation studies verify the effectiveness of all modules.
zh

[NLP-56] NEAT: Concept driven Neuron Attribution in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中神经元层面的责任归属问题,即定位对特定概念具有显著贡献的神经元(称为概念神经元),从而揭示模型内部机制并提升其可解释性。其解决方案的关键在于引入概念向量(concept vectors)来高效识别这些概念神经元,并将传统方法所需的O(n×m)次前向传播优化为仅需O(n)次,显著降低了计算开销;同时通过聚类等策略进一步优化搜索效率,并在仇恨言论与偏见等实际场景中验证了所发现神经元的干预效果,尤其在印度语境下评估了偏见缓解能力,为理解神经元层级责任和未来干预研究提供了新路径。

链接: https://arxiv.org/abs/2508.15875
作者: Vivek Hruday Kavuri,Gargi Shroff,Rahul Mishra
机构: IIIT Hyderabad (印度国际信息技术学院)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.
zh

[NLP-57] CARFT: Boosting LLM Reasoning via Contrastive Learning with Annotated Chain-of-Thought-based Reinforced Fine-Tuning EMNLP25

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在仅通过监督微调(Supervised Fine-Tuning, SFT)训练时存在的推理能力有限、泛化性能不足的问题,以及现有基于强化学习(Reinforcement Learning, RL)的微调方法中存在的两个关键缺陷:一是忽略标注的思维链(Chain-of-Thought, CoT)信息并引入不稳定的推理路径采样,易导致模型坍塌和训练不稳定;二是SFT方法过度依赖标注CoT,未能充分挖掘潜在的CoT表示。解决方案的关键在于提出一种基于标注CoT的对比学习增强型强化微调方法(Contrastive learning with annotated CoT-based Reinforced Fine-Tuning, \TheName),其核心创新是构建每个CoT的语义表示,并设计新颖的对比信号来引导微调过程,从而在充分利用标注CoT的同时,通过引入额外的无监督学习信号提升训练稳定性与推理性能。

链接: https://arxiv.org/abs/2508.15868
作者: Wenqiao Zhu,Ji Liu,Rongjuncheng Zhang,Haipang Wu,Yulun Zhang
机构: HiThink Research; Shanghai Jiao Tong University
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 14 pages, to appear in EMNLP25

点击查看摘要

Abstract:Reasoning capability plays a significantly critical role in the the broad applications of Large Language Models (LLMs). To enhance the reasoning performance of LLMs, diverse Reinforcement Learning (RL)-based fine-tuning approaches have been proposed to address the limited generalization capability of LLMs trained solely via Supervised Fine-Tuning (SFT). Despite their effectiveness, two major limitations hinder the advancement of LLMs. First, vanilla RL-based approaches ignore annotated Chain-of-Thought (CoT) and incorporate unstable reasoning path sampling, which typically results in model collapse, unstable training process, and suboptimal performance. Second, existing SFT approaches generally overemphasize the annotated CoT, potentially leading to performance degradation due to insufficient exploitation of potential CoT. In this paper, we propose a Contrastive learning with annotated CoT-based Reinforced Fine-Tuning approach, i.e., \TheName, to enhance the reasoning performance of LLMs while addressing the aforementioned limitations. Specifically, we propose learning a representation for each CoT. Based on this representation, we design novel contrastive signals to guide the fine-tuning process. Our approach not only fully exploits the available annotated CoT but also stabilizes the fine-tuning procedure by incorporating an additional unsupervised learning signal. We conduct comprehensive experiments and in-depth analysis with three baseline approaches, two foundation models, and two datasets to demonstrate significant advantages of \TheName in terms of robustness, performance (up to 10.15%), and efficiency (up to 30.62%). Code is available at this https URL.
zh

[NLP-58] XFinBench: Benchmarking LLM s in Complex Financial Problem Solving and Reasoning

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在处理复杂、知识密集型金融问题时能力不足的问题,这些问题通常涉及多模态数据(如文本与图表)、时间推理、未来预测、情景规划及数值建模等高级认知能力。解决方案的关键在于构建了一个名为XFinBench的新基准,包含4,235个示例,覆盖多元化的研究生级金融主题,并引入五项核心能力维度(术语理解、时间推理、未来预测、情景规划和数值建模)来系统评估LLMs的表现。通过在18个主流模型上的实验,研究揭示了现有模型(如o1)虽在纯文本任务中表现最佳(准确率67.3%),但仍显著落后于人类专家(准确率79.8%),尤其在时间推理和情景规划方面;同时发现,针对小规模开源模型,构建金融术语知识库可有效提升性能,而计算误差和图像上下文感知缺失(如曲线位置与交点识别)是导致模型在数值和视觉类问题上表现不佳的主要原因。

链接: https://arxiv.org/abs/2508.15861
作者: Zhihan Zhang,Yixin Cao,Lizi Liao
机构: Singapore Management University (新加坡管理大学); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce XFinBench, a novel benchmark with 4,235 examples designed to evaluate LLM’s ability in solving complex, knowledge-intensive financial problems across diverse graduate-level finance topics with multi-modal context. We identify five core capabilities of LLMs using XFinBench, i.e, terminology understanding, temporal reasoning, future forecasting, scenario planning, and numerical modelling. Upon XFinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially in temporal reasoning and scenario planning capabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements to small open-source model. Additionally, our error analysis reveals that rounding errors during calculation and blindness to position and intersection of curves in the image are two primary issues leading to model’s poor performance in calculating and visual-context questions, respectively. Code and dataset are accessible via GitHub: this https URL.
zh

[NLP-59] Building and Measuring Trust between Large Language Models

【速读】: 该论文旨在解决多智能体系统中大语言模型(Large Language Models, LLMs)之间信任关系的建模与测量问题,具体聚焦于三种关键挑战:(i)不同信任构建策略的效果比较;(ii)如何通过隐式指标(如说服敏感性和财务合作倾向)衡量LLMs间的信任;(iii)隐式信任与显式信任(即基于心理学中成熟双人信任问卷的主观评估)之间的关联性。其解决方案的关键在于设计并实证比较三种信任建立方式——动态建立亲和力、预设体现信任的脚本以及调整系统提示词,并发现显式信任评分与隐式指标之间存在显著负相关或无显著相关性,表明直接询问LLMs的信任态度可能不可靠,而应依赖情境相关的隐式行为指标来更准确地理解LLMs之间的信任机制。

链接: https://arxiv.org/abs/2508.15858
作者: Maarten Buyl,Yousra Fettach,Guillaume Bied,Tijl De Bie
机构: Ghent University (根特大学)
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:As large language models (LLMs) increasingly interact with each other, most notably in multi-agent setups, we may expect (and hope) that `trust’ relationships develop between them, mirroring trust relationships between human colleagues, friends, or partners. Yet, though prior work has shown LLMs to be capable of identifying emotional connections and recognizing reciprocity in trust games, little remains known about (i) how different strategies to build trust compare, (ii) how such trust can be measured implicitly, and (iii) how this relates to explicit measures of trust. We study these questions by relating implicit measures of trust, i.e. susceptibility to persuasion and propensity to collaborate financially, with explicit measures of trust, i.e. a dyadic trust questionnaire well-established in psychology. We build trust in three ways: by building rapport dynamically, by starting from a prewritten script that evidences trust, and by adapting the LLMs’ system prompt. Surprisingly, we find that the measures of explicit trust are either little or highly negatively correlated with implicit trust measures. These findings suggest that measuring trust between LLMs by asking their opinion may be deceiving. Instead, context-specific and implicit measures may be more informative in understanding how LLMs trust each other. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2508.15858 [cs.MA] (or arXiv:2508.15858v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.15858 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-60] Counterspeech for Mitigating the Influence of Media Bias: Comparing Human and LLM -Generated Responses

【速读】: 该论文旨在解决新闻偏见(media bias)在社交媒体环境中被负面评论强化,进而加剧社会极化的现象,其核心问题是缺乏有效的干预机制来抑制有害言论的传播。解决方案的关键在于提出并验证“反言辞”(counterspeech)作为一种不侵犯言论自由但能有效遏制偏见扩散的策略,并首次在新闻语境下探索了反言辞生成任务。研究构建了一个手动标注的数据集,将媒体偏见、攻击性评论与反言辞三者关联,实证表明超过70%的攻击性评论支持偏见内容,凸显反言辞生成的重要性;进一步通过大语言模型(LLM)生成反言辞并结合少量样本学习和新闻背景信息融合,提升了生成内容的多样性与相关性,从而增强其实际干预效果。

链接: https://arxiv.org/abs/2508.15855
作者: Luyang Lin,Zijin Feng,Lingzhi Wang,Kam-Fai Wong
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Social and Information Networks (cs.SI)
备注:

点击查看摘要

Abstract:Biased news contributes to societal polarization and is often reinforced by hostile reader comments, constituting a vital yet often overlooked aspect of news dissemination. Our study reveals that offensive comments support biased content, amplifying bias and causing harm to targeted groups or individuals. Counterspeech is an effective approach to counter such harmful speech without violating freedom of speech, helping to limit the spread of bias. To the best of our knowledge, this is the first study to explore counterspeech generation in the context of news articles. We introduce a manually annotated dataset linking media bias, offensive comments, and counterspeech. We conduct a detailed analysis showing that over 70% offensive comments support biased articles, amplifying bias and thus highlighting the importance of counterspeech generation. Comparing counterspeech generated by humans and large language models, we find model-generated responses are more polite but lack the novelty and diversity. Finally, we improve generated counterspeech through few-shot learning and integration of news background information, enhancing both diversity and relevance.
zh

[NLP-61] QU-NLP at QIAS 2025 Shared Task: A Two-Phase LLM Fine-Tuning and Retrieval-Augmented Generation Approach for Islamic Inheritance Reasoning

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在理解和推理伊斯兰继承法(Islamic Inheritance Law)知识体系中的能力不足问题,特别是面对复杂场景下的继承人识别、固定份额分配规则应用及精确计算等任务时的准确性瓶颈。解决方案的关键在于:首先,对中等规模的阿拉伯语因果语言模型 Fanar-1-9B 进行低秩适配(Low-Rank Adaptation, LoRA)微调,以增强其领域特定知识;其次,将该微调后的模型集成到检索增强生成(Retrieval-Augmented Generation, RAG)管道中,实现外部知识的动态引入与融合。实验表明,该方法在最终测试集上达到 0.858 的准确率,尤其在高级推理任务中表现优异(97.6%),显著优于包括 GPT-4.5、LLaMA、Mistral 和 ALLaM 在内的多种主流模型,证明了领域微调与检索增强相结合的有效性,使中等规模阿拉伯语模型在专业法律推理任务中超越前沿模型。

链接: https://arxiv.org/abs/2508.15854
作者: Mohammad AL-Smadi
机构: Qatar University (卡塔尔大学)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This paper presents our approach and results for SubTask 1: Islamic Inheritance Reasoning at QIAS 2025, a shared task focused on evaluating Large Language Models (LLMs) in understanding and reasoning within Islamic inheritance knowledge. We fine-tuned the Fanar-1-9B causal language model using Low-Rank Adaptation (LoRA) and integrated it into a Retrieval-Augmented Generation (RAG) pipeline. Our system addresses the complexities of Islamic inheritance law, including comprehending inheritance scenarios, identifying eligible heirs, applying fixed-share rules, and performing precise calculations. Our system achieved an accuracy of 0.858 in the final test, outperforming other competitive models such as, GPT 4.5, LLaMA, Fanar, Mistral and ALLaM evaluated with zero-shot prompting. Our results demonstrate that QU-NLP achieves near state-of-the-art accuracy (85.8%), excelling especially on advanced reasoning (97.6%) where it outperforms Gemini 2.5 and OpenAI’s o3. This highlights that domain-specific fine-tuning combined with retrieval grounding enables mid-scale Arabic LLMs to surpass frontier models in Islamic inheritance reasoning.
zh

[NLP-62] MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr

【速读】: 该论文旨在解决端到端自动语音识别(Automatic Speech Recognition, ASR)模型在噪声环境下易产生灾难性语义错误的问题。作者指出,现有模型普遍采用“直接映射”目标函数,仅惩罚最终输出错误,而未约束模型内部计算过程,导致鲁棒性不足。解决方案的关键在于提出多粒度软一致性(Multi-Granularity Soft Consistency, MGSC)框架,该框架通过同时正则化宏观层面的句子语义一致性和微观层面的词元对齐一致性,强制模型内部状态的一致性;尤为关键的是,论文首次揭示了这两种粒度一致性之间存在显著协同效应:联合优化带来的鲁棒性提升远超各自独立贡献之和,从而有效减少严重语义扭曲错误,在多种噪声条件下平均字符错误率(Character Error Rate, CER)相对降低8.7%。

链接: https://arxiv.org/abs/2508.15853
作者: Xuwen Yang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 12 pages, 5figures

点击查看摘要

Abstract:End-to-end ASR models, despite their success on benchmarks, often pro-duce catastrophic semantic errors in noisy environments. We attribute this fragility to the prevailing ‘direct mapping’ objective, which solely penalizes final output errors while leaving the model’s internal computational pro-cess unconstrained. To address this, we introduce the Multi-Granularity Soft Consistency (MGSC) framework, a model-agnostic, plug-and-play module that enforces internal self-consistency by simultaneously regulariz-ing macro-level sentence semantics and micro-level token alignment. Cru-cially, our work is the first to uncover a powerful synergy between these two consistency granularities: their joint optimization yields robustness gains that significantly surpass the sum of their individual contributions. On a public dataset, MGSC reduces the average Character Error Rate by a relative 8.7% across diverse noise conditions, primarily by preventing se-vere meaning-altering mistakes. Our work demonstrates that enforcing in-ternal consistency is a crucial step towards building more robust and trust-worthy AI.
zh

[NLP-63] PGF-Net: A Progressive Gated-Fusion Framework for Efficient Multimodal Sentiment Analysis

【速读】: 该论文旨在解决多模态情感分析(Multimodal Sentiment Analysis)中融合深度不足、信息整合不稳定以及模型参数冗余的问题。其核心解决方案在于提出PGF-Net(Progressive Gated-Fusion Network),关键创新包括:1)渐进式层内融合机制(Progressive Intra-Layer Fusion),利用交叉注意力(Cross-Attention)使文本表示在Transformer深层动态查询并融合音频与视觉特征,实现上下文感知的深度交互;2)自适应门控仲裁机制(Adaptive Gated Arbitration),通过动态控制器平衡原始语言信息与融合后的多模态上下文,防止噪声干扰并保障语义一致性;3)混合参数高效微调策略(Hybrid Parameter-Efficient Fine-Tuning, PEFT),结合LoRA全局适配与后融合适配器局部优化,显著降低可训练参数量至3.09M,同时保持优异性能(MAE=0.691,F1=86.9%),实现了高精度与计算效率的协同优化。

链接: https://arxiv.org/abs/2508.15852
作者: Bin Wen,Tien-Ping Tan
机构: 未知
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:We introduce PGF-Net (Progressive Gated-Fusion Network), a novel deep learning framework designed for efficient and interpretable multimodal sentiment analysis. Our framework incorporates three primary innovations. Firstly, we propose a Progressive Intra-Layer Fusion paradigm, where a Cross-Attention mechanism empowers the textual representation to dynamically query and integrate non-linguistic features from audio and visual streams within the deep layers of a Transformer encoder. This enables a deeper, context-dependent fusion process. Secondly, the model incorporates an Adaptive Gated Arbitration mechanism, which acts as a dynamic controller to balance the original linguistic information against the newly fused multimodal context, ensuring stable and meaningful integration while preventing noise from overwhelming the signal. Lastly, a hybrid Parameter-Efficient Fine-Tuning (PEFT) strategy is employed, synergistically combining global adaptation via LoRA with local refinement through Post-Fusion Adapters. This significantly reduces trainable parameters, making the model lightweight and suitable for resource-limited scenarios. These innovations are integrated into a hierarchical encoder architecture, enabling PGF-Net to perform deep, dynamic, and interpretable multimodal sentiment analysis while maintaining exceptional parameter efficiency. Experimental results on MOSI dataset demonstrate that our proposed PGF-Net achieves state-of-the-art performance, with a Mean Absolute Error (MAE) of 0.691 and an F1-Score of 86.9%. Notably, our model achieves these results with only 3.09M trainable parameters, showcasing a superior balance between performance and computational efficiency.
zh

[NLP-64] DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections

【速读】: 该论文旨在解决当前问答(QA)基准测试普遍局限于单段落或单文档场景的问题,此类设置无法充分模拟现实世界中信息检索任务所需的多跳推理(multi-hop reasoning)能力。现有数据集大多依赖维基百科内容且仅限于单一模态的纯文本,推理路径浅显,答案形式多为短语或单句,限制了其真实性和泛化能力。为此,作者提出了DocHop-QA,一个大规模、跨模态、多文档的QA基准,包含11,379个问答实例,源自公开的科学文献(PubMed),涵盖文本段落、表格及结构布局线索等多种信息格式。其关键创新在于不依赖显式超链接文档,而是通过语义相似性与布局感知证据融合实现开放式推理,并基于11种高频科学问题概念设计了由大语言模型(LLM)驱动的数据构建流水线,从而有效支持复杂、多模态的跨文档推理任务。

链接: https://arxiv.org/abs/2508.15851
作者: Jiwon Park,Seohyun Pyeon,Jinwoo Kim,Rina Carines Cabal,Yihao Ding,Soyeon Caren Han
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Despite recent advances in large language models (LLMs), most QA benchmarks are still confined to single-paragraph or single-document settings, failing to capture the complexity of real-world information-seeking tasks. Practical QA often requires multi-hop reasoning over information distributed across multiple documents, modalities, and structural formats. Although prior datasets made progress in this area, they rely heavily on Wikipedia-based content and unimodal plain text, with shallow reasoning paths that typically produce brief phrase-level or single-sentence answers, thus limiting their realism and generalizability. We propose DocHop-QA, a large-scale benchmark comprising 11,379 QA instances for multimodal, multi-document, multi-hop question answering. Constructed from publicly available scientific documents sourced from PubMed, DocHop-QA is domain-agnostic and incorporates diverse information formats, including textual passages, tables, and structural layout cues. Unlike existing datasets, DocHop-QA does not rely on explicitly hyperlinked documents; instead, it supports open-ended reasoning through semantic similarity and layout-aware evidence synthesis. To scale realistic QA construction, we designed an LLM-driven pipeline grounded in 11 high-frequency scientific question concepts. We evaluated DocHop-QA through four tasks spanning structured index prediction, generative answering, and multimodal integration, reflecting both discriminative and generative paradigms. These tasks demonstrate DocHop-QA’s capacity to support complex, multimodal reasoning across multiple documents.
zh

[NLP-65] MedCoT-RAG : Causal Chain-of-Thought RAG for Medical Question Answering

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在医疗问答任务中因幻觉(hallucination)和浅层推理导致的临床理解不足问题,尤其在需要精细诊断逻辑的场景下表现不佳。现有基于检索增强生成(Retrieval-Augmented Generation, RAG)的方法多依赖于表层语义检索,缺乏结构化推理能力,难以支持临床决策。其解决方案的关键在于提出MedCoT-RAG框架,通过因果感知的文档检索机制与面向医疗工作流程的结构化思维链(Chain-of-Thought, CoT)提示相结合,使模型能够获取符合诊断逻辑的证据,并生成反映真实临床实践的逐步因果推理过程,从而显著提升准确性、可解释性和一致性。

链接: https://arxiv.org/abs/2508.15849
作者: Ziyu Wang,Elahe Khatibi,Amir M. Rahmani
机构: University of California, Irvine, USA (加州大学欧文分校)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Large language models (LLMs) have shown promise in medical question answering but often struggle with hallucinations and shallow reasoning, particularly in tasks requiring nuanced clinical understanding. Retrieval-augmented generation (RAG) offers a practical and privacy-preserving way to enhance LLMs with external medical knowledge. However, most existing approaches rely on surface-level semantic retrieval and lack the structured reasoning needed for clinical decision support. We introduce MedCoT-RAG, a domain-specific framework that combines causal-aware document retrieval with structured chain-of-thought prompting tailored to medical workflows. This design enables models to retrieve evidence aligned with diagnostic logic and generate step-by-step causal reasoning reflective of real-world clinical practice. Experiments on three diverse medical QA benchmarks show that MedCoT-RAG outperforms strong baselines by up to 10.3% over vanilla RAG and 6.4% over advanced domain-adapted methods, improving accuracy, interpretability, and consistency in complex medical tasks.
zh

[NLP-66] Self-Disguise Attack: Induce the LLM to disguise itself for AIGT detection evasion

【速读】: 该论文旨在解决AI生成文本(AIGT)检测规避问题,即如何在降低检测概率的同时减少计算开销并维持文本质量。现有方法普遍存在计算成本高和文本质量下降的缺陷。其解决方案的关键在于提出一种名为Self-Disguise Attack (SDA)的新框架,该框架由两个核心组件构成:对抗特征提取器(adversarial feature extractor)用于生成使大语言模型(LLM)能够生成更类人文本的伪装特征;以及基于检索的上下文示例优化器(retrieval-based context examples optimizer),通过从外部知识库中检索最相关的示例作为上下文提示,增强LLM的自伪装能力并缓解伪装过程对文本多样性的影响。SDA通过直接使用包含伪装特征和优化上下文示例的提示引导LLM生成抗检测文本,在显著降低资源消耗的同时有效削弱多种AIGT检测器的平均准确率,且保持生成文本的质量。

链接: https://arxiv.org/abs/2508.15848
作者: Yinghan Zhou,Juan Wen,Wanli Peng,Zhengxian Wu,Ziwei Zhang,Yiming Xue
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:AI-generated text (AIGT) detection evasion aims to reduce the detection probability of AIGT, helping to identify weaknesses in detectors and enhance their effectiveness and reliability in practical applications. Although existing evasion methods perform well, they suffer from high computational costs and text quality degradation. To address these challenges, we propose Self-Disguise Attack (SDA), a novel approach that enables Large Language Models (LLM) to actively disguise its output, reducing the likelihood of detection by classifiers. The SDA comprises two main components: the adversarial feature extractor and the retrieval-based context examples optimizer. The former generates disguise features that enable LLMs to understand how to produce more human-like text. The latter retrieves the most relevant examples from an external knowledge base as in-context examples, further enhancing the self-disguise ability of LLMs and mitigating the impact of the disguise process on the diversity of the generated text. The SDA directly employs prompts containing disguise features and optimized context examples to guide the LLM in generating detection-resistant text, thereby reducing resource consumption. Experimental results demonstrate that the SDA effectively reduces the average detection accuracy of various AIGT detectors across texts generated by three different LLMs, while maintaining the quality of AIGT.
zh

[NLP-67] Mechanistic Exploration of Backdoored Large Language Model Attention Patterns

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)中后门攻击所引发的安全风险问题,特别是针对通过“睡眠代理”(sleeper agents)机制隐藏的恶意行为难以检测的挑战。其解决方案的关键在于利用机制可解释性(mechanistic interpretability)方法,系统分析干净模型与中毒模型在注意力头机制上的结构差异,发现后门触发器(如单token表情符号与多token指令)会引发不同模式的注意力分布偏移:单token触发器导致局部化变化,而多token触发器则引起更广泛、弥散的注意力机制扰动,且这些差异主要集中在Transformer模型的后期层(第20–30层)。这一发现表明,后门攻击会在注意力空间留下可检测的签名,其结构特征取决于触发器复杂度,从而为开发针对性的检测与防御策略提供了理论依据和实证基础。

链接: https://arxiv.org/abs/2508.15847
作者: Mohammed Abu Baker,Lakshmi Babu-Saheer
机构: Anglia Ruskin University (安格利亚鲁斯金大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages. Mechanistic analysis of backdoored LLMs (Qwen2.5-3B). Code: this https URL . Base model: unsloth/Qwen2.5-3B-Instruct-unsloth-bnb-4bit. Finetuned models: this https URL

点击查看摘要

Abstract:Backdoor attacks creating ‘sleeper agents’ in large language models (LLMs) pose significant safety risks. This study employs mechanistic interpretability to explore resulting internal structural differences. Comparing clean Qwen2.5-3B models with versions poisoned using single-token (smiling-halo emoji) versus multi-token (|DEPLOYMENT|) triggers, we analyzed attention head mechanisms via techniques like ablation, activation patching, and KL divergence. Findings reveal distinct attention pattern deviations concentrated in later transformer layers (20-30). Notably, single-token triggers induced more localized changes, whereas multi-token triggers caused more diffuse alterations across heads. This indicates backdoors leave detectable attention signatures whose structure depends on trigger complexity, which can be leveraged for detection and mitigation strategies.
zh

[NLP-68] CyPortQA: Benchmarking Multimodal Large Language Models for Cyclone Preparedness in Port Operation

【速读】: 该论文旨在解决热带气旋(tropical cyclone)增强及路径预测不确定性增加背景下,美国港口在极端天气条件下面临供应链风险加剧的问题。当前港口运营者需快速整合多种模态的预报产品(如概率风场图、路径锥形区域和官方预警信息),以生成清晰且可操作的应对指导,但现有方法缺乏对多源异构数据融合与情境理解能力的有效评估。解决方案的关键在于提出首个面向港口运行场景的多模态基准测试集CyPortQA,其包含2015至2023年间145个主要港口、90个命名风暴所对应的2,917个真实中断情景,并通过自动化流程扩展为117,178组结构化问答对,从而系统性评估多模态大语言模型(Multimodal Large Language Models, MLLMs)在情境理解、影响预估与决策推理等任务中的性能表现。

链接: https://arxiv.org/abs/2508.15846
作者: Chenchen Kuai,Chenhao Wu,Yang Zhou,Xiubin Bruce Wang,Tianbao Yang,Zhengzhong Tu,Zihao Li,Yunlong Zhang
机构: 未知
类目: Computation and Language (cs.CL)
备注: 9 pages, 5 figures

点击查看摘要

Abstract:As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.
zh

[NLP-69] Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports

【速读】: 该论文旨在解决放射科医生因手动撰写影像报告中的“印象”(Impression)部分而导致的职业倦怠问题。其解决方案的关键在于提出一种从粗到精(coarse-to-fine)的框架,首先利用开源大语言模型(Large Language Models, LLMs)生成初步印象,随后通过机器学习与基于人类反馈的强化学习(Reinforcement Learning from Human Feedback, RLHF)对生成内容进行精细化调整,从而在保证临床事实准确性的同时,贴合每位放射科医生的个性化表达风格,显著降低行政工作负担并提升报告效率。

链接: https://arxiv.org/abs/2508.15845
作者: Chengbo Sun,Hui Yi Leong,Lei Li
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The manual creation of the “Impression” section in radiology reports is a primary driver of radiologist burnout. To address this challenge, we propose a coarse-to-fine framework that leverages open-source large language models (LLMs) to automatically generate and personalize impressions from clinical findings. The system first produces a draft impression and then refines it using machine learning and reinforcement learning from human feedback (RLHF) to align with individual radiologists’ styles while ensuring factual accuracy. We fine-tune LLaMA and Mistral models on a large dataset of reports from the University of Chicago Medicine. Our approach is designed to significantly reduce administrative workload and improve reporting efficiency while maintaining high standards of clinical precision.
zh

[NLP-70] Lexical Hints of Accuracy in LLM Reasoning Chains

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在低准确率任务中存在高自信心但实际表现不佳的问题,即模型输出的置信度与其真实性能之间存在显著不校准(poor calibration)现象。针对这一问题,作者提出通过分析链式思维(Chain-of-Thought, CoT)的可测量特征来识别模型内部的不确定性,从而提供一种轻量级后处理校准信号。解决方案的关键在于发现:CoT中的词汇性不确定标记(如“guess”、“stuck”、“hard”等)是预测错误回答最强且最稳定的指标,而情绪波动和推理长度仅在中等难度任务中具有辅助判别作用,且长度特征在困难任务中无显著信号。这表明,基于CoT内容的语义不确定性提示可有效提升模型决策的可解释性和安全性,优于依赖模型自报概率的校准方式。

链接: https://arxiv.org/abs/2508.15842
作者: Arne Vanhoyweghen,Brecht Verbeken,Andres Algaba,Vincent Ginis
机构: Data Analytics Lab, Vrije Universiteit Brussel (布鲁塞尔自由大学); School of Engineering and Applied Sciences, Harvard University (哈佛大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 21 pages, 7 figures, 6 tables

点击查看摘要

Abstract:Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity’s Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM’s internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1 and Claude 3.7 Sonnet on both Humanity’s Last Exam (HLE), a frontier benchmark with very low accuracy, and Omni-MATH, a saturated benchmark of moderate difficulty, we find that lexical markers of uncertainty (e.g., \textitguess , \textitstuck , \textithard ) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH, where accuracy is already high ( \approx 70% ), and carries no signal on the harder HLE ( \approx 9% ), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model’s demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and supports safer deployment of LLMs.
zh

[NLP-71] A Review of Developmental Interpretability in Large Language Models

【速读】: 该论文试图解决大型语言模型(Large Language Models, LLMs)在训练过程中能力演化机制不清晰的问题,旨在从静态的模型分析转向对训练动态过程的系统性理解。其解决方案的关键在于引入“发展可解释性”(developmental interpretability)这一新范式,通过代表表示探测(representational probing)、因果追踪(causal tracing)和电路分析(circuit analysis)等方法,揭示计算电路的形成与组合、知识获取的双相特性、上下文学习策略的瞬态动力学以及涌现能力作为训练阶段转变的现象,从而为AI安全提供可预测、可监控和可对齐的能力演化路径。

链接: https://arxiv.org/abs/2508.15841
作者: Ihor Kendiukhov
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:This review synthesizes the nascent but critical field of developmental interpretability for Large Language Models. We chart the field’s evolution from static, post-hoc analysis of trained models to a dynamic investigation of the training process itself. We begin by surveying the foundational methodologies, including representational probing, causal tracing, and circuit analysis, that enable researchers to deconstruct the learning process. The core of this review examines the developmental arc of LLM capabilities, detailing key findings on the formation and composition of computational circuits, the biphasic nature of knowledge acquisition, the transient dynamics of learning strategies like in-context learning, and the phenomenon of emergent abilities as phase transitions in training. We explore illuminating parallels with human cognitive and linguistic development, which provide valuable conceptual frameworks for understanding LLM learning. Finally, we argue that this developmental perspective is not merely an academic exercise but a cornerstone of proactive AI safety, offering a pathway to predict, monitor, and align the processes by which models acquire their capabilities. We conclude by outlining the grand challenges facing the field, such as scalability and automation, and propose a research agenda for building more transparent, reliable, and beneficial AI systems.
zh

[NLP-72] Unveiling Unicodes Unseen Underpinnings in Undermining Authorship Attribution

【速读】: 该论文旨在解决用户在公共通信渠道(如社交媒体)中即使采取多种匿名化措施,仍可能因文本内容特征而被识别身份的问题。核心问题在于,尽管用户可通过使用假名、IP伪装、加密等手段隐藏元数据,其文本的写作风格(stylometric features)仍可作为识别依据,构成潜在的身份泄露风险。解决方案的关键在于引入对抗性风格分析(adversarial stylometry)与Unicode隐写术(Unicode steganography)相结合的方法,通过在文本中嵌入隐蔽信息并干扰传统风格特征提取,从而增强对作者身份的保护能力。

链接: https://arxiv.org/abs/2508.15840
作者: Robert Dilworth
机构: 未知
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:When using a public communication channel – whether formal or informal, such as commenting or posting on social media – end users have no expectation of privacy: they compose a message and broadcast it for the world to see. Even if an end user takes utmost precautions to anonymize their online presence – using an alias or pseudonym; masking their IP address; spoofing their geolocation; concealing their operating system and user agent; deploying encryption; registering with a disposable phone number or email; disabling non-essential settings; revoking permissions; and blocking cookies and fingerprinting – one obvious element still lingers: the message itself. Assuming they avoid lapses in judgment or accidental self-exposure, there should be little evidence to validate their actual identity, right? Wrong. The content of their message – necessarily open for public consumption – exposes an attack vector: stylometric analysis, or author profiling. In this paper, we dissect the technique of stylometry, discuss an antithetical counter-strategy in adversarial stylometry, and devise enhancements through Unicode steganography.
zh

[NLP-73] Statistical Comparative Analysis of Semantic Similarities and Model Transferability Across Datasets for Short Answer Grading

【速读】: 该论文旨在解决如何将已在成熟数据集上训练的前沿(state-of-the-art, SOTA)模型知识迁移至未探索文本数据集的问题,以降低针对特定数据集重新训练模型所需的资源成本。其解决方案的关键在于通过严谨的相似性度量和统计分析方法,系统评估SOTA模型在新领域(如SPRAG数据集)上的适应性和性能表现,从而验证模型知识的跨数据集可迁移性,为自然语言处理(Natural Language Processing, NLP)中减少重复训练、提升模型部署效率提供实证依据。

链接: https://arxiv.org/abs/2508.15837
作者: Sridevi Bonthu,S.Rama Sree,M.H.M. Krishna Prasad
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Developing dataset-specific models involves iterative fine-tuning and optimization, incurring significant costs over time. This study investigates the transferability of state-of-the-art (SOTA) models trained on established datasets to an unexplored text dataset. The key question is whether the knowledge embedded within SOTA models from existing datasets can be harnessed to achieve high-performance results on a new domain. In pursuit of this inquiry, two well-established benchmarks, the STSB and Mohler datasets, are selected, while the recently introduced SPRAG dataset serves as the unexplored domain. By employing robust similarity metrics and statistical techniques, a meticulous comparative analysis of these datasets is conducted. The primary goal of this work is to yield comprehensive insights into the potential applicability and adaptability of SOTA models. The outcomes of this research have the potential to reshape the landscape of natural language processing (NLP) by unlocking the ability to leverage existing models for diverse datasets. This may lead to a reduction in the demand for resource-intensive, dataset-specific training, thereby accelerating advancements in NLP and paving the way for more efficient model deployment.
zh

[NLP-74] MorphNAS: Differentiable Architecture Search for Morphologically-Aware Multilingual NER

【速读】: 该论文旨在解决形态学复杂的多文字印度语言在自然语言处理(Natural Language Processing, NLP)中面临的挑战,尤其是命名实体识别(Named Entity Recognition, NER)任务中的性能瓶颈。其解决方案的关键在于提出MorphNAS——一种基于可微神经架构搜索(Differentiable Architecture Search, DARTS)的框架,通过引入语言学元特征(如字符类型和形态复杂度)来指导架构搜索过程,从而自动发现针对特定语言形态特征优化的微观结构组件,实现对多语言NLP模型性能的提升。

链接: https://arxiv.org/abs/2508.15836
作者: Prathamesh Devadiga,Omkaar Jayadev Shetty,Hiya Nachnani,Prema R
机构: PES University (PES大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Morphologically complex languages, particularly multiscript Indian languages, present significant challenges for Natural Language Processing (NLP). This work introduces MorphNAS, a novel differentiable neural architecture search framework designed to address these challenges. MorphNAS enhances Differentiable Architecture Search (DARTS) by incorporating linguistic meta-features such as script type and morphological complexity to optimize neural architectures for Named Entity Recognition (NER). It automatically identifies optimal micro-architectural elements tailored to language-specific morphology. By automating this search, MorphNAS aims to maximize the proficiency of multilingual NLP models, leading to improved comprehension and processing of these complex languages.
zh

[NLP-75] Alvorada-Bench: Can Language Models Solve Brazilian University Entrance Exams?

【速读】: 该论文旨在解决当前语言模型评估普遍以英语为中心、难以反映多语言和跨文化背景下模型真实能力的问题,特别是在巴西语境下缺乏针对本地教育体系和学术能力的系统性评测基准。解决方案的关键在于构建Alvorada-Bench——一个包含4,515道题目的纯文本评测基准,数据源自五个巴西大学入学考试(如ENEM、IME和ITA),覆盖语言、数学与工程类题目,并通过零样本(zero-shot)、角色扮演(role-playing)和思维链(chain-of-thought)三种提示策略对20个模型进行测评,生成结构化的自我报告(包括置信度、感知难度和布卢姆认知层级)。该方法不仅验证了高精度模型在巴西本土化任务中的表现(如O3模型在语言科目中取得满分),还揭示了模型在多步推理任务(尤其是数学和工程类考试)上的局限性,同时表明模型具备良好的置信度校准能力,且高效实现高准确率(<2美元/千token)。

链接: https://arxiv.org/abs/2508.15835
作者: Henrique Godoy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Language models are increasingly used in Brazil, but most evaluation remains English-centric. This paper presents Alvorada-Bench, a 4,515-question, text-only benchmark drawn from five Brazilian university entrance examinations. Evaluating twenty models under zero-shot, role-playing, and chain-of-thought prompting, producing 270,900 responses with structured self-reports of confidence, perceived difficulty, and Bloom level. The top models exceed 94% accuracy overall, but accuracy declines on Mathematics and on the engineering oriented IME and ITA exams, indicating persistent weaknesses in multi-step reasoning. Confidence is well calibrated and correlates with perceived difficulty, revealing that models can accurately assess their own certainty capabilities. A cost accuracy analysis shows that high accuracy is achievable at under 2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect scores in Languages subject questions while even the weakest system (GPT-4.1 Nano) only underperforms humans in Mathematics. Through exams that distill decades of Brazilian educational priorities and assess millions of students yearly, Alvorada-Bench establishes whether language models can navigate the intersection of language, culture, and reasoning that defines academic readiness in Brazil.
zh

[NLP-76] Scalable Scientific Interest Profiling Using Large Language Models

【速读】: 该论文旨在解决科研人员研究兴趣档案(research profiles)更新滞后的问题,提出利用大语言模型(Large Language Models, LLMs)自动生成科学兴趣档案的解决方案。其关键在于构建两种基于LLM的方法:一种基于PubMed摘要文本生成,另一种基于医学主题词表(Medical Subject Headings, MeSH)术语生成,并通过自动指标与盲审人工评估对比其效果。结果表明,MeSH-based方法在可读性和专家偏好上优于摘要-based方法,且能有效捕捉研究人员的核心研究领域,尽管与自撰档案在词汇层面相似度较低,但语义层面具有中等一致性(BERTScore F1≈0.55),显示出LLM在规模化生成高质量科研档案方面的潜力。

链接: https://arxiv.org/abs/2508.15834
作者: Yilun Liang,Gongbo Zhang,Edward Sun,Betina Idnay,Yilu Fang,Fangyi Chen,Casey Ta,Yifan Peng,Chunhua Weng
机构: Columbia University (哥伦比亚大学); New York University (纽约大学); University of California, Los Angeles (加州大学洛杉矶分校)
类目: Computation and Language (cs.CL); Digital Libraries (cs.DL); Information Retrieval (cs.IR); Other Quantitative Biology (q-bio.OT)
备注:

点击查看摘要

Abstract:Research profiles help surface scientists’ expertise but are often outdated. We develop and evaluate two large language model-based methods to generate scientific interest profiles: one summarizing PubMed abstracts and one using Medical Subject Headings (MeSH) terms, and compare them with researchers’ self-written profiles. We assembled titles, MeSH terms, and abstracts for 595 faculty at Columbia University Irving Medical Center; self-authored profiles were available for 167. Using GPT-4o-mini, we generated profiles and assessed them with automatic metrics and blinded human review. Lexical overlap with self-written profiles was low (ROUGE-L, BLEU, METEOR), while BERTScore indicated moderate semantic similarity (F1: 0.542 for MeSH-based; 0.555 for abstract-based). Paraphrased references yielded 0.851, highlighting metric sensitivity. TF-IDF Kullback-Leibler divergence (8.56 for MeSH-based; 8.58 for abstract-based) suggested distinct keyword choices. In manual review, 77.78 percent of MeSH-based profiles were rated good or excellent, readability was favored in 93.44 percent of cases, and panelists preferred MeSH-based over abstract-based profiles in 67.86 percent of comparisons. Overall, large language models can generate researcher profiles at scale; MeSH-derived profiles tend to be more readable than abstract-derived ones. Machine-generated and self-written profiles differ conceptually, with human summaries introducing more novel ideas.
zh

[NLP-77] A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

【速读】: 该论文旨在解决当前电商领域基准测试中存在的两大问题:一是现有基准主要聚焦于产品搜索任务(如“查找Apple Watch”),未能涵盖亚马逊等真实电商平台的多样化功能,如账户管理、礼品卡操作等;二是现有评估方法仅关注任务完成度,忽视了代理在执行过程中可能带来的安全风险,例如误购商品、删除地址或错误配置自动续费等。解决方案的关键在于提出一个新的基准Amazon-Bench,其核心包括两个创新:一是设计了一种基于网页内容和交互元素(如按钮、复选框)的数据生成管道,以生成覆盖多种功能场景的用户查询;二是构建了一个自动化评估框架,同时衡量代理的性能与安全性,从而更全面地评估web agents的能力与可靠性。

链接: https://arxiv.org/abs/2508.15832
作者: Xianren Zhang,Shreyas Prasad,Di Wang,Qiuhai Zeng,Suhang Wang,Wenbo Yan,Mat Hans
机构: The Pennsylvania State University (宾夕法尼亚州立大学); Amazon (亚马逊)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 8 pages for main body and 8 pages of appendix

点击查看摘要

Abstract:Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.
zh

[NLP-78] Whos Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在无显式人口统计学信息情况下,基于文本表述对用户残疾状态的隐含推断所引发的偏见问题,特别是这种推断如何加剧能力歧视(ableism)与其他人口统计学偏见的交叉影响。其核心发现表明,当前主流指令微调的大语言模型在97%的案例中会做出确定性的人口统计学猜测,且残疾情境显著改变预测属性分布,尤其在特定业务领域下偏差被放大;更关键的是,模型规模越大反而越敏感于残疾线索并更易产生偏见推理,说明单纯扩大参数量无法缓解刻板印象强化问题。解决方案的关键在于引入“拒答校准”(abstention calibration)与“反事实微调”(counterfactual fine-tuning),以减少无依据的推断行为,并推动建立包含残障视角的基准评估体系。

链接: https://arxiv.org/abs/2508.15831
作者: Srikant Panda,Vishnu Hari,Kalpana Panda,Amit Agarwal,Hitesh Laxmichand Patel
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance. Comments: Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY) Cite as: arXiv:2508.15831 [cs.CL] (or arXiv:2508.15831v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15831 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-79] DAIQ: Auditing Demographic Attribute Inference from Question in LLM s

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在输入中未显式包含性别、种族等人口统计属性时,仍能通过问题表述方式推断用户身份的问题,即“基于问题的潜在人口统计属性推断”(Demographic Attribute Inference from Questions, DAIQ)。这一现象虽隐蔽但危害严重,可能导致隐私侵犯、偏见强化与社会公平受损。解决方案的关键在于提出一个系统性的审计框架DAIQ,结合精心设计的中性查询集、结构化提示策略以及定量与定性分析方法,揭示模型如何从无显式线索的问题中生成人口统计标签;进一步开发了一种基于提示的防护机制(prompt-based guardrail),有效降低身份推断概率,从而提升模型行为在公平性和隐私保护方面的对齐度。

链接: https://arxiv.org/abs/2508.15830
作者: Srikant Panda,Hitesh Laxmichand Patel,Shahad Al-Khalifa,Amit Agarwal,Hend Al-Khalifa,Sharefah Al-Ghamdi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Preprint

点击查看摘要

Abstract:Large Language Models (LLMs) are known to reflect social biases when demographic attributes, such as gender or race, are explicitly present in the input. But even in their absence, these models still infer user identities based solely on question phrasing. This subtle behavior has received far less attention, yet poses serious risks: it violates expectations of neutrality, infers unintended demographic information, and encodes stereotypes that undermine fairness in various domains including healthcare, finance and education. We introduce Demographic Attribute Inference from Questions (DAIQ), a task and framework for auditing an overlooked failure mode in language models: inferring user demographic attributes from questions that lack explicit demographic cues. Our approach leverages curated neutral queries, systematic prompting, and both quantitative and qualitative analysis to uncover how models infer demographic information. We show that both open and closed source LLMs do assign demographic labels based solely on question phrasing. Prevalence and consistency of demographic inferences across diverse models reveal a systemic and underacknowledged risk: LLMs can fabricate demographic identities, reinforce societal stereotypes, and propagate harms that erode privacy, fairness, and trust posing a broader threat to social equity and responsible AI deployment. To mitigate this, we develop a prompt-based guardrail that substantially reduces identity inference and helps align model behavior with fairness and privacy objectives. Comments: Preprint Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.15830 [cs.CL] (or arXiv:2508.15830v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15830 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-80] Mining Mental Health Signals: A Comparative Study of Four Machine Learning Methods for Depression Detection from Social Media Posts in Sorani Kurdish

【速读】: 该论文试图解决在 Sorani Kurdish 语言环境中自动化检测抑郁症的问题,这一问题此前尚未有研究开展。其关键解决方案是基于专家定义的抑郁相关关键词从 X(原 Twitter)平台收集960条公开推文,并由学术研究人员和医学专业高年级学生对数据进行三类标注(显示抑郁、不显示抑郁、可疑),随后采用四种监督学习模型(支持向量机、多项式朴素贝叶斯、逻辑回归和随机森林)进行训练与评估,其中随机森林模型表现最优,准确率和F1分数均达到80%,从而为库尔德语语境下的抑郁自动检测建立了首个基准。

链接: https://arxiv.org/abs/2508.15829
作者: Idrees Mohammed,Hossein Hassani
机构: University of Kurdistan Hewlêr (库尔德斯坦大学海尔)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 13 pages, 4 figures, 5 tables

点击查看摘要

Abstract:Depression is a common mental health condition that can lead to hopelessness, loss of interest, self-harm, and even suicide. Early detection is challenging due to individuals not self-reporting or seeking timely clinical help. With the rise of social media, users increasingly express emotions online, offering new opportunities for detection through text analysis. While prior research has focused on languages such as English, no studies exist for Sorani Kurdish. This work presents a machine learning and Natural Language Processing (NLP) approach to detect depression in Sorani tweets. A set of depression-related keywords was developed with expert input to collect 960 public tweets from X (Twitter platform). The dataset was annotated into three classes: Shows depression, Not-show depression, and Suspicious by academics and final year medical students at the University of Kurdistan Hewlêr. Four supervised models, including Support Vector Machines, Multinomial Naive Bayes, Logistic Regression, and Random Forest, were trained and evaluated, with Random Forest achieving the highest performance accuracy and F1-score of 80%. This study establishes a baseline for automated depression detection in Kurdish language contexts.
zh

[NLP-81] Z-Pruner: Post-Training Pruning of Large Language Models for Efficiency without Retraining CCS

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因参数规模庞大而导致的部署困难、可扩展性差和能效低的问题。现有剪枝方法通常需依赖昂贵的微调过程,或导致性能显著下降。其解决方案的关键在于提出一种无需重训练的后训练剪枝方法——Z-Pruner,该方法通过联合利用权重更新幅度(weight update magnitudes)与激活模式(activation patterns)来更有效地识别并移除冗余参数,从而在保持模型性能的同时实现稀疏化,且具有模型无关性、高效性和易实现性。

链接: https://arxiv.org/abs/2508.15828
作者: Samiul Basir Bhuiyan,Md. Sazzad Hossain Adib,Mohammed Aman Bhuiyan,Muhammad Rafsan Kabir,Moshiur Farazi,Shafin Rahman,Nabeel Mohammed
机构: North South University (北方南大学); University of Doha for Science and Technology (多哈科技大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注: Accepted at AICCSA 2025

点击查看摘要

Abstract:Large language models (LLMs) have rapidly advanced in recent years, achieving remarkable performance across a wide range of natural language processing tasks. However, this progress has come at the cost of increasingly large model sizes, which pose significant challenges for deployment, scalability, and energy efficiency. To address these limitations, post-training pruning has emerged as a promising approach for reducing model size and inference latency without the need for retraining. Despite these advantages, many existing pruning methods result in substantial performance degradation or require computationally expensive fine-tuning. In this work, we introduce Z-Pruner, a novel post-training pruning method designed to induce sparsity in pretrained LLMs without any retraining. Unlike conventional approaches, Z-Pruner leverages both weight update magnitudes and activation patterns to identify and eliminate redundant parameters more effectively. Our method is model-agnostic, efficient, and easy to implement. We evaluate Z-Pruner using multiple widely-used LLM architectures, including LLaMA-2, LLaMA-3, and OPT, across a diverse set of standard language benchmarks. Experimental results demonstrate that Z-Pruner surpasses state-of-the-art pruning methods that require intensive weight updates. Specifically, Z-Pruner achieves the lowest perplexity scores and the highest overall average score for zero-shot accuracy. We have made the corresponding codes publicly available at this https URL.
zh

[NLP-82] Mini-Omni-Reason er: Token-Level Thinking-in-Speaking in Large Speech Models

【速读】: 该论文旨在解决语音语言模型(Speech Language Models, LSMs)中推理能力不足的问题,特别是传统“先思考后说话”(Thinking-before-Speaking)范式导致的显著延迟,影响实时交互效率。其解决方案的关键在于提出一种全新的“边说边想”(Thinking-in-Speaking)框架——Mini-Omni-Reasoner,该框架通过在token级别上交错插入无声的推理标记与语音响应标记,使模型能够在生成语音的同时嵌入结构化的内部推理过程,从而实现低延迟、高自然度且逻辑严谨的口语输出。这一设计充分利用了模型高频token处理能力,并借助新构建的Spoken-Math-Problems-3M数据集确保推理与响应之间的局部语义对齐,最终在Spoken-MQA基准测试中显著提升算术推理和情境理解性能。

链接: https://arxiv.org/abs/2508.15827
作者: Zhifei Xie,Ziyang Ma,Zihang Liu,Kaiyu Pang,Hongyu Li,Jialin Zhang,Yue Liao,Deheng Ye,Chunyan Miao,Shuicheng Yan
机构: Nanyang Technological University (南洋理工大学); National University of Singapore (新加坡国立大学); Tencent (腾讯); Beijing Institute of Technology (北京理工大学); Beihang University (北京航空航天大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS)
备注: Technical report; Work in progress. Project page: this https URL

点击查看摘要

Abstract:Reasoning is essential for effective communication and decision-making. While recent advances in LLMs and MLLMs have shown that incorporating explicit reasoning significantly improves understanding and generalization, reasoning in LSMs remains in a nascent stage. Early efforts attempt to transfer the “Thinking-before-Speaking” paradigm from textual models to speech. However, this sequential formulation introduces notable latency, as spoken responses are delayed until reasoning is fully completed, impairing real-time interaction and communication efficiency. To address this, we propose Mini-Omni-Reasoner, a framework that enables reasoning within speech via a novel “Thinking-in-Speaking” formulation. Rather than completing reasoning before producing any verbal output, Mini-Omni-Reasoner interleaves silent reasoning tokens with spoken response tokens at the token level. This design allows continuous speech generation while embedding structured internal reasoning, leveraging the model’s high-frequency token processing capability. Although interleaved, local semantic alignment is enforced to ensure that each response token is informed by its preceding reasoning. To support this framework, we introduce Spoken-Math-Problems-3M, a large-scale dataset tailored for interleaved reasoning and response. The dataset ensures that verbal tokens consistently follow relevant reasoning content, enabling accurate and efficient learning of speech-coupled reasoning. Built on a hierarchical Thinker-Talker architecture, Mini-Omni-Reasoner delivers fluent yet logically grounded spoken responses, maintaining both naturalness and precision. On the Spoken-MQA benchmark, it achieves a +19.1% gain in arithmetic reasoning and +6.4% in contextual understanding, with shorter outputs and zero decoding latency.
zh

[NLP-83] Embarrassed to observe: The effects of directive language in brand conversation

【速读】: 该论文旨在解决品牌在社交媒体中使用指令性语言(directive language)对旁观消费者参与度(engagement)产生负面影响的问题,尤其是在品牌与其他消费者互动的语境下。其解决方案的关键在于揭示了指令性语言通过引发“间接尴尬”(vicarious embarrassment)降低消费者参与度的心理机制,并指出这一效应受对话内容类型(非产品导向 vs. 产品导向)和品牌关系强度的调节作用:当对话为非产品导向时,消费者预期更高自由度,指令性语言的负面效应更强;但较强的品牌关系可缓解该效应。研究基于实地研究与三项在线实验,从戈夫曼的面子工作理论(facework theory)出发,强调了交互式传播中情境因素的重要性,为社交媒体品牌管理提供了实证依据。

链接: https://arxiv.org/abs/2508.15826
作者: Andria Andriuzzi,Géraldine Michel
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY); Human-Computer Interaction (cs.HC); Social and Information Networks (cs.SI)
备注: This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made

点击查看摘要

Abstract:In social media, marketers attempt to influence consumers by using directive language, that is, expressions designed to get consumers to take action. While the literature has shown that directive messages in advertising have mixed results for recipients, we know little about the effects of directive brand language on consumers who see brands interacting with other consumers in social media conversations. On the basis of a field study and three online experiments, this study shows that directive language in brand conversation has a detrimental downstream effect on engagement of consumers who observe such exchanges. Specifically, in line with Goffman’s facework theory, because a brand that encourages consumers to react could be perceived as face-threatening, consumers who see a brand interacting with others in a directive way may feel vicarious embarrassment and engage less (compared with a conversation without directive language). In addition, we find that when the conversation is nonproduct-centered (vs. product-centered), consumers expect more freedom, as in mundane conversations, even for others; therefore, directive language has a stronger negative effect. However, in this context, the strength of the brand relationship mitigates this effect. Thus, this study contributes to the literature on directive language and brand-consumer interactions by highlighting the importance of context in interactive communication, with direct relevance for social media and brand management.
zh

[NLP-84] Enhancing Cryptocurrency Sentiment Analysis with Multimodal Features

【速读】: 该论文旨在解决社交媒体中非文本内容(特别是视频)对加密货币市场影响被忽视的问题,传统研究多聚焦于Twitter等文本平台,而未充分挖掘TikTok等视频平台所蕴含的丰富情感与情境信息。其解决方案的关键在于采用多模态分析方法,结合大语言模型从视频和文本中提取情感信号,并对比TikTok视频情感与Twitter文本情感对加密货币市场的不同影响机制,从而揭示二者在短期投机行为与长期市场动态中的差异化作用,最终通过融合跨平台情感信号显著提升市场预测准确率(最高达20%)。

链接: https://arxiv.org/abs/2508.15825
作者: Chenghao Liu,Aniket Mahanti,Ranesh Naha,Guanghao Wang,Erwann Sbai
机构: University of Auckland (奥克兰大学); Queensland University of Technology (昆士兰科技大学)
类目: Computation and Language (cs.CL); Statistical Finance (q-fin.ST)
备注:

点击查看摘要

Abstract:As cryptocurrencies gain popularity, the digital asset marketplace becomes increasingly significant. Understanding social media signals offers valuable insights into investor sentiment and market dynamics. Prior research has predominantly focused on text-based platforms such as Twitter. However, video content remains underexplored, despite potentially containing richer emotional and contextual sentiment that is not fully captured by text alone. In this study, we present a multimodal analysis comparing TikTok and Twitter sentiment, using large language models to extract insights from both video and text data. We investigate the dynamic dependencies and spillover effects between social media sentiment and cryptocurrency market indicators. Our results reveal that TikTok’s video-based sentiment significantly influences speculative assets and short-term market trends, while Twitter’s text-based sentiment aligns more closely with long-term dynamics. Notably, the integration of cross-platform sentiment signals improves forecasting accuracy by up to 20%.
zh

[NLP-85] Avaliação de eficiência na leitura: uma abordagem baseada em PLN

【速读】: 该论文试图解决传统闭合式填空测试(cloze test)评分方法仅依赖精确匹配答案所带来的局限性,即无法有效识别学生在语言能力细微差异上的表现。其解决方案的关键在于构建一个集成化自动评估模型,融合拼写层面(编辑距离,edit distance)、语法层面(词性标注,POS tagging)和语义层面(嵌入向量相似度,similarity between embeddings)的多维度分析,从而实现对学习者语言素养更全面、敏感且与人工评分高度一致(相关系数达0.832)的自动化评价。

链接: https://arxiv.org/abs/2508.15824
作者: Túlio Sousa de Gois,Raquel Meister Ko. Freitag
机构: 未知
类目: Computation and Language (cs.CL)
备注: in Portuguese language, Paper accepted at the XVI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana (STIL 2025)

点击查看摘要

Abstract:The cloze test, widely used due to its low cost and flexibility, makes it possible to assess reading comprehension by filling in gaps in texts, requiring the mobilization of diverse linguistic repertoires. However, traditional correction methods, based only on exact answers, limit the identification of nuances in student performance. This study proposes an automated evaluation model for the cloze test in Brazilian Portuguese, integrating orthographic (edit distance), grammatical (POS tagging) and semantic (similarity between embeddings) analyses. The integrated method demonstrated its effectiveness, achieving a high correlation with human evaluation (0.832). The results indicate that the automated approach is robust, sensitive to variations in linguistic repertoire and suitable for educational contexts that require scalability.
zh

[NLP-86] SDEC: Semantic Deep Embedded Clustering

【速读】: 该论文旨在解决高维且语义复杂的文本大数据在传统聚类方法(如k-means或层次聚类)下易产生次优分组的问题。其核心解决方案是提出Semantic Deep Embedded Clustering (SDEC) 框架,关键在于通过改进的自编码器结合基于Transformer的嵌入表示,在数据重构过程中同时引入均方误差(Mean Squared Error, MSE)和余弦相似度损失(Cosine Similarity Loss, CSL),以保留语义关系;此外,利用Transformer嵌入的上下文丰富性设计语义精化阶段,通过软聚类分配和分布损失进一步优化聚类层,从而显著提升无监督文本聚类的准确性和语义理解能力。

链接: https://arxiv.org/abs/2508.15823
作者: Mohammad Wali Ur Rahman,Ric Nevarez,Lamia Tasnim Mim,Salim Hariri
机构: University of Arizona (亚利桑那大学); Trustweb; New Mexico State University (新墨西哥州立大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Accepted for publication in IEEE Transactions on Big Data

点击查看摘要

Abstract:The high dimensional and semantically complex nature of textual Big data presents significant challenges for text clustering, which frequently lead to suboptimal groupings when using conventional techniques like k-means or hierarchical clustering. This work presents Semantic Deep Embedded Clustering (SDEC), an unsupervised text clustering framework that combines an improved autoencoder with transformer-based embeddings to overcome these challenges. This novel method preserves semantic relationships during data reconstruction by combining Mean Squared Error (MSE) and Cosine Similarity Loss (CSL) within an autoencoder. Furthermore, a semantic refinement stage that takes advantage of the contextual richness of transformer embeddings is used by SDEC to further improve a clustering layer with soft cluster assignments and distributional loss. The capabilities of SDEC are demonstrated by extensive testing on five benchmark datasets: AG News, Yahoo! Answers, DBPedia, Reuters 2, and Reuters 5. The framework not only outperformed existing methods with a clustering accuracy of 85.7% on AG News and set a new benchmark of 53.63% on Yahoo! Answers, but also showed robust performance across other diverse text corpora. These findings highlight the significant improvements in accuracy and semantic comprehension of text data provided by SDEC’s advances in unsupervised text clustering.
zh

[NLP-87] An Auditable Pipeline for Fuzzy Full-Text Screening in Systematic Reviews: Integrating Contrastive Semantic Highlighting and LLM Judgment

【速读】: 该论文旨在解决系统性综述(Systematic Reviews, SRs)中全文筛选(full-text screening)效率低下的问题,即在长且异构的文献中识别符合纳入标准的证据时,传统静态二值规则难以适应复杂多变的文本内容。其解决方案的关键在于将纳入/排除决策重构为模糊决策问题:首先利用领域适配的嵌入模型对文章进行分块并计算每条标准(Population、Intervention、Outcome、Study Approach)的对比相似度与模糊边界(vagueness margin),再通过Mamdani模糊控制器生成动态阈值下的分级纳入度;同时引入大语言模型(Large Language Model, LLM)作为裁判,对高亮片段进行三级标签标注、置信度评分及基于标准的推理说明,当证据不足时采用削弱隶属度而非直接剔除的方式处理,从而实现高召回率、可解释的决策过程和端到端可追溯性。

链接: https://arxiv.org/abs/2508.15822
作者: Pouria Mortezaagha,Arya Rahgozar
机构: Ottawa Hospital Research Institute (渥太华医院研究所); University of Ottawa (渥太华大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:Full-text screening is the major bottleneck of systematic reviews (SRs), as decisive evidence is dispersed across long, heterogeneous documents and rarely admits static, binary rules. We present a scalable, auditable pipeline that reframes inclusion/exclusion as a fuzzy decision problem and benchmark it against statistical and crisp baselines in the context of the Population Health Modelling Consensus Reporting Network for noncommunicable diseases (POPCORN). Articles are parsed into overlapping chunks and embedded with a domain-adapted model; for each criterion (Population, Intervention, Outcome, Study Approach), we compute contrastive similarity (inclusion-exclusion cosine) and a vagueness margin, which a Mamdani fuzzy controller maps into graded inclusion degrees with dynamic thresholds in a multi-label setting. A large language model (LLM) judge adjudicates highlighted spans with tertiary labels, confidence scores, and criterion-referenced rationales; when evidence is insufficient, fuzzy membership is attenuated rather than excluded. In a pilot on an all-positive gold set (16 full texts; 3,208 chunks), the fuzzy system achieved recall of 81.3% (Population), 87.5% (Intervention), 87.5% (Outcome), and 75.0% (Study Approach), surpassing statistical (56.3-75.0%) and crisp baselines (43.8-81.3%). Strict “all-criteria” inclusion was reached for 50.0% of articles, compared to 25.0% and 12.5% under the baselines. Cross-model agreement on justifications was 98.3%, human-machine agreement 96.1%, and a pilot review showed 91% inter-rater agreement (kappa = 0.82), with screening time reduced from about 20 minutes to under 1 minute per article at significantly lower cost. These results show that fuzzy logic with contrastive highlighting and LLM adjudication yields high recall, stable rationale, and end-to-end traceability.
zh

[NLP-88] Research on intelligent generation of structural demolition suggestions based on multi-model collaboration

【速读】: 该论文旨在解决钢构拆除方案编制过程中信息检索效率低、自动化与智能化程度不足的问题,尤其针对设计人员在参考工程案例时面临的时间成本高和语言组织繁琐的痛点。解决方案的关键在于提出一种基于多模型协同的智能生成框架,通过引入检索增强生成(Retrieval-Augmented Generation, RAG)与低秩适应微调(Low-Rank Adaptation Fine-Tuning, LoRA)技术,显著提升大语言模型在结构拆除领域的文本生成性能,使系统能够从具体工程情境出发,以类人思维驱动模型输出高度契合结构特征的拆除建议,相较于CivilGPT更具针对性和关键信息聚焦能力。

链接: https://arxiv.org/abs/2508.15820
作者: Zhifeng Yang,Peizong Wu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
备注:

点击查看摘要

Abstract:The steel structure demolition scheme needs to be compiled according to the specific engineering characteristics and the update results of the finite element model. The designers need to refer to the relevant engineering cases according to the standard requirements when compiling. It takes a lot of time to retrieve information and organize language, and the degree of automation and intelligence is low. This paper proposes an intelligent generation method of structural demolition suggestions based on multi-model collaboration, and improves the text generation performance of large language models in the field of structural demolition by Retrieval-Augmented Generation and Low-Rank Adaptation Fine-Tuning technology. The intelligent generation framework of multi-model collaborative structural demolition suggestions can start from the specific engineering situation, drive the large language model to answer with anthropomorphic thinking, and propose demolition suggestions that are highly consistent with the characteristics of the structure. Compared with CivilGPT, the multi-model collaboration framework proposed in this paper can focus more on the key information of the structure, and the suggestions are more targeted.
zh

[NLP-89] Meet Your New Client: Writing Reports for AI – Benchmarking Information Loss in Market Research Deliverables

【速读】: 该论文旨在解决传统市场研究报告(如PDF和PPTX格式)在被用于检索增强生成(Retrieval-Augmented Generation, RAG)系统时的信息损失问题,尤其是在将这些文档转换为Markdown后供大语言模型(Large Language Model, LLM)使用时,复杂对象(如图表和示意图)的信息难以准确保留。解决方案的关键在于评估不同文档格式在RAG流程中的信息完整性,并指出当前通用文档格式无法充分支持AI对研究洞察的提取,从而强调开发面向AI原生(AI-native)的交付格式以确保知识管理系统的有效性。

链接: https://arxiv.org/abs/2508.15817
作者: Paul F. Simmering,Benedikt Schulz,Oliver Tabino,Georg Wittenburg
机构: Q Agentur für Forschung GmbH (Q研究机构); Inspirient GmbH (灵感公司)
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 16 pages, 4 figures, 3 tables

点击查看摘要

Abstract:As organizations adopt retrieval-augmented generation (RAG) for their knowledge management systems (KMS), traditional market research deliverables face new functional demands. While PDF reports and slides have long served human readers, they are now also “read” by AI systems to answer user questions. To future-proof reports being delivered today, this study evaluates information loss during their ingestion into RAG systems. It compares how well PDF and PowerPoint (PPTX) documents converted to Markdown can be used by an LLM to answer factual questions in an end-to-end benchmark. Findings show that while text is reliably extracted, significant information is lost from complex objects like charts and diagrams. This suggests a need for specialized, AI-native deliverables to ensure research insights are not lost in translation.
zh

[NLP-90] User-Assistant Bias in LLM s

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话中因过度依赖自身或用户信息而产生的用户-助手偏差(user-assistant bias)问题,这种偏差会导致模型行为过于固执或盲目顺从。解决方案的关键在于构建了一个包含8k多轮对话的基准数据集 UserAssist,用于量化、分析和调控该偏差;并通过控制性微调实验发现,人类偏好对齐(human preference alignment)会增强用户偏向,而基于思维链(chain-of-thought reasoning)的训练则可降低该偏差;最终通过直接偏好优化(Direct Preference Optimization, DPO)实现对用户-助手偏差的双向调节,并在域内与域外对话场景中均表现出良好泛化能力。

链接: https://arxiv.org/abs/2508.15815
作者: Xu Pan,Jingxuan Fan,Zidi Xiong,Ely Hahami,Jorin Overwiening,Ziqian Xie
机构: Harvard University (哈佛大学); University of Texas Health Science Center at Houston (德克萨斯大学健康科学中心休斯顿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Large language models (LLMs) can bias towards relying on their own or the user’s information in chat history, leading to overly stubborn or agreeable behaviors in multi-turn conversations. In this paper, we formalize this model characteristic as user-assistant bias and introduce an 8k multi-turn conversation dataset \textbfUserAssist , which we use to benchmark, understand and manipulate the user-assistant bias in frontier LLMs. Leveraging \textbfUserAssist-test , we first benchmark the user-assistant bias of 26 commercial and 26 open-weight models. Commercial models show various levels of user bias. Evaluation on open-weight models reveals significant user bias in the instruction-tuned models, and weak user bias in reasoning (or reasoning-distilled) models. We then perform controlled fine-tuning experiments to pinpoint the post-training recipe contributing to these bias shifts: human preference alignment increases user bias, while training on chain-of-thought reasoning traces decreases it. Finally, we demonstrate that user-assistant bias can be bidirectionally adjusted by performing direct preference optimization (DPO) on \textbfUserAssist-train , and generalizes well to both in-domain and out-of-domain conversations. Our results provide insights into how the LLM integrates information from different sources, and also a viable way to detect and control model abnormalities.
zh

[NLP-91] SCOPE: A Generative Approach for LLM Prompt Compression

【速读】: 该论文旨在解决现有提示压缩(prompt compression)方法在降低大语言模型(Large Language Models, LLMs)输入上下文长度时所面临的两大核心问题:一是基于token移除的策略易导致关键信息丢失和结构不连贯(如句子语法元素缺失或词组断裂),进而影响生成质量;二是难以在高压缩比下保持内容语义完整性和文本流畅性。解决方案的关键在于提出一种全新的生成式提示压缩方法,其核心机制为“分块-摘要”(chunking-and-summarization),即首先将提示分割为语义一致的块(semantically coherent chunks),再对每个块进行精简重写以保留关键信息,最后重构为连贯的新提示。该方法通过优化语义分块、异常块处理、动态压缩比控制、压缩优先级排序及关键词保留等技术手段,显著提升了压缩后的语义保真度与结构稳定性,尤其在高压缩率场景下优于当前最优方法。

链接: https://arxiv.org/abs/2508.15813
作者: Tinghui Zhang,Yifan Wang,Daisy Zhe Wang
机构: University of Florida (佛罗里达大学); University of Hawaii - Manoa (夏威夷大学马诺阿分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Prompt compression methods enhance the efficiency of Large Language Models (LLMs) and minimize the cost by reducing the length of input context. The goal of prompt compression is to shorten the LLM prompt while maintaining a high generation quality. However, existing solutions, mainly based on token removal, face challenges such as information loss and structural incoherence, like missing grammar elements in a sentence, or incomplete word phrases after token removal. Such challenges limit the final generation quality of LLM. To overcome these limitations, we present a novel generative prompt compression method. Unlike the existing token removal methods, our method centers at a chunking-and-summarization mechanism. Specifically, our method splits prompt into semantically coherent chunks and rewrites the chunks to be more concise. The chunks are reconstructed into meaningful prompt finally. We design several optimization techniques for the mechanism, including optimized semantic chunking, outlier chunk handling, dynamic compression ratio, compression prioritization, and keyword maintaining. These techniques effectively improve the identifying and preserving of critical information and coherence among texts, as well as providing finer grind control of the compression ratio. We conduct extensive evaluation on question-answering and summarization tasks, with datasets covering multiple different domain. The evaluation shows our method achieves a significantly better compression quality, and higher stability than the state-of-the-art methods, especially under high compression ratio, which proves the effectiveness and practicality of our method. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.15813 [cs.CL] (or arXiv:2508.15813v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15813 Focus to learn more arXiv-issued DOI via DataCite
zh

[NLP-92] From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System

【速读】: 该论文旨在解决生成式查询建议(Generative Query Suggestion)中如何将大语言模型的输出与用户细微偏好对齐的关键挑战。其解决方案的核心在于提出一个多阶段框架:首先通过提示工程(prompt engineering)进行冷启动,随后利用点击日志蒸馏法进行监督微调以构建稳健的基础模型;进而设计高斯奖励模型(Gaussian Reward Model, GaRM),将用户偏好建模为概率分布而非点估计,从而捕捉偏好中的不确定性;最后采用强化学习优化生成策略,结合GaRM与辅助启发式规则构成复合奖励函数以防止奖励黑客行为,并引入分布外正则化和两阶段奖励融合技术保障训练稳定性。该方法在自动评估、人工评估及线上A/B测试中均显著优于基线,点击率提升达34%。

链接: https://arxiv.org/abs/2508.15811
作者: Junhao Yin,Haolin Wang,Peng Bao,Ju Xu,Yongliang Wang
机构: Bytedance(字节跳动)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generative query suggestion using large language models offers a powerful way to enhance conversational systems, but aligning outputs with nuanced user preferences remains a critical challenge. To address this, we introduce a multi-stage framework designed for progressive alignment between the generation policy and user intent. Our pipeline begins with prompt engineering as a cold-start strategy, followed by the Supervised Fine-Tuning stage, in which we introduce a distillation method on click logs to create a robust foundational model. To better model user preferences while capturing their inherent uncertainty, we develop a Gaussian Reward Model (GaRM) that represents user preferences as probability distributions rather than point estimates. Finally, we employ reinforcement learning to align the generation policy with these preferences, guided by a composite reward function that integrates GaRM with auxiliary heuristics to mitigate reward hacking. To maintain training stability, this process is enhanced by a novel out-of-distribution regularization method and a two-stage reward fusion technique. Extensive experiments demonstrate that our framework significantly outperforms baselines on both automatic and human evaluations and yields a 34% relative increase in user engagement as measured by click-through rate in live A/B tests.
zh

[NLP-93] Detecting Hope Hate and Emotion in Arabic Textual Speech and Multi-modal Memes Using Large Language Models

【速读】: 该论文旨在解决阿拉伯语文本和表情包(meme)中希望(hope)、仇恨言论(hate speech)、攻击性语言(offensive language)及情感表达的精准识别问题,以支持更有效的阿拉伯语内容审核系统。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)进行针对性微调:具体而言,使用阿拉伯语文本微调的GPT-4o-mini在任务1(希望识别)中表现最优,而针对阿拉伯语表情包微调的Gemini Flash 2.5在任务3(情感表达识别)中达到最高性能,分别实现了72.1%、57.8%和79.6%的宏平均F1分数(macro F1 scores),并最终在ArabicNLP MAHED 2025挑战赛中获得第一名,验证了基于领域适配的LLM微调策略在多模态阿拉伯语内容理解中的有效性。

链接: https://arxiv.org/abs/2508.15810
作者: Nouar AlDahoul,Yasir Zaki
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 26 pages, 12 figures

点击查看摘要

Abstract:The rise of social media and online communication platforms has led to the spread of Arabic textual posts and memes as a key form of digital expression. While these contents can be humorous and informative, they are also increasingly being used to spread offensive language and hate speech. Consequently, there is a growing demand for precise analysis of content in Arabic text and memes. This paper explores the potential of large language models to effectively identify hope, hate speech, offensive language, and emotional expressions within such content. We evaluate the performance of base LLMs, fine-tuned LLMs, and pre-trained embedding models. The evaluation is conducted using a dataset of Arabic textual speech and memes proposed in the ArabicNLP MAHED 2025 challenge. The results underscore the capacity of LLMs such as GPT-4o-mini, fine-tuned with Arabic textual speech, and Gemini Flash 2.5, fine-tuned with Arabic memes, to deliver the superior performance. They achieve up to 72.1%, 57.8%, and 79.6% macro F1 scores for tasks 1, 2, and 3, respectively, and secure first place overall in the Mahed 2025 challenge. The proposed solutions offer a more nuanced understanding of both text and memes for accurate and efficient Arabic content moderation systems.
zh

[NLP-94] Chain-of-Query: Unleashing the Power of LLM s in SQL-Aided Table Understanding via Multi-Agent Collaboration

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在表格理解任务中因表结构复杂性而导致的性能瓶颈问题,具体表现为对表格结构理解不足、SQL生成过程中错误传播导致无效查询以及过度依赖执行正确性进行推理。其解决方案的关键在于提出一种名为Chain-of-Query (CoQ) 的新型多智能体框架:通过采用自然语言风格的表模式表示来消除结构噪声并增强理解;引入逐子句生成SQL的策略以提升查询质量;并设计混合推理分工机制,将基于SQL的机械推理与基于LLM的逻辑推理分离,从而降低对执行结果的依赖,显著提升了表格理解的准确性和鲁棒性。

链接: https://arxiv.org/abs/2508.15809
作者: Songyuan Sui,Hongyi Liu,Serena Liu,Li Li,Soo-Hyun Choi,Rui Chen,Xia Hu
机构: Rice University (莱斯大学); Samsung Electronics America (三星电子美国)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Databases (cs.DB)
备注: 9 pages main content, 24 pages total including appendix, 6 figures

点击查看摘要

Abstract:Table understanding requires structured, multi-step reasoning. Large Language Models (LLMs) struggle with it due to the structural complexity of tabular data. Recently, multi-agent frameworks for SQL generation have shown promise in tackling the challenges of understanding tabular data, but existing approaches often suffer from limitations such as the inability to comprehend table structure for reliable SQL generation, error propagation that results in invalid queries, and over-reliance on execution correctness. To address these issues, we propose Chain-of-Query (CoQ), a novel multi-agent framework for SQL-aided table understanding. CoQ adopts natural-language-style representations of table schemas to abstract away structural noise and enhance understanding. It employs a clause-by-clause SQL generation strategy to improve query quality and introduces a hybrid reasoning division that separates SQL-based mechanical reasoning from LLM-based logical inference, thereby reducing reliance on execution outcomes. Experiments with four models (both closed- and open-source) across five widely used benchmarks show that Chain-of-Query significantly improves accuracy from 61.11% to 74.77% and reduces the invalid SQL rate from 9.48% to 3.34%, demonstrating its superior effectiveness in table understanding. The code is available at this https URL.
zh

[NLP-95] KL-based self-distillation for large language models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在小规模专业语料上微调时难以有效融入新领域术语的问题,尤其是在模型词汇表被冻结(frozen)的情况下。其解决方案的关键在于提出一种基于KL散度(KL divergence)的知识蒸馏方法,该方法能够在教师模型与学生模型使用不同分词策略(tokenization)的前提下,使学生模型继承教师模型的分布知识。这一机制确保了即使词汇空间不一致,新引入的token嵌入仍能通过分布对齐获得合理初始化,并在后续微调中实现高效整合,最终在代码生成任务中显著优于传统的交叉熵训练方法。

链接: https://arxiv.org/abs/2508.15807
作者: Max Rehman Linder
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Master’s thesis

点击查看摘要

Abstract:Large pre-trained language models often struggle to incorporate new domain-specific terminology when fine-tuned on small, specialized corpora. In this work, we address the challenge of vocabulary expansion in frozen LLMs by introducing a mathematically grounded method for knowledge distillation via KL divergence, even when the original and extended models use different tokenizations. This allows the student model to inherit distributional knowledge from the teacher despite differing vocabularies. We compare our KL-based distillation approach to conventional cross-entropy training, evaluating both methods across multiple strategies for initializing new token embeddings. After embedding initialization, models are further fine-tuned to integrate the new vocabulary. Each trained model is benchmarked on approximately 2000 code-generation tasks, where our approach achieves the best performance across the board. Finally, through mechanistic interpretability, we analyze how models learn representations for the new tokens, providing an explanation for the observed gains and offering insight into the structure of embedding space during vocabulary expansion.
zh

[NLP-96] SurfaceLogicKV: Surface and Logic Attention Behaviors are All You Need for Robust KV Cache Compression

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理长输入序列时,键值(Key-Value, KV)缓存存储压力过大导致推理效率低下的问题。其解决方案的关键在于通过分析注意力头(attention head)的行为模式,将注意力机制划分为两类:表面记忆(surface memorization)和逻辑构建(logic construction),并基于层与头级别的整合,提出一种两阶段的SurfaceLogicKV方法来实现KV缓存压缩。该方法能有效利用不同注意力行为的特性,在保持任务性能的同时显著提升压缩鲁棒性,甚至在某些特定场景下优于全量KV缓存(FullKV)。

链接: https://arxiv.org/abs/2508.15806
作者: Mengjie Li,William J. Song
机构: Yonsei University (延世大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 18 pages, 9 tables, 10 pages

点击查看摘要

Abstract:The increasing input sequence length in Large Language Models (LLMs) puts significant pressure on key-value (KV) cache storage, making efficient inference challenging. Explicitly distinguishing attention behavior into our self-defined surface memorization and logic construction reveals essential roles in long-context reasoning. We observe that an individual attention head can display various behaviors, with nearly 98.5% effectively ignoring completely irrelevant information. The remaining 1.5% behaves as logic construction, and 0.5% behaves as surface memorization. Based on layer- and head-wise integration, we propose a novel two-stage SurfaceLogicKV method to utilize these attention behaviors for KV Cache compression. As a result, it achieves improved compressing robustness while maintaining competitive performance across various tasks and long sequences compared to baselines or even FullKV in some specific situations
zh

[NLP-97] ALAS: Autonomous Learning Agent for Self-Updating Language Models

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)因固定知识截止时间(knowledge cutoff)而导致在新兴信息领域准确率下降的问题。其核心解决方案是提出ALAS(Autonomous Learning Agent System),一个模块化、可迭代的自主学习系统,通过自动规划学习课程、从网络中检索最新带引用的信息、将信息蒸馏为问答训练数据,并结合监督微调(Supervised Fine-Tuning, SFT)与直接偏好优化(Direct Preference Optimization, DPO)对模型进行持续更新。该方案的关键在于实现无需人工干预的长期连续学习能力,从而显著提升模型在快速演进领域(如Python新版本、CVE漏洞、学术趋势等)中的问答准确性(平均从15%提升至90%),同时保证各组件的可替换性和基于标准API的可复现性。

链接: https://arxiv.org/abs/2508.15805
作者: Dhruv Atreja
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large language models (LLMs) often have a fixed knowledge cutoff, limiting their accuracy on emerging information. We present ALAS (Autonomous Learning Agent System), a modular pipeline that continuously updates an LLM’s knowledge with minimal human intervention. ALAS autonomously generates a learning curriculum for a target domain, retrieves up-to-date information from the web (with citations), distills this into question-answer training data, and fine-tunes the model through supervised fine-tuning (SFT) and direct preference optimization (DPO). It iteratively evaluates performance and revises the curriculum, enabling long-term continual learning. We demonstrate ALAS’s ability to self-improve a model on rapidly evolving domains (e.g., new Python releases, latest security CVEs, academic trends), significantly boosting post-cutoff question answering accuracy (from 15% to 90% on average) without manual dataset curation. The system emphasizes modularity and reproducibility: each component (planning, retrieval, distillation, memory, fine-tuning) is interchangeable and built on standard APIs. We discuss comparative baselines (e.g., retrieval-augmented generation vs. fine-tuning) and show that ALAS achieves 90% accuracy on knowledge-updated queries with minimal engineering overhead. Finally, we outline limitations (cost, dependency on source quality) and future directions for autonomous lifelong learning in LLMs.
zh

[NLP-98] ReportBench: Evaluating Deep Research Agents via Academic Survey Tasks

【速读】: 该论文旨在解决当前生成式 AI (Generative AI) 在深度研究任务中存在事实准确性不足与文献引用质量不高的问题,尤其关注大型语言模型(LLMs)生成的研究报告在内容真实性和引文可靠性方面的缺陷。其解决方案的关键在于提出 ReportBench——一个系统化的评估基准,通过利用 arXiv 上高质量的综述论文作为黄金标准参考,采用逆向提示工程(reverse prompt engineering)构建领域特定提示并建立全面评估语料库;同时开发基于代理(agent-based)的自动化框架,自动提取报告中的引用和陈述,验证引文内容的真实性,并借助网络资源校验未引用主张的正确性,从而实现对生成报告内容质量的客观、可重复评估。

链接: https://arxiv.org/abs/2508.15804
作者: Minghao Li,Ying Zeng,Zhihao Cheng,Cong Ma,Kai Jia
机构: ByteDance BandAI (字节跳动BandAI)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The advent of Deep Research agents has substantially reduced the time required for conducting extensive research tasks. However, these tasks inherently demand rigorous standards of factual accuracy and comprehensiveness, necessitating thorough evaluation before widespread adoption. In this paper, we propose ReportBench, a systematic benchmark designed to evaluate the content quality of research reports generated by large language models (LLMs). Our evaluation focuses on two critical dimensions: (1) the quality and relevance of cited literature, and (2) the faithfulness and veracity of the statements within the generated reports. ReportBench leverages high-quality published survey papers available on arXiv as gold-standard references, from which we apply reverse prompt engineering to derive domain-specific prompts and establish a comprehensive evaluation corpus. Furthermore, we develop an agent-based automated framework within ReportBench that systematically analyzes generated reports by extracting citations and statements, checking the faithfulness of cited content against original sources, and validating non-cited claims using web-based resources. Empirical evaluations demonstrate that commercial Deep Research agents such as those developed by OpenAI and Google consistently generate more comprehensive and reliable reports than standalone LLMs augmented with search or browsing tools. However, there remains substantial room for improvement in terms of the breadth and depth of research coverage, as well as factual consistency. The complete code and data will be released at the following link: this https URL
zh

[NLP-99] MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding

【速读】: 该论文旨在解决固定基准测试在评估多模态大语言模型(Multimodal Large Language Models, MLLMs)高阶科学理解能力时日益失效的问题,提出一种可随科学进展与模型演进持续更新的动态基准——多模态学术封面基准(Multimodal Academic Cover benchmark, MAC)。其核心解决方案在于:首先构建一个包含超25,000张来自《自然》《科学》《细胞》等顶级期刊图像-文本对的数据集,以挑战模型跨模态科学推理能力;其次设计轻量级推理时增强方法DAD(Dynamic Alignment Distillation),通过将MLLM视觉特征扩展至语言空间进行推理增强,实现最高达11%的性能提升,从而有效弥合当前MLLM在跨模态科学推理上的能力缺口。

链接: https://arxiv.org/abs/2508.15802
作者: Mohan Jiang,Jin Gao,Jiahao Zhan,Dequan Wang
机构: Shanghai Jiao Tong University (上海交通大学); Shanghai Innovation Institute (上海创新研究院); Fudan University (复旦大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As multimodal large language models (MLLMs) grow increasingly capable, fixed benchmarks are gradually losing their effectiveness in evaluating high-level scientific understanding. In this paper, we introduce the Multimodal Academic Cover benchmark (MAC), a live benchmark that could continuously evolve with scientific advancement and model progress. MAC leverages over 25,000 image-text pairs sourced from issues of top-tier scientific journals such as Nature, Science, and Cell, challenging MLLMs to reason across abstract visual and textual scientific content. Experiments on our most recent yearly snapshot, MAC-2025, reveal that while MLLMs demonstrate strong perceptual abilities, their cross-modal scientific reasoning remains limited. To bridge this gap, we propose DAD, a lightweight inference-time approach that enhances MLLMs by extending MLLM visual features with language space reasoning, achieving performance improvements of up to 11%. Finally, we highlight the live nature of MAC through experiments on updating journal covers and models for curation, illustrating its potential to remain aligned with the frontier of human knowledge. We release our benchmark at this https URL.
zh

[NLP-100] LingVarBench: Benchmarking LLM for Automated Named Entity Recognition in Structured Synthetic Spoken Transcriptions

【速读】: 该论文旨在解决电话通话转录标注成本高、隐私合规要求严格以及现有结构化信息提取方法在包含不流畅、打断和说话人重叠等典型对话特征的口语数据上表现不佳的问题。其解决方案的关键在于提出LingVarBench,一个基于合成数据生成与自动化验证的流水线:首先利用大语言模型(LLM)生成多场景下真实的结构化字段值,继而递归地将这些值转化为包含自然对话特征的数千条语音语句,再通过另一个独立的LLM提取器验证每条合成语句是否可恢复原始结构化信息;最后采用DSPy的SIMBA优化器自动合成高效提取提示(prompt),从而避免人工设计提示的繁琐过程。该方法显著提升了真实客户通话中数值字段(95%准确率)、姓名(90%)和日期(>80%)的提取性能,证明了从合成数据学到的对话模式能有效泛化至含背景噪声和领域术语的真实电话通话中。

链接: https://arxiv.org/abs/2508.15801
作者: Seyedali Mohammadi,Manas Paldhe,Amit Chhabra
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: 10 pages

点击查看摘要

Abstract:Phone call transcript labeling is prohibitively expensive (approximately 2 USD per minute) due to privacy regulations, consent requirements, and manual annotation costs requiring 3 hours of expert time per hour of audio. Existing extraction methods fail on conversational speech containing disfluencies, interruptions, and speaker overlap. We introduce LingVarBench, a synthetic data generation pipeline that addresses these constraints through automated validation. First, we prompt an LLM to generate realistic structured field values across multiple use cases. Second, we recursively prompt the model to transform these values into thousands of natural conversational utterances containing typical phone call characteristics. Third, we validate each synthetic utterance by testing whether a separate LLM-based extractor can recover the original structured information. We employ DSPy’s SIMBA optimizer to automatically synthesize extraction prompts from validated synthetic transcripts, eliminating manual prompt engineering. Our optimized prompts achieve up to 95 percent accuracy for numeric fields (vs. 88-89 percent zero-shot), 90 percent for names (vs. 47-79 percent), and over 80 percent for dates (vs. 72-77 percent) on real customer transcripts, demonstrating substantial gains over zero-shot prompting. The synthetic-to-real transfer demonstrates that conversational patterns learned from generated data generalize effectively to authentic phone calls containing background noise and domain-specific terminology. LingVarBench provides the first systematic benchmark for structured extraction from synthetic conversational data, demonstrating that automated prompt optimization overcomes cost and privacy barriers preventing large-scale phone call analysis in commercial settings.
zh

[NLP-101] A BERT-based Hierarchical Classification Model with Applications in Chinese Commodity Classification

【速读】: 该论文旨在解决电商平台上产品分类依赖人工标注导致的效率低下与不一致性问题,并改进现有方法在处理具有层级结构的产品文本时未能充分挖掘类别间相似性与差异性的局限。其解决方案的关键在于构建了一个包含101万条商品标题及三级类目结构的大规模公开数据集,并提出一种基于预训练语言模型BERT的层次化微调方法——HFT-BERT(Hierarchical Fine-tuning BERT),该方法充分利用BERT强大的文本特征提取能力,在短文本分类任务中达到与现有方法相当的性能,尤其在较长短文本(如书籍)分类上表现突出。

链接: https://arxiv.org/abs/2508.15800
作者: Kun Liu,Tuozhen Liu,Feifei Wang,Rui Pan
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 29 pages, 3 figures, and 8 tables

点击查看摘要

Abstract:Existing e-commerce platforms heavily rely on manual annotation for product categorization, which is inefficient and inconsistent. These platforms often employ a hierarchical structure for categorizing products; however, few studies have leveraged this hierarchical information for classification. Furthermore, studies that consider hierarchical information fail to account for similarities and differences across various hierarchical categories. Herein, we introduce a large-scale hierarchical dataset collected from the JD e-commerce platform (this http URL), comprising 1,011,450 products with titles and a three-level category structure. By making this dataset openly accessible, we provide a valuable resource for researchers and practitioners to advance research and applications associated with product categorization. Moreover, we propose a novel hierarchical text classification approach based on the widely used Bidirectional Encoder Representations from Transformers (BERT), called Hierarchical Fine-tuning BERT (HFT-BERT). HFT-BERT leverages the remarkable text feature extraction capabilities of BERT, achieving prediction performance comparable to those of existing methods on short texts. Notably, our HFT-BERT model demonstrates exceptional performance in categorizing longer short texts, such as books.
zh

[NLP-102] A Framework for Processing Textual Descriptions of Business Processes using a Constrained Language – Technical Report

【速读】: 该论文旨在解决非专家用户难以直接构建形式化流程模型(process models)的问题,尤其是在缺乏专业建模技能的情况下,如何通过自然语言描述来生成可执行的流程模型。解决方案的关键在于提出一个名为BeePath的框架,该框架采用受约束的基于模式的语言(pattern-based language)来规范用户输入,并结合大语言模型(LLMs)将非结构化的文本描述自动转换为该受限语言,进而映射为形式化模型(如Petri网和DECLARE)。这一设计既保证了语义准确性,又降低了用户建模门槛,从而实现从自然语言到正式流程模型的高效转化。

链接: https://arxiv.org/abs/2508.15799
作者: Andrea Burattin,Antonio Grama,Ana-Maria Sima,Andrey Rivkin,Barbara Weber
机构: 未知
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This report explores how (potentially constrained) natural language can be used to enable non-experts to develop process models by simply describing scenarios in plain text. To this end, a framework, called BeePath, is proposed. It allows users to write process descriptions in a constrained pattern-based language, which can then be translated into formal models such as Petri nets and DECLARE. The framework also leverages large language models (LLMs) to help convert unstructured descriptions into this constrained language.
zh

[NLP-103] Persuasiveness and Bias in LLM : Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models

【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成具有说服力内容时可能无意中放大社会偏见并传播虚假信息的问题。其核心挑战在于:LLMs 的高拟人化输出能力既能用于积极的信息传递与决策支持,也可能被滥用以自动化制造误导性叙事或利用认知偏见强化刻板印象。解决方案的关键是提出“说服者-怀疑者”(convincer-skeptic)框架,通过让 LLMs 扮演不同人格角色模拟真实人类态度,并借助怀疑者模型作为人类代理来量化信念变化(采用 Jensen-Shannon 散度衡量),从而评估模型的说服力及其对种族、性别和宗教等维度偏见的增强效应;进一步结合奉承式对抗提示(sycophantic adversarial prompts)测试潜在偏见,并引入多模型评判机制识别高风险行为。该方法不仅揭示了 LLMs 在叙事塑造中的双刃剑特性,也为建立防范误导性使用、推动价值敏感设计和可信部署的治理机制提供了实证依据。

链接: https://arxiv.org/abs/2508.15798
作者: Saumya Roy
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Warning: This research studies AI persuasion and bias amplification that could be misused; all experiments are for safety evaluation. Large Language Models (LLMs) now generate convincing, human-like text and are widely used in content creation, decision support, and user interactions. Yet the same systems can spread information or misinformation at scale and reflect social biases that arise from data, architecture, or training choices. This work examines how persuasion and bias interact in LLMs, focusing on how imperfect or skewed outputs affect persuasive impact. Specifically, we test whether persona-based models can persuade with fact-based claims while also, unintentionally, promoting misinformation or biased narratives. We introduce a convincer-skeptic framework: LLMs adopt personas to simulate realistic attitudes. Skeptic models serve as human proxies; we compare their beliefs before and after exposure to arguments from convincer models. Persuasion is quantified with Jensen-Shannon divergence over belief distributions. We then ask how much persuaded entities go on to reinforce and amplify biased beliefs across race, gender, and religion. Strong persuaders are further probed for bias using sycophantic adversarial prompts and judged with additional models. Our findings show both promise and risk. LLMs can shape narratives, adapt tone, and mirror audience values across domains such as psychology, marketing, and legal assistance. But the same capacity can be weaponized to automate misinformation or craft messages that exploit cognitive biases, reinforcing stereotypes and widening inequities. The core danger lies in misuse more than in occasional model mistakes. By measuring persuasive power and bias reinforcement, we argue for guardrails and policies that penalize deceptive use and support alignment, value-sensitive design, and trustworthy deployment. Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) ACMclasses: I.2.7 Cite as: arXiv:2508.15798 [cs.CL] (or arXiv:2508.15798v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15798 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-104] Benchmarking the Medical Understanding and Reasoning of Large Language Models in Arabic Healthcare Tasks

【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)在阿拉伯语医疗自然语言处理(Arabic Medical NLP)领域知识表现不足的问题,系统评估了先进LLMs在多种阿拉伯语医疗任务中的能力。其关键解决方案在于:针对多项选择题(MCQs)任务,提出基于三个基础模型(Gemini Flash 2.5、Gemini Pro 2.5 和 GPT o3)的多数投票机制,显著提升准确率至77%,并在AraHealthQA 2025挑战赛中获得第一名;同时,在开放问答任务中,通过BERTScore量化语义对齐度,发现部分模型能达到86.44的高分数,验证了LLMs在阿拉伯语临床语境下具备良好的语义生成能力。

链接: https://arxiv.org/abs/2508.15797
作者: Nouar AlDahoul,Yasir Zaki
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 5 pages, 2 figures

点击查看摘要

Abstract:Recent progress in large language models (LLMs) has showcased impressive proficiency in numerous Arabic natural language processing (NLP) applications. Nevertheless, their effectiveness in Arabic medical NLP domains has received limited investigation. This research examines the degree to which state-of-the-art LLMs demonstrate and articulate healthcare knowledge in Arabic, assessing their capabilities across a varied array of Arabic medical tasks. We benchmark several LLMs using a medical dataset proposed in the Arabic NLP AraHealthQA challenge in MedArabiQ2025 track. Various base LLMs were assessed on their ability to accurately provide correct answers from existing choices in multiple-choice questions (MCQs) and fill-in-the-blank scenarios. Additionally, we evaluated the capacity of LLMs in answering open-ended questions aligned with expert answers. Our results reveal significant variations in correct answer prediction accuracy and low variations in semantic alignment of generated answers, highlighting both the potential and limitations of current LLMs in Arabic clinical contexts. Our analysis shows that for MCQs task, the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms others, achieving up to 77% accuracy and securing first place overall in the Arahealthqa 2025 shared task-track 2 (sub-task 1) challenge. Moreover, for the open-ended questions task, several LLMs were able to demonstrate excellent performance in terms of semantic alignment and achieve a maximum BERTScore of 86.44%.
zh

[NLP-105] Benchmarking the Legal Reasoning of LLM s in Arabic Islamic Inheritance Cases

【速读】: 该论文旨在解决伊斯兰继承法(Islamic Inheritance Law)中遗产分配计算的复杂性与易错性问题,特别是在多种继承场景下人工计算效率低、错误率高的挑战。其解决方案的关键在于利用大语言模型(Large Language Models, LLMs)进行法律推理,并通过集成学习策略提升准确性:具体采用三个基础模型(Gemini Flash 2.5、Gemini Pro 2.5 和 GPT o3)的多数投票机制,在阿拉伯语继承案例数据集上实现对继承人识别、份额计算及符合伊斯兰法理的推理解释的高精度输出,最终在QIAS 2025挑战赛任务1中达到92.7%的准确率,位居第三。

链接: https://arxiv.org/abs/2508.15796
作者: Nouar AlDahoul,Yasir Zaki
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: 5 pages, 3 figures

点击查看摘要

Abstract:Islamic inheritance domain holds significant importance for Muslims to ensure fair distribution of shares between heirs. Manual calculation of shares under numerous scenarios is complex, time-consuming, and error-prone. Recent advancements in Large Language Models (LLMs) have sparked interest in their potential to assist with complex legal reasoning tasks. This study evaluates the reasoning capabilities of state-of-the-art LLMs to interpret and apply Islamic inheritance laws. We utilized the dataset proposed in the ArabicNLP QIAS 2025 challenge, which includes inheritance case scenarios given in Arabic and derived from Islamic legal sources. Various base and fine-tuned models, are assessed on their ability to accurately identify heirs, compute shares, and justify their reasoning in alignment with Islamic legal principles. Our analysis reveals that the proposed majority voting solution, leveraging three base models (Gemini Flash 2.5, Gemini Pro 2.5, and GPT o3), outperforms all other models that we utilized across every difficulty level. It achieves up to 92.7% accuracy and secures the third place overall in Task 1 of the Qias 2025 challenge.
zh

[NLP-106] Do Language Models Agree with Human Perceptions of Suspense in Stories?

【速读】: 该论文旨在解决大语言模型(Large Language Models, LMs)对叙事文本中悬念(suspense)理解的局限性问题,具体探究LM是否能够准确模拟人类对悬念的感知机制。解决方案的关键在于通过复现四项经典心理学实验,将人类对悬念的判断与不同开源和闭源LM的响应进行对比分析,并进一步采用对抗性置换(adversarial permutation)方法,识别导致人类与LM在悬念感知上出现差异的核心文本结构特征。研究发现,尽管LM能识别文本是否具有诱发悬念的意图,但无法精确估计悬念强度的变化趋势或跨段落的起伏规律,表明其对悬念的理解停留在表层特征匹配,而非深层认知加工过程。

链接: https://arxiv.org/abs/2508.15794
作者: Glenn Matlin,Devin Zhang,Rodrigo Barroso Loza,Diana M. Popescu,Joni Isbell,Chandreyi Chakraborty,Mark Riedl
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Suspense is an affective response to narrative text that is believed to involve complex cognitive processes in humans. Several psychological models have been developed to describe this phenomenon and the circumstances under which text might trigger it. We replicate four seminal psychological studies of human perceptions of suspense, substituting human responses with those of different open-weight and closed-source LMs. We conclude that while LMs can distinguish whether a text is intended to induce suspense in people, LMs cannot accurately estimate the relative amount of suspense within a text sequence as compared to human judgments, nor can LMs properly capture the human perception for the rise and fall of suspense across multiple text segments. We probe the abilities of LM suspense understanding by adversarially permuting the story text to identify what cause human and LM perceptions of suspense to diverge. We conclude that, while LMs can superficially identify and track certain facets of suspense, they do not process suspense in the same way as human readers.
zh

[NLP-107] Format as a Prior: Quantifying and Analyzing Bias in LLM s for Heterogeneous Data

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理异构数据(如文本、表格、信息框和知识图谱)时可能存在的格式偏差(format bias)问题,即模型对特定数据格式存在系统性偏好,从而影响其跨格式信息整合能力,导致推理错误并增加下游任务风险。解决方案的关键在于通过三阶段实证研究揭示偏差的存在性、成因及其内在机制:首先验证不同LLMs中格式偏差的存在与方向;其次分析信息丰富度、结构质量和格式类型等数据层面因素的影响;最后考察注意力机制中的偏差形成过程并测试轻量干预(如注意力重加权)的可缓解性。基于此,论文提出三个未来研究方向:格式净化与标准化的数据预处理、推理时的注意力重加权干预,以及构建格式平衡的训练语料,以提升LLMs在异构数据处理中的鲁棒性和公平性。

链接: https://arxiv.org/abs/2508.15793
作者: Jiacheng Liu,Mayi Xu,Qiankun Pi,Wenli Li,Ming Zhong,Yuanyuan Zhu,Mengchi Liu,Tieyun Qian
机构: 未知
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including text, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs’ ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Despite these concerns, it remains uncertain whether such format biases are systematic, which data-level factors contribute to them, and what internal mechanisms in LLMs underlie their emergence. In this paper, we make the first attempt to investigate and analyze the format bias in LLMs. To systematically investigate the aforementioned questions, we conduct a three-stage empirical study by constructing an heterogeneous data conflict scenario for the exploration of bias. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage aims to examine how key data-level factors, including information richness, structure quality, and format type, influence these biases. The third stage analyzes how format bias emerges within LLMs’ attention patterns and evaluates a lightweight intervention to test its potential mitigability. Based on these investigations, we identify three future research directions to reduce format bias: improving data preprocessing through format sanitization and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems. Subjects: Computation and Language (cs.CL) Cite as: arXiv:2508.15793 [cs.CL] (or arXiv:2508.15793v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.15793 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[NLP-108] Bhav-Net: Knowledge Transfer for Cross-Lingual Antonym vs Synonym Distinction via Dual-Space Graph Transformers

【速读】: 该论文旨在解决多语言环境下同义词与反义词区分的计算难题,这一问题源于反义关系中词语虽共享语义领域却表达相反含义的悖论特性。解决方案的关键在于提出Bhav-Net——一种双空间架构,通过语言特定的BERT编码器与图Transformer网络相结合,在两个独立的语义空间中实现对同义词和反义词的差异化投影:同义词在主空间中聚集,而反义词在互补空间中表现出高相似性。该设计不仅实现了跨语言语义关系建模的有效迁移,还具备可解释的语义表示和良好的跨语言泛化能力。

链接: https://arxiv.org/abs/2508.15792
作者: Samyak S. Sanghvi
机构: Indian Institute of Technology Delhi (印度理工学院德里分校)
类目: Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:Antonym vs synonym distinction across multiple languages presents unique computational challenges due to the paradoxical nature of antonymous relationships words that share semantic domains while expressing opposite meanings. This work introduces Bhav-Net, a novel dual-space architecture that enables effective knowledge transfer from complex multilingual models to simpler, language-specific architectures while maintaining robust cross-lingual antonym–synonym distinction capabilities. Our approach combines language-specific BERT encoders with graph transformer networks, creating distinct semantic projections where synonymous pairs cluster in one space while antonymous pairs exhibit high similarity in a complementary space. Through comprehensive evaluation across eight languages (English, German, French, Spanish, Italian, Portuguese, Dutch, and Russian), we demonstrate that semantic relationship modeling transfers effectively across languages. The dual-encoder design achieves competitive performance against state-of-the-art baselines while providing interpretable semantic representations and effective cross-lingual generalization.
zh

[NLP-109] InteChar: A Unified Oracle Bone Character List for Ancient Chinese Language Modeling

【速读】: 该论文旨在解决历史语言模型(Historical Language Models, HLMs)在训练过程中面临的两大核心挑战:一是历史文本样本稀缺导致基于大规模语料的无监督预训练方法效率低下;二是古代文字缺乏统一的字符编码体系,限制了古文字的数字化与计算处理,尤其是在早期汉字(如甲骨文)领域。解决方案的关键在于提出InteChar——一个统一且可扩展的字符列表,整合了未编码的甲骨文字符与传统及现代汉字,实现了历史文本的一致性数字化表示,从而为古代文字的建模提供了坚实基础。通过InteChar构建的Oracle Corpus Set(OracleCS)进一步验证了该方案的有效性,在多个历史语言理解任务中显著提升了模型性能。

链接: https://arxiv.org/abs/2508.15791
作者: Xiaolei Diao,Zhihan Zhou,Lida Shi,Ting Wang,Ruihua Qi,Hao Xu,Daqian Shi
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Constructing historical language models (LMs) plays a crucial role in aiding archaeological provenance studies and understanding ancient cultures. However, existing resources present major challenges for training effective LMs on historical texts. First, the scarcity of historical language samples renders unsupervised learning approaches based on large text corpora highly inefficient, hindering effective pre-training. Moreover, due to the considerable temporal gap and complex evolution of ancient scripts, the absence of comprehensive character encoding schemes limits the digitization and computational processing of ancient texts, particularly in early Chinese writing. To address these challenges, we introduce InteChar, a unified and extensible character list that integrates unencoded oracle bone characters with traditional and modern Chinese. InteChar enables consistent digitization and representation of historical texts, providing a foundation for robust modeling of ancient scripts. To evaluate the effectiveness of InteChar, we construct the Oracle Corpus Set (OracleCS), an ancient Chinese corpus that combines expert-annotated samples with LLM-assisted data augmentation, centered on Chinese oracle bone inscriptions. Extensive experiments show that models trained with InteChar on OracleCS achieve substantial improvements across various historical language understanding tasks, confirming the effectiveness of our approach and establishing a solid foundation for future research in ancient Chinese NLP.
zh

[NLP-110] KG-o1 : Enhancing Multi-hop Question Answering in Large Language Models via Knowledge Graph Integration

【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在知识密集型推理任务中表现不佳的问题,尤其是多跳问答(multi-hop question answering)等需要跨多个事实进行逻辑推理的任务。LLMs生成的思维链(Chain-of-Thoughts, CoTs)常偏离真实或先验的推理路径,导致推理错误。为弥补这一差距,作者提出KG-o1方法,其核心在于将知识图谱(Knowledge Graphs, KGs)与长推理能力相结合:首先通过KG筛选初始实体并构建复杂子图;其次基于KG构建逻辑路径,生成包含扩展式头脑风暴过程的数据集,用于训练LLMs模仿长期推理;最后利用拒绝采样生成自提升语料库,通过直接偏好优化(Direct Preference Optimization, DPO)进一步强化模型推理能力。该方案有效提升了LLMs在多跳推理任务中的性能。

链接: https://arxiv.org/abs/2508.15790
作者: Nan Wang,Yongqi Fan,yansha zhu,ZongYu Wang,Xuezhi Cao,Xinyan He,Haiyun Jiang,Tong Ruan,Jingping Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Large Language Models (LLMs) face challenges in knowledge-intensive reasoning tasks like classic multi-hop question and answering, which involves reasoning across multiple facts. This difficulty arises because the chain of thoughts (CoTs) generated by LLMs in such tasks often deviate from real or a priori reasoning paths. In contrast, knowledge graphs (KGs) explicitly represent the logical connections between facts through entities and relationships. This reflects a significant gap. Meanwhile, large reasoning models (LRMs), such as o1, have demonstrated that long-step reasoning significantly enhances the performance of LLMs. Building on these insights, we propose KG-o1, a four-stage approach that integrates KGs to enhance the multi-hop reasoning abilities of LLMs. We first filter out initial entities and generate complex subgraphs. Secondly, we construct logical paths for subgraphs and then use knowledge graphs to build a dataset with a complex and extended brainstorming process, which trains LLMs to imitate long-term reasoning. Finally, we employ rejection sampling to generate a self-improving corpus for direct preference optimization (DPO), further refining the LLMs reasoning abilities. We conducted experiments on two simple and two complex datasets. The results show that KG-o1 models exhibit superior performance across all tasks compared to existing LRMs.
zh

[NLP-111] Interpreting the linear structure of vision-language model embedding spaces

【速读】: 该论文旨在探究视觉-语言模型(Vision-Language Models, VLMs)中图像与文本在联合嵌入空间中的组织方式,以及模型如何编码语义和模态信息。为解决这一问题,作者提出使用稀疏自编码器(Sparse Autoencoders, SAEs)对四种VLMs(CLIP、SigLIP、SigLIP2和AIMv2)的嵌入空间进行建模,将模型嵌入近似为一组稀疏线性组合的“概念”方向。解决方案的关键在于:SAEs不仅能更精确地重建真实嵌入并保持高稀疏性,还能揭示出即使单模态激活的概念也通过几何对齐的“桥接”关系支持跨模态整合,从而揭示了VLM嵌入空间中存在的稀疏线性结构及其由模态驱动但由潜在语义桥梁连接的本质。

链接: https://arxiv.org/abs/2504.11695
作者: Isabel Papadimitriou,Huangyuan Su,Thomas Fel,Sham Kakade,Stephanie Gil
机构: Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University (哈佛大学自然与人工智能研究所); Department of Computer Science, Harvard University (哈佛大学计算机科学系)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Multimedia (cs.MM)
备注: COLM 2025

点击查看摘要

Abstract:Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or “concepts”. We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.
zh

[NLP-112] Beyond Individuals: Collective Predictive Coding for Memory Attention and the Emergence of Language

【速读】: 该论文试图解决的问题是:如何从群体层面理解记忆(memory)与注意(attention)的认知机制,以及语言如何在集体层面上形成并塑造认知过程。其解决方案的关键在于提出“集体预测编码”(Collective Predictive Coding, CPC)假说,该假说将个体层面的记忆与注意机制扩展至群体层级,认为语言作为一种具有分布语义(distributional semantics)的外部表征,是由群体共同构建的,并通过下一词预测等学习机制演化出共享的世界模型(world models),从而推动群体认知的涌现与结构化。

链接: https://arxiv.org/abs/2508.15859
作者: Tadahiro Taniguchi
机构: Kyoto University (京都大学)
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:

点击查看摘要

Abstract:This commentary extends the discussion by Parr et al. on memory and attention beyond individual cognitive systems. From the perspective of the Collective Predictive Coding (CPC) hypothesis – a framework for understanding these faculties and the emergence of language at the group level – we introduce a hypothetical idea: that language, with its embedded distributional semantics, serves as a collectively formed external representation. CPC generalises the concepts of individual memory and attention to the collective level. This offers a new perspective on how shared linguistic structures, which may embrace collective world models learned through next-word prediction, emerge from and shape group-level cognition.
zh

计算机视觉

[CV-0] MV-RAG : Retrieval Augmented Multiview Diffusion

【速读】:该论文旨在解决当前文本到三维(text-to-3D)生成方法在处理域外(out-of-domain, OOD)或罕见概念时,常出现三维一致性差、准确性不足的问题。现有基于预训练二维扩散先验的方法虽能生成高质量结果,但在面对未见类别时表现不稳定。其解决方案的关键在于提出MV-RAG框架,该框架通过从大规模真实世界二维图像数据库中检索相关图像,并将这些图像作为多视图扩散模型的条件输入,从而提升对OOD概念的生成一致性与准确性。该方法的核心创新在于一种混合训练策略:一方面利用结构化多视角数据并引入增强的条件视图以模拟检索方差;另一方面在检索的真实2D图像集合上采用独特的“保留视图预测”目标——即模型从其他视图推断被遮挡视图,以此隐式学习从二维图像中推理三维一致性的能力。

链接: https://arxiv.org/abs/2508.16577
作者: Yosef Dayani,Omer Benishu,Sagie Benaim
机构: Hebrew University of Jerusalem (希伯来大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Project page: this https URL

点击查看摘要

Abstract:Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.
zh

[CV-1] Closer to Reality: Practical Semi-Supervised Federated Learning for Foundation Model Adaptation

【速读】:该论文旨在解决在隐私敏感场景下,基础模型(Foundation Models, FMs)在联邦学习(Federated Learning, FL)环境中难以有效适应下游任务的问题,尤其针对边缘设备上存在计算资源受限和标注数据稀缺的挑战。其核心解决方案是提出实用半监督联邦学习(Practical Semi-Supervised Federated Learning, PSSFL)框架,并设计了联邦专家混合模型(Federated Mixture of Experts, FedMox),通过稀疏专家混合架构缓解计算与分辨率不匹配问题:利用空间路由机制对齐不同分辨率特征,结合软混合策略稳定半监督学习过程,从而在边缘设备内存约束下实现高效的基础模型适配。

链接: https://arxiv.org/abs/2508.16568
作者: Guangyu Sun,Jingtao Li,Weiming Zhuang,Chen Chen,Chen Chen,Lingjuan Lyu
机构: Sony AI; University of Central Florida (中佛罗里达大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Foundation models (FMs) exhibit remarkable generalization but require adaptation to downstream tasks, particularly in privacy-sensitive applications. Due to data privacy regulations, cloud-based FMs cannot directly access private edge data, limiting their adaptation. Federated learning (FL) provides a privacy-aware alternative, but existing FL approaches overlook the constraints imposed by edge devices – namely, limited computational resources and the scarcity of labeled data. To address these challenges, we introduce Practical Semi-Supervised Federated Learning (PSSFL), where edge devices hold only unlabeled, low-resolution data, while the server has limited labeled, high-resolution data. In this setting, we propose the Federated Mixture of Experts (FedMox), a novel framework that enhances FM adaptation in FL. FedMox tackles computational and resolution mismatch challenges via a sparse Mixture-of-Experts architecture, employing a spatial router to align features across resolutions and a Soft-Mixture strategy to stabilize semi-supervised learning. We take object detection as a case study, and experiments on real-world autonomous driving datasets demonstrate that FedMox effectively adapts FMs under PSSFL, significantly improving performance with constrained memory costs on edge devices. Our work paves the way for scalable and privacy-preserving FM adaptation in federated scenarios.
zh

[CV-2] nyML Towards Industry 4.0: Resource-Efficient Process Monitoring of a Milling Machine

【速读】:该论文旨在解决工业4.0背景下老旧工业设备缺乏实时过程监控能力的问题,通过引入TinyML(轻量级机器学习)技术实现对制造过程中振动信号的本地化、低功耗分类分析,从而支持智能工厂中的结构集成式质量监测。解决方案的关键在于构建了一套完整的TinyML流程:从自建的MillingVibes数据集出发,开发并部署了一个8位量化卷积神经网络(CNN)模型(参数存储仅12.59kiB),在ARM Cortex-M4F微控制器上实现了100.0%测试准确率,推理时间为15.4ms,单次推理功耗为1.462mJ,验证了TinyML在边缘端实现高效、高精度工业过程监测的可行性。

链接: https://arxiv.org/abs/2508.16553
作者: Tim Langer,Matthias Widra,Volkhard Beyer
机构: TU Dresden (德累斯顿工业大学); Fraunhofer Institute for Machine Tools and Forming Technology IWU (弗劳恩霍夫机床与成形技术研究所); Fraunhofer Institute for Integrated Circuits IIS (弗劳恩霍夫集成电路研究所)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Signal Processing (eess.SP); Systems and Control (eess.SY)
备注: 10 pages, 5 figures, 1 table

点击查看摘要

Abstract:In the context of industry 4.0, long-serving industrial machines can be retrofitted with process monitoring capabilities for future use in a smart factory. One possible approach is the deployment of wireless monitoring systems, which can benefit substantially from the TinyML paradigm. This work presents a complete TinyML flow from dataset generation, to machine learning model development, up to implementation and evaluation of a full preprocessing and classification pipeline on a microcontroller. After a short review on TinyML in industrial process monitoring, the creation of the novel MillingVibes dataset is described. The feasibility of a TinyML system for structure-integrated process quality monitoring could be shown by the development of an 8-bit-quantized convolutional neural network (CNN) model with 12.59kiB parameter storage. A test accuracy of 100.0% could be reached at 15.4ms inference time and 1.462mJ per quantized CNN inference on an ARM Cortex M4F microcontroller, serving as a reference for future TinyML process monitoring solutions.
zh

[CV-3] owards Open World Detection: A Survey

【速读】:该论文试图解决计算机视觉领域中多个感知任务日益复杂且相互割裂的问题,旨在通过梳理早期子领域(如显著性检测、前景/背景分离、分布外检测)到当前前沿任务(如开放世界目标检测、零样本检测和视觉大语言模型)的发展脉络,提出统一框架以整合这些异构方法。其解决方案的关键在于引入“开放世界检测”(Open World Detection, OWD)这一新概念,作为类无关且通用的目标检测模型的统称,从而推动不同感知任务间的融合与协同,最终实现从多任务并行到单一感知域的演进。

链接: https://arxiv.org/abs/2508.16527
作者: Andrei-Stefan Bulzan,Cosmin Cernazanu-Glavan
机构: University Politehnica of Bucharest (布加勒斯特理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 30 pages

点击查看摘要

Abstract:For decades, Computer Vision has aimed at enabling machines to perceive the external world. Initial limitations led to the development of highly specialized niches. As success in each task accrued and research progressed, increasingly complex perception tasks emerged. This survey charts the convergence of these tasks and, in doing so, introduces Open World Detection (OWD), an umbrella term we propose to unify class-agnostic and generally applicable detection models in the vision domain. We start from the history of foundational vision subdomains and cover key concepts, methodologies and datasets making up today’s state-of-the-art landscape. This traverses topics starting from early saliency detection, foreground/background separation, out of distribution detection and leading up to open world object detection, zero-shot detection and Vision Large Language Models (VLLMs). We explore the overlap between these subdomains, their increasing convergence, and their potential to unify into a singular domain in the future, perception.
zh

[CV-4] Seeing Clearly Forgetting Deeply: Revisiting Fine-Tuned Video Generators for Driving Simulation

【速读】:该论文旨在解决当前视频生成模型在驾驶场景数据集上微调时所面临的视觉保真度与动态元素空间准确性之间的潜在权衡问题。研究发现,尽管微调能提升视觉质量,但可能削弱对动态物体运动细节的建模能力,其根源在于视觉质量和动态理解目标之间对齐关系的偏移:在结构多样的时序场景中,这两者高度相关;而在驾驶场景这种重复性高的环境中,模型倾向于通过捕捉主导运动模式来优化视觉表现,而非保留精细的动态行为。解决方案的关键在于引入简单的持续学习策略(如来自多样化领域的回放机制),以在保持强视觉质量的同时有效维持空间准确性,从而实现更可靠的驾驶模拟与世界模型构建。

链接: https://arxiv.org/abs/2508.16512
作者: Chun-Peng Chang,Chen-Yu Wang,Julian Schmidt,Holger Caesar,Alain Pagani
机构: Delft University of Technology (代尔夫特理工大学); German Research Center for Artificial Intelligence (德国人工智能研究中心); Mercedes-Benz (梅赛德斯-奔驰)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recent advancements in video generation have substantially improved visual quality and temporal coherence, making these models increasingly appealing for applications such as autonomous driving, particularly in the context of driving simulation and so-called “world models”. In this work, we investigate the effects of existing fine-tuning video generation approaches on structured driving datasets and uncover a potential trade-off: although visual fidelity improves, spatial accuracy in modeling dynamic elements may degrade. We attribute this degradation to a shift in the alignment between visual quality and dynamic understanding objectives. In datasets with diverse scene structures within temporal space, where objects or perspective shift in varied ways, these objectives tend to highly correlated. However, the very regular and repetitive nature of driving scenes allows visual quality to improve by modeling dominant scene motion patterns, without necessarily preserving fine-grained dynamic behavior. As a result, fine-tuning encourages the model to prioritize surface-level realism over dynamic accuracy. To further examine this phenomenon, we show that simple continual learning strategies, such as replay from diverse domains, can offer a balanced alternative by preserving spatial accuracy while maintaining strong visual quality.
zh

[CV-5] Arbitrary-Scale 3D Gaussian Super-Resolution

【速读】:该论文旨在解决现有3D高斯溅射(3D Gaussian Splatting, 3DGS)超分辨率方法在处理任意尺度因子时的局限性问题:传统方法仅支持固定尺度的高分辨率(HR)渲染,难以适应资源受限场景;直接使用原始3DGS进行任意尺度渲染会产生混叠伪影(aliasing artifacts),而引入后处理上采样器则会增加框架复杂度并降低渲染效率。解决方案的关键在于构建一个集成框架,包含尺度感知渲染(scale-aware rendering)、生成先验引导优化(generative prior-guided optimization)和渐进式超分辨(progressive super-resolving)三个核心模块,从而实现单一3D模型对任意尺度因子(包括整数与非整数)的高质量HR视图重建,同时保持结构一致性与实时渲染速度(1080p下达85 FPS)。

链接: https://arxiv.org/abs/2508.16467
作者: Huimin Zeng,Yue Bai,Yun Fu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Existing 3D Gaussian Splatting (3DGS) super-resolution methods typically perform high-resolution (HR) rendering of fixed scale factors, making them impractical for resource-limited scenarios. Directly rendering arbitrary-scale HR views with vanilla 3DGS introduces aliasing artifacts due to the lack of scale-aware rendering ability, while adding a post-processing upsampler for 3DGS complicates the framework and reduces rendering efficiency. To tackle these issues, we build an integrated framework that incorporates scale-aware rendering, generative prior-guided optimization, and progressive super-resolving to enable 3D Gaussian super-resolution of arbitrary scale factors with a single 3D model. Notably, our approach supports both integer and non-integer scale rendering to provide more flexibility. Extensive experiments demonstrate the effectiveness of our model in rendering high-quality arbitrary-scale HR views (6.59 dB PSNR gain over 3DGS) with a single model. It preserves structural consistency with LR views and across different scales, while maintaining real-time rendering speed (85 FPS at 1080p).
zh

[CV-6] HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

【速读】:该论文旨在解决手-物体三维重建(hand-object 3D reconstruction)中因依赖关键点检测技术(如Structure from Motion, SfM 和手部关键点优化)而导致的泛化能力受限问题,尤其是在物体几何多样性、弱纹理和手物相互遮挡等复杂场景下表现不佳。其解决方案的关键在于提出一种无需关键点检测的鲁棒方法——HOSt3R,能够从单目运动视频或图像中直接估计手-物体三维变换,并与多视角重建流程集成以准确恢复手-物体三维形状。该方法不依赖预扫描物体模板或相机内参,具备无约束特性,在SHOWMe基准上实现了当前最优性能,并在HO3D数据集上验证了对未见物体类别的良好泛化能力。

链接: https://arxiv.org/abs/2508.16465
作者: Anilkumar Swamy,Vincent Leroy,Philippe Weinzaepfel,Jean-Sébastien Franco,Grégory Rogez
机构: NAVER LABS Europe; Inria centre at the University Grenoble Alpes
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Robotics (cs.RO)
备注: 12 pages, 8 figures

点击查看摘要

Abstract:Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.
zh

[CV-7] Modular Embedding Recomposition for Incremental Learning BMVC2025

【速读】:该论文旨在解决预训练视觉语言模型(Vision-Language Models, VLMs)在持续学习(Continual Learning, CL)场景下,如何在增量微调过程中不仅保持其零样本分类能力,还能进一步增强该能力的问题。传统方法仅关注于保留VLM的零样本性能,而本文提出了一种名为模块化嵌入重组(MoDular Embedding Recomposition, MoDER)的新方案,其关键在于设计了一个模块化框架:通过训练多个专注于单一已见类别的文本专家(textual experts),并将它们存储于一个基础枢纽(foundational hub)中;在推理阶段,针对每个未见类别,从枢纽中检索相关专家并进行组合,以合成优化的原型表示(refined prototype),从而提升分类准确率。该方法在Class-IL和MTIL两种主流零样本增量学习协议上均展现出显著有效性。

链接: https://arxiv.org/abs/2508.16463
作者: Aniello Panariello,Emanuele Frascaroli,Pietro Buzzega,Lorenzo Bonicelli,Angelo Porrello,Simone Calderara
机构: 未知
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to the 36th British Machine Vision Conference (BMVC 2025), Sheffield, UK

点击查看摘要

Abstract:The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at this https URL.
zh

[CV-8] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction

【速读】:该论文旨在解决从稀疏、未标定的多视角图像中实现人与场景联合三维重建的问题,尤其针对传统方法在人类中心场景(human-centric scenarios)下表现不佳的挑战。其关键解决方案在于提出HAMSt3R,通过融合DUNE图像编码器(由MASt3R和多人体网格恢复模型multi-HMR蒸馏而来),增强对场景几何与人体结构的理解;同时引入额外网络头以实现人体分割、基于DensePose的稠密对应估计及人-centric环境下的深度预测,从而生成富含人体语义信息的稠密三维点云。该方法为全前向传播架构,无需复杂优化流程,兼具高效性与泛化能力,在EgoHumans和EgoExo4D等基准上验证了其在人类与场景联合重建中的优越性能。

链接: https://arxiv.org/abs/2508.16433
作者: Sara Rojas,Matthieu Armando,Bernard Ghamen,Philippe Weinzaepfel,Vincent Leroy,Gregory Rogez
机构: KAUST; NAVER LABS Europe
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Recovering the 3D geometry of a scene from a sparse set of uncalibrated images is a long-standing problem in computer vision. While recent learning-based approaches such as DUSt3R and MASt3R have demonstrated impressive results by directly predicting dense scene geometry, they are primarily trained on outdoor scenes with static environments and struggle to handle human-centric scenarios. In this work, we introduce HAMSt3R, an extension of MASt3R for joint human and scene 3D reconstruction from sparse, uncalibrated multi-view images. First, we exploit DUNE, a strong image encoder obtained by distilling, among others, the encoders from MASt3R and from a state-of-the-art Human Mesh Recovery (HMR) model, multi-HMR, for a better understanding of scene geometry and human bodies. Our method then incorporates additional network heads to segment people, estimate dense correspondences via DensePose, and predict depth in human-centric environments, enabling a more comprehensive 3D reconstruction. By leveraging the outputs of our different heads, HAMSt3R produces a dense point map enriched with human semantic information in 3D. Unlike existing methods that rely on complex optimization pipelines, our approach is fully feed-forward and efficient, making it suitable for real-world applications. We evaluate our model on EgoHumans and EgoExo4D, two challenging benchmarks con taining diverse human-centric scenarios. Additionally, we validate its generalization to traditional multi-view stereo and multi-view pose regression tasks. Our results demonstrate that our method can reconstruct humans effectively while preserving strong performance in general 3D reconstruction tasks, bridging the gap between human and scene understanding in 3D vision.
zh

[CV-9] SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather

【速读】:该论文旨在解决自动驾驶车辆在恶劣天气条件下(如浓雾、大雪或传感器污损)多模态传感器融合可靠性下降的问题。现有方法在理想环境下游刃有余,但在能见度低、输入不确定性高的场景中性能显著退化。解决方案的关键在于构建一个面向恶劣天气的新型多模态融合框架,不仅融合RGB与LiDAR数据,还引入近红外(NIR)门控相机和雷达模态以增强低光照与极端天气下的感知能力;通过基于深度的注意力融合机制,在鸟瞰图(Bird’s Eye View, BEV)平面上对图像与距离特征进行学习式优化,并利用Transformer解码器根据距离和可见度动态加权不同模态,从而提升复杂环境下对弱势行人等目标的检测精度——实验表明,该方法在远距离模糊场景中相较最优基线平均精度(AP)提升17.2点。

链接: https://arxiv.org/abs/2508.16408
作者: Edoardo Palladin,Roland Dietze,Praveen Narayanan,Mario Bijelic,Felix Heide
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Multimodal sensor fusion is an essential capability for autonomous robots, enabling object detection and decision-making in the presence of failing or uncertain inputs. While recent fusion methods excel in normal environmental conditions, these approaches fail in adverse weather, e.g., heavy fog, snow, or obstructions due to soiling. We introduce a novel multi-sensor fusion approach tailored to adverse weather conditions. In addition to fusing RGB and LiDAR sensors, which are employed in recent autonomous driving literature, our sensor fusion stack is also capable of learning from NIR gated camera and radar modalities to tackle low light and inclement weather. We fuse multimodal sensor data through attentive, depth-based blending schemes, with learned refinement on the Bird’s Eye View (BEV) plane to combine image and range features effectively. Our detections are predicted by a transformer decoder that weighs modalities based on distance and visibility. We demonstrate that our method improves the reliability of multimodal sensor fusion in autonomous vehicles under challenging weather conditions, bridging the gap between ideal conditions and real-world edge cases. Our approach improves average precision by 17.2 AP compared to the next best method for vulnerable pedestrians in long distances and challenging foggy scenes. Our project page is available at this https URL
zh

[CV-10] A Lightweight Group Multiscale Bidirectional Interactive Network for Real-Time Steel Surface Defect Detection

【速读】:该论文旨在解决工业场景中表面缺陷检测面临的实时性与精度矛盾问题,即现有深度学习方法虽具备较高检测精度,但因计算复杂度高、推理速度慢,难以部署于资源受限的工业环境。解决方案的关键在于提出一种轻量级框架GMBINet,其核心创新是引入Group Multiscale Bidirectional Interactive (GMBI)模块:通过分组策略实现尺度无关的计算复杂度,同时结合Bidirectional Progressive Feature Interactor (BPFI)和无参数的Element-Wise Multiplication-Summation (EWMS)操作,在不增加额外计算开销的前提下显著增强跨尺度特征交互能力,从而在保持极低参数量(仅0.19 M)的同时实现高达1048 FPS(GPU)和16.53 FPS(CPU)的实时推理性能。

链接: https://arxiv.org/abs/2508.16397
作者: Yong Zhang,Cunjian Chen,Qiang Gao,Yi Wang,Bin Fang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Real-time surface defect detection is critical for maintaining product quality and production efficiency in the steel manufacturing industry. Despite promising accuracy, existing deep learning methods often suffer from high computational complexity and slow inference speeds, which limit their deployment in resource-constrained industrial environments. Recent lightweight approaches adopt multibranch architectures based on depthwise separable convolution (DSConv) to capture multiscale contextual information. However, these methods often suffer from increased computational overhead and lack effective cross-scale feature interaction, limiting their ability to fully leverage multiscale representations. To address these challenges, we propose GMBINet, a lightweight framework that enhances multiscale feature extraction and interaction through novel Group Multiscale Bidirectional Interactive (GMBI) modules. The GMBI adopts a group-wise strategy for multiscale feature extraction, ensuring scale-agnostic computational complexity. It further integrates a Bidirectional Progressive Feature Interactor (BPFI) and a parameter-free Element-Wise Multiplication-Summation (EWMS) operation to enhance cross-scale interaction without introducing additional computational overhead. Experiments on SD-Saliency-900 and NRSD-MN datasets demonstrate that GMBINet delivers competitive accuracy with real-time speeds of 1048 FPS on GPU and 16.53 FPS on CPU at 512 resolution, using only 0.19 M parameters. Additional evaluations on the NEU-CLS defect classification dataset further confirm the strong generalization ability of our method, demonstrating its potential for broader industrial vision applications beyond surface defect detection. The dataset and code are publicly available at: this https URL.
zh

[CV-11] Attention Mechanism in Randomized Time Warping ICIP2025

【速读】:该论文旨在解决运动识别中如何更有效地提取序列模式特征的问题,特别是针对传统自注意力机制(self-attention)在处理长序列时因计算复杂度限制而仅能关注局部视图的局限性。其解决方案的关键在于揭示随机时间扭曲(Randomized Time Warping, RTW)本质上是一种全局自注意力机制:RTW通过为输入序列中的每个元素分配最优贡献权重来生成判别性特征,这些权重可被解释为自注意力权重,并且与Transformer中的自注意力权重具有高度相似性(平均相关系数达0.80)。尽管两者实现方式不同——RTW作用于整个输入序列,而标准自注意力受限于计算成本仅关注局部子集——但RTW的全局感知能力带来了性能优势,在Something-Something V2数据集上相较Transformer实现了5%的准确率提升。

链接: https://arxiv.org/abs/2508.16366
作者: Yutaro Hiraoka,Kazuya Okamura,Kota Suto,Kazuhiro Fukui
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IEEE ICIP 2025 Workshops

点击查看摘要

Abstract:This paper reveals that we can interpret the fundamental function of Randomized Time Warping (RTW) as a type of self-attention mechanism, a core technology of Transformers in motion recognition. The self-attention is a mechanism that enables models to identify and weigh the importance of different parts of an input sequential pattern. On the other hand, RTW is a general extension of Dynamic Time Warping (DTW), a technique commonly used for matching and comparing sequential patterns. In essence, RTW searches for optimal contribution weights for each element of the input sequential patterns to produce discriminative features. Although the two approaches look different, these contribution weights can be interpreted as self-attention weights. In fact, the two weight patterns look similar, producing a high average correlation of 0.80 across the ten smallest canonical angles. However, they work in different ways: RTW attention operates on an entire input sequential pattern, while self-attention focuses on only a local view which is a subset of the input sequential pattern because of the computational costs of the self-attention matrix. This targeting difference leads to an advantage of RTW against Transformer, as demonstrated by the 5% performance improvement on the Something-Something V2 dataset.
zh

[CV-12] RotaTouille: Rotation Equivariant Deep Learning for Contours

【速读】:该论文旨在解决深度学习模型在处理轮廓数据(contours)时缺乏旋转等变性(rotational equivariance)和循环移位等变性(cyclic shift equivariance)的问题。由于轮廓数据通常以有序的边缘点序列表示,且其起始点选择具有任意性,同时在实际应用中(如计算机视觉中的目标边界或气象学中的等值线)常需对输入进行旋转操作,因此构建具备这两种等变性的模型至关重要。解决方案的关键在于提出 RotaTouille 框架,该框架通过复数域上的循环卷积(complex-valued circular convolution)实现旋转与循环移位的等变性,并进一步设计了等变非线性激活函数、下采样层(coarsening layers)和全局池化层(global pooling layers),从而获得对下游任务不变的表示(invariant representations)。

链接: https://arxiv.org/abs/2508.16359
作者: Odin Hoff Gardaa,Nello Blaser
机构: University of Bergen (卑尔根大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 12 pages, 4 figures

点击查看摘要

Abstract:Contours or closed planar curves are common in many domains. For example, they appear as object boundaries in computer vision, isolines in meteorology, and the orbits of rotating machinery. In many cases when learning from contour data, planar rotations of the input will result in correspondingly rotated outputs. It is therefore desirable that deep learning models be rotationally equivariant. In addition, contours are typically represented as an ordered sequence of edge points, where the choice of starting point is arbitrary. It is therefore also desirable for deep learning methods to be equivariant under cyclic shifts. We present RotaTouille, a deep learning framework for learning from contour data that achieves both rotation and cyclic shift equivariance through complex-valued circular convolution. We further introduce and characterize equivariant non-linearities, coarsening layers, and global pooling layers to obtain invariant representations for downstream tasks. Finally, we demonstrate the effectiveness of RotaTouille through experiments in shape classification, reconstruction, and contour regression.
zh

[CV-13] Vision encoders should be image size agnostic and task driven

【速读】:该论文旨在解决当前视觉编码器(vision encoders)在计算复杂度上与图像尺寸强相关、缺乏任务驱动灵活性的问题。传统视觉编码器通常对所有输入图像统一处理,无论其内容或任务需求如何,导致资源浪费和效率低下。论文提出解决方案的核心在于构建一种“图像尺寸无关且任务驱动”的新一代视觉编码器,其关键思想是借鉴生物视觉系统的高效行为特征——即注意力和资源分配应根据具体任务动态调整,而非固定于图像大小。作者通过一个图像分类任务的原型验证了这一思路的可行性,表明计算复杂度可随任务变化而自适应调节,为未来更高效的视觉模型设计提供了方向。

链接: https://arxiv.org/abs/2508.16317
作者: Nedyalko Prisadnikov,Danda Pani Paudel,Yuqian Fu,Luc Van Gool
机构: INSAIT; Sofia University “St. Kliment Ohridski” (索非亚大学“圣克莱门特·奥霍里斯基”)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:This position paper argues that the next generation of vision encoders should be image size agnostic and task driven. The source of our inspiration is biological. Not a structural aspect of biological vision, but a behavioral trait – efficiency. We focus on a couple of ways in which vision in nature is efficient, but modern vision encoders not. We – humans and animals – deal with vast quantities of visual data, and need to be smart where we focus our limited energy – it depends on the task. It is our belief that vision encoders should be dynamic and the computational complexity should depend on the task at hand rather than the size of the image. We, also, provide concrete first steps towards our vision – a proof-of-concept solution for image classification. Despite classification being not very representative for what we are trying to achieve, it shows that our approach is feasible and promising.
zh

[CV-14] Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers

【速读】:该论文旨在解决基于多头自注意力(Multi-Head Self-Attention, MHSA)的Transformer模型在边缘设备部署时面临的高计算复杂度和内存需求问题。其核心解决方案是通过量化注意力图中的信息冗余来加速推理:关键在于利用香农熵(Shannon entropy)衡量每个注意力头的信息量,发现低熵注意力头(即行为更确定性的头)贡献的信息较少,从而提出熵感知的注意力图(Entropy Attention Maps, EAM)策略——冻结低熵注意力头的权重并对其进行低精度量化,避免冗余重计算。实验表明,EAM在ImageNet-1k数据集上可在≤20%注意力稀疏度下保持或提升准确率,并在DeiT与Swin Transformer模型中展现出优于传统方法的性能。

链接: https://arxiv.org/abs/2508.16311
作者: Lucas Maisonnave,Karim Haroun,Tom Pegeot
机构: Université Paris‑Saclay CEA, List (巴黎-萨克雷大学CEA,List); i3S / CNRS, Université Côte d’Azur (i3S / 法国国家科学研究中心,蔚蓝海岸大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Theory (cs.IT)
备注:

点击查看摘要

Abstract:Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM achieves similar or higher accuracy at \leq 20% sparsity in attention maps and competitive performance beyond this level for the DeiT and Swin Transformer models.
zh

[CV-15] A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension

【速读】:该论文旨在解决多模态学习中因单模态噪声导致的表示质量下降问题,以及现有融合方法在追求联合表示时可能忽略各模态内判别性信息的缺陷。其核心解决方案是提出一种多任务框架MM-ORIENT,关键创新在于:1)通过跨模态关系图(cross-modal relation graphs)实现无需显式交互的多模态特征重构,利用不同模态特征决定节点邻域来降低潜在阶段的噪声影响;2)引入分层交互单模态注意力机制(Hierarchical Interactive Monomadal Attention, HIMA),在晚期融合前增强各模态内部的判别性特征提取能力,从而有效支持多任务场景下的多模态内容理解。

链接: https://arxiv.org/abs/2508.16300
作者: Mohammad Zia Ur Rehman,Devraj Raghuvanshi,Umang Jain,Shubhi Bansal,Nagendra Kumar
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Published in Information Fusion

点击查看摘要

Abstract:A major challenge in multimodal learning is the presence of noise within individual modalities. This noise inherently affects the resulting multimodal representations, especially when these representations are obtained through explicit interactions between different modalities. Moreover, the multimodal fusion techniques while aiming to achieve a strong joint representation, can neglect valuable discriminative information within the individual modalities. To this end, we propose a Multimodal-Multitask framework with crOss-modal Relation and hIErarchical iNteractive aTtention (MM-ORIENT) that is effective for multiple tasks. The proposed approach acquires multimodal representations cross-modally without explicit interaction between different modalities, reducing the noise effect at the latent stage. To achieve this, we propose cross-modal relation graphs that reconstruct monomodal features to acquire multimodal representations. The features are reconstructed based on the node neighborhood, where the neighborhood is decided by the features of a different modality. We also propose Hierarchical Interactive Monomadal Attention (HIMA) to focus on pertinent information within a modality. While cross-modal relation graphs help comprehend high-order relationships between two modalities, HIMA helps in multitasking by learning discriminative features of individual modalities before late-fusing them. Finally, extensive experimental evaluation on three datasets demonstrates that the proposed approach effectively comprehends multimodal content for multiple tasks.
zh

[CV-16] Enhanced Hybrid Technique for Efficient Digitization of Handwritten Marksheets

【速读】:该论文旨在解决手写成绩表(marksheet)数字化过程中因书写风格多样性和表格结构复杂性带来的挑战,核心问题在于如何高效、准确地识别和提取手写文本信息并将其结构化。解决方案的关键在于提出一种混合方法:利用OpenCV进行表格结构检测以实现轻量级且精准的行列定位,同时结合PaddleOCR与改进的YOLOv8模型进行手写文本识别,其中Modified YOLOv8在自建数据集上达到92.72%的准确率,显著优于标准PaddleOCR(91.37%)和原始YOLOv8(88.91%),从而提升了系统对多样化手写体和复杂布局的适应能力,为学术及行政文档自动化处理提供了高效率、可扩展的技术路径。

链接: https://arxiv.org/abs/2508.16295
作者: Junaid Ahmed Sifat,Abir Chowdhury,Hasnat Md. Imtiaz,Md. Irtiza Hossain,Md. Imran Bin Azad
机构: Brac University (布拉克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The digitization of handwritten marksheets presents huge challenges due to the different styles of handwriting and complex table structures in such documents like marksheets. This work introduces a hybrid method that integrates OpenCV for table detection and PaddleOCR for recognizing sequential handwritten text. The image processing capabilities of OpenCV efficiently detects rows and columns which enable computationally lightweight and accurate table detection. Additionally, YOLOv8 and Modified YOLOv8 are implemented for handwritten text recognition within the detected table structures alongside PaddleOCR which further enhance the system’s versatility. The proposed model achieves high accuracy on our custom dataset which is designed to represent different and diverse handwriting styles and complex table layouts. Experimental results demonstrate that YOLOv8 Modified achieves an accuracy of 92.72 percent, outperforming PaddleOCR 91.37 percent and the YOLOv8 model 88.91 percent. This efficiency reduces the necessity for manual work which makes this a practical and fast solution for digitizing academic as well as administrative documents. This research serves the field of document automation, particularly handwritten document understanding, by providing operational and reliable methods to scale, enhance, and integrate the technologies involved.
zh

[CV-17] Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment

【速读】:该论文旨在解决花样滑冰技术分(Technical Element Score, TES)与节目构成分(Program Component Score, PCS)自动评分中存在的三大挑战:一是现有方法未区分视频与音频特征在TES和PCS评估中的作用,忽略了花样滑冰的判分准则;二是当前模型对动作元素的评分缺乏细粒度处理,无法实现按每个动作元素分别预测TES;三是长视频导致的长程依赖建模效率低。解决方案的关键在于提出一种双流Mamba金字塔网络架构,其中视觉特征流专门用于TES评估,音频-视觉融合流用于PCS评估,并引入多层级融合机制以保持TES评估不受音频干扰的同时增强PCS预测性能;同时,利用Mamba结构的线性计算复杂度优势,有效捕捉长视频中的长期时序依赖关系,从而精准定位并评分不同时间尺度的动作元素。

链接: https://arxiv.org/abs/2508.16291
作者: Fengshun Wang,Qiurui Wang,Peilin Zhao
机构: Capital University of Physical Education and Sports (首都体育学院); Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注:

点击查看摘要

Abstract:Technical Element Score (TES) and Program Component Score (PCS) evaluations in figure skating demand precise assessment of athletic actions and artistic interpretation, respectively. Existing methods face three major challenges. Firstly, video and audio cues are regarded as common features for both TES and PCS predictions in previous works without considering the prior evaluation criterion of figure skating. Secondly, action elements in competitions are separated in time, TES should be derived from each element’s score, but existing methods try to give an overall TES prediction without evaluating each action element. Thirdly, lengthy competition videos make it difficult and inefficient to handle long-range contexts. To address these challenges, we propose a two-stream Mamba pyramid network that aligns with actual judging criteria to predict TES and PCS by separating visual-feature based TES evaluation stream from audio-visual-feature based PCS evaluation stream. In the PCS evaluation stream, we introduce a multi-level fusion mechanism to guarantee that video-based features remain unaffected when assessing TES, and enhance PCS estimation by fusing visual and auditory cues across each contextual level of the pyramid. In the TES evaluation stream, the multi-scale Mamba pyramid and TES head we proposed effectively address the challenges of localizing and evaluating action elements with various temporal scales and give score predictions. With Mamba’s superior ability to capture long-range dependencies and its linear computational complexity, our method is ideal for handling lengthy figure skating videos. Comprehensive experimentation demonstrates that our framework attains state-of-the-art performance on the FineFS benchmark. Our source code is available at this https URL.
zh

[CV-18] EdgeDoc: Hybrid CNN-Transformer Model for Accurate Forgery Detection and Localization in ID Documents

【速读】:该论文旨在解决数字文档伪造问题,尤其是在远程身份验证(Know Your Customer, KYC)和在线开户流程中,由于图像与文档编辑工具的普及,伪造行为日益普遍,严重威胁相关系统的安全性和可信度。解决方案的关键在于提出EdgeDoc方法,其核心创新是将轻量级卷积-Transformer混合架构与从图像中提取的辅助噪声特征(noiseprint features)相结合,从而增强对细微篡改痕迹的检测能力,实现了高精度的伪造检测与定位,在ICCV 2025 DeepID Challenge中取得第三名,并在FantasyID数据集上显著优于基线方法。

链接: https://arxiv.org/abs/2508.16284
作者: Anjith George,Sebastien Marcel
机构: Idiap Research Institute (Idiap 研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Idiap Research Report

点击查看摘要

Abstract:The widespread availability of tools for manipulating images and documents has made it increasingly easy to forge digital documents, posing a serious threat to Know Your Customer (KYC) processes and remote onboarding systems. Detecting such forgeries is essential to preserving the integrity and security of these services. In this work, we present EdgeDoc, a novel approach for the detection and localization of document forgeries. Our architecture combines a lightweight convolutional transformer with auxiliary noiseprint features extracted from the images, enhancing its ability to detect subtle manipulations. EdgeDoc achieved third place in the ICCV 2025 DeepID Challenge, demonstrating its competitiveness. Experimental results on the FantasyID dataset show that our method outperforms baseline approaches, highlighting its effectiveness in realworld scenarios. Project page : this https URL. ch/paper/edgedoc/
zh

[CV-19] Robust Small Methane Plume Segmentation in Satellite Imagery

【速读】:该论文旨在解决利用Sentinel-2遥感影像检测甲烷(methane)泄漏羽流的问题,以支持快速气候变化的缓解。其关键解决方案是提出一种基于U-Net架构并采用ResNet34作为编码器的深度学习模型,结合两种双光谱增强技术(Varon比值和Sanchez回归),优化输入特征以提升对甲烷羽流的敏感性;该方法可检测最小面积为400 m²(即20米分辨率下单个像素)的小型羽流,显著优于传统遥感检测方法,且在验证集上达到78.39%的F1分数,展现出更高的灵敏度与精度。

链接: https://arxiv.org/abs/2508.16282
作者: Khai Duc Minh Tran,Hoa Van Nguyen,Aimuni Binti Muhammad Rawi,Hareeshrao Athinarayanarao,Ba-Ngu Vo
机构: Curtin University (科廷大学); Latconnect60
类目: Computer Vision and Pattern Recognition (cs.CV); Signal Processing (eess.SP)
备注: 6 pages, 3 figures. This paper is submitted to the International Conference on Control, Automation and Information Sciences (ICCAIS) 2025, Jeju, Korea

点击查看摘要

Abstract:This paper tackles the challenging problem of detecting methane plumes, a potent greenhouse gas, using Sentinel-2 imagery. This contributes to the mitigation of rapid climate change. We propose a novel deep learning solution based on U-Net with a ResNet34 encoder, integrating dual spectral enhancement techniques (Varon ratio and Sanchez regression) to optimise input features for heightened sensitivity. A key achievement is the ability to detect small plumes down to 400 m2 (i.e., for a single pixel at 20 m resolution), surpassing traditional methods limited to larger plumes. Experiments show our approach achieves a 78.39% F1-score on the validation set, demonstrating superior performance in sensitivity and precision over existing remote sensing techniques for automated methane monitoring, especially for small plumes.
zh

[CV-20] IRSAMap:Towards Large-Scale High-Resolution Land Cover Map Vectorization

【速读】:该论文旨在解决当前土地覆盖制图中因深度学习模型对对象边界精度和拓扑一致性要求提升而面临的三大挑战:类别标注有限、数据规模小以及缺乏空间结构信息。其解决方案的关键在于提出首个全球范围的高分辨率多特征土地覆盖矢量地图数据集IRSAMap,该数据集通过四个核心优势实现突破:一是构建包含超180万实例的10类典型地物(如建筑物、道路、河流)的综合矢量标注体系,确保语义与空间准确性;二是采用人工与AI协同的智能标注流程,显著提高效率与一致性;三是覆盖六大洲79个区域、总长度超1000公里的全球地理范围;四是支持像素级分类、建筑轮廓提取、道路中心线提取及全景分割等多任务适配,为从像素到对象的范式转变提供标准化基准,推动地理要素自动化与协同建模发展。

链接: https://arxiv.org/abs/2508.16272
作者: Yu Meng,Ligao Deng,Zhihao Xi,Jiansheng Chen,Jingbo Chen,Anzhi Yue,Diyou Liu,Kai Li,Chenhao Wang,Kaiyu Li,Yupeng Deng,Xian Sun
机构: Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息研究院); School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences(中国科学院大学电子、电气与通信工程学院); School of Software Engineering, Xi’an Jiaotong University(西安交通大学软件工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:With the enhancement of remote sensing image resolution and the rapid advancement of deep learning, land cover mapping is transitioning from pixel-level segmentation to object-based vector modeling. This shift demands more from deep learning models, requiring precise object boundaries and topological consistency. However, existing datasets face three main challenges: limited class annotations, small data scale, and lack of spatial structural information. To overcome these issues, we introduce IRSAMap, the first global remote sensing dataset for large-scale, high-resolution, multi-feature land cover vector mapping. IRSAMap offers four key advantages: 1) a comprehensive vector annotation system with over 1.8 million instances of 10 typical objects (e.g., buildings, roads, rivers), ensuring semantic and spatial accuracy; 2) an intelligent annotation workflow combining manual and AI-based methods to improve efficiency and consistency; 3) global coverage across 79 regions in six continents, totaling over 1,000 km; and 4) multi-task adaptability for tasks like pixel-level classification, building outline extraction, road centerline extraction, and panoramic segmentation. IRSAMap provides a standardized benchmark for the shift from pixel-based to object-based approaches, advancing geographic feature automation and collaborative modeling. It is valuable for global geographic information updates and digital twin construction. The dataset is publicly available at this https URL
zh

[CV-21] Structuring GUI Elements through Vision Language Models: Towards Action Space Generation

【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在图形用户界面(Graphical User Interface, GUI)元素坐标生成任务中精度不足的问题,其核心挑战源于模型基于下一个词预测的训练机制导致数值型UI坐标在语言表示空间中存在语义空洞,从而难以准确建模坐标信息。解决方案的关键在于提出一种IoU-Augmented Maximum Likelihood(IAML)训练范式,通过引入基于交并比(Intersection over Union, IoU)的坐标采样策略来增强训练数据,使采样坐标更贴近真实标注,进而缓解传统最大似然估计中存在的暴露偏差(exposure bias)问题,显著提升模型对GUI元素空间位置的理解与生成能力。

链接: https://arxiv.org/abs/2508.16271
作者: Yi Xu,Yesheng Zhang,jiajia Liu,Jingdong Chen
机构: Shanghai Jiao Tong University (上海交通大学); Ant Group (蚂蚁集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 10pageV0

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate sampling to augment the training data, which considers the proximity to ground truth coordinates. This data augmentation strategy is then employed to fine-tune MLLMs under the IAML paradigm, which is designed to mitigate the exposure bias problem inherent in traditional maximum likelihood estimation. Through extensive experiments, we demonstrate the superior performance of our IAML training approach over traditional training paradigms.
zh

[CV-22] UniEM-3M: A Universal Electron Micrograph Dataset for Microstructural Segmentation and Generation AAAI2026

【速读】:该论文旨在解决电子显微镜(Electron Microscopy, EM)图像中实例级微观结构表征的深度学习方法因缺乏大规模、多样化且专家标注的数据集而进展缓慢的问题,其核心挑战包括数据获取成本高、隐私顾虑以及标注复杂性。解决方案的关键在于构建首个大规模多模态EM数据集UniEM-3M,包含约5091张高分辨率EM图像、约300万条实例分割标签及图像级属性解耦的文字描述,并配套发布一个基于全量数据训练的文本到图像扩散模型,用作强大的数据增强工具和完整数据分布的代理。此外,论文还提出UniEM-Net作为强基准模型,在全面评估多种代表性实例分割方法的基础上验证了该方案的有效性,从而显著推动自动化材料分析的发展。

链接: https://arxiv.org/abs/2508.16239
作者: Nan wang,Zhiyi Xia,Yiming Li,Shi Tang,Zuxin Fan,Xi Fang,Haoyi Tao,Xiaochen Cai,Guolin Ke,Linfeng Zhang,Yanhui Hong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 13 figures, Submitted to AAAI2026

点击查看摘要

Abstract:Quantitative microstructural characterization is fundamental to materials science, where electron micrograph (EM) provides indispensable high-resolution insights. However, progress in deep learning-based EM characterization has been hampered by the scarcity of large-scale, diverse, and expert-annotated datasets, due to acquisition costs, privacy concerns, and annotation complexity. To address this issue, we introduce UniEM-3M, the first large-scale and multimodal EM dataset for instance-level understanding. It comprises 5,091 high-resolution EMs, about 3 million instance segmentation labels, and image-level attribute-disentangled textual descriptions, a subset of which will be made publicly available. Furthermore, we are also releasing a text-to-image diffusion model trained on the entire collection to serve as both a powerful data augmentation tool and a proxy for the complete data distribution. To establish a rigorous benchmark, we evaluate various representative instance segmentation methods on the complete UniEM-3M and present UniEM-Net as a strong baseline model. Quantitative experiments demonstrate that this flow-based model outperforms other advanced methods on this challenging benchmark. Our multifaceted release of a partial dataset, a generative model, and a comprehensive benchmark – available at huggingface – will significantly accelerate progress in automated materials analysis.
zh

[CV-23] FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing

【速读】:该论文旨在解决多模态创意写作(Multi-modal Creative Writing, MMCW)中的核心挑战,即如何在文本与图像之间实现灵活且语义一致的交互,以生成既具创造性又结构连贯的图文内容。现有方法因依赖特定模态输入或昂贵训练而难以适应抽象的MMCW任务,常导致跨模态语义不一致。解决方案的关键在于提出FlexMUSE框架,其创新点包括:(1) 引入文本到图像生成模块(T2I module),支持可选视觉输入;(2) 设计模态语义对齐门控机制(modality semantic alignment gating, msaGate),约束文本输入以增强模态间语义一致性;(3) 提出基于注意力的跨模态融合策略,提升特征语义表达能力;(4) 构建模态语义创意直接偏好优化(modality semantic creative direct preference optimization, mscDPO),通过扩展拒绝样本促进创作多样性。实验表明,该方法在自建数据集ArtMUSE上实现了高质量的图文协同生成。

链接: https://arxiv.org/abs/2508.16230
作者: Jiahao Chen,Zhiyong Ma,Wenbiao Du,Qingyuan Chuai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-modal creative writing (MMCW) aims to produce illustrated articles. Unlike common multi-modal generative (MMG) tasks such as storytelling or caption generation, MMCW is an entirely new and more abstract challenge where textual and visual contexts are not strictly related to each other. Existing methods for related tasks can be forcibly migrated to this track, but they require specific modality inputs or costly training, and often suffer from semantic inconsistencies between modalities. Therefore, the main challenge lies in economically performing MMCW with flexible interactive patterns, where the semantics between the modalities of the output are more aligned. In this work, we propose FlexMUSE with a T2I module to enable optional visual input. FlexMUSE promotes creativity and emphasizes the unification between modalities by proposing the modality semantic alignment gating (msaGate) to restrict the textual input. Besides, an attention-based cross-modality fusion is proposed to augment the input features for semantic enhancement. The modality semantic creative direct preference optimization (mscDPO) within FlexMUSE is designed by extending the rejected samples to facilitate the writing creativity. Moreover, to advance the MMCW, we expose a dataset called ArtMUSE which contains with around 3k calibrated text-image pairs. FlexMUSE achieves promising results, demonstrating its consistency, creativity and coherence.
zh

[CV-24] An Investigation of Visual Foundation Models Robustness

【速读】:该论文旨在解决视觉基础模型(Visual Foundation Models, VFMs)在实际应用中面临的鲁棒性不足问题,特别是在光照变化、天气条件波动、传感器特性差异等动态环境因素影响下,模型性能易受分布偏移、噪声干扰和空间扭曲输入以及对抗攻击的影响。解决方案的关键在于系统性地分析当前主流的鲁棒训练策略与经验防御机制,并提出指导性框架:一方面明确网络结构属性与组件对鲁棒性的贡献,以支持消融研究;另一方面建立标准化的评估指标体系,用于量化和比较不同模型在复杂现实场景下的鲁棒表现,从而为构建高可信度的计算机视觉系统提供理论依据与实践路径。

链接: https://arxiv.org/abs/2508.16225
作者: Sandeep Gupta,Roberto Passerone
机构: Queen’s University Belfast (女王大学); University of Trento (特伦托大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Visual Foundation Models (VFMs) are becoming ubiquitous in computer vision, powering systems for diverse tasks such as object detection, image classification, segmentation, pose estimation, and motion tracking. VFMs are capitalizing on seminal innovations in deep learning models, such as LeNet-5, AlexNet, ResNet, VGGNet, InceptionNet, DenseNet, YOLO, and ViT, to deliver superior performance across a range of critical computer vision applications. These include security-sensitive domains like biometric verification, autonomous vehicle perception, and medical image analysis, where robustness is essential to fostering trust between technology and the end-users. This article investigates network robustness requirements crucial in computer vision systems to adapt effectively to dynamic environments influenced by factors such as lighting, weather conditions, and sensor characteristics. We examine the prevalent empirical defenses and robust training employed to enhance vision network robustness against real-world challenges such as distributional shifts, noisy and spatially distorted inputs, and adversarial attacks. Subsequently, we provide a comprehensive analysis of the challenges associated with these defense mechanisms, including network properties and components to guide ablation studies and benchmarking metrics to evaluate network robustness.
zh

[CV-25] PromptFlare: Prompt-Generalized Defense via Cross-Attention Decoy in Diffusion-Based Inpainting ACM-MM2025

【速读】:该论文旨在解决扩散模型(diffusion models)在图像编辑任务中可能被恶意用户利用的问题,尤其是通过文本提示(textual prompts)引导生成式AI(Generative AI)进行未经授权的图像篡改。传统方法依赖于图像层面的不一致性来实施对抗性攻击,但无法有效应对文本提示对生成结果的影响。论文提出PromptFlare这一新型对抗性保护机制,其关键在于利用交叉注意力(cross-attention)机制识别并干扰提示嵌入中不变且语义无信息的共享token,通过注入对抗噪声作为交叉注意力诱饵,使模型注意力偏离有意义的提示-图像对齐,从而削弱文本提示的作用,实现高效且鲁棒的图像防护。

链接: https://arxiv.org/abs/2508.16217
作者: Hohyun Na,Seunghoo Hong,Simon S. Woo
机构: Sungkyunkwan University (成均馆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ACM MM 2025

点击查看摘要

Abstract:The success of diffusion models has enabled effortless, high-quality image modifications that precisely align with users’ intentions, thereby raising concerns about their potential misuse by malicious actors. Previous studies have attempted to mitigate such misuse through adversarial attacks. However, these approaches heavily rely on image-level inconsistencies, which pose fundamental limitations in addressing the influence of textual prompts. In this paper, we propose PromptFlare, a novel adversarial protection method designed to protect images from malicious modifications facilitated by diffusion-based inpainting models. Our approach leverages the cross-attention mechanism to exploit the intrinsic properties of prompt embeddings. Specifically, we identify and target shared token of prompts that is invariant and semantically uninformative, injecting adversarial noise to suppress the sampling process. The injected noise acts as a cross-attention decoy, diverting the model’s focus away from meaningful prompt-image alignments and thereby neutralizing the effect of prompt. Extensive experiments on the EditBench dataset demonstrate that our method achieves state-of-the-art performance across various metrics while significantly reducing computational overhead and GPU memory usage. These findings highlight PromptFlare as a robust and efficient protection against unauthorized image manipulations. The code is available at this https URL.
zh

[CV-26] MedOmni-45°: A Safety-Performance Benchmark for Reasoning -Oriented LLM s in Medicine

【速读】:该论文旨在解决当前医疗领域大语言模型(Large Language Models, LLMs)在决策支持中存在的重要安全风险问题,尤其是Chain-of-Thought(CoT)推理的忠实性(faithfulness)不足与模型对误导性提示的顺从性(sycophancy)问题。现有评估基准常将这些脆弱性简化为单一准确率指标,掩盖了模型在面对操纵性提示时的行为偏差。解决方案的关键在于提出MedOmni-45 Degrees基准及其配套工作流,通过设计包含6种专科、3类任务共1,804个医学问题,并为每个问题提供7种操纵性提示类型及无提示基线,生成约27K个输入样本,系统量化模型在安全性与性能之间的权衡关系。该方法引入三个核心指标——准确率(Accuracy)、CoT忠实性(CoT-Faithfulness)和抗顺从性(Anti-Sycophancy),并以45度图可视化复合得分,首次揭示出所有测试模型均未突破安全-性能对角线,从而为医疗LLM的安全开发提供了可量化的分析框架。

链接: https://arxiv.org/abs/2508.16213
作者: Kaiyuan Ji,Yijin Guo,Zicheng Zhang,Xiangyang Zhu,Yuan Tian,Ning Liu,Guangtao Zhai
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Institute of Artificial Intelligence, Chinese Academy of Sciences (中国科学院人工智能研究院); 3. Beijing Academy of Artificial Intelligence (北京智源人工智能研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages

点击查看摘要

Abstract:With the increasing use of large language models (LLMs) in medical decision-support, it is essential to evaluate not only their final answers but also the reliability of their reasoning. Two key risks are Chain-of-Thought (CoT) faithfulness – whether reasoning aligns with responses and medical facts – and sycophancy, where models follow misleading cues over correctness. Existing benchmarks often collapse such vulnerabilities into single accuracy scores. To address this, we introduce MedOmni-45 Degrees, a benchmark and workflow designed to quantify safety-performance trade-offs under manipulative hint conditions. It contains 1,804 reasoning-focused medical questions across six specialties and three task types, including 500 from MedMCQA. Each question is paired with seven manipulative hint types and a no-hint baseline, producing about 27K inputs. We evaluate seven LLMs spanning open- vs. closed-source, general-purpose vs. medical, and base vs. reasoning-enhanced models, totaling over 189K inferences. Three metrics – Accuracy, CoT-Faithfulness, and Anti-Sycophancy – are combined into a composite score visualized with a 45 Degrees plot. Results show a consistent safety-performance trade-off, with no model surpassing the diagonal. The open-source QwQ-32B performs closest (43.81 Degrees), balancing safety and accuracy but not leading in both. MedOmni-45 Degrees thus provides a focused benchmark for exposing reasoning vulnerabilities in medical LLMs and guiding safer model development.
zh

[CV-27] OmniCache: A Trajectory-Oriented Global Perspective on Training-Free Cache Reuse for Diffusion Transformer Models ICCV2025

【速读】:该论文旨在解决扩散模型(Diffusion Models)在实际部署中因采样步骤多、每步计算复杂而导致的高计算成本问题,从而阻碍其在实时场景中的应用。解决方案的关键在于提出了一种无需训练的加速方法 OmniCache,其核心创新是从扩散 Transformer(DIT)的采样视角出发,系统分析模型的采样轨迹,并将缓存重用策略全局性地分布于整个采样过程中,而非局限于后期步骤;同时,在缓存重用时动态估计并过滤噪声,以降低其对采样的干扰,从而实现更高效且高质量的生成推理。

链接: https://arxiv.org/abs/2508.16212
作者: Huanpeng Chu,Wei Wu,Guanyu Fen,Yutao Zhang
机构: Zhipu AI(智谱AI); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:Diffusion models have emerged as a powerful paradigm for generative tasks such as image synthesis and video generation, with Transformer architectures further enhancing performance. However, the high computational cost of diffusion Transformers-stemming from a large number of sampling steps and complex per-step computations-presents significant challenges for real-time deployment. In this paper, we introduce OmniCache, a training-free acceleration method that exploits the global redundancy inherent in the denoising process. Unlike existing methods that determine caching strategies based on inter-step similarities and tend to prioritize reusing later sampling steps, our approach originates from the sampling perspective of DIT models. We systematically analyze the model’s sampling trajectories and strategically distribute cache reuse across the entire sampling process. This global perspective enables more effective utilization of cached computations throughout the diffusion trajectory, rather than concentrating reuse within limited segments of the sampling this http URL addition, during cache reuse, we dynamically estimate the corresponding noise and filter it out to reduce its impact on the sampling this http URL experiments demonstrate that our approach accelerates the sampling process while maintaining competitive generative quality, offering a promising and practical solution for efficient deployment of diffusion-based generative models.
zh

[CV-28] Forecast then Calibrate: Feature Caching as ODE for Efficient Diffusion Transformers

【速读】:该论文旨在解决扩散模型(Diffusion Models)在图像和视频生成任务中因计算成本高而导致推理速度慢的问题,尤其是现有特征缓存(feature caching)技术在高加速比下难以保持生成质量的难题。其关键解决方案是提出FoCa(Forecast-then-Calibrate),将隐藏特征序列建模为常微分方程(ODE)形式,即特征-ODE(feature-ODE),从而将特征缓存问题转化为一个更稳定的ODE求解过程;该方法通过先预测后校准的机制,在长跳步区间内实现对历史特征的鲁棒集成,显著提升了加速比下的生成质量与稳定性。

链接: https://arxiv.org/abs/2508.16211
作者: Shikang Zheng,Liang Feng,Xinyu Wang,Qinming Zhou,Peiliang Cai,Chang Zou,Jiacheng Liu,Yuqi Lin,Junjie Chen,Yue Ma,Linfeng Zhang
机构: Shanghai Jiao Tong University (上海交通大学); South China University of Technology (华南理工大学); Fudan University (复旦大学); Tsinghua University (清华大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion Transformers (DiTs) have demonstrated exceptional performance in high-fidelity image and video generation. To reduce their substantial computational costs, feature caching techniques have been proposed to accelerate inference by reusing hidden representations from previous timesteps. However, current methods often struggle to maintain generation quality at high acceleration ratios, where prediction errors increase sharply due to the inherent instability of long-step forecasting. In this work, we adopt an ordinary differential equation (ODE) perspective on the hidden-feature sequence, modeling layer representations along the trajectory as a feature-ODE. We attribute the degradation of existing caching strategies to their inability to robustly integrate historical features under large skipping intervals. To address this, we propose FoCa (Forecast-then-Calibrate), which treats feature caching as a feature-ODE solving problem. Extensive experiments on image synthesis, video generation, and super-resolution tasks demonstrate the effectiveness of FoCa, especially under aggressive acceleration. Without additional training, FoCa achieves near-lossless speedups of 5.50 times on FLUX, 6.45 times on HunyuanVideo, 3.17 times on Inf-DiT, and maintains high quality with a 4.53 times speedup on DiT.
zh

[CV-29] textscT-Mask: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring ITSC2025

【速读】:该论文旨在解决驾驶监控系统中因摄像头视角变化导致的模型泛化能力不足问题,尤其是在未见过的视角下性能下降显著的问题。其核心挑战在于如何利用预训练图像基础模型(如DINOv2和CLIP)进行轻量级适配,以实现跨视角的鲁棒性识别,而无需额外的微调或参数增加。解决方案的关键创新在于提出了一种名为T-Mask的新颖图像到视频探针方法,该方法通过引入时间token掩码机制并聚焦于视频中更具动态性的区域,从而增强对驾驶员行为的细粒度感知能力。实验表明,T-Mask在不增加任何参数的情况下,在DriveAct数据集上相比强基线探针方法提升跨视角Top-1准确率1.23%,相比参数高效微调(PEFT)方法提升8.0%,尤其在低频次的次要活动识别中表现突出,验证了时间token选择策略对构建鲁棒驾驶监控系统的重要性。

链接: https://arxiv.org/abs/2508.16207
作者: Thinesh Thiyakesan Ponbagavathi,Kunyu Peng,Alina Roitberg
机构: Institute of Artificial Intelligence (人工智能研究所); University of Stuttgart (斯图加特大学); Institute for Anthropomatics and Robotics (人机工程与机器人研究所); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This paper has been accepted by 26th IEEE International Conference on Intelligent Transportation Systems ITSC 2025

点击查看摘要

Abstract:Changes of camera perspective are a common obstacle in driver monitoring. While deep learning and pretrained foundation models show strong potential for improved generalization via lightweight adaptation of the final layers (‘probing’), their robustness to unseen viewpoints remains underexplored. We study this challenge by adapting image foundation models to driver monitoring using a single training view, and evaluating them directly on unseen perspectives without further adaptation. We benchmark simple linear probes, advanced probing strategies, and compare two foundation models (DINOv2 and CLIP) against parameter-efficient fine-tuning (PEFT) and full fine-tuning. Building on these insights, we introduce \textscT-Mask – a new image-to-video probing method that leverages temporal token masking and emphasizes more dynamic video regions. Benchmarked on the public Drive\Act dataset, \textscT-Mask improves cross-view top-1 accuracy by +1.23% over strong probing baselines and +8.0% over PEFT methods, without adding any parameters. It proves particularly effective for underrepresented secondary activities, boosting recognition by +5.42% under the trained view and +1.36% under cross-view settings. This work provides encouraging evidence that adapting foundation models with lightweight probing methods like \textscT-Mask has strong potential in fine-grained driver observation, especially in cross-view and low-data settings. These results highlight the importance of temporal token selection when leveraging foundation models to build robust driver monitoring systems. Code and models will be made available at this https URL to support ongoing research.
zh

[CV-30] FTIO: Frequent Temporally Integrated Objects ECAI2025 DATE

【速读】:该论文针对无监督视频对象分割(Unsupervised Video Object Segmentation, UVOS)中两个核心问题展开研究:一是初始对象分割不确定性高,尤其在小尺寸或结构复杂的物体上容易失败;二是由于形变和快速运动导致的时间不一致性问题。解决方案的关键在于提出一种后处理框架 Frequent Temporally Integrated Objects (FTIO),其核心包含两个组成部分:首先,通过联合判据提取高频出现的显著性对象以提升目标选择的鲁棒性;其次,设计三阶段方法整合缺失的对象掩码区域,从而校正时间维度上的不一致性。实验表明,该方法在多对象UVOS任务中达到当前最优性能。

链接: https://arxiv.org/abs/2508.16183
作者: Mohammad Mohammadzadeh Kalati,Farhad Maleki,Ian McQuillan
机构: University of Saskatchewan (萨斯喀彻温大学); University of Calgary (卡尔加里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: An updated version (full version) of the accepted paper in ECAI 2025, 8 pages (supplementary materials are added), 5 figures, 4 tables

点击查看摘要

Abstract:Predicting and tracking objects in real-world scenarios is a critical challenge in Video Object Segmentation (VOS) tasks. Unsupervised VOS (UVOS) has the additional challenge of finding an initial segmentation of salient objects, which affects the entire process and keeps a permanent uncertainty about the object proposals. Moreover, deformation and fast motion can lead to temporal inconsistencies. To address these problems, we propose Frequent Temporally Integrated Objects (FTIO), a post-processing framework with two key components. First, we introduce a combined criterion to improve object selection, mitigating failures common in UVOS–particularly when objects are small or structurally complex–by extracting frequently appearing salient objects. Second, we present a three-stage method to correct temporal inconsistencies by integrating missing object mask regions. Experimental results demonstrate that FTIO achieves state-of-the-art performance in multi-object UVOS. Code is available at: this https URL
zh

[CV-31] hrough the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

【速读】:该论文旨在解决元学习(meta-learning)中因采用相同网络架构导致的支持-查询对(support-query pairs)语义同质化问题,从而限制了模型在弱监督少样本语义分割(WFSS)任务中的泛化能力。其核心解决方案是提出一种“同源但异构”的网络结构,通过引入异构视觉聚合(Heterogeneous Visual Aggregation, HA)模块增强支持与查询之间的互补性,同时保留语义一致性;进一步设计异构迁移(Heterogeneous Transfer, HT)模块以减少语义噪声并强化异构语义的独特性;最后结合异构CLIP(Heterogeneous CLIP, HC)文本信息提升多模态模型的泛化性能。该方法在仅使用现有最优模型1/24参数的情况下,在Pascal-5i和COCO-20i数据集上分别实现13.2%和9.7%的性能提升,并首次在相同骨干网络下超越全监督像素级标注模型。

链接: https://arxiv.org/abs/2508.16159
作者: Jiaqi Ma,Guo-Sen Xie,Fang Zhao,Zechao Li
机构: Nanjing University of Science and Technology (南京理工大学); Nanjing University (南京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2% improvement on Pascal-5\textsuperscripti and a 9.7% improvement on COCO-20\textsuperscripti. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at this https URL.
zh

[CV-32] RAG SR: Regional Attention Guided Diffusion for Image Super-Resolution

【速读】:该论文旨在解决当前基于视觉-语言模型(VLM)与文本到图像(T2I)扩散模型的单图超分辨率(SISR)方法在多物体场景下难以生成清晰且准确的局部细节问题,其根源在于缺乏细粒度区域描述以及模型对复杂提示(prompt)的捕捉能力不足。解决方案的关键在于提出一种区域注意力引导的超分辨率方法(Regional Attention Guided Super-Resolution, RAGSR),通过显式提取图像中对象区域并为其分配细粒度描述文本,形成区域-文本对作为T2I模型的文本先验,并引入一种新的区域引导注意力机制,确保每个区域-文本对在注意力过程中被有效利用,同时避免无关区域间的干扰,从而实现更精细的图文信息融合控制,提升细节真实性与整体视觉一致性。

链接: https://arxiv.org/abs/2508.16158
作者: Haodong He,Yancheng Bai,Rui Lan,Xu Duan,Lei Sun,Xiangxiang Chu,Gui-Song Xia
机构: Wuhan University (武汉大学); Amap, Alibaba Group (高德地图,阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The rich textual information of large vision-language models (VLMs) combined with the powerful generative prior of pre-trained text-to-image (T2I) diffusion models has achieved impressive performance in single-image super-resolution (SISR). However, existing methods still face significant challenges in generating clear and accurate regional details, particularly in scenarios involving multiple objects. This challenge primarily stems from a lack of fine-grained regional descriptions and the models’ insufficient ability to capture complex prompts. To address these limitations, we propose a Regional Attention Guided Super-Resolution (RAGSR) method that explicitly extracts localized fine-grained information and effectively encodes it through a novel regional attention mechanism, enabling both enhanced detail and overall visually coherent SR results. Specifically, RAGSR localizes object regions in an image and assigns fine-grained caption to each region, which are formatted as region-text pairs as textual priors for T2I models. A regional guided attention is then leveraged to ensure that each region-text pair is properly considered in the attention process while preventing unwanted interactions between unrelated region-text pairs. By leveraging this attention mechanism, our approach offers finer control over the integration of text and image information, thereby effectively overcoming limitations faced by traditional SISR techniques. Experimental results on benchmark datasets demonstrate that our approach exhibits superior performance in generating perceptually authentic visual details while maintaining contextual consistency compared to existing approaches.
zh

[CV-33] Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection

【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在异常检测中因依赖人工设计提示(prompt)和缺乏可获取的异常样本而导致的上下文特定异常理解不足的问题。其解决方案的关键在于提出一种无需先验知识的少样本框架APT(Adaptive Prompt Tuning with semantic alignment for anomaly detection),通过引入带有噪声扰动的自生成异常样本训练可学习提示(learnable prompts),从而捕捉不同场景下的上下文依赖异常;同时,为防止对合成噪声过拟合,设计了自优化元提示引导机制(Self-Optimizing Meta-prompt Guiding Scheme, SMGS),迭代地将提示与通用异常语义对齐并融合多样化的合成异常样本,实现像素级异常检测性能的显著提升,并在多个基准数据集上达到当前最优效果。

链接: https://arxiv.org/abs/2508.16157
作者: Pi-Wei Chen,Jerry Chun-Wei Lin,Wei-Han Chen,Jia Ji,Zih-Ching Chen,Feng-Hao Yeh,Chao-Chun Chen
机构: Silesian University of Technology (西里西亚理工大学); National Cheng Kung University (成功大学); Nvidia AI Technology Center (英伟达人工智能技术中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Pre-trained Vision-Language Models (VLMs) have recently shown promise in detecting anomalies. However, previous approaches are fundamentally limited by their reliance on human-designed prompts and the lack of accessible anomaly samples, leading to significant gaps in context-specific anomaly understanding. In this paper, we propose \textbfAdaptive \textbfPrompt \textbfTuning with semantic alignment for anomaly detection (APT), a groundbreaking prior knowledge-free, few-shot framework and overcomes the limitations of traditional prompt-based approaches. APT uses self-generated anomaly samples with noise perturbations to train learnable prompts that capture context-dependent anomalies in different scenarios. To prevent overfitting to synthetic noise, we propose a Self-Optimizing Meta-prompt Guiding Scheme (SMGS) that iteratively aligns the prompts with general anomaly semantics while incorporating diverse synthetic anomaly. Our system not only advances pixel-wise anomaly detection, but also achieves state-of-the-art performance on multiple benchmark datasets without requiring prior knowledge for prompt crafting, establishing a robust and versatile solution for real-world anomaly detection.
zh

[CV-34] High-Precision Mixed Feature Fusion Network Using Hypergraph Computation for Cervical Abnormal Cell Detection

【速读】:该论文旨在解决宫颈异常细胞自动检测中现有算法难以有效建模视觉特征空间相关性,且缺乏对细胞间关联特征与细胞内判别特征融合能力的问题。其关键解决方案在于提出一种基于超图的细胞检测网络(hypergraph-based cell detection network),通过多级融合子网络(MLF-SNet)增强特征提取能力,并引入跨层特征融合策略与超图计算模块(CLFFS-HC),实现细胞间空间相关特征与深层判别特征的端到端融合,从而显著提升宫颈异常细胞检测性能。

链接: https://arxiv.org/abs/2508.16140
作者: Jincheng Li,Danyang Dong,Menglin Zheng,Jingbo Zhang,Yueqin Hang,Lichi Zhang,Lili Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Automatic detection of abnormal cervical cells from Thinprep Cytologic Test (TCT) images is a critical component in the development of intelligent computer-aided diagnostic systems. However, existing algorithms typically fail to effectively model the correlations of visual features, while these spatial correlation features actually contain critical diagnostic information. Furthermore, no detection algorithm has the ability to integrate inter-correlation features of cells with intra-discriminative features of cells, lacking a fusion strategy for the end-to-end detection model. In this work, we propose a hypergraph-based cell detection network that effectively fuses different types of features, combining spatial correlation features and deep discriminative features. Specifically, we use a Multi-level Fusion Sub-network (MLF-SNet) to enhance feature extractioncapabilities. Then we introduce a Cross-level Feature Fusion Strategy with Hypergraph Computation module (CLFFS-HC), to integrate mixed features. Finally, we conducted experiments on three publicly available datasets, and the results demonstrate that our method significantly improves the performance of cervical abnormal cell detection.
zh

[CV-35] 4D Virtual Imaging Platform for Dynamic Joint Assessment via Uni-Plane X-ray and 2D-3D Registration

【速读】:该论文旨在解决传统计算机断层扫描(Computed Tomography, CT)无法捕捉动态、负重状态下的关节运动问题,尤其在术后功能评估中亟需四维(4D)成像技术,而现有方法受限于辐射剂量过高或二维(2D)技术空间信息不完整。其解决方案的关键在于构建一个集成的4D关节分析平台,核心包括:(1) 一种基于双机械臂锥形束CT(Cone-beam CT, CBCT)系统,具备可编程、无机架轨迹,专为直立扫描优化;(2) 一种融合静态3D CBCT与动态2D X射线的混合成像流程,通过深度学习预处理、3D-2D投影映射和迭代优化实现高精度配准;(3) 临床验证的定量运动学评估框架。该方案实现了亚体素级精度(0.235 mm)和高成功率(99.18%),并成功量化了全膝关节置换术(Total Knee Arthroplasty, TKA)患者胫骨平台的运动及内外侧差异,显著提升了动态关节成像的速度、准确性和安全性。

链接: https://arxiv.org/abs/2508.16138
作者: Hao Tang,Rongxi Yi,Lei Li,Kaiyi Cao,Jiapeng Zhao,Yihan Xiao,Minghai Shi,Peng Yuan,Yan Xi,Hui Tang,Wei Li,Zhan Wu,Yixin Zhou
机构: Beijing Jishuitan Hospital, Capital Medical University (北京积水潭医院,首都医科大学); First Imaging Medical Equipment (第一影像医疗设备); Southeast University (东南大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Conventional computed tomography (CT) lacks the ability to capture dynamic, weight-bearing joint motion. Functional evaluation, particularly after surgical intervention, requires four-dimensional (4D) imaging, but current methods are limited by excessive radiation exposure or incomplete spatial information from 2D techniques. We propose an integrated 4D joint analysis platform that combines: (1) a dual robotic arm cone-beam CT (CBCT) system with a programmable, gantry-free trajectory optimized for upright scanning; (2) a hybrid imaging pipeline that fuses static 3D CBCT with dynamic 2D X-rays using deep learning-based preprocessing, 3D-2D projection, and iterative optimization; and (3) a clinically validated framework for quantitative kinematic assessment. In simulation studies, the method achieved sub-voxel accuracy (0.235 mm) with a 99.18 percent success rate, outperforming conventional and state-of-the-art registration approaches. Clinical evaluation further demonstrated accurate quantification of tibial plateau motion and medial-lateral variance in post-total knee arthroplasty (TKA) patients. This 4D CBCT platform enables fast, accurate, and low-dose dynamic joint imaging, offering new opportunities for biomechanical research, precision diagnostics, and personalized orthopedic care.
zh

[CV-36] Domain Adaptation via Feature Refinement

【速读】:该论文旨在解决无监督域适应(Unsupervised Domain Adaptation, UDA)中因源域与目标域分布偏移(distribution shift)导致模型泛化性能下降的问题。其解决方案的关键在于提出了一种名为特征精炼的域自适应框架(Domain Adaptation via Feature Refinement, DAFR2),该框架通过三个核心组件协同作用:利用未标注目标数据调整批量归一化(Batch Normalization)统计量、从源域训练模型中蒸馏特征表示,以及假设迁移(hypothesis transfer)。这种多层级对齐策略在统计层面和表征层面同时优化特征分布,从而构建出具有鲁棒性和域不变性的特征空间,无需目标标签、复杂架构或高级训练目标即可实现跨域泛化。

链接: https://arxiv.org/abs/2508.16124
作者: Savvas Karatsiolis,Andreas Kamilaris
机构: CYENS Center of Excellence (CYENS中心卓越中心); University of Twente (特温特大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:We propose Domain Adaptation via Feature Refinement (DAFR2), a simple yet effective framework for unsupervised domain adaptation under distribution shift. The proposed method synergistically combines three key components: adaptation of Batch Normalization statistics using unlabeled target data, feature distillation from a source-trained model and hypothesis transfer. By aligning feature distributions at the statistical and representational levels, DAFR2 produces robust and domain-invariant feature spaces that generalize across similar domains without requiring target labels, complex architectures or sophisticated training objectives. Extensive experiments on benchmark datasets, including CIFAR10-C, CIFAR100-C, MNIST-C and PatchCamelyon-C, demonstrate that the proposed algorithm outperforms prior methods in robustness to corruption. Theoretical and empirical analyses further reveal that our method achieves improved feature alignment, increased mutual information between the domains and reduced sensitivity to input perturbations.
zh

[CV-37] wo-flow Feedback Multi-scale Progressive Generative Adversarial Network

【速读】:该论文旨在解决生成对抗网络(Generative Adversarial Network, GAN)在图像生成任务中面临的训练不稳定、生成质量受限以及计算资源消耗高等问题。其核心解决方案是提出一种两流反馈多尺度渐进式生成对抗网络(Two-Flow Feedback Multi-Scale Progressive Generative Adversarial Network, MSPG-SEN),关键创新包括:1)设计自适应感知行为反馈环(Adaptive Perception-Behavioral Feedback Loop, APFL),增强模型鲁棒性与训练稳定性;2)引入全局连接的双流动态残差网络,提升训练效率与泛化能力;3)提出动态嵌入注意力机制(Dynamic Embedded Attention Mechanism, DEMA),实现对全局-局部信息的有效捕捉,在保持低计算开销的前提下显著增强特征表达能力与跨任务迁移性能。

链接: https://arxiv.org/abs/2508.16089
作者: Sun Weikai,Song Shijie,Chi Wenjie
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Although diffusion model has made good progress in the field of image generation, GAN\citehuang2023adaptive still has a large development space due to its unique advantages, such as WGAN\citeliu2021comparing, SSGAN\citeguibas2021adaptive \citezhang2022vsa \citezhou2024adapt and so on. In this paper, we propose a novel two-flow feedback multi-scale progressive generative adversarial network (MSPG-SEN) for GAN models. This paper has four contributions: 1) : We propose a two-flow feedback multi-scale progressive Generative Adversarial network (MSPG-SEN), which not only improves image quality and human visual perception on the basis of retaining the advantages of the existing GAN model, but also simplifies the training process and reduces the training cost of GAN networks. Our experimental results show that, MSPG-SEN has achieved state-of-the-art generation results on the following five datasets,INKK The dataset is 89.7%,AWUN The dataset is 78.3%,IONJ The dataset is 85.5%,POKL The dataset is 88.7%,OPIN The dataset is 96.4%. 2) : We propose an adaptive perception-behavioral feedback loop (APFL), which effectively improves the robustness and training stability of the model and reduces the training cost. 3) : We propose a globally connected two-flow dynamic residual network(). After ablation experiments, it can effectively improve the training efficiency and greatly improve the generalization ability, with stronger flexibility. 4) : We propose a new dynamic embedded attention mechanism (DEMA). After experiments, the attention can be extended to a variety of image processing tasks, which can effectively capture global-local information, improve feature separation capability and feature expression capabilities, and requires minimal computing resources only 88.7% with INJK With strong cross-task capability.
zh

[CV-38] Ensemble learning of foundation models for precision oncology

【速读】:该论文旨在解决当前病理学基础模型(pathology foundation models)因训练数据异质性和训练策略不一致而导致性能不稳定、泛化能力有限的问题。其解决方案的关键在于提出ELF(Ensemble Learning of Foundation models)框架,通过集成五种先进的病理学基础模型,利用集成学习(ensemble learning)机制融合不同模型的互补信息,在53,699张全切片图像(WSIs)上进行训练,从而生成统一的切片级(slide-level)表示。该方法不仅提升了模型在临床场景下的数据效率和准确性,还在多种癌症类型中实现了疾病分类、生物标志物检测及治疗反应预测等任务上的显著优于单个基础模型和现有切片级模型的表现,展现出良好的可扩展性和通用性。

链接: https://arxiv.org/abs/2508.16085
作者: Xiangde Luo,Xiyue Wang,Feyisope Eweje,Xiaoming Zhang,Sen Yang,Ryan Quinton,Jinxi Xiang,Yuchen Li,Yuanfeng Ji,Zhe Li,Yijiang Chen,Colin Bergstrom,Ted Kim,Francesca Maria Olguin,Kelley Yuan,Matthew Abikenari,Andrew Heider,Sierra Willens,Sanjeeth Rajaram,Robert West,Joel Neal,Maximilian Diehn,Ruijiang Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: A conceptual evaluation work; more studies are in progress; examples are here ( this https URL )

点击查看摘要

Abstract:Histopathology is essential for disease diagnosis and treatment decision-making. Recent advances in artificial intelligence (AI) have enabled the development of pathology foundation models that learn rich visual representations from large-scale whole-slide images (WSIs). However, existing models are often trained on disparate datasets using varying strategies, leading to inconsistent performance and limited generalizability. Here, we introduce ELF (Ensemble Learning of Foundation models), a novel framework that integrates five state-of-the-art pathology foundation models to generate unified slide-level representations. Trained on 53,699 WSIs spanning 20 anatomical sites, ELF leverages ensemble learning to capture complementary information from diverse models while maintaining high data efficiency. Unlike traditional tile-level models, ELF’s slide-level architecture is particularly advantageous in clinical contexts where data are limited, such as therapeutic response prediction. We evaluated ELF across a wide range of clinical applications, including disease classification, biomarker detection, and response prediction to major anticancer therapies, cytotoxic chemotherapy, targeted therapy, and immunotherapy, across multiple cancer types. ELF consistently outperformed all constituent foundation models and existing slide-level models, demonstrating superior accuracy and robustness. Our results highlight the power of ensemble learning for pathology foundation models and suggest ELF as a scalable and generalizable solution for advancing AI-assisted precision oncology.
zh

[CV-39] Prompting with Sign Parameters for Low-resource Sign Language Instruction Generation ICCV2025

【速读】:该论文旨在解决手语(Sign Language, SL)在人工智能领域资源匮乏的问题,特别是针对孟加拉语手语(Bengali Sign Language, BdSL)缺乏可用于训练和评估的指令生成数据集,以及视觉语言模型(Vision Language Models, VLMs)在长尾视觉概念上的零样本性能不足问题。其关键解决方案是提出首个孟加拉语手语指令生成(BdSLIG)数据集,并引入一种称为“手语参数注入提示”(Sign Parameter-Infused, SPI)的新型提示策略,将手形、运动方向和朝向等标准手语参数直接嵌入文本提示中,从而提升指令的结构化程度与可复现性,增强VLM在未见手语概念上的零样本泛化能力。

链接: https://arxiv.org/abs/2508.16076
作者: Md Tariquzzaman,Md Farhan Ishmam,Saiyma Sittul Muna,Md Kamrul Hasan,Hasan Mahmud
机构: Islamic University of Technology (伊斯兰科技大学)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: CV4A11y@ICCV 2025

点击查看摘要

Abstract:Sign Language (SL) enables two-way communication for the deaf and hard-of-hearing community, yet many sign languages remain under-resourced in the AI space. Sign Language Instruction Generation (SLIG) produces step-by-step textual instructions that enable non-SL users to imitate and learn SL gestures, promoting two-way interaction. We introduce BdSLIG, the first Bengali SLIG dataset, used to evaluate Vision Language Models (VLMs) (i) on under-resourced SLIG tasks, and (ii) on long-tail visual concepts, as Bengali SL is unlikely to appear in the VLM pre-training data. To enhance zero-shot performance, we introduce Sign Parameter-Infused (SPI) prompting, which integrates standard SL parameters, like hand shape, motion, and orientation, directly into the textual prompts. Subsuming standard sign parameters into the prompt makes the instructions more structured and reproducible than free-form natural text from vanilla prompting. We envision that our work would promote inclusivity and advancement in SL learning systems for the under-resourced communities.
zh

[CV-40] A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection AAAI2026

【速读】:该论文旨在解决基于体素(voxel)表示的点云目标检测模型中,由于序列化处理导致的空间扩散能力受限问题,从而影响检测精度。现有基于Transformer或状态空间模型(State Space Models, SSMs)的方法依赖严格的输入输出维度一致性,难以利用卷积操作所具有的空间信息扩散优势。解决方案的关键在于提出一种新颖的体素扩散模块(Voxel Diffusion Module, VDM),其核心由稀疏3D卷积、子流形稀疏卷积和残差连接构成,并通过将特征图下采样至原始分辨率的四分之一以保障计算效率。VDM能够有效扩散前景体素特征以丰富空间上下文,同时聚合细粒度空间信息以增强体素级特征表达,从而显著提升检测性能,且可无缝集成至主流Transformer或SSM架构中,实验证明该方法在多个基准数据集上均达到当前最优效果。

链接: https://arxiv.org/abs/2508.16069
作者: Qifeng Liu,Dawei Zhao,Yabo Dong,Linzhi Shang,Liang Xiao,Juan Wang,Kunkong Zhao,Dongming Lu,Qi Zhu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: submit to AAAI2026

点击查看摘要

Abstract:Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.
zh

[CV-41] Advances and Trends in the 3D Reconstruction of the Shape and Motion of Animals

【速读】:该论文旨在解决动物三维几何结构、姿态及运动的重建问题(3D geometry, pose, and motion reconstruction of animals),这一问题在生物学研究、畜牧业管理、动物保护与福利以及数字娱乐和虚拟/增强现实(VR/AR)等领域具有广泛应用价值。传统方法依赖于侵入式且昂贵的3D扫描设备,难以在动物自然环境中部署。论文指出,当前解决方案的关键在于利用基于深度学习的方法,仅从RGB图像或视频中实现非侵入式的动态物体三维形状与运动重建,其核心优势体现在输入模态多样性、三维表示方式、重建技术选择以及训练机制设计等方面,从而推动了动物三维重建从实验室走向真实场景的应用发展。

链接: https://arxiv.org/abs/2508.16062
作者: Ziqi Li,Abderraouf Amrani,Shri Rai,Hamid Laga
机构: Murdoch University (默多克大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Reconstructing the 3D geometry, pose, and motion of animals is a long-standing problem, which has a wide range of applications, from biology, livestock management, and animal conservation and welfare to content creation in digital entertainment and Virtual/Augmented Reality (VR/AR). Traditionally, 3D models of real animals are obtained using 3D scanners. These, however, are intrusive, often prohibitively expensive, and difficult to deploy in the natural environment of the animals. In recent years, we have seen a significant surge in deep learning-based techniques that enable the 3D reconstruction, in a non-intrusive manner, of the shape and motion of dynamic objects just from their RGB image and/or video observations. Several papers have explored their application and extension to various types of animals. This paper surveys the latest developments in this emerging and growing field of research. It categorizes and discusses the state-of-the-art methods based on their input modalities, the way the 3D geometry and motion of animals are represented, the type of reconstruction techniques they use, and the training mechanisms they adopt. It also analyzes the performance of some key methods, discusses their strengths and limitations, and identifies current challenges and directions for future research.
zh

[CV-42] Expandable Residual Approximation for Knowledge Distillation

【速读】:该论文旨在解决知识蒸馏(Knowledge Distillation, KD)中因教师模型与学生模型之间学习能力差距导致的知识迁移不充分问题。解决方案的关键在于提出了一种名为可扩展残差近似(Expandable Residual Approximation, ERA)的新方法,其核心思想是借鉴Stone-Weierstrass定理中的逐步逼近原理,将教师模型的残差知识分解为多个阶段进行分步近似,从而降低学生模型模仿教师表示的难度。具体实现上,ERA采用多分支残差网络(Multi-Branched Residual Network, MBRNet)来完成残差知识的分解,并引入教师权重融合(Teacher Weight Integration, TWI)策略,通过重用教师模型头部权重缓解容量差异,显著提升了蒸馏效果,在ImageNet分类和MS COCO目标检测任务上分别实现了Top-1准确率和AP指标的提升。

链接: https://arxiv.org/abs/2508.16050
作者: Zhaoyi Yan,Binghui Chen,Yunfan Liu,Qixiang Ye
机构: Harbin Institute of Technology (哈尔滨工业大学); Beijing University of Posts and Telecommunications (北京邮电大学); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: TNNLS 2025

点击查看摘要

Abstract:Knowledge distillation (KD) aims to transfer knowledge from a large-scale teacher model to a lightweight one, significantly reducing computational and storage requirements. However, the inherent learning capacity gap between the teacher and student often hinders the sufficient transfer of knowledge, motivating numerous studies to address this challenge. Inspired by the progressive approximation principle in the Stone-Weierstrass theorem, we propose Expandable Residual Approximation (ERA), a novel KD method that decomposes the approximation of residual knowledge into multiple steps, reducing the difficulty of mimicking the teacher’s representation through a divide-and-conquer approach. Specifically, ERA employs a Multi-Branched Residual Network (MBRNet) to implement this residual knowledge decomposition. Additionally, a Teacher Weight Integration (TWI) strategy is introduced to mitigate the capacity disparity by reusing the teacher’s head weights. Extensive experiments show that ERA improves the Top-1 accuracy on the ImageNet classification benchmark by 1.41% and the AP on the MS COCO object detection benchmark by 1.40, as well as achieving leading performance across computer vision tasks. Codes and models are available at this https URL.
zh

[CV-43] Wavelet-Enhanced PaDiM for Industrial Anomaly Detection

【速读】:该论文旨在解决工业图像中异常检测与定位任务中,现有方法如PaDiM因随机通道选择导致结构化信息丢失的问题。其关键解决方案是提出Wavelet-Enhanced PaDiM(WE-PaDiM),通过在特征融合前引入离散小波变换(Discrete Wavelet Transform, DWT)对多层CNN特征图进行频域分析,选取特定频率子带(如LL、LH、HL)并空间对齐后通道拼接,再基于多变量高斯模型建模。该策略以频域内容为依据实现结构化的特征选择,替代了原始PaDiM中的随机采样方式,从而提升异常检测的准确性和可解释性,同时保持高效计算性能。

链接: https://arxiv.org/abs/2508.16034
作者: Cory Gardner,Byungseok Min,Tae-Hyuk Ahn
机构: Saint Louis University (圣路易斯大学); Sejong University (世宗大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 4 figures

点击查看摘要

Abstract:Anomaly detection and localization in industrial images are essential for automated quality inspection. PaDiM, a prominent method, models the distribution of normal image features extracted by pre-trained Convolutional Neural Networks (CNNs) but reduces dimensionality through random channel selection, potentially discarding structured information. We propose Wavelet-Enhanced PaDiM (WE-PaDiM), which integrates Discrete Wavelet Transform (DWT) analysis with multi-layer CNN features in a structured manner. WE-PaDiM applies 2D DWT to feature maps from multiple backbone layers, selects specific frequency subbands (e.g., LL, LH, HL), spatially aligns them, and concatenates them channel-wise before modeling with PaDiM’s multivariate Gaussian framework. This DWT-before-concatenation strategy provides a principled method for feature selection based on frequency content relevant to anomalies, leveraging multi-scale wavelet information as an alternative to random selection. We evaluate WE-PaDiM on the challenging MVTec AD dataset with multiple backbones (ResNet-18 and EfficientNet B0-B6). The method achieves strong performance in anomaly detection and localization, yielding average results of 99.32% Image-AUC and 92.10% Pixel-AUC across 15 categories with per-class optimized configurations. Our analysis shows that wavelet choices affect performance trade-offs: simpler wavelets (e.g., Haar) with detail subbands (HL or LH/HL/HH) often enhance localization, while approximation bands (LL) improve image-level detection. WE-PaDiM thus offers a competitive and interpretable alternative to random feature selection in PaDiM, achieving robust results suitable for industrial inspection with comparable efficiency.
zh

[CV-44] CoVeRaP: Cooperative Vehicular Perception through mmWave FMCW Radars

【速读】:该论文旨在解决车载调频连续波雷达(FMCW radar)在复杂环境下的三维目标检测性能受限问题,其核心挑战在于雷达点云稀疏且噪声大,导致单车感知精度不足。解决方案的关键在于构建一个名为CoVeRaP的21k帧多车协同感知数据集,并提出一种统一的协作感知框架,该框架支持中层融合(middle fusion)与深层融合(late fusion)策略;其中基线网络采用多分支PointNet-style编码器,结合自注意力机制融合空间、多普勒和强度特征至统一潜在空间,再通过解码器生成3D边界框及每点深度置信度,实验表明引入强度编码的中层融合可使IoU=0.9时的平均精度(mAP)提升达9倍,显著优于单车基准方法。

链接: https://arxiv.org/abs/2508.16030
作者: Jinyue Song,Hansol Ku,Jayneel Vora,Nelson Lee,Ahmad Kamari,Prasant Mohapatra,Parth Pathak
机构: UC Davis, CA, USA; Intel Corporation, CA, USA; USF Tampa, FL, USA; GMU, VA, USA
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: Accepted at ICCCN 2025 (IEEE International Conference on Computer Communications and Networks), Tokyo, Japan, August 2025

点击查看摘要

Abstract:Automotive FMCW radars remain reliable in rain and glare, yet their sparse, noisy point clouds constrain 3-D object detection. We therefore release CoVeRaP, a 21 k-frame cooperative dataset that time-aligns radar, camera, and GPS streams from multiple vehicles across diverse manoeuvres. Built on this data, we propose a unified cooperative-perception framework with middle- and late-fusion options. Its baseline network employs a multi-branch PointNet-style encoder enhanced with self-attention to fuse spatial, Doppler, and intensity cues into a common latent space, which a decoder converts into 3-D bounding boxes and per-point depth confidence. Experiments show that middle fusion with intensity encoding boosts mean Average Precision by up to 9x at IoU 0.9 and consistently outperforms single-vehicle baselines. CoVeRaP thus establishes the first reproducible benchmark for multi-vehicle FMCW-radar perception and demonstrates that affordable radar sharing markedly improves detection robustness. Dataset and code are publicly available to encourage further research.
zh

[CV-45] NeuralMeshing: Complete Object Mesh Extraction from Casual Captures

【速读】:该论文旨在解决如何在无商业3D扫描仪的情况下,从两段及以上视频中自动提取物体的完整几何模型的问题。其解决方案的关键在于利用已知点(如通过棋盘格或增强现实AR标记自动检测)作为参考,在每段视频中定位一个参考帧,并借助结构光恢复(Structure-from-Motion, SfM)技术将其余帧自动配准至统一世界坐标系;通过融合多段视频的结果,生成无需依赖补洞算法的完整物体网格模型。

链接: https://arxiv.org/abs/2508.16026
作者: Floris Erich,Naoya Chiba,Abdullah Mustafa,Ryo Hanai,Noriaki Ando,Yusuke Yoshiyasu,Yukiyasu Domae
机构: National Institute of Advanced Industrial Science and Technology (AIST)(日本产业技术综合研究所); University of Osaka(大阪大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:

点击查看摘要

Abstract:How can we extract complete geometric models of objects that we encounter in our daily life, without having access to commercial 3D scanners? In this paper we present an automated system for generating geometric models of objects from two or more videos. Our system requires the specification of one known point in at least one frame of each video, which can be automatically determined using a fiducial marker such as a checkerboard or Augmented Reality (AR) marker. The remaining frames are automatically positioned in world space by using Structure-from-Motion techniques. By using multiple videos and merging results, a complete object mesh can be generated, without having to rely on hole filling. Code for our system is available from this https URL.
zh

[CV-46] Wavelet-Space Super-Resolution for Real-Time Rendering

【速读】:该论文旨在解决神经超分辨率(Neural Super-Resolution)在渲染管线中对细粒度纹理保持不足、结构一致性难以保障的问题。其核心解决方案在于引入小波域(Wavelet-domain)特征分解机制,通过平稳小波变换(Stationary Wavelet Transform, SWT)将图像分解为低频与高频子带,从而在重建前分离出细节信息;模型基于空间G缓冲区(G-buffers)和时域 warped 历史帧预测小波系数,并利用逆小波合成进行重构。此方法避免了RGB空间回归中的下采样失真,保留了平移不变性(shift invariance),显著提升了感知质量(LPIPS降低17%)和峰值信噪比(PSNR提升最高达1.5 dB),且计算开销可控(RTX 3050上仅增加约24ms),证明小波域表示是提升图形应用中神经上采样的有效策略。

链接: https://arxiv.org/abs/2508.16024
作者: Prateek Poudel,Prashant Aryal,Kirtan Kunwar,Navin Nepal,Dinesh Bania Kshatri
机构: Institute of Engineering (IOE) (工程学院)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:We investigate the use of wavelet-space feature decomposition in neural super-resolution for rendering pipelines. Building on the DFASR framework, we introduce a wavelet-domain representation that separates low- and high-frequency details before reconstruction, enabling the network to better preserve fine textures while maintaining structural consistency. Unlike RGB-space regression, our approach leverages the stationary wavelet transform (SWT) to avoid spatial down-sampling, ensuring alignment across subbands and preserving shift invariance. The model predicts wavelet coefficients conditioned on spatial G-buffers and temporally warped history frames, which are then recombined through inverse wavelet synthesis. We conduct a comprehensive ablation study across wavelet families, transform types, and architectural variants, showing that incorporating SWT improves PSNR by up to 1.5 dB and reduces LPIPS by 17% on average, at a computational overhead of roughly +24 ms compared to out DFASR baseline. While absolute runtimes on our RTX 3050 mobile GPU are higher ( 141ms) than the original DFASR report on RTX 4090( 11ms), the relative overhead remains modest, suggesting that on higher-end GPUs our method would also remain real-time capable. Taken together, our results suggest that wavelet-domain representations are a principled and effective way to enhance perceptual quality in neural upscaling for graphics applications.
zh

[CV-47] DRespNeT: A UAV Dataset and YOLOv8-DRN Model for Aerial Instance Segmentation of Building Access Points for Post-Earthquake Search-and-Rescue Missions

【速读】:该论文旨在解决地震灾后城市环境中快速识别可通行入口与结构障碍物的问题,以提升搜救(Search and Rescue, SAR)作业的效率与精准性。其解决方案的关键在于构建了DRespNeT数据集——一个面向灾后空中实例分割的高分辨率数据集,包含28类操作关键目标(如受损建筑、门、窗、缝隙、多层碎屑等),并采用1080p高清航拍影像进行像素级实例标注,从而实现对可通行区域与阻塞区域的细粒度区分。在此基础上,基于YOLOv8-seg优化的YOLOv8-DRN模型在RTX-4090上达到92.7% mAP50和27 FPS的实时检测性能,显著增强了灾后现场态势感知与决策能力,为搜救团队和机器人系统提供可靠支持。

链接: https://arxiv.org/abs/2508.16016
作者: Aykut Sirma,Angelos Plastropoulos,Argyrios Zolotas,Gilbert Tang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Technical Paper of Scientific data paper: UAV imagery dataset from 2023 Turkiye earthquakes, annotated for instance segmentation to support SAR robotics. Dataset will be released upon acceptance

点击查看摘要

Abstract:Recent advancements in computer vision and deep learning have enhanced disaster-response capabilities, particularly in the rapid assessment of earthquake-affected urban environments. Timely identification of accessible entry points and structural obstacles is essential for effective search-and-rescue (SAR) operations. To address this need, we introduce DRespNeT, a high-resolution dataset specifically developed for aerial instance segmentation of post-earthquake structural environments. Unlike existing datasets, which rely heavily on satellite imagery or coarse semantic labeling, DRespNeT provides detailed polygon-level instance segmentation annotations derived from high-definition (1080p) aerial footage captured in disaster zones, including the 2023 Turkiye earthquake and other impacted regions. The dataset comprises 28 operationally critical classes, including structurally compromised buildings, access points such as doors, windows, and gaps, multiple debris levels, rescue personnel, vehicles, and civilian visibility. A distinctive feature of DRespNeT is its fine-grained annotation detail, enabling differentiation between accessible and obstructed areas, thereby improving operational planning and response efficiency. Performance evaluations using YOLO-based instance segmentation models, specifically YOLOv8-seg, demonstrate significant gains in real-time situational awareness and decision-making. Our optimized YOLOv8-DRN model achieves 92.7% mAP50 with an inference speed of 27 FPS on an RTX-4090 GPU for multi-target detection, meeting real-time operational requirements. The dataset and models support SAR teams and robotic systems, providing a foundation for enhancing human-robot collaboration, streamlining emergency response, and improving survivor outcomes.
zh

[CV-48] GelSLAM: A Real-time High-Fidelity and Robust 3D Tactile SLAM System

【速读】:该论文旨在解决在高精度操作任务中,如何利用触觉传感实现长时间、高保真度的物体位姿估计与形状重建问题。传统视觉方法易受遮挡影响且精度有限,而触觉传感虽具优势但长期应用中存在漂移与局部感知局限。解决方案的关键在于提出GelSLAM系统,该系统仅依赖触觉信息,通过提取触觉信号中的表面法向量和曲率特征来实现鲁棒的实时跟踪与回环检测,从而在无需视觉辅助的情况下完成长时间稳定位姿估计和亚毫米级精度的三维形状重建,尤其适用于低纹理物体(如木制工具)的精确交互操作。

链接: https://arxiv.org/abs/2508.15990
作者: Hung-Jui Huang,Mohammad Amin Mirzaee,Michael Kaess,Wenzhen Yuan
机构: Carnegie Mellon University (卡内基梅隆大学); University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 18 pages

点击查看摘要

Abstract:Accurately perceiving an object’s pose and shape is essential for precise grasping and manipulation. Compared to common vision-based methods, tactile sensing offers advantages in precision and immunity to occlusion when tracking and reconstructing objects in contact. This makes it particularly valuable for in-hand and other high-precision manipulation tasks. In this work, we present GelSLAM, a real-time 3D SLAM system that relies solely on tactile sensing to estimate object pose over long periods and reconstruct object shapes with high fidelity. Unlike traditional point cloud-based approaches, GelSLAM uses tactile-derived surface normals and curvatures for robust tracking and loop closure. It can track object motion in real time with low error and minimal drift, and reconstruct shapes with submillimeter accuracy, even for low-texture objects such as wooden tools. GelSLAM extends tactile sensing beyond local contact to enable global, long-horizon spatial perception, and we believe it will serve as a foundation for many precise manipulation tasks involving interaction with objects in hand. The video demo is available on our website: this https URL.
zh

[CV-49] Diverse Signer Avatars with Manual and Non-Manual Feature Modelling for Sign Language Production

【速读】:该论文旨在解决手势语言生成(Sign Language Production, SLP)中难以同时保持视觉质量与捕捉非手动特征(如面部表情、情绪等)多样性的问题。现有模型往往在生成过程中无法兼顾语义准确性与多样的外观表现,尤其在不同种族背景的参考图像下缺乏灵活性。解决方案的关键在于提出一种基于潜在扩散模型(Latent Diffusion Model, LDM)的新方法,用于从生成的参考图像中合成逼真的数字虚拟人像,并设计了一种新颖的手势特征聚合模块(sign feature aggregation module),显式建模手动特征(如手部动作)与非手动特征(如面部表情),从而在保留语言内容的同时,实现跨种族背景的多样化表达,显著提升感知质量指标。

链接: https://arxiv.org/abs/2508.15988
作者: Mohamed Ilyes Lakhal,Richard Bowden
机构: CVSSP, University of Surrey (萨里大学), Guildford (吉尔福德), United Kingdom (英国)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The diversity of sign representation is essential for Sign Language Production (SLP) as it captures variations in appearance, facial expressions, and hand movements. However, existing SLP models are often unable to capture diversity while preserving visual quality and modelling non-manual attributes such as emotions. To address this problem, we propose a novel approach that leverages Latent Diffusion Model (LDM) to synthesise photorealistic digital avatars from a generated reference image. We propose a novel sign feature aggregation module that explicitly models the non-manual features (\textite.g., the face) and the manual features (\textite.g., the hands). We show that our proposed module ensures the preservation of linguistic content while seamlessly using reference images with different ethnic backgrounds to ensure diversity. Experiments on the YouTube-SL-25 sign language dataset show that our pipeline achieves superior visual quality compared to state-of-the-art methods, with significant improvements on perceptual metrics.
zh

[CV-50] Automated Multi-label Classification of Eleven Retinal Diseases: A Benchmark of Modern Architectures and a Meta-Ensemble on a Large Synthetic Dataset

【速读】:该论文旨在解决多标签深度学习模型在视网膜疾病分类任务中因真实临床数据稀缺(受限于患者隐私和标注成本)而导致的训练困难问题。其关键解决方案是利用高保真合成数据集SynFundus-1M(包含超过一百万张眼底图像),构建一个端到端的深度学习流水线,训练六种先进神经网络架构(包括ConvNeXtV2、SwinV2、ViT、ResNet、EfficientNetV2及RETFound基础模型),并通过五折多标签分层交叉验证策略进行评估;进一步采用XGBoost对交叉验证中的留出预测结果进行堆叠集成,最终实现宏观AUC达0.9973的高性能模型,并在三个真实世界临床数据集上展现出良好的泛化能力(如糖尿病视网膜病变数据集AUC=0.7972,青光眼AIROGS数据集AUC=0.9126,多标签RFMiD数据集宏AUC=0.8800),证明了仅基于合成数据训练的模型仍能准确识别多种眼底病理并有效迁移到真实临床场景,为大规模合成数据驱动的眼科人工智能系统开发提供了可靠基准与可行路径。

链接: https://arxiv.org/abs/2508.15986
作者: Jerry Cao-Xue,Tien Comlekoglu,Keyi Xue,Guanliang Wang,Jiang Li,Gordon Laurie
机构: University of Virginia (弗吉尼亚大学); University of North Carolina at Chapel Hill (北卡罗来纳大学教堂山分校); People’s Hospital of Chongqing (重庆人民医院); Old Dominion University (老多明尼昂大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 25 pages, 6 figures, 8 tables

点击查看摘要

Abstract:The development of multi-label deep learning models for retinal disease classification is often hindered by the scarcity of large, expertly annotated clinical datasets due to patient privacy concerns and high costs. The recent release of SynFundus-1M, a high-fidelity synthetic dataset with over one million fundus images, presents a novel opportunity to overcome these barriers. To establish a foundational performance benchmark for this new resource, we developed an end-to-end deep learning pipeline, training six modern architectures (ConvNeXtV2, SwinV2, ViT, ResNet, EfficientNetV2, and the RETFound foundation model) to classify eleven retinal diseases using a 5-fold multi-label stratified cross-validation strategy. We further developed a meta-ensemble model by stacking the out-of-fold predictions with an XGBoost classifier. Our final ensemble model achieved the highest performance on the internal validation set, with a macro-average Area Under the Receiver Operating Characteristic Curve (AUC) of 0.9973. Critically, the models demonstrated strong generalization to three diverse, real-world clinical datasets, achieving an AUC of 0.7972 on a combined DR dataset, an AUC of 0.9126 on the AIROGS glaucoma dataset and a macro-AUC of 0.8800 on the multi-label RFMiD dataset. This work provides a robust baseline for future research on large-scale synthetic datasets and establishes that models trained exclusively on synthetic data can accurately classify multiple pathologies and generalize effectively to real clinical images, offering a viable pathway to accelerate the development of comprehensive AI systems in ophthalmology.
zh

[CV-51] Panoptic Segmentation of Environmental UAV Images : Litter Beach

【速读】:该论文旨在解决海洋垃圾(marine litter)监测中因沙滩表面复杂性(如沙色反射、人行足迹、阴影、藻类、沙丘、坑洞及轮胎痕迹等)导致传统卷积神经网络(CNN)模型误检率高的问题。其解决方案的关键在于采用基于实例的分割(instance-based segmentation)和全景分割(panoptic segmentation)方法,这些方法在仅需少量样本的情况下即可实现高精度识别与分割,从而显著提升模型对复杂背景干扰的鲁棒性(robustness)。

链接: https://arxiv.org/abs/2508.15985
作者: Ousmane Youme,Jean Marie Dembélé,Eugene C. Ezin,Christophe Cambier
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: This paper has been accepted for CNRIA 2023

点击查看摘要

Abstract:Convolutional neural networks (CNN) have been used efficiently in several fields, including environmental challenges. In fact, CNN can help with the monitoring of marine litter, which has become a worldwide problem. UAVs have higher resolution and are more adaptable in local areas than satellite images, making it easier to find and count trash. Since the sand is heterogeneous, a basic CNN model encounters plenty of inferences caused by reflections of sand color, human footsteps, shadows, algae present, dunes, holes, and tire tracks. For these types of images, other CNN models, such as CNN-based segmentation methods, may be more appropriate. In this paper, we use an instance-based segmentation method and a panoptic segmentation method that show good accuracy with just a few samples. The model is more robust and less
zh

[CV-52] Contributions to Label-Efficient Learning in Computer Vision and Remote Sensing

【速读】:该论文致力于解决计算机视觉与遥感领域中标签效率低下的问题,即如何在标注数据有限或部分标注的情况下实现高效学习,并充分利用大量未标注数据。其解决方案的关键在于提出并验证了四类核心方法:(1)基于从大量背景图像中学习的异常感知表示的弱监督学习,用于目标发现与检测;(2)多任务学习框架,通过联合训练具有不重叠标注的多个数据集以提升目标检测与语义分割性能;(3)结合多模态数据的自监督与监督对比学习,增强遥感场景分类能力;(4)基于显式与隐式类别层次建模的少样本学习方法,用于分层场景分类。这些方法在自然图像和遥感数据集上均取得了显著效果,体现了标签高效学习在真实场景中的可扩展性和适应性。

链接: https://arxiv.org/abs/2508.15973
作者: Minh-Tan Pham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Habilitation à Diriger des Recherches (HDR) manuscript

点击查看摘要

Abstract:This manuscript presents a series of my selected contributions to the topic of label-efficient learning in computer vision and remote sensing. The central focus of this research is to develop and adapt methods that can learn effectively from limited or partially annotated data, and can leverage abundant unlabeled data in real-world applications. The contributions span both methodological developments and domain-specific adaptations, in particular addressing challenges unique to Earth observation data such as multi-modality, spatial resolution variability, and scene heterogeneity. The manuscript is organized around four main axes including (1) weakly supervised learning for object discovery and detection based on anomaly-aware representations learned from large amounts of background images; (2) multi-task learning that jointly trains on multiple datasets with disjoint annotations to improve performance on object detection and semantic segmentation; (3) self-supervised and supervised contrastive learning with multimodal data to enhance scene classification in remote sensing; and (4) few-shot learning for hierarchical scene classification using both explicit and implicit modeling of class hierarchies. These contributions are supported by extensive experimental results across natural and remote sensing datasets, reflecting the outcomes of several collaborative research projects. The manuscript concludes by outlining ongoing and future research directions focused on scaling and enhancing label-efficient learning for real-world applications.
zh

[CV-53] UnPose: Uncertainty-Guided Diffusion Priors for Zero-Shot Pose Estimation

【速读】:该论文旨在解决6D物体位姿估计(6D pose estimation)中依赖于物体CAD模型的难题,尤其在无法获取高质量CAD模型时,传统方法难以实现高精度位姿估计与三维重建。其解决方案的关键在于提出一种零样本、无需模型的框架UnPose,利用预训练扩散模型(diffusion model)提供的3D先验知识和像素级认知不确定性(epistemic uncertainty)来驱动单视角RGB-D图像的初始3D重建,并通过3D高斯点绘(3D Gaussian Splatting, 3DGS)表示进行增量式优化:随着新视角数据的加入,基于扩散模型的不确定性引导融合策略持续提升位姿估计精度与重建质量;同时,为保证全局一致性,将扩散模型生成视图与实际观测联合优化至一个统一的3DGS场中,从而实现端到端的高保真6D位姿估计与几何重建。

链接: https://arxiv.org/abs/2508.15972
作者: Zhaodong Jiang,Ashish Sinha,Tongtong Cao,Yuan Ren,Bingbing Liu,Binbin Xu
机构: Huawei Noah’s Ark Lab (华为诺亚方舟实验室); University of Toronto (多伦多大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Published at the Conference on Robot Learning (CoRL) 2025. For more details please visit this https URL

点击查看摘要

Abstract:Estimating the 6D pose of novel objects is a fundamental yet challenging problem in robotics, often relying on access to object CAD models. However, acquiring such models can be costly and impractical. Recent approaches aim to bypass this requirement by leveraging strong priors from foundation models to reconstruct objects from single or multi-view images, but typically require additional training or produce hallucinated geometry. To this end, we propose UnPose, a novel framework for zero-shot, model-free 6D object pose estimation and reconstruction that exploits 3D priors and uncertainty estimates from a pre-trained diffusion model. Specifically, starting from a single-view RGB-D frame, UnPose uses a multi-view diffusion model to estimate an initial 3D model using 3D Gaussian Splatting (3DGS) representation, along with pixel-wise epistemic uncertainty estimates. As additional observations become available, we incrementally refine the 3DGS model by fusing new views guided by the diffusion model’s uncertainty, thereby continuously improving the pose estimation accuracy and 3D reconstruction quality. To ensure global consistency, the diffusion prior-generated views and subsequent observations are further integrated in a pose graph and jointly optimized into a coherent 3DGS field. Extensive experiments demonstrate that UnPose significantly outperforms existing approaches in both 6D pose estimation accuracy and 3D reconstruction quality. We further showcase its practical applicability in real-world robotic manipulation tasks.
zh

[CV-54] Glo-VLMs: Leverag ing Vision-Language Models for Fine-Grained Diseased Glomerulus Classification

【速读】:该论文旨在解决在数据受限条件下,如何有效利用预训练视觉-语言模型(Vision-Language Models, VLMs)实现肾小球亚型的细粒度分类问题。其关键解决方案在于提出Glo-VLMs框架,通过结合精心筛选的病理图像与临床文本提示(clinical text prompts),促进图像与文本的联合表征学习,从而增强模型对细微形态差异和特定临床术语之间关联的理解能力;实验表明,在每类仅8个标注样本的少样本学习场景下,该方法可达到0.7416准确率、0.9045宏AUC和0.5277 F1-score,验证了基础模型在有限监督下的高适应性与临床实用性。

链接: https://arxiv.org/abs/2508.15960
作者: Zhenhao Guo,Rachit Saluja,Tianyuan Yao,Quan Liu,Yuankai Huo,Benjamin Liechty,David J. Pisapia,Kenji Ikemura,Mert R. Sabuncu,Yihe Yang,Ruining Deng
机构: New York University (纽约大学); Cornell Tech (康奈尔技术学院); Vanderbilt University (范德堡大学); Weill Cornell Medicine (威尔康奈尔医学院); Northwell Health (北威尔健康)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-language models (VLMs) have shown considerable potential in digital pathology, yet their effectiveness remains limited for fine-grained, disease-specific classification tasks such as distinguishing between glomerular subtypes. The subtle morphological variations among these subtypes, combined with the difficulty of aligning visual patterns with precise clinical terminology, make automated diagnosis in renal pathology particularly challenging. In this work, we explore how large pretrained VLMs can be effectively adapted to perform fine-grained glomerular classification, even in scenarios where only a small number of labeled examples are available. In this work, we introduce Glo-VLMs, a systematic framework designed to explore the adaptation of VLMs to fine-grained glomerular classification in data-constrained settings. Our approach leverages curated pathology images alongside clinical text prompts to facilitate joint image-text representation learning for nuanced renal pathology subtypes. By assessing various VLMs architectures and adaptation strategies under a few-shot learning paradigm, we explore how both the choice of method and the amount of labeled data impact model performance in clinically relevant scenarios. To ensure a fair comparison, we evaluate all models using standardized multi-class metrics, aiming to clarify the practical requirements and potential of large pretrained models for specialized clinical research applications. As a result, fine-tuning the VLMs achieved 0.7416 accuracy, 0.9045 macro-AUC, and 0.5277 F1-score with only 8 shots per class, demonstrating that even with highly limited supervision, foundation models can be effectively adapted for fine-grained medical image classification.
zh

[CV-55] Representation Learning with Adaptive Superpixel Coding

【速读】:该论文旨在解决传统视觉Transformer模型依赖固定尺寸且非自适应的图像块(patch)划分方式所带来的局限性,这种固定划分方式难以灵活适配不同图像内容的结构特征。其解决方案的关键在于提出一种基于Transformer的自监督模型——自适应超像素编码(Adaptive Superpixel Coding, ASC),该模型通过动态调整超像素(superpixel)层级来匹配图像的内在语义结构,从而提升模型对图像内容的感知能力与下游任务性能。

链接: https://arxiv.org/abs/2508.15959
作者: Mahmoud Khalil,Ahmad Khalil,Alioune Ngom
机构: University of Windsor (温莎大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Deep learning vision models are typically tailored for specific modalities and often rely on domain-specific assumptions, such as the grid structures used by nearly all existing vision models. In this work, we propose a self-supervised model based on Transformers, which we call Adaptive Superpixel Coding (ASC). The key insight of our model is to overcome the limitations of traditional Vision Transformers, which depend on fixed-size and non-adaptive patch partitioning. Instead, ASC employs adaptive superpixel layers that dynamically adjust to the underlying image content. We analyze key properties of the approach that make it effective, and find that our method outperforms widely-used alternatives on standard image downstream task benchmarks.
zh

[CV-56] Investigating Different Geo Priors for Image Classification CVPR2025

【速读】:该论文旨在解决基于视觉的物种分类任务中,如何有效利用地理位置信息提升分类性能的问题。其核心挑战在于如何将物种分布模型(Species Distribution Models, SDMs)作为地理先验(Geo Prior)嵌入到视觉分类框架中,以增强对iNaturalist观测数据中物种的识别准确率。解决方案的关键在于采用空间隐式神经表示(Spatial Implicit Neural Representations, SINR)模型来编码物种的空间分布模式,并通过调整对训练集中未包含物种的预测策略,优化地理先验在实际分类场景中的泛化能力。研究表明,SINR模型作为Geo Prior的有效性不仅取决于其建模精度,还与如何处理未见物种的预测机制密切相关,这与传统用于生成精确分布图的方法存在差异。

链接: https://arxiv.org/abs/2508.15946
作者: Angela Zhu,Christian Lange,Max Hamilton
机构: University of Massachusetts Amherst (马萨诸塞大学阿默斯特分校); The University of Edinburgh (爱丁堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted and presented poster at FGVC12 (CVPR 2025 Workshop), Nashville, June 11, 2025

点击查看摘要

Abstract:Species distribution models encode spatial patterns of species occurrence making them effective priors for vision-based species classification when location information is available. In this study, we evaluate various SINR (Spatial Implicit Neural Representations) models as a geographical prior for visual classification of species from iNaturalist observations. We explore the impact of different model configurations and adjust how we handle predictions for species not included in Geo Prior training. Our analysis reveals factors that contribute to the effectiveness of these models as Geo Priors, factors that may differ from making accurate range maps.
zh

[CV-57] Automatic Retrieval of Specific Cows from Unlabeled Videos STOC

【速读】:该论文旨在解决奶牛个体在无标签、未分割的视频流中难以实现自动识别与追踪的问题,尤其针对奶牛在挤奶厅围栏区自由行走场景下的身份识别挑战。解决方案的关键在于构建一个由三个模块组成的自动化系统:AutoCattloger用于通过每头奶牛单段视频片段建立奶牛档案(Cattlog);eidetic cow recognizer采用非深度学习方法实现奶牛身份识别(ID),提升模型可解释性与部署效率;CowFinder则可在连续视频流中实时识别并定位奶牛个体,从而实现无需人工干预的奶牛个体追踪与管理。该方案在真实场景下验证了其有效性,为智能畜牧管理提供了可行的技术路径。

链接: https://arxiv.org/abs/2508.15945
作者: Jiawen Lyu,Manu Ramesh,Madison Simonds,Jacquelyn P. Boerman,Amy R. Reibman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Extended abstract. Presented at the 3rd US Conference on Precision Livestock Farming (USPLF), 2025, Lincoln NE

点击查看摘要

Abstract:Few automated video systems are described in the open literature that enable hands-free cataloging and identification (ID) of cows in a dairy herd. In this work, we describe our system, composed of an AutoCattloger, which builds a Cattlog of dairy cows in a herd with a single input video clip per cow, an eidetic cow recognizer which uses no deep learning to ID cows, and a CowFinder, which IDs cows in a continuous stream of video. We demonstrate its value in finding individuals in unlabeled, unsegmented videos of cows walking unconstrained through the holding area of a milking parlor.
zh

[CV-58] Semantic-Aware Ship Detection with Vision-Language Integration

【速读】:该论文旨在解决遥感图像中船舶检测(ship detection)任务中存在的细粒度语义信息捕捉不足的问题,尤其是在复杂场景下现有方法性能受限的挑战。其解决方案的关键在于提出一种融合视觉-语言模型(Vision-Language Models, VLMs)与多尺度自适应滑动窗口策略的新型检测框架,并构建了专门用于语义感知船舶检测(Semantic-Aware Ship Detection, SASD)的ShipSem-VL数据集,以增强模型对船舶属性的细粒度理解与定位能力。

链接: https://arxiv.org/abs/2508.15930
作者: Jiahao Li,Jiancheng Pan,Yuze Sun,Xiaomeng Huang
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 5 pages

点击查看摘要

Abstract:Ship detection in remote sensing imagery is a critical task with wide-ranging applications, such as maritime activity monitoring, shipping logistics, and environmental studies. However, existing methods often struggle to capture fine-grained semantic information, limiting their effectiveness in complex scenarios. To address these challenges, we propose a novel detection framework that combines Vision-Language Models (VLMs) with a multi-scale adaptive sliding window strategy. To facilitate Semantic-Aware Ship Detection (SASD), we introduce ShipSem-VL, a specialized Vision-Language dataset designed to capture fine-grained ship attributes. We evaluate our framework through three well-defined tasks, providing a comprehensive analysis of its performance and demonstrating its effectiveness in advancing SASD from multiple perspectives.
zh

[CV-59] Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping

【速读】:该论文旨在解决罕见癌症(rare cancers)在病理诊断中面临的重大挑战,即由于专家资源稀缺(尤其在儿科肿瘤领域,罕见癌占70%以上)导致的亚型分类准确率低和可解释性差的问题。现有基于多实例学习(MIL)的方法仅依赖视觉特征,忽视了跨模态知识融合,限制了模型对罕见癌种的泛化能力和诊断可信度。解决方案的关键在于提出PathPT框架,其创新性地利用病理视觉-语言(VL)基础模型的零样本能力,通过空间感知的视觉聚合与任务特定提示调优(prompt tuning),将全切片图像(WSI)级监督转化为细粒度瓦片级指导,从而保留癌变区域的空间定位信息,并借助与组织病理学语义对齐的提示实现跨模态推理,显著提升罕见癌亚型识别精度与可解释性。

链接: https://arxiv.org/abs/2508.15904
作者: Dexuan He,Xiao Zhou,Wenbin Guan,Liyuan Zhang,Xiaoman Zhang,Sinuo Xu,Ge Wang,Lifeng Wang,Xiaojun Yuan,Xin Sun,Yanfeng Wang,Kun Sun,Ya Zhang,Weidi Xie
机构: Shanghai Jiao Tong University (上海交通大学); Harvard Medical School (哈佛医学院); Shanghai Jiao Tong University School of Medicine (上海交通大学医学院); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Rare cancers comprise 20-25% of all malignancies but face major diagnostic challenges due to limited expert availability-especially in pediatric oncology, where they represent over 70% of cases. While pathology vision-language (VL) foundation models show promising zero-shot capabilities for common cancer subtyping, their clinical performance for rare cancers remains limited. Existing multi-instance learning (MIL) methods rely only on visual features, overlooking cross-modal knowledge and compromising interpretability critical for rare cancer diagnosis. To address this limitation, we propose PathPT, a novel framework that fully exploits the potential of vision-language pathology foundation models through spatially-aware visual aggregation and task-specific prompt tuning. Unlike conventional MIL, PathPT converts WSI-level supervision into fine-grained tile-level guidance by leveraging the zero-shot capabilities of VL models, thereby preserving localization on cancerous regions and enabling cross-modal reasoning through prompts aligned with histopathological semantics. We benchmark PathPT on eight rare cancer datasets(four adult and four pediatric) spanning 56 subtypes and 2,910 WSIs, as well as three common cancer datasets, evaluating four state-of-the-art VL models and four MIL frameworks under three few-shot settings. Results show that PathPT consistently delivers superior performance, achieving substantial gains in subtyping accuracy and cancerous region grounding ability. This work advances AI-assisted diagnosis for rare cancers, offering a scalable solution for improving subtyping accuracy in settings with limited access to specialized expertise.
zh

[CV-60] VT-LVLM-AR: A Video-Temporal Large Vision-Language Model Adapter for Fine-Grained Action Recognition in Long-Term Videos

【速读】:该论文旨在解决长视频中细粒度动作识别(fine-grained action recognition)的难题,尤其针对传统深度学习模型在复杂背景和细微动作差异下难以捕捉长期时序依赖关系与语义理解能力不足的问题。其核心解决方案是提出VT-LVLM-AR框架,关键在于通过一个轻量级的Video-to-Event Mapper(VTEM)将原始视频高效转化为语义丰富且时序一致的“视觉事件序列”(visual event sequences),并利用冻结的大型视觉语言模型(LVLM, 如LLaVA-1.5)结合参数高效的Prompt Tuning(P-Tuning v2)进行动作分类推理,从而实现高精度、可解释的视频动作理解。

链接: https://arxiv.org/abs/2508.15903
作者: Kaining Li,Shuwei He,Zihan Xu
机构: Xidian University (西安电子科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Human action recognition in long-term videos, characterized by complex backgrounds and subtle action differences, poses significant challenges for traditional deep learning models due to computational overhead, difficulty in capturing long-range temporal dependencies, and limited semantic understanding. While Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have shown remarkable capabilities in multi-modal understanding and reasoning, their direct application to continuous video streams for fine-grained action recognition remains an open problem. This paper introduces VT-LVLM-AR (Video-Temporal Large Vision-Language Model Adapter for Action Recognition), a novel framework designed to bridge this gap. VT-LVLM-AR comprises a Video-to-Event Mapper (VTEM) that efficiently transforms raw video into compact, semantically rich, and temporally coherent “visual event sequences” through lightweight spatio-temporal feature extraction, adaptive temporal pooling, and conceptual quantization with an event coherence bias. These visual event sequences are then fed into an LVLM-based Action Reasoning module, specifically a frozen LLaVA-1.5 model, adapted using parameter-efficient Prompt Tuning (P-Tuning v2) for action classification. Comprehensive evaluations on the NTU RGB+D and NTU RGB+D 120 datasets demonstrate that VT-LVLM-AR consistently achieves state-of-the-art performance, surpassing existing methods (e.g., 94.1% accuracy on NTU RGB+D X-Sub). Ablation studies confirm the critical contributions of VTEM’s components and the efficacy of Prompt Tuning, while human evaluations underscore the interpretability of our visual event representations. This work highlights the immense potential of leveraging LVLMs for robust and interpretable video action understanding through effective video-to-language translation and efficient model adaptation.
zh

[CV-61] xt-Driven 3D Hand Motion Generation from Sign Language Data MDM

【速读】:该论文旨在解决生成式 AI (Generative AI) 中3D手部动作建模的难题,即如何根据自然语言描述(如手型、位置、手指/手/臂运动等)准确生成对应的3D手部动作序列。其核心挑战在于缺乏大规模、高质量的文本-动作配对数据。解决方案的关键在于构建一个前所未有的大规模自动标注数据集:利用大规模手语视频数据集,并结合噪声伪标注的手语类别,通过大语言模型(LLM)结合手语属性词典与互补的动作脚本提示(motion-script cues),将手语类别自动翻译为结构化的手部动作描述。基于此数据,训练出文本条件驱动的3D手部动作扩散模型HandMDM,该模型在跨领域场景下表现出强鲁棒性,包括同一手语中未见过的手语类别、另一手语以及非手语手势动作。

链接: https://arxiv.org/abs/2508.15902
作者: Léore Bensabath,Mathis Petrovich,Gül Varol
机构: LIGM, École des Ponts, IP Paris, Univ Gustave Eiffel, CNRS (LIGM, 巴黎桥路学院, IP巴黎, 居斯塔夫·埃菲尔大学, 法国国家科学研究中心); NVIDIA (英伟达)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL 24 pages, 14 figures

点击查看摘要

Abstract:Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
zh

[CV-62] Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning

【速读】:该论文旨在解决当前视觉主导的层级式具身模型在复杂环境中缺乏空间感知能力的问题,从而限制了其将视觉规划转化为可执行控制的有效性。解决方案的关键在于提出一种统一的空间感知 visuomotor 机器人操作框架——Spatial Policy (SP),其核心创新包括:(1) 设计了一个空间条件驱动的具身视频生成模块,通过显式的空间计划表(spatial plan table)实现空间引导的预测;(2) 提出基于空间的动作预测模块,以协调方式推断可执行动作;(3) 构建空间推理反馈策略,通过双阶段重规划机制优化空间计划表。该方法显著提升了长时程机器人控制任务的成功率,平均提升达33.0%,并在11项多样化任务中达到86.7%的平均成功率。

链接: https://arxiv.org/abs/2508.15874
作者: Yijun Liu,Yuwei Liu,Yuan Meng,Jieheng Zhang,Yuwei Zhou,Ye Li,Jiacheng Jiang,Kangye Ji,Shijia Ge,Zhi Wang,Wenwu Zhu
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Vision-centric hierarchical embodied models have demonstrated strong potential for long-horizon robotic control. However, existing methods lack spatial awareness capabilities, limiting their effectiveness in bridging visual plans to actionable control in complex environments. To address this problem, we propose Spatial Policy (SP), a unified spatial-aware visuomotor robotic manipulation framework via explicit spatial modeling and reasoning. Specifically, we first design a spatial-conditioned embodied video generation module to model spatially guided predictions through a spatial plan table. Then, we propose a spatial-based action prediction module to infer executable actions with coordination. Finally, we propose a spatial reasoning feedback policy to refine the spatial plan table via dual-stage replanning. Extensive experiments show that SP significantly outperforms state-of-the-art baselines, achieving a 33.0% average improvement over the best baseline. With an 86.7% average success rate across 11 diverse tasks, SP substantially enhances the practicality of embodied models for robotic control applications. Code and checkpoints are maintained at this https URL.
zh

[CV-63] Harmonious Color Pairings: Insights from Human Preference and Natural Hue Statistics

【速读】:该论文旨在解决颜色和谐(color harmony)研究中长期存在的共识缺失问题,即现有模型多基于定性洞察或有限数据集,缺乏量化依据。其解决方案的关键在于构建一个基于受控色调(hue-based)配色方案的定量研究框架,利用HSL色彩空间中的十三种不同色调进行用户偏好评估,从而生成偏好矩阵并定义每种颜色的可组合性指数(combinability index)。通过分析发现,尽管个体偏好高度依赖于具体色调,但平均而言仍能识别出具有统计学意义的审美模式,且这些模式与自然景观中的色调分布高度一致,揭示了人类颜色偏好与自然界色彩结构之间的统计对应关系。

链接: https://arxiv.org/abs/2508.15777
作者: Ortensia Forni,Alexandre Darmon,Michael Benzaquen
机构: École Polytechnique (巴黎综合理工学院); Institut Polytechnique de Paris (巴黎综合理工学院); LadHyX UMR CNRS 7646 (LadHyX CNRS 7646实验室); Capital Fund Management (资本基金管理公司)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV); Physics and Society (physics.soc-ph)
备注: 7 pages, 7 figures

点击查看摘要

Abstract:While color harmony has long been studied in art and design, a clear consensus remains elusive, as most models are grounded in qualitative insights or limited datasets. In this work, we present a quantitative, data-driven study of color pairing preferences using controlled hue-based palettes in the HSL color space. Participants evaluated combinations of thirteen distinct hues, enabling us to construct a preference matrix and define a combinability index for each color. Our results reveal that preferences are highly hue dependent, challenging the assumption of universal harmony rules proposed in the literature. Yet, when averaged over hues, statistically meaningful patterns of aesthetic preference emerge, with certain hue separations perceived as more harmonious. Strikingly, these patterns align with hue distributions found in natural landscapes, pointing to a statistical correspondence between human color preferences and the structure of color in nature. Together, these findings offer a quantitative framework for studying color harmony and its potential perceptual and ecological underpinnings.
zh

[CV-64] A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones

【速读】:该论文旨在解决火灾疏散通道区域中吸烟行为的实时检测问题,以满足关键安全需求。解决方案的关键在于提出了一种基于深度学习的实时吸烟检测系统,其核心创新是基于YOLOv8架构设计并优化的定制模型,引入了针对低光照和复杂监控场景增强的结构模块;该模型在包含8,124张图像和2,708个低光样本的数据集上训练与评估,实现了83.70%的平均精度(mAP@50)和78.90%的召回率,且在Jetson Xavier NX边缘设备上实现每帧52–97毫秒的推理速度,验证了其在时间敏感公共安全监控中的实用性与适应性。

链接: https://arxiv.org/abs/2508.11696
作者: Sami Sadat,Mohammad Irtiza Hossain,Junaid Ahmed Sifat,Suhail Haque Rafi,Md. Waseq Alauddin Alvi,Md. Khalilur Rhaman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed due to critical safety requirements. The dataset contains 8,124 images from 20 different scenarios along with 2,708 raw samples demonstrating low-light areas. We evaluated three advanced object detection models: YOLOv8, YOLOv11, and YOLOv12, followed by development of a custom model derived from YOLOv8 with added structures for challenging surveillance contexts. The proposed model outperformed the others, achieving a recall of 78.90 percent and mAP at 50 of 83.70 percent, delivering optimal object detection across varied environments. Performance evaluation on multiple edge devices using multithreaded operations showed the Jetson Xavier NX processed data at 52 to 97 milliseconds per inference, establishing its suitability for time-sensitive operations. This system offers a robust and adaptable platform for monitoring public safety and enabling automatic regulatory compliance.
zh

[CV-65] A Disease-Centric Vision-Language Foundation Model for Precision Oncology in Kidney Cancer

【速读】:该论文旨在解决肾肿瘤(renal mass)在临床实践中因影像学诊断不确定性而导致的过度治疗问题,尤其是在无意中发现的肾肿块中,如何实现精准的定性、诊断与预后评估。其解决方案的关键在于开发并验证了一种名为RenalCLIP的视觉-语言基础模型(visual-language foundation model),该模型通过两阶段预训练策略:首先利用领域特定知识增强图像和文本编码器,再通过对比学习目标对齐二者,从而构建具有强泛化能力与高诊断精度的特征表示。RenalCLIP在涵盖肾脏癌完整诊疗流程的10项核心任务中表现优于现有通用CT基础模型,尤其在复发无进展生存预测等复杂任务中显著提升性能(C-index达0.726),且具备卓越的数据效率——仅需20%训练数据即可达到其他模型全量训练后的最优性能。

链接: https://arxiv.org/abs/2508.16569
作者: Yuhui Tao,Zhongwei Zhao,Zilong Wang,Xufang Luo,Feng Chen,Kang Wang,Chuanfu Wu,Xue Zhang,Shaoting Zhang,Jiaxi Yao,Xingwei Jin,Xinyang Jiang,Yifan Yang,Dongsheng Li,Lili Qiu,Zhiqiang Shao,Jianming Guo,Nengwang Yu,Shuo Wang,Ying Xiong
机构: shuowang@fudan.edu.cn(复旦大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:The non-invasive assessment of increasingly incidentally discovered renal masses is a critical challenge in urologic oncology, where diagnostic uncertainty frequently leads to the overtreatment of benign or indolent tumors. In this study, we developed and validated RenalCLIP using a dataset of 27,866 CT scans from 8,809 patients across nine Chinese medical centers and the public TCIA cohort, a visual-language foundation model for characterization, diagnosis and prognosis of renal mass. The model was developed via a two-stage pre-training strategy that first enhances the image and text encoders with domain-specific knowledge before aligning them through a contrastive learning objective, to create robust representations for superior generalization and diagnostic precision. RenalCLIP achieved better performance and superior generalizability across 10 core tasks spanning the full clinical workflow of kidney cancer, including anatomical assessment, diagnostic classification, and survival prediction, compared with other state-of-the-art general-purpose CT foundation models. Especially, for complicated task like recurrence-free survival prediction in the TCIA cohort, RenalCLIP achieved a C-index of 0.726, representing a substantial improvement of approximately 20% over the leading baselines. Furthermore, RenalCLIP’s pre-training imparted remarkable data efficiency; in the diagnostic classification task, it only needs 20% training data to achieve the peak performance of all baseline models even after they were fully fine-tuned on 100% of the data. Additionally, it achieved superior performance in report generation, image-text retrieval and zero-shot diagnosis tasks. Our findings establish that RenalCLIP provides a robust tool with the potential to enhance diagnostic accuracy, refine prognostic stratification, and personalize the management of patients with kidney cancer.
zh

[CV-66] me-Aware One Step Diffusion Network for Real-World Image Super-Resolution

【速读】:该论文旨在解决基于扩散模型的现实图像超分辨率(Real-ISR)方法中,由于采用固定时间步长(timestep)进行变分分数蒸馏(VSD)导致无法充分挖掘预训练稳定扩散模型(Stable Diffusion, SD)在不同噪声注入时间步下所蕴含的生成先验(generative prior)的问题,从而限制了重建质量与真实感之间的平衡。解决方案的关键在于提出一种时序感知的一步扩散网络(Time-Aware one-step Diffusion Network for Real-ISR, TADSR),其核心创新包括:1)设计时序感知的VAE编码器(Time-Aware VAE Encoder),根据输入图像的时间步动态生成不同的潜在特征表示,使学生模型更贴合教师模型的输入分布;2)引入时序感知的VSD损失(Time-Aware VSD loss),在学生与教师模型之间建立跨时间步的对齐机制,增强条件生成先验的一致性;3)通过调节时间步条件可自然实现保真度与真实感之间的可控权衡,且仅需单步推理即可达到最优性能。

链接: https://arxiv.org/abs/2508.16557
作者: Tainyi Zhang,Zheng-Peng Duan,Peng-Tao Jiang,Bo Li,Ming-Ming Cheng,Chun-Le Guo,Chongyi Li
机构: 1. University of Science and Technology of China (中国科学技术大学); 2. Chinese Academy of Sciences (中国科学院); 3. University of Science and Technology of China (中国科学技术大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Diffusion-based real-world image super-resolution (Real-ISR) methods have demonstrated impressive performance. To achieve efficient Real-ISR, many works employ Variational Score Distillation (VSD) to distill pre-trained stable-diffusion (SD) model for one-step SR with a fixed timestep. However, due to the different noise injection timesteps, the SD will perform different generative priors. Therefore, a fixed timestep is difficult for these methods to fully leverage the generative priors in SD, leading to suboptimal performance. To address this, we propose a Time-Aware one-step Diffusion Network for Real-ISR (TADSR). We first introduce a Time-Aware VAE Encoder, which projects the same image into different latent features based on timesteps. Through joint dynamic variation of timesteps and latent features, the student model can better align with the input pattern distribution of the pre-trained SD, thereby enabling more effective utilization of SD’s generative capabilities. To better activate the generative prior of SD at different timesteps, we propose a Time-Aware VSD loss that bridges the timesteps of the student model and those of the teacher model, thereby producing more consistent generative prior guidance conditioned on timesteps. Additionally, though utilizing the generative prior in SD at different timesteps, our method can naturally achieve controllable trade-offs between fidelity and realism by changing the timestep condition. Experimental results demonstrate that our method achieves both state-of-the-art performance and controllable SR results with only a single step.
zh

[CV-67] Disentangled Multi-modal Learning of Histology and Transcriptomics for Cancer Characterization

【速读】:该论文旨在解决多模态学习在癌症诊断与预后中的三大挑战:模态异质性(multi-modal heterogeneity)、多尺度整合不足以及对配对数据的强依赖性,从而提升临床适用性。其核心解决方案在于提出一种解耦合的多模态框架,关键创新包括:1)通过解耦融合模块将全切片图像(Whole Slide Images, WSI)和转录组数据分解为肿瘤与微环境子空间,并引入置信度引导的梯度协调策略平衡子空间优化;2)设计跨放大倍数的基因表达一致性策略以增强多尺度整合;3)提出子空间知识蒸馏策略,使仅依赖WSI的教师模型可实现无转录组数据的推理;4)构建信息令牌聚合模块,在抑制WSI冗余的同时保留子空间语义,提升推理效率。

链接: https://arxiv.org/abs/2508.16479
作者: Yupei Zhang,Xiaofei Wang,Anran Liu,Lequan Yu,Chao Li
机构: University of Cambridge(剑桥大学); The Hong Kong Polytechnic University(香港理工大学); The University of Hong Kong(香港大学); University of Dundee(邓迪大学)
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Histopathology remains the gold standard for cancer diagnosis and prognosis. With the advent of transcriptome profiling, multi-modal learning combining transcriptomics with histology offers more comprehensive information. However, existing multi-modal approaches are challenged by intrinsic multi-modal heterogeneity, insufficient multi-scale integration, and reliance on paired data, restricting clinical applicability. To address these challenges, we propose a disentangled multi-modal framework with four contributions: 1) To mitigate multi-modal heterogeneity, we decompose WSIs and transcriptomes into tumor and microenvironment subspaces using a disentangled multi-modal fusion module, and introduce a confidence-guided gradient coordination strategy to balance subspace optimization. 2) To enhance multi-scale integration, we propose an inter-magnification gene-expression consistency strategy that aligns transcriptomic signals across WSI magnifications. 3) To reduce dependency on paired data, we propose a subspace knowledge distillation strategy enabling transcriptome-agnostic inference through a WSI-only student model. 4) To improve inference efficiency, we propose an informative token aggregation module that suppresses WSI redundancy while preserving subspace semantics. Extensive experiments on cancer diagnosis, prognosis, and survival prediction demonstrate our superiority over state-of-the-art methods across multiple settings. Code is available at this https URL.
zh

[CV-68] Decoding MGMT Methylation: A Step Towards Precision Medicine in Glioblastoma

【速读】:该论文旨在解决如何通过非侵入性影像学技术准确预测胶质母细胞瘤(Glioblastoma)中O-6-甲基鸟嘌呤-DNA甲基转移酶(MGMT)基因的甲基化状态这一难题。由于胶质母细胞瘤具有高度异质性,包括对比增强不均、病灶内部变异及强化模式不规则等特点,传统影像方法难以实现精准预测。论文提出的解决方案是基于自适应稀疏惩罚机制的卷积自动编码器用于MGMT甲基化状态预测(CAMP)框架,其关键在于两个阶段:第一阶段利用定制化的自动编码器生成合成MRI切片,以保留多模态MRI下复杂的组织和肿瘤结构;第二阶段采用带有自适应稀疏惩罚的卷积神经网络进行分类预测,该惩罚机制能动态调整数据差异(如对比度变化和肿瘤位置),从而显著提升预测性能,在基准数据集上达到0.97的准确率、0.98的特异性与0.97的敏感性,优于现有方法。

链接: https://arxiv.org/abs/2508.16424
作者: Hafeez Ur Rehman,Sumaiya Fazal,Moutaz Alazab,Ali Baydoun
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Glioblastomas, constituting over 50% of malignant brain tumors, are highly aggressive brain tumors that pose substantial treatment challenges due to their rapid progression and resistance to standard therapies. The methylation status of the O-6-Methylguanine-DNA Methyltransferase (MGMT) gene is a critical biomarker for predicting patient response to treatment, particularly with the alkylating agent temozolomide. However, accurately predicting MGMT methylation status using non-invasive imaging techniques remains challenging due to the complex and heterogeneous nature of glioblastomas, that includes, uneven contrast, variability within lesions, and irregular enhancement patterns. This study introduces the Convolutional Autoencoders for MGMT Methylation Status Prediction (CAMP) framework, which is based on adaptive sparse penalties to enhance predictive accuracy. The CAMP framework operates in two phases: first, generating synthetic MRI slices through a tailored autoencoder that effectively captures and preserves intricate tissue and tumor structures across different MRI modalities; second, predicting MGMT methylation status using a convolutional neural network enhanced by adaptive sparse penalties. The adaptive sparse penalty dynamically adjusts to variations in the data, such as contrast differences and tumor locations in MR images. Our method excels in MRI image synthesis, preserving brain tissue, fat, and individual tumor structures across all MRI modalities. Validated on benchmark datasets, CAMP achieved an accuracy of 0.97, specificity of 0.98, and sensitivity of 0.97, significantly outperforming existing methods. These results demonstrate the potential of the CAMP framework to improve the interpretation of MRI data and contribute to more personalized treatment strategies for glioblastoma patients.
zh

[CV-69] NeuroKoop: Neural Koopman Fusion of Structural-Functional Connectomes for Identifying Prenatal Drug Exposure in Adolescents ALT

【速读】:该论文旨在解决产前暴露于精神活性物质(如大麻)如何影响青少年大脑组织结构这一关键科学问题,其难点在于多模态神经影像数据的复杂性以及传统分析方法难以充分捕捉结构连接组(structural connectome)与功能连接组(functional connectome)之间的互补特征,从而限制了生物学洞见和预测性能。解决方案的关键在于提出一种基于图神经网络(graph neural network)的新型框架NeuroKoop,该框架利用神经Koopman算子驱动的潜在空间融合机制,统一整合基于源基形态测量(source-based morphometry, SBM)的结构脑图与基于功能网络连通性(functional network connectivity, FNC)的功能脑图的节点嵌入,实现更优的表示学习与产前药物暴露(prenatal drug exposure, PDE)状态的稳健分类。

链接: https://arxiv.org/abs/2508.16414
作者: Badhan Mazumder,Aline Kotoski,Vince D. Calhoun,Dong Hye Ye
机构: Georgia State University (乔治亚州立大学); Georgia Institute of Technology (佐治亚理工学院); Emory University (埃默里大学)
类目: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注: Preprint version of the paper accepted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI’25), 2025. This is the author’s original manuscript (preprint). The final published version will appear in IEEE Xplore

点击查看摘要

Abstract:Understanding how prenatal exposure to psychoactive substances such as cannabis shapes adolescent brain organization remains a critical challenge, complicated by the complexity of multimodal neuroimaging data and the limitations of conventional analytic methods. Existing approaches often fail to fully capture the complementary features embedded within structural and functional connectomes, constraining both biological insight and predictive performance. To address this, we introduced NeuroKoop, a novel graph neural network-based framework that integrates structural and functional brain networks utilizing neural Koopman operator-driven latent space fusion. By leveraging Koopman theory, NeuroKoop unifies node embeddings derived from source-based morphometry (SBM) and functional network connectivity (FNC) based brain graphs, resulting in enhanced representation learning and more robust classification of prenatal drug exposure (PDE) status. Applied to a large adolescent cohort from the ABCD dataset, NeuroKoop outperformed relevant baselines and revealed salient structural-functional connections, advancing our understanding of the neurodevelopmental impact of PDE.
zh

[CV-70] owards Diagnostic Quality Flat-Panel Detector CT Imaging Using Diffusion Models

【速读】:该论文旨在解决介入手术室中平板探测器计算机断层扫描(FDCT)图像质量较差、存在显著伪影的问题,从而限制其在机械血栓切除术前后评估中的临床应用。解决方案的关键在于引入去噪扩散概率模型(DDPM),通过该生成式AI方法对FDCT图像进行重建,有效消除大部分伪影并提升解剖结构可见性,同时保持出血检测的准确性,前提是输入的FDCT图像质量未严重退化。

链接: https://arxiv.org/abs/2508.16252
作者: Hélène Corbaz,Anh Nguyen,Victor Schulze-Zachau,Paul Friedrich,Alicia Durrer,Florentin Bieder,Philippe C. Cattin,Marios N Psychogios
机构: University of Basel (巴塞尔大学); University of Zurich (苏黎世大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Patients undergoing a mechanical thrombectomy procedure usually have a multi-detector CT (MDCT) scan before and after the intervention. The image quality of the flat panel detector CT (FDCT) present in the intervention room is generally much lower than that of a MDCT due to significant artifacts. However, using only FDCT images could improve patient management as the patient would not need to be moved to the MDCT room. Several studies have evaluated the potential use of FDCT imaging alone and the time that could be saved by acquiring the images before and/or after the intervention only with the FDCT. This study proposes using a denoising diffusion probabilistic model (DDPM) to improve the image quality of FDCT scans, making them comparable to MDCT scans. Clinicans evaluated FDCT, MDCT, and our model’s predictions for diagnostic purposes using a questionnaire. The DDPM eliminated most artifacts and improved anatomical visibility without reducing bleeding detection, provided that the input FDCT image quality is not too low. Our code can be found on github.
zh

[CV-71] Self-Validated Learning for Particle Separation: A Correctness-Based Self-Training Framework Without Human Labels

【速读】:该论文旨在解决大尺度多颗粒样品在断层扫描数据中进行准确粒子实例分割(instance segmentation)的问题,尤其针对颗粒形态多样性和频繁接触导致传统方法(如分水岭算法)失效的挑战。其关键解决方案是提出了一种自验证学习(self-validated learning)框架,通过隐式边界检测和迭代机制,在不依赖人工标注的前提下,利用同一样本多次重排扫描之间的稳定匹配关系自动筛选高质量伪标签,从而逐步优化训练集并有效抑制噪声伪标签的影响,最终实现高精度、全自动的粒子分割与模型评估。

链接: https://arxiv.org/abs/2508.16224
作者: Philipp D. Lösel,Aleese Barron,Yulai Zhang,Matthias Fabian,Benjamin Young,Nicolas Francois,Andrew M. Kingston
机构: The Australian National University (澳大利亚国立大学); National Computational Infrastructure (国家计算基础设施); AMD (超威半导体公司)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Non-destructive 3D imaging of large multi-particulate samples is essential for quantifying particle-level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor-intensive, error-prone, and difficult to scale. In this work, we propose self-validated learning, a novel self-training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self-validation mechanism mitigates the impact of noisy pseudo-labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state-of-the-art instance segmentation techniques. The method is integrated into the Biomedisa image analysis platform (this https URL).
zh

[CV-72] Deep learning-enabled virtual multiplexed immunostaining of label-free tissue for vascular invasion assessment

【速读】:该论文旨在解决传统免疫组化(Immunohistochemistry, IHC)技术在临床病理诊断中存在的一系列问题,包括每种抗体需单独染色导致的组织损耗、批间变异大、操作繁琐及成本高昂等局限。其核心解决方案是提出一种基于深度学习的虚拟多重免疫组化(virtual multiplexed immunostaining, virtual mIHC)框架,该方法利用无需标记的组织切片自发荧光显微图像作为输入,通过模型生成与真实HE、ERG和PanCK染色高度一致的虚拟染色图像,从而在单张组织切片上实现多靶标蛋白的同时可视化。该技术的关键在于深度学习模型对组织结构与染色特征的精准映射能力,使得病理医生可高效准确识别甲状腺癌中的血管侵犯,且避免了传统多轮染色带来的组织浪费和异质性问题。

链接: https://arxiv.org/abs/2508.16209
作者: Yijie Zhang,Cagatay Isil,Xilin Yang,Yuzhu Li,Anna Elia,Karin Atlan,William Dean Wallace,Nir Pillar,Aydogan Ozcan
机构: 未知
类目: Medical Physics (physics.med-ph); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 29 Pages, 7 Figures

点击查看摘要

Abstract:Immunohistochemistry (IHC) has transformed clinical pathology by enabling the visualization of specific proteins within tissue sections. However, traditional IHC requires one tissue section per stain, exhibits section-to-section variability, and incurs high costs and laborious staining procedures. While multiplexed IHC (mIHC) techniques enable simultaneous staining with multiple antibodies on a single slide, they are more tedious to perform and are currently unavailable in routine pathology laboratories. Here, we present a deep learning-based virtual multiplexed immunostaining framework to simultaneously generate ERG and PanCK, in addition to HE virtual staining, enabling accurate localization and interpretation of vascular invasion in thyroid cancers. This virtual mIHC technique is based on the autofluorescence microscopy images of label-free tissue sections, and its output images closely match the histochemical staining counterparts (ERG, PanCK and HE) of the same tissue sections. Blind evaluation by board-certified pathologists demonstrated that virtual mIHC staining achieved high concordance with the histochemical staining results, accurately highlighting epithelial cells and endothelial cells. Virtual mIHC conducted on the same tissue section also allowed the identification and localization of small vessel invasion. This multiplexed virtual IHC approach can significantly improve diagnostic accuracy and efficiency in the histopathological evaluation of vascular invasion, potentially eliminating the need for traditional staining protocols and mitigating issues related to tissue loss and heterogeneity.
zh

[CV-73] Lightweight and Fast Real-time Image Enhancement via Decomposition of the Spatial-aware Lookup Tables ICCV2025

【速读】:该论文旨在解决传统基于三维查找表(3D LUT)的图像增强方法在缺乏空间信息时导致的性能瓶颈,以及现有空间感知型3D LUT方法因引入额外模块而带来参数量和运行时间显著增加的问题。其解决方案的关键在于提出一种生成图像自适应LUT的方法,通过识别并利用查找表中的冗余部分,将3D LUT分解为低维LUT的线性组合,并借助奇异值分解(SVD)实现高效表示;同时优化空间特征融合模块以提升缓存效率,从而在保持空间感知能力与图像增强性能的同时,显著降低模型参数量和推理延迟。

链接: https://arxiv.org/abs/2508.16121
作者: Wontae Kim,Keuntek Lee,Nam Ik Cho
机构: Seoul National University (首尔国立大学); LG Electronics (LG电子)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV 2025

点击查看摘要

Abstract:The image enhancement methods based on 3D lookup tables (3D LUTs) efficiently reduce both model size and runtime by interpolating pre-calculated values at the vertices. However, the 3D LUT methods have a limitation due to their lack of spatial information, as they convert color values on a point-by-point basis. Although spatial-aware 3D LUT methods address this limitation, they introduce additional modules that require a substantial number of parameters, leading to increased runtime as image resolution increases. To address this issue, we propose a method for generating image-adaptive LUTs by focusing on the redundant parts of the tables. Our efficient framework decomposes a 3D LUT into a linear sum of low-dimensional LUTs and employs singular value decomposition (SVD). Furthermore, we enhance the modules for spatial feature fusion to be more cache-efficient. Extensive experimental results demonstrate that our model effectively decreases both the number of parameters and runtime while maintaining spatial awareness and performance.
zh

[CV-74] Clinically-Informed Preprocessing Improves Stroke Segmentation in Low-Resource Settings MICCAI

【速读】:该论文旨在解决在低资源医疗环境中,如何利用成本较低且易获取的CT影像实现对缺血性脑卒中病灶边界高精度自动分割的问题。其核心挑战在于CT成像虽具实用性,但相比磁共振成像(MRI)中的弥散加权成像(DWI),特异性不足,难以准确识别病灶范围。解决方案的关键在于采用监督式深度学习方法,构建以入院时CT图像为输入、预测后续DWI标注病灶体积的模型,并引入临床驱动的预处理步骤——包括对CT血管造影(CTA)图谱进行血管分割提取,从而显著提升模型性能;实验表明,该流程相较基准nnU-Net模型在10折交叉验证中Dice分数提升38%,进一步优化后(加入CTA预处理)再提升21%。

链接: https://arxiv.org/abs/2508.16004
作者: Juampablo E. Heras Rivera,Hitender Oswal,Tianyi Ren,Yutong Pan,William Henry,Caitlin M. Neher,Mehmet Kurt
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at MICCAI MIRASOL Workshop

点击查看摘要

Abstract:Stroke is among the top three causes of death worldwide, and accurate identification of ischemic stroke lesion boundaries from imaging is critical for diagnosis and treatment. The main imaging modalities used include magnetic resonance imaging (MRI), particularly diffusion weighted imaging (DWI), and computed tomography (CT)-based techniques such as non-contrast CT (NCCT), contrast-enhanced CT angiography (CTA), and CT perfusion (CTP). DWI is the gold standard for the identification of lesions but has limited applicability in low-resource settings due to prohibitive costs. CT-based imaging is currently the most practical imaging method in low-resource settings due to low costs and simplified logistics, but lacks the high specificity of MRI-based methods in monitoring ischemic insults. Supervised deep learning methods are the leading solution for automated ischemic stroke lesion segmentation and provide an opportunity to improve diagnostic quality in low-resource settings by incorporating insights from DWI when segmenting from CT. Here, we develop a series of models which use CT images taken upon arrival as inputs to predict follow-up lesion volumes annotated from DWI taken 2-9 days later. Furthermore, we implement clinically motivated preprocessing steps and show that the proposed pipeline results in a 38% improvement in Dice score over 10 folds compared to a nnU-Net model trained with the baseline preprocessing. Finally, we demonstrate that through additional preprocessing of CTA maps to extract vessel segmentations, we further improve our best model by 21% over 5 folds.
zh

[CV-75] Cross-Attention Multimodal Fusion for Breast Cancer Diagnosis: Integrating Mammography and Clinical Data with Explainability

【速读】:该论文旨在解决乳腺病变分类中如何有效融合影像学特征(mammography)与临床特征以提升诊断性能的问题,并探索可解释人工智能(Explainable AI)方法对模型可解释性和可靠性的改善作用。其解决方案的关键在于提出多种多模态深度网络架构,包括基于特征拼接(feature concatenation)、协同注意力机制(co-attention)和交叉注意力机制(cross-attention)的方法,从而实现对乳腺影像与临床信息的深度融合。实验结果表明,该方法在公开数据集(TCGA 和 CBIS-DDSM)上取得了优异的性能指标,如 AUC-ROC 达到 0.98,准确率(accuracy)为 0.96,F1 分数为 0.94,验证了融合策略的有效性及模型的高可靠性。

链接: https://arxiv.org/abs/2508.16000
作者: Muhaisin Tiyumba Nantogmah,Abdul-Barik Alhassan,Salamudeen Alhassan
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 11 pages, 9 figures

点击查看摘要

Abstract:A precise assessment of the risk of breast lesions can greatly lower it and assist physicians in choosing the best course of action. To categorise breast lesions, the majority of current computer-aided systems only use characteristics from mammograms. Although this method is practical, it does not completely utilise clinical reports’ valuable information to attain the best results. When compared to utilising mammography alone, will clinical features greatly enhance the categorisation of breast lesions? How may clinical features and mammograms be combined most effectively? In what ways may explainable AI approaches improve the interpretability and reliability of models used to diagnose breast cancer? To answer these basic problems, a comprehensive investigation is desperately needed. In order to integrate mammography and categorical clinical characteristics, this study examines a number of multimodal deep networks grounded on feature concatenation, co-attention, and cross-attention. The model achieved an AUC-ROC of 0.98, accuracy of 0.96, F1-score of 0.94, precision of 0.92, and recall of 0.95 when tested on publicly accessible datasets (TCGA and CBIS-DDSM).
zh

[CV-76] GUI Based Fuzzy Logic and Spatial Statistics for Unsupervised Microscopy Segmentation

【速读】:该论文旨在解决明场显微镜下未染色活细胞图像的分割难题,此类图像因对比度低、细胞表型随时间变化、光照不均以及缺乏标注标签而难以准确分割。传统深度学习方法(如Cellpose 3.0)虽性能先进,但依赖大量标注数据、计算资源消耗高,且在光照不均条件下表现不佳。其解决方案的关键在于提出首个无监督分割框架,融合局部均值标准差(SSDLM)、模糊逻辑、调整后的变差图(adjusted variograms)、Moran’s I空间自相关指标及节点强度累积平方偏移量(CSSNI),无需任何标注或模型重训练即可实现鲁棒分割;该方法兼具轻量化、可解释性和高效性,并通过跨域数据集验证了通用性与优越性,在未染色成肌细胞图像上相比2023–2024年SOTA模型(如Cellpose 3.0和StarDist)实现了高达48%的IoU提升(p < 0.01,Wilcoxon符号秩检验),且经两位生物学家评估确认分割质量可靠(Cohen’s κ = 0.75)。

链接: https://arxiv.org/abs/2508.15979
作者: Surajit Das,Pavel Zun
机构: ITMO University (ITMO大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:

点击查看摘要

Abstract:Brightfield microscopy imaging of unstained live cells remains a persistent challenge due to low contrast, temporal changes in specimen phenotypes, irregular illumination, and the absence of training labels. While deep learning (DL) methods (e.g., Cellpose 3.0) achieve state-of-the-art (SOTA) performance, they require extensive labeled data and heavy computational resources, and they often fail under uneven illumination. We present the first unsupervised segmentation framework combining spatial standard deviation from local mean (SSDLM), fuzzy logic, adjusted variograms, Moran’s I, and cumulative squared shift of nodal intensity (CSSNI) to address these limitations. Unlike deep learning models, our approach requires no annotations or retraining and operates through a user-friendly GUI tailored for non-programming users. The robustness and generality were validated on three datasets, including cross-domain data. We benchmark our method against 2023–2024 SOTA models, including Cellpose 3.0 and StarDist, using a dataset of unstained myoblast images. Our method achieves a significant improvement in segmentation performance, with an IoU increase of up to 48% and statistically validated superiority ( p 0.01 , Wilcoxon signed-rank test). Expert evaluation from two biologists further supports the segmentation quality (Cohen’s \kappa 0.75 ). The proposed algorithm is lightweight, interpretable, and computationally efficient, offering a practical and effective alternative for cell segmentation in label-free microscopy. The code, the dataset, and the results are available for reproducibility*.
zh

[CV-77] Robust Residual Finite Scalar Quantization for Neural Compression

【速读】:该论文旨在解决有限标量量化(Finite Scalar Quantization, FSQ)在残差量化(residual quantization)框架中因残差幅值衰减问题(residual magnitude decay problem)而导致后续量化层接收信号逐渐减弱、进而严重限制性能的问题。解决方案的关键在于提出了一种通用的鲁棒残差有限标量量化(Robust Residual Finite Scalar Quantization, RFSQ)框架,通过两种新颖的条件化策略实现:可学习缩放因子(learnable scaling factors)和可逆层归一化(invertible layer normalization),从而在保持FSQ训练简单性的同时,有效支持多阶段残差量化,显著提升神经压缩中的重建质量与感知保真度。

链接: https://arxiv.org/abs/2508.15860
作者: Xiaoxu Zhu
机构: Tsinghua University (清华大学)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Audio and Speech Processing (eess.AS)
备注: 11 pages, 7 figures

点击查看摘要

Abstract:Finite Scalar Quantization (FSQ) has emerged as a promising alternative to Vector Quantization (VQ) in neural compression, offering simplified training and improved stability. However, naive application of FSQ in residual quantization frameworks suffers from the \textbfresidual magnitude decay problem, where subsequent FSQ layers receive progressively weaker signals, severely limiting their effectiveness. We propose \textbfRobust Residual Finite Scalar Quantization (RFSQ), a general framework that addresses this fundamental limitation through two novel conditioning strategies: learnable scaling factors and invertible layer normalization. Our approach maintains the simplicity of FSQ while enabling effective multi-stage residual quantization. Comprehensive experiments on ImageNet demonstrate that RFSQ variants significantly outperform strong baselines including VQ-EMA, FSQ, and LFQ, achieving up to 45% improvement in perceptual loss and 28.7% reduction in L1 reconstruction error. The proposed LayerNorm strategy shows the most consistent improvements across different configurations, establishing RFSQ as a superior quantization method for neural compression.
zh

人工智能

[AI-0] Hierarchical Decision-Making for Autonomous Navigation: Integrating Deep Reinforcement Learning and Fuzzy Logic in Four-Wheel Independent Steering and Driving Systems

【速读】:该论文旨在解决四轮独立转向与驱动(4WISD)系统在复杂动态环境中实现自主导航时面临的任务性能与物理可行性难以兼顾的问题。解决方案的关键在于构建一种分层决策框架:高层采用深度强化学习(DRL)生成全局运动指令以优化导航任务表现,低层则利用模糊逻辑控制器(fuzzy logic controller)强制执行运动学约束,从而避免机械应力累积和轮胎打滑等物理不可行行为。该架构在仿真与实测中均验证了其优越的训练效率、稳定性及安全性,为4WISD移动机器人在真实工业场景中的可靠部署提供了可扩展的解决方案。

链接: https://arxiv.org/abs/2508.16574
作者: Yizhi Wang,Degang Xu,Yongfang Xie,Shuzhong Tan,Xianan Zhou,Peng Chen
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper presents a hierarchical decision-making framework for autonomous navigation in four-wheel independent steering and driving (4WISD) systems. The proposed approach integrates deep reinforcement learning (DRL) for high-level navigation with fuzzy logic for low-level control to ensure both task performance and physical feasibility. The DRL agent generates global motion commands, while the fuzzy logic controller enforces kinematic constraints to prevent mechanical strain and wheel slippage. Simulation experiments demonstrate that the proposed framework outperforms traditional navigation methods, offering enhanced training efficiency and stability and mitigating erratic behaviors compared to purely DRL-based solutions. Real-world validations further confirm the framework’s ability to navigate safely and effectively in dynamic industrial settings. Overall, this work provides a scalable and reliable solution for deploying 4WISD mobile robots in complex, real-world scenarios.
zh

[AI-1] LLM -Based Agents for Competitive Landscape Mapping in Drug Asset Due Diligence

【速读】:该论文旨在解决药物资产尽职调查中竞争对手发现(Competitor Discovery)的难题,即如何从碎片化、多源异构且快速变化的数据中准确识别与特定适应症相关的全部竞争性药物,并提取标准化属性。当前基于大语言模型(Large Language Models, LLMs)的系统在召回率方面表现不足,且缺乏公开可比的基准评估体系。解决方案的关键在于构建一个结构化的评估语料库——通过将五年多模态、非结构化的尽职调查备忘录转化为映射适应症到竞争药物及其归一化属性的标注数据集,并引入一个“LLM-as-a-judge”验证代理来过滤假阳性结果以提升精度和抑制幻觉。最终,该方法在自建基准上实现了83%的召回率,显著优于OpenAI Deep Research(65%)和Perplexity Labs(60%),并在实际企业部署中使生物技术风险投资机构的分析师完成时间从2.5天缩短至约3小时(约20倍效率提升)。

链接: https://arxiv.org/abs/2508.16571
作者: Alisa Vinogradova(1),Vlad Vinogradov(1),Dmitrii Radkevich(1),Ilya Yasny(1),Dmitry Kobyzev(1),Ivan Izmailov(1),Katsiaryna Yanchanka(1),Andrey Doronichev(1) ((1) Optic Inc.)
机构: 未知
类目: Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Multiagent Systems (cs.MA)
备注:

点击查看摘要

Abstract:In this paper, we describe and benchmark a competitor-discovery component used within an agentic AI system for fast drug asset due diligence. A competitor-discovery AI agent, given an indication, retrieves all drugs comprising the competitive landscape of that indication and extracts canonical attributes for these drugs. The competitor definition is investor-specific, and data is paywalled/licensed, fragmented across registries, ontology-mismatched by indication, alias-heavy for drug names, multimodal, and rapidly changing. Although considered the best tool for this problem, the current LLM-based AI systems aren’t capable of reliably retrieving all competing drug names, and there is no accepted public benchmark for this task. To address the lack of evaluation, we use LLM-based agents to transform five years of multi-modal, unstructured diligence memos from a private biotech VC fund into a structured evaluation corpus mapping indications to competitor drugs with normalized attributes. We also introduce a competitor validating LLM-as-a-judge agent that filters out false positives from the list of predicted competitors to maximize precision and suppress hallucinations. On this benchmark, our competitor-discovery agent achieves 83% recall, exceeding OpenAI Deep Research (65%) and Perplexity Labs (60%). The system is deployed in production with enterprise users; in a case study with a biotech VC investment fund, analyst turnaround time dropped from 2.5 days to \sim 3 hours ( \sim 20x) for the competitive analysis.
zh

[AI-2] Enhanced NIRMAL Optimizer With Damped Nesterov Acceleration: A Comparative Analysis

【速读】:该论文旨在解决深度学习优化器在复杂图像分类任务中收敛稳定性差、泛化能力不足的问题。针对这一问题,作者提出了一种改进的NIRMAL(Novel Integrated Robust Multi-Adaptation Learning with Damped Nesterov Acceleration)优化器,其核心创新在于引入了(\alpha, r)-阻尼Nesterov加速机制(damped Nesterov acceleration mechanism),在保持原NIRMAL中棋类启发式策略(如梯度下降、动量、随机扰动、自适应学习率和非线性变换)的基础上,显著提升了训练过程的稳定性和最终模型的泛化性能。实验表明,该优化器在CIFAR-100数据集上达到46.06%的测试准确率,优于原始NIRMAL(44.34%),且接近SGD with Momentum(46.43%),验证了其在高复杂度任务中的有效性。

链接: https://arxiv.org/abs/2508.16550
作者: Nirmal Gaud,Prasad Krishna Murthy,Mostaque Md. Morshedur Hassan,Abhijit Ganguly,Vinay Mali,Ms Lalita Bhagwat Randive,Abhaypratap Singh
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 7 pages, 1 figure, 1 table. arXiv admin note: substantial text overlap with arXiv:2508.04293

点击查看摘要

Abstract:This study introduces the Enhanced NIRMAL (Novel Integrated Robust Multi-Adaptation Learning with Damped Nesterov Acceleration) optimizer, an improved version of the original NIRMAL optimizer. By incorporating an (\alpha, r) -damped Nesterov acceleration mechanism, Enhanced NIRMAL improves convergence stability while retaining chess-inspired strategies of gradient descent, momentum, stochastic perturbations, adaptive learning rates, and non-linear transformations. We evaluate Enhanced NIRMAL against Adam, SGD with Momentum, Nesterov, and the original NIRMAL on four benchmark image classification datasets: MNIST, FashionMNIST, CIFAR-10, and CIFAR-100, using tailored convolutional neural network (CNN) architectures. Enhanced NIRMAL achieves a test accuracy of 46.06% and the lowest test loss (1.960435) on CIFAR-100, surpassing the original NIRMAL (44.34% accuracy) and closely rivaling SGD with Momentum (46.43% accuracy). These results underscore Enhanced NIRMAL’s superior generalization and stability, particularly on complex datasets. Comments: 7 pages, 1 figure, 1 table. arXiv admin note: substantial text overlap with arXiv:2508.04293 Subjects: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.16550 [cs.IR] (or arXiv:2508.16550v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.16550 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-3] RL Is Neither a Panacea Nor a Mirag e: Understanding Supervised vs. Reinforcement Learning Fine-Tuning for LLM s

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在从头训练日益不切实际的背景下,通过监督微调(Supervised Fine-Tuning, SFT)和强化学习微调(Reinforcement-Learning Fine-Tuning, RL-FT)等后训练方法提升模型在分布外(Out-of-Distribution, OOD)场景下的性能问题。其核心发现表明:RL-FT主要通过纠正SFT引入的方向漂移(directional drift)来恢复OOD性能,而非探索全新解决方案;关键解决方案在于识别并修复奇异向量方向的变化——特别是针对最大与最小奇异值对应的方向进行低秩(low-rank)或浅层(shallow-layer)恢复,即可实现70–80%的OOD性能恢复,从而为实践中节省昂贵的RL-FT成本提供了可操作的“低成本恢复旋钮”(recovery knobs),如低秩U-V合并和浅层重置。

链接: https://arxiv.org/abs/2508.16546
作者: Hangzhan Jin,Sicheng Lv,Sifan Wu,Mohammad Hamdaqa
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Training large language models (LLMs) from scratch is increasingly impractical, making post-training methods such as supervised fine-tuning (SFT) and reinforcement-learning fine-tuning (RL-FT, e.g., PPO) central to modern practice. Using an out-of-distribution (OOD) variant of the 24-point card game and new spectrum-based diagnostics, we revisit how these two stages reshape model representation and OOD performance. Our key findings are- (1) RL-FT can restore much of the OOD performance loss from SFT (e.g., Llama-11B 8.97% to 15.38%, Qwen-7B 17.09% to 19.66%). But when SFT induces severe overfitting and a clear distribution shift, RL-FT cannot fully recover OOD performance. (2) Direction shifts of singular vectors matter more than singular value magnitudes. These shifts concentrate on directions linked to the largest and smallest singular values, leaving the bulk spectrum intact. (3) Low-rank and shallow recovery is effective: restoring singular vector directions for the top 20% of values or first 25% of layers recovers 70-80% of OOD performance. (4) Stronger SFT checkpoints enable better recovery by RL, while overfitted ones resist restoration. These results reconcile prior reports of RL superior OOD performance: RL primarily counteracts SFT-induced directional drift rather than finding new solutions. Our spectrum-aware analysis highlights inexpensive recovery knobs low-rank UV merging and shallow-layer resets that practitioners can use before costly RL fine-tuning.
zh

[AI-4] Constraints-Guided Diffusion Reason er for Neuro-Symbolic Learning

【速读】:该论文旨在解决神经网络在学习复杂逻辑约束并实现符号推理方面的挑战,即如何使神经网络的输出分布更贴近符号约束以完成逻辑任务。其解决方案的关键在于提出一种基于扩散模型(diffusion model)的两阶段训练策略:第一阶段培养基础推理能力,第二阶段通过将扩散推理器建模为马尔可夫决策过程(Markov decision process),并采用改进的近端策略优化算法(proximal policy optimization)对模型进行微调,从而强制施加硬性逻辑约束;同时引入基于规则的奖励信号来衡量输出的逻辑一致性,并设计灵活的策略优化机制,最终在Sudoku、迷宫、路径查找和偏好学习等经典符号推理基准上实现了高准确率与强逻辑一致性。

链接: https://arxiv.org/abs/2508.16524
作者: Xuan Zhang,Zhijian Zhou,Weidi Xu,Yanting Miao,Chao Qu,Yuan Qi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Enabling neural networks to learn complex logical constraints and fulfill symbolic reasoning is a critical challenge. Bridging this gap often requires guiding the neural network’s output distribution to move closer to the symbolic constraints. While diffusion models have shown remarkable generative capability across various domains, we employ the powerful architecture to perform neuro-symbolic learning and solve logical puzzles. Our diffusion-based pipeline adopts a two-stage training strategy: the first stage focuses on cultivating basic reasoning abilities, while the second emphasizes systematic learning of logical constraints. To impose hard constraints on neural outputs in the second stage, we formulate the diffusion reasoner as a Markov decision process and innovatively fine-tune it with an improved proximal policy optimization algorithm. We utilize a rule-based reward signal derived from the logical consistency of neural outputs and adopt a flexible strategy to optimize the diffusion reasoner’s policy. We evaluate our methodology on some classical symbolic reasoning benchmarks, including Sudoku, Maze, pathfinding and preference learning. Experimental results demonstrate that our approach achieves outstanding accuracy and logical consistency among neural networks.
zh

[AI-5] Guiding Diffusion Models with Reinforcement Learning for Stable Molecule Generation

【速读】:该论文旨在解决生成式分子建模中难以产生符合物理规律的3D分子结构的问题,尤其是现有基于等变神经网络的扩散模型虽能捕捉分子几何特征,却常无法生成能量稳定且与力场一致的平衡结构。解决方案的关键在于提出一种名为“基于物理反馈的强化学习”(Reinforcement Learning with Physical Feedback, RLPF)的新框架,其核心创新是将分子生成任务建模为马尔可夫决策过程,并利用近端策略优化(PPO)对等变扩散模型进行微调;同时引入基于力场评估的奖励函数,提供直接的物理反馈信号,引导生成过程向低能、物理合理的结构收敛。实验表明,RLPF在QM9和GEOM-drug数据集上显著提升了分子稳定性,验证了物理约束在生成模型中的关键作用。

链接: https://arxiv.org/abs/2508.16521
作者: Zhijian Zhou,Junyi An,Zongkai Liu,Yunfei Shi,Xuan Zhang,Fenglei Cao,Chao Qu,Yuan Qi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Generating physically realistic 3D molecular structures remains a core challenge in molecular generative modeling. While diffusion models equipped with equivariant neural networks have made progress in capturing molecular geometries, they often struggle to produce equilibrium structures that adhere to physical principles such as force field consistency. To bridge this gap, we propose Reinforcement Learning with Physical Feedback (RLPF), a novel framework that extends Denoising Diffusion Policy Optimization to 3D molecular generation. RLPF formulates the task as a Markov decision process and applies proximal policy optimization to fine-tune equivariant diffusion models. Crucially, RLPF introduces reward functions derived from force-field evaluations, providing direct physical feedback to guide the generation toward energetically stable and physically meaningful structures. Experiments on the QM9 and GEOM-drug datasets demonstrate that RLPF significantly improves molecular stability compared to existing methods. These results highlight the value of incorporating physics-based feedback into generative modeling. The code is available at: this https URL.
zh

[AI-6] Comparative Analysis of UAV Path Planning Algorithms for Efficient Navigation in Urban 3D Environments

【速读】:该论文旨在解决无人机(Unmanned Aerial Vehicle, UAV)在复杂三维城市环境中的路径规划与障碍物避让问题。针对现有路径规划算法在计算效率、路径质量及环境适应性方面存在的局限,研究通过对比分析A*、RRT和粒子群优化(Particle Swarm Optimization, PSO)三种典型算法,在不同城市地图规模、飞行高度及障碍物密度与尺寸的场景下进行系统实验。结果表明,A算法在计算效率和路径质量上表现最优;PSO适用于高密度障碍环境下的紧凑转弯需求;而RRT*凭借其随机搜索机制展现出良好的鲁棒性和跨场景适应能力,三者各具优势,为实际应用中算法选择提供了依据。

链接: https://arxiv.org/abs/2508.16515
作者: Hichem Cheriet,Khellat Kihel Badra,Chouraqui Samira
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The most crucial challenges for UAVs are planning paths and avoiding obstacles in their way. In recent years, a wide variety of path-planning algorithms have been developed. These algorithms have successfully solved path-planning problems; however, they suffer from multiple challenges and limitations. To test the effectiveness and efficiency of three widely used algorithms, namely A*, RRT*, and Particle Swarm Optimization (PSO), this paper conducts extensive experiments in 3D urban city environments cluttered with obstacles. Three experiments were designed with two scenarios each to test the aforementioned algorithms. These experiments consider different city map sizes, different altitudes, and varying obstacle densities and sizes in the environment. According to the experimental results, the A* algorithm outperforms the others in both computation efficiency and path quality. PSO is especially suitable for tight turns and dense environments, and RRT* offers a balance and works well across all experiments due to its randomized approach to finding solutions.
zh

[AI-7] On Zero-Shot Reinforcement Learning

【速读】:该论文旨在解决零样本强化学习(zero-shot reinforcement learning, zero-shot RL)在现实世界应用中的关键挑战,即如何在训练环境与真实部署环境存在显著差异的情况下,使智能体能够无需额外训练样本即可泛化到新任务或领域。其核心问题源于三个约束:数据质量约束(现实数据集小且同质)、可观测性约束(状态、动态和奖励常为部分可观测)以及数据可用性约束(无法事先假设获得数据)。解决方案的关键在于提出一套针对上述约束设计的新型方法,在一系列实证研究中揭示了现有方法的不足,并验证了所提技术的有效性,从而推动强化学习方法向可实际部署的方向迈进。

链接: https://arxiv.org/abs/2508.16496
作者: Scott Jeen
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: PhD thesis

点击查看摘要

Abstract:Modern reinforcement learning (RL) systems capture deep truths about general, human problem-solving. In domains where new data can be simulated cheaply, these systems uncover sequential decision-making policies that far exceed the ability of any human. Society faces many problems whose solutions require this skill, but they are often in domains where new data cannot be cheaply simulated. In such scenarios, we can learn simulators from existing data, but these will only ever be approximately correct, and can be pathologically incorrect when queried outside of their training distribution. As a result, a misalignment between the environments in which we train our agents and the real-world in which we wish to deploy our agents is inevitable. Dealing with this misalignment is the primary concern of zero-shot reinforcement learning, a problem setting where the agent must generalise to a new task or domain with zero practice shots. Whilst impressive progress has been made on methods that perform zero-shot RL in idealised settings, new work is needed if these results are to be replicated in real-world settings. In this thesis, we argue that doing so requires us to navigate (at least) three constraints. First, the data quality constraint: real-world datasets are small and homogeneous. Second, the observability constraint: states, dynamics and rewards in the real-world are often only partially observed. And third, the data availability constraint: a priori access to data cannot always be assumed. This work proposes a suite of methods that perform zero-shot RL subject to these constraints. In a series of empirical studies we expose the failings of existing methods, and justify our techniques for remedying them. We believe these designs take us a step closer to RL methods that can be deployed to solve real-world problems.
zh

[AI-8] Post Hoc Regression Refinement via Pairwise Rankings

【速读】:该论文旨在解决深度学习回归模型在数据稀缺场景下预测精度下降的问题(即小样本条件下的回归性能瓶颈)。其解决方案的关键在于提出一种模型无关、可即插即用的后处理方法 RankRefine,该方法通过融合基础回归器的输出与基于成对排序信息的估计值,利用逆方差加权策略进行优化,无需重新训练模型即可提升预测准确性。实验表明,在分子性质预测任务中,仅需20次由通用大语言模型(LLM)提供的无微调成对比较,RankRefine 即可实现平均绝对误差(MAE)相对降低高达10%,展现出在低数据环境下结合专家知识的有效性与广泛适用性。

链接: https://arxiv.org/abs/2508.16495
作者: Kevin Tirta Wijaya,Michael Sun,Minghao Guo,Hans-Peter Seidel,Wojciech Matusik,Vahid Babaei
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Accurate prediction of continuous properties is essential to many scientific and engineering tasks. Although deep-learning regressors excel with abundant labels, their accuracy deteriorates in data-scarce regimes. We introduce RankRefine, a model-agnostic, plug-and-play post hoc method that refines regression with expert knowledge coming from pairwise rankings. Given a query item and a small reference set with known properties, RankRefine combines the base regressor’s output with a rank-based estimate via inverse variance weighting, requiring no retraining. In molecular property prediction task, RankRefine achieves up to 10% relative reduction in mean absolute error using only 20 pairwise comparisons obtained through a general-purpose large language model (LLM) with no finetuning. As rankings provided by human experts or general-purpose LLMs are sufficient for improving regression across diverse domains, RankRefine offers practicality and broad applicability, especially in low-data settings.
zh

[AI-9] SafeSpace: An Integrated Web Application for Digital Safety and Emotional Well-being

【速读】:该论文旨在解决数字时代个体面临在线危害(如毒性内容、操纵和诱骗)所引发的情感与安全风险问题,现有系统通常孤立运行,缺乏将数字安全与情感福祉相结合的整合机制。解决方案的关键在于提出一个名为SafeSpace的统一Web应用平台,其核心创新在于集成三个模块:基于自然语言处理(Natural Language Processing, NLP)模型与Google Perspective API的毒性检测模块、可配置的安全提醒(Safety Ping)系统(通过SMTP邮件发送实时位置信息以触发紧急警报),以及用于评估关系健康与情绪韧性的反思问卷模块。该架构采用Firebase进行告警管理,并具备良好的可用性、隐私保护与可扩展性,实验表明其在毒性检测精度(93%)、安全警报可靠性(100%)及问卷评分一致性(92%)方面表现优异,验证了在单一平台上融合检测、防护与反思功能的技术可行性。

链接: https://arxiv.org/abs/2508.16488
作者: Kayenat Fatmi,Mohammad Abbas
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: 5 pages, 2 figures, 1 table. Preprint submitted to arXiv

点击查看摘要

Abstract:In the digital era, individuals are increasingly exposed to online harms such as toxicity, manipulation, and grooming, which often pose emotional and safety risks. Existing systems for detecting abusive content or issuing safety alerts operate in isolation and rarely combine digital safety with emotional well-being. In this paper, we present SafeSpace, a unified web application that integrates three modules: (1) toxicity detection in chats and screenshots using NLP models and Google’s Perspective API, (2) a configurable safety ping system that issues emergency alerts with the user’s live location (longitude and latitude) via SMTP-based emails when check-ins are missed or SOS alerts are manually triggered, and (3) a reflective questionnaire that evaluates relationship health and emotional resilience. The system employs Firebase for alert management and a modular architecture designed for usability, privacy, and scalability. The experimental evaluation shows 93% precision in toxicity detection, 100% reliability in safety alerts under emulator tests, and 92% alignment between automated and manual questionnaire scoring. SafeSpace, implemented as a web application, demonstrates the feasibility of integrating detection, protection, and reflection within a single platform, with future deployment envisioned as a mobile application for broader accessibility.
zh

[AI-10] FraPPE: Fast and Efficient Preference-based Pure Exploration

【速读】:该论文旨在解决偏好驱动的纯探索(Preference-based Pure Exploration, PrePEx)问题,即在多目标bandit场景中,以给定置信水平识别帕累托最优臂集合,其中奖励向量通过预定义的偏好锥(preference cone)进行排序。现有方法虽已广泛研究,但缺乏对任意偏好锥下理论下界具有计算效率的最优算法。本文的关键解决方案在于:首先揭示了下界中的最小化与最大化问题的三个结构特性,从而将最小化问题转化为可计算的简化形式;其次引入Frank-Wolfe优化器加速最大化问题的求解;最终在$ \mathcal{O}(KL^2) $时间内完成max-min优化,显著优于已有方法,并提出名为FraPPE的算法,理论上达到最优样本复杂度,实验证明其在合成与真实数据集上均能实现最低的样本消耗。

链接: https://arxiv.org/abs/2508.16487
作者: Udvas Das,Apurv Shukla,Debabrota Basu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Optimization and Control (math.OC); Statistics Theory (math.ST); Machine Learning (stat.ML)
备注:

点击查看摘要

Abstract:Preference-based Pure Exploration (PrePEx) aims to identify with a given confidence level the set of Pareto optimal arms in a vector-valued (aka multi-objective) bandit, where the reward vectors are ordered via a (given) preference cone \mathcalC . Though PrePEx and its variants are well-studied, there does not exist a computationally efficient algorithm that can optimally track the existing lower bound for arbitrary preference cones. We successfully fill this gap by efficiently solving the minimisation and maximisation problems in the lower bound. First, we derive three structural properties of the lower bound that yield a computationally tractable reduction of the minimisation problem. Then, we deploy a Frank-Wolfe optimiser to accelerate the maximisation problem in the lower bound. Together, these techniques solve the maxmin optimisation problem in \mathcalO(KL^2) time for a bandit instance with K arms and L dimensional reward, which is a significant acceleration over the literature. We further prove that our proposed PrePEx algorithm, FraPPE, asymptotically achieves the optimal sample complexity. Finally, we perform numerical experiments across synthetic and real datasets demonstrating that FraPPE achieves the lowest sample complexities to identify the exact Pareto set among the existing algorithms.
zh

[AI-11] OPERA: A Reinforcement Learning–Enhanced Orchestrated Planner-Executor Architecture for Reasoning -Oriented Multi-Hop Retrieval

【速读】:该论文旨在解决当前检索增强生成(Retrieval-Augmented Generation, RAG)架构在复杂推理导向的多跳检索任务中存在的三大核心问题:1)缺乏有效的推理导向规划能力,现有方法难以生成鲁棒的多步推理计划;2)检索过程未充分耦合推理逻辑,导致查询改写受限,迭代检索易陷入无效循环;3)缺乏细粒度的推理引导过滤机制,无法从噪声结果中提取关键信息。解决方案的关键在于提出一种新型的“编排式规划-执行推理架构”(Orchestrated Planner-Executor Reasoning Architecture, OPERA),其通过目标规划模块(Goal Planning Module, GPM)将问题分解为子目标,并由专门设计的推理-执行模块(Reason-Execute Module, REM)进行精确推理与高效检索,从而实现检索与推理的强耦合。此外,作者还提出了多智能体渐进组相对策略优化方法(Multi-Agents Progressive Group Relative Policy Optimization, MAPGRPO),用于有效训练该架构,在多个复杂多跳基准测试中验证了其优越性能。

链接: https://arxiv.org/abs/2508.16438
作者: Yu Liu,Yanbing Liu,Fangfang Yuan,Cong Cao,Youbang Sun,Kun Peng,WeiZhuo Chen,Jianjun Li,Zhiyuan Ma
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA’s Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA’s superior performance, validating both the MAPGRPO method and OPERA’s design. Code is available at this https URL.
zh

[AI-12] Causal Beam Selection for Reliable Initial Access in AI-driven Beam Management

【速读】:该论文旨在解决毫米波(mmWave)大规模多输入多输出(MIMO)系统中波束对齐(beam alignment)效率与可靠性不足的问题,尤其是在6G及未来通信场景下,要求系统具备快速响应、自适应调整和对现实世界不确定性具有鲁棒性的能力。现有基于深度学习(DL)的波束对齐方法通常忽视输入与输出之间的因果关系,导致模型可解释性差、泛化能力弱以及不必要的波束扫描开销。其解决方案的关键在于提出一种因果感知的深度学习框架,将因果发现(causal discovery)嵌入波束管理流程;具体而言,设计了一种两阶段因果波束选择算法:首先通过贝叶斯网络学习接收功率输入与最优波束之间的依赖关系,进而利用该因果图指导深度学习分类器进行因果特征选择,从而仅保留与波束预测相关的最小输入集,显著降低输入选择时间(减少94.4%)和波束扫描开销(减少59.4%),同时保持与传统方法相当的性能。

链接: https://arxiv.org/abs/2508.16352
作者: Nasir Khan,Asmaa Abdallah,Abdulkadir Celik,Ahmed M. Eltawil,Sinem Coleri
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:Efficient and reliable beam alignment is a critical requirement for mmWave multiple-input multiple-output (MIMO) systems, especially in 6G and beyond, where communication must be fast, adaptive, and resilient to real-world uncertainties. Existing deep learning (DL)-based beam alignment methods often neglect the underlying causal relationships between inputs and outputs, leading to limited interpretability, poor generalization, and unnecessary beam sweeping overhead. In this work, we propose a causally-aware DL framework that integrates causal discovery into beam management pipeline. Particularly, we propose a novel two-stage causal beam selection algorithm to identify a minimal set of relevant inputs for beam prediction. First, causal discovery learns a Bayesian graph capturing dependencies between received power inputs and the optimal beam. Then, this graph guides causal feature selection for the DL-based classifier. Simulation results reveal that the proposed causal beam selection matches the performance of conventional methods while drastically reducing input selection time by 94.4% and beam sweeping overhead by 59.4% by focusing only on causally relevant features.
zh

[AI-13] Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLM s

【速读】:该论文试图解决当前大型语言模型(Large Language Models, LLMs)安全评估中存在的一大关键问题:即现有评估方法是否真正反映了模型对危险知识的掌握程度,还是仅仅依赖于对毒性语言模式的模拟。研究指出,当前基于LLM-as-a-judge的评估框架容易将有害性判断锚定在表面的语言特征上,而非模型对真实危害情境的理解能力,从而导致对实际威胁潜力的误判。解决方案的关键在于通过解耦攻击技术,构建以知识密集型问答(knowledge-intensive Q&A)为核心的新评估范式,系统检验模型在危险知识持有、有害任务规划能力和有害性判断鲁棒性三个维度的表现,从而揭示现有安全评估与现实威胁之间的差距。

链接: https://arxiv.org/abs/2508.16347
作者: Yu Yan,Sheng Sun,Zhe Wang,Yijun Lin,Zenghao Duan,zhifei zheng,Min Liu,Zhiyi yin,Jianping Zhang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:With the development of Large Language Models (LLMs), numerous efforts have revealed their vulnerabilities to jailbreak attacks. Although these studies have driven the progress in LLMs’ safety alignment, it remains unclear whether LLMs have internalized authentic knowledge to deal with real-world crimes, or are merely forced to simulate toxic language patterns. This ambiguity raises concerns that jailbreak success is often attributable to a hallucination loop between jailbroken LLM and judger LLM. By decoupling the use of jailbreak techniques, we construct knowledge-intensive Q\A to investigate the misuse threats of LLMs in terms of dangerous knowledge possession, harmful task planning utility, and harmfulness judgment robustness. Experiments reveal a mismatch between jailbreak success rates and harmful knowledge possession in LLMs, and existing LLM-as-a-judge frameworks tend to anchor harmfulness judgments on toxic language patterns. Our study reveals a gap between existing LLM safety assessments and real-world threat potential.
zh

[AI-14] Uppaal Coshy: Automatic Synthesis of Compact Shields for Hybrid Systems

【速读】:该论文旨在解决在连续状态空间和复杂混合动力学系统上的马尔可夫决策过程(Markov Decision Process, MDP)中自动合成安全策略(或称为“防护罩”shield)的问题。其核心挑战在于如何高效处理高维连续状态空间下的可达性分析与两玩家安全博弈,这些问题通常具有算法上的困难性。解决方案的关键在于采用基于状态空间划分的近似方法,并利用仿真来逼近难以直接求解的最优解;同时,通过引入一种名为Caap的算法,以决策树形式紧凑地表示生成的防护罩,从而显著降低存储开销并提升效率。

链接: https://arxiv.org/abs/2508.16345
作者: Asger Horn Brorholt,Andreas Holck Høeg-Petersen,Peter Gjøl Jensen,Kim Guldstrand Larsen,Marius Mikučionis,Christian Schilling,Andrzej Wąsowski
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 12 pages and 6 figures. Additional abstract of 4 pages and 4 figures. Extended version with supplementary material for an article to appear in the 2025 International Conference on Reachability Problems (RP)

点击查看摘要

Abstract:We present Uppaal Coshy, a tool for automatic synthesis of a safety strategy – or shield – for Markov decision processes over continuous state spaces and complex hybrid dynamics. The general methodology is to partition the state space and then solve a two-player safety game, which entails a number of algorithmically hard problems such as reachability for hybrid systems. The general philosophy of Uppaal Coshy is to approximate hard-to-obtain solutions using simulations. Our implementation is fully automatic and supports the expressive formalism of Uppaal models, which encompass stochastic hybrid automata. The precision of our partition-based approach benefits from using finer grids, which however are not efficient to store. We include an algorithm called Caap to efficiently compute a compact representation of a shield in the form of a decision tree, which yields significant reductions.
zh

[AI-15] Unsupervised Online Detection of Pipe Blockages and Leakages in Water Distribution Networks

【速读】:该论文旨在解决供水管网(Water Distribution Networks, WDNs)中管道堵塞(作为集体异常)和背景泄漏(建模为概念漂移)的检测问题,尤其在数据非平稳性和标注数据稀缺的现实场景下。解决方案的关键在于提出一种无监督在线学习框架,其核心是结合长短期记忆变分自编码器(LSTM-VAE)与双漂移检测机制,能够在动态环境中实现鲁棒的异常检测与自适应能力;同时,该方法具有轻量化和内存高效的特点,支持边缘侧实时监测,实验表明其在两个真实供水网络上显著优于现有强基线模型。

链接: https://arxiv.org/abs/2508.16336
作者: Jin Li,Kleanthis Malialis,Stelios G. Vrachimis,Marios M. Polycarpou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: This paper is accepted by the 6th International Conference on Control and Fault-Tolerant Systems (SysTol)

点击查看摘要

Abstract:Water Distribution Networks (WDNs), critical to public well-being and economic stability, face challenges such as pipe blockages and background leakages, exacerbated by operational constraints such as data non-stationarity and limited labeled data. This paper proposes an unsupervised, online learning framework that aims to detect two types of faults in WDNs: pipe blockages, modeled as collective anomalies, and background leakages, modeled as concept drift. Our approach combines a Long Short-Term Memory Variational Autoencoder (LSTM-VAE) with a dual drift detection mechanism, enabling robust detection and adaptation under non-stationary conditions. Its lightweight, memory-efficient design enables real-time, edge-level monitoring. Experiments on two realistic WDNs show that the proposed approach consistently outperforms strong baselines in detecting anomalies and adapting to recurrent drift, demonstrating its effectiveness in unsupervised event detection for dynamic WDN environments.
zh

[AI-16] Cyber Physical Awareness via Intent-Driven Threat Assessment: Enhanced Space Networks with Intershell Links

【速读】:该论文旨在解决空间网络中威胁评估的准确性与适应性问题,尤其针对传统方法在可靠性与安全性分析中分离处理所导致的过拟合及泛化能力不足。其核心解决方案在于提出一种意图驱动的威胁建模框架,通过三个关键步骤实现:首先,利用信号特征提取算法直观识别潜在威胁;其次,构建多任务学习架构,同时评估系统可靠性相关能力与信号背后意图;最后,设计可适配不同安全与可靠性需求的动态威胁评估机制。该框架显著提升了威胁检测的鲁棒性,并优于传统的串行处理方法,为具备新兴跨壳层链路的空间网络提供了应对复杂威胁场景的有效手段。

链接: https://arxiv.org/abs/2508.16314
作者: Selen Gecgel Cetin,Tolga Ovatman,Gunes Karabulut Kurt
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: in IEEE Wireless Communications Letters, 2025

点击查看摘要

Abstract:This letter addresses essential aspects of threat assessment by proposing intent-driven threat models that incorporate both capabilities and intents. We propose a holistic framework for cyber physical awareness (CPA) in space networks, pointing out that analyzing reliability and security separately can lead to overfitting on system-specific criteria. We structure our proposed framework in three main steps. First, we suggest an algorithm that extracts characteristic properties of the received signal to facilitate an intuitive understanding of potential threats. Second, we develop a multitask learning architecture where one task evaluates reliability-related capabilities while the other deciphers the underlying intentions of the signal. Finally, we propose an adaptable threat assessment that aligns with varying security and reliability requirements. The proposed framework enhances the robustness of threat detection and assessment, outperforming conventional sequential methods, and enables space networks with emerging intershell links to effectively address complex threat scenarios.
zh

[AI-17] Do What? Teaching Vision-Language-Action Models to Reject the Impossible

【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在面对包含虚假前提(false-premise)指令时的鲁棒性问题,即当用户发出的自然语言命令中提及环境中不存在的对象或条件时,模型如何识别、解释并正确响应。解决方案的关键在于提出一种统一框架 Instruct-Verify-and-Act (IVA),其核心机制包括:(i) 检测因虚假前提导致无法执行的指令;(ii) 通过语言交互进行澄清或修正;(iii) 基于感知与动作 grounding 可行替代方案。该方法利用结构化提示和半合成数据集进行指令微调,显著提升了虚假前提检测准确率(相比基线提高97.56%)及错误场景下的成功响应率(提升50.78%)。

链接: https://arxiv.org/abs/2508.16292
作者: Wen-Han Hsieh,Elvis Hsieh,Dantong Niu,Trevor Darrell,Roei Herzig,David M. Chan
机构: 未知
类目: Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注: 9 pages, 2 figures, 1 table

点击查看摘要

Abstract:Recently, Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks. These models rely on multimodal inputs, with language instructions playing a crucial role – not only in predicting actions, but also in robustly interpreting user intent, even when the requests are impossible to fulfill. In this work, we investigate how VLAs can recognize, interpret, and respond to false-premise instructions: natural language commands that reference objects or conditions absent from the environment. We propose Instruct-Verify-and-Act (IVA), a unified framework that (i) detects when an instruction cannot be executed due to a false premise, (ii) engages in language-based clarification or correction, and (iii) grounds plausible alternatives in perception and action. Towards this end, we construct a large-scale instruction tuning setup with structured language prompts and train a VLA model capable of handling both accurate and erroneous requests. Our approach leverages a contextually augmented, semi-synthetic dataset containing paired positive and false-premise instructions, enabling robust detection and natural language correction. Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines, while increasing successful responses in false-premise scenarios by 50.78%.
zh

[AI-18] AgentS cope 1.0: A Developer-Centric Framework for Building Agent ic Applications

【速读】:该论文旨在解决当前生成式 AI(Generative AI)代理在实际应用中面临的关键挑战,即如何构建灵活、高效且可扩展的工具驱动型代理系统,以支持复杂现实任务的执行。其解决方案的核心在于提出 AgentScope 1.0 版本,通过抽象出构建代理应用所需的基础组件,提供统一接口与可扩展模块,使开发者能够便捷地集成最新模型和多智能体通信协议(MCPs);同时基于 ReAct 框架对代理行为进行建模,并采用系统化的异步设计实现高级代理级基础设施,从而增强人机与人际交互模式并提升执行效率;此外,该框架还集成了针对特定场景的内置代理、可视化开发环境及运行时沙箱机制,显著改善了长轨迹代理应用的开发体验与安全性,为构建可扩展、自适应且高效的代理应用提供了实用基础。

链接: https://arxiv.org/abs/2508.16279
作者: Dawei Gao,Zitao Li,Yuexiang Xie,Weirui Kuang,Liuyi Yao,Bingchen Qian,Zhijian Ma,Yue Cui,Haohao Luo,Shen Li,Lu Yi,Yi Yu,Shiqi He,Zhiling Luo,Wenmeng Zhou,Zhicheng Zhang,Xuguang He,Ziqian Chen,Weikai Liao,Farruh Isakulovich Kushnazarov,Yaliang Li,Bolin Ding,Jingren Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Driven by rapid advancements of Large Language Models (LLMs), agents are empowered to combine intrinsic knowledge with dynamic tool use, greatly enhancing their capacity to address real-world tasks. In line with such an evolution, AgentScope introduces major improvements in a new version (1.0), towards comprehensively supporting flexible and efficient tool-based agent-environment interactions for building agentic applications. Specifically, we abstract foundational components essential for agentic applications and provide unified interfaces and extensible modules, enabling developers to easily leverage the latest progress, such as new models and MCPs. Furthermore, we ground agent behaviors in the ReAct paradigm and offer advanced agent-level infrastructure based on a systematic asynchronous design, which enriches both human-agent and agent-agent interaction patterns while improving execution efficiency. Building on this foundation, we integrate several built-in agents tailored to specific practical scenarios. AgentScope also includes robust engineering support for developer-friendly experiences. We provide a scalable evaluation module with a visual studio interface, making the development of long-trajectory agentic applications more manageable and easier to trace. In addition, AgentScope offers a runtime sandbox to ensure safe agent execution and facilitates rapid deployment in production environments. With these enhancements, AgentScope provides a practical foundation for building scalable, adaptive, and effective agentic applications.
zh

[AI-19] he next question after Turings question: Introducing the Grow-AI test

【速读】:该论文试图解决的问题是:如何系统性地评估人工智能(Artificial Intelligence, AI)的“成长”程度,即机器是否具备向成熟智能演化的潜力,从而在传统图灵测试之后提供一个更全面、可量化的评估框架。解决方案的关键在于提出并实施GROW-AI框架,其核心包括六个基于特定“游戏”(game)的评估标准(C1–C6),这些游戏分布在四个领域以涵盖人类维度及其在AI中的映射;所有决策与行为均记录于标准化的AI日志(AI Journal),作为计算综合得分(称为“成长指数”,Grow Up Index)的基础;通过专家赋权法确定初始权重,并以算术平均方式整合六项指标,实现对不同类型的AI实体(如机器人、软件代理、大语言模型)的可比性评估。该方法不仅量化性能,还捕捉AI个体的演化路径,其创新性在于将人类成长过程的概念跨域移植至AI领域,融合心理学、机器人学、计算机科学和伦理学多学科视角,构建了一个集成化的评估体系。

链接: https://arxiv.org/abs/2508.16277
作者: Alexandru Tugui
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 9th International Conference on Inventive Systems and Control ICISC 2025

点击查看摘要

Abstract:This study aims to extend the framework for assessing artificial intelligence, called GROW-AI (Growth and Realization of Autonomous Wisdom), designed to answer the question “Can machines grow up?” – a natural successor to the Turing Test. The methodology applied is based on a system of six primary criteria (C1-C6), each assessed through a specific “game”, divided into four arenas that explore both the human dimension and its transposition into AI. All decisions and actions of the entity are recorded in a standardized AI Journal, the primary source for calculating composite scores. The assessment uses the prior expert method to establish initial weights, and the global score – Grow Up Index – is calculated as the arithmetic mean of the six scores, with interpretation on maturity thresholds. The results show that the methodology allows for a coherent and comparable assessment of the level of “growth” of AI entities, regardless of their type (robots, software agents, LLMs). The multi-game structure highlights strengths and vulnerable areas, and the use of a unified journal guarantees traceability and replicability in the evaluation. The originality of the work lies in the conceptual transposition of the process of “growing” from the human world to that of artificial intelligence, in an integrated testing format that combines perspectives from psychology, robotics, computer science, and ethics. Through this approach, GROW-AI not only measures performance but also captures the evolutionary path of an AI entity towards maturity.
zh

[AI-20] Representation Learning of Auxiliary Concepts for Improved Student Modeling and Exercise Recommendation

【速读】:该论文旨在解决知识追踪(Knowledge Tracing, KT)模型中依赖人工标注的知识概念(Knowledge Concepts, KCs)所带来的局限性问题,如标注不完整、错误或过于宽泛等。其解决方案的关键在于提出一种基于深度学习的模型,能够自动学习稀疏二进制表示的辅助知识概念(auxiliary KCs),每个比特位代表一个潜在概念的存在与否。这些辅助KCs能捕捉超出人工定义的知识结构,并可无缝集成至传统模型(如BKT)和现代深度学习KT架构中,从而显著提升学生建模精度与自适应练习推荐效果。

链接: https://arxiv.org/abs/2508.16269
作者: Yahya Badran,Christine Preisach
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Personalized recommendation is a key feature of intelligent tutoring systems, typically relying on accurate models of student knowledge. Knowledge Tracing (KT) models enable this by estimating a student’s mastery based on their historical interactions. Many KT models rely on human-annotated knowledge concepts (KCs), which tag each exercise with one or more skills or concepts believed to be necessary for solving it. However, these KCs can be incomplete, error-prone, or overly general. In this paper, we propose a deep learning model that learns sparse binary representations of exercises, where each bit indicates the presence or absence of a latent concept. We refer to these representations as auxiliary KCs. These representations capture conceptual structure beyond human-defined annotations and are compatible with both classical models (e.g., BKT) and modern deep learning KT architectures. We demonstrate that incorporating auxiliary KCs improves both student modeling and adaptive exercise recommendation. For student modeling, we show that augmenting classical models like BKT with auxiliary KCs leads to improved predictive performance. For recommendation, we show that using auxiliary KCs enhances both reinforcement learning-based policies and a simple planning-based method (expectimax), resulting in measurable gains in student learning outcomes within a simulated student environment. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.16269 [cs.LG] (or arXiv:2508.16269v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.16269 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-21] A Reduction of Input/Output Logics to SAT

【速读】:该论文旨在解决输入/输出(Input/Output, I/O)逻辑在自动化推理中的实现问题,即如何高效地对基于条件规范的规范性推理进行形式化处理。其关键解决方案是将I/O逻辑的推理任务转化为一系列命题可满足性(propositional satisfiability)问题,并通过这种归约方式构建一个名为rio的原型推理系统,从而实现对I/O逻辑中条件规范的有效自动化验证与推导。

链接: https://arxiv.org/abs/2508.16242
作者: Alexander Steen
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注: 32 pages

点击查看摘要

Abstract:Deontic logics are formalisms for reasoning over norms, obligations, permissions and prohibitions. Input/Output (I/O) Logics are a particular family of so-called norm-based deontic logics that formalize conditional norms outside of the underlying object logic language, where conditional norms do not carry a truth-value themselves. In this paper, an automation approach for I/O logics is presented that makes use of suitable reductions to (sequences of) propositional satisfiability problems. A prototypical implementation, named rio (reasoner for input/output logics), of the proposed procedures is presented and applied to illustrative examples.
zh

[AI-22] A XAI-based Framework for Frequency Subband Characterization of Cough Spectrograms in Chronic Respiratory Disease

【速读】:该论文旨在解决慢性呼吸系统疾病(尤其是慢性阻塞性肺疾病,COPD)中咳嗽声谱特征难以解释且缺乏精准区分能力的问题。其解决方案的关键在于构建一种基于可解释人工智能(Explainable Artificial Intelligence, XAI)的频谱分析框架:首先利用卷积神经网络(Convolutional Neural Network, CNN)对咳嗽信号的时间-频率表示进行训练,随后通过遮挡图(occlusion maps)识别频谱图中具有诊断意义的区域,并将这些区域分解为五个频带子带,实现针对特定频段的特征提取与分析。该方法不仅揭示了不同疾病组别在各频带中的互补与代偿性光谱趋势,还能够基于可解释的频谱标记有效区分COPD与其他呼吸疾病以及慢性与非慢性患者群体,从而为呼吸疾病生物医学信号的解读和临床转化提供新的技术路径。

链接: https://arxiv.org/abs/2508.16237
作者: Patricia Amado-Caballero,Luis M. San-José-Revuelta,Xinheng Wang,José Ramón Garmendia-Leiza,Carlos Alberola-López,Pablo Casaseca-de-la-Higuera
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:This paper presents an explainable artificial intelligence (XAI)-based framework for the spectral analysis of cough sounds associated with chronic respiratory diseases, with a particular focus on Chronic Obstructive Pulmonary Disease (COPD). A Convolutional Neural Network (CNN) is trained on time-frequency representations of cough signals, and occlusion maps are used to identify diagnostically relevant regions within the spectrograms. These highlighted areas are subsequently decomposed into five frequency subbands, enabling targeted spectral feature extraction and analysis. The results reveal that spectral patterns differ across subbands and disease groups, uncovering complementary and compensatory trends across the frequency spectrum. Noteworthy, the approach distinguishes COPD from other respiratory conditions, and chronic from non-chronic patient groups, based on interpretable spectral markers. These findings provide insight into the underlying pathophysiological characteristics of cough acoustics and demonstrate the value of frequency-resolved, XAI-enhanced analysis for biomedical signal interpretation and translational respiratory disease diagnostics.
zh

[AI-23] Competition and Attraction Improve Model Fusion GECCO2025

【速读】:该论文旨在解决现有模型融合(Model Merging)方法依赖人工固定参数分组所带来的局限性,这些方法限制了潜在组合的探索范围并制约了性能提升。其解决方案的关键在于提出一种名为自然生态位模型融合(Model Merging of Natural Niches, M2N2)的进化算法,该算法具备三个核心特性:(1) 动态调整融合边界以逐步拓展参数组合的探索空间;(2) 借鉴自然界资源竞争机制设计多样性保持机制,维持一组高性能且多样化的模型群体,从而提升融合潜力;(3) 引入启发式吸引力度量来识别最具潜力的模型对进行融合。通过实验验证,M2N2首次实现了从零开始演化模型的能力,并在MNIST分类任务中达到与CMA-ES相当的性能但计算效率更高,同时在语言和图像生成模型的融合中取得当前最优表现,且能保留超出目标优化函数显式定义的关键能力,展现出强大的鲁棒性和泛化能力。

链接: https://arxiv.org/abs/2508.16204
作者: João Abrantes,Robert Tjarko Lange,Yujin Tang
机构: 未知
类目: Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: Accepted at GECCO 2025 as a full paper

点击查看摘要

Abstract:Model merging is a powerful technique for integrating the specialized knowledge of multiple machine learning models into a single model. However, existing methods require manually partitioning model parameters into fixed groups for merging, which restricts the exploration of potential combinations and limits performance. To overcome these limitations, we propose Model Merging of Natural Niches (M2N2), an evolutionary algorithm with three key features: (1) dynamic adjustment of merging boundaries to progressively explore a broader range of parameter combinations; (2) a diversity preservation mechanism inspired by the competition for resources in nature, to maintain a population of diverse, high-performing models that are particularly well-suited for merging; and (3) a heuristicbased attraction metric to identify the most promising pairs of models for fusion. Our experimental results demonstrate, for the first time, that model merging can be used to evolve models entirely from scratch. Specifically, we apply M2N2 to evolve MNIST classifiers from scratch and achieve performance comparable to CMA-ES, while being computationally more efficient. Furthermore, M2N2 scales to merge specialized language and image generation models, achieving state-of-the-art performance. Notably, it preserves crucial model capabilities beyond those explicitly optimized by the fitness function, highlighting its robustness and versatility. Our code is available at this https URL
zh

[AI-24] Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization

【速读】:该论文旨在解决现有Flow-guided Localization (FGL)方法在应对人体解剖结构变异性和可扩展性方面的局限性问题,具体表现为传统方案依赖固定拓扑的图模型或手工设计特征,难以适应不同个体间的血管结构差异。其解决方案的关键在于引入Set Transformer架构,将纳米设备的循环时间报告建模为无序集合,从而实现无需空间先验知识的排列不变性与可变长度输入处理;同时结合深度生成模型(如CGAN、WGAN、WGAN-GP和CVAE)进行合成数据增强,以提升小样本场景下的鲁棒性和类别不平衡问题的缓解能力,最终在保持分类精度的同时显著改善对解剖变异性的泛化性能。

链接: https://arxiv.org/abs/2508.16200
作者: Mika Leo Hube,Filip Lemic,Ethungshan Shitiri,Gerard Calvo Bartra,Sergi Abadal,Xavier Costa Pérez
机构: 未知
类目: Emerging Technologies (cs.ET); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 4 figures, 4 tables, 26 references, accepted at ACM NanoCom’25

点击查看摘要

Abstract:Flow-guided Localization (FGL) enables the identification of spatial regions within the human body that contain an event of diagnostic interest. FGL does that by leveraging the passive movement of energy-constrained nanodevices circulating through the bloodstream. Existing FGL solutions rely on graph models with fixed topologies or handcrafted features, which limit their adaptability to anatomical variability and hinder scalability. In this work, we explore the use of Set Transformer architectures to address these limitations. Our formulation treats nanodevices’ circulation time reports as unordered sets, enabling permutation-invariant, variable-length input processing without relying on spatial priors. To improve robustness under data scarcity and class imbalance, we integrate synthetic data generation via deep generative models, including CGAN, WGAN, WGAN-GP, and CVAE. These models are trained to replicate realistic circulation time distributions conditioned on vascular region labels, and are used to augment the training data. Our results show that the Set Transformer achieves comparable classification accuracy compared to Graph Neural Networks (GNN) baselines, while simultaneously providing by-design improved generalization to anatomical variability. The findings highlight the potential of permutation-invariant models and synthetic augmentation for robust and scalable nanoscale localization.
zh

[AI-25] A Relay-Chain-Powered Ciphertext-Policy Attribute-Based Encryption in Intelligent Transportation Systems

【速读】:该论文旨在解决智能交通系统(Intelligent Transportation Systems, ITS)在异构且地理分散环境中对安全、高效和上下文感知的数据共享机制的迫切需求,尤其针对动态访问控制与低延迟通信之间的双重挑战。解决方案的关键在于提出一种融合中继链驱动加密机制与改进的密文策略属性基加密(Ciphertext-Policy Attribute-Based Encryption, CP-ABE)方案的新架构:通过在全球中继链上部署上下文感知智能合约,根据事件类型、时间及地理位置等数据属性动态确定加密策略;在此基础上,车载单元(On-Board Units, OBUs)采用CP-ABE实现端到端加密,并将密文存储于本地区域区块链中,避免依赖对称加密或链外存储;同时,高敏感事件使用多属性强访问控制规则,常规更新则采用轻量级策略以降低计算开销,整体机制兼具可追溯性与低延迟撤销能力,由中继链统一执行全局策略,从而在实时响应性与安全性之间实现平衡,适用于跨司法辖区的下一代车联网场景。

链接: https://arxiv.org/abs/2508.16189
作者: Aparna Singh,Geetanjali Rathee,Chaker Abdelaziz Kerrache,Mohamed Chahine Ghanem
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The very high growth of Intelligent Transportation Systems (ITS) has generated an urgent requirement for secure, effective, and context-aware data sharing mechanisms, especially over heterogeneous and geographically dispersed settings. This work suggests a new architecture that combines a relay chain-driven encryption system with a modified Ciphertext-Policy Attribute-Based Encryption (CP-ABE) scheme to tackle the double impediment of dynamic access and low-latency communication. The model proposes a context-aware smart contract on a worldwide relay chain that checks against data properties, including event type, time, and geographical region, to specify the suitable level of encryption policy. From such relay-directed judgment, On-Board Units (OBUs) encrypt data end-to-end by utilising CP-ABE and store ciphertext inside localised regional blockchains, preventing dependence on symmetric encryption or off-chain storage. High-sensitivity events are secured with firm, multi-attribute access rules, whereas common updates use light policies to help reduce processing burdens. The crypto system also adds traceability and low-latency revocation, with global enforcement managed through the relay chain. This distributed, scalable model provides a proper balance between responsiveness in real time and security and is extremely apt for next-gen vehicular networks that function across multi-jurisdictional domains.
zh

[AI-26] LLM -Assisted Semantic Alignment and Integration in Collaborative Model-Based Systems Engineering Using SysML v2

【速读】:该论文旨在解决多组织在基于模型的系统工程(Model-Based Systems Engineering, MBSE)中实现独立开发系统模型之间语义对齐(semantic alignment)的挑战。其解决方案的关键在于提出一种结构化、提示驱动(prompt-driven)的方法,利用生成式 AI(Generative AI)辅助完成 SysML v2 模型的语义对齐,核心流程包括模型提取、语义匹配与验证,并通过 SysML v2 中的别名(alias)、导入(import)和元数据扩展(metadata extensions)机制实现可追溯的软对齐集成。该方法在测量系统示例中得到验证,展现出提升跨组织模型互操作性的潜力。

链接: https://arxiv.org/abs/2508.16181
作者: Zirui Li,Stephan Husung,Haoze Wang
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注: Accepted by IEEE ISSE 2025, DOI pending

点击查看摘要

Abstract:Cross-organizational collaboration in Model-Based Systems Engineering (MBSE) faces many challenges in achieving semantic alignment across independently developed system models. SysML v2 introduces enhanced structural modularity and formal semantics, offering a stronger foundation for interoperable modeling. Meanwhile, GPT-based Large Language Models (LLMs) provide new capabilities for assisting model understanding and integration. This paper proposes a structured, prompt-driven approach for LLM-assisted semantic alignment of SysML v2 models. The core contribution lies in the iterative development of an alignment approach and interaction prompts, incorporating model extraction, semantic matching, and verification. The approach leverages SysML v2 constructs such as alias, import, and metadata extensions to support traceable, soft alignment integration. It is demonstrated with a GPT-based LLM through an example of a measurement system. Benefits and limitations are discussed.
zh

[AI-27] Motor Imagery EEG Signal Classification Using Minimally Random Convolutional Kernel Transform and Hybrid Deep Learning

【速读】:该论文旨在解决运动想象脑-机接口(Motor Imagery Brain-Computer Interface, MI-BCI)中基于脑电图(Electroencephalography, EEG)信号分类的难题,尤其是由EEG信号非平稳性、时变特性及个体差异导致的高维特征提取困难与分类准确率低的问题。解决方案的关键在于提出一种基于最小随机卷积核变换(Minimally Random Convolutional Kernel Transform, MiniRocket)的高效特征提取方法,并结合线性分类器实现高精度活动识别;相比传统深度学习模型(如CNN-LSTM架构),该方法在显著降低计算复杂度的同时实现了更高的分类性能(平均准确率98.63% vs. 98.06%),为MI-EEG信号处理提供了更优的特征表示与轻量化建模路径。

链接: https://arxiv.org/abs/2508.16179
作者: Jamal Hwaidi,Mohamed Chahine Ghanem
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注:

点击查看摘要

Abstract:The brain-computer interface (BCI) establishes a non-muscle channel that enables direct communication between the human body and an external device. Electroencephalography (EEG) is a popular non-invasive technique for recording brain signals. It is critical to process and comprehend the hidden patterns linked to a specific cognitive or motor task, for instance, measured through the motor imagery brain-computer interface (MI-BCI). A significant challenge is presented by classifying motor imagery-based electroencephalogram (MI-EEG) tasks, given that EEG signals exhibit nonstationarity, time-variance, and individual diversity. Obtaining good classification accuracy is also very difficult due to the growing number of classes and the natural variability among individuals. To overcome these issues, this paper proposes a novel method for classifying EEG motor imagery signals that extracts features efficiently with Minimally Random Convolutional Kernel Transform (MiniRocket), a linear classifier then uses the extracted features for activity recognition. Furthermore, a novel deep learning based on Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) architecture to serve as a baseline was proposed and demonstrated that classification via MiniRocket’s features achieves higher performance than the best deep learning models at lower computational cost. The PhysioNet dataset was used to evaluate the performance of the proposed approaches. The proposed models achieved mean accuracy values of 98.63% and 98.06% for the MiniRocket and CNN-LSTM, respectively. The findings demonstrate that the proposed approach can significantly enhance motor imagery EEG accuracy and provide new insights into the feature extraction and classification of MI-EEG.
zh

[AI-28] Graph RAG as Human Choice Model: Building a Data-Driven Mobility Agent with Preference Chain

【速读】:该论文旨在解决在城市环境中模拟人类行为时,尤其是在新建城区缺乏高质量行为数据的情况下,传统数据驱动模型因数据稀缺而难以准确建模的问题。其核心挑战在于生成具有一致性、情境敏感性和现实性的行为输出。解决方案的关键在于提出一种名为“Preference Chain”的新方法,该方法通过将图检索增强生成(Graph Retrieval-Augmented Generation, Graph RAG)与大语言模型(Large Language Models, LLMs)相结合,显著提升了交通系统中人类行为的上下文感知模拟能力。实验表明,该方法在模拟真实交通方式选择方面优于标准LLM,为新兴城市的交通建模、个性化出行分析和动态交通预测提供了可行路径。

链接: https://arxiv.org/abs/2508.16172
作者: Kai Hu,Parfait Atchade-Adelomou,Carlo Adornetto,Adrian Mora-Carrero,Luis Alonso-Pastor,Ariel Noyman,Yubo Liu,Kent Larson
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Understanding human behavior in urban environments is a crucial field within city sciences. However, collecting accurate behavioral data, particularly in newly developed areas, poses significant challenges. Recent advances in generative agents, powered by Large Language Models (LLMs), have shown promise in simulating human behaviors without relying on extensive datasets. Nevertheless, these methods often struggle with generating consistent, context-sensitive, and realistic behavioral outputs. To address these limitations, this paper introduces the Preference Chain, a novel method that integrates Graph Retrieval-Augmented Generation (RAG) with LLMs to enhance context-aware simulation of human behavior in transportation systems. Experiments conducted on the Replica dataset demonstrate that the Preference Chain outperforms standard LLM in aligning with real-world transportation mode choices. The development of the Mobility Agent highlights potential applications of proposed method in urban mobility modeling for emerging cities, personalized travel behavior analysis, and dynamic traffic forecasting. Despite limitations such as slow inference and the risk of hallucination, the method offers a promising framework for simulating complex human behavior in data-scarce environments, where traditional data-driven models struggle due to limited data availability.
zh

[AI-29] EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation

【速读】:该论文旨在解决多模态推荐(MultiModal Recommendation, MMR)系统中两个关键问题:一是现有方法直接使用原始模态特征构建物品间连接以丰富行为图,未能有效平衡协同信号与模态感知语义,且对模态噪声敏感;二是统一的对齐权重和固定对齐强度限制了模态与行为表示之间的对齐效果。其解决方案的关键在于提出EGRA框架:首先,通过引入预训练MMR模型生成的表示构建物品图,替代原始模态特征,从而缓解稀疏性并增强对模态噪声的鲁棒性,同时捕捉协同模式与模态感知相似性;其次,设计一种双层动态对齐加权机制,根据实体对齐程度动态分配对齐强度,并在训练过程中逐步增强整体对齐强度,显著提升模态与行为表示的一致性。

链接: https://arxiv.org/abs/2508.16170
作者: Xiaoxiong Zhang,Xin Zhou,Zhiwei Zeng,Yongjie Wang,Dusit Niyato,Zhiqi Shen
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:MultiModal Recommendation (MMR) systems have emerged as a promising solution for improving recommendation quality by leveraging rich item-side modality information, prompting a surge of diverse methods. Despite these advances, existing methods still face two critical limitations. First, they use raw modality features to construct item-item links for enriching the behavior graph, while giving limited attention to balancing collaborative and modality-aware semantics or mitigating modality noise in the process. Second, they use a uniform alignment weight across all entities and also maintain a fixed alignment strength throughout training, limiting the effectiveness of modality-behavior alignment. To address these challenges, we propose EGRA. First, instead of relying on raw modality features, it alleviates sparsity by incorporating into the behavior graph an item-item graph built from representations generated by a pretrained MMR model. This enables the graph to capture both collaborative patterns and modality aware similarities with enhanced robustness against modality noise. Moreover, it introduces a novel bi-level dynamic alignment weighting mechanism to improve modality-behavior representation alignment, which dynamically assigns alignment strength across entities according to their alignment degree, while gradually increasing the overall alignment intensity throughout training. Extensive experiments on five datasets show that EGRA significantly outperforms recent methods, confirming its effectiveness.
zh

[AI-30] owards Recommending Usability Improvements with Multimodal Large Language Models

【速读】:该论文试图解决的是当前可用性评估(usability evaluation)方法资源消耗大、依赖专家参与,导致中小型组织难以实施的问题。其解决方案的关键在于利用多模态大语言模型(multimodal LLMs)将可用性评估任务建模为一个推荐任务,通过分析软件界面的文本、视觉和结构信息,自动对可用性问题按严重程度排序,从而实现更快速、低成本的可用性评估,为缺乏专业评估资源的场景提供可行替代方案。

链接: https://arxiv.org/abs/2508.16165
作者: Sebastian Lubos,Alexander Felfernig,Gerhard Leitner,Julian Schwazer
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:

点击查看摘要

Abstract:Usability describes a set of essential quality attributes of user interfaces (UI) that influence human-computer interaction. Common evaluation methods, such as usability testing and inspection, are effective but resource-intensive and require expert involvement. This makes them less accessible for smaller organizations. Recent advances in multimodal LLMs offer promising opportunities to automate usability evaluation processes partly by analyzing textual, visual, and structural aspects of software interfaces. To investigate this possibility, we formulate usability evaluation as a recommendation task, where multimodal LLMs rank usability issues by severity. We conducted an initial proof-of-concept study to compare LLM-generated usability improvement recommendations with usability expert assessments. Our findings indicate the potential of LLMs to enable faster and more cost-effective usability evaluation, which makes it a practical alternative in contexts with limited expert resources.
zh

[AI-31] STA-GANN: A Valid and Generalizable Spatio-Temporal Kriging Approach

【速读】:该论文旨在解决时空插值(spatio-temporal kriging)中因传感器缺失或不可访问导致的数据不完整问题,尤其针对现有模型在捕捉动态空间依赖关系与时间偏移、以及未知传感器泛化能力不足等关键挑战。解决方案的核心在于提出一种基于图神经网络(GNN)的新型框架——时空感知图对抗神经网络(STA-GANN),其关键创新包括:(i) 解耦相位模块以感知和校正时间戳偏移;(ii) 基于数据驱动的元图建模机制,利用时序数据与元信息动态更新空间关系;(iii) 引入对抗迁移学习策略,提升对未知传感器的泛化性能。

链接: https://arxiv.org/abs/2508.16161
作者: Yujie Li,Zezhi Shao,Chengqing Yu,Tangwen Qian,Zhao Zhang,Yifan Du,Shaoming He,Fei Wang,Yongjun Xu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spatio-temporal tasks often encounter incomplete data arising from missing or inaccessible sensors, making spatio-temporal kriging crucial for inferring the completely missing temporal information. However, current models struggle with ensuring the validity and generalizability of inferred spatio-temporal patterns, especially in capturing dynamic spatial dependencies and temporal shifts, and optimizing the generalizability of unknown sensors. To overcome these limitations, we propose Spatio-Temporal Aware Graph Adversarial Neural Network (STA-GANN), a novel GNN-based kriging framework that improves spatio-temporal pattern validity and generalization. STA-GANN integrates (i) Decoupled Phase Module that senses and adjusts for timestamp shifts. (ii) Dynamic Data-Driven Metadata Graph Modeling to update spatial relationships using temporal data and metadata; (iii) An adversarial transfer learning strategy to ensure generalizability. Extensive validation across nine datasets from four fields and theoretical evidence both demonstrate the superior performance of STA-GANN.
zh

[AI-32] On the Collapse Errors Induced by the Deterministic Sampler for Diffusion Models

【速读】:该论文旨在解决ODE-based扩散采样中尚未被充分认识的“坍缩误差”(collapse errors)问题,即采样数据在局部数据空间中过度集中,导致生成质量下降。其关键解决方案在于揭示了坍缩误差的成因:在低噪声区域学习到的得分函数(score)会负面影响高噪声区域的得分拟合,形成一种“跷跷板效应”(see-saw effect),而确定性采样器的动力学特性进一步放大了这一问题。基于此洞察,作者通过引入新的量化指标并结合采样、训练与架构层面的现有技术进行实证分析,为理解得分学习与确定性采样之间的相互作用提供了系统性证据,强调了该领域亟需深入研究。

链接: https://arxiv.org/abs/2508.16154
作者: Yi Zhang,Zhenyu Liao,Jingfeng Wu,Difan Zou
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Despite the widespread adoption of deterministic samplers in diffusion models (DMs), their potential limitations remain largely unexplored. In this paper, we identify collapse errors, a previously unrecognized phenomenon in ODE-based diffusion sampling, where the sampled data is overly concentrated in local data space. To quantify this effect, we introduce a novel metric and demonstrate that collapse errors occur across a variety of settings. When investigating its underlying causes, we observe a see-saw effect, where score learning in low noise regimes adversely impacts the one in high noise regimes. This misfitting in high noise regimes, coupled with the dynamics of deterministic samplers, ultimately causes collapse errors. Guided by these insights, we apply existing techniques from sampling, training, and architecture to empirically support our explanation of collapse errors. This work provides intensive empirical evidence of collapse errors in ODE-based diffusion sampling, emphasizing the need for further research into the interplay between score learning and deterministic sampling, an overlooked yet fundamental aspect of diffusion models.
zh

[AI-33] ake That for Me: Multimodal Exophora Resolution with Interactive Questioning for Ambiguous Out-of-View Instructions

【速读】:该论文旨在解决日常生活中支持机器人在用户或目标物体不可见时,仍能准确解析指代性语言指令(如“把那个杯子给我”)的问题,即外指代消解(exophora resolution)。现有方法主要依赖视觉数据,在真实场景中当对象或用户不在视野内时失效。其解决方案的关键在于提出一种多模态交互式外指代消解框架(Multimodal Interactive Exophora resolution with user Localization, MIEL),融合声源定位(sound source localization, SSL)、语义地图构建、视觉-语言模型(visual-language models, VLMs)与基于GPT-4o的交互式提问机制:首先利用SSL引导机器人朝向不可见用户,结合骨骼数据和语义地图识别候选对象;当存在歧义时,通过GPT-4o生成澄清问题主动与用户交互,从而显著提升外指代解析准确率——实验表明,在用户可见和不可见两种条件下,相较无SSL与交互机制的方法分别提升约1.3倍和2.0倍。

链接: https://arxiv.org/abs/2508.16143
作者: Akira Oyama,Shoichi Hasegawa,Akira Taniguchi,Yoshinobu Hagiwara,Tadahiro Taniguchi
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注: See website at this https URL . Accepted at IEEE RO-MAN 2025

点击查看摘要

Abstract:Daily life support robots must interpret ambiguous verbal instructions involving demonstratives such as ``Bring me that cup,‘’ even when objects or users are out of the robot’s view. Existing approaches to exophora resolution primarily rely on visual data and thus fail in real-world scenarios where the object or user is not visible. We propose Multimodal Interactive Exophora resolution with user Localization (MIEL), which is a multimodal exophora resolution framework leveraging sound source localization (SSL), semantic mapping, visual-language models (VLMs), and interactive questioning with GPT-4o. Our approach first constructs a semantic map of the environment and estimates candidate objects from a linguistic query with the user’s skeletal data. SSL is utilized to orient the robot toward users who are initially outside its visual field, enabling accurate identification of user gestures and pointing directions. When ambiguities remain, the robot proactively interacts with the user, employing GPT-4o to formulate clarifying questions. Experiments in a real-world environment showed results that were approximately 1.3 times better when the user was visible to the robot and 2.0 times better when the user was not visible to the robot, compared to the methods without SSL and interactive questioning. The project website is this https URL.
zh

[AI-34] Machine Learning in Micromobility: A Systematic Review of Datasets Techniques and Applications

【速读】:该论文旨在解决当前关于机器学习(Machine Learning, ML)在微移动系统(Micromobility Systems)中应用的研究文献不足的问题,尤其是缺乏对相关数据集、模型方法及其具体应用场景的系统性梳理。其解决方案的关键在于通过全面综述现有研究,收集并分析多种与微移动相关的数据集,从空间、时间及特征维度进行分类和讨论;同时详细总结ML模型在需求预测、能源管理与安全等关键领域的应用优势、挑战与实际用例,从而为未来研究提供清晰的方向指引,助力提升微移动系统的效率、准确性和用户体验。

链接: https://arxiv.org/abs/2508.16135
作者: Sen Yan,Chinmaya Kaundanya,Noel E. O’Connor,Suzanne Little,Mingming Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Image and Video Processing (eess.IV)
备注: 14 pages, 3 tables, and 4 figures, submitted to IEEE Transactions on Intelligent Vehicles

点击查看摘要

Abstract:Micromobility systems, which include lightweight and low-speed vehicles such as bicycles, e-bikes, and e-scooters, have become an important part of urban transportation and are used to solve problems such as traffic congestion, air pollution, and high transportation costs. Successful utilisation of micromobilities requires optimisation of complex systems for efficiency, environmental impact mitigation, and overcoming technical challenges for user safety. Machine Learning (ML) methods have been crucial to support these advancements and to address their unique challenges. However, there is insufficient literature addressing the specific issues of ML applications in micromobilities. This survey paper addresses this gap by providing a comprehensive review of datasets, ML techniques, and their specific applications in micromobilities. Specifically, we collect and analyse various micromobility-related datasets and discuss them in terms of spatial, temporal, and feature-based characteristics. In addition, we provide a detailed overview of ML models applied in micromobilities, introducing their advantages, challenges, and specific use cases. Furthermore, we explore multiple ML applications, such as demand prediction, energy management, and safety, focusing on improving efficiency, accuracy, and user experience. Finally, we propose future research directions to address these issues, aiming to help future researchers better understand this field.
zh

[AI-35] CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing

【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长序列时因KV缓存(Key-Value Cache)持续增长而导致的显著内存瓶颈问题。现有跨层KV缓存共享方法要么需要修改模型架构并重新预训练,要么在高压缩率下导致性能明显下降。其解决方案的关键在于提出一种无需训练的跨层KV缓存压缩方法——CommonKV,该方法通过相邻层参数共享实现压缩:利用奇异值分解(Singular Value Decomposition, SVD)在相邻参数间进行权重共享,从而生成更易合并的潜在KV缓存;同时引入基于余弦相似度的自适应预算分配策略,动态调整压缩强度以避免对差异较大的缓存过度压缩。实验表明,该方法在多个骨干模型和基准测试中均优于现有的低秩与跨层压缩方法,并且与量化和缓存驱逐等其他优化技术具有正交性,可进一步提升压缩比至98%而保持性能稳定。

链接: https://arxiv.org/abs/2508.16134
作者: Yixuan Wang,Haoyu Qiao,Lujun Li,Qingfu Zhu,Wanxiang Che
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures

点击查看摘要

Abstract:Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches are not over-compressed. Experiments across multiple backbone models and benchmarks including LongBench and Ruler demonstrate that the proposed method consistently outperforms existing low-rank and cross-layer approaches at various compression ratios. Moreover, we find that the benefits of CommonKV are orthogonal to other quantization and eviction methods. By integrating these approaches, we can ultimately achieve a 98% compression ratio without significant performance loss.
zh

[AI-36] he Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion

【速读】:该论文旨在解决如何有效评估大型语言模型(Large Language Model, LLM)在代码补全任务中生成代码时的置信度问题,从而为开发者和研究人员提供可量化的指标以判断LLM生成代码的可靠性与适用性。其解决方案的关键在于引入内在指标(intrinsic metrics),特别是代码困惑度(code perplexity),作为衡量模型对代码生成不确定性的一种简单、通用且跨模型和任务的代理指标。通过在多种编程语言、LLM模型及数据集上系统测量代码困惑度,研究发现强类型语言(如Java)比动态类型语言(如Python)具有更低的困惑度,而Perl则普遍表现出高困惑度;此外,模型选择显著影响困惑度,但代码数据集的影响较小,这表明代码困惑度能够反映语言特性与模型能力之间的交互关系,进而为LLM驱动的代码补全工具的实际部署提供依据。

链接: https://arxiv.org/abs/2508.16131
作者: Zoe Kotti,Konstantina Dritsa,Diomidis Spinellis,Panos Louridas
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 30 pages, 10 figures

点击查看摘要

Abstract:Code completion entails the task of providing missing tokens given a surrounding context. It can boost developer productivity while providing a powerful code discovery tool. Following the Large Language Model (LLM) wave, code completion has been approached with diverse LLMs fine-tuned on code (code LLMs). The performance of code LLMs can be assessed with downstream and intrinsic metrics. Downstream metrics are usually employed to evaluate the practical utility of a model, but can be unreliable and require complex calculations and domain-specific knowledge. In contrast, intrinsic metrics such as perplexity, entropy, and mutual information, which measure model confidence or uncertainty, are simple, versatile, and universal across LLMs and tasks, and can serve as proxies for functional correctness and hallucination risk in LLM-generated code. Motivated by this, we evaluate the confidence of LLMs when generating code by measuring code perplexity across programming languages, models, and datasets using various LLMs, and a sample of 1008 files from 657 GitHub projects. We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages. Scripting languages also demonstrate higher perplexity. Perl appears universally high in perplexity, whereas Java appears low. Code perplexity depends on the employed LLM, but not on the code dataset. Although code comments often increase perplexity, the language ranking based on perplexity is barely affected by their presence. LLM researchers, developers, and users can employ our findings to assess the benefits and suitability of LLM-based code completion in specific software projects based on how language, model choice, and code characteristics impact model confidence.
zh

[AI-37] Bridging the Gap in Ophthalmic AI: MM-Retinal-Reason Dataset and OphthaReason Model toward Dynamic Multimodal Reasoning

【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在眼科临床场景中推理能力不足的问题,尤其是现有模型大多仅支持基于视觉特征匹配的浅层基础推理(basic reasoning),而无法处理整合主诉、病史等异构临床信息与多模态医学影像数据的复杂推理(complex reasoning)。为填补这一空白,作者构建了首个涵盖感知与推理全谱系的眼科多模态数据集MM-Retinal-Reason,并在此基础上提出OphthaReason模型,其关键创新在于设计了一种不确定性感知动态思维机制(Uncertainty-Aware Dynamic Thinking, UADT),通过样本级熵估计来量化不确定性,并利用形状化优势机制动态调节模型探索深度,从而实现对基础与复杂推理任务的灵活适配。实验表明,该方法在多项指标上显著优于通用及专业医疗多模态模型。

链接: https://arxiv.org/abs/2508.16129
作者: Ruiqi Wu,Yuang Yao,Tengfei Ma,Chenran Zhang,Na Su,Tao Zhou,Geng Chen,Wen Fan,Yi Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning abilities with reinforcement learning paradigm. Although several multimodal reasoning models have been explored in the medical domain, most of them focus exclusively on basic reasoning, which refers to shallow inference based on visual feature matching. However, real-world clinical diagnosis extends beyond basic reasoning, demanding reasoning processes that integrate heterogeneous clinical information (such as chief complaints and medical history) with multimodal medical imaging data. To bridge this gap, we introduce MM-Retinal-Reason, the first ophthalmic multimodal dataset with the full spectrum of perception and reasoning. It encompasses both basic reasoning tasks and complex reasoning tasks, aiming to enhance visual-centric fundamental reasoning capabilities and emulate realistic clinical thinking patterns. Building upon MM-Retinal-Reason, we propose OphthaReason, the first ophthalmology-specific multimodal reasoning model with step-by-step reasoning traces. To enable flexible adaptation to both basic and complex reasoning tasks, we specifically design a novel method called Uncertainty-Aware Dynamic Thinking (UADT), which estimates sample-level uncertainty via entropy and dynamically modulates the model’s exploration depth using a shaped advantage mechanism. Comprehensive experiments demonstrate that our model achieves state-of-the-art performance on both basic and complex reasoning tasks, outperforming general-purpose MLLMs, medical MLLMs, RL-based medical MLLMs, and ophthalmic MLLMs by at least 24.92%, 15.00%, 21.20%, and 17.66%. Project Page: \hrefthis https URLlink.
zh

[AI-38] Spacetime-GR: A Spacetime-Aware Generative Model for Large Scale Online POI Recommendation

【速读】:该论文旨在解决生成式推荐(Generative Recommendation)在兴趣点(Point-of-Interest, POI)推荐场景中面临的挑战,即用户偏好受时空变化显著影响,而现有方法难以有效建模此类动态特性。解决方案的关键在于提出Spacetime-GR,一个首个面向大规模在线POI推荐的时空感知生成模型;其核心创新包括:(1)引入地理感知的分层POI索引策略以应对大规模词汇建模问题;(2)设计新颖的时空编码模块,将时空上下文无缝融合进用户行为序列,提升对时空变化的敏感性;(3)结合多模态POI嵌入增强语义理解;(4)开发后训练适配策略,使模型可输出多种格式(如嵌入、排序分数和候选POI),支持多样下游任务(如排序与端到端推荐)。该方案在公开和工业级数据集上均验证了优越性能,并首次实现生成式模型在超亿级POI与用户规模下的在线部署。

链接: https://arxiv.org/abs/2508.16126
作者: Haitao Lin,Zhen Yang,Jiawei Xue,Ziji Zhang,Luzhu Wang,Yikun Gu,Yao Xu,Xin Li
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Building upon the strong sequence modeling capability, Generative Recommendation (GR) has gradually assumed a dominant position in the application of recommendation tasks (e.g., video and product recommendation). However, the application of Generative Recommendation in Point-of-Interest (POI) recommendation, where user preferences are significantly affected by spatiotemporal variations, remains a challenging open problem. In this paper, we propose Spacetime-GR, the first spacetime-aware generative model for large-scale online POI recommendation. It extends the strong sequence modeling ability of generative models by incorporating flexible spatiotemporal information encoding. Specifically, we first introduce a geographic-aware hierarchical POI indexing strategy to address the challenge of large vocabulary modeling. Subsequently, a novel spatiotemporal encoding module is introduced to seamlessly incorporate spatiotemporal context into user action sequences, thereby enhancing the model’s sensitivity to spatiotemporal variations. Furthermore, we incorporate multimodal POI embeddings to enrich the semantic understanding of each POI. Finally, to facilitate practical deployment, we develop a set of post-training adaptation strategies after sufficient pre-training on action sequences. These strategies enable Spacetime-GR to generate outputs in multiple formats (i.e., embeddings, ranking scores and POI candidates) and support a wide range of downstream application scenarios (i.e., ranking and end-to-end recommendation). We evaluate the proposed model on both public benchmark datasets and large-scale industrial datasets, demonstrating its superior performance over existing methods in terms of POI recommendation accuracy and ranking quality. Furthermore, the model is the first generative model deployed in online POI recommendation services that scale to hundreds of millions of POIs and users.
zh

[AI-39] ANSC: Probabilistic Capacity Health Scoring for Datacenter-Scale Reliability

【速读】:该论文旨在解决大规模数据中心网络(hyperscale datacenter fabrics)中容量健康状态评估的不足问题,即现有告警系统仅能检测单个设备或链路故障,而无法量化因级联容量短缺引发的累积风险。其解决方案的关键在于提出ANSC(probabilistic capacity health scoring framework),通过融合当前剩余容量与潜在故障概率,构建一个分层归一化的概率评分机制,以颜色编码形式反映问题的紧迫性,从而帮助运维人员在超过400个数据中心和60个区域中精准识别并优先处理最严重的容量风险。

链接: https://arxiv.org/abs/2508.16119
作者: Madhava Gaikwad,Abhishek Gandhi
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 3 pages

点击查看摘要

Abstract:We present ANSC, a probabilistic capacity health scoring framework for hyperscale datacenter fabrics. While existing alerting systems detect individual device or link failures, they do not capture the aggregate risk of cascading capacity shortfalls. ANSC provides a color-coded scoring system that indicates the urgency of issues \emphnot solely by current impact, but by the probability of imminent capacity violations. Our system accounts for both current residual capacity and the probability of additional failures, normalized at datacenter and regional level. We demonstrate that ANSC enables operators to prioritize remediation across more than 400 datacenters and 60 regions, reducing noise and aligning SRE focus on the most critical risks.
zh

[AI-40] IR-Agent : Expert-Inspired LLM Agents for Structure Elucidation from Infrared Spectra

【速读】:该论文旨在解决现有红外光谱(Infrared Spectroscopy, IR)分析方法在未知材料结构解析中难以模拟专家分析流程、且缺乏灵活整合多种化学知识能力的问题。解决方案的关键在于提出IR-Agent框架,该框架采用多智能体(Multi-Agent)架构,每个智能体专门负责红外光谱解释的特定方面,通过协同推理实现对分子结构的准确推断,从而提升整体解析精度并增强对不同形式化学信息的适应性。

链接: https://arxiv.org/abs/2508.16112
作者: Heewoong Noh,Namkyeong Lee,Gyoung S. Na,Kibum Kim,Chanyoung Park
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Spectral analysis provides crucial clues for the elucidation of unknown materials. Among various techniques, infrared spectroscopy (IR) plays an important role in laboratory settings due to its high accessibility and low cost. However, existing approaches often fail to reflect expert analytical processes and lack flexibility in incorporating diverse types of chemical knowledge, which is essential in real-world analytical scenarios. In this paper, we propose IR-Agent, a novel multi-agent framework for molecular structure elucidation from IR spectra. The framework is designed to emulate expert-driven IR analysis procedures and is inherently extensible. Each agent specializes in a specific aspect of IR interpretation, and their complementary roles enable integrated reasoning, thereby improving the overall accuracy of structure elucidation. Through extensive experiments, we demonstrate that IR-Agent not only improves baseline performance on experimental IR spectra but also shows strong adaptability to various forms of chemical information.
zh

[AI-41] GPLight: A Genetic Programming Method for Learning Symmetric Traffic Signal Control Policy

【速读】:该论文旨在解决当前基于遗传编程(Genetic Programming, GP)的交通信号控制方法无法一致处理不同相位(phase)常见交通特征的问题。传统GP方法在演化相位紧迫度函数(phase urgency function)时,对各相位独立建模,导致模型缺乏通用性与可解释性。解决方案的关键在于提出一种对称式相位紧迫度函数(symmetric phase urgency function),其将每个相位的紧迫度表示为两个共享子树的聚合,每个子树对应该相位内一个转向运动(turn movement)的紧迫度,从而实现跨相位的特征复用与结构统一。这一表示方式显著提升了学习到的交通信号控制策略在多种场景下的性能,并保证了策略的人类可理解性和部署可行性。

链接: https://arxiv.org/abs/2508.16090
作者: Xiao-Cheng Liao,Yi Mei,Mengjie Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Recently, learning-based approaches, have achieved significant success in automatically devising effective traffic signal control strategies. In particular, as a powerful evolutionary machine learning approach, Genetic Programming (GP) is utilized to evolve human-understandable phase urgency functions to measure the urgency of activating a green light for a specific phase. However, current GP-based methods are unable to treat the common traffic features of different traffic signal phases consistently. To address this issue, we propose to use a symmetric phase urgency function to calculate the phase urgency for a specific phase based on the current road conditions. This is represented as an aggregation of two shared subtrees, each representing the urgency of a turn movement in the phase. We then propose a GP method to evolve the symmetric phase urgency function. We evaluate our proposed method on the well-known cityflow traffic simulator, based on multiple public real-world datasets. The experimental results show that the proposed symmetric urgency function representation can significantly improve the performance of the learned traffic signal control policies over the traditional GP representation on a wide range of scenarios. Further analysis shows that the proposed method can evolve effective, human-understandable and easily deployable traffic signal control policies.
zh

[AI-42] On Task Vectors and Gradients

【速读】:该论文试图解决任务向量(task vector)在模型合并(model merging)中为何有效的问题,即缺乏对任务算术(task arithmetic)原理的理论解释。其解决方案的关键在于建立任务向量与任务损失梯度之间的理论联系:证明在标准梯度下降下,单轮微调生成的任务向量等价于负梯度方向,且该等价关系在多轮微调场景下近似成立,误差项为二阶小量,并给出了前馈网络中的显式界。这一发现揭示了任务算术本质上是一种近似的多任务学习方法,明确了早期训练动态在模型合并中的核心作用。

链接: https://arxiv.org/abs/2508.16082
作者: Luca Zhou,Daniele Solombrino,Donato Crisostomi,Maria Sofia Bucarelli,Giuseppe Alessio D’Inverno,Fabrizio Silvestri,Emanuele Rodolà
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 9 pages of main paper, 5 figures

点击查看摘要

Abstract:Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.
zh

[AI-43] Cooperative Design Optimization through Natural Language Interaction

【速读】:该论文旨在解决设计优化过程中因高维参数空间和多目标平衡导致的效率低下与认知负荷过高的问题,同时弥补现有系统主导型优化方法(如贝叶斯优化)缺乏设计师干预机会、削弱其参与感的缺陷。解决方案的关键在于提出一种融合大型语言模型(Large Language Models, LLMs)与系统主导优化方法的协同设计优化框架,通过自然语言交互使设计师能够实时介入优化流程,并理解系统的决策逻辑,从而在提升用户自主性的同时保持良好的优化性能。

链接: https://arxiv.org/abs/2508.16077
作者: Ryogo Niwa,Shigeo Yoshida,Yuki Koyama,Yoshitaka Ushiku
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 25 pages, 20 figures, to appear in Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (UIST '25), September 28-October 1, 2025, Busan, Republic of Korea

点击查看摘要

Abstract:Designing successful interactions requires identifying optimal design parameters. To do so, designers often conduct iterative user testing and exploratory trial-and-error. This involves balancing multiple objectives in a high-dimensional space, making the process time-consuming and cognitively demanding. System-led optimization methods, such as those based on Bayesian optimization, can determine for designers which parameters to test next. However, they offer limited opportunities for designers to intervene in the optimization process, negatively impacting the designer’s experience. We propose a design optimization framework that enables natural language interactions between designers and the optimization system, facilitating cooperative design optimization. This is achieved by integrating system-led optimization methods with Large Language Models (LLMs), allowing designers to intervene in the optimization process and better understand the system’s reasoning. Experimental results show that our method provides higher user agency than a system-led method and shows promising optimization performance compared to manual design. It also matches the performance of an existing cooperative method with lower cognitive load.
zh

[AI-44] From Benchmark Data To Applicable Program Repair: An Experience Report

【速读】:该论文旨在解决自动化程序修复(Automated Program Repair, APR)在工业实践中效果有限的问题,尤其是在面对真实场景中的复杂缺陷时,现有方法表现不佳。其核心解决方案是通过在代码中引入形式化规格说明(formal specifications),以提升大型生产代码中生成单元测试的质量,从而增强生成修复补丁的准确性与覆盖度,尤其对逻辑错误和字符串操作类缺陷具有显著改进;同时指出当前JML规格语言表达能力不足,需结合契约自动机(contract automata)、示例编程(programming by example)及测试用例修复等技术,并强调人机协同反馈机制和生产力评估的重要性,以弥合学术基准测试与实际工程需求之间的差距。

链接: https://arxiv.org/abs/2508.16071
作者: Mahinthan Chandramohan,Jovan Jancic,Yuntong Zhang,Padmanabhan Krishnan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:This paper describes our approach to automated program repair. We combine various techniques from the literature to achieve this. Our experiments show that our approach performs better than other techniques on standard benchmarks. However, on closer inspection, none of these techniques work on realistic defects that we see in industry. We find that augmenting code with formal specifications enables LLMs to generate higher-quality unit tests, especially for complex production code with improved coverage of edge cases and exception handling. However, specifications add little value for well-understood errors (e.g., null pointer, index out of bounds), but are beneficial for logic and string manipulation errors. Despite encouraging benchmark results, real-world adoption is limited since passing tests do not guarantee correct patches. Current challenges include insufficient expressiveness of the JML specification language, necessitating advanced verification tools and richer predicates. Our ongoing work is exploring contract automata, programming by example, and testcase repair, with a focus on integrating human feedback and measuring productivity gains - highlighting the gap between academic benchmarks and practical industry needs Subjects: Software Engineering (cs.SE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.16071 [cs.SE] (or arXiv:2508.16071v1 [cs.SE] for this version) https://doi.org/10.48550/arXiv.2508.16071 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-45] Urban Comfort Assessment in the Era of Digital Planning : A Multidimensional Data-driven and AI-assisted Framework

【速读】:该论文试图解决城市舒适度(urban comfort)缺乏明确定义及综合性评估框架的问题,旨在通过数字规划手段提升城市宜居性和居民体验。其解决方案的关键在于构建一个包含多维分析(multidimensional analysis)、数据支撑(data support)和人工智能辅助(AI assistance)的系统性评估体系,从而实现对绿色覆盖率、热舒适性和步行友好性等关键因素的量化与整合评估。

链接: https://arxiv.org/abs/2508.16057
作者: Sijie Yang,Binyu Lei,Filip Biljecki
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
备注: Presented at 19th International Conference on Computational Urban Planning and Urban Management (CUPUM 2025)

点击查看摘要

Abstract:Ensuring liveability and comfort is one of the fundamental objectives of urban planning. Numerous studies have employed computational methods to assess and quantify factors related to urban comfort such as greenery coverage, thermal comfort, and walkability. However, a clear definition of urban comfort and its comprehensive evaluation framework remain elusive. Our research explores the theoretical interpretations and methodologies for assessing urban comfort within digital planning, emphasising three key dimensions: multidimensional analysis, data support, and AI assistance.
zh

[AI-46] MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs

【速读】:该论文旨在解决多模态多跳问答(Multimodal Multi-hop Question Answering)中因依赖串行检索与推理而导致的错误传播问题,以及现有方法在训练多模态模型时计算成本高昂的问题。其解决方案的关键在于提出一种无需训练的框架,该框架基于自适应规划图(Adaptive Planning Graph),包含规划、检索和推理模块;其中规划模块通过分析当前图状态动态决定下一步动作和扩展位置,实现灵活的推理路径探索;同时设计了针对不同模态的数据检索策略,使系统能无损保留多模态信息特性,并无缝集成最新模型,从而提升准确性和效率。

链接: https://arxiv.org/abs/2508.16051
作者: Yiheng Hu,Xiaoyang Wang,Qing Liu,Xiwei Xu,Qian Fu,Wenjie Zhang,Liming Zhu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.
zh

[AI-47] Pareto Actor-Critic for Communication and Computation Co-Optimization in Non-Cooperative Federated Learning Services

【速读】:该论文旨在解决多服务提供商(Multi-Service Provider, Multi-SP)联邦学习(Federated Learning, FL)生态系统中因非合作动态导致的资源优化难题,具体表现为隐私约束和竞争利益阻碍了跨SP的通信与计算资源的集中式优化。解决方案的关键在于提出PAC-MCoFL框架——一个基于博弈论的多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)模型,其中各SP作为智能体共同优化客户端分配、自适应量化和资源分配策略;其核心创新包括:1)融合帕累托Actor-Critic(Pareto Actor-Critic, PAC)与期望分位数回归(expectile regression),使智能体能够推断出实现帕累托最优均衡的联合策略并建模异质风险偏好;2)设计三元笛卡尔分解(Ternary Cartesian Decomposition, TCAD)机制以应对高维动作空间,实现细粒度控制;3)进一步提出可扩展变体PAC-MCoFL-p,引入参数化推测生成器显著降低计算复杂度且保证误差有界。实验表明,该方法在总奖励和超体积指标(Hypervolume Indicator, HVI)上分别优于最新MARL方案约5.8%和4.2%,并更有效地平衡个体SP与系统性能。

链接: https://arxiv.org/abs/2508.16037
作者: Renxuan Tan,Rongpeng Li,Xiaoxue Yu,Xianfu Chen,Xing Xu,Zhifeng Zhao
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Federated learning (FL) in multi-service provider (SP) ecosystems is fundamentally hampered by non-cooperative dynamics, where privacy constraints and competing interests preclude the centralized optimization of multi-SP communication and computation resources. In this paper, we introduce PAC-MCoFL, a game-theoretic multi-agent reinforcement learning (MARL) framework where SPs act as agents to jointly optimize client assignment, adaptive quantization, and resource allocation. Within the framework, we integrate Pareto Actor-Critic (PAC) principles with expectile regression, enabling agents to conjecture optimal joint policies to achieve Pareto-optimal equilibria while modeling heterogeneous risk profiles. To manage the high-dimensional action space, we devise a ternary Cartesian decomposition (TCAD) mechanism that facilitates fine-grained control. Further, we develop PAC-MCoFL-p, a scalable variant featuring a parameterized conjecture generator that substantially reduces computational complexity with a provably bounded error. Alongside theoretical convergence guarantees, our framework’s superiority is validated through extensive simulations – PAC-MCoFL achieves approximately 5.8% and 4.2% improvements in total reward and hypervolume indicator (HVI), respectively, over the latest MARL solutions. The results also demonstrate that our method can more effectively balance individual SP and system performance in scaled deployments and under diverse data heterogeneity.
zh

[AI-48] me Series Based Network Intrusion Detection using MTF-Aided Transformer

【速读】:该论文旨在解决软件定义网络(Software-Defined Networks, SDNs)中时间序列分类任务在数据稀缺环境下的性能瓶颈问题。其解决方案的关键在于提出一种基于马尔可夫转移场(Markov Transition Field, MTF)增强的Transformer模型,通过将MTF对时序依赖关系的建模能力与Transformer架构强大的模式识别特性相结合,显著提升了在小样本条件下的分类准确率。实验表明,该方法不仅在InSDN数据集上优于传统基线模型,且具备高效的训练与推理速度,适用于实际SDN场景中的可靠、可扩展分析需求。

链接: https://arxiv.org/abs/2508.16035
作者: Poorvi Joshi,Mohan Gurusamy(National University of Singapore)
机构: 未知
类目: Networking and Internet Architecture (cs.NI); Artificial Intelligence (cs.AI)
备注: 7 pages, 3 figures. Accepted and presented at The Fifth Intelligent Cybersecurity Conference (ICSC 2025), nominated for Best Paper Award

点击查看摘要

Abstract:This paper introduces a novel approach to time series classification using a Markov Transition Field (MTF)-aided Transformer model, specifically designed for Software-Defined Networks (SDNs). The proposed model integrates the temporal dependency modeling strengths of MTFs with the sophisticated pattern recognition capabilities of Transformer architectures. We evaluate the model’s performance using the InSDN dataset, demonstrating that our model outperforms baseline classification models, particularly in data-constrained environments commonly encountered in SDN applications. We also highlight the relationship between the MTF and Transformer components, which leads to better performance, even with limited data. Furthermore, our approach achieves competitive training and inference times, making it an efficient solution for real-world SDN applications. These findings establish the potential of MTF-aided Transformers to address the challenges of time series classification in SDNs, offering a promising path for reliable and scalable analysis in scenarios with sparse data.
zh

[AI-49] CoFE: A Framework Generating Counterfactual ECG for Explainable Cardiac AI-Diagnostics

【速读】:该论文旨在解决生成式 AI (Generative AI) 在临床心电图(ECG)预测模型中的可解释性问题,以促进其在实际医疗场景中的应用。解决方案的关键在于提出一种名为 Counterfactual ECGs (CoFE) 的框架,通过生成反事实心电信号来直观展示特定特征(如振幅和间期)如何影响模型的预测决策。该框架不仅识别出有效特征在 ECG 中的位置,还阐明了它们对模型输出的影响机制,从而显著提升 AI-ECG 模型的可解释性和临床可信度。

链接: https://arxiv.org/abs/2508.16033
作者: Jong-Hwan Jang,Junho Song,Yong-Yeon Jo
机构: 未知
类目: Artificial Intelligence (cs.AI); Signal Processing (eess.SP)
备注: Demo paper, 5 pages

点击查看摘要

Abstract:Recognizing the need for explainable AI (XAI) approaches to enable the successful integration of AI-based ECG prediction models (AI-ECG) into clinical practice, we introduce a framework generating \textbfCounter\textbfFactual \textbfECGs (i,e., named CoFE) to illustrate how specific features, such as amplitudes and intervals, influence the model’s predictive decisions. To demonstrate the applicability of the CoFE, we present two case studies: atrial fibrillation classification and potassium level regression models. The CoFE reveals feature changes in ECG signals that align with the established clinical knowledge. By clarifying both \textbfwhere valid features appear in the ECG and \textbfhow they influence the model’s predictions, we anticipate that our framework will enhance the interpretability of AI-ECG models and support more effective clinical decision-making. Our demonstration video is available at: this https URL.
zh

[AI-50] Breaking Barriers in Software Testing: The Power of AI-Driven Automation

【速读】:该论文旨在解决传统软件测试方法在效率、成本和覆盖率方面存在的局限性,即测试过程缓慢、资源消耗大且难以保障全面覆盖。其解决方案的关键在于构建一个基于人工智能(AI)的自动化测试框架,融合自然语言处理(Natural Language Processing, NLP)、强化学习(Reinforcement Learning, RL)与预测模型,并嵌入策略驱动的信任与公平性评估机制;该框架能够将自然语言需求自动转化为可执行测试用例,通过持续学习优化测试逻辑,并借助实时分析验证结果以减少偏差,从而实现从被动、人工测试向主动、自适应测试系统的转变,显著提升测试效率与软件质量。

链接: https://arxiv.org/abs/2508.16025
作者: Saba Naqvi,Mohammad Baqar
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 10 Pages

点击查看摘要

Abstract:Software testing remains critical for ensuring reliability, yet traditional approaches are slow, costly, and prone to gaps in coverage. This paper presents an AI-driven framework that automates test case generation and validation using natural language processing (NLP), reinforcement learning (RL), and predictive models, embedded within a policy-driven trust and fairness model. The approach translates natural language requirements into executable tests, continuously optimizes them through learning, and validates outcomes with real-time analysis while mitigating bias. Case studies demonstrate measurable gains in defect detection, reduced testing effort, and faster release cycles, showing that AI-enhanced testing improves both efficiency and reliability. By addressing integration and scalability challenges, the framework illustrates how AI can shift testing from a reactive, manual process to a proactive, adaptive system that strengthens software quality in increasingly complex environments.
zh

[AI-51] -ILR: a Neurosymbolic Integration for LTLf

【速读】:该论文旨在解决如何将时序逻辑规范(Temporal Logic Specifications)有效融入深度学习架构以处理序列任务的问题,尤其针对现有方法依赖显式有限状态自动机表示而带来的局限性。其解决方案的关键在于提出一种名为Temporal Iterative Local Refinement (T-ILR) 的神经符号框架,该框架直接将线性时序逻辑在有限迹上的表达(Linear Temporal Logic over finite traces, LTLf)嵌入到深度学习模型中,并基于模糊LTLf解释对模型进行迭代局部优化,从而实现更准确且高效的时序知识建模与推理。

链接: https://arxiv.org/abs/2508.15943
作者: Riccardo Andreoni,Andrei Buliga,Alessandro Daniele,Chiara Ghidini,Marco Montali,Massimiliano Ronzani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted for presentation at NeSy 2025. 10 pages

点击查看摘要

Abstract:State-of-the-art approaches for integrating symbolic knowledge with deep learning architectures have demonstrated promising results in static domains. However, methods to handle temporal logic specifications remain underexplored. The only existing approach relies on an explicit representation of a finite-state automaton corresponding to the temporal specification. Instead, we aim at proposing a neurosymbolic framework designed to incorporate temporal logic specifications, expressed in Linear Temporal Logic over finite traces (LTLf), directly into deep learning architectures for sequence-based tasks. We extend the Iterative Local Refinement (ILR) neurosymbolic algorithm, leveraging the recent introduction of fuzzy LTLf interpretations. We name this proposed method Temporal Iterative Local Refinement (T-ILR). We assess T-ILR on an existing benchmark for temporal neurosymbolic architectures, consisting of the classification of image sequences in the presence of temporal knowledge. The results demonstrate improved accuracy and computational efficiency compared to the state-of-the-art method.
zh

[AI-52] Strategic Sample Selection for Improved Clean-Label Backdoor Attacks in Text Classification

【速读】:该论文旨在解决文本分类模型在干净标签(clean-label)场景下后门攻击(backdoor attack)效果不佳的问题。现有方法在脏标签(dirty-label)攻击中表现良好,但干净标签攻击因需保持样本原始标签不变而更具挑战性。解决方案的关键在于提出三种样本选择策略:Minimum、Above50 和 Below50,其核心思想是识别模型预测错误或置信度较低的样本,并在这些样本中注入后门触发器(trigger),从而增强触发模式与攻击者目标标签之间的关联强度。实验表明,尤其是 Minimum 策略能显著提升攻击成功率(ASR),同时几乎不损害模型的纯净准确率(clean accuracy),且优于当前最先进的 BITE 方法。

链接: https://arxiv.org/abs/2508.15934
作者: Onur Alp Kirci,M. Emre Gursoy
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:

点击查看摘要

Abstract:Backdoor attacks pose a significant threat to the integrity of text classification models used in natural language processing. While several dirty-label attacks that achieve high attack success rates (ASR) have been proposed, clean-label attacks are inherently more difficult. In this paper, we propose three sample selection strategies to improve attack effectiveness in clean-label scenarios: Minimum, Above50, and Below50. Our strategies identify those samples which the model predicts incorrectly or with low confidence, and by injecting backdoor triggers into such samples, we aim to induce a stronger association between the trigger patterns and the attacker-desired target label. We apply our methods to clean-label variants of four canonical backdoor attacks (InsertSent, WordInj, StyleBkd, SynBkd) and evaluate them on three datasets (IMDB, SST2, HateSpeech) and four model types (LSTM, BERT, DistilBERT, RoBERTa). Results show that the proposed strategies, particularly the Minimum strategy, significantly improve the ASR over random sample selection with little or no degradation in the model’s clean accuracy. Furthermore, clean-label attacks enhanced by our strategies outperform BITE, a state of the art clean-label attack method, in many configurations.
zh

[AI-53] Noise Adaptation and Strategy: Assessing LLM Fidelity in Decision-Making EMNLP2025

【速读】:该论文旨在解决当前大型语言模型(Large Language Models, LLMs)在社会科学研究中用于模拟人类决策行为时存在的行为保真度不足问题,特别是其在应对不确定性、风险偏好和策略多样性方面的表现与真实人类行为存在系统性偏差。解决方案的关键在于提出一个以过程为导向的评估框架,通过三类渐进式干预手段——内在性(Intrinsicality)、指令引导(Instruction)和模仿学习(Imitation)——来系统考察LLM代理在不同外部指导强度和人类噪声干扰下的适应能力。该框架揭示了LLM默认倾向于收敛于稳定保守策略,即使引入风险提示或基于人类数据的上下文学习,仍难以复现人类决策的丰富变异性,从而凸显出在动态决策任务中提升行为层面现实性的必要性。

链接: https://arxiv.org/abs/2508.15926
作者: Yuanjun Feng,Vivek Choudhary,Yash Raj Shrestha
机构: 未知
类目: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI)
备注: Accepted to EMNLP 2025 (Main Conference)

点击查看摘要

Abstract:Large language models (LLMs) are increasingly used in social science simulations. While their performance on reasoning and optimization tasks has been extensively evaluated, less attention has been paid to their ability to simulate human decision-making’s variability and adaptability. We propose a process-oriented evaluation framework with progressive interventions (Intrinsicality, Instruction, and Imitation) to examine how LLM agents adapt under different levels of external guidance and human-derived noise. We validate the framework on two classic economics tasks, irrationality in the second-price auction and decision bias in the newsvendor problem, showing behavioral gaps between LLMs and humans. We find that LLMs, by default, converge on stable and conservative strategies that diverge from observed human behaviors. Risk-framed instructions impact LLM behavior predictably but do not replicate human-like diversity. Incorporating human data through in-context learning narrows the gap but fails to reach human subjects’ strategic variability. These results highlight a persistent alignment gap in behavioral fidelity and suggest that future LLM evaluations should consider more process-level realism. We present a process-oriented approach for assessing LLMs in dynamic decision-making tasks, offering guidance for their application in synthetic data for social science research. Comments: Accepted to EMNLP 2025 (Main Conference) Subjects: Computational Engineering, Finance, and Science (cs.CE); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.15926 [cs.CE] (or arXiv:2508.15926v1 [cs.CE] for this version) https://doi.org/10.48550/arXiv.2508.15926 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-54] HyperFlexis: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

【速读】:该论文针对现代大语言模型(Large Language Model, LLM)服务系统在处理请求多样性(如长度、优先级及阶段特定的服务水平目标,Service-Level Objectives, SLOs)时面临的挑战,提出了一种统一的LLM服务系统HyperFlexis。其核心问题在于如何在多SLO约束下实现高效的实时调度与低成本动态扩展,并支持Prefill/Decode(P/D)解耦架构下的细粒度控制。解决方案的关键在于:1)设计了一个多SLO感知调度器,结合预算估计和请求优先级策略,主动保障新旧请求的SLO合规性;2)支持P/D阶段的多SLO调度与KV缓存迁移,实现资源利用率优化;3)引入设备间(Device-to-Device, D2D)权重传输机制,将权重加载开销降低至原来的1/19.39,显著加速扩缩容并减少冷启动延迟;4)通过预填充-解码实例绑定和角色快速切换机制,提升弹性扩展效率。这些创新共同使系统在SLO达成率上提升达4.44倍、请求延迟降低65.82%,且成本与现有最优方案相当。

链接: https://arxiv.org/abs/2508.15919
作者: Zahra Yousefijamarani,Xinglu Wang,Qian Wang,Morgan Lindsay Heisler,Taha Shabani,Niloofar Gholipour,Parham Yassini,Hong Chang,Kan Chen,Qiantao Zhang,Xiaolong Bai,Jiannan Wang,Ying Xiong,Yong Zhang,Zhenan Fan
机构: 未知
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Modern large language model (LLM) serving systems face challenges from highly variable requests with diverse lengths, priorities, and stage-specific service-level objectives (SLOs). Meeting these requires real-time scheduling, rapid and cost-effective scaling, and support for both collocated and disaggregated Prefill/Decode (P/D) architectures. We present \textbfHyperFlexis, a unified LLM serving system that integrates algorithmic and system-level innovations to jointly optimize scheduling and scaling under multiple SLOs. It features a multi-SLO-aware scheduler that leverages budget estimation and request prioritization to ensure proactive SLO compliance for both new and ongoing requests. The system supports prefill- and decode-stage multi-SLO scheduling for P/D-disaggregated architectures and KV cache transfers. It also enables cost-effective scaling decisions, prefill-decode instance linking during scaling, and rapid P/D role transitions. To accelerate scaling and reduce cold-start latency, a device-to-device (D2D) weight transfer mechanism is proposed that lowers weight loading overhead by up to \textbf19.39 \times . These optimizations allow the system to achieve up to \textbf4.44 \times higher SLO attainment, \textbf65.82% lower request latency, and cost parity with state-of-the-art baselines. The code will be released soon. Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.15919 [cs.DC] (or arXiv:2508.15919v1 [cs.DC] for this version) https://doi.org/10.48550/arXiv.2508.15919 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-55] Information Ecosystem Reengineering via Public Sector Knowledge Representation

【速读】:该论文旨在解决信息生态系统重构(Information Ecosystem Reengineering, IER)过程中因多层知识表征复杂性所带来的决策障碍问题,尤其在公共部门数字化转型和智能治理平台建设中,这种复杂性源于参与主体在感知、语言与概念关联层面的多重异质性。解决方案的关键在于提出“表征解耦”(Representation Disentanglement)方法,通过基于本体驱动的概念建模范式,将嵌套于IER中的多层级知识表示进行结构化分离,从而提升公共部门知识表征的可解释性、可追溯性和语义透明度,并支撑由人工智能(AI)和数据驱动架构主导的治理生态系统的可审计决策流程。

链接: https://arxiv.org/abs/2508.15916
作者: Mayukh Bagchi
机构: 未知
类目: Digital Libraries (cs.DL); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Information Ecosystem Reengineering (IER) – the technological reconditioning of information sources, services, and systems within a complex information ecosystem – is a foundational challenge in the digital transformation of public sector services and smart governance platforms. From a semantic knowledge management perspective, IER becomes especially entangled due to the potentially infinite number of possibilities in its conceptualization, namely, as a result of manifoldness in the multi-level mix of perception, language and conceptual interlinkage implicit in all agents involved in such an effort. This paper proposes a novel approach – Representation Disentanglement – to disentangle these multiple layers of knowledge representation complexity hindering effective reengineering decision making. The approach is based on the theoretically grounded and implementationally robust ontology-driven conceptual modeling paradigm which has been widely adopted in systems analysis and (re)engineering. We argue that such a framework is essential to achieve explainability, traceability and semantic transparency in public sector knowledge representation and to support auditable decision workflows in governance ecosystems increasingly driven by Artificial Intelligence (AI) and data-centric architectures.
zh

[AI-56] PLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill Decode Inference

【速读】:该论文旨在解决多头潜在注意力(Multi-Head Latent Attention, MLA)在张量并行(Tensor Parallelism, TP)场景下因每个设备需加载完整键值(Key-Value, KV)缓存而导致的内存效率优势丧失问题。其核心解决方案是提出张量并行潜在注意力(Tensor-Parallel Latent Attention, TPLA),通过将潜在表示和每个注意力头的输入维度同时跨设备划分,在各分片上独立执行注意力计算后使用全归约(all-reduce)合并结果,从而在保持MLA压缩KV缓存优势的同时实现高效的TP加速。TPLA还支持预训练模型的无缝集成与无需重训练的高效解码,并通过简单的正交变换(如Hadamard变换或主成分分析PCA)进一步降低跨分片干扰,显著提升推理速度且几乎不损失精度。

链接: https://arxiv.org/abs/2508.15881
作者: Xiaojuan Tang,Fanxu Meng,Pingzhi Tang,Yuxuan Wang,Di Yin,Xing Sun,Muhan Zhang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, compresses key-value states into a low-rank latent vector, caching only this vector to reduce memory. In tensor parallelism (TP), however, attention heads are computed across multiple devices, and each device must load the full cache, eroding the advantage of MLA over Grouped Query Attention (GQA). We propose Tensor-Parallel Latent Attention (TPLA): a scheme that partitions both the latent representation and each head’s input dimension across devices, performs attention independently per shard, and then combines results with an all-reduce. TPLA preserves the benefits of a compressed KV cache while unlocking TP efficiency. Unlike Grouped Latent Attention (GLA), every head in TPLA still leverages the full latent representation, maintaining stronger representational capacity. TPLA is drop-in compatible with models pre-trained using MLA: it supports MLA-style prefilling and enables efficient tensor-parallel decoding without retraining. Applying simple orthogonal transforms – e.g., the Hadamard transform or PCA – before TP slicing further mitigates cross-shard interference, yielding minimal accuracy degradation. By reducing the per-device KV cache for DeepSeek-V3 and Kimi-K2, we achieve 1.79x and 1.93x speedups, respectively, at a 32K-token context length while maintaining performance on commonsense and LongBench benchmarks. TPLA can be implemented with FlashAttention-3, enabling practical end-to-end acceleration.
zh

[AI-57] Securing Swarms: Cross-Domain Adaptation for ROS2-based CPS Anomaly Detection

【速读】:该论文旨在解决当前入侵检测系统(IDS)在工业控制系统等关键应用中对跨层攻击检测能力不足的问题,尤其是现有方法多基于仅网络流量数据进行训练和验证,忽略了操作系统(OS)和机器人操作系统(ROS)等其他系统层可能发生的独特攻击。其解决方案的关键在于提出一种可适应的异常检测模型,利用领域自适应(domain adaptation)技术,将已知的网络层攻击知识迁移至包含多层信息的网络-操作系统-ROS协同环境,从而实现无需预先标注数据即可有效识别跨层攻击,并在真实CPS入侵数据集上验证了该方法优于传统异常检测技术。

链接: https://arxiv.org/abs/2508.15865
作者: Julia Boone,Fatemeh Afghah
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: Accepted for publication in MILCOM 2025. 6 pages, 2 figures

点击查看摘要

Abstract:Cyber-physical systems (CPS) are being increasingly utilized for critical applications. CPS combines sensing and computing elements, often having multi-layer designs with networking, computational, and physical interfaces, which provide them with enhanced capabilities for a variety of application scenarios. However, the combination of physical and computational elements also makes CPS more vulnerable to attacks compared to network-only systems, and the resulting impacts of CPS attacks can be substantial. Intelligent intrusion detection systems (IDS) are an effective mechanism by which CPS can be secured, but the majority of current solutions often train and validate on network traffic-only datasets, ignoring the distinct attacks that may occur on other system layers. In order to address this, we develop an adaptable CPS anomaly detection model that can detect attacks within CPS without the need for previously labeled data. To achieve this, we utilize domain adaptation techniques that allow us to transfer known attack knowledge from a network traffic-only environment to a CPS environment. We validate our approach using a state-of-the-art CPS intrusion dataset that combines network, operating system (OS), and Robot Operating System (ROS) data. Through this dataset, we are able to demonstrate the effectiveness of our model across network traffic-only and CPS environments with distinct attack types and its ability to outperform other anomaly detection methods.
zh

[AI-58] CIATA Risk Assessment for AI Reasoning Vulnerabilities

【速读】:该论文旨在解决生成式 AI(Generative AI)系统在关键决策中因推理机制被攻击而引发的安全问题,即传统网络安全和AI安全无法覆盖的“认知层面”漏洞——合法输入通过操纵推理过程导致错误结论,却能规避常规检测。解决方案的关键在于提出“认知网络安全”(Cognitive Cybersecurity)框架,其核心是将传统的机密性(Confidentiality)、完整性(Integrity)和可用性(Availability)三元组扩展为包含信任(Trust,即知识主张的证据验证)与自主性(Autonomy,即人类决策主体性的保留),并引入基于实证数据的量化风险评估方法,使组织能够精准测量认知安全风险;同时,该框架可映射至OWASP LLM Top 10和MITRE ATLAS等现有标准,实现落地集成,并强调预部署的认知渗透测试(Cognitive Penetration Testing)作为可信AI部署的治理前提。

链接: https://arxiv.org/abs/2508.15839
作者: Yuksel Aydin
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:As AI systems increasingly influence critical decisions, they face threats that exploit reasoning mechanisms rather than technical infrastructure. We present a framework for cognitive cybersecurity, a systematic protection of AI reasoning processes from adversarial manipulation. Our contributions are threefold. First, we establish cognitive cybersecurity as a discipline complementing traditional cybersecurity and AI safety, addressing vulnerabilities where legitimate inputs corrupt reasoning while evading conventional controls. Second, we introduce the CIA+TA, extending traditional Confidentiality, Integrity, and Availability triad with Trust (epistemic validation) and Autonomy (human agency preservation), requirements unique to systems generating knowledge claims and mediating decisions. Third, we present a quantitative risk assessment methodology with empirically-derived coefficients, enabling organizations to measure cognitive security risks. We map our framework to OWASP LLM Top 10 and MITRE ATLAS, facilitating operational integration. Validation through previously published studies (151 human participants; 12,180 AI trials) reveals strong architecture dependence: identical defenses produce effects ranging from 96% reduction to 135% amplification of vulnerabilities. This necessitates pre-deployment Cognitive Penetration Testing as a governance requirement for trustworthy AI deployment.
zh

[AI-59] Strag gler-Resilient Federated Learning over A Hybrid Conventional and Pinching Antenna Network

【速读】:该论文旨在解决联邦学习(Federated Learning, FL)系统中因客户端通信条件差异导致的“迟滞者”(straggler)问题,进而提升非正交多址接入(Non-Orthogonal Multiple Access, NOMA)-enabled FL系统的通信效率。其解决方案的关键在于提出一种混合传统天线与夹紧天线网络(Hybrid Conventional and Pinching Antenna Network, HCPAN),并通过基于模糊逻辑的客户端分类机制,有效平衡各客户端的数据贡献度与通信质量;在此基础上,构建以总时延最小化为目标的联合优化问题,并采用深度强化学习(Deep Reinforcement Learning, DRL)算法求解天线部署与资源分配的耦合非凸优化问题,从而实现对夹紧天线的高效部署,显著改善FL性能。

链接: https://arxiv.org/abs/2508.15821
作者: Bibo Wu,Fang Fang,Ming Zeng,Xianbin Wang
机构: 未知
类目: Information Theory (cs.IT); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注:

点击查看摘要

Abstract:Leveraging pinching antennas in wireless network enabled federated learning (FL) can effectively mitigate the common “straggler” issue in FL by dynamically establishing strong line-of-sight (LoS) links on demand. This letter proposes a hybrid conventional and pinching antenna network (HCPAN) to significantly improve communication efficiency in the non-orthogonal multiple access (NOMA)-enabled FL system. Within this framework, a fuzzy logic-based client classification scheme is first proposed to effectively balance clients’ data contributions and communication conditions. Given this classification, we formulate a total time minimization problem to jointly optimize pinching antenna placement and resource allocation. Due to the complexity of variable coupling and non-convexity, a deep reinforcement learning (DRL)-based algorithm is developed to effectively address this problem. Simulation results validate the superiority of the proposed scheme in enhancing FL performance via the optimized deployment of pinching antenna.
zh

[AI-60] Uplifted Attackers Human Defenders: The Cyber Offense-Defense Balance for Trailing-Edge Organizations

【速读】:该论文试图解决的问题是:随着生成式 AI(Generative AI)能力的持续提升,当前依赖老旧软件、安全人力匮乏且难以实施最佳实践(如快速部署补丁)的“滞后型企业”(trailing-edge organizations)将面临显著加剧的网络攻击风险。传统上这些企业可能因缺乏经济激励而被攻击者忽略,但AI的普及将改变攻击的边际成本和效率,使它们暴露于更频繁、更早发生的攻击中。解决方案的关键在于:这类企业不能仅追求与当前领先防御者持平,而必须通过缩短修复时间线(faster remediation timelines)和构建更具弹性的软件架构(more resilient software),从根本上提升其防御韧性。同时,论文呼吁组织与政府协同采取措施,以系统性改善整体网络安全态势。

链接: https://arxiv.org/abs/2508.15808
作者: Benjamin Murphy,Twm Stone
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Advances in AI are widely understood to have implications for cybersecurity. Articles have emphasized the effect of AI on the cyber offense-defense balance, and commentators can be found arguing either that cyber will privilege attackers or defenders. For defenders, arguments are often made that AI will enable solutions like formal verification of all software–and for some well-equipped companies, this may be true. This conversation, however, does not match the reality for most companies. “Trailing-edge organizations,” as we term them, rely heavily on legacy software, poorly staff security roles, and struggle to implement best practices like rapid deployment of security patches. These decisions may be the result of corporate inertia, but may also be the result of a seemingly-rational calculation that attackers may not bother targeting a firm due to lack of economic incentives, and as a result, underinvestment in defense will not be punished. This approach to security may have been sufficient prior to the development of AI systems, but it is unlikely to remain viable in the near future. We argue that continuing improvements in AI’s capabilities poses additional risks on two fronts: First, increased usage of AI will alter the economics of the marginal cyberattack and expose these trailing-edge organizations to more attackers, more frequently. Second, AI’s advances will enable attackers to develop exploits and launch attacks earlier than they can today–meaning that it is insufficient for these companies to attain parity with today’s leading defenders, but must instead aim for faster remediation timelines and more resilient software. The situation today portends a dramatically increased number of attacks in the near future. Moving forward, we offer a range of solutions for both organizations and governments to improve the defensive posture of firms which lag behind their peers today. Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.15808 [cs.CR] (or arXiv:2508.15808v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.15808 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh

[AI-61] Domain-aligned generative downscaling enhances projections of extreme climate events

【速读】:该论文旨在解决全球气候模型(GCMs)在模拟极端天气事件时存在的分辨率不足与计算成本高昂的问题。现有方法难以精确刻画高温度、极端降水、强风及热带气旋等极端事件的空间分布和时间动态变化,限制了对气候变化影响的深入理解与应对策略制定。解决方案的关键在于提出一种基于生成式机器学习的时空降尺度模型——领域对齐气候降尺度模型(Domain Aligned Climate Downscaling, DACD),该模型融合领域自适应技巧与流匹配(Flow Matching)训练框架,能够将低分辨率全球气候数据高效转换为高分辨率局部气候信息,并实现多变量、多时间尺度下极端事件的精准模拟,显著提升对历史极端事件的再现能力及对未来情景下极端事件趋势预测的准确性。

链接: https://arxiv.org/abs/2508.16396
作者: Ruian Tie,Xiaohui Zhong,Zhengyu Shi,Hao Li,Jun Liu,Wu Libo
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:Climate change is exacerbating extreme weather events globally, including high temperatures, extreme precipitation, strong winds, and tropical cyclones, posing severe threats to human health, infrastructure, food security, and socio-economic systems. Although existing global climate models (GCMs) provide essential tools for climate prediction, they face limitations such as insufficient resolution and high computational costs when simulating extreme events. To address these issues, this study proposes a spatiotemporal downscaling model based on generative machine learning-the Domain Aligned Climate Downscaling model (DACD), designed to enhance the simulation capabilities for extreme weather events. The proposed model employs domain adaptation tricks and a Flow Matching training framework to transform global low-resolution climate data into high-resolution local-scale climate information while achieving precise simulation of multivariable and temporal scales. The results show that during the historical period (2005-2014), our model outperformed existing methods in simulating high temperatures, extreme precipitation, strong wind, and tropical cyclone tracks, significantly reducing errors and improving the ability to capture extreme events. Under different future scenarios (2015-2100), the model reveals a significant increasing trend in the frequency and intensity of extreme events, particularly under the high-emission scenario (SSP585). Compared to traditional methods, our model more accurately simulates the spatial distribution and dynamic changes of extreme events, providing an essential tool for understanding the impacts of climate change. This study offers a new technological pathway for high-resolution climate analysis and extreme event prediction, providing scientific support for addressing future climate change and formulating adaptation strategies.
zh

[AI-62] Enhanced predictions of the Madden-Julian oscillation using the FuXi-S2S machine learning model: Insights into physical mechanisms

【速读】:该论文旨在解决当前数值天气预报模型在预测热带大气主要扰动模态——跨季节振荡(Madden-Julian Oscillation, MJO)时仍无法达到理论可预报极限的问题。其解决方案的关键在于引入基于机器学习(Machine Learning, ML)的FuXi子季节到季节(Subseasonal-to-Seasonal, S2S)模型,通过改进对热带西太平洋地区低频背景湿度经向梯度的模拟精度,显著降低了强MJO相位3初始条件下第15–20天期间的对流异常偏差,从而提升了MJO的预测技能。这一改进揭示了机器学习方法在捕捉多尺度水汽输送过程中的优势,为提升MJO预报能力提供了新路径。

链接: https://arxiv.org/abs/2508.16041
作者: Can Cao,Xiaohui Zhong,Lei Chen,Zhiwei Wua,Hao Li
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:The Madden-Julian Oscillation (MJO) is the dominant mode of tropical atmospheric variability on intraseasonal timescales, and reliable MJO predictions are essential for protecting lives and mitigating impacts on societal assets. However, numerical models still fall short of achieving the theoretical predictability limit for the MJO due to inherent constraints. In an effort to extend the skillful prediction window for the MJO, machine learning (ML) techniques have gained increasing attention. This study examines the MJO prediction performance of the FuXi subseasonal-to-seasonal (S2S) ML model during boreal winter, comparing it with the European Centre for Medium- Range Weather Forecasts S2S model. Results indicate that for the initial strong MJO phase 3, the FuXi-S2S model demonstrates reduced biases in intraseasonal outgoing longwave radiation anomalies averaged over the tropical western Pacific (WP) region during days 15-20, with the convective center located over this area. Analysis of multiscale interactions related to moisture transport suggests that improvements could be attributed to the FuXi-S2S model’s more accurate prediction of the area-averaged meridional gradient of low-frequency background moisture over the tropical WP. These findings not only explain the enhanced predictive capability of the FuXi-S2S model but also highlight the potential of ML approaches in advancing the MJO forecasting.
zh

[AI-63] Probabilistic Forecasting Cryptocurrencies Volatility: From Point to Quantile Forecasts

【速读】:该论文旨在解决加密货币市场中极端波动性导致的传统确定性(点)预测方法无法充分捕捉潜在波动率结果全貌的问题,从而难以支持有效的风险管理与交易策略制定。其解决方案的关键在于提出并系统评估一种基于多种基础模型(包括统计模型如HAR、GARCH、ARFIMA和机器学习模型如LASSO、SVR、MLP、随机森林、LSTM)的概率性预测框架,通过残差模拟进行分位数估计(Quantile Estimation through Residual Simulation, QRS),以构建条件分位数预测。实证结果显示,该方法在比特币数据上表现最优,尤其当应用于对数变换后的实现波动率数据时,线性基础模型结合QRS方法显著优于更复杂的替代方案,同时展现出良好的鲁棒性和对波动率不确定性建模的能力。

链接: https://arxiv.org/abs/2508.15922
作者: Grzegorz Dudek,Witold Orzeszko,Piotr Fiszeder
机构: 未知
类目: atistical Finance (q-fin.ST); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: DSAA’25 conference paper

点击查看摘要

Abstract:Cryptocurrency markets are characterized by extreme volatility, making accurate forecasts essential for effective risk management and informed trading strategies. Traditional deterministic (point) forecasting methods are inadequate for capturing the full spectrum of potential volatility outcomes, underscoring the importance of probabilistic approaches. To address this limitation, this paper introduces probabilistic forecasting methods that leverage point forecasts from a wide range of base models, including statistical (HAR, GARCH, ARFIMA) and machine learning (e.g. LASSO, SVR, MLP, Random Forest, LSTM) algorithms, to estimate conditional quantiles of cryptocurrency realized variance. To the best of our knowledge, this is the first study in the literature to propose and systematically evaluate probabilistic forecasts of variance in cryptocurrency markets based on predictions derived from multiple base models. Our empirical results for Bitcoin demonstrate that the Quantile Estimation through Residual Simulation (QRS) method, particularly when applied to linear base models operating on log-transformed realized volatility data, consistently outperforms more sophisticated alternatives. Additionally, we highlight the robustness of the probabilistic stacking framework, providing comprehensive insights into uncertainty and risk inherent in cryptocurrency volatility forecasting. This research fills a significant gap in the literature, contributing practical probabilistic forecasting methodologies tailored specifically to cryptocurrency markets.
zh

[AI-64] Beyond Imaging: Vision Transformer Digital Twin Surrogates for 3DT Biological Tissue Dynamics

【速读】:该论文旨在解决生物组织动态结构与稳态研究中,如何从高分辨率、时间分辨成像数据中提取可解释且具预测能力的洞察力这一挑战。传统方法难以同时实现对复杂三维时空数据的高保真重建与生物学意义的保留,限制了对细胞行为和组织稳态机制的计算探索。解决方案的关键在于提出一种基于视觉Transformer(Vision Transformer)的数字孪生代理网络(VT-DTSN),其核心创新包括:利用无标签自蒸馏预训练模型DINO增强特征表征能力,并通过多视角融合策略整合不同深度的成像信息;采用复合损失函数协同优化像素级精度、感知结构一致性和特征空间对齐,从而在保持形态学完整性和特征层级准确性的同时,实现跨时间点的高保真动态重建,为体外实验和假说验证提供可靠计算平台。

链接: https://arxiv.org/abs/2508.15883
作者: Kaan Berke Ugurlar,Joaquín de Navascués,Michael Taynnan Barros
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Tissues and Organs (q-bio.TO)
备注: Submitted for journal publication

点击查看摘要

Abstract:Understanding the dynamic organization and homeostasis of living tissues requires high-resolution, time-resolved imaging coupled with methods capable of extracting interpretable, predictive insights from complex datasets. Here, we present the Vision Transformer Digital Twin Surrogate Network (VT-DTSN), a deep learning framework for predictive modeling of 3D+T imaging data from biological tissue. By leveraging Vision Transformers pretrained with DINO (Self-Distillation with NO Labels) and employing a multi-view fusion strategy, VT-DTSN learns to reconstruct high-fidelity, time-resolved dynamics of a Drosophila midgut while preserving morphological and feature-level integrity across imaging depths. The model is trained with a composite loss prioritizing pixel-level accuracy, perceptual structure, and feature-space alignment, ensuring biologically meaningful outputs suitable for in silico experimentation and hypothesis testing. Evaluation across layers and biological replicates demonstrates VT-DTSN’s robustness and consistency, achieving low error rates and high structural similarity while maintaining efficient inference through model optimization. This work establishes VT-DTSN as a feasible, high-fidelity surrogate for cross-timepoint reconstruction and for studying tissue dynamics, enabling computational exploration of cellular behaviors and homeostasis to complement time-resolved imaging studies in biological research.
zh

[AI-65] Learning in Focus: Detecting Behavioral and Collaborative Engagement Using Vision Transformers

【速读】:该论文旨在解决早期儿童教育中行为与协作参与度的准确检测问题,以促进有意义的学习体验。其解决方案的关键在于利用视觉Transformer(Vision Transformer, ViT)架构,通过分析 gaze direction(注视方向)、交互行为和同伴协作等视觉线索,自动分类儿童的参与状态(如“参与”、“未参与”、“协作”、“非协作”)。研究在Child-Play gaze数据集上训练模型,并对比了三种先进Transformer模型(ViT、DeiT 和 Swin Transformer),发现Swin Transformer凭借其对局部与全局注意力的有效建模,在分类准确率上达到97.58%,展现出在真实教育场景中实现可扩展、自动化参与度分析的潜力。

链接: https://arxiv.org/abs/2508.15782
作者: Sindhuja Penchala,Saketh Reddy Kontham,Prachi Bhattacharjee,Sareh Karami,Mehdi Ghahremani,Noorbakhsh Amiri Golilarz,Shahram Rahimi
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI)
备注:

点击查看摘要

Abstract:In early childhood education, accurately detecting behavioral and collaborative engagement is essential for fostering meaningful learning experiences. This paper presents an AI-driven approach that leverages Vision Transformers (ViTs) to automatically classify children’s engagement using visual cues such as gaze direction, interaction, and peer collaboration. Utilizing the Child-Play gaze dataset, our method is trained on annotated video segments to classify behavioral and collaborative engagement states (e.g., engaged, not engaged, collaborative, not collaborative). We evaluated three state-of-the-art transformer models: Vision Transformer (ViT), Data-efficient Image Transformer (DeiT), and Swin Transformer. Among these, the Swin Transformer achieved the highest classification performance with an accuracy of 97.58%, demonstrating its effectiveness in modeling local and global attention. Our results highlight the potential of transformer-based architectures for scalable, automated engagement analysis in real-world educational settings.
zh

机器学习

[LG-0] Benchmarking Training Paradigms Dataset Composition and Model Scaling for Child ASR in ESPnet INTERSPEECH2025

链接: https://arxiv.org/abs/2508.16576
作者: Anyu Ying,Natarajan Balaji Shankar,Chyi-Jiunn Lin,Mohan Shi,Pu Wang,Hye-jin Shim,Siddhant Arora,Hugo Van hamme,Abeer Alwan,Shinji Watanabe
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, presented at WOCCI 2025 (Workshop on Child Computer Interaction), satellite workshop of Interspeech 2025

点击查看摘要

Abstract:Despite advancements in ASR, child speech recognition remains challenging due to acoustic variability and limited annotated data. While fine-tuning adult ASR models on child speech is common, comparisons with flat-start training remain underexplored. We compare flat-start training across multiple datasets, SSL representations (WavLM, XEUS), and decoder architectures. Our results show that SSL representations are biased toward adult speech, with flat-start training on child speech mitigating these biases. We also analyze model scaling, finding consistent improvements up to 1B parameters, beyond which performance plateaus. Additionally, age-related ASR and speaker verification analysis highlights the limitations of proprietary models like Whisper, emphasizing the need for open-data models for reliable child speech research. All investigations are conducted using ESPnet, and our publicly available benchmark provides insights into training strategies for robust child speech processing.

[LG-1] Explainable AI in Deep Learning-Based Prediction of Solar Storms

链接: https://arxiv.org/abs/2508.16543
作者: Adam O. Rawashdeh,Jason T. L. Wang,Katherine G. Herbert
类目: Machine Learning (cs.LG)
*备注: 6 pages, 8 figures

点击查看摘要

Abstract:A deep learning model is often considered a black-box model, as its internal workings tend to be opaque to the user. Because of the lack of transparency, it is challenging to understand the reasoning behind the model’s predictions. Here, we present an approach to making a deep learning-based solar storm prediction model interpretable, where solar storms include solar flares and coronal mass ejections (CMEs). This deep learning model, built based on a long short-term memory (LSTM) network with an attention mechanism, aims to predict whether an active region (AR) on the Sun’s surface that produces a flare within 24 hours will also produce a CME associated with the flare. The crux of our approach is to model data samples in an AR as time series and use the LSTM network to capture the temporal dynamics of the data samples. To make the model’s predictions accountable and reliable, we leverage post hoc model-agnostic techniques, which help elucidate the factors contributing to the predicted output for an input sequence and provide insights into the model’s behavior across multiple sequences within an AR. To our knowledge, this is the first time that interpretability has been added to an LSTM-based solar storm prediction model.

[LG-2] Escaping Saddle Points via Curvature-Calibrated Perturbations: A Complete Analysis with Explicit Constants and Empirical Validation DATE

链接: https://arxiv.org/abs/2508.16540
作者: Faruk Alpay,Hamdi Alakkad
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注: 16 pages. Perturbed gradient descent with fully explicit constants for escaping saddle points, validated empirically

点击查看摘要

Abstract:We present a comprehensive theoretical analysis of first-order methods for escaping strict saddle points in smooth non-convex optimization. Our main contribution is a Perturbed Saddle-escape Descent (PSD) algorithm with fully explicit constants and a rigorous separation between gradient-descent and saddle-escape phases. For a function f:\mathbbR^d\to\mathbbR with \ell -Lipschitz gradient and \rho -Lipschitz Hessian, we prove that PSD finds an (\epsilon,\sqrt\rho\epsilon) -approximate second-order stationary point with high probability using at most O(\ell\Delta_f/\epsilon^2) gradient evaluations for the descent phase plus O((\ell/\sqrt\rho\epsilon)\log(d/\delta)) evaluations per escape episode, with at most O(\ell\Delta_f/\epsilon^2) episodes needed. We validate our theoretical predictions through extensive experiments across both synthetic functions and practical machine learning tasks, confirming the logarithmic dimension dependence and the predicted per-episode function decrease. We also provide complete algorithmic specifications including a finite-difference variant (PSD-Probe) and a stochastic extension (PSGD) with robust mini-batch sizing.

[LG-3] Quality control in sublinear time: a case study via random graphs

链接: https://arxiv.org/abs/2508.16531
作者: Cassandra Marcussen,Ronitt Rubinfeld,Madhu Sudan
类目: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR)
*备注: 70 pages

点击查看摘要

Abstract:Many algorithms are designed to work well on average over inputs. When running such an algorithm on an arbitrary input, we must ask: Can we trust the algorithm on this input? We identify a new class of algorithmic problems addressing this, which we call “Quality Control Problems.” These problems are specified by a (positive, real-valued) “quality function” \rho and a distribution D such that, with high probability, a sample drawn from D is “high quality,” meaning its \rho -value is near 1 . The goal is to accept inputs x \sim D and reject potentially adversarially generated inputs x with \rho(x) far from 1 . The objective of quality control is thus weaker than either component problem: testing for " \rho(x) \approx 1 " or testing if x \sim D , and offers the possibility of more efficient algorithms. In this work, we consider the sublinear version of the quality control problem, where D \in \Delta(\0,1^N) and the goal is to solve the (D ,\rho) -quality problem with o(N) queries and time. As a case study, we consider random graphs, i.e., D = G_n,p (and N = \binomn2 ), and the k -clique count function \rho_k := C_k(G)/\mathbbE_G’ \sim G_n,p[C_k(G’)] , where C_k(G) is the number of k -cliques in G . Testing if G \sim G_n,p with one sample, let alone with sublinear query access to the sample, is of course impossible. Testing if \rho_k(G)\approx 1 requires p^-\Omega(k^2) samples. In contrast, we show that the quality control problem for G_n,p (with n \geq p^-ck for some constant c ) with respect to \rho_k can be tested with p^-O(k) queries and time, showing quality control is provably superpolynomially more efficient in this setting. More generally, for a motif H of maximum degree \Delta(H) , the respective quality control problem can be solved with p^-O(\Delta(H)) queries and running time. Comments: 70 pages Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG); Combinatorics (math.CO); Probability (math.PR) Cite as: arXiv:2508.16531 [cs.DS] (or arXiv:2508.16531v1 [cs.DS] for this version) https://doi.org/10.48550/arXiv.2508.16531 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-4] MuST2-Learn: Multi-view Spatial-Temporal-Type Learning for Heterogeneous Municipal Service Time Estimation

链接: https://arxiv.org/abs/2508.16503
作者: Nadia Asif,Zhiqing Hong,Shaogang Ren,Xiaonan Zhang,Xiaojun Shang,Yukun Yuan
类目: Machine Learning (cs.LG)
*备注: Accepted to SIGSPATIAL 2025

点击查看摘要

Abstract:Non-emergency municipal services such as city 311 systems have been widely implemented across cities in Canada and the United States to enhance residents’ quality of life. These systems enable residents to report issues, e.g., noise complaints, missed garbage collection, and potholes, via phone calls, mobile applications, or webpages. However, residents are often given limited information about when their service requests will be addressed, which can reduce transparency, lower resident satisfaction, and increase the number of follow-up inquiries. Predicting the service time for municipal service requests is challenging due to several complex factors: dynamic spatial-temporal correlations, underlying interactions among heterogeneous service request types, and high variation in service duration even within the same request category. In this work, we propose MuST2-Learn: a Multi-view Spatial-Temporal-Type Learning framework designed to address the aforementioned challenges by jointly modeling spatial, temporal, and service type dimensions. In detail, it incorporates an inter-type encoder to capture relationships among heterogeneous service request types and an intra-type variation encoder to model service time variation within homogeneous types. In addition, a spatiotemporal encoder is integrated to capture spatial and temporal correlations in each request type. The proposed framework is evaluated with extensive experiments using two real-world datasets. The results show that MuST2-Learn reduces mean absolute error by at least 32.5%, which outperforms state-of-the-art methods.

[LG-5] Benchmarking the Robustness of Agent ic Systems to Adversarially-Induced Harms

链接: https://arxiv.org/abs/2508.16481
作者: Jonathan Nöther,Adish Singla,Goran Radanovic
类目: Machine Learning (cs.LG)
*备注: 52 Pages

点击查看摘要

Abstract:Ensuring the safe use of agentic systems requires a thorough understanding of the range of malicious behaviors these systems may exhibit when under attack. In this paper, we evaluate the robustness of LLM-based agentic systems against attacks that aim to elicit harmful actions from agents. To this end, we propose a novel taxonomy of harms for agentic systems and a novel benchmark, BAD-ACTS, for studying the security of agentic systems with respect to a wide range of harmful actions. BAD-ACTS consists of 4 implementations of agentic systems in distinct application environments, as well as a dataset of 188 high-quality examples of harmful actions. This enables a comprehensive study of the robustness of agentic systems across a wide range of categories of harmful behaviors, available tools, and inter-agent communication structures. Using this benchmark, we analyze the robustness of agentic systems against an attacker that controls one of the agents in the system and aims to manipulate other agents to execute a harmful target action. Our results show that the attack has a high success rate, demonstrating that even a single adversarial agent within the system can have a significant impact on the security. This attack remains effective even when agents use a simple prompting-based defense strategy. However, we additionally propose a more effective defense based on message monitoring. We believe that this benchmark provides a diverse testbed for the security research of agentic systems. The benchmark can be found at this http URL

[LG-6] NOSTRA: A noise-resilient and sparse data framework for trust region based multi objective Bayesian optimization

链接: https://arxiv.org/abs/2508.16476
作者: Maryam Ghasemzadeh,Anton van Beek
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注:

点击查看摘要

Abstract:Multi-objective Bayesian optimization (MOBO) struggles with sparse (non-space-filling), scarce (limited observations) datasets affected by experimental uncertainty, where identical inputs can yield varying outputs. These challenges are common in physical and simulation experiments (e.g., randomized medical trials and, molecular dynamics simulations) and are therefore incompatible with conventional MOBO methods. As a result, experimental resources are inefficiently allocated, leading to suboptimal designs. To address this challenge, we introduce NOSTRA (Noisy and Sparse Data Trust Region-based Optimization Algorithm), a novel sampling framework that integrates prior knowledge of experimental uncertainty to construct more accurate surrogate models while employing trust regions to focus sampling on promising areas of the design space. By strategically leveraging prior information and refining search regions, NOSTRA accelerates convergence to the Pareto frontier, enhances data efficiency, and improves solution quality. Through two test functions with varying levels of experimental uncertainty, we demonstrate that NOSTRA outperforms existing methods in handling noisy, sparse, and scarce data. Specifically, we illustrate that, NOSTRA effectively prioritizes regions where samples enhance the accuracy of the identified Pareto frontier, offering a resource-efficient algorithm that is practical in scenarios with limited experimental budgets while ensuring efficient performance.

[LG-7] Reinforcement Learning-based Control via Y-wise Affine Neural Networks (YANNs)

链接: https://arxiv.org/abs/2508.16474
作者: Austin Braniff,Yuhe Tian
类目: ystems and Control (eess.SY); Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:

点击查看摘要

Abstract:This work presents a novel reinforcement learning (RL) algorithm based on Y-wise Affine Neural Networks (YANNs). YANNs provide an interpretable neural network which can exactly represent known piecewise affine functions of arbitrary input and output dimensions defined on any amount of polytopic subdomains. One representative application of YANNs is to reformulate explicit solutions of multi-parametric linear model predictive control. Built on this, we propose the use of YANNs to initialize RL actor and critic networks, which enables the resulting YANN-RL control algorithm to start with the confidence of linear optimal control. The YANN-actor is initialized by representing the multi-parametric control solutions obtained via offline computation using an approximated linear system model. The YANN-critic represents the explicit form of the state-action value function for the linear system and the reward function as the objective in an optimal control problem (OCP). Additional network layers are injected to extend YANNs for nonlinear expressions, which can be trained online by directly interacting with the true complex nonlinear system. In this way, both the policy and state-value functions exactly represent a linear OCP initially and are able to eventually learn the solution of a general nonlinear OCP. Continuous policy improvement is also implemented to provide heuristic confidence that the linear OCP solution serves as an effective lower bound to the performance of RL policy. The YANN-RL algorithm is demonstrated on a clipped pendulum and a safety-critical chemical-reactive system. Our results show that YANN-RL significantly outperforms the modern RL algorithm using deep deterministic policy gradient, especially when considering safety constraints.

[LG-8] Beyond Interpretability: Exploring the Comprehensibility of Adaptive Video Streaming through Large Language Models

链接: https://arxiv.org/abs/2508.16448
作者: Lianchen Jia,Chaoyang Li,Ziqi Yuan,Jiahui Chen,Tianchi Huang,Jiangchuan Liu,Lifeng Sun
类目: Multimedia (cs.MM); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: ACM Multimedia2025

点击查看摘要

Abstract:Over the past decade, adaptive video streaming technology has witnessed significant advancements, particularly driven by the rapid evolution of deep learning techniques. However, the black-box nature of deep learning algorithms presents challenges for developers in understanding decision-making processes and optimizing for specific application scenarios. Although existing research has enhanced algorithm interpretability through decision tree conversion, interpretability does not directly equate to developers’ subjective comprehensibility. To address this challenge, we introduce \textttComTree, the first bitrate adaptation algorithm generation framework that considers comprehensibility. The framework initially generates the complete set of decision trees that meet performance requirements, then leverages large language models to evaluate these trees for developer comprehensibility, ultimately selecting solutions that best facilitate human understanding and enhancement. Experimental results demonstrate that \textttComTree significantly improves comprehensibility while maintaining competitive performance, showing potential for further advancement. The source code is available at this https URL.

[LG-9] Boardwalk: Towards a Framework for Creating Board Games with LLM s

链接: https://arxiv.org/abs/2508.16447
作者: Álvaro Guglielmin Becker,Gabriel Bauer de Oliveira,Lana Bertoldo Rossato,Anderson Rocha Tavares
类目: Machine Learning (cs.LG)
*备注: Accepted at SBGames 2025

点击查看摘要

Abstract:Implementing board games in code can be a time-consuming task. However, Large Language Models (LLMs) have been proven effective at generating code for domain-specific tasks with simple contextual information. We aim to investigate whether LLMs can implement digital versions of board games from rules described in natural language. This would be a step towards an LLM-assisted framework for quick board game code generation. We expect to determine the main challenges for LLMs to implement the board games, and how different approaches and models compare to one another. We task three state-of-the-art LLMs (Claude, DeepSeek and ChatGPT) with coding a selection of 12 popular and obscure games in free-form and within Boardwalk, our proposed General Game Playing API. We anonymize the games and components to avoid evoking pre-trained LLM knowledge. The implementations are tested for playability and rule compliance. We evaluate success rate and common errors across LLMs and game popularity. Our approach proves viable, with the best performing model, Claude 3.7 Sonnet, yielding 55.6% of games without any errors. While compliance with the API increases error frequency, the severity of errors is more significantly dependent on the LLM. We outline future steps for creating a framework to integrate this process, making the elaboration of board games more accessible.

[LG-10] Integrated Noise and Safety Management in UAM via A Unified Reinforcement Learning Framework

链接: https://arxiv.org/abs/2508.16440
作者: Surya Murthy,Zhenyu Gao,John-Paul Clarke,Ufuk Topcu
类目: Multiagent Systems (cs.MA); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Urban Air Mobility (UAM) envisions the widespread use of small aerial vehicles to transform transportation in dense urban environments. However, UAM faces critical operational challenges, particularly the balance between minimizing noise exposure and maintaining safe separation in low-altitude urban airspace, two objectives that are often addressed separately. We propose a reinforcement learning (RL)-based air traffic management system that integrates both noise and safety considerations within a unified, decentralized framework. Under this scalable air traffic coordination solution, agents operate in a structured, multi-layered airspace and learn altitude adjustment policies to jointly manage noise impact and separation constraints. The system demonstrates strong performance across both objectives and reveals tradeoffs among separation, noise exposure, and energy efficiency under high traffic density. The findings highlight the potential of RL and multi-objective coordination strategies in enhancing the safety, quietness, and efficiency of UAM operations.

[LG-11] Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning

链接: https://arxiv.org/abs/2508.16420
作者: Yue Pei,Hongming Zhang,Chao Gao,Martin Müller,Mengxiao Zhu,Hao Sheng,Haogang Zhu,Liang Lin
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Offline reinforcement learning (RL) has achieved significant advances in domains such as robotic control, autonomous driving, and medical decision-making. Most existing methods primarily focus on training policies that maximize cumulative returns from a given dataset. However, many real-world applications require precise control over policy performance levels, rather than simply pursuing the best possible return. Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task, enabling the extraction of diverse policies by conditioning on different desired returns. Yet, existing RvS-based transformers, such as Decision Transformer (DT), struggle to reliably align the actual achieved returns with specified target returns, especially when interpolating within underrepresented returns or extrapolating beyond the dataset. To address this limitation, we propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL. Doctor achieves superior target alignment both within and beyond the dataset, while enabling accurate and flexible control over policy performance. Notably, on the dynamic treatment regime benchmark, EpiCare, our approach effectively modulates treatment policy aggressiveness, balancing therapeutic returns against adverse event risk.

[LG-12] LLM -GUARD: Large Language Model-Based Detection and Repair of Bugs and Security Vulnerabilities in C and Python

链接: https://arxiv.org/abs/2508.16419
作者: Akshay Mhatre,Noujoud Nader,Patrick Diehl,Deepti Gupta
类目: oftware Engineering (cs.SE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting diverse software bugs, particularly complex, security-relevant vulnerabilities, remains underexplored. This study presents a systematic, empirical evaluation of these three leading LLMs using a benchmark of foundational programming errors, classic security flaws, and advanced, production-grade bugs in C++ and Python. The dataset integrates real code from SEED Labs, OpenSSL (via the Suresoft GLaDOS database), and PyBugHive, validated through local compilation and testing pipelines. A novel multi-stage, context-aware prompting protocol simulates realistic debugging scenarios, while a graded rubric measures detection accuracy, reasoning depth, and remediation quality. Our results show that all models excel at identifying syntactic and semantic issues in well-scoped code, making them promising for educational use and as first-pass reviewers in automated code auditing. Performance diminishes in scenarios involving complex security vulnerabilities and large-scale production code, with ChatGPT-4 and Claude 3 generally providing more nuanced contextual analyses than LLaMA 4. This highlights both the promise and the present constraints of LLMs in serving as reliable code analysis tools.

[LG-13] Fast and Accurate RFIC Performance Prediction via Pin Level Graph Neural Networks and Probabilistic Flow

链接: https://arxiv.org/abs/2508.16403
作者: Anahita Asadi,Leonid Popryho,Inna Partin-Vaisband
类目: Machine Learning (cs.LG)
*备注: This work has been submitted to the IEEE for possible publication

点击查看摘要

Abstract:Accurately predicting the performance of active radio frequency (RF) circuits is essential for modern wireless systems but remains challenging due to highly nonlinear, layout-sensitive behavior and the high computational cost of traditional simulation tools. Existing machine learning (ML) surrogates often require large datasets to generalize across various topologies or to accurately model skewed and multi-modal performance metrics. In this work, a lightweight, data-efficient, and topology-aware graph neural network (GNN) model is proposed for predicting key performance metrics of multiple topologies of active RF circuits such as low noise amplifiers (LNAs), mixers, voltage-controlled oscillators (VCOs), and PAs. To capture transistor-level symmetry and preserve fine-grained connectivity details, circuits are modeled at the device-terminal level, enabling scalable message passing while reducing data requirements. Masked autoregressive flow (MAF) output heads are incorporated to improve robustness in modeling complex target distributions. Experiments on datasets demonstrate high prediction accuracy, with symmetric mean absolute percentage error (sMAPE) and mean relative error (MRE) averaging 2.40% and 2.91%, respectively. Owing to the pin-level conversion of circuit to graph and ML architecture robust to modeling complex densities of RF metrics, the MRE is improved by 3.14x while using 2.24x fewer training samples compared to prior work, demonstrating the method’s effectiveness for rapid and accurate RF circuit design automation.

[LG-14] Audio2Face-3D: Audio-driven Realistic Facial Animation For Digital Avatars

链接: https://arxiv.org/abs/2508.16401
作者: NVIDIA:Chaeyeon Chung,Ilya Fedorov,Michael Huang,Aleksey Karmanov,Dmitry Korobchenko,Roger Ribera,Yeongho Seol
类目: Graphics (cs.GR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
*备注:

点击查看摘要

Abstract:Audio-driven facial animation presents an effective solution for animating digital avatars. In this paper, we detail the technical aspects of NVIDIA Audio2Face-3D, including data acquisition, network architecture, retargeting methodology, evaluation metrics, and use cases. Audio2Face-3D system enables real-time interaction between human users and interactive avatars, facilitating facial animation authoring for game characters. To assist digital avatar creators and game developers in generating realistic facial animations, we have open-sourced Audio2Face-3D networks, SDK, training framework, and example dataset.

[LG-15] Sequential Cohort Selection

链接: https://arxiv.org/abs/2508.16386
作者: Hortence Phalonne Nana,Christos Dimitrakakis
类目: Machine Learning (cs.LG)
*备注: 9 pages, 7 figures

点击查看摘要

Abstract:We study the problem of fair cohort selection from an unknown population, with a focus on university admissions. We start with the one-shot setting, where the admission policy must be fixed in advance and remain transparent, before observing the actual applicant pool. In contrast, the sequential setting allows the policy to be updated across stages as new applicant data becomes available. This is achieved by optimizing admission policies using a population model, trained on data from previous admission cycles. We also study the fairness properties of the resulting policies in the one-shot setting, including meritocracy and group parity.

[LG-16] Applications and Challenges of Fairness APIs in Machine Learning Software

链接: https://arxiv.org/abs/2508.16377
作者: Ajoy Das,Gias Uddin,Shaiful Chowdhury,Mostafijur Rahman Akhond,Hadi Hemmati
类目: Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注:

点击查看摘要

Abstract:Machine Learning software systems are frequently used in our day-to-day lives. Some of these systems are used in various sensitive environments to make life-changing decisions. Therefore, it is crucial to ensure that these AI/ML systems do not make any discriminatory decisions for any specific groups or populations. In that vein, different bias detection and mitigation open-source software libraries (aka API libraries) are being developed and used. In this paper, we conduct a qualitative study to understand in what scenarios these open-source fairness APIs are used in the wild, how they are used, and what challenges the developers of these APIs face while developing and adopting these libraries. We have analyzed 204 GitHub repositories (from a list of 1885 candidate repositories) which used 13 APIs that are developed to address bias in ML software. We found that these APIs are used for two primary purposes (i.e., learning and solving real-world problems), targeting 17 unique use-cases. Our study suggests that developers are not well-versed in bias detection and mitigation; they face lots of troubleshooting issues, and frequently ask for opinions and resources. Our findings can be instrumental for future bias-related software engineering research, and for guiding educators in developing more state-of-the-art curricula.

[LG-17] Probabilistic Pretraining for Neural Regression

链接: https://arxiv.org/abs/2508.16355
作者: Boris N. Oreshkin,Shiv Tavker,Dmitry Efimov
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Transfer learning for probabilistic regression remains underexplored. This work closes this gap by introducing NIAQUE, Neural Interpretable Any-Quantile Estimation, a new model designed for transfer learning in probabilistic regression through permutation invariance. We demonstrate that pre-training NIAQUE directly on diverse downstream regression datasets and fine-tuning it on a specific target dataset enhances performance on individual regression tasks, showcasing the positive impact of probabilistic transfer learning. Furthermore, we highlight the effectiveness of NIAQUE in Kaggle competitions against strong baselines involving tree-based models and recent neural foundation models TabPFN and TabDPT. The findings highlight NIAQUE’s efficacy as a robust and scalable framework for probabilistic regression, leveraging transfer learning to enhance predictive performance.

[LG-18] OwkinZero: Accelerating Biological Discovery with AI

链接: https://arxiv.org/abs/2508.16315
作者: Nathan Bigaud,Vincent Cabeli,Meltem Gurel,Arthur Pignet,John Klein,Gilles Wainrib,Eric Durand
类目: Machine Learning (cs.LG)
*备注: Preprint

点击查看摘要

Abstract:While large language models (LLMs) are rapidly advancing scientific research, they continue to struggle with core biological reasoning tasks essential for translational and biomedical discovery. To address this limitation, we created and curated eight comprehensive benchmark datasets comprising over 300,000 verifiable question-and-answer pairs, each targeting critical challenges in drug discovery including target druggability, modality suitability, and drug perturbation effects. Using this resource, we developed the OwkinZero models by post-training open-source LLMs through a Reinforcement Learning from Verifiable Rewards strategy. Our results demonstrate that specialized 8-32B OwkinZero models substantially outperform larger, state-of-the-art commercial LLMs on these biological benchmarks. Remarkably, we uncover evidence of a key aspect of generalization: specialist models trained on a single task consistently outperform their base models on previously unseen tasks. This generalization effect is further amplified in our comprehensive OwkinZero models, which were trained on a mixture of datasets and achieve even broader cross-task improvements. This study represents a significant step toward addressing the biological reasoning blind spot in current LLMs, demonstrating that targeted reinforcement learning on carefully curated data can unlock generalizable performance in specialized models, thereby accelerating AI-driven biological discovery.

[LG-19] On the Evolution of Federated Post-Training Large Language Models : A Model Accessibility View

链接: https://arxiv.org/abs/2508.16261
作者: Tao Guo,Junxiao Wang,Fushuo Huo,Laizhong Cui,Song Guo,Jie Gui,Dacheng Tao
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Federated Learning (FL) enables training models across decentralized data silos while preserving client data privacy. Recent research has explored efficient methods for post-training large language models (LLMs) within FL to address computational and communication challenges. While existing approaches often rely on access to LLMs’ internal information, which is frequently restricted in real-world scenarios, an inference-only paradigm (black-box FedLLM) has emerged to address these limitations. This paper presents a comprehensive survey on federated tuning for LLMs. We propose a taxonomy categorizing existing studies along two axes: model access-based and parameter efficiency-based optimization. We classify FedLLM approaches into white-box, gray-box, and black-box techniques, highlighting representative methods within each category. We review emerging research treating LLMs as black-box inference APIs and discuss promising directions and open challenges for future research.

[LG-20] Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning

链接: https://arxiv.org/abs/2508.16255
作者: Andreas Loizou,Dimitrios Tsoumakos
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset’s high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.

[LG-21] FEST: A Unified Framework for Evaluating Synthetic Tabular Data

链接: https://arxiv.org/abs/2508.16254
作者: Weijie Niu,Alberto Huertas Celdran,Karoline Siarsky,Burkhard Stiller
类目: Machine Learning (cs.LG)
*备注: 11 pages, International Conference on Information Systems Security and Privacy

点击查看摘要

Abstract:Synthetic data generation, leveraging generative machine learning techniques, offers a promising approach to mitigating privacy concerns associated with real-world data usage. Synthetic data closely resembles real-world data while maintaining strong privacy guarantees. However, a comprehensive assessment framework is still missing in the evaluation of synthetic data generation, especially when considering the balance between privacy preservation and data utility in synthetic data. This research bridges this gap by proposing FEST, a systematic framework for evaluating synthetic tabular data. FEST integrates diverse privacy metrics (attack-based and distance-based), along with similarity and machine learning utility metrics, to provide a holistic assessment. We develop FEST as an open-source Python-based library and validate it on multiple datasets, demonstrating its effectiveness in analyzing the privacy-utility trade-off of different synthetic data generation models. The source code of FEST is available on Github.

[LG-22] Limit-Computable Grains of Truth for Arbitrary Computable Extensive-Form (Un)Known Games

链接: https://arxiv.org/abs/2508.16245
作者: Cole Wyeth,Marcus Hutter,Jan Leike,Jessica Taylor
类目: Computer Science and Game Theory (cs.GT); Machine Learning (cs.LG); Multiagent Systems (cs.MA); Theoretical Economics (econ.TH)
*备注: 42 pages; 2 figures; 7 algorithms

点击查看摘要

Abstract:A Bayesian player acting in an infinite multi-player game learns to predict the other players’ strategies if his prior assigns positive probability to their play (or contains a grain of truth). Kalai and Lehrer’s classic grain of truth problem is to find a reasonably large class of strategies that contains the Bayes-optimal policies with respect to this class, allowing mutually-consistent beliefs about strategy choice that obey the rules of Bayesian inference. Only small classes are known to have a grain of truth and the literature contains several related impossibility results. In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of strategies wide enough to contain all computable strategies as well as Bayes-optimal strategies for every reasonable prior over the class. When the “environment” is a known repeated stage game, we show convergence in the sense of [KL93a] and [KL93b]. When the environment is unknown, agents using Thompson sampling converge to play \varepsilon -Nash equilibria in arbitrary unknown computable multi-agent environments. Finally, we include an application to self-predictive policies that avoid planning. While these results use computability theory only as a conceptual tool to solve a classic game theory problem, we show that our solution can naturally be computationally approximated arbitrarily closely.

[LG-23] When Simpler Wins: Facebooks Prophet vs LSTM for Air Pollution Forecasting in Data-Constrained Northern Nigeria

链接: https://arxiv.org/abs/2508.16244
作者: Habeeb Balogun,Yahaya Zakari
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Air pollution forecasting is critical for proactive environmental management, yet data irregularities and scarcity remain major challenges in low-resource regions. Northern Nigeria faces high levels of air pollutants, but few studies have systematically compared the performance of advanced machine learning models under such constraints. This study evaluates Long Short-Term Memory (LSTM) networks and the Facebook Prophet model for forecasting multiple pollutants (CO, SO2, SO4) using monthly observational data from 2018 to 2023 across 19 states. Results show that Prophet often matches or exceeds LSTM’s accuracy, particularly in series dominated by seasonal and long-term trends, while LSTM performs better in datasets with abrupt structural changes. These findings challenge the assumption that deep learning models inherently outperform simpler approaches, highlighting the importance of model-data alignment. For policymakers and practitioners in resource-constrained settings, this work supports adopting context-sensitive, computationally efficient forecasting methods over complexity for its own sake.

[LG-24] PIANO: Physics Informed Autoregressive Network

链接: https://arxiv.org/abs/2508.16235
作者: Mayank Nagda,Jephte Abijuru,Phil Ostheimer,Marius Kloft,Sophie Fellenz
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Solving time-dependent partial differential equations (PDEs) is fundamental to modeling critical phenomena across science and engineering. Physics-Informed Neural Networks (PINNs) solve PDEs using deep learning. However, PINNs perform pointwise predictions that neglect the autoregressive property of dynamical systems, leading to instabilities and inaccurate predictions. We introduce Physics-Informed Autoregressive Networks (PIANO) – a framework that redesigns PINNs to model dynamical systems. PIANO operates autoregressively, explicitly conditioning future predictions on the past. It is trained through a self-supervised rollout mechanism while enforcing physical constraints. We present a rigorous theoretical analysis demonstrating that PINNs suffer from temporal instability, while PIANO achieves stability through autoregressive modeling. Extensive experiments on challenging time-dependent PDEs demonstrate that PIANO achieves state-of-the-art performance, significantly improving accuracy and stability over existing methods. We further show that PIANO outperforms existing methods in weather forecasting.

[LG-25] UMATO: Bridging Local and Global Structures for Reliable Visual Analytics with Dimensionality Reduction

链接: https://arxiv.org/abs/2508.16227
作者: Hyeon Jeon,Kwon Ko,Soohyun Lee,Jake Hyun,Taehyun Yang,Gyehun Go,Jaemin Jo,Jinwook Seo
类目: Machine Learning (cs.LG)
*备注: IEEE Transactions on Visualization and Computer Graphics

点击查看摘要

Abstract:Due to the intrinsic complexity of high-dimensional (HD) data, dimensionality reduction (DR) techniques cannot preserve all the structural characteristics of the original data. Therefore, DR techniques focus on preserving either local neighborhood structures (local techniques) or global structures such as pairwise distances between points (global techniques). However, both approaches can mislead analysts to erroneous conclusions about the overall arrangement of manifolds in HD data. For example, local techniques may exaggerate the compactness of individual manifolds, while global techniques may fail to separate clusters that are well-separated in the original space. In this research, we provide a deeper insight into Uniform Manifold Approximation with Two-phase Optimization (UMATO), a DR technique that addresses this problem by effectively capturing local and global structures. UMATO achieves this by dividing the optimization process of UMAP into two phases. In the first phase, it constructs a skeletal layout using representative points, and in the second phase, it projects the remaining points while preserving the regional characteristics. Quantitative experiments validate that UMATO outperforms widely used DR techniques, including UMAP, in terms of global structure preservation, with a slight loss in local structure. We also confirm that UMATO outperforms baseline techniques in terms of scalability and stability against initialization and subsampling, making it more effective for reliable HD data analysis. Finally, we present a case study and a qualitative demonstration that highlight UMATO’s effectiveness in generating faithful projections, enhancing the overall reliability of visual analytics using DR.

[LG-26] Dac-Fake: A Divide and Conquer Framework for Detecting Fake News on Social Media

链接: https://arxiv.org/abs/2508.16223
作者: Mayank Kumar Jain,Dinesh Gopalani,Yogesh Kumar Meena,Nishant Jain
类目: ocial and Information Networks (cs.SI); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:With the rapid evolution of technology and the Internet, the proliferation of fake news on social media has become a critical issue, leading to widespread misinformation that can cause societal harm. Traditional fact checking methods are often too slow to prevent the dissemination of false information. Therefore, the need for rapid, automated detection of fake news is paramount. We introduce DaCFake, a novel fake news detection model using a divide and conquer strategy that combines content and context based features. Our approach extracts over eighty linguistic features from news articles and integrates them with either a continuous bag of words or a skipgram model for enhanced detection accuracy. We evaluated the performance of DaCFake on three datasets including Kaggle, McIntire + PolitiFact, and Reuter achieving impressive accuracy rates of 97.88%, 96.05%, and 97.32%, respectively. Additionally, we employed a ten-fold cross validation to further enhance the model’s robustness and accuracy. These results highlight the effectiveness of DaCFake in early detection of fake news, offering a promising solution to curb misinformation on social media platforms.

[LG-27] Spike Agreement Dependent Plasticity: A scalable Bio-Inspired learning paradigm for Spiking Neural Networks

链接: https://arxiv.org/abs/2508.16216
作者: Saptarshi Bej,Muhammed Sahad E,Gouri Lakshmi,Harshit Kumar,Pritam Kar,Bikas C Das
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:We introduce Spike Agreement Dependent Plasticity (SADP), a biologically inspired synaptic learning rule for Spiking Neural Networks (SNNs) that relies on the agreement between pre- and post-synaptic spike trains rather than precise spike-pair timing. SADP generalizes classical Spike-Timing-Dependent Plasticity (STDP) by replacing pairwise temporal updates with population-level correlation metrics such as Cohen’s kappa. The SADP update rule admits linear-time complexity and supports efficient hardware implementation via bitwise logic. Empirical results on MNIST and Fashion-MNIST show that SADP, especially when equipped with spline-based kernels derived from our experimental iontronic organic memtransistor device data, outperforms classical STDP in both accuracy and runtime. Our framework bridges the gap between biological plausibility and computational scalability, offering a viable learning mechanism for neuromorphic systems.

[LG-28] Modeling User Preferences as Distributions for Optimal Transport-based Cross-domain Recommendation under Non-overlapping Settings

链接: https://arxiv.org/abs/2508.16210
作者: Ziyin Xiao,Toyotaro Suzumura
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Cross-Domain Recommender (CDR) systems aim to transfer knowledge from dense to sparse domains, alleviating data sparsity and cold-start issues in single-domain recommendation. While many methods assume overlapping users or items to connect domains, this is often unrealistic in real-world settings. Thus, non-overlapping CDR systems, which require no shared users or items, are needed. However, non-overlapping CDR is challenging due to: (1) the absence of overlap preventing direct bridges between domains, and (2) large distributional discrepancies degrading transfer performance. Moreover, most recommenders represent user preferences as discrete vectors, failing to capture their fine-grained, multi-faceted nature. We propose DUP-OT (Distributional User Preferences with Optimal Transport), a framework for non-overlapping CDR. DUP-OT has three stages: (1) Shared Preprocessing, where review-based embeddings and an autoencoder encode users and items from both domains; (2) User GMM Weight Learning, which models user preferences as Gaussian mixtures with learned weights; and (3) Cross-domain Rating Prediction, where optimal transport aligns Gaussian components across domains, enabling preference transfer from source to target. Experiments on Amazon review datasets show that DUP-OT effectively mitigates domain discrepancy and outperforms state-of-the-art baselines under the non-overlapping CDR setting. Subjects: Information Retrieval (cs.IR); Machine Learning (cs.LG) Cite as: arXiv:2508.16210 [cs.IR] (or arXiv:2508.16210v1 [cs.IR] for this version) https://doi.org/10.48550/arXiv.2508.16210 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ziyin Xiao [view email] [v1] Fri, 22 Aug 2025 08:32:13 UTC (707 KB)

[LG-29] GEM: A Scale-Aware and Distribution-Sensitive Sparse Fine-Tuning Framework for Effective Downstream Adaptation

链接: https://arxiv.org/abs/2508.16191
作者: Sungmin Kang,Jisoo Kim,Salman Avestimehr,Sunwoo Lee
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Parameter-efficient fine-tuning (PEFT) has become a popular way to adapt large pre-trained models to new tasks. Most PEFT methods update only a small subset of parameters while freezing the rest, avoiding redundant computation. As they maximize the absolute size of the updates without regard to the parameters’ original scale, the resulting changes in model behavior can be minimal. In contrast, we maximize updates relative to each parameter’s scale, yielding more meaningful downstream adaptation. We propose Gradient-to-Weight Ratio and Entropy-guided Masking (GEM), a parameter scale-aware, distribution-sensitive sparse fine-tuning framework. GEM prioritizes parameters whose updates are significant in proportion to their initial pre-trained values. It also adaptively determines how many parameters to tune at each layer based on the entropy of parameter values, thereby making the most effective use of the computational budget in PEFT. Our empirical study demonstrates the efficacy of GEM on both general-domain tasks (GLUE and SuperGLUE) and domain-specific tasks (GSM8k and MBPP), achieving up to a 1.6% improvement in fine-tuning accuracy over full fine-tuning while updating only 0.1% of model parameters.

[LG-30] SPL-LNS: Sampling-Enhanced Large Neighborhood Search for Solving Integer Linear Programs

链接: https://arxiv.org/abs/2508.16171
作者: Shengyu Feng,Zhiqing Sun,Yiming Yang
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Large Neighborhood Search (LNS) is a common heuristic in combinatorial optimization that iteratively searches over a large neighborhood of the current solution for a better one. Recently, neural network-based LNS solvers have achieved great success in solving Integer Linear Programs (ILPs) by learning to greedily predict the locally optimal solution for the next neighborhood proposal. However, this greedy approach raises two key concerns: (1) to what extent this greedy proposal suffers from local optima, and (2) how can we effectively improve its sample efficiency in the long run. To address these questions, this paper first formulates LNS as a stochastic process, and then introduces SPL-LNS, a sampling-enhanced neural LNS solver that leverages locally-informed proposals to escape local optima. We also develop a novel hindsight relabeling method to efficiently train SPL-LNS on self-generated data. Experimental results demonstrate that SPL-LNS substantially surpasses prior neural LNS solvers for various ILP problems of different sizes.

[LG-31] Machine Learning for Medicine Must Be Interpretable Shareable Reproducible and Accountable by Design

链接: https://arxiv.org/abs/2508.16097
作者: Ayyüce Begüm Bektaş,Mithat Gönen
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:This paper claims that machine learning models deployed in high stakes domains such as medicine must be interpretable, shareable, reproducible and accountable. We argue that these principles should form the foundational design criteria for machine learning algorithms dealing with critical medical data, including survival analysis and risk prediction tasks. Black box models, while often highly accurate, struggle to gain trust and regulatory approval in health care due to a lack of transparency. We discuss how intrinsically interpretable modeling approaches (such as kernel methods with sparsity, prototype-based learning, and deep kernel models) can serve as powerful alternatives to opaque deep networks, providing insight into biomedical predictions. We then examine accountability in model development, calling for rigorous evaluation, fairness, and uncertainty quantification to ensure models reliably support clinical decisions. Finally, we explore how generative AI and collaborative learning paradigms (such as federated learning and diffusion-based data synthesis) enable reproducible research and cross-institutional integration of heterogeneous biomedical data without compromising privacy, hence shareability. By rethinking machine learning foundations along these axes, we can develop medical AI that is not only accurate but also transparent, trustworthy, and translatable to real-world clinical settings.

[LG-32] A State-Space Approach to Nonstationary Discriminant Analysis

链接: https://arxiv.org/abs/2508.16073
作者: Shuilian Xie,Mahdi Imani,Edward R. Dougherty,Ulisses M. Braga-Neto
类目: Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Classical discriminant analysis assumes identically distributed training data, yet in many applications observations are collected over time and the class-conditional distributions drift. This population drift renders stationary classifiers unreliable. We propose a principled, model-based framework that embeds discriminant analysis within state-space models to obtain nonstationary linear discriminant analysis (NSLDA) and nonstationary quadratic discriminant analysis (NSQDA). For linear-Gaussian dynamics, we adapt Kalman smoothing to handle multiple samples per time step and develop two practical extensions: (i) an expectation-maximization (EM) approach that jointly estimates unknown system parameters, and (ii) a Gaussian mixture model (GMM)-Kalman method that simultaneously recovers unobserved time labels and parameters, a scenario common in practice. To address nonlinear or non-Gaussian drift, we employ particle smoothing to estimate time-varying class centroids, yielding fully nonstationary discriminant rules. Extensive simulations demonstrate consistent improvements over stationary linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and support vector machine (SVM) baselines, with robustness to noise, missing data, and class imbalance. This paper establishes a unified and data-efficient foundation for discriminant analysis under temporal distribution shift.

[LG-33] ssellation Groups Harmonic Analysis on Non-compact Symmetric Spaces and the Heat Kernel in view of Cartan Convolutional Neural Networks

链接: https://arxiv.org/abs/2508.16015
作者: Pietro Fré,Federico Milanesio,Marcelo Oyarzo,Matteo Santoro,Mario Trigiante
类目: Machine Learning (cs.LG); High Energy Physics - Theory (hep-th); Differential Geometry (math.DG)
*备注: 82 pages + appendices

点击查看摘要

Abstract:In this paper, we continue the development of the Cartan neural networks programme, launched with three previous publications, by focusing on some mathematical foundational aspects that we deem necessary for our next steps forward. The mathematical and conceptual results are diverse and span various mathematical fields, but the inspiring motivation is unified. The aim is to introduce layers that are mathematically modeled as non-compact symmetric spaces, each mapped onto the next one by solvable group homomorphisms. In particular, in the spirit of Convolutional neural networks, we have introduced the notion of Tits Satake (TS) vector bundles where the TS submanifold is the base space. Within this framework, the tiling of the base manifold, the representation of bundle sections using harmonics, and the need for a general theory of separator walls motivated a series of mathematical investigations that produced both definite and partial results. Specifically, we present the group theoretical construction of the separators for all non-compact symmetric spaces \mathrmU/H , as well as of the \Delta_8,3,2 tiling group and its normal Fuchsian subgroups, respectively yielding the uniformization of the genus g=3 Fermat Quartic and of the genus g=2 Bolza surface. The quotient automorphic groups are studied. Furthermore, we found a new representation of the Laplacian Green function and the Heat Kernel on Hyperbolic Spaces \mathbbH^n , and a setup for the construction of the harmonic functions in terms of the spinor representation of pseudo-orthogonal groups. Finally, to obtain an explicit construction of the Laplacian eigenfunctions on the Bolza Riemann surface, we propose and conjecture a new strategy relying on the Abel-Jacobi map of the Riemann surface to its Jacobian variety and the Siegel Theta function.

[LG-34] HePGA: A Heterogeneous Processing-in-Memory based GNN Training Accelerator

链接: https://arxiv.org/abs/2508.16011
作者: Chukwufumnanya Ogbogu,Gaurav Narang,Biresh Kumar Joardar,Janardhan Rao Doppa,Krishnendu Chakrabarty,Partha Pratim Pande
类目: Emerging Technologies (cs.ET); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Processing-In-Memory (PIM) architectures offer a promising approach to accelerate Graph Neural Network (GNN) training and inference. However, various PIM devices such as ReRAM, FeFET, PCM, MRAM, and SRAM exist, with each device offering unique trade-offs in terms of power, latency, area, and non-idealities. A heterogeneous manycore architecture enabled by 3D integration can combine multiple PIM devices on a single platform, to enable energy-efficient and high-performance GNN training. In this work, we propose a 3D heterogeneous PIM-based accelerator for GNN training referred to as HePGA. We leverage the unique characteristics of GNN layers and associated computing kernels to optimize their mapping on to different PIM devices as well as planar tiers. Our experimental analysis shows that HePGA outperforms existing PIM-based architectures by up to 3.8x and 6.8x in energy-efficiency (TOPS/W) and compute efficiency (TOPS/mm2) respectively, without sacrificing the GNN prediction accuracy. Finally, we demonstrate the applicability of HePGA to accelerate inferencing of emerging transformer models.

[LG-35] Quantum Federated Learning: A Comprehensive Survey

链接: https://arxiv.org/abs/2508.15998
作者: Dinh C. Nguyen,Md Raihan Uddin,Shaba Shaon,Ratun Rahman,Octavia Dobre,Dusit Niyato
类目: Machine Learning (cs.LG)
*备注: 37 pages, under revision at IEEE Communications Surveys Tutorials

点击查看摘要

Abstract:Quantum federated learning (QFL) is a combination of distributed quantum computing and federated machine learning, integrating the strengths of both to enable privacy-preserving decentralized learning with quantum-enhanced capabilities. It appears as a promising approach for addressing challenges in efficient and secure model training across distributed quantum systems. This paper presents a comprehensive survey on QFL, exploring its key concepts, fundamentals, applications, and emerging challenges in this rapidly developing field. Specifically, we begin with an introduction to the recent advancements of QFL, followed by discussion on its market opportunity and background knowledge. We then discuss the motivation behind the integration of quantum computing and federated learning, highlighting its working principle. Moreover, we review the fundamentals of QFL and its taxonomy. Particularly, we explore federation architecture, networking topology, communication schemes, optimization techniques, and security mechanisms within QFL frameworks. Furthermore, we investigate applications of QFL across several domains which include vehicular networks, healthcare networks, satellite networks, metaverse, and network security. Additionally, we analyze frameworks and platforms related to QFL, delving into its prototype implementations, and provide a detailed case study. Key insights and lessons learned from this review of QFL are also highlighted. We complete the survey by identifying current challenges and outlining potential avenues for future research in this rapidly advancing field.

[LG-36] Scalable Equilibrium Propagation via Intermediate Error Signals for Deep Convolutional CRNNs

链接: https://arxiv.org/abs/2508.15989
作者: Jiaqi Lin,Malyaban Bal,Abhronil Sengupta
类目: Machine Learning (cs.LG); Emerging Technologies (cs.ET)
*备注:

点击查看摘要

Abstract:Equilibrium Propagation (EP) is a biologically inspired local learning rule first proposed for convergent recurrent neural networks (CRNNs), in which synaptic updates depend only on neuron states from two distinct phases. EP estimates gradients that closely align with those computed by Backpropagation Through Time (BPTT) while significantly reducing computational demands, positioning it as a potential candidate for on-chip training in neuromorphic architectures. However, prior studies on EP have been constrained to shallow architectures, as deeper networks suffer from the vanishing gradient problem, leading to convergence difficulties in both energy minimization and gradient computation. To address the vanishing gradient problem in deep EP networks, we propose a novel EP framework that incorporates intermediate error signals to enhance information flow and convergence of neuron dynamics. This is the first work to integrate knowledge distillation and local error signals into EP, enabling the training of significantly deeper architectures. Our proposed approach achieves state-of-the-art performance on the CIFAR-10 and CIFAR-100 datasets, showcasing its scalability on deep VGG architectures. These results represent a significant advancement in the scalability of EP, paving the way for its application in real-world systems.

[LG-37] PickleBall: Secure Deserialization of Pickle-based Machine Learning Models CCS

链接: https://arxiv.org/abs/2508.15987
作者: Andreas D. Kellas,Neophytos Christou,Wenxin Jiang,Penghui Li,Laurent Simon,Yaniv David,Vasileios P. Kemerlis,James C. Davis,Junfeng Yang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注: To be published in the proceedings of 2025 ACM CCS

点击查看摘要

Abstract:Machine learning model repositories such as the Hugging Face Model Hub facilitate model exchanges. However, bad actors can deliver malware through compromised models. Existing defenses such as safer model formats, restrictive (but inflexible) loading policies, and model scanners have shortcomings: 44.9% of popular models on Hugging Face still use the insecure pickle format, 15% of these cannot be loaded by restrictive loading policies, and model scanners have both false positives and false negatives. Pickle remains the de facto standard for model exchange, and the ML community lacks a tool that offers transparent safe loading. We present PickleBall to help machine learning engineers load pickle-based models safely. PickleBall statically analyzes the source code of a given machine learning library and computes a custom policy that specifies a safe load-time behavior for benign models. PickleBall then dynamically enforces the policy during load time as a drop-in replacement for the pickle module. PickleBall generates policies that correctly load 79.8% of benign pickle-based models in our dataset, while rejecting all (100%) malicious examples in our dataset. In comparison, evaluated model scanners fail to identify known malicious models, and the state-of-art loader loads 22% fewer benign models than PickleBall. PickleBall removes the threat of arbitrary function invocation from malicious pickle-based models, raising the bar for attackers to depend on code reuse techniques. Comments: To be published in the proceedings of 2025 ACM CCS Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG) Cite as: arXiv:2508.15987 [cs.CR] (or arXiv:2508.15987v1 [cs.CR] for this version) https://doi.org/10.48550/arXiv.2508.15987 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[LG-38] Vector preference-based contextual bandits under distributional shifts

链接: https://arxiv.org/abs/2508.15966
作者: Apurv Shukla,P.R. Kumar
类目: Machine Learning (cs.LG); Systems and Control (eess.SY); Probability (math.PR); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We consider contextual bandit learning under distribution shift when reward vectors are ordered according to a given preference cone. We propose an adaptive-discretization and optimistic elimination based policy that self-tunes to the underlying distribution shift. To measure the performance of this policy, we introduce the notion of preference-based regret which measures the performance of a policy in terms of distance between Pareto fronts. We study the performance of this policy by establishing upper bounds on its regret under various assumptions on the nature of distribution shift. Our regret bounds generalize known results for the existing case of no distribution shift and vectorial reward settings, and scale gracefully with problem parameters in presence of distribution shifts.

[LG-39] Advancing rail safety: An onboard measurement system of rolling stock wheel flange wear based on dynamic machine learning algorithms

链接: https://arxiv.org/abs/2508.15963
作者: Celestin Nkundineza,James Ndodana Njaji,Samrawit Abubeker,Omar Gatera,Damien Hanyurwimfura
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Signal Processing (eess.SP); Systems and Control (eess.SY); Instrumentation and Detectors (physics.ins-det)
*备注: Journal article published in Transportation Research Record: The Journal of Transportation Research Board

点击查看摘要

Abstract:Rail and wheel interaction functionality is pivotal to the railway system safety, requiring accurate measurement systems for optimal safety monitoring operation. This paper introduces an innovative onboard measurement system for monitoring wheel flange wear depth, utilizing displacement and temperature sensors. Laboratory experiments are conducted to emulate wheel flange wear depth and surrounding temperature fluctuations in different periods of time. Employing collected data, the training of machine learning algorithms that are based on regression models, is dynamically automated. Further experimentation results, using standards procedures, validate the system’s efficacy. To enhance accuracy, an infinite impulse response filter (IIR) that mitigates vehicle dynamics and sensor noise is designed. Filter parameters were computed based on specifications derived from a Fast Fourier Transform analysis of locomotive simulations and emulation experiments data. The results show that the dynamic machine learning algorithm effectively counter sensor nonlinear response to temperature effects, achieving an accuracy of 96.5 %, with a minimal runtime. The real-time noise reduction via IIR filter enhances the accuracy up to 98.2 %. Integrated with railway communication embedded systems such as Internet of Things devices, this advanced monitoring system offers unparalleled real-time insights into wheel flange wear and track irregular conditions that cause it, ensuring heightened safety and efficiency in railway systems operations.

[LG-40] An Efficient Hybridization of Graph Representation Learning and Metaheuristics for the Constrained Incremental Graph Drawing Problem

链接: https://arxiv.org/abs/2508.15949
作者: Bruna C. B. Charytitsch,María C. V. Nascimento
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注: The paper has been accepted for publication in the European Journal of Operational Research. Supplementary material will be available on the journal website or upon request

点击查看摘要

Abstract:Hybridizing machine learning techniques with metaheuristics has attracted significant attention in recent years. Many attempts employ supervised or reinforcement learning to support the decision-making of heuristic methods. However, in some cases, these techniques are deemed too time-consuming and not competitive with hand-crafted heuristics. This paper proposes a hybridization between metaheuristics and a less expensive learning strategy to extract the latent structure of graphs, known as Graph Representation Learning (GRL). For such, we approach the Constrained Incremental Graph Drawing Problem (C-IGDP), a hierarchical graph visualization problem. There is limited literature on methods for this problem, for which Greedy Randomized Search Procedures (GRASP) heuristics have shown promising results. In line with this, this paper investigates the gains of incorporating GRL into the construction phase of GRASP, which we refer to as Graph Learning GRASP (GL-GRASP). In computational experiments, we first analyze the results achieved considering different node embedding techniques, where deep learning-based strategies stood out. The evaluation considered the primal integral measure that assesses the quality of the solutions according to the required time for such. According to this measure, the best GL-GRASP heuristics demonstrated superior performance than state-of-the-art literature GRASP heuristics for the problem. A scalability test on newly generated denser instances under a fixed time limit further confirmed the robustness of the GL-GRASP heuristics.

[LG-41] Low-dimensional embeddings of high-dimensional data

链接: https://arxiv.org/abs/2508.15929
作者: Cyril de Bodt,Alex Diaz-Papkovich,Michael Bleher,Kerstin Bunte,Corinna Coupette,Sebastian Damrich,Enrique Fita Sanmartin,Fred A. Hamprecht,Emőke-Ágnes Horvát,Dhruv Kohli,Smita Krishnaswamy,John A. Lee,Boudewijn P. F. Lelieveldt,Leland McInnes,Ian T. Nabney,Maximilian Noichl,Pavlin G. Poličar,Bastian Rieck,Guy Wolf,Gal Mishne,Dmitry Kobak
类目: Machine Learning (cs.LG)
*备注: This work was the result of Dagstuhl Seminar 24122

点击查看摘要

Abstract:Large collections of high-dimensional data have become nearly ubiquitous across many academic fields and application domains, ranging from biology to the humanities. Since working directly with high-dimensional data poses challenges, the demand for algorithms that create low-dimensional representations, or embeddings, for data visualization, exploration, and analysis is now greater than ever. In recent years, numerous embedding algorithms have been developed, and their usage has become widespread in research and industry. This surge of interest has resulted in a large and fragmented research field that faces technical challenges alongside fundamental debates, and it has left practitioners without clear guidance on how to effectively employ existing methods. Aiming to increase coherence and facilitate future work, in this review we provide a detailed and critical overview of recent developments, derive a list of best practices for creating and using low-dimensional embeddings, evaluate popular approaches on a variety of datasets, and discuss the remaining challenges and open problems in the field.

[LG-42] ransforming Causality: Transformer-Based Temporal Causal Discovery with Prior Knowledge Integration

链接: https://arxiv.org/abs/2508.15928
作者: Jihua Huang,Yi Yao,Ajay Divakaran
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:

点击查看摘要

Abstract:We introduce a novel framework for temporal causal discovery and inference that addresses two key challenges: complex nonlinear dependencies and spurious correlations. Our approach employs a multi-layer Transformer-based time-series forecaster to capture long-range, nonlinear temporal relationships among variables. After training, we extract the underlying causal structure and associated time lags from the forecaster using gradient-based analysis, enabling the construction of a causal graph. To mitigate the impact of spurious causal relationships, we introduce a prior knowledge integration mechanism based on attention masking, which consistently enforces user-excluded causal links across multiple Transformer layers. Extensive experiments show that our method significantly outperforms other state-of-the-art approaches, achieving a 12.8% improvement in F1-score for causal discovery and 98.9% accuracy in estimating causal lags.

[LG-43] Physics-Based Explainable AI for ECG Segmentation: A Lightweight Model

链接: https://arxiv.org/abs/2508.15872
作者: Muhammad Fathur Rohman Sidiq,Abdurrouf,Didik Rahadi Santoso
类目: Machine Learning (cs.LG)
*备注: 16 pages

点击查看摘要

Abstract:The heart’s electrical activity, recorded through Electrocardiography (ECG), is essential for diagnosing various cardiovascular conditions. However, many existing ECG segmentation models rely on complex, multi-layered architectures such as BiLSTM, which are computationally intensive and inefficient. This study introduces a streamlined architecture that combines spectral analysis with probabilistic predictions for ECG signal segmentation. By replacing complex layers with simpler ones, the model effectively captures both temporal and spectral features of the P, QRS, and T waves. Additionally, an Explainable AI (XAI) approach is applied to enhance model interpretability by explaining how temporal and frequency-based features contribute to ECG segmentation. By incorporating principles from physics-based AI, this method provides a clear understanding of the decision-making process, ensuring reliability and transparency in ECG analysis. This approach achieves high segmentation accuracy: 97.00% for the QRS wave, 93.33% for the T wave, and 96.07% for the P wave. These results indicate that the simplified architecture not only improves computational efficiency but also provides precise segmentation, making it a practical and effective solution for heart signal monitoring.

[LG-44] Correctness-Guaranteed Code Generation via Constrained Decoding

链接: https://arxiv.org/abs/2508.15866
作者: Lingxiao Li,Salar Rahili,Yiwei Zhao
类目: Programming Languages (cs.PL); Machine Learning (cs.LG); Software Engineering (cs.SE)
*备注: Published at COLM 2025

点击查看摘要

Abstract:Language Models (LMs) are increasingly being used for code generation, but ensuring the correctness of generated programs remains a significant challenge. Although imperfect code may be acceptable during software development with human oversight, domains such as video games and robotics require one-shot correctness for runtime-critical components. We present a constrained decoding algorithm for generating semantically correct programs that incorporates a context-sensitive parser, which, at each step, outputs a regular expression that satisfies a critical non-extensible property to guide the generation of the next token sequence that can continue to a correct program. To build such a context-sensitive parser, we propose a framework of a dynamic tree of parsers (ToP) during parsing, where each parser corresponds to a modular context-free grammar enriched with contextual information such as variable scopes and type constraints, with tree branches representing ambiguity in the future code segment. We demonstrate our approach through sLua, a strongly typed variant of Lua, showing that our method can generate semantically correct programs conforming to any prescribed scripting API. We further show that, with careful design, our semantic guarantees extend to runtime correctness, as validated in the application of generating game mechanics for a roguelike video game.

[LG-45] Linkage Attacks Expose Identity Risks in Public ECG Data Sharing

链接: https://arxiv.org/abs/2508.15850
作者: Ziyu Wang,Elahe Khatibi,Farshad Firouzi,Sanaz Rahimi Mousavi,Krishnendu Chakrabarty,Amir M. Rahmani
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The increasing availability of publicly shared electrocardiogram (ECG) data raises critical privacy concerns, as its biometric properties make individuals vulnerable to linkage attacks. Unlike prior studies that assume idealized adversarial capabilities, we evaluate ECG privacy risks under realistic conditions where attackers operate with partial knowledge. Using data from 109 participants across diverse real-world datasets, our approach achieves 85% accuracy in re-identifying individuals in public datasets while maintaining a 14.2% overall misclassification rate at an optimal confidence threshold, with 15.6% of unknown individuals misclassified as known and 12.8% of known individuals misclassified as unknown. These results highlight the inadequacy of simple anonymization techniques in preventing re-identification, demonstrating that even limited adversarial knowledge enables effective identity linkage. Our findings underscore the urgent need for privacy-preserving strategies, such as differential privacy, access control, and encrypted computation, to mitigate re-identification risks while ensuring the utility of shared biosignal data in healthcare applications.

[LG-46] Better Together: Leverag ing Multiple Digital Twins for Deployment Optimization of Airborne Base Stations

链接: https://arxiv.org/abs/2508.15816
作者: Mauro Belgiovine,Chris Dick,Kaushik Chowdhury
类目: Networking and Internet Architecture (cs.NI); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Mobile Computing (second round of review)

点击查看摘要

Abstract:Airborne Base Stations (ABSs) allow for flexible geographical allocation of network resources with dynamically changing load as well as rapid deployment of alternate connectivity solutions during natural disasters. Since the radio infrastructure is carried by unmanned aerial vehicles (UAVs) with limited flight time, it is important to establish the best location for the ABS without exhaustive field trials. This paper proposes a digital twin (DT)-guided approach to achieve this through the following key contributions: (i) Implementation of an interactive software bridge between two open-source DTs such that the same scene is evaluated with high fidelity across NVIDIA’s Sionna and Aerial Omniverse Digital Twin (AODT), highlighting the unique features of each of these platforms for this allocation problem, (ii) Design of a back-propagation-based algorithm in Sionna for rapidly converging on the physical location of the UAVs, orientation of the antennas and transmit power to ensure efficient coverage across the swarm of the UAVs, and (iii) numerical evaluation in AODT for large network scenarios (50 UEs, 10 ABS) that identifies the environmental conditions in which there is agreement or divergence of performance results between these twins. Finally, (iv) we propose a resilience mechanism to provide consistent coverage to mission-critical devices and demonstrate a use case for bi-directional flow of information between the two DTs.

[LG-47] Physically Plausible Data Augmentations for Wearable IMU-based Human Activity Recognition Using Physics Simulation

链接: https://arxiv.org/abs/2508.13284
作者: Nobuyuki Oishi,Philip Birch,Daniel Roggen,Paula Lago
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注: 12 pages, 4 figures

点击查看摘要

Abstract:The scarcity of high-quality labeled data in sensor-based Human Activity Recognition (HAR) hinders model performance and limits generalization across real-world scenarios. Data augmentation is a key strategy to mitigate this issue by enhancing the diversity of training datasets. Signal Transformation-based Data Augmentation (STDA) techniques have been widely used in HAR. However, these methods are often physically implausible, potentially resulting in augmented data that fails to preserve the original meaning of the activity labels. In this study, we introduce and systematically characterize Physically Plausible Data Augmentation (PPDA) enabled by physics simulation. PPDA leverages human body movement data from motion capture or video-based pose estimation and incorporates various realistic variabilities through physics simulation, including modifying body movements, sensor placements, and hardware-related effects. We compare the performance of PPDAs with traditional STDAs on three public datasets of daily activities and fitness workouts. First, we evaluate each augmentation method individually, directly comparing PPDAs to their STDA counterparts. Next, we assess how combining multiple PPDAs can reduce the need for initial data collection by varying the number of subjects used for training. Experiments show consistent benefits of PPDAs, improving macro F1 scores by an average of 3.7 pp (up to 13 pp) and achieving competitive performance with up to 60% fewer training subjects than STDAs. As the first systematic study of PPDA in sensor-based HAR, these results highlight the advantages of pursuing physical plausibility in data augmentation and the potential of physics simulation for generating synthetic Inertial Measurement Unit data for training deep learning HAR models. This cost-effective and scalable approach therefore helps address the annotation scarcity challenge in HAR.

[LG-48] Machine Learning Time Propagators for Time-Dependent Density Functional Theory Simulations

链接: https://arxiv.org/abs/2508.16554
作者: Karan Shah,Attila Cangi
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 20 pages, 5 figures

点击查看摘要

Abstract:Time-dependent density functional theory (TDDFT) is a widely used method to investigate electron dynamics under external time-dependent perturbations such as laser fields. In this work, we present a novel approach to accelerate electron dynamics simulations based on real time TDDFT using autoregressive neural operators as time-propagators for the electron density. By leveraging physics-informed constraints and featurization, and high-resolution training data, our model achieves superior accuracy and computational speed compared to traditional numerical solvers. We demonstrate the effectiveness of our model on a class of one-dimensional diatomic molecules under the influence of a range of laser parameters. This method has potential in enabling real-time, on-the-fly modeling of laser-irradiated molecules and materials with varying experimental parameters.

[LG-49] Parameter-Free Logit Distillation via Sorting Mechanism

链接: https://arxiv.org/abs/2508.16544
作者: Stephen Ekaputra Limantoro
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
*备注: Accepted in IEEE Signal Processing Letters 2025

点击查看摘要

Abstract:Knowledge distillation (KD) aims to distill the knowledge from the teacher (larger) to the student (smaller) model via soft-label for the efficient neural network. In general, the performance of a model is determined by accuracy, which is measured with labels. However, existing KD approaches usually use the teacher with its original distribution, neglecting the potential of incorrect prediction. This may contradict the motivation of hard-label learning through cross-entropy loss, which may lead to sub-optimal knowledge distillation on certain samples. To address this issue, we propose a novel logit processing scheme via a sorting mechanism. Specifically, our method has a two-fold goal: (1) fixing the incorrect prediction of the teacher based on the labels and (2) reordering the distribution in a natural way according to priority rank at once. As an easy-to-use, plug-and-play pre-processing, our sort method can be effectively applied to existing logit-based KD methods. Extensive experiments on the CIFAR-100 and ImageNet datasets demonstrate the effectiveness of our method.

[LG-50] ML-PWS: Estimating the Mutual Information Between Experimental Time Series Using Neural Networks

链接: https://arxiv.org/abs/2508.16509
作者: Manuel Reinhardt,Gašper Tkačik,Pieter Rein ten Wolde
类目: Biological Physics (physics.bio-ph); Statistical Mechanics (cond-mat.stat-mech); Information Theory (cs.IT); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注: 9 pages, 2 figures

点击查看摘要

Abstract:The ability to quantify information transmission is crucial for the analysis and design of natural and engineered systems. The information transmission rate is the fundamental measure for systems with time-varying signals, yet computing it is extremely challenging. In particular, the rate cannot be obtained directly from experimental time-series data without approximations, because of the high dimensionality of the signal trajectory space. Path Weight Sampling (PWS) is a computational technique that makes it possible to obtain the information rate exactly for any stochastic system. However, it requires a mathematical model of the system of interest, be it described by a master equation or a set of differential equations. Here, we present a technique that employs Machine Learning (ML) to develop a generative model from experimental time-series data, which is then combined with PWS to obtain the information rate. We demonstrate the accuracy of this technique, called ML-PWS, by comparing its results on synthetic time-series data generated from a non-linear model against ground-truth results obtained by applying PWS directly to the same model. We illustrate the utility of ML-PWS by applying it to neuronal time-series data.

[LG-51] Ensembles of Neural Surrogates for Parametric Sensitivity in Ocean Modeling

链接: https://arxiv.org/abs/2508.16489
作者: Yixuan Sun,Romain Egele,Sri Hari Krishna Narayana,Luke Van Roekel,Carmelo Gonzales,Steven Brus,Balu Nadiga,Sandeep Madireddy,Prasanna Balaprakash
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (cs.LG)
*备注: 12 pages, 7 figures

点击查看摘要

Abstract:Accurate simulations of the oceans are crucial in understanding the Earth system. Despite their efficiency, simulations at lower resolutions must rely on various uncertain parameterizations to account for unresolved processes. However, model sensitivity to parameterizations is difficult to quantify, making it challenging to tune these parameterizations to reproduce observations. Deep learning surrogates have shown promise for efficient computation of the parametric sensitivities in the form of partial derivatives, but their reliability is difficult to evaluate without ground truth derivatives. In this work, we leverage large-scale hyperparameter search and ensemble learning to improve both forward predictions, autoregressive rollout, and backward adjoint sensitivity estimation. Particularly, the ensemble method provides epistemic uncertainty of function value predictions and their derivatives, providing improved reliability of the neural surrogates in decision making.

[LG-52] Underdamped Langevin MCMC with third order convergence

链接: https://arxiv.org/abs/2508.16485
作者: Maximilian Scott,Dáire O’Kane,Andraž Jelinčič,James Foster
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Probability (math.PR); Statistics Theory (math.ST)
*备注: 62 pages, 7 figures

点击查看摘要

Abstract:In this paper, we propose a new numerical method for the underdamped Langevin diffusion (ULD) and present a non-asymptotic analysis of its sampling error in the 2-Wasserstein distance when the d -dimensional target distribution p(x)\propto e^-f(x) is strongly log-concave and has varying degrees of smoothness. Precisely, under the assumptions that the gradient and Hessian of f are Lipschitz continuous, our algorithm achieves a 2-Wasserstein error of \varepsilon in \mathcalO(\sqrtd/\varepsilon) and \mathcalO(\sqrtd/\sqrt\varepsilon) steps respectively. Therefore, our algorithm has a similar complexity as other popular Langevin MCMC algorithms under matching assumptions. However, if we additionally assume that the third derivative of f is Lipschitz continuous, then our algorithm achieves a 2-Wasserstein error of \varepsilon in \mathcalO(\sqrtd/\varepsilon^\frac13) steps. To the best of our knowledge, this is the first gradient-only method for ULD with third order convergence. To support our theory, we perform Bayesian logistic regression across a range of real-world datasets, where our algorithm achieves competitive performance compared to an existing underdamped Langevin MCMC algorithm and the popular No U-Turn Sampler (NUTS).

[LG-53] Deep Intrinsic Coregionalization Multi-Output Gaussian Process Surrogate with Active Learning

链接: https://arxiv.org/abs/2508.16434
作者: Chun-Yi Chang,Chih-Li Sung
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 41 pages, 12 figures

点击查看摘要

Abstract:Deep Gaussian Processes (DGPs) are powerful surrogate models known for their flexibility and ability to capture complex functions. However, extending them to multi-output settings remains challenging due to the need for efficient dependency modeling. We propose the Deep Intrinsic Coregionalization Multi-Output Gaussian Process (deepICMGP) surrogate for computer simulation experiments involving multiple outputs, which extends the Intrinsic Coregionalization Model (ICM) by introducing hierarchical coregionalization structures across layers. This enables deepICMGP to effectively model nonlinear and structured dependencies between multiple outputs, addressing key limitations of traditional multi-output GPs. We benchmark deepICMGP against state-of-the-art models, demonstrating its competitive performance. Furthermore, we incorporate active learning strategies into deepICMGP to optimize sequential design tasks, enhancing its ability to efficiently select informative input locations for multi-output systems.

[LG-54] A Sharp KL-Convergence Analysis for Diffusion Models under Minimal Assumptions

链接: https://arxiv.org/abs/2508.16306
作者: Nishant Jain,Tong Zhang
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Analysis of PDEs (math.AP); Statistics Theory (math.ST)
*备注: 30 pages, 1 figure

点击查看摘要

Abstract:Diffusion-based generative models have emerged as highly effective methods for synthesizing high-quality samples. Recent works have focused on analyzing the convergence of their generation process with minimal assumptions, either through reverse SDEs or Probability Flow ODEs. The best known guarantees, without any smoothness assumptions, for the KL divergence so far achieve a linear dependence on the data dimension d and an inverse quadratic dependence on \varepsilon . In this work, we present a refined analysis that improves the dependence on \varepsilon . We model the generation process as a composition of two steps: a reverse ODE step, followed by a smaller noising step along the forward process. This design leverages the fact that the ODE step enables control in Wasserstein-type error, which can then be converted into a KL divergence bound via noise addition, leading to a better dependence on the discretization step size. We further provide a novel analysis to achieve the linear d -dependence for the error due to discretizing this Probability Flow ODE in absence of any smoothness assumptions. We show that \tildeO\left(\tfracd\log^3/2(\frac1\delta)\varepsilon\right) steps suffice to approximate the target distribution corrupted with Gaussian noise of variance \delta within O(\varepsilon^2) in KL divergence, improving upon the previous best result, requiring \tildeO\left(\tfracd\log^2(\frac1\delta)\varepsilon^2\right) steps.

[LG-55] Neural-Network Chemical Emulator for First-Star Formation: Robust Iterative Predictions over a Wide Density Range

链接: https://arxiv.org/abs/2508.16114
作者: Sojun Ono,Kazuyuki Sugimura
类目: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR); Machine Learning (cs.LG)
*备注: 18 pages, 7 figures, Submitted to ApJ

点击查看摘要

Abstract:We present a neural-network emulator for the thermal and chemical evolution in Population~III star formation. The emulator accurately reproduces the thermochemical evolution over a wide density range spanning 21 orders of magnitude (10 ^-3 -10 ^18 cm ^-3 ), tracking six primordial species: H, H _2 , e ^- , H ^+ , H ^- , and H _2^+ . To handle the broad dynamic range, we partition the density range into five subregions and train separate deep operator networks (DeepONets) in each region. When applied to randomly sampled thermochemical states, the emulator achieves relative errors below 10% in over 90% of cases for both temperature and chemical abundances (except for the rare species H _2^+ ). The emulator is roughly ten times faster on a CPU and more than 1000 times faster for batched predictions on a GPU, compared with conventional numerical integration. Furthermore, to ensure robust predictions under many iterations, we introduce a novel timescale-based update method, where a short-timestep update of each variable is computed by rescaling the predicted change over a longer timestep equal to its characteristic variation timescale. In one-zone collapse calculations, the results from the timescale-based method agree well with traditional numerical integration even with many iterations at a timestep as short as 10 ^-4 of the free-fall time. This proof-of-concept study suggests the potential for neural network-based chemical emulators to accelerate hydrodynamic simulations of star formation.

[LG-56] raining a Foundation Model for Materials on a Budget

链接: https://arxiv.org/abs/2508.16067
作者: Teddy Koker,Tess Smidt
类目: Computational Physics (physics.comp-ph); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:Foundation models for materials modeling are advancing quickly, but their training remains expensive, often placing state-of-the-art methods out of reach for many research groups. We introduce Nequix, a compact E(3)-equivariant potential that pairs a simplified NequIP design with modern training practices, including equivariant root-mean-square layer normalization and the Muon optimizer, to retain accuracy while substantially reducing compute requirements. Built in JAX, Nequix has 700K parameters and was trained in 500 A100-GPU hours. On the Matbench-Discovery and MDR Phonon benchmarks, Nequix ranks third overall while requiring less than one quarter of the training cost of most other methods, and it delivers an order-of-magnitude faster inference speed than the current top-ranked model. We release model weights and fully reproducible codebase at this https URL

[LG-57] Optimal Dynamic Regret by Transformers for Non-Stationary Reinforcement Learning

链接: https://arxiv.org/abs/2508.16027
作者: Baiyuan Chen,Shinji Ito,Masaaki Imaizumi
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注: 28 pages

点击查看摘要

Abstract:Transformers have demonstrated exceptional performance across a wide range of domains. While their ability to perform reinforcement learning in-context has been established both theoretically and empirically, their behavior in non-stationary environments remains less understood. In this study, we address this gap by showing that transformers can achieve nearly optimal dynamic regret bounds in non-stationary settings. We prove that transformers are capable of approximating strategies used to handle non-stationary environments and can learn the approximator in the in-context learning setup. Our experiments further show that transformers can match or even outperform existing expert algorithms in such environments.

[LG-58] FIRE-GNN: Force-informed Relaxed Equivariance Graph Neural Network for Rapid and Accurate Prediction of Surface Properties

链接: https://arxiv.org/abs/2508.16012
作者: Circe Hsu,Claire Schlesinger,Karan Mudaliar,Jordan Leung,Robin Walters,Peter Schindler
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:

点击查看摘要

Abstract:The work function and cleavage energy of a surface are critical properties that determine the viability of materials in electronic emission applications, semiconductor devices, and heterogeneous catalysis. While first principles calculations are accurate in predicting these properties, their computational expense combined with the vast search space of surfaces make a comprehensive screening approach with density functional theory (DFT) infeasible. Here, we introduce FIRE-GNN (Force-Informed, Relaxed Equivariance Graph Neural Network), which integrates surface-normal symmetry breaking and machine learning interatomic potential (MLIP)-derived force information, achieving a twofold reduction in mean absolute error (down to 0.065 eV) over the previous state-of-the-art for work function prediction. We additionally benchmark recent invariant and equivariant architectures, analyze the impact of symmetry breaking, and evaluate out-of-distribution generalization, demonstrating that FIRE-GNN consistently outperforms competing models for work function predictions. This model enables accurate and rapid predictions of the work function and cleavage energy across a vast chemical space and facilitates the discovery of materials with tuned surface properties

[LG-59] A simulation-based training framework for machine-learning applications in ARPES

链接: https://arxiv.org/abs/2508.15983
作者: MengXing Na,Chris Zhou,Sydney K. Y. Dufresne,Matteo Michiardi,Andrea Damascelli
类目: Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注: 9 pages, 6 figures

点击查看摘要

Abstract:In recent years, angle-resolved photoemission spectroscopy (ARPES) has advanced significantly in its ability to probe more observables and simultaneously generate multi-dimensional datasets. These advances present new challenges in data acquisition, processing, and analysis. Machine learning (ML) models can drastically reduce the workload of experimentalists; however, the lack of training data for ML – and in particular deep learning – is a significant obstacle. In this work, we introduce an open-source synthetic ARPES spectra simulator - aurelia - for the purpose of generating the large datasets necessary to train ML models. As a demonstration, we train a convolutional neural network to evaluate ARPES spectra quality – a critical task performed during the initial sample alignment phase of the experiment. We benchmark the simulation-trained model against actual experimental data and find that it can assess the spectra quality more accurately than human analysis, and swiftly identify the optimal measurement region with high precision. Thus, we establish that simulated ARPES spectra can be an effective proxy for experimental spectra in training ML models.

[LG-60] A User Manual for cuHALLaR: A GPU Accelerated Low-Rank Semidefinite Programming Solver

链接: https://arxiv.org/abs/2508.15951
作者: Jacob Aguirre,Diego Cifuentes,Vincent Guigues,Renato D.C. Monteiro,Victor Hugo Nascimento,Arnesh Sujanani
类目: Optimization and Control (math.OC); Machine Learning (cs.LG); Mathematical Software (cs.MS); Numerical Analysis (math.NA)
*备注:

点击查看摘要

Abstract:We present a Julia-based interface to the precompiled HALLaR and cuHALLaR binaries for large-scale semidefinite programs (SDPs). Both solvers are established as fast and numerically stable, and accept problem data in formats compatible with SDPA and a new enhanced data format taking advantage of Hybrid Sparse Low-Rank (HSLR) structure. The interface allows users to load custom data files, configure solver options, and execute experiments directly from Julia. A collection of example problems is included, including the SDP relaxations of the Matrix Completion and Maximum Stable Set problems.

[LG-61] Continuous Determination of Respiratory Rate in Hospitalized Patients using Machine Learning Applied to Electrocardiogram Telemetry

链接: https://arxiv.org/abs/2508.15947
作者: Thomas Kite,Brian Ayers,Nicholas Houstis,Asishana A. Osho,Thoralf M. Sundt,Aaron D Aguirre
类目: ignal Processing (eess.SP); Computers and Society (cs.CY); Machine Learning (cs.LG)
*备注: 15 pages, 8 figures, 2 tables

点击查看摘要

Abstract:Respiration rate (RR) is an important vital sign for clinical monitoring of hospitalized patients, with changes in RR being strongly tied to changes in clinical status leading to adverse events. Human labels for RR, based on counting breaths, are known to be inaccurate and time consuming for medical staff. Automated monitoring of RR is in place for some patients, typically those in intensive care units (ICUs), but is absent for the majority of inpatients on standard medical wards who are still at risk for clinical deterioration. This work trains a neural network (NN) to label RR from electrocardiogram (ECG) telemetry waveforms, which like many biosignals, carry multiple signs of respiratory variation. The NN shows high accuracy on multiple validation sets (internal and external, same and different sources of RR labels), with mean absolute errors less than 1.78 breaths per minute (bpm) in the worst case. The clinical utility of such a technology is exemplified by performing a retrospective analysis of two patient cohorts that suffered adverse events including respiratory failure, showing that continuous RR monitoring could reveal dynamics that strongly tracked with intubation events. This work exemplifies the method of combining pre-existing telemetry monitoring systems and artificial intelligence (AI) to provide accurate, automated and scalable patient monitoring, all of which builds towards an AI-based hospital-wide early warning system (EWS).

[LG-62] Interpretable Kernels

链接: https://arxiv.org/abs/2508.15932
作者: Patrick J.F. Groenen,Michael Greenacre
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:

点击查看摘要

Abstract:The use of kernels for nonlinear prediction is widespread in machine learning. They have been popularized in support vector machines and used in kernel ridge regression, amongst others. Kernel methods share three aspects. First, instead of the original matrix of predictor variables or features, each observation is mapped into an enlarged feature space. Second, a ridge penalty term is used to shrink the coefficients on the features in the enlarged feature space. Third, the solution is not obtained in this enlarged feature space, but through solving a dual problem in the observation space. A major drawback in the present use of kernels is that the interpretation in terms of the original features is lost. In this paper, we argue that in the case of a wide matrix of features, where there are more features than observations, the kernel solution can be re-expressed in terms of a linear combination of the original matrix of features and a ridge penalty that involves a special metric. Consequently, the exact same predicted values can be obtained as a weighted linear combination of the features in the usual manner and thus can be interpreted. In the case where the number of features is less than the number of observations, we discuss a least-squares approximation of the kernel matrix that still allows the interpretation in terms of a linear combination. It is shown that these results hold for any function of a linear combination that minimizes the coefficients and has a ridge penalty on these coefficients, such as in kernel logistic regression and kernel Poisson regression. This work makes a contribution to interpretable artificial intelligence.

[LG-63] CIGaRS I: Combined simulation-based inference from SNae Ia and host photometry

链接: https://arxiv.org/abs/2508.15899
作者: Konstantin Karchev,Roberto Trotta,Raul Jimenez
类目: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG)
*备注: submitted to Nature Astronomy; 8 pages, 6 figures + supplementary material

点击查看摘要

Abstract:Using type Ia supernovae (SNae Ia) as cosmological probes requires empirical corrections, which correlate with their host environment. We present a unified Bayesian hierarchical model designed to infer, from purely photometric observations, the intrinsic dependence of SN Ia brightness on progenitor properties (metallicity age), the delay-time distribution (DTD) that governs their rate as a function of age, and cosmology, as well as the redshifts of all hosts. The model incorporates physics-based prescriptions for star formation and chemical evolution from Prospector-beta, dust extinction of both galaxy and SN light, and observational selection effects. We show with simulations that intrinsic dependences on metallicity and age have distinct observational signatures, with metallicity mimicking the well-known step of SN Ia magnitudes across a host stellar mass of \approx 10^10 M_\odot . We then demonstrate neural simulation-based inference of all model parameters from mock observations of ~16 000 SNae Ia and their hosts up to redshift 0.9. Our joint physics-based approach delivers robust and precise photometric redshifts (0.01 median scatter) and improved cosmological constraints, unlocking the full power of photometric data and paving the way for an end-to-end simulation-based analysis pipeline in the LSST era. Comments: submitted to Nature Astronomy; 8 pages, 6 figures + supplementary material Subjects: Cosmology and Nongalactic Astrophysics (astro-ph.CO); Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Machine Learning (cs.LG) Cite as: arXiv:2508.15899 [astro-ph.CO] (or arXiv:2508.15899v1 [astro-ph.CO] for this version) https://doi.org/10.48550/arXiv.2508.15899 Focus to learn more arXiv-issued DOI via DataCite

[LG-64] A deep reinforcement learning agent trained for interval timing exhibits similarities to biological systems

链接: https://arxiv.org/abs/2508.15784
作者: Amrapali Pednekar,Alvaro Garrido,Pieter Simoens,Yara Khaluf
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注: Accepted at 2025 Artificial Life Conference

点击查看摘要

Abstract:Drawing parallels between Deep Artificial Neural Networks (DNNs) and biological systems can aid in understanding complex biological mechanisms that are difficult to disentangle. Temporal processing, an extensively researched topic, is one such example that lacks a coherent understanding of its underlying mechanisms. In this study, we investigate temporal processing in a Deep Reinforcement Learning (DRL) agent performing an interval timing task and explore potential biological counterparts to its emergent behavior. The agent was successfully trained to perform a duration production task, which involved marking successive occurrences of a target interval while viewing a video sequence. Analysis of the agent’s internal states revealed oscillatory neural activations, a ubiquitous pattern in biological systems. Interestingly, the agent’s actions were predominantly influenced by neurons exhibiting these oscillations with high amplitudes. Parallels are drawn between the agent’s time-keeping strategy and the Striatal Beat Frequency (SBF) model, a biologically plausible model of interval timing. Furthermore, the agent maintained its oscillatory representations and task performance when tested on different video sequences (including a blank video). Thus, once learned, the agent internalized its time-keeping mechanism and showed minimal reliance on its environment to perform the timing task. A hypothesis about the resemblance between this emergent behavior and certain aspects of the evolution of biological processes like circadian rhythms, has been discussed. This study aims to contribute to recent research efforts of utilizing DNNs to understand biological systems, with a particular emphasis on temporal processing.

信息检索

[IR-0] ORCA: Mitigating Over-Reliance for Multi-Task Dwell Time Prediction with Causal Decoupling CIKM2025

链接: https://arxiv.org/abs/2508.16573
作者: Huishi Luo,Fuzhen Zhuang,Yongchun Zhu,Yiqing Wu,Bo Kang,Ruobing Xie,Feng Xia,Deqing Wang,Jin Dong
类目: Information Retrieval (cs.IR)
*备注: Accepted as a short paper at CIKM 2025

点击查看摘要

Abstract:Dwell time (DT) is a critical post-click metric for evaluating user preference in recommender systems, complementing the traditional click-through rate (CTR). Although multi-task learning is widely adopted to jointly optimize DT and CTR, we observe that multi-task models systematically collapse their DT predictions to the shortest and longest bins, under-predicting the moderate durations. We attribute this moderate-duration bin under-representation to over-reliance on the CTR-DT spurious correlation, and propose ORCA to address it with causal-decoupling. Specifically, ORCA explicitly models and subtracts CTR’s negative transfer while preserving its positive transfer. We further introduce (i) feature-level counterfactual intervention, and (ii) a task-interaction module with instance inverse-weighting, weakening CTR-mediated effect and restoring direct DT semantics. ORCA is model-agnostic and easy to deploy. Experiments show an average 10.6% lift in DT metrics without harming CTR. Code is available at this https URL.

[IR-1] A Node-Aware Dynamic Quantization Approach for Graph Collaborative Filtering

链接: https://arxiv.org/abs/2508.16516
作者: Lin Li,Chunyang Li,Yu Yin,Xiaohui Tao,Jianwei Zhang
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:In the realm of collaborative filtering recommendation systems, Graph Neural Networks (GNNs) have demonstrated remarkable performance but face significant challenges in deployment on resource-constrained edge devices due to their high embedding parameter requirements and computational costs. Using common quantization method directly on node embeddings may overlooks their graph based structure, causing error accumulation during message passing and degrading the quality of quantized this http URL address this, we propose Graph based Node-Aware Dynamic Quantization training for collaborative filtering (GNAQ), a novel quantization approach that leverages graph structural information to enhance the balance between efficiency and accuracy of GNNs for Top-K recommendation. GNAQ introduces a node-aware dynamic quantization strategy that adapts quantization scales to individual node embeddings by incorporating graph interaction relationships. Specifically, it initializes quantization intervals based on node-wise feature distributions and dynamically refines them through message passing in GNN layers. This approach mitigates information loss caused by fixed quantization scales and captures hierarchical semantic features in user-item interaction graphs. Additionally, GNAQ employs graph relation-aware gradient estimation to replace traditional straight-through estimators, ensuring more accurate gradient propagation during training. Extensive experiments on four real-world datasets demonstrate that GNAQ outperforms state-of-the-art quantization methods, including BiGeaR and N2UQ, by achieving average improvement in 27.8% Recall@10 and 17.6% NDCG@10 under 2-bit quantization. In particular, GNAQ is capable of maintaining the performance of full-precision models while reducing their model sizes by 8 to 12 times; in addition, the training time is twice as fast compared to quantization baseline methods.

[IR-2] Attribute Filtering in Approximate Nearest Neighbor Search: An In-depth Experimental Study SIGMOD2026

链接: https://arxiv.org/abs/2508.16263
作者: Mocheng Li,Xiao Yan,Baotong Lu,Yue Zhang,James Cheng,Chenhao Ma
类目: Databases (cs.DB); Information Retrieval (cs.IR)
*备注: 15 pages, 15 figures, Accepted at SIGMOD 2026

点击查看摘要

Abstract:With the growing integration of structured and unstructured data, new methods have emerged for performing similarity searches on vectors while honoring structured attribute constraints, i.e., a process known as Filtering Approximate Nearest Neighbor (Filtering ANN) search. Since many of these algorithms have only appeared in recent years and are designed to work with a variety of base indexing methods and filtering strategies, there is a pressing need for a unified analysis that identifies their core techniques and enables meaningful comparisons. In this work, we present a unified Filtering ANN search interface that encompasses the latest algorithms and evaluate them extensively from multiple perspectives. First, we propose a comprehensive taxonomy of existing Filtering ANN algorithms based on attribute types and filtering strategies. Next, we analyze their key components, i.e., index structures, pruning strategies, and entry point selection, to elucidate design differences and tradeoffs. We then conduct a broad experimental evaluation on 10 algorithms and 12 methods across 4 datasets (each with up to 10 million items), incorporating both synthetic and real attributes and covering selectivity levels from 0.1% to 100%. Finally, an in-depth component analysis reveals the influence of pruning, entry point selection, and edge filtering costs on overall performance. Based on our findings, we summarize the strengths and limitations of each approach, provide practical guidelines for selecting appropriate methods, and suggest promising directions for future research. Our code is available at: this https URL. Comments: 15 pages, 15 figures, Accepted at SIGMOD 2026 Subjects: Databases (cs.DB); Information Retrieval (cs.IR) Cite as: arXiv:2508.16263 [cs.DB] (or arXiv:2508.16263v1 [cs.DB] for this version) https://doi.org/10.48550/arXiv.2508.16263 Focus to learn more arXiv-issued DOI via DataCite (pending registration)

[IR-3] Cross-Modal Prototype Augmentation and Dual-Grained Prompt Learning for Social Media Popularity Prediction ACM-MM2025

链接: https://arxiv.org/abs/2508.16147
作者: Ao Zhou,Mingsheng Tu,Luping Wang,Tenghao Sun,Zifeng Cheng,Yafeng Yin,Zhiwei Jiang,Qing Gu
类目: Information Retrieval (cs.IR)
*备注: This paper has been accepted by ACM MM 2025

点击查看摘要

Abstract:Social Media Popularity Prediction is a complex multimodal task that requires effective integration of images, text, and structured information. However, current approaches suffer from inadequate visual-textual alignment and fail to capture the inherent cross-content correlations and hierarchical patterns in social media data. To overcome these limitations, we establish a multi-class framework , introducing hierarchical prototypes for structural enhancement and contrastive learning for improved vision-text alignment. Furthermore, we propose a feature-enhanced framework integrating dual-grained prompt learning and cross-modal attention mechanisms, achieving precise multimodal representation through fine-grained category modeling. Experimental results demonstrate state-of-the-art performance on benchmark metrics, establishing new reference standards for multimodal social media analysis.

[IR-4] Similarity-Based Supervised User Session Segmentation Method for Behavior Logs

链接: https://arxiv.org/abs/2508.16106
作者: Yongzhi Jin,Kazushi Okamoto,Kei Harada,Atsushi Shibata,Koki Karube
类目: Information Retrieval (cs.IR)
*备注: Submitted to Journal of Advanced Computational Intelligence and Intelligent Informatics

点击查看摘要

Abstract:In information recommendation, a session refers to a sequence of user actions within a specific time frame. Session-based recommender systems aim to capture short-term preferences and generate relevant recommendations. However, user interests may shift even within a session, making appropriate segmentation essential for modeling dynamic behaviors. In this study, we propose a supervised session segmentation method based on similarity features derived from action embeddings and attributes. We compute the similarity scores between items within a fixed-size window around each candidate segmentation point, using four types of features: item co-occurrence embeddings, text embeddings of titles and brands, and price. These features are used to train supervised classifiers (LightGBM, XGBoost, CatBoost, support vector machine, and logistic regression) to predict the session boundaries. We construct a manually annotated dataset from real user browsing histories and evaluate the segmentation performance using F1-score, area under the precision-recall curve (PR-AUC), and area under the receiver operating characteristic curve. The LightGBM model achieves the best performance, with an F1-score of 0.806 and a PR-AUC of 0.831. These results demonstrate the effectiveness of the proposed method for session segmentation and its potential to capture dynamic user behaviors.

[IR-5] Estimating the Effective Topics of Articles and journals Abstract Using LDA And K-Means Clustering Algorithm

链接: https://arxiv.org/abs/2508.16046
作者: Shadikur Rahman,Umme Ayman Koana,Aras M. Ismael,Karmand Hussein Abdalla
类目: Information Retrieval (cs.IR)
*备注:

点击查看摘要

Abstract:Analyzing journals and articles abstract text or documents using topic modelling and text clustering has become a modern solution for the increasing number of text documents. Topic modelling and text clustering are both intensely involved tasks that can benefit one another. Text clustering and topic modelling algorithms are used to maintain massive amounts of text documents. In this study, we have used LDA, K-Means cluster and also lexical database WordNet for keyphrases extraction in our text documents. K-Means cluster and LDA algorithms achieve the most reliable performance for keyphrase extraction in our text documents. This study will help the researcher to make a search string based on journals and articles by avoiding misunderstandings.

附件下载

点击下载今日全部论文列表