本篇博文主要内容为 2025-08-19 从Arxiv.org论文网站获取的最新论文列表,自动更新,按照NLP、CV、ML、AI、IR五个大方向区分,若需要邮件定时接收,请在评论区留下你的邮箱号。
说明:每日论文数据从Arxiv.org获取,每天早上12:00左右定时自动更新。
友情提示: 如何您需要邮箱接收每日论文数据,请在评论处留下你的邮箱。
目录
概览 (2025-08-19)
今日共更新800篇论文,其中:
- 自然语言处理共111篇(Computation and Language (cs.CL))
- 人工智能共249篇(Artificial Intelligence (cs.AI))
- 计算机视觉共189篇(Computer Vision and Pattern Recognition (cs.CV))
- 机器学习共217篇(Machine Learning (cs.LG))
自然语言处理
[NLP-0] RepreGuard: Detecting LLM -Generated Text by Revealing Hidden Representation Patterns ACL2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)生成文本(LLM-generated text, LGT)检测方法在分布外(out-of-distribution, OOD)场景下鲁棒性不足的问题。现有检测方法虽在分布内(in-distribution, ID)表现良好,但难以应对真实世界中多样化的文本生成模式和对抗攻击。解决方案的关键在于利用LLM内部表示(internal representations)中蕴含的更全面、原始的特征,这些特征能够更有效地捕捉LGT与人类写作文本(human-written text, HWT)之间的统计差异。作者提出RepreGuard方法,通过一个代理模型提取LGT和HWT的激活特征,并计算文本表示沿该特征方向的投影得分,结合预设阈值进行分类。实验表明,RepreGuard在ID和OOD场景下平均AUROC达94.92%,且对不同文本长度和主流攻击具有强鲁棒性。
链接: https://arxiv.org/abs/2508.13152
作者: Xin Chen,Junchao Wu,Shu Yang,Runzhe Zhan,Zeyu Wu,Ziyang Luo,Di Wang,Min Yang,Lidia S. Chao,Derek F. Wong
机构: NLP2CT Lab, Department of Computer and Information Science, University of Macau (澳门大学计算机与信息科学系); Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences (中国科学院深圳先进技术研究院); Provable Responsible AI and Data Analytics Lab, KAUST (KAUST可证明负责任人工智能与数据分析实验室); Hong Kong Baptist University (香港浸会大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted to TACL 2025. This version is a pre-MIT Press publication version
Abstract:Detecting content generated by large language models (LLMs) is crucial for preventing misuse and building trustworthy AI systems. Although existing detection methods perform well, their robustness in out-of-distribution (OOD) scenarios is still lacking. In this paper, we hypothesize that, compared to features used by existing detection methods, the internal representations of LLMs contain more comprehensive and raw features that can more effectively capture and distinguish the statistical pattern differences between LLM-generated texts (LGT) and human-written texts (HWT). We validated this hypothesis across different LLMs and observed significant differences in neural activation patterns when processing these two types of texts. Based on this, we propose RepreGuard, an efficient statistics-based detection method. Specifically, we first employ a surrogate model to collect representation of LGT and HWT, and extract the distinct activation feature that can better identify LGT. We can classify the text by calculating the projection score of the text representations along this feature direction and comparing with a precomputed threshold. Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios, while also demonstrating robust resilience to various text sizes and mainstream attacks. Data and code are publicly available at: this https URL
zh
[NLP-1] Signal and Noise: A Framework for Reducing Uncertainty in Language Model Evaluation
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)开发过程中,由于训练成本高昂而需依赖小规模实验进行决策时,现有评估基准(benchmark)可靠性不足的问题。其核心挑战在于:许多基准在小样本条件下无法有效区分模型优劣(信号弱),或对训练过程中的随机波动过于敏感(噪声高),从而导致错误的优化方向和不准确的缩放定律(scaling law)预测。解决方案的关键在于引入两个量化指标——“信号”(signal)与“噪声”(noise),并提出三项可操作的干预策略:一是采用更具区分度和稳定性的评估指标(如困惑度 perplexity 替代准确率 accuracy);二是过滤掉噪声较大的子任务以提升整体信号-噪声比;三是通过平均模型中间检查点的输出来降低随机性影响。这些方法显著提升了基准在小规模实验下的决策可靠性及缩放定律预测精度,为构建高质量评估体系提供了实证依据和实践路径。
链接: https://arxiv.org/abs/2508.13144
作者: David Heineman,Valentin Hofmann,Ian Magnusson,Yuling Gu,Noah A. Smith,Hannaneh Hajishirzi,Kyle Lo,Jesse Dodge
机构: Allen Institute for Artificial Intelligence (艾伦人工智能研究所); Paul G. Allen School of Computer Science & Engineering, University of Washington (华盛顿大学保罗·G·艾伦计算机科学与工程学院)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:Developing large language models is expensive and involves making decisions with small experiments, typically by evaluating on large, multi-task evaluation suites. In this work, we analyze specific properties which make a benchmark more reliable for such decisions, and interventions to design higher-quality evaluation benchmarks. We introduce two key metrics that show differences in current benchmarks: signal, a benchmark’s ability to separate better models from worse models, and noise, a benchmark’s sensitivity to random variability between training steps. We demonstrate that benchmarks with a better signal-to-noise ratio are more reliable when making decisions at small scale, and those with less noise have lower scaling law prediction error. These results suggest that improving signal or noise will lead to more useful benchmarks, so we introduce three interventions designed to directly affect signal or noise. For example, we propose that switching to a metric that has better signal and noise (e.g., perplexity rather than accuracy) leads to better reliability and improved scaling law error. We also find that filtering noisy subtasks, to improve an aggregate signal-to-noise ratio, leads to more reliable multi-task evaluations. We also find that averaging the output of a model’s intermediate checkpoints to reduce noise leads to consistent improvements. We conclude by recommending that those creating new benchmarks, or selecting which existing benchmarks to use, aim for high signal and low noise. We use 30 benchmarks for these experiments, and 375 open-weight language models from 60M to 32B parameters, resulting in a new, publicly available dataset of 900K evaluation benchmark results, totaling 200M instances.
zh
[NLP-2] Has GPT -5 Achieved Spatial Intelligence? An Empirical Study
【速读】: 该论文旨在解决当前多模态模型在空间理解与推理能力上的显著局限性问题,而这一能力被认为是实现通用人工智能(Artificial General Intelligence, AGI)的关键基础。其解决方案的关键在于构建一个统一的空间任务分类体系(taxonomy of spatial tasks),整合现有评估基准并提出公平评测的挑战;在此基础上,通过大规模实证研究(使用超过十亿token成本)对主流商业及开源模型进行系统评估,从而揭示GPT-5等前沿模型在空间智能方面的突破与不足,并识别出当前多模态模型最困难的任务类型,同时发现商业模型在极端挑战下并无决定性优势。
链接: https://arxiv.org/abs/2508.13142
作者: Zhongang Cai,Yubo Wang,Qingping Sun,Ruisi Wang,Chenyang Gu,Wanqi Yin,Zhiqian Lin,Zhitao Yang,Chen Wei,Xuanke Shi,Kewang Deng,Xiaoyang Han,Zukai Chen,Jiaqi Li,Xiangyu Fan,Hanming Deng,Lewei Lu,Bo Li,Ziwei Liu,Quan Wang,Dahua Lin,Lei Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)
备注:
Abstract:Multi-modal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, which are fundamental capabilities to achieving artificial general intelligence. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models stand on the path toward spatial intelligence. First, we propose a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and discuss the challenges in ensuring fair evaluation. We then evaluate state-of-the-art proprietary and open-source models on eight key benchmarks, at a cost exceeding one billion total tokens. Our empirical study reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence, yet (2) still falls short of human performance across a broad spectrum of tasks. Moreover, we (3) identify the more challenging spatial intelligence problems for multi-modal models, and (4) proprietary models do not exhibit a decisive advantage when facing the most difficult problems. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans yet fail even the most advanced multi-modal models.
zh
[NLP-3] OptimalThinkingBench: Evaluating Over and Underthinking in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在推理过程中存在的“过度思考”(overthinking)与“思考不足”(underthinking)问题,即思考型模型在简单任务上消耗过多计算资源却无性能提升,而非思考型模型则在复杂推理任务中表现欠佳。其解决方案的关键在于提出一个统一的评估基准 OptimalThinkingBench,该基准包含两个子基准:OverthinkingBench(涵盖72个简单领域的问题)和 UnderthinkingBench(包含11个高难度推理任务),并通过新颖的“思维调整准确率”(thinking-adjusted accuracy)指标对33种不同模型进行系统评测,揭示当前模型均无法实现最优思考——既不过度也不不足。研究进一步探索多种促进最优思考的方法,但发现这些方法常在改善某一子基准表现的同时损害另一子基准,凸显了未来亟需开发能够平衡性能与效率的统一优化模型。
链接: https://arxiv.org/abs/2508.13141
作者: Pranjal Aggarwal,Seungone Kim,Jack Lanchantin,Sean Welleck,Jason Weston,Ilia Kulikov,Swarnadeep Saha
机构: Meta(元)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 26 pages, 6 tables, 10 figures
Abstract:Thinking LLMs solve complex tasks at the expense of increased compute and overthinking on simpler problems, while non-thinking LLMs are faster and cheaper but underthink on harder reasoning problems. This has led to the development of separate thinking and non-thinking LLM variants, leaving the onus of selecting the optimal model for each query on the end user. In this work, we introduce OptimalThinkingBench, a unified benchmark that jointly evaluates overthinking and underthinking in LLMs and also encourages the development of optimally-thinking models that balance performance and efficiency. Our benchmark comprises two sub-benchmarks: OverthinkingBench, featuring simple queries in 72 domains, and UnderthinkingBench, containing 11 challenging reasoning tasks. Using novel thinking-adjusted accuracy metrics, we perform extensive evaluation of 33 different thinking and non-thinking models and show that no model is able to optimally think on our benchmark. Thinking models often overthink for hundreds of tokens on the simplest user queries without improving performance. In contrast, large non-thinking models underthink, often falling short of much smaller thinking models. We further explore several methods to encourage optimal thinking, but find that these approaches often improve on one sub-benchmark at the expense of the other, highlighting the need for better unified and optimal models in the future.
zh
[NLP-4] Improving Detection of Watermarked Language Models
【速读】: 该论文旨在解决生成式 AI (Generative AI) 中因语言模型后训练(如指令微调或基于人类反馈的强化学习 RLHF)导致熵值受限,从而使得仅依赖水印检测难以准确识别模型生成文本的问题。其解决方案的关键在于提出多种混合检测策略,将水印检测器与非水印检测器相结合,在广泛的实验条件下实现了对生成文本检测性能的提升。
链接: https://arxiv.org/abs/2508.13131
作者: Dara Bahri,John Wieting
机构: Google DeepMind(谷歌深度思维)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Machine Learning (stat.ML)
备注:
Abstract:Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.
zh
[NLP-5] MuDRiC: Multi-Dialect Reasoning for Arabic Commonsense Validation
【速读】: 该论文旨在解决阿拉伯语中常识验证(commonsense validation)任务研究不足的问题,尤其是现有资源主要集中在现代标准阿拉伯语(Modern Standard Arabic, MSA),而对广泛使用的地区方言关注匮乏。解决方案的关键在于:首先构建了MuDRiC数据集,这是首个包含多种阿拉伯语方言的常识推理数据集;其次提出了一种基于图卷积网络(Graph Convolutional Networks, GCNs)的新型方法,通过增强语义关系建模来提升阿拉伯语常识验证性能。实验表明,该方法在处理阿拉伯语复杂语言变体方面具有显著优势。
链接: https://arxiv.org/abs/2508.13130
作者: Kareem Elozeiri,Mervat Abassy,Preslav Nakov,Yuxia Wang
机构: Mohamed bin Zayed University of Artificial Intelligence (穆罕默德·本·扎耶德人工智能大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Commonsense validation evaluates whether a sentence aligns with everyday human understanding, a critical capability for developing robust natural language understanding systems. While substantial progress has been made in English, the task remains underexplored in Arabic, particularly given its rich linguistic diversity. Existing Arabic resources have primarily focused on Modern Standard Arabic (MSA), leaving regional dialects underrepresented despite their prevalence in spoken contexts. To bridge this gap, we present two key contributions: (i) we introduce MuDRiC, an extended Arabic commonsense dataset incorporating multiple dialects, and (ii) a novel method adapting Graph Convolutional Networks (GCNs) to Arabic commonsense reasoning, which enhances semantic relationship modeling for improved commonsense validation. Our experimental results demonstrate that this approach achieves superior performance in Arabic commonsense validation. Our work enhances Arabic natural language understanding by providing both a foundational dataset and a novel method for handling its complex variations. To the best of our knowledge, we release the first Arabic multi-dialect commonsense reasoning dataset.
zh
[NLP-6] Spot the BlindSpots: Systematic Identification and Quantification of Fine-Grained LLM Biases in Contact Center Summaries
【速读】: 该论文旨在解决生成式 AI(Generative AI)在客服中心场景中进行摘要生成时可能存在的系统性操作偏差(Operational Bias)问题,即大型语言模型(LLMs)是否在处理通话转录文本时对某些特定方面(如话语不流畅性、说话人身份、话题等)存在非均衡的关注,从而导致摘要内容的偏倚。解决方案的关键在于提出名为 BlindSpot 的框架,该框架基于15个操作偏差维度构建了分类体系,并利用大语言模型作为零样本分类器,对每个偏差维度提取源文本与摘要之间的类别分布,进而通过两个核心指标——Fidelity Gap(分布间JS散度)和 Coverage(遗漏源标签比例)——实现对偏差的量化评估。实证研究表明,此类偏差普遍存在且跨模型规模和家族,验证了 BlindSpot 在识别和测量系统性偏差方面的有效性。
链接: https://arxiv.org/abs/2508.13124
作者: Kawin Mayilvaghanan,Siddhant Gupta,Ayush Kumar
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Abstractive summarization is a core application in contact centers, where Large Language Models (LLMs) generate millions of summaries of call transcripts daily. Despite their apparent quality, it remains unclear whether LLMs systematically under- or over-attend to specific aspects of the transcript, potentially introducing biases in the generated summary. While prior work has examined social and positional biases, the specific forms of bias pertinent to contact center operations - which we term Operational Bias - have remained unexplored. To address this gap, we introduce BlindSpot, a framework built upon a taxonomy of 15 operational bias dimensions (e.g., disfluency, speaker, topic) for the identification and quantification of these biases. BlindSpot leverages an LLM as a zero-shot classifier to derive categorical distributions for each bias dimension in a pair of transcript and its summary. The bias is then quantified using two metrics: Fidelity Gap (the JS Divergence between distributions) and Coverage (the percentage of source labels omitted). Using BlindSpot, we conducted an empirical study with 2500 real call transcripts and their summaries generated by 20 LLMs of varying scales and families (e.g., GPT, Llama, Claude). Our analysis reveals that biases are systemic and present across all evaluated models, regardless of size or family.
zh
[NLP-7] AutoBnB-RAG : Enhancing Multi-Agent Incident Response with Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前基于大语言模型(Large Language Models, LLMs)的自动化事件响应(Incident Response, IR)系统在实际应用中因缺乏外部知识访问能力而导致推理受限的问题。其核心解决方案是提出AutoBnB-RAG框架,通过引入检索增强生成(Retrieval-Augmented Generation, RAG)机制,使多智能体在模拟事件响应过程中能够主动检索并融合外部证据(如技术文档或叙事型事件报告),从而提升决策质量与协同分析能力。关键创新在于构建了两种不同的检索设置(RAG-Wiki和RAG-News),并验证其在多种团队结构下对复杂多阶段攻击重建的有效性,显著改善了LLM在网络安全场景中的推理准确性与实用性。
链接: https://arxiv.org/abs/2508.13118
作者: Zefang Liu,Arman Anwar
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL); Cryptography and Security (cs.CR)
备注:
Abstract:Incident response (IR) requires fast, coordinated, and well-informed decision-making to contain and mitigate cyber threats. While large language models (LLMs) have shown promise as autonomous agents in simulated IR settings, their reasoning is often limited by a lack of access to external knowledge. In this work, we present AutoBnB-RAG, an extension of the AutoBnB framework that incorporates retrieval-augmented generation (RAG) into multi-agent incident response simulations. Built on the Backdoors Breaches (BB) tabletop game environment, AutoBnB-RAG enables agents to issue retrieval queries and incorporate external evidence during collaborative investigations. We introduce two retrieval settings: one grounded in curated technical documentation (RAG-Wiki), and another using narrative-style incident reports (RAG-News). We evaluate performance across eight team structures, including newly introduced argumentative configurations designed to promote critical reasoning. To validate practical utility, we also simulate real-world cyber incidents based on public breach reports, demonstrating AutoBnB-RAG’s ability to reconstruct complex multi-stage attacks. Our results show that retrieval augmentation improves decision quality and success rates across diverse organizational models. This work demonstrates the value of integrating retrieval mechanisms into LLM-based multi-agent systems for cybersecurity decision-making.
zh
[NLP-8] All for law and law for all: Adaptive RAG Pipeline for Legal Research
【速读】: 该论文旨在解决法律领域中大型语言模型(Large Language Models, LLMs)输出易产生幻觉(hallucinations)的问题,通过引入检索增强生成(Retrieval-Augmented Generation, RAG)技术,将模型输出锚定在可引用的法律文献来源上,从而提升答案的准确性与可信度。其解决方案的关键在于三个针对性改进:(i) 设计了一种上下文感知的查询翻译器,能够分离自然语言问题中的文档引用信息,并根据用户专业性和问题具体性动态调整检索深度和响应风格;(ii) 采用开源嵌入模型(SBERT 和 GTE)实现高效且性能显著提升的检索策略(Recall@K 提升 30–95%,Precision@K 提升约 2.5 倍),同时保持成本效益;(iii) 构建了融合 RAGAS、BERTScore-F1 和 ROUGE-Recall 的综合评估框架,用于量化不同模型与提示设计下的语义一致性与忠实度。整体表明,任务感知的组件级调优可构建出法律领域内可复现、低成本且高可靠的 RAG 系统。
链接: https://arxiv.org/abs/2508.13107
作者: Figarri Keisha,Prince Singh,Pallavi,Dion Fernandes,Aravindh Manivannan,Ilham Wicaksono,Faisal Ahmad
机构: University College London (伦敦大学学院)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: submitted to NLLP 2025 Workshop
Abstract:Retrieval-Augmented Generation (RAG) mitigates hallucinations by grounding large language model outputs in cited sources, a capability that is especially critical in the legal domain. We present an end-to-end RAG pipeline that revisits and extends the LegalBenchRAG baseline with three targeted enhancements: (i) a context-aware query translator that disentangles document references from natural-language questions and adapts retrieval depth and response style based on expertise and specificity, (ii) open-source retrieval strategies using SBERT and GTE embeddings that achieve substantial performance gains (improving Recall@K by 30-95% and Precision@K by \sim 2.5 \times for K4 ) while remaining cost-efficient, and (iii) a comprehensive evaluation and generation framework that combines RAGAS, BERTScore-F1, and ROUGE-Recall to assess semantic alignment and faithfulness across models and prompt designs. Our results show that carefully designed open-source pipelines can rival or outperform proprietary approaches in retrieval quality, while a custom legal-grounded prompt consistently produces more faithful and contextually relevant answers than baseline prompting. Taken together, these contributions demonstrate the potential of task-aware, component-level tuning to deliver legally grounded, reproducible, and cost-effective RAG systems for legal research assistance.
zh
[NLP-9] DocHPLT: A Massively Multilingual Document-Level Translation Dataset
【速读】: 该论文旨在解决当前文档级机器翻译(document-level machine translation)资源稀缺的问题,尤其是针对低资源语言缺乏高质量对齐文档数据的瓶颈。现有公开数据集仅覆盖少数高资源语言,限制了模型在多语言、长文本场景下的训练与评估能力。解决方案的关键在于构建DocHPLT——目前规模最大、涵盖50种语言的文档级翻译数据集,包含1.24亿对跨语言文档及42.6亿句对,并通过改进网页提取流程保留原始文档完整性(包括未对齐内容),而非依赖传统的基于句子重建的方法。这一设计显著提升了训练语料的真实性和多样性,实验证明基于该数据集微调的大语言模型(LLM)在文档级翻译任务中性能显著优于通用指令微调基线,尤其在低资源语言上提升明显。
链接: https://arxiv.org/abs/2508.13079
作者: Dayyán O’Brien,Bhavitvya Malik,Ona de Gibert,Pinzhen Chen,Barry Haddow,Jörg Tiedemann
机构: University of Edinburgh (爱丁堡大学); University of Helsinki (赫尔辛基大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Existing document-level machine translation resources are only available for a handful of languages, mostly high-resourced ones. To facilitate the training and evaluation of document-level translation and, more broadly, long-context modeling for global communities, we create DocHPLT, the largest publicly available document-level translation dataset to date. It contains 124 million aligned document pairs across 50 languages paired with English, comprising 4.26 billion sentences, with further possibility to provide 2500 bonus pairs not involving English. Unlike previous reconstruction-based approaches that piece together documents from sentence-level data, we modify an existing web extraction pipeline to preserve complete document integrity from the source, retaining all content including unaligned portions. After our preliminary experiments identify the optimal training context strategy for document-level translation, we demonstrate that LLMs fine-tuned on DocHPLT substantially outperform off-the-shelf instruction-tuned baselines, with particularly dramatic improvements for under-resourced languages. We open-source the dataset under a permissive license, providing essential infrastructure for advancing multilingual document-level translation.
zh
[NLP-10] Reinforced Context Order Recovery for Adaptive Reasoning and Planning
【速读】: 该论文旨在解决当前因果语言模型(causal language models)和扩散模型(diffusion models)在处理需要动态调整生成顺序的任务时所面临的挑战,即这些模型通常被训练为按固定(从左到右)或随机顺序生成token,这与实际问题中逻辑上合理的生成顺序存在偏差,导致在复杂推理和规划任务中性能受限。解决方案的关键在于提出一种基于强化学习的框架ReCOR(Reinforced Context Order Recovery),该框架无需标注即可从文本数据中自监督地学习适应性、数据依赖的token生成顺序;通过预测统计信息作为奖励信号,ReCOR在训练和推理阶段均能估计每个未填充token的预测难度,并自适应选择下一个生成的token,从而显著提升模型在复杂任务上的表现,甚至优于使用真实生成顺序监督的“理想”基线模型。
链接: https://arxiv.org/abs/2508.13070
作者: Long Ma,Fangwei Zhong,Yizhou Wang
机构: Peking University (北京大学); Beijing Normal University (北京师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern causal language models, followed by rapid developments in discrete diffusion models, can now produce a wide variety of interesting and useful content. However, these families of models are predominantly trained to output tokens with a fixed (left-to-right) or random order, which may deviate from the logical order in which tokens are generated originally. In this paper, we observe that current causal and diffusion models encounter difficulties in problems that require adaptive token generation orders to solve tractably, which we characterize with the \mathcalV -information framework. Motivated by this, we propose Reinforced Context Order Recovery (ReCOR), a reinforcement-learning-based framework to extract adaptive, data-dependent token generation orders from text data without annotations. Self-supervised by token prediction statistics, ReCOR estimates the hardness of predicting every unfilled token and adaptively selects the next token during both training and inference. Experiments on challenging reasoning and planning datasets demonstrate the superior performance of ReCOR compared with baselines, sometimes outperforming oracle models supervised with the ground-truth order.
zh
[NLP-11] Evaluating ASR robustness to spontaneous speech errors: A study of WhisperX using a Speech Error Database INTERSPEECH2025
【速读】: 该论文旨在解决自动语音识别(Automatic Speech Recognition, ASR)系统在面对真实口语中的言语错误时,其性能评估缺乏精细化诊断工具的问题。解决方案的关键在于利用Simon Fraser University Speech Error Database (SFUSED) 的结构化标注体系,该数据库对自发英语口语中的言语错误进行了多维标注,包括意图与实际发音、语言层级、上下文敏感性、词形退化、词级修正及音位/音节级定位等维度。通过在此数据集上评估WhisperX模型的转录准确率,研究验证了SFUSED作为ASR系统性能诊断工具的有效性,从而为ASR模型在复杂真实场景下的鲁棒性分析提供了可量化、可解释的评估框架。
链接: https://arxiv.org/abs/2508.13060
作者: John Alderete,Macarious Kin Fung Hui,Aanchan Mohan
机构: Northeastern University (东北大学); Simon Fraser University (西蒙菲莎大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 6 figures, 1 table, Interspeech 2025 (Rotterdam)
Abstract:The Simon Fraser University Speech Error Database (SFUSED) is a public data collection developed for linguistic and psycholinguistic research. Here we demonstrate how its design and annotations can be used to test and evaluate speech recognition models. The database comprises systematically annotated speech errors from spontaneous English speech, with each error tagged for intended and actual error productions. The annotation schema incorporates multiple classificatory dimensions that are of some value to model assessment, including linguistic hierarchical level, contextual sensitivity, degraded words, word corrections, and both word-level and syllable-level error positioning. To assess the value of these classificatory variables, we evaluated the transcription accuracy of WhisperX across 5,300 documented word and phonological errors. This analysis demonstrates the atabase’s effectiveness as a diagnostic tool for ASR system performance.
zh
[NLP-12] Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi
【速读】: 该论文旨在解决形态学复杂且资源匮乏语言(如土耳其语)在自然语言处理中因不当分词导致的语义与语法结构丢失问题。其解决方案的关键在于提出了一套新的评估框架,引入了语言特定标记占比(%TR)和标记纯净度(%Pure)等指标,以量化不同分词器对语言结构的保留能力,并发现语言特定标记占比与下游任务性能(如MMLU分数)具有更强的相关性,从而强调了针对语言特性定制化分词策略的重要性。
链接: https://arxiv.org/abs/2508.13058
作者: M. Ali Bayram,Ali Arda Fincan,Ahmet Semih Gümüş,Sercan Karakaş,Banu Diri,Savaş Yıldırım
机构: Yıldız Technical University (伊斯坦布尔技术大学); Yeditepe University (耶迪特佩大学); The University of Chicago (芝加哥大学); İstanbul Bilgi University (伊斯坦布尔比尔吉大学)
类目: Computation and Language (cs.CL)
备注: in Turkish language, Presented at the 2025 33rd Signal Processing and Communications Applications Conference (SIU), 25–28 June 2025, Şile, Istanbul, Türkiye
Abstract:Tokenization is a fundamental preprocessing step in Natural Language Processing (NLP), significantly impacting the capability of large language models (LLMs) to capture linguistic and semantic nuances. This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish. Utilizing the Turkish MMLU (TR-MMLU) dataset, comprising 6,200 multiple-choice questions from the Turkish education system, we assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure). These newly proposed metrics measure how effectively tokenizers preserve linguistic structures. Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity. Furthermore, increasing model parameters alone does not necessarily enhance linguistic performance, underscoring the importance of tailored, language-specific tokenization methods. The proposed framework establishes robust and practical tokenization standards for morphologically complex languages.
zh
[NLP-13] Büyük Dil Modelleri için TR-MMLU Benchmarkı: Performans Değerlendirmesi Zorluklar ve İyileştirme Fırsatları
【速读】: 该论文旨在解决资源受限语言(如土耳其语)在大语言模型(Large Language Models, LLMs)评估中缺乏系统性基准测试的问题。现有评估方法多集中于英语等主流语言,难以准确反映模型在低资源语言上的理解与生成能力。解决方案的关键在于构建一个名为土耳其多领域知识理解基准(Turkish MMLU, TR-MMLU)的综合性评测框架,该框架基于6200道多项选择题,覆盖土耳其教育体系中的62个学科领域,为土耳其自然语言处理(Natural Language Processing, NLP)研究提供了标准化、细粒度的评估工具,从而推动针对土耳其语的LLM性能分析与优化。
链接: https://arxiv.org/abs/2508.13044
作者: M. Ali Bayram,Ali Arda Fincan,Ahmet Semih Gümüş,Banu Diri,Savaş Yıldırım,Öner Aytaş
机构: Yıldız Technical University (伊斯坦布尔理工大学); Yeditepe University (叶迪特佩大学); İstanbul Bilgi University (伊斯坦布尔比尔吉大学); Işık University (伊肖克大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, in Turkish language, 5 figures. Presented at the 2025 33rd Signal Processing and Communications Applications Conference (SIU), 25–28 June 2025, Sile, Istanbul, Türkiye
Abstract:Language models have made significant advancements in understanding and generating human language, achieving remarkable success in various applications. However, evaluating these models remains a challenge, particularly for resource-limited languages like Turkish. To address this issue, we introduce the Turkish MMLU (TR-MMLU) benchmark, a comprehensive evaluation framework designed to assess the linguistic and conceptual capabilities of large language models (LLMs) in Turkish. TR-MMLU is based on a meticulously curated dataset comprising 6,200 multiple-choice questions across 62 sections within the Turkish education system. This benchmark provides a standard framework for Turkish NLP research, enabling detailed analyses of LLMs’ capabilities in processing Turkish text. In this study, we evaluated state-of-the-art LLMs on TR-MMLU, highlighting areas for improvement in model design. TR-MMLU sets a new standard for advancing Turkish NLP research and inspiring future innovations.
zh
[NLP-14] Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction IJCAI2025
【速读】: 该论文旨在解决小语言模型(Small Language Models, SLMs)在数学推理任务中表现不佳的问题。现有方法通常依赖大语言模型(Large Language Models, LLMs)生成海量数据进行“填鸭式”训练,但这种方法难以有效提升SLMs的推理能力。解决方案的关键在于受人类双系统思维启发:借鉴心理学中的System 1(直觉式思维)和System 2(分析式思维),提出一种基于多LoRA交互的数学推理蒸馏方法(LoRID)。其核心机制包括三个模块——直观推理器(Intuitive Reasoner, IR)、知识生成器(Knowledge Generator, KG)和深度推理器(Deep Reasoner, DR),分别模拟System 1的快速响应、System 2的知识获取与推理过程,并通过迭代一致性评估实现相互反馈优化,从而显著增强SLMs的数学推理性能。
链接: https://arxiv.org/abs/2508.13037
作者: Xinhe Li,Jiajun Liu,Peng Wang
机构: Southeast University (东南大学); Ministry of Education (教育部)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted by IJCAI2025
Abstract:Recent studies have demonstrated that Large Language Models (LLMs) have strong mathematical reasoning abilities but rely on hundreds of billions of parameters. To tackle the challenge of poor reasoning in Small Language Models (SLMs), existing methods typically leverage LLMs to generate massive amounts of data for cramming training. In psychology, they are akin to System 1 thinking, which resolves reasoning problems rapidly based on experience and intuition. However, human learning also requires System 2 thinking, where knowledge is first acquired and then reinforced through practice. Inspired by such two distinct modes of thinking, we propose a novel method based on the multi-LoRA Interaction for mathematical reasoning Distillation (LoRID). First, we input the question and reasoning of each sample into an LLM to create knowledge-enhanced datasets. Subsequently, we train a LoRA block on the student model as an Intuitive Reasoner (IR), which directly generates Chain-of-Thoughts for problem-solving. Then, to imitate System 2 thinking, we train the Knowledge Generator (KG) and Deep Reasoner (DR), respectively. The former outputs only knowledge after receiving problems, while the latter uses that knowledge to perform reasoning. Finally, to address the randomness in the generation of IR and DR, we evaluate whether their outputs are consistent, and the inference process needs to be iterated if not. This step can enhance the mathematical reasoning ability of SLMs through mutual feedback. Experimental results show that LoRID achieves state-of-the-art performance, especially on the GSM8K dataset, where it outperforms the second-best method by 2.3%, 16.1%, 2.4%, 12.3%, and 1.8% accuracy across the five base models, respectively.
zh
[NLP-15] Integrating Feedback Loss from Bi-modal Sarcasm Detector for Sarcastic Speech Synthesis
【速读】: 该论文旨在解决生成式语音合成(Speech Synthesis)中难以准确传达讽刺语气(sarcasm)的问题,其核心挑战在于讽刺语调的细微韵律特征(prosody)以及标注良好的讽刺语音数据稀缺。解决方案的关键在于引入双模态讽刺检测模型(bi-modal sarcasm detection model)的反馈损失(feedback loss)机制,将其嵌入到文本到语音(TTS)训练过程中,从而增强模型对讽刺语义的捕捉能力;同时,采用两阶段迁移学习策略:首先在包含多种语音风格的多样化数据集上进行微调,再在专门聚焦讽刺语音的数据集上进一步优化,显著提升了合成语音的质量、自然度及讽刺感知能力(sarcasm-awareness)。
链接: https://arxiv.org/abs/2508.13028
作者: Zhu Li,Yuqing Zhang,Xiyuan Gao,Devraj Raghuvanshi,Nagendra Kumar,Shekhar Nayak,Matt Coler
机构: Radboud University Nijmegen (拉德布德大学奈梅亨分校); Brown University (布朗大学); Indian Institute of Technology Indore (印度理工学院印多尔分校)
类目: Computation and Language (cs.CL)
备注: Speech Synthesis Workshop 2025
Abstract:Sarcastic speech synthesis, which involves generating speech that effectively conveys sarcasm, is essential for enhancing natural interactions in applications such as entertainment and human-computer interaction. However, synthesizing sarcastic speech remains a challenge due to the nuanced prosody that characterizes sarcasm, as well as the limited availability of annotated sarcastic speech data. To address these challenges, this study introduces a novel approach that integrates feedback loss from a bi-modal sarcasm detection model into the TTS training process, enhancing the model’s ability to capture and convey sarcasm. In addition, by leveraging transfer learning, a speech synthesis model pre-trained on read speech undergoes a two-stage fine-tuning process. First, it is fine-tuned on a diverse dataset encompassing various speech styles, including sarcastic speech. In the second stage, the model is further refined using a dataset focused specifically on sarcastic speech, enhancing its ability to generate sarcasm-aware speech. Objective and subjective evaluations demonstrate that our proposed methods improve the quality, naturalness, and sarcasm-awareness of synthesized speech.
zh
[NLP-16] WebMall – A Multi-Shop Benchmark for Evaluating Web Agents
【速读】: 该论文旨在解决当前大语言模型(Large Language Model, LLM)驱动的网络代理在跨店铺比价购物场景中评估标准缺失的问题,即缺乏一个能够系统性衡量代理在多电商平台间执行复杂、长周期任务能力的基准测试平台。解决方案的关键在于提出并构建了WebMall——一个包含四个模拟在线商店(源自Common Crawl的真实商品数据)和91项跨店比价任务的基准测试集,其任务涵盖基础操作(如搜索、加购、结账)与高级行为(如模糊需求匹配、替代品识别、兼容性判断),相较于现有电商基准(如WebShop或ShoppingBench)具有更强的多样性、真实性和交互深度。通过在该基准上对八种不同配置的基线代理进行评估,验证了其在导航、推理与效率方面的综合性能,从而为未来web agent研究提供了标准化评测框架与可复现的实验基础。
链接: https://arxiv.org/abs/2508.13024
作者: Ralph Peeters,Aaron Steiner,Luca Schwarz,Julian Yuya Caspary,Christian Bizer
机构: Data and Web Science Group (数据与网络科学组); University of Mannheim (曼海姆大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:LLM-based web agents have the potential to automate long-running web tasks, such as finding offers for specific products in multiple online shops and subsequently ordering the cheapest products that meet the users needs. This paper introduces WebMall, a multi-shop online shopping benchmark for evaluating the effectiveness and efficiency of web agents for comparison-shopping. WebMall consists of four simulated online shops populated with authentic product offers sourced from the Common Crawl, alongside a suite of 91 cross-shop tasks. These tasks include basic tasks such as finding specific products in multiple shops, performing price comparisons, adding items to the shopping cart, and completing checkout. Advanced tasks involve searching for products based on vague requirements, identifying suitable substitutes, and finding compatible products. Compared to existing e-commerce benchmarks, such as WebShop or ShoppingBench, WebMall introduces comparison-shopping tasks across multiple shops. Furthermore, the product offers are more heterogeneous, as they originate from hundreds of distinct real-world shops. The tasks in WebMall require longer interaction trajectories than those in WebShop, while remaining representative of real-world shopping behaviors. We evaluate eight baseline agents on WebMall, varying in observation modality, memory utilization, and underlying large language model (GPT 4.1 and Claude Sonnet 4). The best-performing configurations achieve completion rates of 75% and 53%, and F1 scores of 87% and 63%, on the basic and advanced task sets, respectively. WebMall is publicly released to facilitate research on web agents and to promote advancements in navigation, reasoning, and efficiency within e-commerce scenarios.
zh
[NLP-17] PC-Sampler: Position-Aware Calibration of Decoding Bias in Masked Diffusion Models
【速读】: 该论文旨在解决掩码扩散模型(Masked Diffusion Models, MDMs)在序列生成任务中因解码策略选择不当而导致的生成质量不稳定问题,尤其针对现有基于不确定性的采样方法存在全局轨迹控制能力不足和早期解码阶段对平凡标记(trivial tokens)过度偏倚的两大局限。其解决方案的关键在于提出一种名为位置感知置信度校准采样(Position-Aware Confidence-Calibrated Sampling, PC-Sampler)的新颖解码策略,该策略通过引入位置感知加权机制实现对解码路径的全局调控,并结合校准后的置信度分数有效抑制早期阶段对平凡标记的过早选择,从而在内容相关性与生成多样性之间取得更好平衡。
链接: https://arxiv.org/abs/2508.13021
作者: Pengcheng Huang,Shuhao Liu,Zhenghao Liu,Yukun Yan,Shuo Wang,Zulong Chen,Tong Xiao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 17 pages,13 figures
Abstract:Recent advances in masked diffusion models (MDMs) have established them as powerful non-autoregressive alternatives for sequence generation. Nevertheless, our preliminary experiments reveal that the generation quality of MDMs is still highly sensitive to the choice of decoding strategy. In particular, widely adopted uncertainty-based samplers suffer from two key limitations: a lack of global trajectory control and a pronounced bias toward trivial tokens in the early stages of decoding. These shortcomings restrict the full potential of MDMs. In this work, we introduce Position-Aware Confidence-Calibrated Sampling (PC-Sampler), a novel decoding strategy that unifies global trajectory planning with content-aware informativeness maximization. PC-Sampler incorporates a position-aware weighting mechanism to regulate the decoding path and a calibrated confidence score to suppress the premature selection of trivial tokens. Extensive experiments on three advanced MDMs across seven challenging benchmarks-including logical reasoning and planning tasks-demonstrate that PC-Sampler consistently outperforms existing MDM decoding strategies by more than 10% on average, significantly narrowing the performance gap with state-of-the-art autoregressive models. All codes are available at this https URL.
zh
[NLP-18] Analyzing Information Sharing and Coordination in Multi-Agent Planning
【速读】: 该论文旨在解决多智能体系统(Multi-agent Systems, MASs)在执行长周期、多约束规划任务时面临的挑战,尤其是如何有效处理复杂依赖约束和避免因信息遗漏或幻觉导致的错误。其核心解决方案包括两个关键机制:一是引入“笔记簿”(notebook)以结构化方式促进智能体间的信息共享,减少因幻觉细节引发的错误(降低18%);二是设计一个协调代理(orchestrator agent)通过引导对话焦点,在特定子领域内进一步将错误减少达13.5%。二者结合使TravelPlanner基准测试的最终通过率提升至25%,相较单智能体基线(7.5%)实现绝对提升17.5%,验证了结构化信息共享与反思式协调在LLM驱动的多智能体系统中对长程规划任务的重要性。
链接: https://arxiv.org/abs/2508.12981
作者: Tianyue Ou,Saujas Vaduguru,Daniel Fried
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Multi-agent systems (MASs) have pushed the boundaries of large language model (LLM) agents in domains such as web research and software engineering. However, long-horizon, multi-constraint planning tasks involve conditioning on detailed information and satisfying complex interdependent constraints, which can pose a challenge for these systems. In this study, we construct an LLM-based MAS for a travel planning task which is representative of these challenges. We evaluate the impact of a notebook to facilitate information sharing, and evaluate an orchestrator agent to improve coordination in free form conversation between agents. We find that the notebook reduces errors due to hallucinated details by 18%, while an orchestrator directs the MAS to focus on and further reduce errors by up to 13.5% within focused sub-areas. Combining both mechanisms achieves a 25% final pass rate on the TravelPlanner benchmark, a 17.5% absolute improvement over the single-agent baseline’s 7.5% pass rate. These results highlight the potential of structured information sharing and reflective orchestration as key components in MASs for long horizon planning with LLMs.
zh
[NLP-19] SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
【速读】: 该论文旨在解决TinyML(微型机器学习)场景下模型不确定性估计的资源效率问题,即如何在有限的内存和计算资源约束下实现可靠的不确定性量化(Uncertainty Quantification, UQ),以支持设备端的实时监控与故障检测。解决方案的关键在于提出SNAP-UQ方法:它采用单次前向传播、无需标签数据的机制,通过深度可分离的下一激活预测(depth-wise next-activation prediction)来估算风险——具体而言,使用int8精度的小型头部网络从压缩的前一层特征中预测下一层的统计特性,并借助一个轻量级单调映射器将“意外度”(surprisal)转化为可操作的不确定性得分。该设计不依赖时间缓冲区、辅助输出分支或多次前向传递,仅增加几十KB存储开销,同时在视觉和音频骨干网络上显著优于早期退出(early-exit)和深度集成(deep ensembles)等方法,在保持高准确率的同时大幅降低闪存占用(约40–60%更小)和延迟(约25–35%更快),并在受扰动的数据流中提升准确性下降检测能力(AUPRC提升数个百分点),并维持稳定的失败检测性能(AUROC ≈ 0.9)。
链接: https://arxiv.org/abs/2508.12907
作者: Ismail Lamaakal,Chaymae Yahyati,Khalid El Makkaoui,Ibrahim Ouahbi,Yassine Maleh
机构: Multidisciplinary Faculty of Nador (多学科纳多学院); University Mohammed Premier (穆罕默德第一大学); Laboratory LaSTI, ENSAK (LaSTI实验室,恩萨克工程学院); Sultan Moulay Slimane University (苏丹穆莱·斯利曼大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We introduce \textbfSNAP-UQ, a single-pass, label-free uncertainty method for TinyML that estimates risk from \emphdepth-wise next-activation prediction: tiny int8 heads forecast the statistics of the next layer from a compressed view of the previous one, and a lightweight monotone mapper turns the resulting surprisal into an actionable score. The design requires no temporal buffers, auxiliary exits, or repeated forward passes, and adds only a few tens of kilobytes to MCU deployments. Across vision and audio backbones, SNAP-UQ consistently reduces flash and latency relative to early-exit and deep ensembles (typically \sim 40–60% smaller and \sim 25–35% faster), with competing methods of similar accuracy often exceeding memory limits. In corrupted streams it improves accuracy-drop detection by several AUPRC points and maintains strong failure detection (AUROC \approx 0.9) in a single pass. Grounding uncertainty in layer-to-layer dynamics yields a practical, resource-efficient basis for on-device monitoring in TinyML.
zh
[NLP-20] CUQ: Single-Pass Uncertainty Quantification from Temporal Consistency with Streaming Conformal Calibration for TinyML
【速读】: 该论文旨在解决TinyML(微型机器学习)在资源受限设备上进行流式推理时的不确定性量化与风险监控问题,尤其关注如何在无标签数据和低计算开销的前提下实现校准的决策机制。其解决方案的关键在于提出TCUQ(Temporal Consistency Uncertainty Quantification),通过轻量级后验概率和特征信号捕捉短期时间一致性,并利用O(W)环形缓冲区和每步O(1)更新实现高效风险评分;进一步结合流式共形层将该评分转化为预算约束的接受/拒绝规则,从而无需在线标签或额外前向传播即可获得校准行为。此方法在微控制器上仅需KB级存储空间,显著优于早期退出和深度集成等方案,同时提升了对分布内扰动和故障检测的敏感性。
链接: https://arxiv.org/abs/2508.12905
作者: Ismail Lamaakal,Chaymae Yahyati,Khalid El Makkaoui,Ibrahim Ouahbi,Yassine Maleh
机构: Multidisciplinary Faculty of Nador (多学科纳多学院); University Mohammed Premier (穆罕默德第一大学); Laboratory LaSTI, ENSAK (LaSTI 实验室,ENSAK); Sultan Moulay Slimane University (苏丹穆莱·斯利曼大学)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:We introduce TCUQ, a single pass, label free uncertainty monitor for streaming TinyML that converts short horizon temporal consistency captured via lightweight signals on posteriors and features into a calibrated risk score with an O(W ) ring buffer and O(1) per step updates. A streaming conformal layer turns this score into a budgeted accept/abstain rule, yielding calibrated behavior without online labels or extra forward passes. On microcontrollers, TCUQ fits comfortably on kilobyte scale devices and reduces footprint and latency versus early exit and deep ensembles (typically about 50 to 60% smaller and about 30 to 45% faster), while methods of similar accuracy often run out of memory. Under corrupted in distribution streams, TCUQ improves accuracy drop detection by 3 to 7 AUPRC points and reaches up to 0.86 AUPRC at high severities; for failure detection it attains up to 0.92 AUROC. These results show that temporal consistency, coupled with streaming conformal calibration, provides a practical and resource efficient foundation for on device monitoring in TinyML.
zh
[NLP-21] A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models
【速读】: 该论文旨在解决当前自精炼(self-refinement)方法依赖固定迭代次数、缺乏动态调整能力的问题,即现有方法通常采用反应式(reactive)流程,在不考虑生成过程中上下文演化的情况下进行固定次数的迭代优化,导致难以在恰当的时间和内容上实现高效精炼。解决方案的关键在于提出一种主动式自精炼(ProActive Self-Refinement, PASR),该方法使大语言模型(LLM)能够在生成过程中基于其内部状态和动态演化的上下文,主动决策是否、何时以及如何进行精炼,而非重新生成整个响应。这种机制显著提升了问题求解性能,并在保持更高准确率的同时大幅降低令牌消耗(如Qwen3-8B上减少41.6%的平均token消耗)。
链接: https://arxiv.org/abs/2508.12903
作者: Jinyi Han,Xinyi Wang,Haiquan Zhao,Tingyun li,Zishang Jiang,Sihang Jiang,Jiaqing Liang,Xin Lin,Weikang Zhou,Zeye Sun,Fei Yu,Yanghua Xiao
机构: Shanghai Institute of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育研究院); School of Data Science, Fudan University (复旦大学数据科学学院); College of Computer Science and Artificial Intelligence, Fudan University (复旦大学计算机科学与人工智能学院); Antgroup (蚂蚁集团)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in self-refinement have demonstrated significant potential for improving the outputs of large language models (LLMs) through iterative refinement. However, most existing self-refinement methods rely on a reactive process with a fixed number of iterations, making it difficult to determine the optimal timing and content of refinement based on the evolving generation context. Inspired by the way humans dynamically refine their thoughts during execution, we propose ProActive Self-Refinement (PASR), a novel method that enables LLMs to refine their outputs during the generation process. Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model’s internal state and evolving context. We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR. Experimental results show that PASR significantly enhances problem-solving performance. In particular, on Qwen3-8B, PASR reduces average token consumption by 41.6 percent compared to standard generation, while also achieving an 8.2 percent improvement in accuracy. Our code and all baselines used in the paper are available in the GitHub.
zh
[NLP-22] An LLM Agent -Based Complex Semantic Table Annotation Approach
【速读】: 该论文旨在解决复杂表格中语义表注释(Semantic Table Annotation, STA)任务的准确性问题,特别是针对列类型注释(Column Type Annotation, CTA)和单元格实体注释(Cell Entity Annotation, CEA)面临的挑战,如列名或单元格值的语义丢失、严格的本体层次结构要求、同义词混淆、拼写错误及缩写等问题。解决方案的关键在于提出一种基于大语言模型(Large Language Model, LLM)的代理方法,设计并实现五个基于ReAct框架的外部工具与定制提示(prompt),使STA代理能够根据表格特征动态选择合适的注释策略,从而提升注释精度;同时引入Levenshtein距离以减少冗余注释,实现时间成本降低70%、LLM令牌使用量减少60%,显著提高效率与经济性。
链接: https://arxiv.org/abs/2508.12868
作者: Yilin Geng,Shujing Wang,Chuan Wang,Keqing He,Yanfei Lv,Ying Wang,Zaiwen Feng,Xiaoying Bai
机构: 未知
类目: Computation and Language (cs.CL); Databases (cs.DB)
备注:
Abstract:The Semantic Table Annotation (STA) task, which includes Column Type Annotation (CTA) and Cell Entity Annotation (CEA), maps table contents to ontology entities and plays important roles in various semantic applications. However, complex tables often pose challenges such as semantic loss of column names or cell values, strict ontological hierarchy requirements, homonyms, spelling errors, and abbreviations, which hinder annotation accuracy. To address these issues, this paper proposes an LLM-based agent approach for CTA and CEA. We design and implement five external tools with tailored prompts based on the ReAct framework, enabling the STA agent to dynamically select suitable annotation strategies depending on table characteristics. Experiments are conducted on the Tough Tables and BiodivTab datasets from the SemTab challenge, which contain the aforementioned challenges. Our method outperforms existing approaches across various metrics. Furthermore, by leveraging Levenshtein distance to reduce redundant annotations, we achieve a 70% reduction in time costs and a 60% reduction in LLM token usage, providing an efficient and cost-effective solution for STA.
zh
[NLP-23] Word Meanings in Transformer Language Models
【速读】: 该论文试图解决的问题是:Transformer语言模型(如RoBERTa-base)中词义是如何表征的,特别是它们是否像人类语言系统一样拥有一个类似于词汇存储(lexical store)的机制,其中每个词都包含语义信息的独立条目。解决方案的关键在于对模型的token嵌入空间进行k-means聚类(生成200个簇),并通过人工检查和五种心理语言学指标(效价、具象性、象似性、禁忌性和习得年龄)来评估这些簇是否编码了丰富的语义信息。结果表明,token嵌入空间确实蕴含多样化的语义信息,从而排除了某些“意义消解论”(meaning eliminativist)关于Transformer模型处理语义的方式的假设。
链接: https://arxiv.org/abs/2508.12863
作者: Jumbly Grindrod,Peter Grindrod
机构: University of Reading, Department of Philosophy (雷丁大学哲学系); University of Oxford, Mathematical Institute (牛津大学数学研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:We investigate how word meanings are represented in the transformer language models. Specifically, we focus on whether transformer models employ something analogous to a lexical store - where each word has an entry that contains semantic information. To do this, we extracted the token embedding space of RoBERTa-base and k-means clustered it into 200 clusters. In our first study, we then manually inspected the resultant clusters to consider whether they are sensitive to semantic information. In our second study, we tested whether the clusters are sensitive to five psycholinguistic measures: valence, concreteness, iconicity, taboo, and age of acquisition. Overall, our findings were very positive - there is a wide variety of semantic information encoded within the token embedding space. This serves to rule out certain “meaning eliminativist” hypotheses about how transformer LLMs process semantic information.
zh
[NLP-24] E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model ACM-MM2025
【速读】: 该论文旨在解决多模态共情响应生成(Multimodal Empathetic Response Generation, MERG)中的关键挑战,即如何有效处理跨模态情感内容并保持对话主体的身份一致性。现有基于大语言模型(Large Language Models, LLMs)的文本共情响应生成方法在面对语音、视频等多模态输入时表现不足,且难以维持长期的情感连贯性与个性特征。为此,作者提出E3RG系统——一种基于多模态大语言模型的显式情感驱动共情响应生成框架,其核心创新在于将MERG任务解耦为三个子模块:多模态共情理解、共情记忆检索和多模态响应生成。通过集成先进的语音与视频生成模型,E3RG无需额外训练即可输出自然、情感丰富且身份一致的响应,在零样本和少样本场景下均取得显著性能提升,并在ACM MM 25的Avatar-based Multimodal Empathy Challenge中获得第一名。
链接: https://arxiv.org/abs/2508.12854
作者: Ronghao Lin,Shuai Shen,Weipeng Hu,Qiaolin He,Aolin Xiong,Li Huang,Haifeng Hu,Yap-peng Tan
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Accepted at ACM MM 2025 Grand Challenge
Abstract:Multimodal Empathetic Response Generation (MERG) is crucial for building emotionally intelligent human-computer interactions. Although large language models (LLMs) have improved text-based ERG, challenges remain in handling multimodal emotional content and maintaining identity consistency. Thus, we propose E3RG, an Explicit Emotion-driven Empathetic Response Generation System based on multimodal LLMs which decomposes MERG task into three parts: multimodal empathy understanding, empathy memory retrieval, and multimodal response generation. By integrating advanced expressive speech and video generative models, E3RG delivers natural, emotionally rich, and identity-consistent responses without extra training. Experiments validate the superiority of our system on both zero-shot and few-shot settings, securing Top-1 position in the Avatar-based Multimodal Empathy Challenge on ACM MM 25. Our code is available at this https URL.
zh
[NLP-25] It takes a village to write a book: Mapping anonymous contributions in Stephen Langtons Quaestiones Theologiae
【速读】: 该论文旨在解决中世纪经院哲学文献中基于口头讲授记录(reportationes)的集体性文学生产过程的编辑层次识别问题,以验证其文本形成机制的相关假设。解决方案的关键在于采用风格计量学(stylometry)方法对斯蒂芬·朗顿(Stephen Langton)的《神学问题》(Quaestiones Theologiae)这一由报告性文本汇编而成的集合进行分析,具体包括:构建一个基于光学字符识别(OCR)和自动转录对齐的高精度历史文本识别(HTR)流程,并利用高频词、词性标注(POS tags)及伪词缀(pseudo-affixes)等特征进行作者归属分析,从而区分不同编辑层级的文本来源;该研究还将直接比较人工撰写与自动提取数据在风格计量模型中的表现差异,同时评估基于Transformer架构的OCR技术在中世纪拉丁语语料库处理流程中的有效性,为后续计算人文研究提供可复用的方法论模板。
链接: https://arxiv.org/abs/2508.12830
作者: Jan Maliszewski
机构: University of Warsaw (华沙大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:While the indirect evidence suggests that already in the early scholastic period the literary production based on records of oral teaching (so-called reportationes) was not uncommon, there are very few sources commenting on the practice. This paper details the design of a study applying stylometric techniques of authorship attribution to a collection developed from reportationes – Stephen Langton’s Quaestiones Theologiae – aiming to uncover layers of editorial work and thus validate some hypotheses regarding the collection’s formation. Following Camps, Clérice, and Pinche (2021), I discuss the implementation of an HTR pipeline and stylometric analysis based on the most frequent words, POS tags, and pseudo-affixes. The proposed study will offer two methodological gains relevant to computational research on the scholastic tradition: it will directly compare performance on manually composed and automatically extracted data, and it will test the validity of transformer-based OCR and automated transcription alignment for workflows applied to scholastic Latin corpora. If successful, this study will provide an easily reusable template for the exploratory analysis of collaborative literary production stemming from medieval universities.
zh
[NLP-26] Context Matters: Incorporating Target Awareness in Conversational Abusive Language Detection
【速读】: 该论文旨在解决社交媒体中滥用语言检测模型在孤立分析单条帖子时忽略对话上下文信息的问题,从而影响检测准确性。其核心解决方案是引入父级帖子(parent tweet)的上下文信息,通过整合内容特征(如语义、情感倾向等)与账户特征(如用户历史行为),构建更全面的分类模型。研究表明,利用上下文内容特征显著优于仅依赖回复帖自身特征的方法,且多维内容特征的组合使用比单一或少量特征更能提升模型性能,验证了在真实对话场景中采用上下文感知策略对提高滥用语言识别准确性的关键作用。
链接: https://arxiv.org/abs/2508.12828
作者: Raneem Alharthi,Rajwa Alharthi,Aiqi Jiang,Arkaitz Zubiaga
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Abusive language detection has become an increasingly important task as a means to tackle this type of harmful content in social media. There has been a substantial body of research developing models for determining if a social media post is abusive or not; however, this research has primarily focused on exploiting social media posts individually, overlooking additional context that can be derived from surrounding posts. In this study, we look at conversational exchanges, where a user replies to an earlier post by another user (the parent tweet). We ask: does leveraging context from the parent tweet help determine if a reply post is abusive or not, and what are the features that contribute the most? We study a range of content-based and account-based features derived from the context, and compare this to the more widely studied approach of only looking at the features from the reply tweet. For a more generalizable study, we test four different classification models on a dataset made of conversational exchanges (parent-reply tweet pairs) with replies labeled as abusive or not. Our experiments show that incorporating contextual features leads to substantial improvements compared to the use of features derived from the reply tweet only, confirming the importance of leveraging context. We observe that, among the features under study, it is especially the content-based features (what is being posted) that contribute to the classification performance rather than account-based features (who is posting it). While using content-based features, it is best to combine a range of different features to ensure improved performance over being more selective and using fewer features. Our study provides insights into the development of contextualized abusive language detection models in realistic settings involving conversations.
zh
[NLP-27] ding-01 :ARG0: An AMR Corpus for Spontaneous French Dialogue
【速读】: 该论文旨在解决法语对话语料库在语义表示方面的不足,特别是针对自发性口语中动态特征和法语特有句法结构的建模问题。其解决方案的关键在于扩展抽象语义表示(Abstract Meaning Representation, AMR)框架以更好地捕捉自发性对话的语义特性,并制定详细的标注指南以确保标注一致性;同时,作者构建并公开了基于DinG语料库的法语对话语义标注数据集(CC-SA-BY许可),并训练了一个AMR解析器作为辅助标注工具,为人工标注提供初始结果以提升效率与准确性。
链接: https://arxiv.org/abs/2508.12819
作者: Jeongwoo Kang,Maria Boritchev,Maximin Coavoux
机构: Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, 38000 Grenoble, France; LTCI, Télécom Paris, 91120 Palaiseau, France
类目: Computation and Language (cs.CL)
备注: Accepted at IWCS 2025
Abstract:We present our work to build a French semantic corpus by annotating French dialogue in Abstract Meaning Representation (AMR). Specifically, we annotate the DinG corpus, consisting of transcripts of spontaneous French dialogues recorded during the board game Catan. As AMR has insufficient coverage of the dynamics of spontaneous speech, we extend the framework to better represent spontaneous speech and sentence structures specific to French. Additionally, to support consistent annotation, we provide an annotation guideline detailing these extensions. We publish our corpus under a free license (CC-SA-BY). We also train and evaluate an AMR parser on our data. This model can be used as an assistance annotation tool to provide initial annotations that can be refined by human annotators. Our work contributes to the development of semantic resources for French dialogue.
zh
[NLP-28] Learning to Steer: Input-dependent Steering for Multimodal LLM s
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在后训练阶段缺乏细粒度行为引导的问题,尤其针对现有静态 steering 方法(如均值 steering)无法根据输入内容动态调整行为的局限性。其关键解决方案是提出一种基于输入特定线性偏移(input-specific linear shift)的细粒度 steering 方法,通过对比式输入特定提示(contrastive input-specific prompting)计算偏移向量,并引入一个小型辅助模块(auxiliary module)来学习预测该偏移向量,从而实现对 MLLMs 的动态、上下文感知的行为控制。该方法命名为 L2S(Learn-to-Steer),实验证明其能有效降低幻觉并提升安全性,优于传统静态基线方法。
链接: https://arxiv.org/abs/2508.12815
作者: Jayneel Parekh,Pegah Khayatan,Mustafa Shukor,Arnaud Dapogny,Alasdair Newson,Matthieu Cord
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Steering has emerged as a practical approach to enable post-hoc guidance of LLMs towards enforcing a specific behavior. However, it remains largely underexplored for multimodal LLMs (MLLMs); furthermore, existing steering techniques, such as mean steering, rely on a single steering vector, applied independently of the input query. This paradigm faces limitations when the desired behavior is dependent on the example at hand. For example, a safe answer may consist in abstaining from answering when asked for an illegal activity, or may point to external resources or consultation with an expert when asked about medical advice. In this paper, we investigate a fine-grained steering that uses an input-specific linear shift. This shift is computed using contrastive input-specific prompting. However, the input-specific prompts required for this approach are not known at test time. Therefore, we propose to train a small auxiliary module to predict the input-specific steering vector. Our approach, dubbed as L2S (Learn-to-Steer), demonstrates that it reduces hallucinations and enforces safety in MLLMs, outperforming other static baselines.
zh
[NLP-29] When Alignment Hurts: Decoupling Representational Spaces in Multilingual Models
【速读】: 该论文旨在解决高资源标准语言(如现代标准阿拉伯语,MSA)与低资源方言之间表示纠缠(representational entanglement)对生成式建模的负面影响问题。研究发现,过度依赖标准语言的表征空间会抑制模型在相关方言上的生成能力。解决方案的关键在于提出一种在线变分探测框架(online variational probing framework),该框架在微调过程中持续估计标准语言子空间,并通过投影方式实现对该子空间的解耦(decoupling)。这一方法在25种阿拉伯语方言上验证有效,显著提升生成质量(最高+4.9 chrF++,平均+2.0),同时揭示了子空间主导性对生成能力的因果限制,为多语言和多领域大语言模型中的表征分配控制提供了可操作的工具。
链接: https://arxiv.org/abs/2508.12803
作者: Ahmed Elshabrawy,Hour Kaing,Haiyue Song,Alham Fikri Aji,Hideki Tanaka,Masao Utiyama,Raj Dabre
机构: MBZUAI; NICT, Japan; IIT Madras
类目: Computation and Language (cs.CL)
备注:
Abstract:Alignment with high-resource standard languages is often assumed to aid the modeling of related low-resource varieties. We challenge this assumption by demonstrating that excessive representational entanglement with a dominant variety, such as Modern Standard Arabic (MSA) in relation to Arabic dialects, can actively hinder generative modeling. We present the first comprehensive causal study of this phenomenon by analyzing and directly intervening in the internal representation geometry of large language models (LLMs). Our key contribution is an online variational probing framework that continuously estimates the subspace of the standard variety during fine-tuning, enabling projection-based decoupling from this space. While our study uses Arabic as a case due to its unusually rich parallel resources across 25 dialects, the broader motivation is methodological: dialectal MT serves as a controlled proxy for generative tasks where comparable multi-variety corpora are unavailable. Across 25 dialects, our intervention improves generation quality by up to +4.9 chrF++ and +2.0 on average compared to standard fine-tuning, despite a measured tradeoff in standard-language performance. These results provide causal evidence that subspace dominance by high-resource varieties can restrict generative capacity for related varieties. More generally, we unify geometric and information-theoretic probing with subspace-level causal interventions, offering practical tools for improving generative modeling in closely related language families and, more broadly, for controlling representational allocation in multilingual and multi-domain LLMs. Code will be released.
zh
[NLP-30] Maximum Score Routing For Mixture-of-Experts
【速读】: 该论文旨在解决稀疏激活的专家混合模型(Mixture-of-Experts, MoE)中因专家容量约束导致的token丢弃和硬件效率低下问题,以及去除容量约束后引发的负载不均衡与计算效率下降问题。解决方案的关键在于提出一种新的路由范式——最大得分路由(Maximum Score Routing, MaxScore),其将路由过程建模为最小成本最大流问题,并引入SoftTopk操作符,从而在不依赖迭代重路由或最优传输公式的基础上,实现更低的训练损失和更高的评估分数,同时在相同浮点运算次数(FLOPs)下显著提升模型性能与硬件利用率。
链接: https://arxiv.org/abs/2508.12801
作者: Bowen Dong,Yilong Fan,Yutao Sun,Zhenyu Li,Tengyu Pan,Xun Zhou,Jianyong Wang
机构: Tsinghua University (清华大学); Tianjin University (天津大学); Seed-Foundation-Model Team; ByteDance (字节跳动)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Routing networks in sparsely activated mixture-of-experts (MoE) dynamically allocate input tokens to top-k experts through differentiable sparse transformations, enabling scalable model capacity while preserving computational efficiency. Traditional MoE networks impose an expert capacity constraint to ensure GPU-friendly computation. However, this leads to token dropping when capacity is saturated and results in low hardware efficiency due to padding in underutilized experts. Removing the capacity constraint, in turn, compromises load balancing and computational efficiency. To address these issues, we propose Maximum Score Routing ( \mathbfMaxScore ), a novel MoE routing paradigm that models routing as a minimum-cost maximum-flow problem and integrates a SoftTopk operator. MaxScore resolves the fundamental limitations of iterative rerouting and optimal transport formulations, achieving lower training losses and higher evaluation scores at equivalent FLOPs compared to both constrained and unconstrained baselines. Implementation details and experimental configurations can be obtained from \hrefthis https URLMaxScore .
zh
[NLP-31] Atom-Searcher: Enhancing Agent ic Deep Research via Fine-Grained Atomic Thought Reward
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在处理复杂任务时因静态内部知识受限、检索增强生成(Retrieval-Augmented Generation, RAG)在多跳推理和策略性搜索中受限,以及当前基于结果的强化学习(Outcome-based Reinforcement Learning, RL)方法存在梯度冲突与奖励稀疏性等问题。其解决方案的关键在于提出原子化思维(Atomic Thought)这一新型LLM推理范式,将推理过程分解为细粒度的功能单元,并通过推理奖励模型(Reasoning Reward Models, RRMs)提供细粒度的原子思维奖励(Atomic Thought Rewards, ATR),进而构建了名为Atom-Searcher的新型强化学习框架,该框架采用课程启发式的奖励调度机制,优先引导过程级ATR并逐步过渡到结果奖励,从而加速有效推理路径的收敛,显著提升多跳推理与自主深度研究能力。
链接: https://arxiv.org/abs/2508.12800
作者: Yong Deng,Guoqing Wang,Zhenzhe Ying,Xiaofeng Wu,Jinzhen Lin,Wenwen Xiong,Yuqin Dai,Shuo Yang,Zhanwei Zhang,Qiwen Wang,Yang Qin,Changhua Meng
机构: Ant Group
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths. Experiments on seven benchmarks show consistent improvements over the state-of-the-art. Key advantages include: (1) Atom-Searcher scales computation at test-time. (2) Atomic Thought provides supervision anchors for RRMs, bridging deep research tasks and RRMs. (3) Atom-Searcher exhibits more interpretable, human-like reasoning patterns.
zh
[NLP-32] Bridging Human and LLM Judgments: Understanding and Narrowing the Gap
【速读】: 该论文旨在解决大语言模型作为评判者(LLM-as-a-judge)时其评估结果与人类判断存在系统性偏差的问题。解决方案的关键在于提出一个统一的统计框架——Bridge,该框架通过假设每个提示-响应对存在一个潜在的人类偏好得分,并将LLM的偏离建模为协变量的线性变换,从而显式地在绝对评分和成对比较两种范式下桥接人类与LLM的评价体系。这一方法不仅提供了对LLM评分进行校准的简洁且原理清晰的路径,还能揭示人类与LLM之间系统性差异的来源。
链接: https://arxiv.org/abs/2508.12792
作者: Felipe Maia Polo,Xinhe Wang,Mikhail Yurochkin,Gongjun Xu,Moulinath Banerjee,Yuekai Sun
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
备注:
Abstract:Large language models are increasingly used as judges (LLM-as-a-judge) to evaluate model outputs at scale, but their assessments often diverge systematically from human judgments. We present Bridge, a unified statistical framework that explicitly bridges human and LLM evaluations under both absolute scoring and pairwise comparison paradigms. Bridge posits a latent human preference score for each prompt-response pair and models LLM deviations as linear transformations of covariates that capture sources of discrepancies. This offers a simple and principled framework for refining LLM ratings and characterizing systematic discrepancies between humans and LLMs. We provide an efficient fitting algorithm with asymptotic guarantees for statistical inference. Using six LLM judges and two benchmarks (BigGen Bench and Chatbot Arena), Bridge achieves higher agreement with human ratings (accuracy, calibration, and KL divergence) and exposes systematic human-LLM gaps.
zh
[NLP-33] Reinforcement Learning with Rubric Anchors
【速读】: 该论文旨在解决强化学习从可验证奖励(Reinforcement Learning from Verifiable Rewards, RLVR)在开放任务中应用受限的问题,即传统RLVR依赖于自动可验证的信号(如代码通过单元测试或数学推理答案正确),难以适用于需要主观判断的开放-ended任务。解决方案的关键在于引入基于评分量表(rubric-based rewards)的奖励机制,利用结构化、模型可解释的评分规则对主观输出进行自动评分,从而将RLVR扩展至人文社科等开放领域。作者构建了迄今最大的rubric奖励系统(超10,000条来自人类、大语言模型或人机协作的rubric),并提出一套清晰的训练框架,在仅使用5,000+样本的情况下显著提升模型在开放任务上的表现(+5.2%),同时实现细粒度风格控制,缓解“AI式”语气问题,生成更自然的人类表达。
链接: https://arxiv.org/abs/2508.12790
作者: Zenan Huang,Yihong Zhuang,Guoshan Lu,Zeyu Qin,Haokai Xu,Tianyu Zhao,Ru Peng,Jiaqi Hu,Zhanming Shen,Xiaomeng Hu,Xijun Gu,Peiyi Tu,Jiaxin Liu,Wenyu Chen,Yuzhuo Fu,Zhiting Fan,Yanmei Gu,Yuanyuan Wang,Zhengkai Yang,Jianguo Li,Junbo Zhao
机构: 未知
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: technical report
Abstract:Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a powerful paradigm for enhancing Large Language Models (LLMs), exemplified by the success of OpenAI’s o-series. In RLVR, rewards are derived from verifiable signals-such as passing unit tests in code generation or matching correct answers in mathematical reasoning. While effective, this requirement largely confines RLVR to domains with automatically checkable outcomes. To overcome this, we extend the RLVR paradigm to open-ended tasks by integrating rubric-based rewards, where carefully designed rubrics serve as structured, model-interpretable criteria for automatic scoring of subjective outputs. We construct, to our knowledge, the largest rubric reward system to date, with over 10,000 rubrics from humans, LLMs, or a hybrid human-LLM collaboration. Implementing rubric-based RL is challenging; we tackle these issues with a clear framework and present an open-sourced Qwen-30B-A3B model with notable gains: 1) With only 5K+ samples, our system improves by +5.2% on open-ended benchmarks (especially humanities), outperforming a 671B DeepSeek-V3 model by +2.4%, while preserving general and reasoning abilities. 2) Our method provides fine-grained stylistic control, using rubrics as anchors to mitigate the “AI-like” tone and produce more human-like, expressive responses. We share key lessons in rubric construction, data selection, and training, and discuss limitations and future releases.
zh
[NLP-34] HeteroRAG : A Heterogeneous Retrieval-Augmented Generation Framework for Medical Vision Language Tasks
【速读】: 该论文旨在解决医学视觉语言模型(Medical Large Vision-Language Models, Med-LVLMs)在临床应用中因事实性错误和不可靠输出而带来的风险问题,特别是现有检索增强生成(Retrieval-Augmented Generation, RAG)系统难以在异构来源间有效检索,导致分析结果缺乏事实依据且临床决策可信度不足。解决方案的关键在于提出HeteroRAG框架:通过构建包含多模态报告库与多样化文本语料的MedAtlas知识库,引入模态特定CLIP(Modality-specific CLIPs)实现跨模态报告精准检索,并设计多语料查询生成器(Multi-corpora Query Generator)动态适配不同来源的查询构造;最终利用异构知识偏好微调(Heterogeneous Knowledge Preference Tuning)对Med-LVLM进行训练,实现跨模态与多源知识对齐,从而显著提升模型的事实准确性和可靠性。
链接: https://arxiv.org/abs/2508.12778
作者: Zhe Chen,Yusheng Liao,Shuyang Jiang,Zhiyuan Zhu,Haolin Li,Yanfeng Wang,Yu Wang
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Medical large vision-language Models (Med-LVLMs) have shown promise in clinical applications but suffer from factual inaccuracies and unreliable outputs, posing risks in real-world diagnostics. While retrieval-augmented generation has emerged as a potential solution, current medical multimodal RAG systems are unable to perform effective retrieval across heterogeneous sources. The irrelevance of retrieved reports affects the factuality of analysis, while insufficient knowledge affects the credibility of clinical decision-making. To bridge the gap, we construct MedAtlas, which includes extensive multimodal report repositories and diverse text corpora. Based on it, we present HeteroRAG, a novel framework that enhances Med-LVLMs through heterogeneous knowledge sources. The framework introduces Modality-specific CLIPs for effective report retrieval and a Multi-corpora Query Generator for dynamically constructing queries for diverse corpora. Incorporating knowledge from such multifaceted sources, Med-LVLM is then trained with Heterogeneous Knowledge Preference Tuning to achieve cross-modality and multi-source knowledge alignment. Extensive experiments across 12 datasets and 3 modalities demonstrate that the proposed HeteroRAG achieves state-of-the-art performance in most medical vision language benchmarks, significantly improving factual accuracy and reliability of Med-LVLMs.
zh
[NLP-35] From SALAMANDRA to SALAMANDRATA: BSC Submission for WMT25 General Machine Translation Shared Task
【速读】: 该论文旨在解决多语言机器翻译(Machine Translation, MT)任务中模型性能不足的问题,特别是在38种欧洲语言间的跨语言翻译能力优化。解决方案的关键在于提出SALAMANDRATA系列模型,其基于前代SALAMANDRA LLMs进行改进,采用两阶段训练策略:首先在平行语料上进行持续预训练(continual pre-training),随后在高质量指令数据上进行监督微调(supervised fine-tuning)。针对WMT25通用机器翻译共享任务,研究团队进一步对7B版本模型进行了词汇扩展和针对性再训练,以适配非欧洲语言并提升全方向翻译性能;同时引入基于最小贝叶斯风险(Minimum Bayes Risk Decoding)和COMET/COMET-KIWI重排序的双质量感知解码策略,显著增强译文质量与一致性。
链接: https://arxiv.org/abs/2508.12774
作者: Javier Garcia Gilabert,Xixian Liao,Severino Da Dalt,Ella Bohman,Audrey Mash,Francesca De Luca Fornaciari,Irene Baucells,Joan Llop,Miguel Claramunt Argote,Carlos Escolano,Maite Melero
机构: Barcelona Supercomputing Center (巴塞罗那超级计算中心); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:In this paper, we present the SALAMANDRATA family of models, an improved iteration of SALAMANDRA LLMs (Gonzalez-Agirre et al., 2025) specifically trained to achieve strong performance in translation-related tasks for 38 European languages. SALAMANDRATA comes in two scales: 2B and 7B parameters. For both versions, we applied the same training recipe with a first step of continual pre-training on parallel data, and a second step of supervised fine-tuning on high-quality instructions. The BSC submission to the WMT25 General Machine Translation shared task is based on the 7B variant of SALAMANDRATA. We first adapted the model vocabulary to support the additional non-European languages included in the task. This was followed by a second phase of continual pre-training and supervised fine-tuning, carefully designed to optimize performance across all translation directions for this year’s shared task. For decoding, we employed two quality-aware strategies: Minimum Bayes Risk Decoding and Tuned Re-ranking using COMET and COMET-KIWI respectively. We publicly release both the 2B and 7B versions of SALAMANDRATA, along with the newer SALAMANDRATA-V2 model, on Hugging Face1
zh
[NLP-36] CRED-SQL: Enhancing Real-world Large Scale Database Text-to-SQL Parsing through Cluster Retrieval and Execution Description
【速读】: 该论文旨在解决大型数据库环境下Text-to-SQL系统中存在的语义不匹配问题,尤其是自然语言问题(NLQ)与对应SQL查询之间因语义相似属性导致的模式链接错误和生成过程中的语义漂移,从而影响模型准确率。解决方案的关键在于提出CRED-SQL框架,其核心创新包括:1)基于聚类的大规模模式检索机制,用于精准定位与NLQ最相关的表和列,缓解模式错位;2)引入中间自然语言表示——执行描述语言(Execution Description Language, EDL),将Text-to-SQL任务分解为Text-to-EDL和EDL-to-SQL两个阶段,借助大语言模型(LLM)的强大推理能力降低语义偏差,提升生成准确性与可扩展性。
链接: https://arxiv.org/abs/2508.12769
作者: Shaoming Duan,Zirui Wang,Chuanyi Liu,Zhibin Zhu,Yuhao Zhang,Peiyi Han,Liang Yan,Zewu Penge
机构: Harbin Institute of Technology, Shenzhen, China (哈尔滨工业大学深圳校区); Pengcheng Laboratory, Shenzhen, China (鹏城实验室); Mindflow.ai; Inspur Cloud Information Technology Co., Ltd, Jinan 250101, China (浪潮云信息技术有限公司); Guangdong Power Grid Co., Ltd, China (广东电网公司)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advances in large language models (LLMs) have significantly improved the accuracy of Text-to-SQL systems. However, a critical challenge remains: the semantic mismatch between natural language questions (NLQs) and their corresponding SQL queries. This issue is exacerbated in large-scale databases, where semantically similar attributes hinder schema linking and semantic drift during SQL generation, ultimately reducing model accuracy. To address these challenges, we introduce CRED-SQL, a framework designed for large-scale databases that integrates Cluster Retrieval and Execution Description. CRED-SQL first performs cluster-based large-scale schema retrieval to pinpoint the tables and columns most relevant to a given NLQ, alleviating schema mismatch. It then introduces an intermediate natural language representation-Execution Description Language (EDL)-to bridge the gap between NLQs and SQL. This reformulation decomposes the task into two stages: Text-to-EDL and EDL-to-SQL, leveraging LLMs’ strong general reasoning capabilities while reducing semantic deviation. Extensive experiments on two large-scale, cross-domain benchmarks-SpiderUnion and BirdUnion-demonstrate that CRED-SQL achieves new state-of-the-art (SOTA) performance, validating its effectiveness and scalability. Our code is available at this https URL
zh
[NLP-37] LinguaSafe: A Comprehensive Multilingual Safety Benchmark for Large Language Models
【速读】: 该论文旨在解决当前大型语言模型(Large Language Models, LLMs)多语言安全评估中缺乏全面性和多样性数据的问题,尤其针对匈牙利语到马来语等资源较少的语言存在评估空白。解决方案的关键在于提出LinguaSafe——一个涵盖12种语言、共4.5万条数据的多维度、细粒度多语言安全基准测试集,其数据通过翻译、再创作和本地采集相结合的方式构建,确保语言真实性,并引入直接与间接安全评估及过度敏感性检测机制,从而系统性地提升LLMs在不同语言和文化背景下的安全对齐能力。
链接: https://arxiv.org/abs/2508.12733
作者: Zhiyuan Ning,Tianle Gu,Jiaxin Song,Shixin Hong,Lingyu Li,Huacan Liu,Jie Li,Yixu Wang,Meng Lingyu,Yan Teng,Yingchun Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 7pages, 5 figures
Abstract:The widespread adoption and increasing prominence of large language models (LLMs) in global technologies necessitate a rigorous focus on ensuring their safety across a diverse range of linguistic and cultural contexts. The lack of a comprehensive evaluation and diverse data in existing multilingual safety evaluations for LLMs limits their effectiveness, hindering the development of robust multilingual safety alignment. To address this critical gap, we introduce LinguaSafe, a comprehensive multilingual safety benchmark crafted with meticulous attention to linguistic authenticity. The LinguaSafe dataset comprises 45k entries in 12 languages, ranging from Hungarian to Malay. Curated using a combination of translated, transcreated, and natively-sourced data, our dataset addresses the critical need for multilingual safety evaluations of LLMs, filling the void in the safety evaluation of LLMs across diverse under-represented languages from Hungarian to Malay. LinguaSafe presents a multidimensional and fine-grained evaluation framework, with direct and indirect safety assessments, including further evaluations for oversensitivity. The results of safety and helpfulness evaluations vary significantly across different domains and different languages, even in languages with similar resource levels. Our benchmark provides a comprehensive suite of metrics for in-depth safety evaluation, underscoring the critical importance of thoroughly assessing multilingual safety in LLMs to achieve more balanced safety alignment. Our dataset and code are released to the public to facilitate further research in the field of multilingual LLM safety.
zh
[NLP-38] DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在跨学科复杂推理任务中表现不足的问题,尤其是现有推理数据集普遍存在学科覆盖范围窄、结构深度不够难以激发高质量推理行为的局限性。其解决方案的关键在于提出 DESIGNER 数据合成流水线,核心创新是引入“设计逻辑”(Design Logic)概念——通过逆向工程从超过12万个现有问题中抽象出设计模式,并将其与不同学科的原始文本材料(书籍语料库和网络语料库)匹配,从而生成具有更高难度和多样性的多学科推理问题。该方法显著提升了合成数据的质量,实验证明基于此生成的数据集可有效提升模型在多学科推理任务上的性能。
链接: https://arxiv.org/abs/2508.12726
作者: Weize Liu,Yongchi Zhao,Yijia Luo,Mingyu Xu,Jiaheng Liu,Yanan Li,Xiguo Hu,Yuchi Xu,Wenbo Su,Bo Zheng
机构: Alibaba Group (阿里巴巴集团); Nanjing University (南京大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often either lack disciplinary breadth or the structural depth necessary to elicit robust reasoning behaviors. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (book corpus and web corpus) to generate multidisciplinary challenging questions. A core innovation of our approach is the introduction of a Design Logic concept, which mimics the question-creation process of human educators. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with disciplinary source materials, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Based on this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: Design-Logic-Reasoning-Book (DLR-Book), containing 3.04 million challenging questions synthesized from the book corpus, and Design-Logic-Reasoning-Web (DLR-Web), with 1.66 million challenging questions from the web corpus. Our data analysis demonstrates that the questions synthesized by our method exhibit substantially greater difficulty and diversity than those in the baseline datasets. We validate the effectiveness of these datasets by conducting SFT experiments on the Qwen3-8B-Base and Qwen3-4B-Base models. The results show that our dataset significantly outperforms existing multidisciplinary datasets of the same volume. Training with the full datasets further enables the models to surpass the multidisciplinary reasoning performance of the official Qwen3-8B and Qwen3-4B models.
zh
[NLP-39] oolACE-MT: Non-Autoregressive Generation for Agent ic Multi-Turn Interaction
【速读】: 该论文旨在解决生成式 AI (Generative AI) 场景中多智能体任务求解(agentic task-solving)的数据生成难题,尤其是现有基于仿真方法依赖昂贵的自回归式多智能体交互,导致数据生成效率低且难以扩展。其解决方案的关键在于提出一种非自回归迭代生成框架 ToolACE-MT,通过三个阶段实现高质量多轮对话的高效构建:首先进行粗粒度初始化以建立结构完整的对话骨架;随后通过掩码填充(mask-and-fill)操作逐步引入语义复杂性和动态交互细节;最后利用规则与模型结合的离线验证机制确保对话的逻辑一致性与正确性。该方法显著提升了 agentic 对话数据的生成效率、质量与泛化能力。
链接: https://arxiv.org/abs/2508.12685
作者: Xingshan Zeng,Weiwen Liu,Lingzhi Wang,Liangyou Li,Fei Mi,Yasheng Wang,Lifeng Shang,Xin Jiang,Qun Liu
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby limiting real-world performance of agentic tasks. In this paper, we propose a novel Non-Autoregressive Iterative Generation framework, called ToolACE-MT, for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement via mask-and-fill operations; and the offline verification phase ensures correctness and coherence via rule- and model-based checks. Experiments demonstrate that ToolACE-MT enables efficient, effective and generalizable agentic data generation, offering a new paradigm for high-quality data construction in tool-augmented LLM scenarios.
zh
[NLP-40] Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation
【速读】: 该论文旨在解决当前视觉语言模型(Vision-Language Models, VLMs)在训练过程中受限于任务范围狭窄、难以跨域泛化的问题,尤其在数学与逻辑推理之外的多样化视觉推理任务中表现不足,根源在于缺乏广泛可用且可验证的奖励数据,以及多源异构数据融合困难。解决方案的关键在于构建一个涵盖46个数据源、8个维度(如信息图、空间推理、医学图像等)的综合性强化学习(Reinforcement Learning, RL)就绪视觉推理数据集,并提出基于影响函数(influence function)的数据选择策略与基于难度的过滤机制,以筛选高质量样本;随后采用多轮强化学习(multi-round RL)结合数据课程(data curriculum)迭代优化模型,最终训练出名为Vision-G1的VLM,在多个视觉推理基准上达到SOTA性能,超越同类规模模型及商用模型如GPT-4o和Gemini-1.5 Flash。
链接: https://arxiv.org/abs/2508.12680
作者: Yuheng Zha,Kun Zhou,Yujia Wu,Yushu Wang,Jie Feng,Zhi Xu,Shibo Hao,Zhengzhong Liu,Eric P. Xing,Zhiting Hu
机构: UC San Diego(加州大学圣地亚哥分校); Carnegie Mellon University(卡内基梅隆大学); MBZUAI(穆罕默德本扎耶德人工智能大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注:
Abstract:Despite their success, current training pipelines for reasoning VLMs focus on a limited range of tasks, such as mathematical and logical reasoning. As a result, these models face difficulties in generalizing their reasoning capabilities to a wide range of domains, primarily due to the scarcity of readily available and verifiable reward data beyond these narrowly defined areas. Moreover, integrating data from multiple domains is challenging, as the compatibility between domain-specific datasets remains uncertain. To address these limitations, we build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions, covering a wide range of tasks such as infographic, mathematical, spatial, cross-image, graphic user interface, medical, common sense and general science. We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset. Subsequently, we train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities. Our model achieves state-of-the-art performance across various visual reasoning benchmarks, outperforming similar-sized VLMs and even proprietary models like GPT-4o and Gemini-1.5 Flash. The model, code and dataset are publicly available at this https URL.
zh
[NLP-41] Leverag ing Large Language Models for Predictive Analysis of Human Misery
【速读】: 该论文旨在解决如何利用大型语言模型(Large Language Models, LLMs)从自然语言描述中预测人类感知的痛苦指数(misery score)这一问题,其核心挑战在于将主观情感体验量化为0到100之间的标量值。解决方案的关键在于引入多种提示策略(prompting strategies),包括零样本(zero-shot)、固定上下文少样本(fixed-context few-shot)和基于BERT句向量检索的提示方法,并发现少样本策略显著优于零样本基线,表明上下文示例在情感预测中的重要性;进一步地,作者设计了“痛苦游戏秀”(Misery Game Show)这一新颖的游戏化评估框架,通过结构化轮次(序数比较、二分类、标量估计及反馈驱动推理)测试LLM在动态情境下的适应能力,从而不仅衡量预测准确性,还评估其在纠正反馈下进行情绪推理的潜力。
链接: https://arxiv.org/abs/2508.12669
作者: Bishanka Seal,Rahul Seetharaman,Aman Bansal,Abhilash Nandy
机构: 未知
类目: Computation and Language (cs.CL); Computers and Society (cs.CY)
备注: 14 pages, 4 tables
Abstract:This study investigates the use of Large Language Models (LLMs) for predicting human-perceived misery scores from natural language descriptions of real-world scenarios. The task is framed as a regression problem, where the model assigns a scalar value from 0 to 100 to each input statement. We evaluate multiple prompting strategies, including zero-shot, fixed-context few-shot, and retrieval-based prompting using BERT sentence embeddings. Few-shot approaches consistently outperform zero-shot baselines, underscoring the value of contextual examples in affective prediction. To move beyond static evaluation, we introduce the “Misery Game Show”, a novel gamified framework inspired by a television format. It tests LLMs through structured rounds involving ordinal comparison, binary classification, scalar estimation, and feedback-driven reasoning. This setup enables us to assess not only predictive accuracy but also the model’s ability to adapt based on corrective feedback. The gamified evaluation highlights the broader potential of LLMs in dynamic emotional reasoning tasks beyond standard regression. Code and data link: this https URL
zh
[NLP-42] Breaking Language Barriers: Equitable Performance in Multilingual Language Models NAACL2025
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在低资源语言(Low-Resource Languages, LRLs)如印地语或斯瓦希里语中进行常识推理(Common Sense Reasoning, CSR)时性能显著低于高资源语言(High-Resource Languages, HRLs)如英语的问题,从而弥合不同语言群体间对高质量LLM输出的获取差异。解决方案的关键在于通过受控的语言混合方法生成合成的代码混用(code-switched)文本,并以此对LLM进行微调,实验证明该策略能显著提升LRL下的模型表现,同时保持或增强HRL上的性能。
链接: https://arxiv.org/abs/2508.12662
作者: Tanay Nagar,Grigorii Khvatskii,Anna Sokol,Nitesh V. Chawla
机构: University of Wisconsin–Madison (威斯康星大学麦迪逊分校); NSF Center for Computer Assisted Synthesis (CCAS), University of Notre Dame (美国国家科学基金会计算机辅助合成中心,圣母大学); University of Notre Dame (圣母大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted as a non-archival work-in-progress paper at the NAACL 2025 Student Research Workshop
Abstract:Cutting-edge LLMs have emerged as powerful tools for multilingual communication and understanding. However, LLMs perform worse in Common Sense Reasoning (CSR) tasks when prompted in low-resource languages (LRLs) like Hindi or Swahili compared to high-resource languages (HRLs) like English. Equalizing this inconsistent access to quality LLM outputs is crucial to ensure fairness for speakers of LRLs and across diverse linguistic communities. In this paper, we propose an approach to bridge this gap in LLM performance. Our approach involves fine-tuning an LLM on synthetic code-switched text generated using controlled language-mixing methods. We empirically demonstrate that fine-tuning LLMs on synthetic code-switched datasets leads to substantial improvements in LRL model performance while preserving or enhancing performance in HRLs. Additionally, we present a new dataset of synthetic code-switched text derived from the CommonSenseQA dataset, featuring three distinct language ratio configurations.
zh
[NLP-43] Prompt-Induced Linguistic Fingerprints for LLM -Generated Fake News Detection
【速读】: 该论文旨在解决由大语言模型(Large Language Models, LLMs)生成的虚假新闻日益增多所带来的检测难题。传统方法主要依赖文本内容本身的分析,但由于LLM生成的内容往往语义连贯且看似真实,其细微伪造痕迹难以被识别。论文的关键创新在于提出一种名为Linguistic Fingerprints Extraction (LIFE)的新方法,其核心思想是通过分布差异分析发现恶意提示诱导下生成文本的语言指纹——即在词级别概率分布上存在的统计性偏移。LIFE通过重建词级概率分布来提取具有判别性的模式,并结合关键片段(key-fragment)技术增强这些微弱的语言差异,从而显著提升对LLM生成虚假新闻的检测准确率,同时在人类撰写虚假新闻的检测中也保持高性能。
链接: https://arxiv.org/abs/2508.12632
作者: Chi Wang,Min Gao,Zongwei Wang,Junwei Yin,Kai Shu,Chenghua Lin
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:With the rapid development of large language models, the generation of fake news has become increasingly effortless, posing a growing societal threat and underscoring the urgent need for reliable detection methods. Early efforts to identify LLM-generated fake news have predominantly focused on the textual content itself; however, because much of that content may appear coherent and factually consistent, the subtle traces of falsification are often difficult to uncover. Through distributional divergence analysis, we uncover prompt-induced linguistic fingerprints: statistically distinct probability shifts between LLM-generated real and fake news when maliciously prompted. Based on this insight, we propose a novel method named Linguistic Fingerprints Extraction (LIFE). By reconstructing word-level probability distributions, LIFE can find discriminative patterns that facilitate the detection of LLM-generated fake news. To further amplify these fingerprint patterns, we also leverage key-fragment techniques that accentuate subtle linguistic differences, thereby improving detection reliability. Our experiments show that LIFE achieves state-of-the-art performance in LLM-generated fake news and maintains high performance in human-written fake news. The code and data are available at this https URL.
zh
[NLP-44] Beyond GPT -5: Making LLM s Cheaper and Better via Performance-Efficiency Optimized Routing
【速读】: 该论文旨在解决大型语言模型(Large Language Model, LLM)发展中性能与效率难以平衡的核心挑战。现有方法往往在高精度和低延迟/资源消耗之间做出单一权衡,缺乏灵活适配不同应用场景的能力。其解决方案的关键在于提出Avengers-Pro这一测试时路由(test-time routing)框架,通过嵌入(embedding)和聚类对输入查询进行分析,并基于性能-效率评分动态分配至最合适的模型;该机制可统一处理所有性能-效率折衷场景,在多个基准测试中实现优于单一最优模型的准确率(提升+7%),或在显著降低计算成本(27%~63%)的前提下保持相近性能,最终达成帕累托前沿(Pareto frontier),即在给定成本下获得最高准确率,或在给定准确率下实现最低成本。
链接: https://arxiv.org/abs/2508.12631
作者: Yiqun Zhang,Hao Li,Jianhao Chen,Hangfan Zhang,Peng Ye,Lei Bai,Shuyue Hu
机构: Shanghai Artificial Intelligence Laboratory (上海人工智能实验室)
类目: Computation and Language (cs.CL)
备注: Ongoing work
Abstract:Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models – including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 – Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at this https URL.
zh
[NLP-45] Semantic Anchoring in Agent ic Memory: Leverag ing Linguistic Structures for Persistent Conversational Context
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在多轮对话和长期交互中因记忆持久性不足而导致的性能瓶颈问题。现有检索增强生成(Retrieval-Augmented Generation, RAG)系统通常将对话历史存储为密集向量,虽能捕捉语义相似性,但忽视了句法依赖、话语关系和指代链等细粒度语言结构,从而影响对复杂上下文信息的准确召回。解决方案的关键在于提出一种称为“语义锚定”(Semantic Anchoring)的混合代理记忆架构,通过整合依存句法分析(dependency parsing)、话语关系标注(discourse relation tagging)和指代消解(coreference resolution),构建结构化的记忆条目,以显式保留语言的语法与语用层次信息,显著提升事实召回率和话语连贯性,实验表明其相较强基线RAG系统在两项指标上分别提升高达18%。
链接: https://arxiv.org/abs/2508.12630
作者: Maitreyi Chatterjee,Devansh Agarwal
机构: Cornell University (康奈尔大学)
类目: Computation and Language (cs.CL)
备注: Paper is currently in peer review
Abstract:Large Language Models (LLMs) have demonstrated impressive fluency and task competence in conversational settings. However, their effectiveness in multi-session and long-term interactions is hindered by limited memory persistence. Typical retrieval-augmented generation (RAG) systems store dialogue history as dense vectors, which capture semantic similarity but neglect finer linguistic structures such as syntactic dependencies, discourse relations, and coreference links. We propose Semantic Anchoring, a hybrid agentic memory architecture that enriches vector-based storage with explicit linguistic cues to improve recall of nuanced, context-rich exchanges. Our approach combines dependency parsing, discourse relation tagging, and coreference resolution to create structured memory entries. Experiments on adapted long-term dialogue datasets show that semantic anchoring improves factual recall and discourse coherence by up to 18% over strong RAG baselines. We further conduct ablation studies, human evaluations, and error analysis to assess robustness and interpretability.
zh
[NLP-46] An LLM ASP Workflow for Joint Entity-Relation Extraction
【速读】: 该论文旨在解决联合实体关系抽取(Joint Entity-Relation Extraction, JERE)任务中传统机器学习方法对大规模标注数据依赖性强、难以融入领域知识且缺乏灵活性的问题。其解决方案的关键在于融合生成式预训练大语言模型(Generative Pretrained Large Language Models, LLMs)与答案集编程(Answer Set Programming, ASP)的能力:LLMs负责从无标注文本中理解语义,ASP则提供形式化知识表示与推理机制,实现对新增领域知识的“可扩展性”(elaboration tolerance),无需修改核心逻辑即可动态整合类型规范等先验信息。实验表明,在仅使用10%训练数据的情况下,该LLM + ASP工作流在多个JERE基准上优于现有最先进系统,尤其在SciERC数据集上实现了关系抽取性能提升2.5倍(35% vs 15%)。
链接: https://arxiv.org/abs/2508.12611
作者: Trang Tran,Trung Hoang Le,Huiping Cao,Tran Cao Son
机构: New Mexico State University (新墨西哥州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 13 pages, 1 figure, Accepted as Technical Communication, 41st International Conference on Logic Programming
Abstract:Joint entity-relation extraction (JERE) identifies both entities and their relationships simultaneously. Traditional machine-learning based approaches to performing this task require a large corpus of annotated data and lack the ability to easily incorporate domain specific information in the construction of the model. Therefore, creating a model for JERE is often labor intensive, time consuming, and elaboration intolerant. In this paper, we propose harnessing the capabilities of generative pretrained large language models (LLMs) and the knowledge representation and reasoning capabilities of Answer Set Programming (ASP) to perform JERE. We present a generic workflow for JERE using LLMs and ASP. The workflow is generic in the sense that it can be applied for JERE in any domain. It takes advantage of LLM’s capability in natural language understanding in that it works directly with unannotated text. It exploits the elaboration tolerant feature of ASP in that no modification of its core program is required when additional domain specific knowledge, in the form of type specifications, is found and needs to be used. We demonstrate the usefulness of the proposed workflow through experiments with limited training data on three well-known benchmarks for JERE. The results of our experiments show that the LLM + ASP workflow is better than state-of-the-art JERE systems in several categories with only 10% of training data. It is able to achieve a 2.5 times (35% over 15%) improvement in the Relation Extraction task for the SciERC corpus, one of the most difficult benchmarks.
zh
[NLP-47] Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning
【速读】: 该论文旨在解决传统自动口语评估(Automated Speaking Assessment, ASA)系统在模态限制下的性能瓶颈问题,即纯文本方法缺失声学信息、纯音频方法缺乏语义上下文,导致评估不全面。其解决方案的关键在于引入多模态大语言模型(Multimodal Large Language Models, MLLM),通过统一框架同时处理语音与文本信息以实现更全面的评估。进一步地,为提升对“表达”(delivery)维度的评估准确性,作者提出“语音优先的多模态训练策略”(Speech-First Multimodal Training, SFMT),基于课程学习原则先建立稳健的语音建模基础,再进行跨模态协同融合,显著提升了整体评估性能(PCC从0.783提升至0.846),尤其在表达维度上较传统方法提升4%绝对准确率。
链接: https://arxiv.org/abs/2508.12591
作者: Yu-Hsuan Fang,Tien-Hong Lo,Yao-Ting Sung,Berlin Chen
机构: National Taiwan Normal University (国立台湾师范大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Sound (cs.SD)
备注: Accepted at IEEE ASRU 2025
Abstract:Traditional Automated Speaking Assessment (ASA) systems exhibit inherent modality limitations: text-based approaches lack acoustic information while audio-based methods miss semantic context. Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive ASA by simultaneously processing audio and text within unified frameworks. This paper presents a very first systematic study of MLLM for comprehensive ASA, demonstrating the superior performance of MLLM across the aspects of content and language use . However, assessment on the delivery aspect reveals unique challenges, which is deemed to require specialized training strategies. We thus propose Speech-First Multimodal Training (SFMT), leveraging a curriculum learning principle to establish more robust modeling foundations of speech before cross-modal synergetic fusion. A series of experiments on a benchmark dataset show MLLM-based systems can elevate the holistic assessment performance from a PCC value of 0.783 to 0.846. In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches, which also paves a new avenue for ASA.
zh
[NLP-48] Insight Rumors: A Novel Textual Rumor Locating and Marking Model Leverag ing Att_BiMamba2 Network
【速读】: 该论文旨在解决现有谣言检测模型仅能粗粒度分类文本是否为谣言,而无法精确定位并标记具体谣言内容的问题。解决方案的关键在于提出一种名为Insight Rumors的新模型,其核心创新包括:(1) 设计了带点积注意力机制的双向Mamba2网络(Att_BiMamba2),通过双向建模和注意力加权融合增强高维谣言特征表示;(2) 构建谣言定位与标记模块,利用跳跃连接将高维特征映射至低维标签空间,并结合条件随机场(CRF)对标签序列施加强约束,从而实现精准的谣言内容定位与标注。
链接: https://arxiv.org/abs/2508.12574
作者: Bin Ma,Yifei Zhang,Yongjin Xian,Qi Li,Linna Zhou,Gongxun Miao
机构: 未知
类目: ocial and Information Networks (cs.SI); Computation and Language (cs.CL)
备注:
Abstract:With the development of social media networks, rumor detection models have attracted more and more attention. Whereas, these models primarily focus on classifying contexts as rumors or not, lacking the capability to locate and mark specific rumor content. To address this limitation, this paper proposes a novel rumor detection model named Insight Rumors to locate and mark rumor content within textual data. Specifically, we propose the Bidirectional Mamba2 Network with Dot-Product Attention (Att_BiMamba2), a network that constructs a bidirectional Mamba2 model and applies dot-product attention to weight and combine the outputs from both directions, thereby enhancing the representation of high-dimensional rumor features. Simultaneously, a Rumor Locating and Marking module is designed to locate and mark rumors. The module constructs a skip-connection network to project high-dimensional rumor features onto low-dimensional label features. Moreover, Conditional Random Fields (CRF) is employed to impose strong constraints on the output label features, ensuring accurate rumor content location. Additionally, a labeled dataset for rumor locating and marking is constructed, with the effectiveness of the proposed model is evaluated through comprehensive experiments. Extensive experiments indicate that the proposed scheme not only detects rumors accurately but also locates and marks them in context precisely, outperforming state-of-the-art schemes that can only discriminate rumors roughly.
zh
[NLP-49] CorrSteer: Steering Improves Task Performance and Safety in LLM s through Correlation-based Sparse Autoencoder Feature Selection
【速读】: 该论文旨在解决稀疏自编码器(Sparse Autoencoders, SAEs)在下游控制任务中性能受限的问题,具体表现为对对比数据集的依赖以及大规模激活存储的需求。其解决方案的关键在于提出CorrSteer方法,通过在推理阶段将样本正确性与SAE生成token的激活值进行相关性分析,从而自动选择更相关的特征,并利用平均激活值计算控制系数,实现端到端的自动化SAE控制流程。该方法仅依赖推理时的激活信息,避免了虚假相关性,显著提升了问答、偏见缓解、越狱防御和推理等任务的表现,且在少量样本(如4000个)下即可获得显著提升。
链接: https://arxiv.org/abs/2508.12535
作者: Seonglae Cho,Zekun Wu,Adriano Koshiyama
机构: Holistic AI; University College London
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 42 pages, 9 tables
Abstract:Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby avoiding spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma 2 2B and LLaMA 3.1 8B, notably achieving a +4.1% improvement in MMLU performance and a +22.9% improvement in HarmBench with only 4000 samples. Selected features demonstrate semantically meaningful patterns aligned with each task’s requirements, revealing the underlying capabilities that drive performance. Our work establishes correlationbased selection as an effective and scalable approach for automated SAE steering across language model applications.
zh
[NLP-50] Mitigating Hallucinations in Large Language Models via Causal Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)中存在的逻辑不一致幻觉问题,即模型输出看似合理但违反推理原则的现象。现有方法如思维链(Chain-of-Thought, CoT)及其图结构变体仅在词元(token)层面进行推理,无法建模变量间的潜在因果关系,缺乏对条件独立性或因果识别假设的表达能力。其解决方案的关键在于提出一种监督微调框架——因果-DAG 构建与推理(Causal-DAG Construction and Reasoning for Supervised Fine-Tuning, CDCR-SFT),通过训练LLM显式构建变量级别的有向无环图(Directed Acyclic Graph, DAG),并在该结构上执行因果推理,从而提升模型的因果推理能力并减少逻辑不一致的幻觉。
链接: https://arxiv.org/abs/2508.12495
作者: Yuangang Li,Yiqing Shen,Yi Nian,Jiechao Gao,Ziyi Wang,Chenxiao Yu,Shawn Li,Jie Wang,Xiyang Hu,Yue Zhao
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Large language models (LLMs) exhibit logically inconsistent hallucinations that appear coherent yet violate reasoning principles, with recent research suggesting an inverse relationship between causal reasoning capabilities and such hallucinations. However, existing reasoning approaches in LLMs, such as Chain-of-Thought (CoT) and its graph-based variants, operate at the linguistic token level rather than modeling the underlying causal relationships between variables, lacking the ability to represent conditional independencies or satisfy causal identification assumptions. To bridge this gap, we introduce causal-DAG construction and reasoning (CDCR-SFT), a supervised fine-tuning framework that trains LLMs to explicitly construct variable-level directed acyclic graph (DAG) and then perform reasoning over it. Moreover, we present a dataset comprising 25,368 samples (CausalDR), where each sample includes an input question, explicit causal DAG, graph-based reasoning trace, and validated answer. Experiments on four LLMs across eight tasks show that CDCR-SFT improves the causal reasoning capability with the state-of-the-art 95.33% accuracy on CLADDER (surpassing human performance of 94.8% for the first time) and reduces the hallucination on HaluEval with 10% improvements. It demonstrates that explicit causal structure modeling in LLMs can effectively mitigate logical inconsistencies in LLM outputs. Code is available at this https URL.
zh
[NLP-51] he Structural Sources of Verb Meaning Revisited: Large Language Models Display Syntactic Bootstrapping
【速读】: 该论文试图解决的问题是:大型语言模型是否表现出类似于人类儿童的句法启动(syntactic bootstrapping)现象,即利用句法环境来学习动词语义。其解决方案的关键在于通过构建扰动数据集,系统性地移除句法信息或共现信息,然后在RoBERTa和GPT-2模型上进行训练与对比分析。结果表明,当句法线索被移除时,模型对动词的表征显著退化,尤其是心理动词(mental verbs)受影响更明显,而名词则在共现信息受损时表现更差,这验证了句法在动词学习中的核心作用,并为在大规模模型中测试发展认知假设提供了可行路径。
链接: https://arxiv.org/abs/2508.12482
作者: Xiaomeng Zhu,R. Thomas McCoy,Robert Frank
机构: Yale University (耶鲁大学); Wu Tsai Institute (伍兹·蔡研究所)
类目: Computation and Language (cs.CL)
备注:
Abstract:Syntactic bootstrapping (Gleitman, 1990) is the hypothesis that children use the syntactic environments in which a verb occurs to learn its meaning. In this paper, we examine whether large language models exhibit a similar behavior. We do this by training RoBERTa and GPT-2 on perturbed datasets where syntactic information is ablated. Our results show that models’ verb representation degrades more when syntactic cues are removed than when co-occurrence information is removed. Furthermore, the representation of mental verbs, for which syntactic bootstrapping has been shown to be particularly crucial in human verb learning, is more negatively impacted in such training regimes than physical verbs. In contrast, models’ representation of nouns is affected more when co-occurrences are distorted than when syntax is distorted. In addition to reinforcing the important role of syntactic bootstrapping in verb learning, our results demonstrated the viability of testing developmental hypotheses on a larger scale through manipulating the learning environments of large language models.
zh
[NLP-52] Is GPT -OSS Good? A Comprehensive Evaluation of OpenAI s Latest Open Source Models
【速读】: 该论文试图解决的问题是:在稀疏架构(sparse architectures)中,模型参数规模扩大是否能带来与密集模型(dense models)相当的性能提升,尤其是在开源大语言模型(open source large language models)部署场景下。解决方案的关键在于通过系统性评估GPT-OSS系列模型(含120B和20B参数的混合专家架构)与六种主流开源模型(参数范围14.7B–235B)在十个基准测试上的表现,采用标准化推理设置和统计验证方法(McNemar检验与效应量分析),发现尽管GPT-OSS-120B参数更多,但在多个任务(如HumanEval、MMLU)上反而弱于GPT-OSS-20B,且后者在内存和能耗方面更具优势。这一结果表明,稀疏架构中的扩展可能无法带来成比例的性能增益,提示需优化训练策略并更谨慎地选择适合实际部署需求的模型。
链接: https://arxiv.org/abs/2508.12461
作者: Ziqian Bi,Keyu Chen,Chiung-Yi Tseng,Danyang Zhang,Tianyang Wang,Hongying Luo,Lu Chen,Junming Huang,Jibin Guan,Junfeng Hao,Junhao Song
机构: AI Agent Lab, Vokram Group(维克兰集团); Purdue University (普渡大学); Georgia Institute of Technology (佐治亚理工学院); LuxMuse AI; ByteDance Inc (字节跳动公司); University of Minnesota (明尼苏达大学); Imperial College London (帝国理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:In August 2025, OpenAI released GPT-OSS models, its first open weight large language models since GPT-2 in 2019, comprising two mixture of experts architectures with 120B and 20B parameters. We evaluated both variants against six contemporary open source large language models ranging from 14.7B to 235B parameters, representing both dense and sparse designs, across ten benchmarks covering general knowledge, mathematical reasoning, code generation, multilingual understanding, and conversational ability. All models were tested in unquantised form under standardised inference settings, with statistical validation using McNemars test and effect size analysis. Results show that gpt-oss-20B consistently outperforms gpt-oss-120B on several benchmarks, such as HumanEval and MMLU, despite requiring substantially less memory and energy per response. Both models demonstrate mid-tier overall performance within the current open source landscape, with relative strength in code generation and notable weaknesses in multilingual tasks. These findings provide empirical evidence that scaling in sparse architectures may not yield proportional performance gains, underscoring the need for further investigation into optimisation strategies and informing more efficient model selection for future open source deployments.
zh
[NLP-53] LoraxBench: A Multitask Multilingual Benchmark Suite for 20 Indonesian Languages
【速读】: 该论文旨在解决印度尼西亚多语言环境下自然语言处理(Natural Language Processing, NLP)资源匮乏的问题,尤其是在低资源语言上的模型性能不足。其解决方案的关键在于构建了一个名为LoraxBench的基准测试集,覆盖20种印尼语方言及三种语言的正式程度变体(formality registers),并涵盖阅读理解、开放域问答、语言推理、因果推理、翻译和文化问答等6类任务。该基准揭示了当前主流多语言大语言模型(Large Language Models, LLMs)在印尼语及其低资源语言上表现不佳,且区域特定模型并未显著优于通用多语言模型,同时指出语言正式程度的变化(如爪哇语中的高礼貌语体Krama)会显著影响模型性能,从而为未来面向东南亚低资源语言的NLP研究提供了重要评估标准与改进方向。
链接: https://arxiv.org/abs/2508.12459
作者: Alham Fikri Aji,Trevor Cohn
机构: MBZUAI; The University of Melbourne; Google(谷歌)
类目: Computation and Language (cs.CL)
备注:
Abstract:As one of the world’s most populous countries, with 700 languages spoken, Indonesia is behind in terms of NLP progress. We introduce LoraxBench, a benchmark that focuses on low-resource languages of Indonesia and covers 6 diverse tasks: reading comprehension, open-domain QA, language inference, causal reasoning, translation, and cultural QA. Our dataset covers 20 languages, with the addition of two formality registers for three languages. We evaluate a diverse set of multilingual and region-focused LLMs and found that this benchmark is challenging. We note a visible discrepancy between performance in Indonesian and other languages, especially the low-resource ones. There is no clear lead when using a region-specific model as opposed to the general multilingual model. Lastly, we show that a change in register affects model performance, especially with registers not commonly found in social media, such as high-level politeness `Krama’ Javanese.
zh
[NLP-54] M3PO: Multimodal-Model-Guided Preference Optimization for Visual Instruction Following
【速读】: 该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂多模态指令遵循任务中,因依赖高成本且不一致的人工标注数据进行微调和偏好对齐而导致的开发瓶颈问题。传统监督微调(SFT)及偏好优化方法(如RLHF和DPO)难以高效利用模型自身生成空间来挖掘具有信息量的“难负样本”(hard negative samples)。其解决方案的关键在于提出一种名为多模态模型引导的偏好优化(Multimodal-Model-Guided Preference Optimization, M3PO)的新方法,该方法通过融合两个核心信号——用于评估外部质量的多模态对齐得分(Multimodal Alignment Score, MAS)与衡量模型内部置信度的自一致性得分(log-probability),构建出M3P-Score,从而精准筛选出高价值的偏好样本对,特别识别那些模型虽自信但错误的响应,进而实现基于LoRA的高效直接偏好优化(Direct Preference Optimization, DPO)微调,显著提升LVLM在多模态指令跟随任务上的性能表现。
链接: https://arxiv.org/abs/2508.12458
作者: Ruirui Gao,Emily Johnson,Bowen Tan,Yanfei Qian
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Vision-Language Models (LVLMs) hold immense potential for complex multimodal instruction following, yet their development is often hindered by the high cost and inconsistency of human annotation required for effective fine-tuning and preference alignment. Traditional supervised fine-tuning (SFT) and existing preference optimization methods like RLHF and DPO frequently struggle to efficiently leverage the model’s own generation space to identify highly informative “hard negative” samples. To address these challenges, we propose Multimodal-Model-Guided Preference Optimization (M3PO), a novel and data-efficient method designed to enhance LVLMs’ capabilities in visual instruction following. M3PO intelligently selects the most “learning-valuable” preference sample pairs from a diverse pool of LVLM-generated candidates. This selection is driven by a sophisticated mechanism that integrates two crucial signals: a Multimodal Alignment Score (MAS) to assess external quality and the model’s Self-Consistency / Confidence (log-probability) to gauge internal belief. These are combined into a novel M3P-Score, which specifically identifies preferred responses and challenging dispreferred responses that the model might confidently generate despite being incorrect. These high-quality preference pairs are then used for efficient Direct Preference Optimization (DPO) fine-tuning on base LVLMs like LLaVA-1.5 (7B/13B) using LoRA. Our extensive experiments demonstrate that M3PO consistently outperforms strong baselines, including SFT, simulated RLHF, vanilla DPO, and RM-DPO, across a comprehensive suite of multimodal instruction following benchmarks (MME-Bench, POPE, IFT, Human Pref. Score).
zh
[NLP-55] Uncovering Emergent Physics Representations Learned In-Context by Large Language Models
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在上下文学习(In-Context Learning, ICL)中如何实现跨任务泛化,尤其是其在物理系统动力学预测任务中的推理机制尚不明确的问题。解决方案的关键在于通过引入基于物理系统的动力学预测任务作为代理实验,结合稀疏自编码器(Sparse Autoencoders, SAEs)对模型残差流激活的分析,发现LLMs在ICL过程中能够编码出与能量等关键物理变量相关的特征表示,从而揭示了LLMs在真实世界结构化动态环境中学习并表征物理概念的能力。
链接: https://arxiv.org/abs/2508.12448
作者: Yeongwoo Song,Jaeyong Bae,Dong-Kyum Kim,Hawoong Jeong
机构: KAIST(韩国科学技术院); MPI for Security and Privacy(德国马克斯普朗克信息安全研究所)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 17 pages, 10 figures
Abstract:Large language models (LLMs) exhibit impressive in-context learning (ICL) abilities, enabling them to solve wide range of tasks via textual prompts alone. As these capabilities advance, the range of applicable domains continues to expand significantly. However, identifying the precise mechanisms or internal structures within LLMs that allow successful ICL across diverse, distinct classes of tasks remains elusive. Physics-based tasks offer a promising testbed for probing this challenge. Unlike synthetic sequences such as basic arithmetic or symbolic equations, physical systems provide experimentally controllable, real-world data based on structured dynamics grounded in fundamental principles. This makes them particularly suitable for studying the emergent reasoning behaviors of LLMs in a realistic yet tractable setting. Here, we mechanistically investigate the ICL ability of LLMs, especially focusing on their ability to reason about physics. Using a dynamics forecasting task in physical systems as a proxy, we evaluate whether LLMs can learn physics in context. We first show that the performance of dynamics forecasting in context improves with longer input contexts. To uncover how such capability emerges in LLMs, we analyze the model’s residual stream activations using sparse autoencoders (SAEs). Our experiments reveal that the features captured by SAEs correlate with key physical variables, such as energy. These findings demonstrate that meaningful physical concepts are encoded within LLMs during in-context learning. In sum, our work provides a novel case study that broadens our understanding of how LLMs learn in context.
zh
[NLP-56] Adversarial Attacks on VQA-NLE: Exposing and Alleviating Inconsistencies in Visual Question Answering Explanations
【速读】: 该论文旨在解决当前视觉问答中的自然语言解释(Natural Language Explanation, NLE)系统存在解释不一致、缺乏对上下文真实理解的问题,这些问题暴露了模型推理流程或解释生成机制的脆弱性。解决方案的关键在于提出一种新颖的对抗攻击策略——通过最小扰动图像来诱导模型产生矛盾或虚假的解释输出,同时引入基于外部知识的缓解方法以增强模型鲁棒性。实验表明,该攻击策略能有效揭示现有VQA-NLE系统的安全与可靠性缺陷,而知识驱动的防御机制则展现出提升系统稳定性的潜力。
链接: https://arxiv.org/abs/2508.12430
作者: Yahsin Yeh,Yilun Wu,Bokai Ruan,Honghan Shuai
机构: National Yang Ming Chiao Tung University (国立阳明交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Natural language explanations in visual question answering (VQA-NLE) aim to make black-box models more transparent by elucidating their decision-making processes. However, we find that existing VQA-NLE systems can produce inconsistent explanations and reach conclusions without genuinely understanding the underlying context, exposing weaknesses in either their inference pipeline or explanation-generation mechanism. To highlight these vulnerabilities, we not only leverage an existing adversarial strategy to perturb questions but also propose a novel strategy that minimally alters images to induce contradictory or spurious outputs. We further introduce a mitigation method that leverages external knowledge to alleviate these inconsistencies, thereby bolstering model robustness. Extensive evaluations on two standard benchmarks and two widely used VQA-NLE models underscore the effectiveness of our attacks and the potential of knowledge-based defenses, ultimately revealing pressing security and reliability concerns in current VQA-NLE systems.
zh
[NLP-57] Non-Iterative Symbolic-Aided Chain-of-Thought for Logical Reasoning
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在逻辑推理任务中缺乏透明性、可解释性和可分析性的问题,尤其是在复杂推理场景下,标准链式思维(Chain-of-Thought, CoT)方法难以有效处理多约束或规则依赖的推理过程。解决方案的关键在于提出符号辅助的链式思维(Symbolic-Aided Chain-of-Thought, Symbolic-Aided CoT),通过在少样本提示(few-shot prompts)中引入轻量级符号表示(symbolic representations),以一致的策略结构化推理步骤,从而在非迭代推理过程中显式地增强推理模式的表达能力。此方法在保持标准提示技术泛化性的前提下,显著提升了LLMs在多个逻辑推理基准(ProofWriter、FOLIO、ProntoQA和LogicalDeduction)上的表现,尤其在需要多规则协同推理的任务中效果突出。
链接: https://arxiv.org/abs/2508.12425
作者: Phuong Minh Nguyen,Tien Huu Dang,Naoya Inoue
机构: RebelsNLU Lab, IS school (信息科学学院); Japan Advanced Institute of Science and Technology (日本高级科学技术研究院)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:This work introduces Symbolic-Aided Chain-of-Thought (CoT), an improved approach to standard CoT, for logical reasoning in large language models (LLMs). The key idea is to integrate lightweight symbolic representations into few-shot prompts, structuring the inference steps with a consistent strategy to make reasoning patterns more explicit within a non-iterative reasoning process. By incorporating these symbolic structures, our method preserves the generalizability of standard prompting techniques while enhancing the transparency, interpretability, and analyzability of LLM logical reasoning. Extensive experiments on four well-known logical reasoning benchmarks – ProofWriter, FOLIO, ProntoQA, and LogicalDeduction, which cover diverse reasoning scenarios – demonstrate the effectiveness of the proposed approach, particularly in complex reasoning tasks that require navigating multiple constraints or rules. Notably, Symbolic-Aided CoT consistently improves LLMs’ reasoning capabilities across various model sizes and significantly outperforms conventional CoT on three out of four datasets, ProofWriter, ProntoQA, and LogicalDeduction.
zh
[NLP-58] he Cultural Gene of Large Language Models : A Study on the Impact of Cross-Corpus Training on Model Values and Biases
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在全球部署中所隐含的文化与伦理假设缺乏系统性探究的问题。其核心挑战在于,尽管LLMs广泛应用于跨文化场景,但它们的价值取向是否受训练数据文化背景影响、以及如何量化这种影响仍不明确。解决方案的关键在于提出“文化基因”(cultural gene)的概念——即LLMs从训练语料中继承的系统性价值倾向,并构建了一个包含200个标准化零样本提示的 Cultural Probe Dataset (CPD),聚焦于个体主义-集体主义(Individualism-Collectivism, IDV)和权力距离(Power Distance, PDI)两个经典跨文化维度。通过对比西方中心模型GPT-4与东方中心模型ERNIE Bot在这些维度上的表现差异及其与霍夫斯泰德国家文化指标的对齐程度,研究揭示了LLMs作为其训练文化语料统计镜像的本质特征,从而为实现文化敏感型评估与部署提供了方法论基础,以避免算法层面的文化霸权。
链接: https://arxiv.org/abs/2508.12411
作者: Emanuel Z. Fenech-Borg,Tilen P. Meznaric-Kos,Milica D. Lekovic-Bojovic,Arni J. Hentze-Djurhuus
机构: University of Malta (马耳他大学); University of Primorska (普里莫尔斯卡大学); University of Montenegro (蒙特内哥罗大学); University of the Faroe Islands (法罗群岛大学)
类目: Computation and Language (cs.CL)
备注: 10 pages, 5 figures, IEEE conference format, submitted to [Conference Name]
Abstract:Large language models (LLMs) are deployed globally, yet their underlying cultural and ethical assumptions remain underexplored. We propose the notion of a “cultural gene” – a systematic value orientation that LLMs inherit from their training corpora – and introduce a Cultural Probe Dataset (CPD) of 200 prompts targeting two classic cross-cultural dimensions: Individualism-Collectivism (IDV) and Power Distance (PDI). Using standardized zero-shot prompts, we compare a Western-centric model (GPT-4) and an Eastern-centric model (ERNIE Bot). Human annotation shows significant and consistent divergence across both dimensions. GPT-4 exhibits individualistic and low-power-distance tendencies (IDV score approx 1.21; PDI score approx -1.05), while ERNIE Bot shows collectivistic and higher-power-distance tendencies (IDV approx -0.89; PDI approx 0.76); differences are statistically significant (p 0.001). We further compute a Cultural Alignment Index (CAI) against Hofstede’s national scores and find GPT-4 aligns more closely with the USA (e.g., IDV CAI approx 0.91; PDI CAI approx 0.88) whereas ERNIE Bot aligns more closely with China (IDV CAI approx 0.85; PDI CAI approx 0.81). Qualitative analyses of dilemma resolution and authority-related judgments illustrate how these orientations surface in reasoning. Our results support the view that LLMs function as statistical mirrors of their cultural corpora and motivate culturally aware evaluation and deployment to avoid algorithmic cultural hegemony.
zh
[NLP-59] ZigzagAttention: Efficient Long-Context Inference with Exclusive Retrieval and Streaming Heads
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在处理长上下文时因KV缓存(Key-Value Cache)内存消耗过大而导致的部署难题。现有方法通过区分“检索头”(retrieval heads)和“流式头”(streaming heads),仅对后者不缓存KV以降低内存开销,但这种分层混合使用两类头的方式会因频繁访问和索引张量而引入额外延迟。该研究的关键改进在于设计了一种新的识别准则,确保每一层中的注意力头要么全部为检索头、要么全部为流式头,从而避免了因拆分注意力计算带来的额外延迟,同时保持性能损失可忽略。该方案命名为ZigzagAttention,在降低延迟的同时实现了与基线相当的性能表现。
链接: https://arxiv.org/abs/2508.12407
作者: Zhuorui Liu,Chen Zhang,Dawei Song
机构: Beijing Institute of Technology (北京理工大学)
类目: Computation and Language (cs.CL)
备注: 5 pages, 4 figures
Abstract:With the rapid development of large language models (LLMs), handling long context has become one of the vital abilities in LLMs. Such long-context ability is accompanied by difficulties in deployment, especially due to the increased consumption of KV cache. There is certain work aiming to optimize the memory footprint of KV cache, inspired by the observation that attention heads can be categorized into retrieval heads that are of great significance and streaming heads that are of less significance. Typically, identifying the streaming heads and and waiving the KV cache in the streaming heads would largely reduce the overhead without hurting the performance that much. However, since employing both retrieval and streaming heads in one layer decomposes one large round of attention computation into two small ones, it may unexpectedly bring extra latency on accessing and indexing tensors. Based on this intuition, we impose an important improvement to the identification process of retrieval and streaming heads, in which we design a criterion that enforces exclusively retrieval or streaming heads gathered in one unique layer. In this way, we further eliminate the extra latency and only incur negligible performance degradation. Our method named \textscZigzagAttention is competitive among considered baselines owing to reduced latency and comparable performance.
zh
[NLP-60] Extracting Post-Acute Sequelae of SARS-CoV-2 Infection Symptoms from Clinical Notes via Hybrid Natural Language Processing ALT
【速读】: 该论文旨在解决Post-Acute Sequelae of COVID-19 (PASC) 的准确与高效诊断难题,因其症状多样且在不同时间尺度上动态演变。解决方案的关键在于构建了一个混合自然语言处理(Natural Language Processing, NLP)流程,该流程结合了基于规则的命名实体识别(Named Entity Recognition, NER)模块与基于BERT的断言检测(Assertion Detection)模块,用于从临床笔记中提取PASC相关症状并判断其断言状态(如阳性或阴性)。研究团队还开发了一个由临床专家制定的综合性PASC词典,并在全美11个医疗系统中收集了共计47,654份进展记录用于模型验证与流行病学分析,最终实现了高精度(F1分数达0.76–0.82)和高效率(每条笔记平均处理时间2.448秒)的自动化症状识别与断言判定,为PASC的临床诊断提供了可靠的技术支持。
链接: https://arxiv.org/abs/2508.12405
作者: Zilong Bai,Zihan Xu,Cong Sun,Chengxi Zang,H. Timothy Bunnell,Catherine Sinfield,Jacqueline Rutter,Aaron Thomas Martinez,L. Charles Bailey,Mark Weiner,Thomas R. Campion,Thomas Carton,Christopher B. Forrest,Rainu Kaushal,Fei Wang,Yifan Peng
机构: Weill Cornell Medicine (威尔康奈尔医学院); Nemours Children’s Health (尼莫尔儿童健康中心); Children’s Hospital of Philadelphia (费城儿童医院); Louisiana Public Health Institute (路易斯安那州公共卫生研究所)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for publication in npj Health Systems
Abstract:Accurately and efficiently diagnosing Post-Acute Sequelae of COVID-19 (PASC) remains challenging due to its myriad symptoms that evolve over long- and variable-time intervals. To address this issue, we developed a hybrid natural language processing pipeline that integrates rule-based named entity recognition with BERT-based assertion detection modules for PASC-symptom extraction and assertion detection from clinical notes. We developed a comprehensive PASC lexicon with clinical specialists. From 11 health systems of the RECOVER initiative network across the U.S., we curated 160 intake progress notes for model development and evaluation, and collected 47,654 progress notes for a population-level prevalence study. We achieved an average F1 score of 0.82 in one-site internal validation and 0.76 in 10-site external validation for assertion detection. Our pipeline processed each note at 2.448\pm 0.812 seconds on average. Spearman correlation tests showed \rho 0.83 for positive mentions and \rho 0.72 for negative ones, both with P 0.0001 . These demonstrate the effectiveness and efficiency of our models and their potential for improving PASC diagnosis.
zh
[NLP-61] Where to Start Alignment? Diffusion Large Language Model May Demand a Distinct Position
【速读】: 该论文旨在解决扩散型大语言模型(Diffusion Large Language Models, dLLMs)在安全性方面的研究空白问题,尤其是针对其非自回归生成机制下安全对齐的挑战。解决方案的关键在于揭示了防御者与攻击者在安全博弈中的关键不对称性:防御者发现dLLMs输出中中间token(middle tokens)比初始token更影响整体安全性,而攻击者难以操控这些中间token,因为dLLMs在实际生成中表现出强烈的顺序依赖性,迫使攻击行为偏离对关键中间区域的影响。基于此不对称性,作者提出Middle-tOken Safety Alignment (MOSA) 方法,通过强化学习直接对齐模型中间生成阶段与安全拒绝策略,从而显著提升dLLMs的安全性能,并在多个基准测试和任务(如编码、数学推理)中验证了其有效性。
链接: https://arxiv.org/abs/2508.12398
作者: Zhixin Xie,Xurui Song,Jun Luo
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Diffusion Large Language Models (dLLMs) have recently emerged as a competitive non-autoregressive paradigm due to their unique training and inference approach. However, there is currently a lack of safety study on this novel architecture. In this paper, we present the first analysis of dLLMs’ safety performance and propose a novel safety alignment method tailored to their unique generation characteristics. Specifically, we identify a critical asymmetry between the defender and attacker in terms of security. For the defender, we reveal that the middle tokens of the response, rather than the initial ones, are more critical to the overall safety of dLLM outputs; this seems to suggest that aligning middle tokens can be more beneficial to the defender. The attacker, on the contrary, may have limited power to manipulate middle tokens, as we find dLLMs have a strong tendency towards a sequential generation order in practice, forcing the attack to meet this distribution and diverting it from influencing the critical middle tokens. Building on this asymmetry, we introduce Middle-tOken Safety Alignment (MOSA), a novel method that directly aligns the model’s middle generation with safe refusals exploiting reinforcement learning. We implement MOSA and compare its security performance against eight attack methods on two benchmarks. We also test the utility of MOSA-aligned dLLM on coding, math, and general reasoning. The results strongly prove the superiority of MOSA.
zh
[NLP-62] MedKGent: A Large Language Model Agent Framework for Constructing Temporally Evolving Medical Knowledge Graph
【速读】: 该论文旨在解决当前医学知识图谱(Knowledge Graph, KG)构建方法在处理海量、动态演化的生物医学文献时存在的局限性,包括依赖监督式流水线导致泛化能力弱,以及忽视知识的时间动态性和上下文不确定性等问题。其解决方案的关键在于提出 MedKGent——一个基于大语言模型(Large Language Model, LLM)的智能代理框架,通过细粒度日级时间序列模拟生物医学知识的演化过程,利用两个专用代理(Extractor Agent 和 Constructor Agent)协同工作:前者基于采样估计生成置信度评分以过滤低质量三元组,后者依据置信度和时间戳增量式整合三元组,从而实现对医学KG的动态更新与冲突消解。该方案显著提升了知识图谱的质量与实用性,在多个下游任务中展现出优越性能。
链接: https://arxiv.org/abs/2508.12393
作者: Duzhen Zhang,Zixiao Wang,Zhong-Zhi Li,Yahan Yu,Shuncheng Jia,Jiahua Dong,Haotian Xu,Xing Wu,Yingying Zhang,Tielin Zhang,Jie Yang,Xiuying Chen,Le Song
机构: MBZUAI(穆巴达拉科技大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid expansion of medical literature presents growing challenges for structuring and integrating domain knowledge at scale. Knowledge Graphs (KGs) offer a promising solution by enabling efficient retrieval, automated reasoning, and knowledge discovery. However, current KG construction methods often rely on supervised pipelines with limited generalizability or naively aggregate outputs from Large Language Models (LLMs), treating biomedical corpora as static and ignoring the temporal dynamics and contextual uncertainty of evolving knowledge. To address these limitations, we introduce MedKGent, a LLM agent framework for constructing temporally evolving medical KGs. Leveraging over 10 million PubMed abstracts published between 1975 and 2023, we simulate the emergence of biomedical knowledge via a fine-grained daily time series. MedKGent incrementally builds the KG in a day-by-day manner using two specialized agents powered by the Qwen2.5-32B-Instruct model. The Extractor Agent identifies knowledge triples and assigns confidence scores via sampling-based estimation, which are used to filter low-confidence extractions and inform downstream processing. The Constructor Agent incrementally integrates the retained triples into a temporally evolving graph, guided by confidence scores and timestamps to reinforce recurring knowledge and resolve conflicts. The resulting KG contains 156,275 entities and 2,971,384 relational triples. Quality assessments by two SOTA LLMs and three domain experts demonstrate an accuracy approaching 90%, with strong inter-rater agreement. To evaluate downstream utility, we conduct RAG across seven medical question answering benchmarks using five leading LLMs, consistently observing significant improvements over non-augmented baselines. Case studies further demonstrate the KG’s value in literature-based drug repurposing via confidence-aware causal inference.
zh
[NLP-63] ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models
【速读】: 该论文旨在解决小型语言模型(Small Language Models, SLMs)在复杂推理任务中表现不佳的问题,具体表现为:受限的模型容量导致多步推理时易产生错误或不一致的答案;现有方法虽能提升性能,但常以牺牲推理能力、自主性和泛化能力为代价——如偏置监督过滤负向推理路径限制从错误中学习、过度依赖外部推理信号削弱自主性、以及过拟合教师特定模式降低泛化性。其解决方案的核心是提出 ReaLM 框架,包含三个关键技术:(1) 多路径过程验证(Multi-Route Process Verification, MRPV),通过对比正负推理路径提取决定性模式以增强推理能力;(2) 渐进诱导自主性(Enabling Autonomy via Asymptotic Induction, EAAI),逐步弱化外部信号以提高模型自主性;(3) 引导式思维链蒸馏(guided chain-of-thought distillation),将领域特定规则和专家知识编码至模型参数中,从而提升泛化能力。
链接: https://arxiv.org/abs/2508.12387
作者: Yuanfeng Xu,Zehui Dai,Jian Liang,Jiapeng Guan,Guangrun Wang,Liang Lin,Xiaohui Lv
机构: 1. Tsinghua University (清华大学); 2. Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所); 3. Peking University (北京大学); 4. University of Science and Technology of China (中国科学技术大学)
类目: Computation and Language (cs.CL)
备注: 16pages, 3 figures
Abstract:Small Language Models (SLMs) are a cost-effective alternative to Large Language Models (LLMs), but often struggle with complex reasoning due to their limited capacity and a tendency to produce mistakes or inconsistent answers during multi-step reasoning. Existing efforts have improved SLM performance, but typically at the cost of one or more of three key aspects: (1) reasoning capability, due to biased supervision that filters out negative reasoning paths and limits learning from errors; (2) autonomy, due to over-reliance on externally generated reasoning signals; and (3) generalization, which suffers when models overfit to teacher-specific patterns. In this paper, we introduce ReaLM, a reinforcement learning framework for robust and self-sufficient reasoning in vertical domains. To enhance reasoning capability, we propose Multi-Route Process Verification (MRPV), which contrasts both positive and negative reasoning paths to extract decisive patterns. To reduce reliance on external guidance and improve autonomy, we introduce Enabling Autonomy via Asymptotic Induction (EAAI), a training strategy that gradually fades external signals. To improve generalization, we apply guided chain-of-thought distillation to encode domain-specific rules and expert knowledge into SLM parameters, making them part of what the model has learned. Extensive experiments on both vertical and general reasoning tasks demonstrate that ReaLM significantly improves SLM performance across aspects (1)-(3) above.
zh
[NLP-64] Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering
【速读】: 该论文旨在解决多答案问答(Multi-Answer Question Answering, MAQA)中存在冲突答案的现实挑战,即当一个问题存在多个有效但相互矛盾的答案时,现有模型难以准确识别所有合法答案并检测其中的冲突对。传统方法通常假设证据一致性,而忽略了MAQA中常见的冲突场景,且现有基准数据集多依赖合成数据或自动化标注,缺乏真实性和标注可靠性。解决方案的关键在于提出一种新颖、低成本的方法,利用已有的事实核查(fact-checking)数据集构建NATCONFQA这一新基准,该基准不仅包含真实世界中的冲突答案对,还为每一对答案提供细粒度的冲突标签,从而支持更严谨的模型评估与改进。实验表明,当前高阶大语言模型(Large Language Models, LLMs)在处理不同类型冲突时表现脆弱,且采用错误的策略进行冲突化解。
链接: https://arxiv.org/abs/2508.12355
作者: Eviatar Nachshoni,Arie Cattan,Shmuel Amar,Ori Shapira,Ido Dagan
机构: Bar-Ilan University (巴伊兰大学); Google Research (谷歌研究院); OriginAI
类目: Computation and Language (cs.CL)
备注: no comments
Abstract:Large Language Models (LLMs) have demonstrated strong performance in question answering (QA) tasks. However, Multi-Answer Question Answering (MAQA), where a question may have several valid answers, remains challenging. Traditional QA settings often assume consistency across evidences, but MAQA can involve conflicting answers. Constructing datasets that reflect such conflicts is costly and labor-intensive, while existing benchmarks often rely on synthetic data, restrict the task to yes/no questions, or apply unverified automated annotation. To advance research in this area, we extend the conflict-aware MAQA setting to require models not only to identify all valid answers, but also to detect specific conflicting answer pairs, if any. To support this task, we introduce a novel cost-effective methodology for leveraging fact-checking datasets to construct NATCONFQA, a new benchmark for realistic, conflict-aware MAQA, enriched with detailed conflict labels, for all answer pairs. We evaluate eight high-end LLMs on NATCONFQA, revealing their fragility in handling various types of conflicts and the flawed strategies they employ to resolve them.
zh
[NLP-65] CarelessWhisper: Turning Whisper into a Causal Streaming Model
【速读】: 该论文旨在解决当前基于Transformer的自动语音识别(ASR)模型(如OpenAI Whisper和NVIDIA Canary)在流式(online或实时)语音转录场景下的低延迟问题,因其原始架构和训练方式不适用于依赖未来上下文的非因果建模。解决方案的关键在于将原有的非因果编码器通过低秩适配(LoRA)微调转化为因果编码器,并结合弱对齐数据集联合优化编码器与解码器,从而实现低延迟(小于300毫秒)的流式推理。进一步提出了一种更新的推理机制,支持贪婪搜索和束搜索解码,并被证明具有局部最优性;实验表明该方法在保持较低计算复杂度的同时,在多数情况下优于现有非微调的流式ASR方案,且能更精确地提取词级时间戳。
链接: https://arxiv.org/abs/2508.12301
作者: Tomer Krichli,Bhiksha Raj,Joseph Keshet
机构: Technion–Israel Institute of Technology (以色列理工学院); Carnegie Mellon University (卡内基梅隆大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
备注: 17 pages, 7 Figures, This work has been submitted to the IEEE for possible publication
Abstract:Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming (online or real-time) transcription, due to limitations in their architecture and training methodology. We propose a method to turn the transformer encoder-decoder model into a low-latency streaming model that is careless about future context. We present an analysis explaining why it is not straightforward to convert an encoder-decoder transformer to a low-latency streaming model. Our proposed method modifies the existing (non-causal) encoder to a causal encoder by fine-tuning both the encoder and decoder using Low-Rank Adaptation (LoRA) and a weakly aligned dataset. We then propose an updated inference mechanism that utilizes the fine-tune causal encoder and decoder to yield greedy and beam-search decoding, and is shown to be locally optimal. Experiments on low-latency chunk sizes (less than 300 msec) show that our fine-tuned model outperforms existing non-fine-tuned streaming approaches in most cases, while using a lower complexity. Additionally, we observe that our training process yields better alignment, enabling a simple method for extracting word-level timestamps. We release our training and inference code, along with the fine-tuned models, to support further research and development in streaming ASR.
zh
[NLP-66] Incorporating Legal Logic into Deep Learning: An Intelligent Approach to Probation Prediction
【速读】: 该论文旨在解决当前智能司法辅助系统(Intelligent Judicial Assistant System, IJAS)在量刑预测中缺乏专门针对缓刑(probation)的预测方法,且对影响缓刑资格的法律要素分析不足的问题。现有研究多依赖数据驱动模型,忽视了司法决策中的法律逻辑。解决方案的关键在于提出一种融合法律逻辑的深度学习模型——多任务双理论缓刑预测模型(Multi-Task Dual-Theory Probation Prediction Model, MT-DT),该模型基于缓刑的法律要素(Probation Legal Elements, PLEs)和惩罚的“双轨理论”(Dual-Track Theory of Punishment),通过分阶段构建专用缓刑数据集与模型结构,在保证预测性能的同时增强司法解释的可解释性与合规性。
链接: https://arxiv.org/abs/2508.12286
作者: Qinghua Wang,Xu Zhang,Lingyan Yang,Rui Shao,Bonan Wang,Fang Wang,Cunquan Qu
机构: Shandong University (山东大学); University of Macau (澳门大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Probation is a crucial institution in modern criminal law, embodying the principles of fairness and justice while contributing to the harmonious development of society. Despite its importance, the current Intelligent Judicial Assistant System (IJAS) lacks dedicated methods for probation prediction, and research on the underlying factors influencing probation eligibility remains limited. In addition, probation eligibility requires a comprehensive analysis of both criminal circumstances and remorse. Much of the existing research in IJAS relies primarily on data-driven methodologies, which often overlooks the legal logic underpinning judicial decision-making. To address this gap, we propose a novel approach that integrates legal logic into deep learning models for probation prediction, implemented in three distinct stages. First, we construct a specialized probation dataset that includes fact descriptions and probation legal elements (PLEs). Second, we design a distinct probation prediction model named the Multi-Task Dual-Theory Probation Prediction Model (MT-DT), which is grounded in the legal logic of probation and the \textitDual-Track Theory of Punishment. Finally, our experiments on the probation dataset demonstrate that the MT-DT model outperforms baseline models, and an analysis of the underlying legal logic further validates the effectiveness of the proposed approach.
zh
[NLP-67] A Question Answering Dataset for Temporal-Sensitive Retrieval-Augmented Generation
【速读】: 该论文旨在解决当前中文问答系统在时间推理(temporal reasoning)能力评估上的不足,尤其是针对检索增强生成(Retrieval-Augmented Generation, RAG)系统在处理绝对、聚合与相对时间关系时的逻辑一致性与时空对齐问题。解决方案的关键在于构建ChronoQA——一个大规模、高质量的中文时间推理问答数据集,涵盖2019至2024年间超过30万篇新闻文章,包含5,176个结构化标注的问题,覆盖显式与隐式时间表达,并支持单文档与多文档场景。该数据集通过多阶段验证(规则引擎、大语言模型和人工评估)确保质量,从而为时间敏感型RAG系统的评测提供动态、可靠且可扩展的基准。
链接: https://arxiv.org/abs/2508.12282
作者: Ziyang Chen,Erxue Min,Xiang Zhao,Yunxin Li,Xin Jia,Jinzhi Liao,Jichao Li,Shuaiqiang Wang,Baotian Hu,Dawei Yin
机构: National University of Defense Technology (国防科技大学); Baidu Inc. (百度); Harbin Institute of Technology (深圳) (哈尔滨工业大学(深圳)
类目: Computation and Language (cs.CL); Information Retrieval (cs.IR)
备注: 10 pages, 5 figures
Abstract:We introduce ChronoQA, a large-scale benchmark dataset for Chinese question answering, specifically designed to evaluate temporal reasoning in Retrieval-Augmented Generation (RAG) systems. ChronoQA is constructed from over 300,000 news articles published between 2019 and 2024, and contains 5,176 high-quality questions covering absolute, aggregate, and relative temporal types with both explicit and implicit time expressions. The dataset supports both single- and multi-document scenarios, reflecting the real-world requirements for temporal alignment and logical consistency. ChronoQA features comprehensive structural annotations and has undergone multi-stage validation, including rule-based, LLM-based, and human evaluation, to ensure data quality. By providing a dynamic, reliable, and scalable resource, ChronoQA enables structured evaluation across a wide range of temporal tasks, and serves as a robust benchmark for advancing time-sensitive retrieval-augmented question answering systems.
zh
[NLP-68] LegalΔ: Enhancing Legal Reasoning in LLM s via Reinforcement Learning with Chain-of-Thought Guided Information Gain
【速读】: 该论文旨在解决当前法律大语言模型(Legal LLM)在司法决策自动化中推理可靠性与可解释性不足的问题,尤其是模型倾向于采用“快速思维”行为,直接输出答案而缺乏显式的多步推理过程,从而限制了其在复杂法律场景下的适用性。解决方案的关键在于提出一种名为 Legal Δ 的强化学习框架,通过链式思维(chain-of-thought)引导的信息增益最大化机制,在训练中采用双模式输入(直接答案与推理增强模式),促使模型习得具有实质意义的推理模式而非表面或冗余的解释;该框架分为两阶段:首先从强大的大推理模型(Large Reasoning Model, LRM)DeepSeek-R1中蒸馏隐含的推理能力,其次通过差异比较和多维奖励机制(兼顾结构连贯性与法律领域特异性)优化推理质量,最终显著提升法律判断的准确性和可信度,且无需依赖标注偏好数据。
链接: https://arxiv.org/abs/2508.12281
作者: Xin Dai,Buqiang Xu,Zhenghao Liu,Yukun Yan,Huiyuan Xie,Xiaoyuan Yi,Shuo Wang,Ge Yu
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Legal Artificial Intelligence (LegalAI) has achieved notable advances in automating judicial decision-making with the support of Large Language Models (LLMs). However, existing legal LLMs still struggle to generate reliable and interpretable reasoning processes. They often default to fast-thinking behavior by producing direct answers without explicit multi-step reasoning, limiting their effectiveness in complex legal scenarios that demand rigorous justification. To address this challenge, we propose Legal \Delta , a reinforcement learning framework designed to enhance legal reasoning through chain-of-thought guided information gain. During training, Legal \Delta employs a dual-mode input setup-comprising direct answer and reasoning-augmented modes-and maximizes the information gain between them. This encourages the model to acquire meaningful reasoning patterns rather than generating superficial or redundant explanations. Legal \Delta follows a two-stage approach: (1) distilling latent reasoning capabilities from a powerful Large Reasoning Model (LRM), DeepSeek-R1, and (2) refining reasoning quality via differential comparisons, combined with a multidimensional reward mechanism that assesses both structural coherence and legal-domain specificity. Experimental results on multiple legal reasoning tasks demonstrate that Legal \Delta outperforms strong baselines in both accuracy and interpretability. It consistently produces more robust and trustworthy legal judgments without relying on labeled preference data. All code and data will be released at this https URL.
zh
[NLP-69] he Self-Execution Benchmark: Measuring LLM s Attempts to Overcome Their Lack of Self-Execution
【速读】: 该论文试图解决的问题是:当前大型语言模型(Large Language Models, LLMs)在评估时主要关注其知识掌握或推理能力,但缺乏对其自身输出行为的自我预测能力的考察。为填补这一空白,作者提出了一种新的评估范式——Self-Execution Benchmark,其关键在于通过设计任务来衡量LLM能否准确预判自身输出的属性,如问题难度、是否拒绝回答或可能产生的关联内容。该方案揭示了模型在自我认知方面的系统性不足,表明即使模型规模扩大或能力增强,也未必能提升其对自身行为的预测准确性,从而揭示了LLMs在内部表征与元认知推理上的根本局限。
链接: https://arxiv.org/abs/2508.12277
作者: Elon Ezra,Ariel Weizman,Amos Azaria
机构: Ariel University ( Ariel 大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: 11 pages, 9 figures
Abstract:Large language models (LLMs) are commonly evaluated on tasks that test their knowledge or reasoning abilities. In this paper, we explore a different type of evaluation: whether an LLM can predict aspects of its own responses. Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model’s ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this benchmark, and that increased model size or capability does not consistently lead to better performance. These results suggest a fundamental limitation in how LLMs represent and reason about their own behavior.
zh
[NLP-70] Fast Slow and Tool-augmented Thinking for LLM s: A Review
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在真实世界任务中缺乏适应性推理能力的问题,即如何根据问题特性动态调整推理策略,从快速直觉响应到细致的逐步推理,再到工具增强的思维过程。其解决方案的关键在于提出了一种基于认知心理学启发的新颖分类法,将LLM的推理策略划分为两个维度:快/慢边界(区分直觉式与反思式过程)和内/外边界(区分仅依赖模型参数的内部推理与引入外部工具的外部增强推理),并据此系统梳理了近期自适应推理方法的研究进展,为构建更灵活、高效且可靠的LLM提供了理论框架与分类依据。
链接: https://arxiv.org/abs/2508.12265
作者: Xinda Jia,Jinpeng Li,Zezhong Wang,Jingjing Li,Xingshan Zeng,Yasheng Wang,Weinan Zhang,Yong Yu,Weiwen Liu
机构: Shanghai Jiao Tong University (上海交通大学); Huawei Noah’s Ark Lab; The Chinese University of Hong Kong (香港中文大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:Large Language Models (LLMs) have demonstrated remarkable progress in reasoning across diverse domains. However, effective reasoning in real-world tasks requires adapting the reasoning strategy to the demands of the problem, ranging from fast, intuitive responses to deliberate, step-by-step reasoning and tool-augmented thinking. Drawing inspiration from cognitive psychology, we propose a novel taxonomy of LLM reasoning strategies along two knowledge boundaries: a fast/slow boundary separating intuitive from deliberative processes, and an internal/external boundary distinguishing reasoning grounded in the model’s parameters from reasoning augmented by external tools. We systematically survey recent work on adaptive reasoning in LLMs and categorize methods based on key decision factors. We conclude by highlighting open challenges and future directions toward more adaptive, efficient, and reliable LLMs.
zh
[NLP-71] Structuring the Unstructured: A Systematic Review of Text-to-Structure Generation for Agent ic AI with a Universal Evaluation Framework
【速读】: 该论文旨在解决当前生成式 AI(Generative AI)系统在向代理化(agentic)操作和上下文感知检索演进过程中,如何高效地将非结构化文本转化为表格、知识图谱和图表等结构化格式这一关键问题。其解决方案的关键在于提出一个通用的评估框架,用于标准化结构化输出的质量评价,并通过系统性综述梳理现有文本转结构化数据的技术方法、数据集与评估指标,从而为下一代 AI 系统奠定结构化信息处理的基础架构。
链接: https://arxiv.org/abs/2508.12257
作者: Zheye Deng,Chunkit Chan,Tianshi Zheng,Wei Fan,Weiqi Wang,Yangqiu Song
机构: HKUST(香港科技大学)
类目: Computation and Language (cs.CL)
备注: Under Review
Abstract:The evolution of AI systems toward agentic operation and context-aware retrieval necessitates transforming unstructured text into structured formats like tables, knowledge graphs, and charts. While such conversions enable critical applications from summarization to data mining, current research lacks a comprehensive synthesis of methodologies, datasets, and metrics. This systematic review examines text-to-structure techniques and the encountered challenges, evaluates current datasets and assessment criteria, and outlines potential directions for future research. We also introduce a universal evaluation framework for structured outputs, establishing text-to-structure as foundational infrastructure for next-generation AI systems.
zh
[NLP-72] What do Speech Foundation Models Learn? Analysis and Applications
【速读】: 该论文旨在解决语音基础模型(Speech Foundation Models, SFMs)中所编码的声学与语言知识尚不清晰的问题,以及其在更深层次的口语理解(Spoken Language Understanding, SLU)任务中的有效性尚未充分验证的问题。解决方案的关键在于:首先构建一个轻量级的分析框架,利用统计工具和无需训练的任务来系统性地解析SFMs各层所蕴含的知识;其次,为SLU评估基准新增了口语命名实体识别(Spoken Named Entity Recognition, NER)和命名实体定位(Named Entity Localization, NEL)任务,并提出基于SFMs的端到端(End-to-End, E2E)建模方法,证明其优于传统级联式(Cascaded)语音识别后接文本模型的范式,从而为未来模型设计与应用提供可量化、可比较的工具与数据集支持。
链接: https://arxiv.org/abs/2508.12255
作者: Ankita Pasad
机构: 未知
类目: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS)
备注: Ph.D. Thesis
Abstract:Speech foundation models (SFMs) are designed to serve as general-purpose representations for a wide range of speech-processing tasks. The last five years have seen an influx of increasingly successful self-supervised and supervised pre-trained models with impressive performance on various downstream tasks. Although the zoo of SFMs continues to grow, our understanding of the knowledge they acquire lags behind. This thesis presents a lightweight analysis framework using statistical tools and training-free tasks to investigate the acoustic and linguistic knowledge encoded in SFM layers. We conduct a comparative study across multiple SFMs and statistical tools. Our study also shows that the analytical insights have concrete implications for downstream task performance. The effectiveness of an SFM is ultimately determined by its performance on speech applications. Yet it remains unclear whether the benefits extend to spoken language understanding (SLU) tasks that require a deeper understanding than widely studied ones, such as speech recognition. The limited exploration of SLU is primarily due to a lack of relevant datasets. To alleviate that, this thesis contributes tasks, specifically spoken named entity recognition (NER) and named entity localization (NEL), to the Spoken Language Understanding Evaluation benchmark. We develop SFM-based approaches for NER and NEL, and find that end-to-end (E2E) models leveraging SFMs can surpass traditional cascaded (speech recognition followed by a text model) approaches. Further, we evaluate E2E SLU models across SFMs and adaptation strategies to assess the impact on task performance. Collectively, this thesis tackles previously unanswered questions about SFMs, providing tools and datasets to further our understanding and to enable the community to make informed design choices for future model development and adoption. Comments: Ph.D. Thesis Subjects: Computation and Language (cs.CL); Audio and Speech Processing (eess.AS) Cite as: arXiv:2508.12255 [cs.CL] (or arXiv:2508.12255v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2508.12255 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Submission history From: Ankita Pasad [view email] [v1] Sun, 17 Aug 2025 06:31:34 UTC (17,899 KB)
zh
[NLP-73] SEA-BED: Southeast Asia Embedding Benchmark
【速读】: 该论文旨在解决东南亚(Southeast Asia, SEA)地区缺乏高质量、本地化语料的句子嵌入(sentence embeddings)评估基准的问题。当前多语言基准如MMTEB虽覆盖广泛,但SEA地区数据稀缺且多为机器翻译,未能体现本地语言特性,导致模型在低资源语言(如缅甸语)上的性能评估不准确。解决方案的关键在于提出SEA-BED——首个大规模面向东南亚的句子嵌入基准,包含169个数据集、9类任务和10种语言,其中71%由人工构建而非机器生成或翻译,从而更真实地反映本地语言结构与语义特性。通过系统评估17种嵌入模型,研究揭示了SEA语言的独特挑战、跨语言性能差异及人工标注数据对低资源语言评估的重要性,为区域适配的生成式AI(Generative AI)模型优化提供了关键实证依据。
链接: https://arxiv.org/abs/2508.12243
作者: Wuttikorn Ponwitayarat,Raymond Ng,Jann Railey Montalan,Thura Aung,Jian Gang Ngui,Yosephine Susanto,William Tjhi,Panuthep Tasawong,Erik Cambria,Ekapol Chuangsuwanich,Sarana Nutanong,Peerat Limkonchotiwat
机构: Vidyasirimedhi Institute of Science and Technology (维迪亚西里梅迪学院科学技术研究所); AI Singapore (人工智能新加坡); Nanyang Technological University (南洋理工大学); Department of Computer Engineering, Faculty of Engineering, Chulalongkorn University (朱拉隆功大学工程学院计算机工程系); King Mongkut’s Institute of Technology Ladkrabang (拉贾曼尼技术学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Sentence embeddings are essential for NLP tasks such as semantic search, re-ranking, and textual similarity. Although multilingual benchmarks like MMTEB broaden coverage, Southeast Asia (SEA) datasets are scarce and often machine-translated, missing native linguistic properties. With nearly 700 million speakers, the SEA region lacks a region-specific embedding benchmark. We introduce SEA-BED, the first large-scale SEA embedding benchmark with 169 datasets across 9 tasks and 10 languages, where 71% are formulated by humans, not machine generation or translation. We address three research questions: (1) which SEA languages and tasks are challenging, (2) whether SEA languages show unique performance gaps globally, and (3) how human vs. machine translations affect evaluation. We evaluate 17 embedding models across six studies, analyzing task and language challenges, cross-benchmark comparisons, and translation trade-offs. Results show sharp ranking shifts, inconsistent model performance among SEA languages, and the importance of human-curated datasets for low-resource languages like Burmese.
zh
[NLP-74] Arabic Multimodal Machine Learning: Datasets Applications Approaches and Challenges
【速读】: 该论文旨在解决阿拉伯语多模态机器学习(Arabic Multimodal Machine Learning, Arabic MML)领域缺乏系统性综述的问题,以推动该方向的深入研究与发展。其解决方案的关键在于提出一个新颖的分类体系(taxonomy),将现有研究工作系统地划分为四大核心主题:数据集(datasets)、应用(applications)、方法(approaches)和挑战(challenges),从而为学术界提供结构化的现状分析,并明确未被充分探索的研究空白与关键瓶颈,助力后续研究者聚焦机遇、攻克难题,进一步提升阿拉伯语多模态学习的能力与实用性。
链接: https://arxiv.org/abs/2508.12227
作者: Abdelhamid Haouhat,Slimane Bellaouar,Attia Nehar,Hadda Cherroun,Ahmed Abdelali
机构: Université Amar Telidji (阿马尔·特利吉大学); Université de Ghardaia (加尔达亚大学); Ziane Achour University (齐安·阿肖尔大学); National Center for AI, SDAIAR (人工智能国家中心,SDAIAR)
类目: Computation and Language (cs.CL)
备注:
Abstract:Multimodal Machine Learning (MML) aims to integrate and analyze information from diverse modalities, such as text, audio, and visuals, enabling machines to address complex tasks like sentiment analysis, emotion recognition, and multimedia retrieval. Recently, Arabic MML has reached a certain level of maturity in its foundational development, making it time to conduct a comprehensive survey. This paper explores Arabic MML by categorizing efforts through a novel taxonomy and analyzing existing research. Our taxonomy organizes these efforts into four key topics: datasets, applications, approaches, and challenges. By providing a structured overview, this survey offers insights into the current state of Arabic MML, highlighting areas that have not been investigated and critical research gaps. Researchers will be empowered to build upon the identified opportunities and address challenges to advance the field.
zh
[NLP-75] LLM -as-a-Judge for Privacy Evaluation? Exploring the Alignment of Human and LLM Perceptions of Privacy in Textual Data CCS2025
【速读】: 该论文旨在解决隐私保护自然语言处理(Privacy-Preserving Natural Language Processing, PPNLP)领域中隐私评估准确性不足的核心问题,即如何客观、可靠地衡量文本数据的隐私敏感性。由于隐私具有主观性和难以定义的特点,传统方法难以达成一致的评估标准。论文提出的解决方案关键在于引入“大语言模型作为裁判”(LLM-as-a-Judge)范式,利用大规模语言模型(Large Language Models, LLMs)对文本隐私敏感性进行自动评估,并通过对比13个LLM与677名人类参与者在10个数据集上的判断结果,验证了LLM能够有效建模全局人类隐私认知,从而为隐私评估提供一种可扩展、高一致性且具备可解释性的技术路径。
链接: https://arxiv.org/abs/2508.12158
作者: Stephen Meisenbacher,Alexandra Klymenko,Florian Matthes
机构: Technical University of Munich (慕尼黑工业大学)
类目: Computation and Language (cs.CL)
备注: 13 pages, 3 figures, 4 tables. Accepted to HAIPS @ CCS 2025
Abstract:Despite advances in the field of privacy-preserving Natural Language Processing (NLP), a significant challenge remains the accurate evaluation of privacy. As a potential solution, using LLMs as a privacy evaluator presents a promising approach \unicodex2013 a strategy inspired by its success in other subfields of NLP. In particular, the so-called \textitLLM-as-a-Judge paradigm has achieved impressive results on a variety of natural language evaluation tasks, demonstrating high agreement rates with human annotators. Recognizing that privacy is both subjective and difficult to define, we investigate whether LLM-as-a-Judge can also be leveraged to evaluate the privacy sensitivity of textual data. Furthermore, we measure how closely LLM evaluations align with human perceptions of privacy in text. Resulting from a study involving 10 datasets, 13 LLMs, and 677 human survey participants, we confirm that privacy is indeed a difficult concept to measure empirically, exhibited by generally low inter-human agreement rates. Nevertheless, we find that LLMs can accurately model a global human privacy perspective, and through an analysis of human and LLM reasoning patterns, we discuss the merits and limitations of LLM-as-a-Judge for privacy evaluation in textual data. Our findings pave the way for exploring the feasibility of LLMs as privacy evaluators, addressing a core challenge in solving pressing privacy issues with innovative technical solutions.
zh
[NLP-76] Exploring Efficiency Frontiers of Thinking Budget in Medical Reasoning : Scaling Laws between Computational Resources and Reasoning Quality
【速读】: 该论文旨在解决医学推理任务中计算资源分配与推理质量之间关系不明确的问题,特别是如何通过控制“思考预算”(thinking budget)来优化医疗人工智能系统的性能与效率。其解决方案的关键在于系统性地评估不同规模模型(Qwen3 和 DeepSeek-R1)在多种医学数据集上的表现,发现准确率提升遵循对数尺度规律,并识别出三种效率区间:高效率(0–256 tokens)、平衡型(256–512 tokens)和高精度(>512 tokens),从而为临床场景下动态调整资源提供依据;同时揭示小模型在扩展思考预算时获得相对更大的性能增益,表明思考预算机制可作为提升容量受限模型表现的有效手段,且该机制具有跨架构通用性,具备实际部署价值。
链接: https://arxiv.org/abs/2508.12140
作者: Ziqian Bi,Lu Chen,Junhao Song,Hongying Luo,Enze Ge,Junmin Huang,Tianyang Wang,Keyu Chen,Chia Xin Liang,Zihan Wei,Huafeng Liu,Chunjie Tian,Jibin Guan,Joe Yeong,Yongzhi Xu,Peng Wang,Junfeng Hao
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.
zh
[NLP-77] DynamixSFT: Dynamic Mixture Optimization of Instruction Tuning Collections
【速读】: 该论文旨在解决后训练阶段中大量指令微调数据集混合比例动态平衡与优化的问题,以提升模型性能并保持数据多样性。解决方案的关键在于将问题建模为多臂老虎机(multi-armed bandit)框架,并引入先验加权的玻尔兹曼探索策略(Prior-scaled Boltzmann Exploration),该策略通过软约束方式使更新后的采样分布贴近原始数据集比例,从而保留数据集合的固有覆盖范围和多样性;同时采用轻量级一步前瞻奖励(1-Step Look-ahead Reward)来实时评估各数据集对当前模型性能的边际贡献,并据此动态调整采样概率,实现高效自适应优化。
链接: https://arxiv.org/abs/2508.12116
作者: Haebin Shin,Lei Ji,Xiao Liu,Zhiwei Yu,Qi Chen,Yeyun Gong
机构: Microsoft Research (微软研究院)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:As numerous instruction-tuning datasets continue to emerge during the post-training stage, dynamically balancing and optimizing their mixtures has become a critical challenge. To address this, we propose DynamixSFT, a dynamic and automated method for instruction-tuning dataset mixture optimization. We formulate the problem as a multi-armed bandit setup and introduce a Prior-scaled Boltzmann Exploration that softly anchors the updated sampling distribution to the original dataset proportions, thereby preserving the inherent diversity and coverage of the collection. Sampling probabilities are updated using a lightweight 1-Step Look-ahead Reward, reflecting how much the dataset contributes to improving the model’s performance at its current state. When applied to the Tulu-v2-mixture collection comprising 16 instruction-tuning datasets, DynamixSFT achieves up to a 2.2% performance improvement across 10 benchmarks. Furthermore, we provide a comprehensive analysis and visualizations to offer deeper insights into the adaptive dynamics of our method.
zh
[NLP-78] Generative Medical Event Models Improve with Scale
【速读】: 该论文旨在解决个性化医疗(Personalized Medicine)规模化落地的难题,核心挑战在于如何从海量纵向患者健康轨迹(即一系列医疗事件序列)中提取可泛化、可扩展的临床洞察。解决方案的关键在于构建一个基于大规模医疗事件数据预训练的生成式基础模型——CoMET(Cosmos Medical Event Transformer),其通过在118 million患者、115 billion离散医疗事件(151 billion tokens)上进行自回归建模,学习到复杂的临床动态规律,并实现无需任务特定微调即可在78项真实世界任务(如诊断预测、疾病预后和医疗运营优化)中达到或超越现有监督模型的性能表现。这一成果验证了生成式AI(Generative AI)在医学领域的强大潜力,为临床决策支持与医疗系统效率提升提供了通用且可扩展的框架。
链接: https://arxiv.org/abs/2508.12104
作者: Shane Waxler,Paul Blazek,Davis White,Daniel Sneider,Kevin Chung,Mani Nagarathnam,Patrick Williams,Hank Voeller,Karen Wong,Matthew Swanhorst,Sheng Zhang,Naoto Usuyama,Cliff Wong,Tristan Naumann,Hoifung Poon,Andrew Loza,Daniella Meeker,Seth Hain,Rahul Shah
机构: Epic Systems(埃普ic系统); Microsoft Research(微软研究院); Yale School of Medicine(耶鲁医学院); Cosmos Governing Council(宇宙治理委员会)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Cosmos Medical Event Transformer ( CoMET) models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study for medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Based on this, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient’s real-world history, CoMET autoregressively generates the next medical event, simulating patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, CoMET generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. CoMET’s predictive power consistently improves as the model and pretraining scale. Our results show that CoMET, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.
zh
[NLP-79] STEM: Efficient Relative Capability Evaluation of LLM s through Structured Transition Samples AAAI2026
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)评估过程中存在的两大核心问题:一是标准基准测试得分提升与真实推理能力增强之间不一致的现象;二是由于公开基准广泛过拟合及全量评估计算成本高昂,导致难以有效区分模型间细微的能力差异。解决方案的关键在于提出一种轻量且可解释的评估框架——结构化过渡评估方法(Structured Transition Evaluation Method, STEM),其通过分析同架构不同参数规模模型在多个基准上的性能一致性变化,识别出显著过渡样本(Significant Transition Samples, STS),从而高效估计未知模型的能力位置。该方法具备细粒度、架构无关的特性,并已在Qwen3模型族上验证其对多种代表性任务的有效性,能够可靠捕捉性能趋势并匹配真实模型能力排序。
链接: https://arxiv.org/abs/2508.12096
作者: Haiquan Hu,Jiazhi Jiang,Shiyou Xu,Ruhan Zeng,Tian Wang
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Submit to AAAI 2026
Abstract:Evaluating large language models (LLMs) has become increasingly challenging as model capabilities advance rapidly. While recent models often achieve higher scores on standard benchmarks, these improvements do not consistently reflect enhanced real-world reasoning capabilities. Moreover, widespread overfitting to public benchmarks and the high computational cost of full evaluations have made it both expensive and less effective to distinguish meaningful differences between models. To address these challenges, we propose the \textbfStructured \textbfTransition \textbfEvaluation \textbfMethod (STEM), a lightweight and interpretable evaluation framework for efficiently estimating the relative capabilities of LLMs. STEM identifies \textitsignificant transition samples (STS) by analyzing consistent performance transitions among LLMs of the same architecture but varying parameter scales. These samples enable STEM to effectively estimate the capability position of an unknown model. Qwen3 model family is applied to construct the STS pool on six diverse and representative benchmarks. To assess generalizability. Experimental results indicate that STEM reliably captures performance trends, aligns with ground-truth rankings of model capability. These findings highlight STEM as a practical and scalable method for fine-grained, architecture-agnostic evaluation of LLMs.
zh
[NLP-80] J6: Jacobian-Driven Role Attribution for Multi-Objective Prompt Optimization in LLM s
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)适应过程中多目标优化的冲突问题,尤其是如何在提升事实准确性(factuality)与增强置信度(通过低熵实现)之间取得平衡,同时应对提示参数(如隐藏层插入 $ h $ 和嵌入修改 $ w $)之间非线性交互带来的挑战。现有方法通常依赖标量梯度聚合,忽视了目标与参数间的几何结构。其解决方案的关键在于提出J6方法——一种基于雅可比矩阵(Jacobian)的结构化分解策略,将梯度交互矩阵分解为六个可解释的组成部分,从而支持硬决策(如通过argmax选择主导更新方向)和软策略(如通过softmax进行注意力加权),形成能够动态响应局部冲突与协同效应的更新框架,同时提供参数归因、任务干扰分析及几何对齐适配的洞察力。
链接: https://arxiv.org/abs/2508.12086
作者: Yao Wu
机构: Westlake University (西湖大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 9 pages, 3 tables, 1 algorithm
Abstract:In large language model (LLM) adaptation, balancing multiple optimization objectives such as improving factuality (heat) and increasing confidence (via low entropy) poses a fundamental challenge, especially when prompt parameters (e.g., hidden-layer insertions h and embedding modifications w) interact in non-trivial ways. Existing multi-objective optimization strategies often rely on scalar gradient aggregation, ignoring the deeper geometric structure between objectives and parameters. We propose J6, a structured Jacobian-based method that decomposes the gradient interaction matrix into six interpretable components. This decomposition enables both hard decision-making (e.g., choosing the dominant update direction via argmax) and soft strategies (e.g., attention-style weighting via softmax over J6), forming a dynamic update framework that adapts to local conflict and synergy. Moreover, the interpretable structure of J6 provides insight into parameter attribution, task interference, and geometry-aligned adaptation. Our work introduces a principled and extensible mechanism for conflict-aware prompt optimization, and opens a new avenue for incorporating structured Jacobian reasoning into multi-objective neural tuning.
zh
[NLP-81] VimoRAG : Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models
【速读】: 该论文旨在解决运动大语言模型(motion LLMs)在面对域外(out-of-domain)或词汇外(out-of-vocabulary)任务时因标注数据有限而导致的性能瓶颈问题。为提升3D动作生成质量,作者提出VimoRAG框架,其核心解决方案在于利用大规模野生视频数据库检索相关2D人体动作信号以增强生成能力。关键技术突破包括:(1) 设计了基于运动中心的视频检索机制——Gemini Motion Video Retriever,能够有效区分人体姿态与动作;(2) 提出运动中心双对齐DPO训练策略(Motion-centric Dual-alignment DPO Trainer),缓解因检索结果不佳引发的误差传播问题,从而显著提升仅依赖文本输入的运动LLMs性能。
链接: https://arxiv.org/abs/2508.12081
作者: Haidong Xu,Guangwei Xu,Zhedong Zheng,Xiatian Zhu,Wei Ji,Xiangtai Li,Ruijie Guo,Meishan Zhang,Min zhang,Hao Fei
机构: Harbin Institute of Technology (Shenzhen); University of Macau; University of Surrey; Nanjing University; ByteDance; National University of Singapore
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 20 pages,13 figures
Abstract:This paper introduces VimoRAG, a novel video-based retrieval-augmented motion generation framework for motion large language models (LLMs). As motion LLMs face severe out-of-domain/out-of-vocabulary issues due to limited annotated data, VimoRAG leverages large-scale in-the-wild video databases to enhance 3D motion generation by retrieving relevant 2D human motion signals. While video-based motion RAG is nontrivial, we address two key bottlenecks: (1) developing an effective motion-centered video retrieval model that distinguishes human poses and actions, and (2) mitigating the issue of error propagation caused by suboptimal retrieval results. We design the Gemini Motion Video Retriever mechanism and the Motion-centric Dual-alignment DPO Trainer, enabling effective retrieval and generation processes. Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
zh
[NLP-82] Mitigating Jailbreaks with Intent-Aware LLM s
【速读】: 该论文旨在解决大型语言模型(Large Language Models, LLMs)在经过安全调优后仍易受对抗性指令攻击的问题,这类攻击通过精心设计的指令绕过安全机制,暴露出安全性与任务性能之间的权衡困境。解决方案的关键在于提出一种名为Intent-FT的轻量级微调方法,其核心思想是让模型在生成回复前显式地推断指令的潜在意图(underlying intent),并通过针对特定对抗指令的微调,使模型能够将意图识别能力泛化至未见过的攻击场景,从而显著提升鲁棒性。实证结果表明,该方法在多种参数和非参数攻击下均有效,且不牺牲模型通用能力,同时减少对良性指令的过度拒绝。
链接: https://arxiv.org/abs/2508.12072
作者: Wei Jie Yeo,Ranjan Satapathy,Erik Cambria
机构: Nanyang Technological University (南洋理工大学); Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A∗\textasteriskcentered STAR) (新加坡科技研究局)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL)
备注:
Abstract:Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. In this work, we propose Intent-FT, a simple and lightweight fine-tuning approach that explicitly trains LLMs to infer the underlying intent of an instruction before responding. By fine-tuning on a targeted set of adversarial instructions, Intent-FT enables LLMs to generalize intent deduction to unseen attacks, thereby substantially improving their robustness. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Empirically, Intent-FT consistently mitigates all evaluated attack categories, with no single attack exceeding a 50% success rate – whereas existing defenses remain only partially effective. Importantly, our method preserves the model’s general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. Furthermore, models trained with Intent-FT accurately identify hidden harmful intent in adversarial attacks, and these learned intentions can be effectively transferred to enhance vanilla model defenses.
zh
[NLP-83] Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在生成文本过程中缺乏自知之明、常对错误预测赋予过高置信度的问题,从而影响输出结果的可信度与可靠性。现有方法普遍采用粗粒度的置信度评分机制,无法提供生成过程中的细粒度连续置信度估计。其解决方案的关键在于提出 FineCE 方法:首先构建一个能有效捕捉 LLM 响应概率分布的训练数据构造流程,并训练一个监督模型以预测任意文本序列的置信度;其次引入后向置信度整合(Backward Confidence Integration, BCI)策略,在推理阶段利用后续文本信息增强当前序列的置信度估计精度;同时设计三种定位策略以识别生成过程中最优的置信度评估位置。实验表明,FineCE 在多个基准数据集上显著优于传统方法,实现了更准确、细粒度的置信度估计。
链接: https://arxiv.org/abs/2508.12040
作者: Jinyi Han,Tingyun Li,Shisong Chen,Jie Shi,Xinyi Wang,Guanglei Yue,Jiaqing Liang,Xin Lin,Liqian Wen,Zulong Chen,Yanghua Xiao
机构: Shanghai Institute of Artificial Intelligence for Education, East China Normal University (华东师范大学人工智能教育研究院); School of Data Science, Fudan University (复旦大学数据科学学院); College of Computer Science and Artificial Intelligence, Fudan University (复旦大学计算机科学技术与人工智能学院); Alibaba (阿里巴巴)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: The initial versin was made in August 2024
Abstract:While large language models (LLMs) have demonstrated remarkable performance across diverse tasks, they fundamentally lack self-awareness and frequently exhibit overconfidence, assigning high confidence scores to incorrect predictions. Accurate confidence estimation is therefore critical for enhancing the trustworthiness and reliability of LLM-generated outputs. However, existing approaches suffer from coarse-grained scoring mechanisms that fail to provide fine-grained, continuous confidence estimates throughout the generation process. To address these limitations, we introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation. Specifically, we first develop a comprehensive pipeline for constructing training data that effectively captures the underlying probabilistic distribution of LLM responses, and then train a model to predict confidence scores for arbitrary text sequences in a supervised manner. Furthermore, we propose a Backward Confidence Integration (BCI) strategy that leverages information from the subsequent text to enhance confidence estimation for the current sequence during inference. We also introduce three strategies for identifying optimal positions to perform confidence estimation within the generation process. Extensive experiments on multiple benchmark datasets demonstrate that FineCE consistently outperforms existing classical confidence estimation methods. Our code and all baselines used in the paper are available on GitHub.
zh
[NLP-84] Learning Wisdom from Errors: Promoting LLM s Continual Relation Learning through Exploiting Error Cases
【速读】: 该论文旨在解决持续关系抽取(Continual Relation Extraction, CRE)中因灾难性遗忘(catastrophic forgetting)导致模型难以有效学习新关系的问题。现有方法主要依赖记忆回放和对比学习来缓解遗忘,但忽视了错误样本所揭示的模型认知偏差这一关键信息。其解决方案的关键在于提出一种基于指令的持续对比微调方法(instruction-based continual contrastive tuning),通过将每个任务的训练数据和记忆数据依据初始预测正确性分为两部分,并采用双任务微调策略差异化处理;同时利用大语言模型(Large Language Models, LLMs)的指令遵循能力,设计了一种以指令微调方式引导旧数据纠正当前认知偏差的新颖对比策略,从而更适配LLMs特性地缩小新旧关系之间的语义鸿沟。
链接: https://arxiv.org/abs/2508.12031
作者: Shaozhe Yin,Jinyu Guo,Kai Shuang,Xia Liu,Ruize Ou
机构: Beijing University of Posts and Telecommunications (北京邮电大学); University of Electronic Science and Technology of China (电子科技大学); China National Institute of Standardization (中国国家标准研究院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Continual Relation Extraction (CRE) aims to continually learn new emerging relations while avoiding catastrophic forgetting. Existing CRE methods mainly use memory replay and contrastive learning to mitigate catastrophic forgetting. However, these methods do not attach importance to the error cases that can reveal the model’s cognitive biases more effectively. To address this issue, we propose an instruction-based continual contrastive tuning approach for Large Language Models (LLMs) in CRE. Different from existing CRE methods that typically handle the training and memory data in a unified manner, this approach splits the training and memory data of each task into two parts respectively based on the correctness of the initial responses and treats them differently through dual-task fine-tuning. In addition, leveraging the advantages of LLM’s instruction-following ability, we propose a novel instruction-based contrastive tuning strategy for LLM to continuously correct current cognitive biases with the guidance of previous data in an instruction-tuning manner, which mitigates the gap between old and new relations in a more suitable way for LLMs. We experimentally evaluate our model on TACRED and FewRel, and the results show that our model achieves new state-of-the-art CRE performance with significant improvements, demonstrating the importance of specializing in exploiting error cases.
zh
[NLP-85] CHBench: A Cognitive Hierarchy Benchmark for Evaluating Strategic Reasoning Capability of LLM s
【速读】: 该论文旨在解决当前大语言模型(Large Language Models, LLMs)战略推理能力评估中依赖效用性能指标的局限性问题,这类指标因对手行为和博弈结构的差异而缺乏鲁棒性。为应对这一挑战,作者提出了一种名为认知层级基准(Cognitive Hierarchy Benchmark, CHBench)的新评估框架,其核心思想源于行为经济学中的认知层级模型(Cognitive Hierarchy Models),假设智能体具有有限理性并表现出不同深度的推理层级。该方案通过三阶段系统化流程,利用六种先进LLMs在十五个精心设计的正常形式博弈中的行为数据进行评估,实验证明LLMs在不同对手下展现出一致的战略推理层级,验证了框架的鲁棒性和泛化能力;同时分析发现,对话机制(Chat Mechanism)显著削弱战略推理表现,而记忆机制(Memory Mechanism)则有效增强推理能力,从而为LLM战略推理能力的客观、稳定评估提供了可靠工具。
链接: https://arxiv.org/abs/2508.11944
作者: Hongtao Liu,Zhicheng Du,Zihe Wang,Weiran Shen
机构: Gaoling School of Artificial Intelligence (人工智能学院); Renmin University of China (中国人民大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
备注:
Abstract:Game-playing ability serves as an indicator for evaluating the strategic reasoning capability of large language models (LLMs). While most existing studies rely on utility performance metrics, which are not robust enough due to variations in opponent behavior and game structure. To address this limitation, we propose \textbfCognitive Hierarchy Benchmark (CHBench), a novel evaluation framework inspired by the cognitive hierarchy models from behavioral economics. We hypothesize that agents have bounded rationality – different agents behave at varying reasoning depths/levels. We evaluate LLMs’ strategic reasoning through a three-phase systematic framework, utilizing behavioral data from six state-of-the-art LLMs across fifteen carefully selected normal-form games. Experiments show that LLMs exhibit consistent strategic reasoning levels across diverse opponents, confirming the framework’s robustness and generalization capability. We also analyze the effects of two key mechanisms (Chat Mechanism and Memory Mechanism) on strategic reasoning performance. Results indicate that the Chat Mechanism significantly degrades strategic reasoning, whereas the Memory Mechanism enhances it. These insights position CHBench as a promising tool for evaluating LLM capabilities, with significant potential for future research and practical applications.
zh
[NLP-86] CAMF: Collaborative Adversarial Multi-agent Framework for Machine Generated Text Detection
【速读】: 该论文旨在解决当前零样本机器生成文本(Machine-Generated Text, MGT)检测方法中存在的两大关键问题:一是现有方法多局限于对文本表层特征的分析,缺乏对语言维度一致性(如风格、语义和逻辑)的深入考察;二是缺乏系统性机制来识别跨维度的细微不一致,而这正是区分人类与机器生成文本的核心线索。解决方案的关键在于提出一种名为协同对抗多智能体框架(Collaborative Adversarial Multi-agent Framework, CAMF)的新架构,其通过三个阶段的协同对抗流程——多维语言特征提取、对抗一致性探查和合成判断聚合——实现对文本中潜在非人类来源的深层识别,从而显著提升零样本场景下的检测性能。
链接: https://arxiv.org/abs/2508.11933
作者: Yue Wang,Liesheng Wei,Yuxiang Wang
机构: Stanford University (斯坦福大学); Shanghai Ocean University (上海海洋大学); Stevens Institute of Technology (斯蒂文斯理工学院)
类目: Computation and Language (cs.CL)
备注:
Abstract:Detecting machine-generated text (MGT) from contemporary Large Language Models (LLMs) is increasingly crucial amid risks like disinformation and threats to academic integrity. Existing zero-shot detection paradigms, despite their practicality, often exhibit significant deficiencies. Key challenges include: (1) superficial analyses focused on limited textual attributes, and (2) a lack of investigation into consistency across linguistic dimensions such as style, semantics, and logic. To address these challenges, we introduce the \textbfCollaborative \textbfAdversarial \textbfMulti-agent \textbfFramework (\textbfCAMF), a novel architecture using multiple LLM-based agents. CAMF employs specialized agents in a synergistic three-phase process: \emphMulti-dimensional Linguistic Feature Extraction, \emphAdversarial Consistency Probing, and \emphSynthesized Judgment Aggregation. This structured collaborative-adversarial process enables a deep analysis of subtle, cross-dimensional textual incongruities indicative of non-human origin. Empirical evaluations demonstrate CAMF’s significant superiority over state-of-the-art zero-shot MGT detection techniques.
zh
[NLP-87] LLM s Struggle with NLI for Perfect Aspect: A Cross-Linguistic Study in Chinese and Japanese
【速读】: 该论文试图解决多语言自然语言推理(Natural Language Inference, NLI)中因时态标记缺失导致的时序语义理解难题,尤其聚焦于汉语和日语在完成体(perfect aspect)层面缺乏独立的时态形态标记,从而使得模型难以准确识别细微的时间关系与参考时间点变化。其解决方案的关键在于构建了一个基于语言学动机的、模板驱动的NLI数据集(每种语言包含1,350对样本),以系统性地评估大语言模型(LLMs)在跨语言时序推理上的表现,从而揭示当前模型在处理非显式时态线索时的局限性,并推动更严谨的跨语言时序语义评估体系的发展。
链接: https://arxiv.org/abs/2508.11927
作者: Jie Lu,Du Jin,Hitomi Yanaka
机构: the University of Tokyo (东京大学); RIKEN (理化学研究所)
类目: Computation and Language (cs.CL)
备注: 9 pages, 3 figures
Abstract:Unlike English, which uses distinct forms (e.g., had, has, will have) to mark the perfect aspect across tenses, Chinese and Japanese lack separate grammatical forms for tense within the perfect aspect, which complicates Natural Language Inference (NLI). Focusing on the perfect aspect in these languages, we construct a linguistically motivated, template-based NLI dataset (1,350 pairs per language). Experiments reveal that even advanced LLMs struggle with temporal inference, particularly in detecting subtle tense and reference-time shifts. These findings highlight model limitations and underscore the need for cross-linguistic evaluation in temporal semantics. Our dataset is available at this https URL.
zh
[NLP-88] Optimizing Token Choice for Code Watermarking: A RL Approach
【速读】: 该论文旨在解决大语言模型(Large Language Model, LLM)生成代码时难以有效检测其来源的问题,核心挑战在于LLM生成的代码具有高度结构化和语法约束的特点,传统水印技术难以在不破坏代码功能的前提下嵌入可检测信号。解决方案的关键在于提出CodeTracer框架,其创新性地采用基于强化学习的自适应水印机制,通过参数化策略模型在下一token预测过程中智能偏置词元选择,从而在保持代码功能完整性的前提下引入统计上可检测的分布偏差;同时设计融合执行反馈与水印嵌入信号的综合奖励函数,并利用Gumbel Top-k重参数化实现离散水印决策的梯度优化,显著提升了水印的可检测性与代码功能性之间的平衡。
链接: https://arxiv.org/abs/2508.11925
作者: Zhimeng Guo,Huaisheng Zhu,Siyuan Xu,Hangfan Zhang,Teng Xiao,Minhao Cheng
机构: The Pennsylvania State University (宾夕法尼亚州立大学)
类目: Cryptography and Security (cs.CR); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: 18 pages, 3 figures
Abstract:The need for detecting LLM-generated code necessitates watermarking systems capable of operating within its highly structured and syntactically constrained environment. To address this, we introduce CodeTracer, an innovative adaptive code watermarking framework underpinned by a novel reinforcement learning training paradigm. At its core, CodeTracer features a policy-driven approach that utilizes a parameterized model to intelligently bias token choices during next-token prediction. This strategy ensures that embedded watermarks maintain code functionality while exhibiting subtle yet statistically detectable deviations from typical token distributions. To facilitate policy learning, we devise a comprehensive reward system that seamlessly integrates execution feedback with watermark embedding signals, balancing process-level and outcome-level rewards. Additionally, we employ Gumbel Top-k reparameterization to enable gradient-based optimization of discrete watermarking decisions. Extensive comparative evaluations demonstrate CodeTracer’s significant superiority over state-of-the-art baselines in both watermark detectability and the preservation of generated code’s functionality.
zh
[NLP-89] CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
【速读】: 该论文旨在解决多智能体系统中基于大语言模型(Large Language Models, LLMs)的博弈交互过程中,语言多样性缺乏量化评估的问题。现有研究虽揭示了此类交互中的涌现能力,但未充分衡量不同博弈情境下语言使用的有效性与鲁棒性。解决方案的关键在于提出一种名为对话鲁棒性评估分数(Conversational Robustness Evaluation Score, CORE)的新指标,该指标融合了聚类熵(cluster entropy)、词汇重复率(lexical repetition)和语义相似度(semantic similarity),从而从多个维度直接刻画对话质量。通过将CORE应用于竞争、合作与中立三种博弈场景下的LLM对话语料,并结合Zipf定律和Heaps定律分析词频分布与词汇增长特性,作者发现合作场景表现出更陡峭的Zipf分布和更高的Heaps指数,表明其在保持一定重复性的同时具备更强的词汇扩展能力;而竞争场景则相反,反映出语言使用更为受限。这一方法为理解社会激励如何影响语言适应提供了新视角,并验证了CORE作为多智能体LLM系统中语言鲁棒性诊断工具的有效性。
链接: https://arxiv.org/abs/2508.11915
作者: Punya Syon Pandey,Yongjin Yang,Jiarui Liu,Zhijing Jin
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified. In this paper, we present the Conversational Robustness Evaluation Score: CORE, a metric to quantify the effectiveness of language use within multi-agent systems across different game-theoretic interactions. CORE integrates measures of cluster entropy, lexical repetition, and semantic similarity, providing a direct lens of dialog quality. We apply CORE to pairwise LLM dialogs across competitive, cooperative, and neutral settings, further grounding our analysis in Zipf’s and Heaps’ Laws to characterize word frequency distributions and vocabulary growth. Our findings show that cooperative settings exhibit both steeper Zipf distributions and higher Heap exponents, indicating more repetition alongside greater vocabulary expansion. In contrast, competitive interactions display lower Zipf and Heaps exponents, reflecting less repetition and more constrained vocabularies. These results provide new insights into how social incentives influence language adaptation, and highlight CORE as a robust diagnostic for measuring linguistic robustness in multi-agent LLM systems. Our code is available at this https URL.
zh
[NLP-90] In-Context Examples Matter: Improving Emotion Recognition in Conversation with Instruction Tuning
【速读】: 该论文旨在解决情感识别在对话(Emotion Recognition in Conversation, ERC)任务中,现有多阶段指令微调方法因无法联合建模说话人特征与对话上下文动态交互而导致的说话人身份、情境线索与情绪状态之间对齐不足的问题。解决方案的关键在于提出一种简单而有效的单阶段上下文内指令微调框架 InitERC,其核心是通过上下文示例引导大语言模型(Large Language Models, LLMs)学习说话人-上下文-情绪的统一对齐关系,具体包括演示池构建、上下文示例选择、提示模板设计和上下文内指令微调四个组件,并系统探究了检索策略、示例排序和示例数量三个关键因素对性能的影响。
链接: https://arxiv.org/abs/2508.11889
作者: Hui Ma,Bo Zhang,Jinpeng Hu,Zenglin Shi
机构: 未知
类目: Computation and Language (cs.CL)
备注:
Abstract:Emotion recognition in conversation (ERC) aims to identify the emotion of each utterance in a conversation, playing a vital role in empathetic artificial intelligence. With the growing of large language models (LLMs), instruction tuning has emerged as a critical paradigm for ERC. Existing studies mainly focus on multi-stage instruction tuning, which first endows LLMs with speaker characteristics, and then conducts context-aware instruction tuning to comprehend emotional states. However, these methods inherently constrains the capacity to jointly capture the dynamic interaction between speaker characteristics and conversational context, resulting in weak alignment among speaker identity, contextual cues, and emotion states within a unified framework. In this paper, we propose InitERC, a simple yet effective one-stage in-context instruction tuning framework for ERC. InitERC adapts LLMs to learn speaker-context-emotion alignment from context examples via in-context instruction tuning. Specifically, InitERC comprises four components, i.e., demonstration pool construction, in-context example selection, prompt template design, and in-context instruction tuning. To explore the impact of in-context examples, we conduct a comprehensive study on three key factors: retrieval strategy, example ordering, and the number of examples. Extensive experiments on three widely used datasets demonstrate that our proposed InitERC achieves substantial improvements over the state-of-the-art baselines.
zh
[NLP-91] EVTP-IVS: Effective Visual Token Pruning For Unifying Instruction Visual Segmentation In Multi-Modal Large Language Models
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在指令式视觉分割(Instructed Visual Segmentation, IVS)任务中推理成本过高的问题,尤其是在视频场景下。其解决方案的关键在于提出一种新颖的视觉 token 剪枝方法 EVTP-IV,该方法基于 k-center 算法并融合空间信息,以选择一个紧凑且空间代表性强的 token 子集,从而显著加速推理过程。实验表明,该方法可在仅使用 20% token 的情况下实现与全量 token 相当的分割精度,并在视频任务上达到最高 5 倍、图像任务上达 3.5 倍的加速效果,同时优于现有最优剪枝基线。
链接: https://arxiv.org/abs/2508.11886
作者: Wenhui Zhu,Xiwen Chen,Zhipeng Wang,Shao Tang,Sayan Ghosh,Xuanzhao Dong,Rajat Koner,Yalin Wang
机构: Arizona State University (亚利桑那州立大学); Clemson University (克莱姆森大学); LinkedIn Corporation (领英公司); Ludwig Maximilian University of Munich (慕尼黑路德维希-马克西米利安大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG); Image and Video Processing (eess.IV)
备注:
Abstract:Instructed Visual Segmentation (IVS) tasks require segmenting objects in images or videos based on natural language instructions. While recent multimodal large language models (MLLMs) have achieved strong performance on IVS, their inference cost remains a major bottleneck, particularly in video. We empirically analyze visual token sampling in MLLMs and observe a strong correlation between subset token coverage and segmentation performance. This motivates our design of a simple and effective token pruning method that selects a compact yet spatially representative subset of tokens to accelerate inference. In this paper, we introduce a novel visual token pruning method for IVS, called EVTP-IV, which builds upon the k-center by integrating spatial information to ensure better coverage. We further provide an information-theoretic analysis to support our design. Experiments on standard IVS benchmarks show that our method achieves up to 5X speed-up on video tasks and 3.5X on image tasks, while maintaining comparable accuracy using only 20% of the tokens. Our method also consistently outperforms state-of-the-art pruning baselines under varying pruning ratios.
zh
[NLP-92] LARC: Towards Human-level Constrained Retrosynthesis Planning through an Agent ic Framework
【速读】: 该论文旨在解决化学领域中受限逆合成规划(constrained retrosynthesis planning)这一复杂且关键的问题,即在实际约束条件下从市售起始原料高效生成目标分子的合成路径。传统方法依赖专家经验,效率低且难以扩展。解决方案的关键在于提出首个基于大语言模型(Large Language Model, LLM)的代理框架LARC(LLM-based Agentic framework for Retrosynthesis),其核心创新是将“代理作为裁判”(Agent-as-a-Judge)机制嵌入到逆合成规划流程中,通过工具驱动的推理反馈实时评估并约束合成路径生成,从而实现高成功率(72.9%)与人类专家水平相当的结果,同时显著缩短计算时间。
链接: https://arxiv.org/abs/2508.11860
作者: Frazier N. Baker,Daniel Adu-Ampratwum,Reza Averly,Botao Yu,Huan Sun,Xia Ning
机构: The Ohio State University (俄亥俄州立大学)
类目: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 24 pages, 5 figures
Abstract:Large language model (LLM) agent evaluators leverage specialized tools to ground the rational decision-making of LLMs, making them well-suited to aid in scientific discoveries, such as constrained retrosynthesis planning. Constrained retrosynthesis planning is an essential, yet challenging, process within chemistry for identifying synthetic routes from commercially available starting materials to desired target molecules, subject to practical constraints. Here, we present LARC, the first LLM-based Agentic framework for Retrosynthesis planning under Constraints. LARC incorporates agentic constraint evaluation, through an Agent-as-a-Judge, directly into the retrosynthesis planning process, using agentic feedback grounded in tool-based reasoning to guide and constrain route generation. We rigorously evaluate LARC on a carefully curated set of 48 constrained retrosynthesis planning tasks across 3 constraint types. LARC achieves a 72.9% success rate on these tasks, vastly outperforming LLM baselines and approaching human expert-level success in substantially less time. The LARC framework is extensible, and serves as a first step towards an effective agentic tool or a co-scientist to human experts for constrained retrosynthesis.
zh
[NLP-93] SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
【速读】: 该论文旨在解决自然语言处理(Natural Language Processing, NLP)中词元化(Tokenization)这一基础但长期被忽视的瓶颈问题,即现有词元化策略多为静态且难以适应复杂语义结构,限制了模型性能的进一步提升。其解决方案的核心在于提出一种新型词元化架构 SupraTok,通过三项关键技术实现突破:一是跨边界模式学习(cross-boundary pattern learning),用于发现多词语义单元(multi-word semantic units);二是熵驱动的数据筛选(entropy-driven data curation),优化训练语料质量;三是多阶段课程学习(multi-phase curriculum learning),确保收敛稳定性。SupraTok 在保持与主流模型兼容性的前提下,显著提升了词元效率(如在英文上达 31% 的改进),并有效增强下游任务表现(如 HellaSWAG 和 MMLU 基准测试分别提升 8.4% 和 9.5%),表明高效词元化可作为模型架构创新之外的另一重要性能提升路径。
链接: https://arxiv.org/abs/2508.11857
作者: Andrei-Valentin Tănase,Elena Pelican
机构: “Ovidius” University of Constanta (奥维德大学康斯坦察分校)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning “superword” tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI’s o200k tokenizer and 30% improvement over Google’s Gemma 3 tokenizer (256k vocabulary), while maintaining competitive performance across 38 languages. When integrated with a GPT-2 scale model (124M parameters) trained on 10 billion tokens from the FineWeb-Edu dataset, SupraTok yields 8.4% improvement on HellaSWAG and 9.5% on MMLU benchmarks without architectural modifications. While these results are promising at this scale, further validation at larger model scales is needed. These findings suggest that efficient tokenization can complement architectural innovations as a path to improved language model performance.
zh
[NLP-94] When Does Language Transfer Help? Sequential Fine-Tuning for Cross-Lingual Euphemism Detection
【速读】: 该论文旨在解决跨语言环境下隐喻性表达(euphemism)检测的挑战,尤其是在低资源语言中,由于文化差异和语义模糊性,现有语言模型性能受限。其解决方案的关键在于采用顺序微调(sequential fine-tuning)策略,即先在高资源语言(如英语)上进行微调,再逐步迁移至低资源语言(如约鲁巴语和土耳其语),从而提升多语言模型在不同语言间的泛化能力。实验表明,该方法相较于单语言微调或同步微调更有效,尤其对低资源语言表现显著,且XLM-R在性能提升上更具优势,但需注意预训练覆盖不足可能引发灾难性遗忘问题。
链接: https://arxiv.org/abs/2508.11831
作者: Julia Sammartino,Libby Barak,Jing Peng,Anna Feldman
机构: Montclair State University (蒙特克莱尔州立大学)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: RANLP 2025
Abstract:Euphemisms are culturally variable and often ambiguous, posing challenges for language models, especially in low-resource settings. This paper investigates how cross-lingual transfer via sequential fine-tuning affects euphemism detection across five languages: English, Spanish, Chinese, Turkish, and Yoruba. We compare sequential fine-tuning with monolingual and simultaneous fine-tuning using XLM-R and mBERT, analyzing how performance is shaped by language pairings, typological features, and pretraining coverage. Results show that sequential fine-tuning with a high-resource L1 improves L2 performance, especially for low-resource languages like Yoruba and Turkish. XLM-R achieves larger gains but is more sensitive to pretraining gaps and catastrophic forgetting, while mBERT yields more stable, though lower, results. These findings highlight sequential fine-tuning as a simple yet effective strategy for improving euphemism detection in multilingual models, particularly when low-resource languages are involved.
zh
[NLP-95] Every 28 Days the AI Dreams of Soft Skin and Burning Stars: Scaffolding AI Agents with Hormones and Emotions NEURIPS
【速读】: 该论文试图解决人工智能系统面临的“框架问题”(frame problem),即如何从庞大的可能性空间中识别出上下文相关的有效信息。其解决方案的关键在于引入生物节律,特别是激素周期(如月经周期和昼夜节律),作为天然的相关性过滤机制;通过系统提示(system prompts)嵌入由关键激素(雌激素、睾酮和皮质醇)的周期函数模拟的生理节律信号,使大型语言模型在不同生物相位下表现出相应的情感倾向与认知性能变化,从而实现更符合人类生理节律的上下文感知能力。
链接: https://arxiv.org/abs/2508.11829
作者: Leigh Levinson,Christopher J. Agostino
机构: Indiana University (印第安纳大学); NPC Worldwide
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注: 9 pages, 1 figure, submitted to NeurIPS Creative AI track
Abstract:Despite significant advances, AI systems struggle with the frame problem: determining what information is contextually relevant from an exponentially large possibility space. We hypothesize that biological rhythms, particularly hormonal cycles, serve as natural relevance filters that could address this fundamental challenge. We develop a framework that embeds simulated menstrual and circadian cycles into Large Language Models through system prompts generated from periodic functions modeling key hormones including estrogen, testosterone, and cortisol. Across multiple state-of-the-art models, linguistic analysis reveals emotional and stylistic variations that track biological phases; sadness peaks during menstruation while happiness dominates ovulation and circadian patterns show morning optimism transitioning to nocturnal introspection. Benchmarking on SQuAD, MMLU, Hellaswag, and AI2-ARC demonstrates subtle but consistent performance variations aligning with biological expectations, including optimal function in moderate rather than extreme hormonal ranges. This methodology provides a novel approach to contextual AI while revealing how societal biases regarding gender and biology are embedded within language models.
zh
[NLP-96] A Survey of Idiom Datasets for Psycholinguistic and Computational Research
【速读】: 该论文试图解决的问题是:如何系统梳理和整合用于研究习语(idioms)的跨学科数据集,以促进心理学语言学(psycholinguistics)与计算语言学(computational linguistics)之间的协同研究。其解决方案的关键在于对53个相关数据集进行结构化分析,涵盖其内容、形式及预期用途,并总结标注实践、覆盖范围与任务设定的趋势;尽管近年来在语言多样性和任务类型上有所扩展,但论文指出当前两类研究之间尚未建立明确关联,提示未来需加强跨领域数据共享与协作机制。
链接: https://arxiv.org/abs/2508.11828
作者: Michael Flor,Xinyi Liu,Anna Feldman
机构: Educational Testing Service (教育考试服务中心); Montclair State University (蒙特克莱尔州立大学)
类目: Computation and Language (cs.CL)
备注: KONVENS 2025. To appear
Abstract:Idioms are figurative expressions whose meanings often cannot be inferred from their individual words, making them difficult to process computationally and posing challenges for human experimental studies. This survey reviews datasets developed in psycholinguistics and computational linguistics for studying idioms, focusing on their content, form, and intended use. Psycholinguistic resources typically contain normed ratings along dimensions such as familiarity, transparency, and compositionality, while computational datasets support tasks like idiomaticity detection/classification, paraphrasing, and cross-lingual modeling. We present trends in annotation practices, coverage, and task framing across 53 datasets. Although recent efforts expanded language coverage and task diversity, there seems to be no relation yet between psycholinguistic and computational research on idioms.
zh
[NLP-97] Hallucination Detection and Mitigation in Scientific Text Simplification using Ensemble Approaches: DS@GT at CLEF 2025 SimpleText
【速读】: 该论文旨在解决科学文本简化过程中生成式AI(Generative AI)可能引入的创造性偏差与信息失真问题,即如何有效检测并评估简化文本中是否存在不恰当的虚构内容或对原意的扭曲。其解决方案的关键在于构建一个集成框架,融合基于BERT的分类器、语义相似度测量、自然语言推理模型以及大语言模型(LLM)推理等多种信号,并通过元分类器(meta-classifier)整合这些异构特征,从而提升对虚假生成和信息失真的鲁棒性检测能力;同时,为保障生成内容的可信性,还引入基于LLM的后编辑系统,依据原始输入文本对简化结果进行修正,以增强生成文本的忠实性与准确性。
链接: https://arxiv.org/abs/2508.11823
作者: Krishna Chaitanya Marturi,Heba H. Elwazzan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Text Simplification, hallucination detection, LLMs, CLEF 2025, SimpleText, CEUR-WS
Abstract:In this paper, we describe our methodology for the CLEF 2025 SimpleText Task 2, which focuses on detecting and evaluating creative generation and information distortion in scientific text simplification. Our solution integrates multiple strategies: we construct an ensemble framework that leverages BERT-based classifier, semantic similarity measure, natural language inference model, and large language model (LLM) reasoning. These diverse signals are combined using meta-classifiers to enhance the robustness of spurious and distortion detection. Additionally, for grounded generation, we employ an LLM-based post-editing system that revises simplifications based on the original input texts.
zh
[NLP-98] LLM -Guided Planning and Summary-Based Scientific Text Simplification: DS@GT at CLEF 2025 SimpleText
【速读】: 该论文旨在解决科学文本简化(scientific text simplification)问题,涵盖句子级和文档级两个层面。其解决方案的关键在于提出一种两阶段、基于大语言模型(large language models, LLMs)的框架:在句子级,利用LLMs生成结构化简化计划并据此执行逐句简化;在文档级,则通过LLMs生成摘要以提供上下文引导,从而实现更连贯且忠实于原意的文本简化。
链接: https://arxiv.org/abs/2508.11816
作者: Krishna Chaitanya Marturi,Heba H. Elwazzan
机构: Georgia Institute of Technology (佐治亚理工学院)
类目: Computation and Language (cs.CL)
备注: Text Simplification, hallucination detection, LLMs, CLEF 2025, SimpleText, CEUR-WS
Abstract:In this paper, we present our approach for the CLEF 2025 SimpleText Task 1, which addresses both sentence-level and document-level scientific text simplification. For sentence-level simplification, our methodology employs large language models (LLMs) to first generate a structured plan, followed by plan-driven simplification of individual sentences. At the document level, we leverage LLMs to produce concise summaries and subsequently guide the simplification process using these summaries. This two-stage, LLM-based framework enables more coherent and contextually faithful simplifications of scientific text.
zh
[NLP-99] Labels or Input? Rethinking Augmentation in Multimodal Hate Detection
【速读】: 该论文旨在解决多模态仇恨内容检测(multimodal hate detection)中因文本与图像间隐含互动而难以识别的挑战,尤其是针对伪装成幽默或讽刺的隐性仇恨言论。其核心解决方案包括两个关键方面:一是提出一种提示优化框架(prompt optimization framework),通过系统性调整提示结构、监督粒度和训练模态,发现结构化提示可提升模型鲁棒性,即使在小模型中亦然;二是设计了一种多模态数据增强流水线(multimodal data augmentation pipeline),利用多智能体大语言模型-视觉语言模型(LLM-VLM)架构生成2,479条反事实中立的仇恨模态替代样本,有效缓解虚假相关性并提升分类器泛化能力。研究表明,提示设计与数据组成与模型规模同等重要,且针对性的数据增强能显著提升仇恨检测的可信度与情境敏感性。
链接: https://arxiv.org/abs/2508.11808
作者: Sahajpreet Singh,Rongxin Ouyang,Subhayan Mukerjee,Kokil Jaidka
机构: 1. University of California, Berkeley (加州大学伯克利分校); 2. Google (谷歌); 3. OpenAI (OpenAI)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computers and Society (cs.CY); Multimedia (cs.MM)
备注: 13 pages, 2 figures, 7 tables
Abstract:The modern web is saturated with multimodal content, intensifying the challenge of detecting hateful memes, where harmful intent is often conveyed through subtle interactions between text and image under the guise of humor or satire. While recent advances in Vision-Language Models (VLMs) show promise, these models lack support for fine-grained supervision and remain susceptible to implicit hate speech. In this paper, we present a dual-pronged approach to improve multimodal hate detection. First, we propose a prompt optimization framework that systematically varies prompt structure, supervision granularity, and training modality. We show that prompt design and label scaling both influence performance, with structured prompts improving robustness even in small models, and InternVL2 achieving the best F1-scores across binary and scaled settings. Second, we introduce a multimodal data augmentation pipeline that generates 2,479 counterfactually neutral memes by isolating and rewriting the hateful modality. This pipeline, powered by a multi-agent LLM-VLM setup, successfully reduces spurious correlations and improves classifier generalization. Our approaches inspire new directions for building synthetic data to train robust and fair vision-language models. Our findings demonstrate that prompt structure and data composition are as critical as model size, and that targeted augmentation can support more trustworthy and context-sensitive hate detection.
zh
[NLP-100] VideoAVE: A Multi-Attribute Video-to-Text Attribute Value Extraction Dataset and Benchmark Models CIKM2025
【速读】: 该论文旨在解决电商领域中产品属性值提取(Attribute Value Extraction, AVE)任务在视频场景下的数据稀缺与评估标准缺失问题,现有数据集主要局限于文本或图像到文本的设置,缺乏对产品视频的支持、多样化的属性覆盖以及公开可用性。解决方案的关键在于构建首个公开的视频到文本电商AVE数据集VideoAVE,涵盖14个商品类别和172个独特属性,并通过基于CLIP的混合专家过滤系统(CLIP-MoE)对视频-商品配对进行后处理筛选,从而提升数据质量,最终形成包含22.4万条训练样本和2.5万条测试样本的高质量数据集。此外,论文还建立了全面的基准评测体系,用于评估多种先进视频视觉语言模型(Video Vision Language Models, VLMs)在条件属性值预测与开放属性-值对提取任务中的表现,揭示了当前方法在视频语义理解与时序信息利用方面的局限性。
链接: https://arxiv.org/abs/2508.11801
作者: Ming Cheng,Tong Wu,Jiazhen Hu,Jiaying Gong,Hoda Eldardiry
机构: Virginia Tech (弗吉尼亚理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
备注: 5 pages, 2 figures, 5 tables, accepted in CIKM 2025
Abstract:Attribute Value Extraction (AVE) is important for structuring product information in e-commerce. However, existing AVE datasets are primarily limited to text-to-text or image-to-text settings, lacking support for product videos, diverse attribute coverage, and public availability. To address these gaps, we introduce VideoAVE, the first publicly available video-to-text e-commerce AVE dataset across 14 different domains and covering 172 unique attributes. To ensure data quality, we propose a post-hoc CLIP-based Mixture of Experts filtering system (CLIP-MoE) to remove the mismatched video-product pairs, resulting in a refined dataset of 224k training data and 25k evaluation data. In order to evaluate the usability of the dataset, we further establish a comprehensive benchmark by evaluating several state-of-the-art video vision language models (VLMs) under both attribute-conditioned value prediction and open attribute-value pair extraction tasks. Our results analysis reveals that video-to-text AVE remains a challenging problem, particularly in open settings, and there is still room for developing more advanced VLMs capable of leveraging effective temporal information. The dataset and benchmark code for VideoAVE are available at: this https URL
zh
[NLP-101] A Multi-Task Evaluation of LLM s Processing of Academic Text Input
【速读】: 该论文旨在解决大语言模型(Large Language Models, LLMs)在科学发现中,特别是辅助学术同行评审方面的实际应用潜力问题。当前学界对此存在激烈争议,亟需系统性评估LLMs对学术文本处理能力的真实水平。解决方案的关键在于构建一个结构化、任务导向的评估框架,将计算机科学领域常用的独立文本处理任务整合为一个受引导且稳健的工作流程,涵盖内容复现(reproduction)、对比(comparison)、评分(scoring)与反思(reflection)四项核心任务,并明确每项任务下LLM应扮演的角色(如oracle、judgmental arbiter等),从而逐步测试其从基础理解到高阶推理的多维智力能力。通过使用顶级信息系统期刊文章作为输入文本并结合多种文本指标,研究发现当前领先的LLM(如Google Gemini)虽能生成可接受的摘要和改写,但在文本排序、评分判别及深度反思方面表现不佳,且结果在不同评估维度(语言学指标、与真实答案对比、人工评价)下均一致,表明不应未经严格验证地用于构建同行评审体系。
链接: https://arxiv.org/abs/2508.11779
作者: Tianyi Li,Yu Qin,Olivia R. Liu Sheng
机构: CUHK (香港中文大学); Arizona State University (亚利桑那州立大学)
类目: Computation and Language (cs.CL); General Economics (econ.GN)
备注:
Abstract:How much large language models (LLMs) can aid scientific discovery, notably in assisting academic peer review, is in heated debate. Between a literature digest and a human-comparable research assistant lies their practical application potential. We organize individual tasks that computer science studies employ in separate terms into a guided and robust workflow to evaluate LLMs’ processing of academic text input. We employ four tasks in the assessment: content reproduction/comparison/scoring/reflection, each demanding a specific role of the LLM (oracle/judgmental arbiter/knowledgeable arbiter/collaborator) in assisting scholarly works, and altogether testing LLMs with questions that increasingly require intellectual capabilities towards a solid understanding of scientific texts to yield desirable solutions. We exemplify a rigorous performance evaluation with detailed instructions on the prompts. Adopting first-rate Information Systems articles at three top journals as the input texts and an abundant set of text metrics, we record a compromised performance of the leading LLM - Google’s Gemini: its summary and paraphrase of academic text is acceptably reliable; using it to rank texts through pairwise text comparison is faintly scalable; asking it to grade academic texts is prone to poor discrimination; its qualitative reflection on the text is self-consistent yet hardly insightful to inspire meaningful research. This evidence against an endorsement of LLMs’ text-processing capabilities is consistent across metric-based internal (linguistic assessment), external (comparing to the ground truth), and human evaluation, and is robust to the variations of the prompt. Overall, we do not recommend an unchecked use of LLMs in constructing peer reviews.
zh
[NLP-102] Investigating Transcription Normalization in the Faetar ASR Benchmark
【速读】: 该论文旨在解决Faetar自动语音识别(Automatic Speech Recognition, ASR)基准中因转录不一致(transcription inconsistencies)所导致的识别性能瓶颈问题。其关键解决方案在于通过构建一个小型人工词典(hand-constructed lexicon)对解码过程进行约束,从而有效缓解低资源场景下的识别难题;研究发现,尽管转录不一致确实存在,但并非主要挑战,而基于词典的约束 decoding 比使用二元语法(bigram)语言模型更能提升性能。
链接: https://arxiv.org/abs/2508.11771
作者: Leo Peckham,Michael Ong,Naomi Nagy,Ewan Dunbar
机构: University of Toronto (多伦多大学)
类目: Computation and Language (cs.CL)
备注:
Abstract:We examine the role of transcription inconsistencies in the Faetar Automatic Speech Recognition benchmark, a challenging low-resource ASR benchmark. With the help of a small, hand-constructed lexicon, we conclude that find that, while inconsistencies do exist in the transcriptions, they are not the main challenge in the task. We also demonstrate that bigram word-based language modelling is of no added benefit, but that constraining decoding to a finite lexicon can be beneficial. The task remains extremely difficult.
zh
[NLP-103] Limitation Learning: Catching Adverse Dialog with GAIL
【速读】: 该论文旨在解决在缺乏显式奖励信号的情况下,如何通过专家示范(expert demonstrations)来训练对话策略的问题。其核心挑战在于构建一个能够生成自然、有效对话的策略模型,同时识别出当前对话模型存在的局限性。解决方案的关键在于采用模仿学习(imitation learning)方法,从专家对话数据中学习策略(policy),并引入一个判别器(discriminator)用于区分专家生成与合成的对话内容;该判别器不仅辅助策略优化,还能揭示对话模型在实际应用中的潜在缺陷,从而为识别任意数据模型在面向对话任务中的不良行为提供可解释的评估工具。
链接: https://arxiv.org/abs/2508.11767
作者: Noah Kasmanoff,Rahul Zalkikar
机构: New York University (纽约大学)
类目: Computation and Language (cs.CL); Machine Learning (cs.LG)
备注: Paper from 2021
Abstract:Imitation learning is a proven method for creating a policy in the absence of rewards, by leveraging expert demonstrations. In this work, we apply imitation learning to conversation. In doing so, we recover a policy capable of talking to a user given a prompt (input state), and a discriminator capable of classifying between expert and synthetic conversation. While our policy is effective, we recover results from our discriminator that indicate the limitations of dialog models. We argue that this technique can be used to identify adverse behavior of arbitrary data models common for dialog oriented tasks.
zh
[NLP-104] Using Natural Language for Human-Robot Collaboration in the Real World
【速读】: 该论文试图解决如何将大语言模型(Large Language Models, LLMs)的语言理解能力有效集成到物理世界中的机器人系统,以实现人机协作任务中自然语言交互的增强。其核心挑战在于使机器人不仅能够理解人类的自然语言指令,还能在动态环境中积累情境知识并作出适应性响应。解决方案的关键在于构建一个以认知代理(cognitive agent)为核心的AI系统,该代理同时与人类和LLM交互,并通过实际操作经验不断学习和更新情境知识,从而形成具备语言理解、环境感知与决策执行一体化能力的机器人助手。论文进一步通过三个具体实验验证了该架构的可行性,表明将LLM作为语言接口嵌入机器人控制系统是迈向实用化人机协同的重要路径。
链接: https://arxiv.org/abs/2508.11759
作者: Peter Lindes,Kaoutar Skiker
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 34 pages, 11 figures, 5 tables. Submitted for publication (2026) in W.F. Lawless, Ranjeev Mittu, Shannon P. McGrarry, Marco Brambilla (Eds.), Generative AI Risks and Benefits within Human-Machine Teams, Elsevier, Chapter 6
Abstract:We have a vision of a day when autonomous robots can collaborate with humans as assistants in performing complex tasks in the physical world. This vision includes that the robots will have the ability to communicate with their human collaborators using language that is natural to the humans. Traditional Interactive Task Learning (ITL) systems have some of this ability, but the language they can understand is very limited. The advent of large language models (LLMs) provides an opportunity to greatly improve the language understanding of robots, yet integrating the language abilities of LLMs with robots that operate in the real physical world is a challenging problem. In this chapter we first review briefly a few commercial robot products that work closely with humans, and discuss how they could be much better collaborators with robust language abilities. We then explore how an AI system with a cognitive agent that controls a physical robot at its core, interacts with both a human and an LLM, and accumulates situational knowledge through its experiences, can be a possible approach to reach that vision. We focus on three specific challenges of having the robot understand natural language, and present a simple proof-of-concept experiment using ChatGPT for each. Finally, we discuss what it will take to turn these simple experiments into an operational system where LLM-assisted language understanding is a part of an integrated robotic assistant that uses language to collaborate with humans. Comments: 34 pages, 11 figures, 5 tables. Submitted for publication (2026) in W.F. Lawless, Ranjeev Mittu, Shannon P. McGrarry, Marco Brambilla (Eds.), Generative AI Risks and Benefits within Human-Machine Teams, Elsevier, Chapter 6 Subjects: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2508.11759 [cs.RO] (or arXiv:2508.11759v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2508.11759 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[NLP-105] Can we Evaluate RAG s with Synthetic Data? ECML-PKDD2025
【速读】: 该论文试图解决的问题是:当人类标注的基准数据(benchmark)不可获得时,能否使用由大语言模型(Large Language Models, LLMs)生成的合成问答(Question-Answer, QA)数据作为有效的替代基准来评估检索增强生成(Retrieval-Augmented Generation, RAG)系统性能。其解决方案的关键在于通过两个实验设计验证合成基准的可靠性:一是固定生成器(generator)而改变检索器(retriever)参数,二是固定检索器而改变生成器架构。研究发现,合成基准在评估不同检索器配置时能可靠地对RAG系统进行排序,与人类标注基准高度一致;但在比较不同生成器架构时则表现不稳定,可能源于任务不匹配和生成风格偏差。这一结果表明,合成QA数据可作为检索组件优化的有效代理指标,但需谨慎用于生成器层面的对比分析。
链接: https://arxiv.org/abs/2508.11758
作者: Jonas van Elburg,Peter van der Putten,Maarten Marx
机构: 未知
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
备注: Accepted for the SynDAiTE workshop at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD 2025), September 15, 2025 - Porto, Portugal
Abstract:We investigate whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labeled benchmarks when such data is unavailable. We assess the reliability of synthetic benchmarks across two experiments: one varying retriever parameters while keeping the generator fixed, and another varying the generator with fixed retriever parameters. Across four datasets, of which two open-domain and two proprietary, we find that synthetic benchmarks reliably rank the RAGs varying in terms of retriever configuration, aligning well with human-labeled benchmark baselines. However, they fail to produce consistent RAG rankings when comparing generator architectures. The breakdown possibly arises from a combination of task mismatch between the synthetic and human benchmarks, and stylistic bias favoring certain generators.
zh
[NLP-106] Ovis2.5 Technical Report
【速读】: 该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在处理高分辨率视觉内容时因固定分辨率切片导致的细节丢失问题,以及在复杂推理任务中缺乏深度反思能力的问题。解决方案的关键在于:首先,引入原生分辨率视觉Transformer(native-resolution vision transformer),直接以图像原始分辨率进行处理,避免了传统固定分辨率分块带来的信息损失,从而更好地保留细粒度结构与全局布局;其次,在训练过程中采用五阶段课程学习策略,通过DPO(Direct Preference Optimization)和GRPO(Group Reward Policy Optimization)增强模型的推理能力,使其具备自我检查与修正的“反思”机制,并在推理阶段提供可选的“思考模式”,实现延迟与精度之间的灵活权衡。这一系列改进使Ovis2.5在复杂图表分析、STEM任务及视频理解等关键指标上达到开源模型中的最先进水平。
链接: https://arxiv.org/abs/2508.11737
作者: Shiyin Lu,Yang Li,Yu Xia,Yuwei Hu,Shanshan Zhao,Yanqing Ma,Zhichao Wei,Yinglun Li,Lunhao Duan,Jianshan Zhao,Yuxuan Han,Haijun Li,Wanying Chen,Junke Tang,Chengkun Hou,Zhixing Du,Tianli Zhou,Wenjie Zhang,Huping Ding,Jiahe Li,Wen Li,Gui Hu,Yiliang Gu,Siran Yang,Jiamang Wang,Hailong Sun,Yibo Wang,Hui Sun,Jinlong Huang,Yuping He,Shengze Shi,Weihong Zhang,Guodong Zheng,Junpeng Jiang,Sensen Gao,Yi-Feng Wu,Sijia Chen,Yuhui Chen,Qing-Guo Chen,Zhao Xu,Weihua Luo,Kaifu Zhang
机构: Alibaba Group (阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
备注:
Abstract:We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout – crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection – including self-checking and revision. This advanced capability is exposed as an optional “thinking mode” at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the “small model, big performance” philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.
zh
[NLP-107] Code Vulnerability Detection Across Different Programming Languages with AI Models
【速读】: 该论文旨在解决跨编程语言的源代码漏洞检测问题,尤其是传统基于规则的静态分析工具在处理上下文依赖性漏洞时准确率低、误报率高的难题。其解决方案的关键在于利用预训练的Transformer模型(如CodeBERT和CodeLlama)进行动态微调,并结合集成学习与可解释人工智能技术,在多个漏洞数据集上实现高精度的漏洞预测能力。实验表明,经过优化的CodeBERT模型在准确率上可达到97%以上,甚至优于部分现有静态分析器,且具备跨语言和跨漏洞类型的泛化能力,但需通过混合模型与验证流程提升精度以降低假阳性率。
链接: https://arxiv.org/abs/2508.11710
作者: Hael Abdulhakim Ali Humran,Ferdi Sonmez
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注:
Abstract:Security vulnerabilities present in a code that has been written in diverse programming languages are among the most critical yet complicated aspects of source code to detect. Static analysis tools based on rule-based patterns usually do not work well at detecting the context-dependent bugs and lead to high false positive rates. Recent developments in artificial intelligence, specifically the use of transformer-based models like CodeBERT and CodeLlama, provide light to this problem, as they show potential in finding such flaws better. This paper presents the implementations of these models on various datasets of code vulnerability, showing how off-the-shelf models can successfully produce predictive capacity in models through dynamic fine-tuning of the models on vulnerable and safe code fragments. The methodology comprises the gathering of the dataset, normalization of the language, fine-tuning of the model, and incorporation of ensemble learning and explainable AI. Experiments show that a well-trained CodeBERT can be as good as or even better than some existing static analyzers in terms of accuracy greater than 97%. Further study has indicated that although language models can achieve close-to-perfect recall, the precision can decrease. A solution to this is given by hybrid models and validation procedures, which will reduce false positives. According to the results, the AI-based solutions generalize to different programming languages and classes of vulnerability. Nevertheless, robustness, interpretability, and deployment readiness are still being developed. The results illustrate the probabilities that AI will enhance the trustworthiness in the usability and scalability of machine-learning-based detectors of vulnerabilities.
zh
[NLP-108] Deep Language Geometry: Constructing a Metric Space from LLM Weights
【速读】: 该论文旨在解决传统语言表征方法依赖人工设计语言特征(hand-crafted linguistic features)所带来的局限性,从而无法充分捕捉语言内在复杂结构的问题。其解决方案的关键在于利用现代大语言模型(Large Language Models, LLMs)内部权重激活信息,通过改进的剪枝算法计算权重重要性得分,自动构建高维向量表示的空间——即语言潜在空间(language latent space)。这种方法能够无监督地提取反映语言本质特征的向量表示,不仅在多语言数据集上验证了与已知语言家族的一致性,还揭示了可能源于历史接触或语言演化的非预期语言关联。
链接: https://arxiv.org/abs/2508.11676
作者: Maksym Shamrai,Vladyslav Hamolia
机构: Institute of Mathematics of NASU (乌克兰国家科学院数学研究所); MacPaw (MacPaw); Kyiv, Ukraine (基辅, 乌克兰)
类目: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 18 pages, accepted to RANLP 2025
Abstract:We introduce a novel framework that utilizes the internal weight activations of modern Large Language Models (LLMs) to construct a metric space of languages. Unlike traditional approaches based on hand-crafted linguistic features, our method automatically derives high-dimensional vector representations by computing weight importance scores via an adapted pruning algorithm. Our approach captures intrinsic language characteristics that reflect linguistic phenomena. We validate our approach across diverse datasets and multilingual LLMs, covering 106 languages. The results align well with established linguistic families while also revealing unexpected inter-language connections that may indicate historical contact or language evolution. The source code, computed language latent vectors, and visualization tool are made publicly available at this https URL.
zh
[NLP-109] Assessing Representation Stability for Transformer Models
【速读】: 该论文旨在解决对抗性文本攻击对Transformer模型持续构成的威胁问题,现有防御方法通常针对特定攻击类型或需要昂贵的模型重训练。其解决方案的关键在于提出一种模型无关的检测框架——表示稳定性(Representation Stability, RS),通过测量嵌入表示在掩码重要词汇时的变化来识别对抗样本:首先利用重要性启发式方法对词汇进行排序,随后评估嵌入对掩码前k个关键词的敏感性,并使用双向长短期记忆网络(BiLSTM)处理由此产生的模式。实验表明,对抗扰动词相较于自然重要词表现出显著更高的掩码敏感性,RS在多个数据集、攻击类型和目标模型上均实现超过88%的检测准确率,且计算成本低于现有最优方法,同时具备良好的跨数据集、攻击和模型泛化能力。
链接: https://arxiv.org/abs/2508.11667
作者: Bryan E. Tuck,Rakesh M. Verma
机构: University of Houston (休斯顿大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
备注: 19 pages, 19 figures, 8 tables. Code available at this https URL
Abstract:Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Representation Stability (RS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. RS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, RS achieves over 88% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we reveal that gradient-based ranking outperforms attention and random selection approaches, with identification quality correlating with detection performance for word-level attacks. RS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.
zh
[NLP-110] Sparse Attention across Multiple-context KV Cache
【速读】: 该论文旨在解决多上下文场景下(如检索增强生成,RAG)因独立计算每个文档的键值缓存(Key-Value Cache, KV Cache)而导致缺乏跨上下文注意力的问题,从而限制了现有稀疏注意力机制在多上下文KV Cache中的应用效果。传统方法无法有效压缩序列长度且难以降低内存开销,尤其在需要保留全部KV Cache以避免精度损失的情况下。解决方案的关键在于提出SamKV,首次探索针对多上下文KV Cache的注意力稀疏化策略:它在对某一上下文进行稀疏化时,主动考虑其他上下文提供的互补信息,并仅局部重新计算被稀疏的部分,从而在不牺牲生成精度的前提下将序列长度压缩至15%,显著提升多上下文RAG场景下的推理吞吐量。
链接: https://arxiv.org/abs/2508.11661
作者: Ziyi Cao,Qingyi Si,Jingbin Zhang,Bingquan Liu
机构: Huawei(华为)
类目: Machine Learning (cs.LG); Computation and Language (cs.CL)
备注:
Abstract:Large language models face significant cost challenges in long-sequence inference. To address this, reusing historical Key-Value (KV) Cache for improved inference efficiency has become a mainstream approach. Recent advances further enhance throughput by sparse attention mechanisms to select the most relevant KV Cache, thereby reducing sequence length. However, such techniques are limited to single-context scenarios, where historical KV Cache is computed sequentially with causal-attention dependencies. In retrieval-augmented generation (RAG) scenarios, where retrieved documents as context are unknown beforehand, each document’s KV Cache is computed and stored independently (termed multiple-context KV Cache), lacking cross-attention between contexts. This renders existing methods ineffective. Although prior work partially recomputes multiple-context KV Cache to mitigate accuracy loss from missing cross-attention, it requires retaining all KV Cache throughout, failing to reduce memory overhead. This paper presents SamKV, the first exploration of attention sparsification for multiple-context KV Cache. Specifically, SamKV takes into account the complementary information of other contexts when sparsifying one context, and then locally recomputes the sparsified information. Experiments demonstrate that our method compresses sequence length to 15% without accuracy degradation compared with full-recompuation baselines, significantly boosting throughput in multi-context RAG scenarios.
zh
计算机视觉
[CV-0] 4DNeX: Feed-Forward 4D Generative Modeling Made Easy
【速读】:该论文旨在解决从单张图像生成4D(动态三维)场景表示的难题,现有方法通常依赖计算密集型优化或需要多帧视频输入,难以实现高效且通用的端到端建模。解决方案的关键在于提出4DNeX框架,其核心创新包括:构建大规模高质量4D数据集4DNeX-10M以缓解数据稀缺问题;设计统一的6D视频表示(联合建模RGB与XYZ序列),促进外观和几何结构的协同学习;并提出一套简单有效的适配策略,将预训练视频扩散模型迁移用于4D建模。该方法实现了高保真动态点云生成,并支持新视角视频合成,在效率和泛化能力上优于现有4D生成方法,为构建可生成动态场景演化的世界模型奠定了基础。
链接: https://arxiv.org/abs/2508.13154
作者: Zhaoxi Chen,Tianqi Liu,Long Zhuo,Jiawei Ren,Zeng Tao,He Zhu,Fangzhou Hong,Liang Pan,Ziwei Liu
机构: S-Lab, Nanyang Technological University (南洋理工大学); Shanghai AI Laboratory
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:We present 4DNeX, the first feed-forward framework for generating 4D (i.e., dynamic 3D) scene representations from a single image. In contrast to existing methods that rely on computationally intensive optimization or require multi-frame video inputs, 4DNeX enables efficient, end-to-end image-to-4D generation by fine-tuning a pretrained video diffusion model. Specifically, 1) to alleviate the scarcity of 4D data, we construct 4DNeX-10M, a large-scale dataset with high-quality 4D annotations generated using advanced reconstruction approaches. 2) we introduce a unified 6D video representation that jointly models RGB and XYZ sequences, facilitating structured learning of both appearance and geometry. 3) we propose a set of simple yet effective adaptation strategies to repurpose pretrained video diffusion models for 4D modeling. 4DNeX produces high-quality dynamic point clouds that enable novel-view video synthesis. Extensive experiments demonstrate that 4DNeX outperforms existing 4D generation methods in efficiency and generalizability, offering a scalable solution for image-to-4D modeling and laying the foundation for generative 4D world models that simulate dynamic scene evolution.
zh
[CV-1] IGFuse: Interactive 3D Gaussian Scene Reconstruction via Multi-Scans Fusion
【速读】:该论文旨在解决三维场景重建中因物体持续遮挡和传感器覆盖有限而导致的结构信息不完整问题,尤其针对单次扫描难以获取全貌、传统多阶段流水线(如分割、背景补全、图像修复)或逐对象密集扫描方法易出错且难以扩展的局限性。其解决方案的关键在于提出IGFuse框架,通过融合多个扫描视角下的观测数据,利用物体在不同采集时刻的自然重排来揭示先前被遮挡的区域;该框架构建具有语义感知能力的高斯场(Gaussian fields),并强制跨扫描间的双向光度与语义一致性,同时引入伪中间场景状态以统一处理空间错位,并采用协同共修剪策略优化几何精度,从而实现无需密集观测或复杂流程即可生成高保真渲染结果及支持对象级场景操作。
链接: https://arxiv.org/abs/2508.13153
作者: Wenhao Hu,Zesheng Li,Haonan Zhou,Liu Liu,Xuexiang Wen,Zhizhong Su,Xi Li,Gaoang Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Reconstructing complete and interactive 3D scenes remains a fundamental challenge in computer vision and robotics, particularly due to persistent object occlusions and limited sensor coverage. Multiview observations from a single scene scan often fail to capture the full structural details. Existing approaches typically rely on multi stage pipelines, such as segmentation, background completion, and inpainting or require per-object dense scanning, both of which are error-prone, and not easily scalable. We propose IGFuse, a novel framework that reconstructs interactive Gaussian scene by fusing observations from multiple scans, where natural object rearrangement between captures reveal previously occluded regions. Our method constructs segmentation aware Gaussian fields and enforces bi-directional photometric and semantic consistency across scans. To handle spatial misalignments, we introduce a pseudo-intermediate scene state for unified alignment, alongside collaborative co-pruning strategies to refine geometry. IGFuse enables high fidelity rendering and object level scene manipulation without dense observations or complex pipelines. Extensive experiments validate the framework’s strong generalization to novel scene configurations, demonstrating its effectiveness for real world 3D reconstruction and real-to-simulation transfer. Our project page is available online.
zh
[CV-2] Motion2Motion: Cross-topology Motion Transfer with Sparse Correspondence SIGGRAPH
【速读】:该论文旨在解决跨骨骼拓扑结构(skeletal topology)的动画迁移问题,即在源角色与目标角色骨骼结构差异显著时,如何实现高质量的动作迁移。传统方法受限于骨骼拓扑不一致导致难以建立一对一的骨骼对应关系,且缺乏大规模跨拓扑结构的配对运动数据集,制约了数据驱动方法的发展。其解决方案的关键在于提出一种无需训练(training-free)的框架Motion2Motion,仅需目标骨骼上少量示例动作和稀疏的源-目标骨骼骨节对应关系,即可高效、可靠地完成相似骨骼及跨物种骨骼间的动作迁移,从而突破现有方法对大规模标注数据和固定拓扑结构的依赖。
链接: https://arxiv.org/abs/2508.13139
作者: Ling-Hao Chen,Yuhong Zhang,Zixin Yin,Zhiyang Dou,Xin Chen,Jingbo Wang,Taku Komura,Lei Zhang
机构: Tsinghua University (清华大学); The Hong Kong University of Science and Technology (香港科技大学); The University of Hong Kong (香港大学); ByteDance (字节跳动); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); International Digital Economy Academy (数字经济研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: SIGGRAPH Asia 2025
Abstract:This work studies the challenge of transfer animations between characters whose skeletal topologies differ substantially. While many techniques have advanced retargeting techniques in decades, transfer motions across diverse topologies remains less-explored. The primary obstacle lies in the inherent topological inconsistency between source and target skeletons, which restricts the establishment of straightforward one-to-one bone correspondences. Besides, the current lack of large-scale paired motion datasets spanning different topological structures severely constrains the development of data-driven approaches. To address these limitations, we introduce Motion2Motion, a novel, training-free framework. Simply yet effectively, Motion2Motion works with only one or a few example motions on the target skeleton, by accessing a sparse set of bone correspondences between the source and target skeletons. Through comprehensive qualitative and quantitative evaluations, we demonstrate that Motion2Motion achieves efficient and reliable performance in both similar-skeleton and cross-species skeleton transfer scenarios. The practical utility of our approach is further evidenced by its successful integration in downstream applications and user interfaces, highlighting its potential for industrial applications. Code and data are available at this https URL.
zh
[CV-3] Precise Action-to-Video Generation Through Visual Action Prompts ICCV2025
【速读】:该论文旨在解决动作驱动视频生成中精度与泛化能力之间的权衡问题:现有方法要么依赖文本、基础动作或粗粒度掩码,虽具备跨域通用性但缺乏动作细节精度;要么采用以智能体为中心的动作信号,虽能实现高精度控制却难以在不同领域间迁移。解决方案的关键在于提出“视觉动作提示”(visual action prompts),将复杂高自由度交互动作转化为与领域无关的视觉提示表示,其中选用视觉骨骼(visual skeletons)作为核心表征方式,因其兼具几何精度和跨域适应性;通过从人类-物体交互(HOI)和灵巧机器人操作两类数据源构建鲁棒的骨骼生成管道,实现跨域训练,并借助轻量级微调将视觉骨骼集成至预训练视频生成模型中,从而在保持跨域动态学习的同时,实现对复杂交互动作的精确控制。
链接: https://arxiv.org/abs/2508.13104
作者: Yuang Wang,Chao Wen,Haoyu Guo,Sida Peng,Minghan Qin,Hujun Bao,Xiaowei Zhou,Ruizhen Hu
机构: Xiangjiang Lab; Zhejiang University (浙江大学); Fudan University (复旦大学); Tsinghua University (清华大学); Shenzhen University (深圳大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注: Accepted to ICCV 2025. Project page: this https URL
Abstract:We present visual action prompts, a unified action representation for action-to-video generation of complex high-DoF interactions while maintaining transferable visual dynamics across domains. Action-driven video generation faces a precision-generality trade-off: existing methods using text, primitive actions, or coarse masks offer generality but lack precision, while agent-centric action signals provide precision at the cost of cross-domain transferability. To balance action precision and dynamic transferability, we propose to “render” actions into precise visual prompts as domain-agnostic representations that preserve both geometric precision and cross-domain adaptability for complex actions; specifically, we choose visual skeletons for their generality and accessibility. We propose robust pipelines to construct skeletons from two interaction-rich data sources - human-object interactions (HOI) and dexterous robotic manipulation - enabling cross-domain training of action-driven generative models. By integrating visual skeletons into pretrained video generation models via lightweight fine-tuning, we enable precise action control of complex interaction while preserving the learning of cross-domain dynamics. Experiments on EgoVid, RT-1 and DROID demonstrate the effectiveness of our proposed approach. Project page: this https URL.
zh
[CV-4] Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
【速读】:该论文旨在解决视觉-语言-动作(Vision-Language-Action, VLA)模型在真实环境中泛化能力不足的问题,其根源在于观测空间与动作空间之间的固有不一致性:尽管训练数据来自多样化的相机视角,模型通常仍以机器人基坐标系预测末端执行器位姿,导致空间表示不一致。解决方案的关键在于提出一种观测中心的VLA框架(Observation-Centric VLA, OC-VLA),通过利用相机外参标定矩阵,将末端执行器位姿从机器人基坐标系转换至相机坐标系,从而在不同视角下统一动作预测目标。该方法为轻量级、即插即用策略,无需修改现有VLA架构即可显著提升感知与动作的空间对齐性,增强模型对相机视角变化的鲁棒性,并在仿真与真实机器人操作任务中验证了其加速收敛、提高任务成功率和跨视角泛化能力的优势。
链接: https://arxiv.org/abs/2508.13103
作者: Tianyi Zhang,Haonan Duan,Haoran Hao,Yu Qiao,Jifeng Dai,Zhi Hou
机构: Zhejiang University (浙江大学); Shanghai AI Lab; SenseTime Research; Nanjing University (南京大学); Tsinghua University (清华大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-Language-Action (VLA) models frequently encounter challenges in generalizing to real-world environments due to inherent discrepancies between observation and action spaces. Although training data are collected from diverse camera perspectives, the models typically predict end-effector poses within the robot base coordinate frame, resulting in spatial inconsistencies. To mitigate this limitation, we introduce the Observation-Centric VLA (OC-VLA) framework, which grounds action predictions directly in the camera observation space. Leveraging the camera’s extrinsic calibration matrix, OC-VLA transforms end-effector poses from the robot base coordinate system into the camera coordinate system, thereby unifying prediction targets across heterogeneous viewpoints. This lightweight, plug-and-play strategy ensures robust alignment between perception and action, substantially improving model resilience to camera viewpoint variations. The proposed approach is readily compatible with existing VLA architectures, requiring no substantial modifications. Comprehensive evaluations on both simulated and real-world robotic manipulation tasks demonstrate that OC-VLA accelerates convergence, enhances task success rates, and improves cross-view generalization. The code will be publicly available.
zh
[CV-5] Real-Time Beach Litter Detection and Counting: A Comparative Analysis of RT-DETR Model Variants
【速读】:该论文旨在解决海岸污染监测中人工巡检效率低、难以实现大规模自动化的问题,提出利用生成式 AI(Generative AI)中的实时检测变压器(Real-Time Detection Transformer, RT-DETR)模型实现海滩垃圾的自动识别与计数。其关键解决方案在于采用端到端的Transformer架构,在公开的海岸废弃物数据集上训练两种变体——RT-DETR-Large(RT-DETR-L)和RT-DETR-Extra-Large(RT-DETR-X),并通过对比分析发现:尽管RT-DETR-X在精度上略优(mAP@50为0.816 vs. 0.810),但RT-DETR-L在推理速度上显著更快(20.1 ms vs. 34.5 ms),展现出更优的实时部署可行性,从而为环境监测提供了兼顾准确性和计算效率的技术路径。
链接: https://arxiv.org/abs/2508.13101
作者: Miftahul Huda,Arsyiah Azahra,Putri Maulida Chairani,Dimas Rizky Ramadhani,Nabila Azhari,Ade Lailani
机构: Sumatera Institute of Technology(苏门答腊技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Coastal pollution is a pressing global environmental issue, necessitating scalable and automated solutions for monitoring and management. This study investigates the efficacy of the Real-Time Detection Transformer (RT-DETR), a state-of-the-art, end-to-end object detection model, for the automated detection and counting of beach litter. A rigorous comparative analysis is conducted between two model variants, RT-DETR-Large (RT-DETR-L) and RT-DETR-Extra-Large (RT-DETR-X), trained on a publicly available dataset of coastal debris. The evaluation reveals that the RT-DETR-X model achieves marginally superior accuracy, with a mean Average Precision at 50% IoU (mAP@50) of 0.816 and a mAP@50-95 of 0.612, compared to the RT-DETR-L model’s 0.810 and 0.606, respectively. However, this minor performance gain is realized at a significant computational cost; the RT-DETR-L model demonstrates a substantially faster inference time of 20.1 ms versus 34.5 ms for the RT-DETR-X. The findings suggest that the RT-DETR-L model offers a more practical and efficient solution for real-time, in-field deployment due to its superior balance of processing speed and detection accuracy. This research provides valuable insights into the application of advanced Transformer-based detectors for environmental conservation, highlighting the critical trade-offs between model complexity and operational viability.
zh
[CV-6] DMS:Diffusion-Based Multi-Baseline Stereo Generation for Improving Self-Supervised Depth Estimation
【速读】:该论文旨在解决自监督立体匹配(stereo matching)和单目深度估计(monocular depth estimation)中因光度重建过程中存在歧义而导致的性能瓶颈问题,尤其是目标视角中缺失对应像素区域(如遮挡和视场外区域)所引发的模糊性。其解决方案的关键在于提出一种模型无关的方法 DMS(Diffusion-based Multi-view Synthesis),该方法利用扩散模型(diffusion models)中的几何先验,通过方向性提示(directional prompts)引导生成沿极线方向的新视角图像,从而显式地补充遮挡区域的像素信息,实现更可靠的光度对应关系构建。该方法无需额外标注数据,仅依赖未标记的立体图像对即可完成训练与合成,具有即插即用、零成本的优势,并在多个基准数据集上实现了显著的误差降低(最高达35%的异常值减少)和最先进的性能表现。
链接: https://arxiv.org/abs/2508.13091
作者: Zihua Liu,Yizhou Li,Songyan Zhang,Masatoshi Okutomi
机构: Institute of Science Tokyo (东京科学研究所); Sony Semiconductor Solutions Group (索尼半导体解决方案集团); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While supervised stereo matching and monocular depth estimation have advanced significantly with learning-based algorithms, self-supervised methods using stereo images as supervision signals have received relatively less focus and require further investigation. A primary challenge arises from ambiguity introduced during photometric reconstruction, particularly due to missing corresponding pixels in ill-posed regions of the target view, such as occlusions and out-of-frame areas. To address this and establish explicit photometric correspondences, we propose DMS, a model-agnostic approach that utilizes geometric priors from diffusion models to synthesize novel views along the epipolar direction, guided by directional prompts. Specifically, we finetune a Stable Diffusion model to simulate perspectives at key positions: left-left view shifted from the left camera, right-right view shifted from the right camera, along with an additional novel view between the left and right cameras. These synthesized views supplement occluded pixels, enabling explicit photometric reconstruction. Our proposed DMS is a cost-free, ‘‘plug-and-play’’ method that seamlessly enhances self-supervised stereo matching and monocular depth estimation, and relies solely on unlabeled stereo image pairs for both training and synthesizing. Extensive experiments demonstrate the effectiveness of our approach, with up to 35% outlier reduction and state-of-the-art performance across multiple benchmark datasets.
zh
[CV-7] Checkmate: interpretable and explainable RSVQA is the endgame
【速读】:该论文旨在解决遥感视觉问答(Remote Sensing Visual Question Answering, RSVQA)中模型决策缺乏可解释性和视觉 grounding 的问题,以及因数据集分布偏差导致的捷径学习(shortcut learning)现象。其解决方案的关键在于构建一个名为 Chessboard 的新型 RSVQA 数据集,该数据集通过包含 3,123,253 个问题和平衡的答案分布来最小化偏倚,并将每个答案与图像中的一个或多个像素单元(cell)相关联,从而支持细粒度的视觉推理;在此基础上开发了名为 Checkmate 的可解释、可理解的模型,能够识别对决策最相关的图像区域,显著提升 RSVQA 系统的透明度与可信度。
链接: https://arxiv.org/abs/2508.13086
作者: Lucrezia Tosato,Christel Tartini Chappuis,Syrielle Montariol,Flora Weissgerber,Sylvain Lobry,Devis Tuia
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Remote Sensing Visual Question Answering (RSVQA) presents unique challenges in ensuring that model decisions are both understandable and grounded in visual content. Current models often suffer from a lack of interpretability and explainability, as well as from biases in dataset distributions that lead to shortcut learning. In this work, we tackle these issues by introducing a novel RSVQA dataset, Chessboard, designed to minimize biases through 3’123’253 questions and a balanced answer distribution. Each answer is linked to one or more cells within the image, enabling fine-grained visual reasoning. Building on this dataset, we develop an explainable and interpretable model called Checkmate that identifies the image cells most relevant to its decisions. Through extensive experiments across multiple model architectures, we show that our approach improves transparency and supports more trustworthy decision-making in RSVQA systems. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.13086 [cs.CV] (or arXiv:2508.13086v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.13086 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-8] ID-Card Synthetic Generation: Toward a Simulated Bona fide Dataset
【速读】:该论文旨在解决身份识别证件(ID cards)中 Presentation Attack Detection (PAD) 系统训练数据不足的问题,尤其是真实样本(bona fide images)稀缺以及攻击手段种类日益多样所带来的挑战。传统方法多聚焦于生成攻击样本,而忽视了真实图像的匮乏问题。其解决方案的关键在于利用 Stable Diffusion 模型生成合成的真实图像,从而扩充训练数据集,提升 PAD 检测器的泛化能力;实验表明,这些合成图像在新训练的系统及商用方案中均被正确识别为真实样本,有效改善了检测性能并缓解了数据限制问题。
链接: https://arxiv.org/abs/2508.13078
作者: Qingwen Zeng,Juan E. Tapia,Izan Garcia,Juan M. Espin,Christoph Busch
机构: Hochschule Darmstadt (达姆施塔特应用技术大学); DTU (丹麦技术大学); Facephi company (Facephi公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Nowadays, the development of a Presentation Attack Detection (PAD) system for ID cards presents a challenge due to the lack of images available to train a robust PAD system and the increase in diversity of possible attack instrument species. Today, most algorithms focus on generating attack samples and do not take into account the limited number of bona fide images. This work is one of the first to propose a method for mimicking bona fide images by generating synthetic versions of them using Stable Diffusion, which may help improve the generalisation capabilities of the detector. Furthermore, the new images generated are evaluated in a system trained from scratch and in a commercial solution. The PAD system yields an interesting result, as it identifies our images as bona fide, which has a positive impact on detection performance and data restrictions.
zh
[CV-9] Eyes on the Image: Gaze Supervised Multimodal Learning for Chest X-ray Diagnosis and Report Generation
【速读】:该论文旨在解决胸部X光图像中疾病分类精度不足以及生成的放射学报告缺乏解剖区域感知的问题。解决方案的关键在于提出一个两阶段的多模态框架:第一阶段采用基于注视引导的对比学习架构进行疾病分类,通过融合视觉特征、临床标签、边界框和放射科医生眼动信号,并引入一种结合均方误差(MSE)、KL散度、相关性及质心对齐的新型多目标注视注意力损失函数,显著提升了F1分数和AUC指标;第二阶段构建模块化的报告生成流程,利用置信度加权的诊断关键词提取、基于领域先验构建的解剖区域映射字典以及结构化提示模板,实现区域对齐的句子生成,从而提高报告的临床关键词召回率和ROUGE重叠度,整体增强了模型性能与可解释性。
链接: https://arxiv.org/abs/2508.13068
作者: Tanjim Islam Riju,Shuchismita Anwar,Saman Sarker Joy,Farig Sadeque,Swakkhar Shatabda
机构: Brac University (布拉克大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:We propose a two-stage multimodal framework that enhances disease classification and region-aware radiology report generation from chest X-rays, leveraging the MIMIC-Eye dataset. In the first stage, we introduce a gaze-guided contrastive learning architecture for disease classification. It integrates visual features, clinical labels, bounding boxes, and radiologist eye-tracking signals and is equipped with a novel multi-term gaze-attention loss combining MSE, KL divergence, correlation, and center-of-mass alignment. Incorporating fixations improves F1 score from 0.597 to 0.631 (+5.70%) and AUC from 0.821 to 0.849 (+3.41%), while also improving precision and recall, highlighting the effectiveness of gaze-informed attention supervision. In the second stage, we present a modular report generation pipeline that extracts confidence-weighted diagnostic keywords, maps them to anatomical regions using a curated dictionary constructed from domain-specific priors, and generates region-aligned sentences via structured prompts. This pipeline improves report quality as measured by clinical keyword recall and ROUGE overlap. Our results demonstrate that integrating gaze data improves both classification performance and the interpretability of generated medical reports.
zh
[CV-10] Odo: Depth-Guided Diffusion for Identity-Preserving Body Reshaping
【速读】:该论文旨在解决人体形状编辑(human shape editing)中长期存在的挑战,即在保持姿态、身份、衣物和背景不变的前提下,实现可控且真实的体型变换(如瘦削、肌肉发达或肥胖),而现有方法常因依赖3D可变形模型或图像形变技术而导致比例失真、纹理扭曲及背景不一致等问题。其关键解决方案是提出一个大规模、高质量的数据集Odo-Dataset(包含18,573张图像、1523名受试者),并设计了一种基于扩散模型的端到端方法Odo,该方法融合了冻结的UNet以保留输入图像的精细外观与背景细节,并引入ControlNet通过目标SMPL深度图引导形状变换,从而实现高精度(顶点重建误差低至7.5mm)且逼真的体型重塑效果。
链接: https://arxiv.org/abs/2508.13065
作者: Siddharth Khandelwal,Sridhar Kamath,Arjun Jain
机构: Fast Code AI Pvt. Ltd.
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Human shape editing enables controllable transformation of a person’s body shape, such as thin, muscular, or overweight, while preserving pose, identity, clothing, and background. Unlike human pose editing, which has advanced rapidly, shape editing remains relatively underexplored. Current approaches typically rely on 3D morphable models or image warping, often introducing unrealistic body proportions, texture distortions, and background inconsistencies due to alignment errors and deformations. A key limitation is the lack of large-scale, publicly available datasets for training and evaluating body shape manipulation methods. In this work, we introduce the first large-scale dataset of 18,573 images across 1523 subjects, specifically designed for controlled human shape editing. It features diverse variations in body shape, including fat, muscular and thin, captured under consistent identity, clothing, and background conditions. Using this dataset, we propose Odo, an end-to-end diffusion-based method that enables realistic and intuitive body reshaping guided by simple semantic attributes. Our approach combines a frozen UNet that preserves fine-grained appearance and background details from the input image with a ControlNet that guides shape transformation using target SMPL depth maps. Extensive experiments demonstrate that our method outperforms prior approaches, achieving per-vertex reconstruction errors as low as 7.5mm, significantly lower than the 13.6mm observed in baseline methods, while producing realistic results that accurately match the desired target shapes.
zh
[CV-11] XR-NPE: High-Throughput Mixed-precision SIMD Neural Processing Engine for Extended Reality Perception Workloads
【速读】:该论文旨在解决扩展现实(XR)感知任务中计算效率与能效瓶颈问题,尤其是在资源受限设备上实现高吞吐量、低功耗的神经网络推理。其核心挑战在于传统硬件架构难以支持超低精度计算以降低内存带宽需求,同时保持模型精度和算术强度。解决方案的关键是提出XR-NPE——一种支持多种混合精度格式(包括FP4、Posit (4,1)、Posit (8,0) 和 Posit (16,1))的单指令多数据流(SIMD)神经处理引擎,采用层自适应的混合算法实现策略,在保证最小精度损失的前提下显著减少内存访问开销;并通过可重构尾数乘法与指数处理电路(RMMEC)优化MAC计算单元,结合选择性电源门控技术降低暗硅效应,从而提升算术强度达2.85倍,并在CMOS 28nm工艺下实现面积减少42%、功耗降低38%,相较现有最先进MAC方案具备显著优势。
链接: https://arxiv.org/abs/2508.13049
作者: Tejas Chaudhari,Akarsh J.,Tanushree Dewangan,Mukul Lokhande,Santosh Kumar Vishvakarma
机构: NSDCS Research Group, Dept. of Electrical Engineering, Indian Institute of Technology Indore, India
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:This work proposes XR-NPE, a high-throughput Mixed-precision SIMD Neural Processing Engine, designed for extended reality (XR) perception workloads like visual inertial odometry (VIO), object classification, and eye gaze extraction. XR-NPE is first to support FP4, Posit (4,1), Posit (8,0), and Posit (16,1) formats, with layer adaptive hybrid-algorithmic implementation supporting ultra-low bit precision to significantly reduce memory bandwidth requirements, and accompanied by quantization-aware training for minimal accuracy loss. The proposed Reconfigurable Mantissa Multiplication and Exponent processing Circuitry (RMMEC) reduces dark silicon in the SIMD MAC compute engine, assisted by selective power gating to reduce energy consumption, providing 2.85x improved arithmetic intensity. XR-NPE achieves a maximum operating frequency of 1.72 GHz, area 0.016 mm2 , and arithmetic intensity 14 pJ at CMOS 28nm, reducing 42% area, 38% power compared to the best of state-of-the-art MAC approaches. The proposed XR-NPE based AXI-enabled Matrix-multiplication co-processor consumes 1.4x fewer LUTs, 1.77x fewer FFs, and provides 1.2x better energy efficiency compared to SoTA accelerators on VCU129. The proposed co-processor provides 23% better energy efficiency and 4% better compute density for VIO workloads. XR-NPE establishes itself as a scalable, precision-adaptive compute engine for future resource-constrained XR devices. The complete set for codes for results reproducibility are released publicly, enabling designers and researchers to readily adopt and build upon them. this https URL.
zh
[CV-12] IntelliCap: Intelligent Guidance for Consistent View Sampling
【速读】:该论文旨在解决高质视频合成(novel view synthesis)中输入图像采集阶段的人类操作难题,即如何辅助人类用户在扫描场景时实现均匀且密集的视点采样。现有方法多聚焦于单个物体或忽略视点依赖性材质特性,难以满足复杂场景下的渲染质量需求。其解决方案的关键在于提出一种多尺度情境化可视化技术:通过语义分割与视觉-语言模型(vision-language model)对场景中的对象进行类别识别与重要性排序,生成围绕高优先级对象的球形代理(spherical proxies)作为导航指引,从而引导用户补充关键区域的图像覆盖,显著提升视点依赖外观的表示精度。
链接: https://arxiv.org/abs/2508.13043
作者: Ayaka Yasunaga,Hideo Saito,Dieter Schmalstieg,Shohei Mori
机构: Keio University (庆应义塾大学); University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: This work is a pre-print version of a paper that has been accepted to the IEEE International Symposium on Mixed and Augmented Reality for future publication. Project Page: this https URL
Abstract:Novel view synthesis from images, for example, with 3D Gaussian splatting, has made great progress. Rendering fidelity and speed are now ready even for demanding virtual reality applications. However, the problem of assisting humans in collecting the input images for these rendering algorithms has received much less attention. High-quality view synthesis requires uniform and dense view sampling. Unfortunately, these requirements are not easily addressed by human camera operators, who are in a hurry, impatient, or lack understanding of the scene structure and the photographic process. Existing approaches to guide humans during image acquisition concentrate on single objects or neglect view-dependent material characteristics. We propose a novel situated visualization technique for scanning at multiple scales. During the scanning of a scene, our method identifies important objects that need extended image coverage to properly represent view-dependent appearance. To this end, we leverage semantic segmentation and category identification, ranked by a vision-language model. Spherical proxies are generated around highly ranked objects to guide the user during scanning. Our results show superior performance in real scenes compared to conventional view sampling strategies.
zh
[CV-13] HierAdaptMR: Cross-Center Cardiac MRI Reconstruction with Hierarchical Feature Adapters MICCAI2025
【速读】:该论文旨在解决深度学习驱动的心脏磁共振成像(Cardiac MRI)重建在多中心临床部署中因扫描仪配置和成像协议异质性导致的域偏移(domain shift)问题。其核心解决方案是提出HierAdaptMR框架,通过分层特征适应机制实现多层级域差异的有效建模:具体包括针对序列特异性特征的协议级适配器(Protocol-Level Adapters)、针对扫描仪依赖性变化的中心级适配器(Center-Level Adapters),以及利用随机训练策略学习中心不变适配的通用适配器(Universal Adapter),从而在保持重建质量的同时显著提升跨中心泛化能力。
链接: https://arxiv.org/abs/2508.13026
作者: Ruru Xu,Ilkay Oksuz
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MICCAI 2025, CMRxRecon2025 Challenge paper
Abstract:Deep learning-based cardiac MRI reconstruction faces significant domain shift challenges when deployed across multiple clinical centers with heterogeneous scanner configurations and imaging protocols. We propose HierAdaptMR, a hierarchical feature adaptation framework that addresses multi-level domain variations through parameter-efficient adapters. Our method employs Protocol-Level Adapters for sequence-specific characteristics and Center-Level Adapters for scanner-dependent variations, built upon a variational unrolling backbone. A Universal Adapter enables generalization to entirely unseen centers through stochastic training that learns center-invariant adaptations. The framework utilizes multi-scale SSIM loss with frequency domain enhancement and contrast-adaptive weighting for robust optimization. Comprehensive evaluation on the CMRxRecon2025 dataset spanning 5+ centers, 10+ scanners, and 9 modalities demonstrates superior cross-center generalization while maintaining reconstruction quality. code: this https URL
zh
[CV-14] EgoTwin: Dreaming Body and View in First Person
【速读】:该论文旨在解决第一人称视角视频(egocentric video)生成中长期被忽视的问题,即如何同时建模第一人称视点内容与由佩戴者身体运动引起的相机轨迹,并确保生成的人体动作与视觉动态之间具有因果一致性。其核心挑战包括:1)视角对齐(Viewpoint Alignment),要求生成视频中的相机轨迹与从人体动作中推导出的头部轨迹精确匹配;2)因果交互(Causal Interplay),要求合成的人体动作在时间上与相邻视频帧的视觉变化保持因果关系。解决方案的关键在于提出EgoTwin框架,基于扩散Transformer架构,引入一种以头部关节为中心的人体运动表示方法(head-centric motion representation),并设计受控制论启发的交互机制,通过注意力机制显式捕捉视频与运动之间的因果关联,从而实现高质量的第一人称视频与人体动作联合生成。
链接: https://arxiv.org/abs/2508.13013
作者: Jingqiao Xiu,Fangzhou Hong,Yicong Li,Mengze Li,Wentao Wang,Sirui Han,Liang Pan,Ziwei Liu
机构: National University of Singapore (新加坡国立大学); Nanyang Technological University (南洋理工大学); Hong Kong University of Science and Technology (香港科技大学); Shanghai AI Laboratory (上海人工智能实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer’s body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.
zh
[CV-15] Matrix-Game 2.0: An Open-Source Real-Time and Streaming Interactive World Model
【速读】:该论文旨在解决现有交互式世界模型在实时性方面的瓶颈问题,即当前基于扩散模型的交互式世界模型依赖双向注意力机制和较长的推理步骤,难以满足现实场景中对即时反馈的需求。其解决方案的关键在于提出Matrix-Game 2.0框架,通过三个核心组件实现高效、低延迟的视频生成:(1) 基于Unreal Engine和GTA5环境的大规模视频数据生产流水线(约1200小时),包含多样化的交互标注;(2) 动作注入模块,支持帧级鼠标与键盘输入作为交互条件;(3) 基于因果架构的少步蒸馏方法,使模型能够在少量推理步骤内完成自回归扩散,从而实现实时流式视频生成,达到25 FPS的超高速度并生成高质量分钟级视频。
链接: https://arxiv.org/abs/2508.13009
作者: Xianglong He,Chunli Peng,Zexiang Liu,Boyang Wang,Yifan Zhang,Qi Cui,Fei Kang,Biao Jiang,Mengyin An,Yangyang Ren,Baixin Xu,Hao-Xiang Guo,Kaixiong Gong,Cyrus Wu,Wei Li,Xuchen Song,Yang Liu,Eric Li,Yahui Zhou
机构: Skywork AI
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project Page: this https URL
Abstract:Recent advances in interactive video generations have demonstrated diffusion model’s potential as world models by capturing complex physical dynamics and interactive behaviors. However, existing interactive world models depend on bidirectional attention and lengthy inference steps, severely limiting real-time performance. Consequently, they are hard to simulate real-world dynamics, where outcomes must update instantaneously based on historical context and current actions. To address this, we present Matrix-Game 2.0, an interactive world model generates long videos on-the-fly via few-step auto-regressive diffusion. Our framework consists of three key components: (1) A scalable data production pipeline for Unreal Engine and GTA5 environments to effectively produce massive amounts (about 1200 hours) of video data with diverse interaction annotations; (2) An action injection module that enables frame-level mouse and keyboard inputs as interactive conditions; (3) A few-step distillation based on the casual architecture for real-time and streaming video generation. Matrix Game 2.0 can generate high-quality minute-level videos across diverse scenes at an ultra-fast speed of 25 FPS. We open-source our model weights and codebase to advance research in interactive world modeling.
zh
[CV-16] SlimComm: Doppler-Guided Sparse Queries for Bandwidth-Efficient Cooperative 3-D Perception ICCV
【速读】:该论文旨在解决协同感知中因共享密集的鸟瞰图(Bird’s-Eye-View, BEV)特征图而导致车联网通信带宽过载的问题。其解决方案的关键在于提出了一种名为SlimComm的通信高效框架,该框架融合4D雷达多普勒信息并采用查询驱动的稀疏通信机制:通过构建以运动为中心的动态地图区分动静态目标,并生成两类查询——(i)针对动态且高置信度区域的参考查询,以及(ii)通过两阶段偏移探测遮挡区域的探索性查询;仅传输与查询相关的BEV特征并通过多尺度门控可变形注意力进行融合,从而显著降低通信负载,同时保持感知精度。
链接: https://arxiv.org/abs/2508.13007
作者: Melih Yazgan,Qiyuan Wu,Iramm Hamdard,Shiqi Li,J. Marius Zoellner
机构: FZI Research Center for Information Technology (信息科技研究中心); Karlsruhe Institute of Technology (卡尔斯鲁厄理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICCV - Drive2X Workshop
Abstract:Collaborative perception allows connected autonomous vehicles (CAVs) to overcome occlusion and limited sensor range by sharing intermediate features. Yet transmitting dense Bird’s-Eye-View (BEV) feature maps can overwhelm the bandwidth available for inter-vehicle communication. We present SlimComm, a communication-efficient framework that integrates 4D radar Doppler with a query-driven sparse scheme. SlimComm builds a motion-centric dynamic map to distinguish moving from static objects and generates two query types: (i) reference queries on dynamic and high-confidence regions, and (ii) exploratory queries probing occluded areas via a two-stage offset. Only query-specific BEV features are exchanged and fused through multi-scale gated deformable attention, reducing payload while preserving accuracy. For evaluation, we release OPV2V-R and Adver-City-R, CARLA-based datasets with per-point Doppler radar. SlimComm achieves up to 90% lower bandwidth than full-map sharing while matching or surpassing prior baselines across varied traffic densities and occlusions. Dataset and code will be available at: this https URL.
zh
[CV-17] Empirical Evidences for the Effects of Feature Diversity in Open Set Recognition and Continual Learning
【速读】:该论文旨在解决开放集识别(Open Set Recognition, OSR)与持续学习(Continual Learning)两大挑战,前者关注推理阶段对未知类别的检测能力,后者则强调模型在不遗忘旧知识的前提下整合新类别数据的能力。其解决方案的关键在于通过增强特征多样性(feature diversity)来同时提升两个任务的性能:实证表明,增加特征多样性不仅有助于更准确地识别开放集样本,还能有效促进模型对历史数据的保留与对新数据的融合,从而为相关领域提供可借鉴的实践方法和理论依据。
链接: https://arxiv.org/abs/2508.13005
作者: Jiawen Xu,Odej Kao
机构: TU Berlin (柏林工业大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Open set recognition (OSR) and continual learning are two critical challenges in machine learning, focusing respectively on detecting novel classes at inference time and updating models to incorporate the new classes. While many recent approaches have addressed these problems, particularly OSR, by heuristically promoting feature diversity, few studies have directly examined the role that feature diversity plays in tackling them. In this work, we provide empirical evidence that enhancing feature diversity improves the recognition of open set samples. Moreover, increased feature diversity also facilitates both the retention of previously learned data and the integration of new data in continual learning. We hope our findings can inspire further research into both practical methods and theoretical understanding in these domains.
zh
[CV-18] Omni Survey for Multimodality Analysis in Visual Object Tracking
【速读】:该论文旨在解决多模态视觉目标跟踪(Multi-modal Visual Object Tracking, MMVOT)中的关键挑战,包括多模态数据的采集、模态对齐与标注、模型设计及评估方法等问题。其解决方案的核心在于系统性地梳理和分类现有MMVOT方法,依据可见光(RGB)与辅助模态(如热红外T、深度D、事件E、近红外NIR、语言L或声纳S)之间的处理方式,将方法分为基于复制或非复制实验配置的两种范式,并首次从对象类别分布角度分析了现有数据集的长尾特性及动物类别的缺失问题,从而为未来研究提供理论基础与实践指导。
链接: https://arxiv.org/abs/2508.13000
作者: Zhangyong Tang,Tianyang Xu,Xuefeng Zhu,Hui Li,Shaochuan Zhao,Tao Zhou,Chunyang Cheng,Xiaojun Wu,Josef Kittler
机构: Jiangnan University (江南大学); China University of Mining and Technology (中国矿业大学); Nanjing University of Science and Technology (南京理工大学); University of Surrey (萨里大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The first comprehensive survey for multi-modal visual object tracking; 6 multi-modal tasks; 338 references
Abstract:The development of smart cities has led to the generation of massive amounts of multi-modal data in the context of a range of tasks that enable a comprehensive monitoring of the smart city infrastructure and services. This paper surveys one of the most critical tasks, multi-modal visual object tracking (MMVOT), from the perspective of multimodality analysis. Generally, MMVOT differs from single-modal tracking in four key aspects, data collection, modality alignment and annotation, model designing, and evaluation. Accordingly, we begin with an introduction to the relevant data modalities, laying the groundwork for their integration. This naturally leads to a discussion of challenges of multi-modal data collection, alignment, and annotation. Subsequently, existing MMVOT methods are categorised, based on different ways to deal with visible (RGB) and X modalities: programming the auxiliary X branch with replicated or non-replicated experimental configurations from the RGB branch. Here X can be thermal infrared (T), depth (D), event (E), near infrared (NIR), language (L), or sonar (S). The final part of the paper addresses evaluation and benchmarking. In summary, we undertake an omni survey of all aspects of multi-modal visual object tracking (VOT), covering six MMVOT tasks and featuring 338 references in total. In addition, we discuss the fundamental rhetorical question: Is multi-modal tracking always guaranteed to provide a superior solution to unimodal tracking with the help of information fusion, and if not, in what circumstances its application is beneficial. Furthermore, for the first time in this field, we analyse the distributions of the object categories in the existing MMVOT datasets, revealing their pronounced long-tail nature and a noticeable lack of animal categories when compared with RGB datasets.
zh
[CV-19] Vitamin N: Benefits of Different Forms of Public Greenery for Urban Health
【速读】:该论文试图解决现有城市绿地与健康关系研究中结果不一致的问题,其核心在于传统官方绿地指标仅衡量绿地的数量或可达性,而忽略了人们在日常生活中实际看到或使用绿地的频率。解决方案的关键在于提出一种新的分类方法,将街道上的绿地(on-road greenery)与需要计划访问的非道路绿地(off-road greenery)区分开来,并结合高分辨率遥感影像、OpenStreetMap数据、超过10万张Google街景图像中的绿地量化信息以及基于16万条道路段的可达性估计,构建更贴近居民日常体验的绿地度量体系。这一新指标显著优于现有常用官方指标,揭示了日常可见绿地对健康的更强关联性。
链接: https://arxiv.org/abs/2508.12998
作者: Sanja Šćepanović,Sagar Joglekar,Stephen Law,Daniele Quercia,Ke Zhou,Alice Battiston,Rossano Schifanella
机构: Nokia Bell Labs (诺基亚贝尔实验室); University College London (伦敦大学学院); University of Turin (都灵大学)
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Urban greenery is often linked to better health, yet findings from past research have been inconsistent. One reason is that official greenery metrics measure the amount or nearness of greenery but ignore how often people actually may potentially see or use it in daily life. To address this gap, we introduced a new classification that separates on-road greenery, which people see while walking through streets, from off-road greenery, which requires planned visits. We did so by combining aerial imagery of Greater London and greenery data from OpenStreetMap with quantified greenery from over 100,000 Google Street View images and accessibility estimates based on 160,000 road segments. We linked these measures to 7.45 billion medical prescriptions issued by the National Health Service and processed through our methodology. These prescriptions cover five conditions: diabetes, hypertension, asthma, depression, and anxiety, as well as opioid use. As hypothesized, we found that green on-road was more strongly linked to better health than four widely used official measures. For example, hypertension prescriptions dropped by 3.68% in wards with on-road greenery above the median citywide level compared to those below it. If all below-median wards reached the citywide median in on-road greenery, prescription costs could fall by up to £3.15 million each year. These results suggest that greenery seen in daily life may be more relevant than public yet secluded greenery, and that official metrics commonly used in the literature have important limitations.
zh
[CV-20] Dextr: Zero-Shot Neural Architecture Search with Singular Value Decomposition and Extrinsic Curvature
【速读】:该论文旨在解决零样本神经架构搜索(Zero-shot Neural Architecture Search, Zero-shot NAS)中两个关键问题:一是现有零成本代理(zero-cost proxy)通常依赖标签数据,而实际应用中标签数据往往不可获得;二是当前方法多仅关注网络收敛性或泛化性能,或仅考虑表达能力(expressivity),未能统一优化多个关键属性。解决方案的关键在于提出一种无需标签数据的新型零成本代理,其核心思想是通过分析神经网络层特征的奇异值分解(Singular Value Decomposition, SVD)和网络输出的外曲率(extrinsic curvature)来联合建模收敛性、泛化性和表达能力。具体而言,该代理将特征条件数的倒数之和与外曲率的对数以简化调和平均形式结合,从而仅用单个无标签样本即可准确预测网络在测试数据上的性能表现。
链接: https://arxiv.org/abs/2508.12977
作者: Rohan Asthana,Joschua Conrad,Maurits Ortmanns,Vasileios Belagiannis
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at Transactions on Machine Learning Research (TMLR)
Abstract:Zero-shot Neural Architecture Search (NAS) typically optimises the architecture search process by exploiting the network or gradient properties at initialisation through zero-cost proxies. The existing proxies often rely on labelled data, which is usually unavailable in real-world settings. Furthermore, the majority of the current methods focus either on optimising the convergence and generalisation attributes or solely on the expressivity of the network architectures. To address both limitations, we first demonstrate how channel collinearity affects the convergence and generalisation properties of a neural network. Then, by incorporating the convergence, generalisation and expressivity in one approach, we propose a zero-cost proxy that omits the requirement of labelled data for its computation. In particular, we leverage the Singular Value Decomposition (SVD) of the neural network layer features and the extrinsic curvature of the network output to design our proxy. %As a result, the proposed proxy is formulated as the simplified harmonic mean of the logarithms of two key components: the sum of the inverse of the feature condition number and the extrinsic curvature of the network output. Our approach enables accurate prediction of network performance on test data using only a single label-free data sample. Our extensive evaluation includes a total of six experiments, including the Convolutional Neural Network (CNN) search space, i.e. DARTS and the Transformer search space, i.e. AutoFormer. The proposed proxy demonstrates a superior performance on multiple correlation benchmarks, including NAS-Bench-101, NAS-Bench-201, and TransNAS-Bench-101-micro; as well as on the NAS task within the DARTS and the AutoFormer search space, all while being notably efficient. The code is available at this https URL.
zh
[CV-21] Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation
【速读】:该论文旨在解决基于Transformer的视频生成模型中自注意力机制(self-attention mechanism)计算开销过大这一关键问题,尤其是在生成超长视频序列时,现有方法如分解注意力(factorized attention)和固定稀疏模式无法充分挖掘视频数据中的时空冗余信息。其解决方案的关键在于提出一种面向硬件加速的紧凑注意力(Compact Attention)框架,通过三项核心创新实现:1)自适应分块策略(adaptive tiling strategies),动态分组以逼近多样的空间交互模式;2)时变窗口机制(temporally varying windows),根据帧间距离调整稀疏程度;3)自动化配置搜索算法,优化稀疏模式的同时保留关键注意力路径。该方法在单GPU环境下实现1.6~2.5倍的注意力计算加速,同时保持与全连接注意力基线相当的视觉质量,为高效长视频生成提供了结构化稀疏利用的系统性方法。
链接: https://arxiv.org/abs/2508.12969
作者: Qirui Li,Guangcong Zheng,Qi Zhao,Jie Li,Bin Dong,Yiwu Yao,Xi Li
机构: Zhejiang University (浙江大学); Huawei Technologies (华为技术有限公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The computational demands of self-attention mechanisms pose a critical challenge for transformer-based video generation, particularly in synthesizing ultra-long sequences. Current approaches, such as factorized attention and fixed sparse patterns, fail to fully exploit the inherent spatio-temporal redundancies in video data. Through systematic analysis of video diffusion transformers (DiT), we uncover a key insight: Attention matrices exhibit structured, yet heterogeneous sparsity patterns, where specialized heads dynamically attend to distinct spatiotemporal regions (e.g., local pattern, cross-shaped pattern, or global pattern). Existing sparse attention methods either impose rigid constraints or introduce significant overhead, limiting their effectiveness. To address this, we propose Compact Attention, a hardware-aware acceleration framework featuring three innovations: 1) Adaptive tiling strategies that approximate diverse spatial interaction patterns via dynamic tile grouping, 2) Temporally varying windows that adjust sparsity levels based on frame proximity, and 3) An automated configuration search algorithm that optimizes sparse patterns while preserving critical attention pathways. Our method achieves 1.6~2.5x acceleration in attention computation on single-GPU setups while maintaining comparable visual quality with full-attention baselines. This work provides a principled approach to unlocking efficient long-form video generation through structured sparsity exploitation. Project Page: this https URL
zh
[CV-22] GazeDETR: Gaze Detection using Disentangled Head and Gaze Representations
【速读】:该论文旨在解决现有端到端 gaze 目标检测模型在同时进行人头定位与注视点预测时,因共享统一表征而导致的多任务信息纠缠问题。其核心解决方案是提出 GazeDETR 架构,采用两个解耦的解码器分别独立学习人头定位和注视点预测的特征表示,并通过协同注意力机制有效利用各自的任务相关感知区域;具体而言,人头预测模块主要依赖局部信息,而注视点解码器则融合局部与全局信息,从而实现更精准的多任务建模,最终在 GazeFollow、VideoAttentionTarget 和 ChildPlay 数据集上达到当前最优性能。
链接: https://arxiv.org/abs/2508.12966
作者: Ryan Anthony Jalova de Belen,Gelareh Mohammadi,Arcot Sowmya
机构: University of New South Wales (新南威尔士大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Gaze communication plays a crucial role in daily social interactions. Quantifying this behavior can help in human-computer interaction and digital phenotyping. While end-to-end models exist for gaze target detection, they only utilize a single decoder to simultaneously localize human heads and predict their corresponding gaze (e.g., 2D points or heatmap) in a scene. This multitask learning approach generates a unified and entangled representation for human head localization and gaze location prediction. Herein, we propose GazeDETR, a novel end-to-end architecture with two disentangled decoders that individually learn unique representations and effectively utilize coherent attentive fields for each subtask. More specifically, we demonstrate that its human head predictor utilizes local information, while its gaze decoder incorporates both local and global information. Our proposed architecture achieves state-of-the-art results on the GazeFollow, VideoAttentionTarget and ChildPlay datasets. It outperforms existing end-to-end models with a notable margin.
zh
[CV-23] Multi-Phase Automated Segmentation of Dental Structures in CBCT Using a Lightweight Auto3DSeg and SegResNet Implementation MICCAI
【速读】:该论文旨在解决锥形束计算机断层扫描(Cone-beam Computed Tomography, CBCT)中牙齿结构的自动化分割问题,以支持牙科诊断、治疗规划及头颈部肿瘤患者的放射治疗计划制定。其关键解决方案在于构建一个基于深度学习的多类牙齿分割流程:首先利用MONAI Auto3DSeg框架与3D SegResNet模型,在包含63例CBCT扫描数据的子集上采用5折交叉验证进行训练;随后通过图像重采样至0.6 mm各向同性分辨率和强度截断等预处理步骤提升模型性能;进一步采用多标签STAPLE(Multi-Label STAPLE)集成融合方法对五折预测结果进行整合,获得初步分割结果,并在此基础上对下颌骨区域进行精确裁剪后执行第二阶段细粒度神经结构分割。最终在ToothFairy3挑战赛的外部验证集上实现了平均Dice系数0.87,表明该方法在临床应用中具备高精度和可靠性,有助于提升放疗精准性和患者护理质量。
链接: https://arxiv.org/abs/2508.12962
作者: Dominic LaBella,Keshav Jha,Jared Robbins,Esther Yu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: MICCAI. ToothFairy3, 16 pages, 5 figures, 1 table
Abstract:Cone-beam computed tomography (CBCT) has become an invaluable imaging modality in dentistry, enabling 3D visualization of teeth and surrounding structures for diagnosis and treatment planning. Automated segmentation of dental structures in CBCT can efficiently assist in identifying pathology (e.g., pulpal or periapical lesions) and facilitate radiation therapy planning in head and neck cancer patients. We describe the DLaBella29 team’s approach for the MICCAI 2025 ToothFairy3 Challenge, which involves a deep learning pipeline for multi-class tooth segmentation. We utilized the MONAI Auto3DSeg framework with a 3D SegResNet architecture, trained on a subset of the ToothFairy3 dataset (63 CBCT scans) with 5-fold cross-validation. Key preprocessing steps included image resampling to 0.6 mm isotropic resolution and intensity clipping. We applied an ensemble fusion using Multi-Label STAPLE on the 5-fold predictions to infer a Phase 1 segmentation and then conducted tight cropping around the easily segmented Phase 1 mandible to perform Phase 2 segmentation on the smaller nerve structures. Our method achieved an average Dice of 0.87 on the ToothFairy3 challenge out-of-sample validation set. This paper details the clinical context, data preparation, model development, results of our approach, and discusses the relevance of automated dental segmentation for improving patient care in radiation oncology.
zh
[CV-24] Breaking Reward Collapse: Adaptive Reinforcement for Open-ended Medical Reasoning with Enhanced Semantic Discrimination
【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)在医学影像领域应用中面临的两大关键问题:一是现有基于规则的奖励机制主要局限于闭式视觉问答(Closed-ended Visual Question Answering, VQA),难以满足临床实践中开放式的复杂推理需求;二是现有语义引导的RL方法常因奖励函数缺乏区分度而导致奖励塌陷(reward collapse),即语义差异显著的回答获得相似评分,从而削弱模型优化效果。解决方案的关键在于提出一种名为ARMed(Adaptive Reinforcement for Medical Reasoning)的新颖强化学习框架:首先通过链式思维(Chain-of-Thought, CoT)监督微调(Supervised Fine-Tuning, SFT)注入领域知识,随后引入文本正确性奖励与自适应语义奖励相结合的策略,显著提升奖励的判别能力,从而增强模型在开放医学VQA任务中的推理质量与泛化性能。
链接: https://arxiv.org/abs/2508.12957
作者: Yizhou Liu,Jingwei Wei,Zizhi Chen,Minghao Han,Xukun Zhang,Keliang Liu,Lihua Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reinforcement learning (RL) with rule-based rewards has demonstrated strong potential in enhancing the reasoning and generalization capabilities of vision-language models (VLMs) and large language models (LLMs), while reducing computational overhead. However, its application in medical imaging remains underexplored. Existing reinforcement fine-tuning (RFT) approaches in this domain primarily target closed-ended visual question answering (VQA), limiting their applicability to real-world clinical reasoning. In contrast, open-ended medical VQA better reflects clinical practice but has received limited attention. While some efforts have sought to unify both formats via semantically guided RL, we observe that model-based semantic rewards often suffer from reward collapse, where responses with significant semantic differences receive similar scores. To address this, we propose ARMed (Adaptive Reinforcement for Medical Reasoning), a novel RL framework for open-ended medical VQA. ARMed first incorporates domain knowledge through supervised fine-tuning (SFT) on chain-of-thought data, then applies reinforcement learning with textual correctness and adaptive semantic rewards to enhance reasoning quality. We evaluate ARMed on six challenging medical VQA benchmarks. Results show that ARMed consistently boosts both accuracy and generalization, achieving a 32.64% improvement on in-domain tasks and an 11.65% gain on out-of-domain benchmarks. These results highlight the critical role of reward discriminability in medical RL and the promise of semantically guided rewards for enabling robust and clinically meaningful multimodal reasoning.
zh
[CV-25] MaskSem: Semantic-Guided Masking for Learning 3D Hybrid High-Order Motion Representation IROS2025
【速读】:该论文旨在解决自监督骨架动作识别中现有方法对关节选择受限及低阶运动模式建模不足的问题,从而限制了模型对复杂运动模式的理解能力。其解决方案的关键在于提出一种语义引导的掩码机制(MaskSem),通过基于相对运动的Grad-CAM生成注意力图,指导对最具语义信息的时序区域进行掩码,促使模型挖掘更具判别性的特征;同时引入混合高阶运动作为重建目标,融合低阶速度与高阶加速度信息,以更全面地刻画动态运动过程,提升模型对多阶运动模式的学习能力。
链接: https://arxiv.org/abs/2508.12948
作者: Wei Wei,Shaojie Zhang,Yonghao Dang,Jianqin Yin
机构: Beijing University of Posts and Telecommunications (北京邮电大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to IROS 2025
Abstract:Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low-order motion patterns, limiting the model’s ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic-guided masking process can encourage the model to explore more discriminative features. Furthermore, we propose using hybrid high-order motion as the reconstruction target, enabling the model to learn multi-order motion patterns. Specifically, low-order motion velocity and high-order motion acceleration are used together as the reconstruction target. This approach offers a more comprehensive description of the dynamic motion process, enhancing the model’s understanding of motion patterns. Experiments on the NTU60, NTU120, and PKU-MMD datasets show that MaskSem, combined with a vanilla transformer, improves skeleton-based action recognition, making it more suitable for applications in human-robot interaction.
zh
[CV-26] Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models
【速读】:该论文旨在解决视频重照明(video relighting)问题,即在不破坏前景物体原始属性(如反照率 albedo)的前提下,将背景替换为新场景的同时,实现前景光照的一致性调整与和谐融合。其核心挑战在于如何在时间帧之间保持光照一致性并精确保留前景特征。解决方案的关键在于构建一个大规模混合数据集(包含真实与合成视频),并通过联合训练策略充分利用合成视频的物理一致性与真实视频的分布泛化能力;进一步引入域感知适配器(domain-aware adapter)以解耦光照编辑与域外观分布的学习,从而实现端到端的高质量视频重照明。
链接: https://arxiv.org/abs/2508.12945
作者: Jianshu Zeng,Yuxuan Liu,Yutong Feng,Chenxuan Miao,Zixiang Gao,Jiwang Qu,Jianzhang Zhang,Bin Wang,Kun Yuan
机构: Peking University (北京大学); Kunbyte AI; University of Chinese Academy of Sciences (中国科学院大学); Zhejiang University (浙江大学); Hangzhou Normal University (杭州师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 7 figures
Abstract:Video relighting is a challenging yet valuable task, aiming to replace the background in videos while correspondingly adjusting the lighting in the foreground with harmonious blending. During translation, it is essential to preserve the original properties of the foreground, e.g., albedo, and propagate consistent relighting among temporal frames. In this paper, we propose Lumen, an end-to-end video relighting framework developed on large-scale video generative models, receiving flexible textual description for instructing the control of lighting and background. Considering the scarcity of high-qualified paired videos with the same foreground in various lighting conditions, we construct a large-scale dataset with a mixture of realistic and synthetic videos. For the synthetic domain, benefiting from the abundant 3D assets in the community, we leverage advanced 3D rendering engine to curate video pairs in diverse environments. For the realistic domain, we adapt a HDR-based lighting simulation to complement the lack of paired in-the-wild videos. Powered by the aforementioned dataset, we design a joint training curriculum to effectively unleash the strengths of each domain, i.e., the physical consistency in synthetic videos, and the generalized domain distribution in realistic videos. To implement this, we inject a domain-aware adapter into the model to decouple the learning of relighting and domain appearance distribution. We construct a comprehensive benchmark to evaluate Lumen together with existing methods, from the perspectives of foreground preservation and video consistency assessment. Experimental results demonstrate that Lumen effectively edit the input into cinematic relighted videos with consistent lighting and strict foreground preservation. Our project page: this https URL
zh
[CV-27] Fully Automated Segmentation of Fiber Bundles in Anatomic Tracing Data MICCAI2025
【速读】:该论文旨在解决大尺度解剖示踪数据中纤维束自动分割的难题,尤其是传统人工标注耗时费力、现有自动化方法易遗漏稀疏纤维束且需复杂跨切片后处理的问题。其解决方案的关键在于提出了一种基于U-Net架构的端到端全自动框架,通过采用大感受野补丁(large patch sizes)、前景感知采样(foreground aware sampling)以及半监督预训练(semisupervised pre-training),显著提升了对稀疏纤维束的检测能力(提升超20%),并将假阳性发现率(False Discovery Rate, FDR)降低40%,同时支持独立切片分析,从而增强了方法的灵活性与可扩展性。
链接: https://arxiv.org/abs/2508.12942
作者: Kyriaki-Margarita Bintsi,Yaël Balbastre,Jingjing Wu,Julia F. Lehman,Suzanne N. Haber,Anastasia Yendiki
机构: 11; 22; 33; 44
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: Accepted at CDMRI, MICCAI 2025
Abstract:Anatomic tracer studies are critical for validating and improving diffusion MRI (dMRI) tractography. However, large-scale analysis of data from such studies is hampered by the labor-intensive process of annotating fiber bundles manually on histological slides. Existing automated methods often miss sparse bundles or require complex post-processing across consecutive sections, limiting their flexibility and generalizability. We present a streamlined, fully automated framework for fiber bundle segmentation in macaque tracer data, based on a U-Net architecture with large patch sizes, foreground aware sampling, and semisupervised pre-training. Our approach eliminates common errors such as mislabeling terminals as bundles, improves detection of sparse bundles by over 20% and reduces the False Discovery Rate (FDR) by 40% compared to the state-of-the-art, all while enabling analysis of standalone slices. This new framework will facilitate the automated analysis of anatomic tracing data at a large scale, generating more ground-truth data that can be used to validate and optimize dMRI tractography methods.
zh
[CV-28] SEDEG:Sequential Enhancement of Decoder and Encoders Generality for Class Incremental Learning with Small Memory ICONIP2025
【速读】:该论文旨在解决增量学习(Incremental Learning)中因灾难性遗忘(Catastrophic Forgetting)导致长期知识退化的问题,尤其在小内存场景下现有方法性能显著下降。其核心挑战在于如何同时提升编码器(Encoder)和解码器(Decoder)的泛化能力,以平衡新旧知识的保留与适应性。解决方案的关键是提出一个两阶段训练框架SEDEG:第一阶段通过特征增强(Feature Boosting)训练集成编码器(Ensembled Encoder),以学习更具泛化性的表征并优化解码器及分类器的平衡;第二阶段采用知识蒸馏(Knowledge Distillation, KD)策略,结合平衡KD与特征KD,压缩集成编码器并生成新的、更泛化的编码器,从而有效迁移历史知识并提升模型鲁棒性。
链接: https://arxiv.org/abs/2508.12932
作者: Hongyang Chen,Shaoling Pu,Lingyu Zheng,Zhongwu Sun
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted by ICONIP2025
Abstract:In incremental learning, enhancing the generality of knowledge is crucial for adapting to dynamic data inputs. It can develop generalized representations or more balanced decision boundaries, preventing the degradation of long-term knowledge over time and thus mitigating catastrophic forgetting. Some emerging incremental learning methods adopt an encoder-decoder architecture and have achieved promising results. In the encoder-decoder achitecture, improving the generalization capabilities of both the encoder and decoder is critical, as it helps preserve previously learned knowledge while ensuring adaptability and robustness to new, diverse data inputs. However, many existing continual methods focus solely on enhancing one of the two components, which limits their effectiveness in mitigating catastrophic forgetting. And these methods perform even worse in small-memory scenarios, where only a limited number of historical samples can be stored. To mitigate this limitation, we introduces SEDEG, a two-stage training framework for vision transformers (ViT), focusing on sequentially improving the generality of both Decoder and Encoder. Initially, SEDEG trains an ensembled encoder through feature boosting to learn generalized representations, which subsequently enhance the decoder’s generality and balance the classifier. The next stage involves using knowledge distillation (KD) strategies to compress the ensembled encoder and develop a new, more generalized encoder. This involves using a balanced KD approach and feature KD for effective knowledge transfer. Extensive experiments on three benchmark datasets show SEDEG’s superior performance, and ablation studies confirm the efficacy of its components. The code is available at this https URL.
zh
[CV-29] owards High-Resolution Industrial Image Anomaly Detection
【速读】:该论文旨在解决高分辨率图像中异常检测精度不足的问题,尤其是在工业场景下,传统下采样方法常因丢失细粒度判别信息而导致微小异常区域漏检,而现有轻量级网络或简单图像分块集成策略在准确性和效率上仍难以满足实际需求。其解决方案的关键在于提出一种通用的高分辨率异常检测框架HiAD,该框架采用双分支结构以跨尺度融合异常线索,从而同时捕捉细微与大尺度异常;并引入多分辨率特征融合策略以应对高分辨率图像中细粒度纹理变化带来的挑战;此外,通过构建检测器池(detector pool)与多种分配策略相结合的方式,根据局部图像块特征自适应选择检测器,在保证检测性能的同时有效控制计算开销,实现高效且鲁棒的高分辨率异常检测。
链接: https://arxiv.org/abs/2508.12931
作者: Ximiao Zhang,Min Xu,Xiuzhuang Zhou
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Capital Normal University (首都师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Current anomaly detection methods primarily focus on low-resolution scenarios. For high-resolution images, conventional downsampling often results in missed detections of subtle anomalous regions due to the loss of fine-grained discriminative information. Despite some progress, recent studies have attempted to improve detection resolution by employing lightweight networks or using simple image tiling and ensemble methods. However, these approaches still struggle to meet the practical demands of industrial scenarios in terms of detection accuracy and efficiency. To address the above issues, we propose HiAD, a general framework for high-resolution anomaly detection. HiAD is capable of detecting anomalous regions of varying sizes in high-resolution images under limited computational resources. Specifically, HiAD employs a dual-branch architecture that integrates anomaly cues across different scales to comprehensively capture both subtle and large-scale anomalies. Furthermore, it incorporates a multi-resolution feature fusion strategy to tackle the challenges posed by fine-grained texture variations in high-resolution images. To enhance both adaptability and efficiency, HiAD utilizes a detector pool in conjunction with various detector assignment strategies, enabling detectors to be adaptively assigned based on patch features, ensuring detection performance while effectively controlling computational costs. We conduct extensive experiments on our specifically constructed high-resolution anomaly detection benchmarks, including MVTec-HD, VisA-HD, and the real-world benchmark RealIAD-HD, demonstrating the superior performance of HiAD. The code is available at this https URL.
zh
[CV-30] 7Bench: a Comprehensive Benchmark for Layout-guided Text-to-image Models
【速读】:该论文旨在解决布局引导的文本到图像生成模型中,语义对齐(text alignment)与空间对齐(layout alignment)缺乏联合评估的问题。当前基准测试仅关注文本一致性,忽视了布局引导生成中至关重要的空间准确性,这在合成数据等应用场景中会导致噪声引入和数据质量下降。解决方案的关键在于提出首个综合性评估基准7Bench,其包含七种具有挑战性的场景,系统性地评估对象生成、颜色保真度、属性识别、物体间关系及空间控制等任务,并创新性地引入布局对齐分数(layout alignment score),以量化空间精度,从而全面衡量模型在语义和空间维度上的表现。
链接: https://arxiv.org/abs/2508.12919
作者: Elena Izzo,Luca Parolari,Davide Vezzaro,Lamberto Ballan
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICIAP 2025
Abstract:Layout-guided text-to-image models offer greater control over the generation process by explicitly conditioning image synthesis on the spatial arrangement of elements. As a result, their adoption has increased in many computer vision applications, ranging from content creation to synthetic data generation. A critical challenge is achieving precise alignment between the image, textual prompt, and layout, ensuring semantic fidelity and spatial accuracy. Although recent benchmarks assess text alignment, layout alignment remains overlooked, and no existing benchmark jointly evaluates both. This gap limits the ability to evaluate a model’s spatial fidelity, which is crucial when using layout-guided generation for synthetic data, as errors can introduce noise and degrade data quality. In this work, we introduce 7Bench, the first benchmark to assess both semantic and spatial alignment in layout-guided text-to-image generation. It features text-and-layout pairs spanning seven challenging scenarios, investigating object generation, color fidelity, attribute recognition, inter-object relationships, and spatial control. We propose an evaluation protocol that builds on existing frameworks by incorporating the layout alignment score to assess spatial accuracy. Using 7Bench, we evaluate several state-of-the-art diffusion models, uncovering their respective strengths and limitations across diverse alignment tasks. The benchmark is available at this https URL.
zh
[CV-31] CMF-IoU: Multi-Stage Cross-Modal Fusion 3D Object Detection with IoU Joint Prediction
【速读】:该论文旨在解决多模态3D检测中因相机与激光雷达(LiDAR)信息在空间和语义层面难以对齐而导致的特征提取不足与性能受限问题。其解决方案的关键在于提出了一种多阶段跨模态融合框架CMF-IOU,通过深度补全网络将图像像素映射至3D空间生成伪点,统一两种传感器的数据表示;设计双分支3D骨干网络——稀疏到远距离(S2D)分支增强稀疏LiDAR点云表示,残差视图一致性(ResVC)分支利用3D与2D卷积协同降低伪点误差影响;进一步引入迭代体素-点感知细粒度池化模块,在候选框精修阶段同时捕捉LiDAR的空间结构与伪点的纹理信息,并结合IoU联合预测分支与新型候选框生成策略,实现高IoU与高分类置信度的边界框保留,从而显著提升检测精度。
链接: https://arxiv.org/abs/2508.12917
作者: Zhiwei Ning,Zhaojiang Liu,Xuanang Gao,Yifan Zuo,Jie Yang,Yuming Fang,Wei Liu
机构: Shanghai Jiao Tong University (上海交通大学); Jiangxi University of Finance and Economics (江西财经大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: The Paper is Accepted by TCSVT
Abstract:Multi-modal methods based on camera and LiDAR sensors have garnered significant attention in the field of 3D detection. However, many prevalent works focus on single or partial stage fusion, leading to insufficient feature extraction and suboptimal performance. In this paper, we introduce a multi-stage cross-modal fusion 3D detection framework, termed CMF-IOU, to effectively address the challenge of aligning 3D spatial and 2D semantic information. Specifically, we first project the pixel information into 3D space via a depth completion network to get the pseudo points, which unifies the representation of the LiDAR and camera information. Then, a bilateral cross-view enhancement 3D backbone is designed to encode LiDAR points and pseudo points. The first sparse-to-distant (S2D) branch utilizes an encoder-decoder structure to reinforce the representation of sparse LiDAR points. The second residual view consistency (ResVC) branch is proposed to mitigate the influence of inaccurate pseudo points via both the 3D and 2D convolution processes. Subsequently, we introduce an iterative voxel-point aware fine grained pooling module, which captures the spatial information from LiDAR points and textural information from pseudo points in the proposal refinement stage. To achieve more precise refinement during iteration, an intersection over union (IoU) joint prediction branch integrated with a novel proposals generation technique is designed to preserve the bounding boxes with both high IoU and classification scores. Extensive experiments show the superior performance of our method on the KITTI, nuScenes and Waymo datasets.
zh
[CV-32] CTFlow: Video-Inspired Latent Flow Matching for 3D CT Synthesis
【速读】:该论文旨在解决如何基于临床报告(clinical reports)生成高质量、一致且具有诊断意义的完整三维CT(Computed Tomography)体积数据的问题,以支持医学研究中的数据增强、隐私保护合成及降低监管约束。其核心解决方案是提出CTFlow模型,这是一个参数量为0.5B的潜空间流匹配(latent flow matching)Transformer模型,通过引入A-VAE定义潜空间、CT-Clip编码临床文本,并采用定制的自回归策略:首先仅根据文本预测初始切片序列,随后结合已生成切片与文本逐步预测后续切片,从而在保持内存效率的同时实现整体积的一致性生成。实验表明,该方法在时间一致性、图像多样性及文本-图像对齐度上优于现有最优生成式CT模型,验证了其有效性。
链接: https://arxiv.org/abs/2508.12900
作者: Jiayi Wang,Hadrien Reynaud,Franciskus Xaverius Erick,Bernhard Kainz
机构: Friedrich–Alexander University Erlangen-Nürnberg (埃尔朗根-纽伦堡弗里德里希-亚历山大大学); Imperial College London (帝国理工学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Generative modelling of entire CT volumes conditioned on clinical reports has the potential to accelerate research through data augmentation, privacy-preserving synthesis and reducing regulator-constraints on patient data while preserving diagnostic signals. With the recent release of CT-RATE, a large-scale collection of 3D CT volumes paired with their respective clinical reports, training large text-conditioned CT volume generation models has become achievable. In this work, we introduce CTFlow, a 0.5B latent flow matching transformer model, conditioned on clinical reports. We leverage the A-VAE from FLUX to define our latent space, and rely on the CT-Clip text encoder to encode the clinical reports. To generate consistent whole CT volumes while keeping the memory constraints tractable, we rely on a custom autoregressive approach, where the model predicts the first sequence of slices of the volume from text-only, and then relies on the previously generated sequence of slices and the text, to predict the following sequence. We evaluate our results against state-of-the-art generative CT model, and demonstrate the superiority of our approach in terms of temporal coherence, image diversity and text-image alignment, with FID, FVD, IS scores and CLIP score.
zh
[CV-33] ONG: One-Shot NMF-based Gradient Masking for Efficient Model Sparsification
【速读】:该论文旨在解决深度神经网络(Deep Neural Networks, DNNs)因参数量庞大而导致的部署难题,特别是现有剪枝技术普遍存在迭代过程复杂、依赖特定判据或难以在训练过程中有效维持稀疏结构的问题。其解决方案的关键在于提出一种名为ONG(One-shot NMF-based Gradient Masking)的一次性稀疏化策略:首先利用非负矩阵分解(Non-negative Matrix Factorization, NMF)识别权重中的显著结构,实现训练初期的单次剪枝;随后通过精确的梯度掩码机制,仅允许未被剪枝的权重参与更新,从而在整个训练过程中严格保持目标稀疏度。该方法在BIMP对比框架下验证了其在CIFAR-10和CIFAR-100数据集上对ResNet系列模型的有效性,展现出在不同稀疏度下性能相当或更优,并能保持剪枝后的结构完整性。
链接: https://arxiv.org/abs/2508.12891
作者: Sankar Behera,Yamuna Prasad
机构: Indian Institute of Technology Jammu (印度理工学院贾姆马)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages
Abstract:Deep Neural Networks (DNNs) have achieved remarkable success but their large size poses deployment challenges. While various pruning techniques exist, many involve complex iterative processes, specialized criteria, or struggle to maintain sparsity effectively during training. We introduce ONG (One-shot NMF-based Gradient Masking), a novel sparsification strategy that identifies salient weight structures using Non-negative Matrix Factorization (NMF) for one-shot pruning at the outset of training. Subsequently, ONG employs a precise gradient masking mechanism to ensure that only unpruned weights are updated, strictly preserving the target sparsity throughout the training phase. We integrate ONG into the BIMP comparative framework and evaluate it on CIFAR-10 and CIFAR-100 with ResNet56, ResNet34, and ResNet18 against established stable sparsification methods. Our experiments demonstrate ONG’s ability to achieve comparable or superior performance at various sparsity levels while maintaining structural integrity post-pruning and offering a clear mechanism for targeting desired sparsities.
zh
[CV-34] S2-Guidance: Stochastic Self Guidance for Training-Free Enhancement of Diffusion Models
【速读】:该论文旨在解决Classifier-free Guidance (CFG) 在扩散模型中导致的次优结果问题,即CFG生成的样本质量不高且与提示语不一致,常引发语义不连贯和低质量输出。其解决方案的关键在于提出S²-Guidance方法,通过在前向过程中引入随机块丢弃(stochastic block-dropping)策略构建随机子网络,从而引导模型避开潜在的低质量预测,提升生成样本的质量和一致性。
链接: https://arxiv.org/abs/2508.12880
作者: Chubin Chen,Jiashu Zhu,Xiaokun Feng,Nisha Huang,Meiqi Wu,Fangyuan Mao,Jiahong Wu,Xiangxiang Chu,Xiu Li
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Classifier-free Guidance (CFG) is a widely used technique in modern diffusion models for enhancing sample quality and prompt adherence. However, through an empirical analysis on Gaussian mixture modeling with a closed-form solution, we observe a discrepancy between the suboptimal results produced by CFG and the ground truth. The model’s excessive reliance on these suboptimal predictions often leads to semantic incoherence and low-quality outputs. To address this issue, we first empirically demonstrate that the model’s suboptimal predictions can be effectively refined using sub-networks of the model itself. Building on this insight, we propose S^2-Guidance, a novel method that leverages stochastic block-dropping during the forward process to construct stochastic sub-networks, effectively guiding the model away from potential low-quality predictions and toward high-quality outputs. Extensive qualitative and quantitative experiments on text-to-image and text-to-video generation tasks demonstrate that S^2-Guidance delivers superior performance, consistently surpassing CFG and other advanced guidance strategies. Our code will be released.
zh
[CV-35] Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning
【速读】:该论文旨在解决预训练视觉-语言模型(Vision-Language Models, VLMs)在微调过程中因忽略数据分布的几何结构而导致语义表示失真的问题。现有方法虽通过参数高效微调或实例一致性约束缓解过拟合,但未能保持特征空间中数据流形的拓扑结构,从而影响模型的泛化能力。解决方案的关键在于提出一种名为流形保真雕刻微调(Manifold-Preserving and Sculpting Tuning, MPS-Tuning)的新方法:首先通过匹配微调前后特征的Gram矩阵来保留原始流形的宏观与微观拓扑结构,理论上逼近Gromov-Wasserstein距离的上界;其次,通过对图像和文本模态特征进行配对并优化其相似性,进一步雕刻流形以增强类别可分性。该方法在不破坏语义流形结构的前提下显著提升了模型性能。
链接: https://arxiv.org/abs/2508.12877
作者: Dexia Chen,Qianjie Zhu,Weibing Li,Yue Yu,Tong Zhang,Ruixuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Pretrained vision-language models (VLMs), such as CLIP, have shown remarkable potential in few-shot image classification and led to numerous effective transfer learning strategies. These methods leverage the pretrained knowledge of VLMs to enable effective domain adaptation while mitigating overfitting through parameter-efficient tuning or instance-based consistency constraints. However, such regularizations often neglect the geometric structure of data distribution, which may lead to distortion of the overall semantic representation. To overcome this limitation, we propose a novel fine-tuning method, Manifold-Preserving and Sculpting Tuning (MPS-Tuning). Regarding the data distribution in feature space as a semantic manifold, MPS-Tuning explicitly constrains the intrinsic geometry of this manifold while further sculpting it to enhance class separability. Specifically, MPS-Tuning preserves both macroscopic and microscopic topological structures of the original manifold by aligning Gram matrices of features before and after fine-tuning. Theoretically, this constraint is shown to approximate an upper bound of the Gromov-Wasserstein distance. Furthermore, features from the image and text modalities are paired, and pairwise similarities are optimized to enhance the manifold’s class discriminability. Extensive experiments demonstrate that MPS-Tuning significantly improves model performance while effectively preserving the structure of the semantic manifold. The code will be released.
zh
[CV-36] Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models
【速读】:该论文旨在解决视觉语言模型(Vision-Language Models, VLMs)在跨域少样本图像识别任务中性能下降的问题,即当目标图像域与预训练所用自然图像域存在差异时,现有基于VLM的迁移学习方法效果受限。其解决方案的关键在于提出了一种一致性引导的多视角协同优化策略(Consistency-guided Multi-view Collaborative Optimization, CoMuCo),通过两个功能互补的专家模块提取多视角特征,并引入基于先验知识的一致性约束和基于信息几何的共识机制,以增强特征学习的鲁棒性,从而提升模型在跨域场景下的少样本识别能力。
链接: https://arxiv.org/abs/2508.12861
作者: Dexia Chen,Wentao Zhang,Qianjie Zhu,Ping Hu,Weibing Li,Tong Zhang,Ruixuan Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-language models (VLMs) pre-trained on natural image and language data, such as CLIP, have exhibited significant potential in few-shot image recognition tasks, leading to development of various efficient transfer learning methods. These methods exploit inherent pre-learned knowledge in VLMs and have achieved strong performance on standard image datasets. However, their effectiveness is often limited when confronted with cross-domain tasks where imaging domains differ from natural images. To address this limitation, we propose Consistency-guided Multi-view Collaborative Optimization (CoMuCo), a novel fine-tuning strategy for VLMs. This strategy employs two functionally complementary expert modules to extract multi-view features, while incorporating prior knowledge-based consistency constraints and information geometry-based consensus mechanisms to enhance the robustness of feature learning. Additionally, a new cross-domain few-shot benchmark is established to help comprehensively evaluate methods on imaging domains distinct from natural images. Extensive empirical evaluations on both existing and newly proposed benchmarks suggest CoMuCo consistently outperforms current methods in few-shot tasks. The code and benchmark will be released.
zh
[CV-37] Multi-source Multimodal Progressive Domain Adaption for Audio-Visual Deception Detection ACM-MM2025
【速读】:该论文旨在解决多模态欺骗检测(Multi-modal Deception Detection, MMDD)中因源域与目标域之间存在领域偏移(domain shift)而导致模型泛化能力下降的问题。其解决方案的关键在于提出了一种多源多模态渐进式领域自适应(Multi-source Multimodal Progressive Domain Adaptation, MMPDA)框架,通过在特征层和决策层上逐步对齐多个源域与目标域的分布,有效缓解了跨数据集的领域差异,从而提升了模型在目标域上的性能表现。
链接: https://arxiv.org/abs/2508.12842
作者: Ronghao Lin,Sijie Mai,Ying Zeng,Qiaolin He,Aolin Xiong,Haifeng Hu
机构: Sun Yat-Sen University (中山大学); Nanyang Technological University (南洋理工大学); South China Normal University (华南师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: Accepted at ACM MM 2025 SVC Workshop
Abstract:This paper presents the winning approach for the 1st MultiModal Deception Detection (MMDD) Challenge at the 1st Workshop on Subtle Visual Computing (SVC). Aiming at the domain shift issue across source and target domains, we propose a Multi-source Multimodal Progressive Domain Adaptation (MMPDA) framework that transfers the audio-visual knowledge from diverse source domains to the target domain. By gradually aligning source and the target domain at both feature and decision levels, our method bridges domain shifts across diverse multimodal datasets. Extensive experiments demonstrate the effectiveness of our approach securing Top-2 place. Our approach reaches 60.43% on accuracy and 56.99% on F1-score on competition stage 2, surpassing the 1st place team by 5.59% on F1-score and the 3rd place teams by 6.75% on accuracy. Our code is available at this https URL.
zh
[CV-38] DEEP-SEA: Deep-Learning Enhancement for Environmental Perception in Submerged Aquatics
【速读】:该论文旨在解决水下视觉退化问题,即由于光散射、吸收和浑浊度导致的图像清晰度下降与色彩失真,从而影响水下监测平台在海洋生物多样性分析、生态评估及自主探索中的准确性。解决方案的关键在于提出DEEP-SEA模型,其核心是Dual-Frequency Enhanced Self-Attention Spatial and Frequency Modulator(双频增强自注意力空间与频率调制器),该模块能够自适应地在频域内优化特征表示,并同步保留空间信息,以提升图像细节恢复能力和结构一致性。
链接: https://arxiv.org/abs/2508.12824
作者: Shuang Chen,Ronald Thenius,Farshad Arvin,Amir Atapour-Abarghouei
机构: Durham University (杜伦大学); University of Graz (格拉茨大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continuous and reliable underwater monitoring is essential for assessing marine biodiversity, detecting ecological changes and supporting autonomous exploration in aquatic environments. Underwater monitoring platforms rely on mainly visual data for marine biodiversity analysis, ecological assessment and autonomous exploration. However, underwater environments present significant challenges due to light scattering, absorption and turbidity, which degrade image clarity and distort colour information, which makes accurate observation difficult. To address these challenges, we propose DEEP-SEA, a novel deep learning-based underwater image restoration model to enhance both low- and high-frequency information while preserving spatial structures. The proposed Dual-Frequency Enhanced Self-Attention Spatial and Frequency Modulator aims to adaptively refine feature representations in frequency domains and simultaneously spatial information for better structural preservation. Our comprehensive experiments on EUVP and LSUI datasets demonstrate the superiority over the state of the art in restoring fine-grained image detail and structural consistency. By effectively mitigating underwater visual degradation, DEEP-SEA has the potential to improve the reliability of underwater monitoring platforms for more accurate ecological observation, species identification and autonomous navigation.
zh
[CV-39] SIS-Challenge: Event-based Spatio-temporal Instance Segmentation Challenge at the CVPR 2025 Event-based Vision Workshop
【速读】:该论文针对事件相机(event camera)与灰度相机数据在时空上对齐条件下的目标实例分割问题展开研究,旨在从多模态传感器数据中精确预测指定类别对象的像素级分割掩码。其解决方案的关键在于构建一个统一的时空实例分割(Spatio-temporal Instance Segmentation, SIS)挑战任务框架,包含标准化的数据集、评估指标及竞赛机制,并通过分析排名前五团队的方法揭示了当前基于多模态融合与时序建模的有效策略,为事件视觉领域的实例分割提供了基准和方法参考。
链接: https://arxiv.org/abs/2508.12813
作者: Friedhelm Hamann,Emil Mededovic,Fabian Gülhan,Yuli Wu,Johannes Stegmaier,Jing He,Yiqing Wang,Kexin Zhang,Lingling Li,Licheng Jiao,Mengru Ma,Hongxiang Huang,Yuhao Yan,Hongwei Ren,Xiaopeng Lin,Yulong Huang,Bojun Cheng,Se Hyun Lee,Gyu Sung Ham,Kanghan Oh,Gi Hyun Lim,Boxuan Yang,Bowen Du,Guillermo Gallego
机构: TU Berlin (柏林工业大学); SCIoI (科学与创新研究所); ECDF (欧洲计算数据基金会); RWTH Aachen (亚琛工业大学); Xidian University (西安电子科技大学); Hong Kong University of Science and Technology (香港科技大学); Sun Yat-sen University (中山大学); Wonkwang University (원광대학교); Tongji University (同济大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 13 pages, 7 figures, 7 tables
Abstract:We present an overview of the Spatio-temporal Instance Segmentation (SIS) challenge held in conjunction with the CVPR 2025 Event-based Vision Workshop. The task is to predict accurate pixel-level segmentation masks of defined object classes from spatio-temporally aligned event camera and grayscale camera data. We provide an overview of the task, dataset, challenge details and results. Furthermore, we describe the methods used by the top-5 ranking teams in the challenge. More resources and code of the participants’ methods are available here: this https URL
zh
[CV-40] Next Visual Granularity Generation
【速读】:该论文旨在解决图像生成过程中难以实现多粒度层次控制的问题,即如何在生成流程中从全局结构到局部细节逐步精细化地构建图像。其解决方案的关键在于提出了一种新颖的Next Visual Granularity (NVG) 生成框架,该框架将图像分解为一个结构化的粒度序列(sequence of visual granularity),其中每个元素具有相同的空间分辨率但使用不同数量的独特标记(tokens),从而捕捉不同层级的视觉细节。通过从空图像开始迭代生成,并逐级细化图像的结构与细节,NVG实现了分层、有序的图像生成过程,提供了对多个粒度层级的精细控制能力。实验表明,该方法在ImageNet数据集上相较于VAR系列模型在FID分数上显著提升,验证了其有效性与扩展性。
链接: https://arxiv.org/abs/2508.12811
作者: Yikai Wang,Zhouxia Wang,Zhonghua Wu,Qingyi Tao,Kang Liao,Chen Change Loy
机构: S-Lab, Nanyang Technological University (南洋理工大学); SenseTime Research (商汤科技)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:We propose a novel approach to image generation by decomposing an image into a structured sequence, where each element in the sequence shares the same spatial resolution but differs in the number of unique tokens used, capturing different level of visual granularity. Image generation is carried out through our newly introduced Next Visual Granularity (NVG) generation framework, which generates a visual granularity sequence beginning from an empty image and progressively refines it, from global layout to fine details, in a structured manner. This iterative process encodes a hierarchical, layered representation that offers fine-grained control over the generation process across multiple granularity levels. We train a series of NVG models for class-conditional image generation on the ImageNet dataset and observe clear scaling behavior. Compared to the VAR series, NVG consistently outperforms it in terms of FID scores (3.30 - 3.03, 2.57 -2.44, 2.09 - 2.06). We also conduct extensive analysis to showcase the capability and potential of the NVG framework. Our code and models will be released.
zh
[CV-41] Morphological classification of eclipsing binary stars using computer vision methods
【速读】:该论文旨在解决大样本巡天数据中食双星(Eclipsing Binaries, EB)形态分类的自动化问题,特别是区分分离型与相接型系统,并识别是否存在黑子(spots)等细微光变特征。解决方案的关键在于提出一种新型图像表示方法——将相位折叠后的光变曲线转换为极坐标系并结合六边形密度可视化(hexbin visualisation),从而增强模型对EB光变特征的感知能力;同时采用基于预训练卷积神经网络(ResNet50)和视觉Transformer(ViT-base-patch16-224)的微调策略,在多个波段(Gaia G、I 和 TESS)上实现高精度分类(验证集准确率达96%),并在OGLE、DEBCat和WUMaCat观测数据中表现出强泛化能力(准确率94%–100%)。
链接: https://arxiv.org/abs/2508.12802
作者: Štefan Parimucha,Maksim Gabdeev,Yanna Markus,Martin Vaňko,Pavol Gajdoš
机构: Pavol Jozef Šafárik University in Košice (帕沃尔·约瑟夫·沙法里克大学); Slovak Academy of Sciences (斯洛伐克科学院); Czech Academy of Sciences (捷克科学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
备注: 19 pages, 4 figures, 4 tables
Abstract:We present an application of computer vision methods to classify the light curves of eclipsing binaries (EB). We have used pre-trained models based on convolutional neural networks ( \textitResNet50 ) and vision transformers ( \textitvit_base_patch16_224 ), which were fine-tuned on images created from synthetic datasets. To improve model generalisation and reduce overfitting, we developed a novel image representation by transforming phase-folded light curves into polar coordinates combined with hexbin visualisation. Our hierarchical approach in the first stage classifies systems into detached and overcontact types, and in the second stage identifies the presence or absence of spots. The binary classification models achieved high accuracy ( 96% ) on validation data across multiple passbands (Gaia~ G , I , and TESS ) and demonstrated strong performance ( 94% , up to 100% for TESS ) when tested on extensive observational data from the OGLE, DEBCat, and WUMaCat catalogues. While the primary binary classification was highly successful, the secondary task of automated spot detection performed poorly, revealing a significant limitation of our models for identifying subtle photometric features. This study highlights the potential of computer vision for EB morphological classification in large-scale surveys, but underscores the need for further research into robust, automated spot detection.
zh
[CV-42] A Shift in Perspective on Causality in Domain Generalization
【速读】:该论文旨在解决因果建模在域泛化(Domain Generalization, DG)任务中是否能真正提升AI模型鲁棒性的问题,尤其针对近期研究对因果建模有效性的质疑。其核心贡献在于重新审视因果性与DG文献中的主张,调和看似矛盾的实证结果,并提出一个更细致的理论框架,以阐明因果关系在模型泛化能力中的作用机制。解决方案的关键在于强调因果结构并非直接保证泛化性能,而是需结合具体任务、数据分布变化模式及学习策略,从而构建更具解释性和适应性的泛化理论。
链接: https://arxiv.org/abs/2508.12798
作者: Damian Machlanski,Stephanie Riley,Edward Moroshko,Kurt Butler,Panagiotis Dimitrakopoulos,Thomas Melistas,Akchunya Chanchal,Steven McDonagh,Ricardo Silva,Sotirios A. Tsaftaris
机构: CHAI Hub, UK; The University of Edinburgh, UK; University of Manchester, UK; Archimedes, Athena Research Center, Greece; King’s College London, UK; University College London, UK; National and Kapodistrian University of Athens, Greece
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: 2 pages, 1 figure, to be presented at the UK AI Research Symposium (UKAIRS) 2025
Abstract:The promise that causal modelling can lead to robust AI generalization has been challenged in recent work on domain generalization (DG) benchmarks. We revisit the claims of the causality and DG literature, reconciling apparent contradictions and advocating for a more nuanced theory of the role of causality in generalization. We also provide an interactive demo at this https URL.
zh
[CV-43] Vehicle detection from GSV imagery: Predicting travel behaviour for cycling and motorcycling using Computer Vision
【速读】:该论文旨在解决全球范围内骑行(cycling)和摩托车出行(motorcycling)模式份额数据稀缺的问题,尤其在缺乏可靠交通调查数据的地区。其核心解决方案是利用谷歌街景图像(Google Street View, GSV)结合深度学习目标检测技术,通过YOLOv4模型自动识别街景图像中的自行车和摩托车数量,并基于城市层面的检测计数构建贝叶斯回归模型(beta regression),以预测各城市的骑行和摩托车出行比例。关键创新在于将计算机视觉与大规模街景数据融合,实现了对全球60个城市出行模式的高效、低成本估计,且模型在多数城市表现出良好的预测精度(R²=0.612–0.614,中位绝对误差<1.5%)。
链接: https://arxiv.org/abs/2508.12794
作者: Kyriaki(Kelly)Kokka,Rahul Goel,Ali Abbas,Kerry A. Nice,Luca Martial,SM Labib,Rihuan Ke,Carola Bibiane Schönlieb,James Woodcock
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Transportation influence health by shaping exposure to physical activity, air pollution and injury this http URL data on cycling and motorcycling behaviours is scarce, particularly at a global this http URL view imagery, such as Google Street View (GSV), combined with computer vision, is a valuable resource for efficiently capturing travel behaviour this http URL study demonstrates a novel approach using deep learning on street view images to estimate cycling and motorcycling levels across diverse cities this http URL utilized data from 185 global this http URL data on mode shares of cycling and motorcycling estimated using travel surveys or this http URL used GSV images to detect cycles and motorcycles in sampled locations, using 8000 images per this http URL YOLOv4 model, fine-tuned using images from six cities, achieved a mean average precision of 89% for detecting cycles and motorcycles in GSV images.A global prediction model was developed using beta regression with city-level mode shares as outcome, with log transformed explanatory variables of counts of GSV-detected images with cycles and motorcycles, while controlling for population this http URL found strong correlations between GSV motorcycle counts and motorcycle mode share (0.78) and moderate correlations between GSV cycle counts and cycling mode share (0.51).Beta regression models predicted mode shares with R^2 values of 0.614 for cycling and 0.612 for motorcycling, achieving median absolute errors (MDAE) of 1.3% and 1.4%, this http URL demonstrated consistent prediction accuracy, though cities like Utrecht and Cali were this http URL model was applied to 60 cities globally for which we didn’t have recent mode share this http URL provided estimates for some cities in the Middle East, Latin America and East this http URL computer vision, GSV images capture travel modes and activity, providing insights alongside traditional data sources.
zh
[CV-44] Leverag ing Diffusion Models for Stylization using Multiple Style Images
【速读】:该论文旨在解决当前基于潜在扩散模型(latent diffusion models)的图像风格迁移方法中存在的三大关键问题:风格匹配不准确、可使用的风格图像数量有限,以及内容与风格在特征层面的 undesired entanglement(非期望纠缠)。其解决方案的关键在于引入多风格图像(multiple style images)以更全面地表征风格特征并减少内容泄露,并设计了一种结合图像提示适配器(image prompt adapters)与去噪过程中特征统计对齐(statistical alignment)的方法。具体而言,该方法通过聚类从大量风格样本中提取代表性注意力特征,从而在去噪UNet的交叉注意力(cross-attention)和自注意力(self-attention)层实现干预,显著提升了风格迁移的质量与可控性,最终实现了当前最优的图像风格化效果。
链接: https://arxiv.org/abs/2508.12784
作者: Dan Ruta,Abdelaziz Djelouah,Raphael Ortiz,Christopher Schroers
机构: DisneyResearch|Studios
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in latent diffusion models have enabled exciting progress in image style transfer. However, several key issues remain. For example, existing methods still struggle to accurately match styles. They are often limited in the number of style images that can be used. Furthermore, they tend to entangle content and style in undesired ways. To address this, we propose leveraging multiple style images which helps better represent style features and prevent content leaking from the style images. We design a method that leverages both image prompt adapters and statistical alignment of the features during the denoising process. With this, our approach is designed such that it can intervene both at the cross-attention and the self-attention layers of the denoising UNet. For the statistical alignment, we employ clustering to distill a small representative set of attention features from the large number of attention values extracted from the style samples. As demonstrated in our experimental section, the resulting method achieves state-of-the-art results for stylization.
zh
[CV-45] SocialTrack: Multi-Object Tracking in Complex Urban Traffic Scenes Inspired by Social Behavior
【速读】:该论文旨在解决无人机(UAV)视角下多目标跟踪(Multi-Object Tracking, MOT)在复杂城市交通环境中面临的挑战,包括小目标尺度变化、遮挡、非线性交叉运动及运动模糊等问题,这些问题严重削弱了跟踪的稳定性与准确性。解决方案的关键在于提出一个名为SocialTrack的新型多目标跟踪框架,其核心创新包括:1)基于多尺度特征增强机制的小目标检测器,提升小目标检测性能;2)引入速度动态建模的Velocity Adaptive Cubature Kalman Filter(VACKF),改善轨迹预测精度;3)利用社会群体运动先验的Group Motion Compensation Strategy(GMCS),为低质量轨迹提供稳定的状态更新参考,提高复杂动态环境下的目标关联准确率;4)通过时空记忆预测(Spatio-Temporal Memory Prediction, STMP)利用历史轨迹信息预测低质量轨迹未来状态,有效缓解身份切换问题。该框架在UAVDT和MOT17数据集上显著优于现有最先进方法,在MOTA和IDF1等关键指标上取得明显提升。
链接: https://arxiv.org/abs/2508.12777
作者: Wenguang Tao,Xiaotian Wang,Tian Yan,Jie Yan,Guodong Li,Kun Bai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:As a key research direction in the field of multi-object tracking (MOT), UAV-based multi-object tracking has significant application value in the analysis and understanding of urban intelligent transportation systems. However, in complex UAV perspectives, challenges such as small target scale variations, occlusions, nonlinear crossing motions, and motion blur severely hinder the stability of multi-object tracking. To address these challenges, this paper proposes a novel multi-object tracking framework, SocialTrack, aimed at enhancing the tracking accuracy and robustness of small targets in complex urban traffic environments. The specialized small-target detector enhances the detection performance by employing a multi-scale feature enhancement mechanism. The Velocity Adaptive Cubature Kalman Filter (VACKF) improves the accuracy of trajectory prediction by incorporating a velocity dynamic modeling mechanism. The Group Motion Compensation Strategy (GMCS) models social group motion priors to provide stable state update references for low-quality tracks, significantly improving the target association accuracy in complex dynamic environments. Furthermore, the Spatio-Temporal Memory Prediction (STMP) leverages historical trajectory information to predict the future state of low-quality tracks, effectively mitigating identity switching issues. Extensive experiments on the UAVDT and MOT17 datasets demonstrate that SocialTrack outperforms existing state-of-the-art (SOTA) methods across several key metrics. Significant improvements in MOTA and IDF1, among other core performance indicators, highlight its superior robustness and adaptability. Additionally, SocialTrack is highly modular and compatible, allowing for seamless integration with existing trackers to further enhance performance.
zh
[CV-46] Harnessing Group-Oriented Consistency Constraints for Semi-Supervised Semantic Segmentation in CdZnTe Semiconductors
【速读】:该论文旨在解决半导体材料镉锌碲(Cadmium Zinc Telluride, CdZnTe)图像中缺陷边界对比度低导致的标注困难问题,其核心挑战在于多视角图像共享同一真实标签(ground truth, GT),形成独特的“多对一”关系,而现有半监督语义分割(Semi-Supervised Semantic Segmentation, SSS)方法通常基于“一对一”映射假设,难以有效利用此类组内一致性,易在低对比度区域累积错误并加剧确认偏差。解决方案的关键在于提出一种面向群体的增强框架——组内一致性增强框架(Intra-group Consistency Augmentation Framework, ICAF),其核心创新包括:1)通过组内视图采样(Intra-group View Sampling, IVS)建立群体导向的基线;2)引入伪标签校正网络(Pseudo-label Correction Network, PCN),包含视图增强模块(View Augmentation Module, VAM)和视图校正模块(View Correction Module, VCM),前者动态合成具有边界感知能力的视图以提升细节,后者通过多视图信息交互强化显著区域并抑制噪声,从而显著提升低对比度区域的分割精度。实验表明,仅使用2组标注数据(占总数据的0.5%)即可在DeepLabV3+ + ResNet-101模型上达到70.6%的mIoU。
链接: https://arxiv.org/abs/2508.12766
作者: Peihao Li,Yan Fang,Man Liu,Huihui Bai,Anhong Wang,Yunchao Wei,Yao Zhao
机构: Beijing Jiaotong University (北京交通大学); Anhui University (安徽大学); Taiyuan University of Science and Technology (太原科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Labeling Cadmium Zinc Telluride (CdZnTe) semiconductor images is challenging due to the low-contrast defect boundaries, necessitating annotators to cross-reference multiple views. These views share a single ground truth (GT), forming a unique many-to-one'' relationship. This characteristic renders advanced semi-supervised semantic segmentation (SSS) methods suboptimal, as they are generally limited by a
one-to-one’’ relationship, where each image is independently associated with its GT. Such limitation may lead to error accumulation in low-contrast regions, further exacerbating confirmation bias. To address this issue, we revisit the SSS pipeline from a group-oriented perspective and propose a human-inspired solution: the Intra-group Consistency Augmentation Framework (ICAF). First, we experimentally validate the inherent consistency constraints within CdZnTe groups, establishing a group-oriented baseline using the Intra-group View Sampling (IVS). Building on this insight, we introduce the Pseudo-label Correction Network (PCN) to enhance consistency representation, which consists of two key modules. The View Augmentation Module (VAM) improves boundary details by dynamically synthesizing a boundary-aware view through the aggregation of multiple views. In the View Correction Module (VCM), this synthesized view is paired with other views for information interaction, effectively emphasizing salient regions while minimizing noise. Extensive experiments demonstrate the effectiveness of our solution for CdZnTe materials. Leveraging DeepLabV3+ with a ResNet-101 backbone as our segmentation model, we achieve a 70.6% mIoU on the CdZnTe dataset using only 2 group-annotated data (5\textperthousand). The code is available at \hrefthis https URLthis https URL.
zh
[CV-47] CLAIRE-DSA: Fluoroscopic Image Classification for Quality Assurance of Computer Vision Pipelines in Acute Ischemic Stroke MICCAI
【速读】:该论文旨在解决机械血栓切除术(Mechanical Thrombectomy, MT)过程中数字减影血管造影(Digital Subtraction Angiography, DSA)图像质量差导致计算机视觉模型性能下降的问题。其解决方案的关键在于提出了一种基于深度学习的框架 CLAIRE-DSA,该框架利用预训练的 ResNet 主干网络对最小强度投影(MinIP)图像中的九类关键图像属性(如对比剂存在、投射角度、运动伪影严重程度等)进行分类,并通过在包含 1,758 张荧光 MinIP 图像的标注数据集上微调,实现了优异的分类性能(ROC-AUC 范围为 0.91–0.98,精度范围为 0.70–1.00)。进一步验证表明,通过 CLAIRE-DSA 过滤低质量图像可显著提升后续分割任务的成功率(从 42% 提升至 69%,p < 0.001),证明其在临床和研究场景中具备自动化图像标注与质量控制的潜力。
链接: https://arxiv.org/abs/2508.12755
作者: Cristo J. van den Berg,Frank G. te Nijenhuis,Mirre J. Blaauboer,Daan T. W. van Erp,Carlijn M. Keppels,Matthijs van der Sluijs,Bob Roozenbeek,Wim van Zwam,Sandra Cornelissen,Danny Ruijters,Ruisheng Su,Theo van Walsum
机构: 11: Delft University of Technology (代尔夫特理工大学); 22: University of Amsterdam (阿姆斯特丹大学); 33: Vrije Universiteit Amsterdam (阿姆斯特丹自由大学); 66: Eindhoven University of Technology (埃因霍温理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 4 figures, workshop paper accepted at this https URL
Abstract:Computer vision models can be used to assist during mechanical thrombectomy (MT) for acute ischemic stroke (AIS), but poor image quality often degrades performance. This work presents CLAIRE-DSA, a deep learning–based framework designed to categorize key image properties in minimum intensity projections (MinIPs) acquired during MT for AIS, supporting downstream quality control and workflow optimization. CLAIRE-DSA uses pre-trained ResNet backbone models, fine-tuned to predict nine image properties (e.g., presence of contrast, projection angle, motion artefact severity). Separate classifiers were trained on an annotated dataset containing 1,758 fluoroscopic MinIPs. The model achieved excellent performance on all labels, with ROC-AUC ranging from 0.91 to 0.98 , and precision ranging from 0.70 to 1.00 . The ability of CLAIRE-DSA to identify suitable images was evaluated on a segmentation task by filtering poor quality images and comparing segmentation performance on filtered and unfiltered datasets. Segmentation success rate increased from 42% to 69% , p 0.001 . CLAIRE-DSA demonstrates strong potential as an automated tool for accurately classifying image properties in DSA series of acute ischemic stroke patients, supporting image annotation and quality control in clinical and research applications. Source code is available at this https URL.
zh
[CV-48] D2-Mamba: Dual-Scale Fusion and Dual-Path Scanning with SSMs for Shadow Removal
【速读】:该论文旨在解决阴影去除(shadow removal)任务中因阴影区域与非阴影区域之间存在显著差异的局部非均匀退化问题,传统方法难以应用统一修复策略。其核心挑战在于如何有效整合非局部上下文信息并自适应建模不同区域的变换特性。解决方案的关键创新在于提出一种基于Mamba架构的新型网络结构,包含双尺度融合模块(Dual-Scale Fusion Mamba Block, DFMB)和双路径扫描模块(Dual-Path Mamba Group, DPMG):DFMB通过融合原始特征与低分辨率特征增强多尺度表示并减少边界伪影;DPMG则利用水平扫描捕获全局特征,并引入掩码感知的自适应扫描机制,从而提升结构连续性和细粒度区域建模能力,实现基于区域变换相似性的选择性上下文传播。
链接: https://arxiv.org/abs/2508.12750
作者: Linhao Li,Boya Jin,Zizhe Li,Lanqing Guo,Hao Cheng,Bo Li,Yongfeng Dong
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper Under Review
Abstract:Shadow removal aims to restore images that are partially degraded by shadows, where the degradation is spatially localized and non-uniform. Unlike general restoration tasks that assume global degradation, shadow removal can leverage abundant information from non-shadow regions for guidance. However, the transformation required to correct shadowed areas often differs significantly from that of well-lit regions, making it challenging to apply uniform correction strategies. This necessitates the effective integration of non-local contextual cues and adaptive modeling of region-specific transformations. To this end, we propose a novel Mamba-based network featuring dual-scale fusion and dual-path scanning to selectively propagate contextual information based on transformation similarity across regions. Specifically, the proposed Dual-Scale Fusion Mamba Block (DFMB) enhances multi-scale feature representation by fusing original features with low-resolution features, effectively reducing boundary artifacts. The Dual-Path Mamba Group (DPMG) captures global features via horizontal scanning and incorporates a mask-aware adaptive scanning strategy, which improves structural continuity and fine-grained region modeling. Experimental results demonstrate that our method significantly outperforms existing state-of-the-art approaches on shadow removal benchmarks.
zh
[CV-49] DCSCR: A Class-Specific Collaborative Representation based Network for Image Set Classification
【速读】:该论文旨在解决图像集分类(Image Set Classification, ISC)中特征表示学习不足与集合间相似性度量不准确的问题,尤其在少样本(few-shot)场景下表现受限。现有传统方法依赖原始像素特征,忽视了深层特征的学习;而深度ISC方法虽能提取深层特征,却无法在计算集合距离时自适应调整特征表示,导致性能瓶颈。解决方案的关键在于提出一种名为Deep Class-specific Collaborative Representation (DCSCR)的网络架构,其核心创新是同时学习图像集的帧级(frame-level)和概念级(concept-level)特征表示,并通过类特定协同表示(class-specific collaborative representation)构建新的对比损失函数,实现对不同图像集之间距离相似性的自适应建模,从而提升少样本条件下的分类性能。
链接: https://arxiv.org/abs/2508.12745
作者: Xizhan Gao,Wei Hu
机构: Shandong Key Laboratory of Ubiquitous Intelligent Computing, School of Information Science and Engineering, University of Jinan (济南大学信息科学与工程学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Image set classification (ISC), which can be viewed as a task of comparing similarities between sets consisting of unordered heterogeneous images with variable quantities and qualities, has attracted growing research attention in recent years. How to learn effective feature representations and how to explore the similarities between different image sets are two key yet challenging issues in this field. However, existing traditional ISC methods classify image sets based on raw pixel features, ignoring the importance of feature learning. Existing deep ISC methods can learn deep features, but they fail to adaptively adjust the features when measuring set distances, resulting in limited performance in few-shot ISC. To address the above issues, this paper combines traditional ISC methods with deep models and proposes a novel few-shot ISC approach called Deep Class-specific Collaborative Representation (DCSCR) network to simultaneously learn the frame- and concept-level feature representations of each image set and the distance similarities between different sets. Specifically, DCSCR consists of a fully convolutional deep feature extractor module, a global feature learning module, and a class-specific collaborative representation-based metric learning module. The deep feature extractor and global feature learning modules are used to learn (local and global) frame-level feature representations, while the class-specific collaborative representation-based metric learning module is exploit to adaptively learn the concept-level feature representation of each image set and thus obtain the distance similarities between different sets by developing a new CSCR-based contrastive loss function. Extensive experiments on several well-known few-shot ISC datasets demonstrate the effectiveness of the proposed method compared with some state-of-the-art image set classification algorithms.
zh
[CV-50] Frequency-Driven Inverse Kernel Prediction for Single Image Defocus Deblurring
【速读】:该论文旨在解决单张图像离焦模糊去模糊(single image defocus deblurring)任务中,由于局部高频细节缺失导致的空间变化模糊核(spatially varying blur kernels)建模不准确的问题。现有方法主要依赖空间特征进行核估计,在严重模糊区域性能显著下降。其解决方案的关键在于提出一种频率驱动的逆核预测网络(Frequency-Driven Inverse Kernel Prediction, FDIKP),通过引入频域表示增强核建模的结构可辨识性;具体包括:设计双分支逆核预测(Dual-Branch Inverse Kernel Prediction, DIKP)策略以提升核估计精度与稳定性,采用位置自适应卷积(Position Adaptive Convolution, PAC)增强去卷积过程的适应性,并构建双域尺度递归模块(Dual-Domain Scale Recurrent Module, DSRM)实现从粗到细的渐进式质量提升。
链接: https://arxiv.org/abs/2508.12736
作者: Ying Zhang,Xiongxin Tang,Chongyi Li,Qiao Chen,Yuquan Wu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Single image defocus deblurring aims to recover an all-in-focus image from a defocus counterpart, where accurately modeling spatially varying blur kernels remains a key challenge. Most existing methods rely on spatial features for kernel estimation, but their performance degrades in severely blurry regions where local high-frequency details are missing. To address this, we propose a Frequency-Driven Inverse Kernel Prediction network (FDIKP) that incorporates frequency-domain representations to enhance structural identifiability in kernel modeling. Given the superior discriminative capability of the frequency domain for blur modeling, we design a Dual-Branch Inverse Kernel Prediction (DIKP) strategy that improves the accuracy of kernel estimation while maintaining stability. Moreover, considering the limited number of predicted inverse kernels, we introduce a Position Adaptive Convolution (PAC) to enhance the adaptability of the deconvolution process. Finally, we propose a Dual-Domain Scale Recurrent Module (DSRM) to fuse deconvolution results and progressively improve deblurring quality from coarse to fine. Extensive experiments demonstrate that our method outperforms existing approaches. Code will be made publicly available.
zh
[CV-51] Quantifying and Alleviating Co-Adaptation in Sparse-View 3D Gaussian Splatting
【速读】:该论文旨在解决稀疏视角下3D高斯溅射(3D Gaussian Splatting, 3DGS)在新视角合成中出现的外观伪影问题。研究表明,当前方法在稀疏视图条件下过度纠缠优化后的高斯分布以拟合训练视图,从而忽视了场景真实的外观分布,导致新视角中出现视觉伪影。解决方案的关键在于识别并缓解这种“共适应”(co-adaptation)现象——即多个高斯之间因协同调整而产生的过度依赖。作者提出两个轻量级、可插拔的策略:(1) 随机高斯丢弃(random gaussian dropout);(2) 向不透明度引入乘性噪声(multiplicative noise injection to the opacity),二者均能有效降低高斯间的共适应程度,提升稀疏视图下的渲染质量。
链接: https://arxiv.org/abs/2508.12720
作者: Kangjie Chen,Yingji Zhong,Zhihao Li,Jiaqi Lin,Youyu Chen,Minghan Qin,Haoqian Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review. Project page: this https URL
Abstract:3D Gaussian Splatting (3DGS) has demonstrated impressive performance in novel view synthesis under dense-view settings. However, in sparse-view scenarios, despite the realistic renderings in training views, 3DGS occasionally manifests appearance artifacts in novel views. This paper investigates the appearance artifacts in sparse-view 3DGS and uncovers a core limitation of current approaches: the optimized Gaussians are overly-entangled with one another to aggressively fit the training views, which leads to a neglect of the real appearance distribution of the underlying scene and results in appearance artifacts in novel views. The analysis is based on a proposed metric, termed Co-Adaptation Score (CA), which quantifies the entanglement among Gaussians, i.e., co-adaptation, by computing the pixel-wise variance across multiple renderings of the same viewpoint, with different random subsets of Gaussians. The analysis reveals that the degree of co-adaptation is naturally alleviated as the number of training views increases. Based on the analysis, we propose two lightweight strategies to explicitly mitigate the co-adaptation in sparse-view 3DGS: (1) random gaussian dropout; (2) multiplicative noise injection to the opacity. Both strategies are designed to be plug-and-play, and their effectiveness is validated across various methods and benchmarks. We hope that our insights into the co-adaptation effect will inspire the community to achieve a more comprehensive understanding of sparse-view 3DGS.
zh
[CV-52] Single-Reference Text-to-Image Manipulation with Dual Contrastive Denoising Score
【速读】:该论文旨在解决当前基于文本到图像生成模型(text-to-image generative models)在真实图像编辑中的两个核心问题:一是用户难以编写精确描述输入图像所有视觉细节的文本提示(prompt),二是现有方法在局部区域实现期望修改时,常导致整体内容发生显著变化,并在非目标区域引入不可控的干扰。解决方案的关键在于提出一种名为“双对比去噪得分”(Dual Contrastive Denoising Score)的框架,其核心创新是引入一种简洁但有效的双对比损失(dual contrastive loss),利用预训练扩散模型中自注意力层中间表示的丰富空间信息,无需额外辅助网络即可实现结构保持与灵活内容修改的统一,从而在不进行微调的情况下完成零样本图像到图像翻译和高质量真实图像编辑。
链接: https://arxiv.org/abs/2508.12718
作者: Syed Muhmmad Israr,Feng Zhao
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Large-scale text-to-image generative models have shown remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is difficult for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. To address these challenges, we present Dual Contrastive Denoising Score, a simple yet powerful framework that leverages the rich generative prior of text-to-image diffusion models. Inspired by contrastive learning approaches for unpaired image-to-image translation, we introduce a straightforward dual contrastive loss within the proposed framework. Our approach utilizes the extensive spatial information from the intermediate representations of the self-attention layers in latent diffusion models without depending on auxiliary networks. Our method achieves both flexible content modification and structure preservation between input and output images, as well as zero-shot image-to-image translation. Through extensive experiments, we show that our approach outperforms existing methods in real image editing while maintaining the capability to directly utilize pretrained text-to-image diffusion models without further training.
zh
[CV-53] Real-Time Sign Language Gestures to Speech Transcription using Deep Learning
【速读】:该论文旨在解决听力和言语障碍者在日常环境中因沟通障碍而难以有效交流的问题。其解决方案的关键在于利用深度学习技术构建一个实时辅助系统,通过卷积神经网络(Convolutional Neural Networks, CNN)对摄像头捕捉的手语手势进行分类,并结合文本转语音(Text-to-Speech, TTS)合成技术将识别结果即时转化为文字和可听语音,从而实现手语到自然语言的实时翻译,提升使用者在多样化社交场景中的自主性和融合度。
链接: https://arxiv.org/abs/2508.12713
作者: Brandone Fonya
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Course related research project
Abstract:Communication barriers pose significant challenges for individuals with hearing and speech impairments, often limiting their ability to effectively interact in everyday environments. This project introduces a real-time assistive technology solution that leverages advanced deep learning techniques to translate sign language gestures into textual and audible speech. By employing convolution neural networks (CNN) trained on the Sign Language MNIST dataset, the system accurately classifies hand gestures captured live via webcam. Detected gestures are instantaneously translated into their corresponding meanings and transcribed into spoken language using text-to-speech synthesis, thus facilitating seamless communication. Comprehensive experiments demonstrate high model accuracy and robust real-time performance with some latency, highlighting the system’s practical applicability as an accessible, reliable, and user-friendly tool for enhancing the autonomy and integration of sign language users in diverse social settings.
zh
[CV-54] Argos: A Decentralized Federated System for Detection of Traffic Signs in CAVs
【速读】:该论文旨在解决车联网中交通标志检测任务面临的隐私泄露与通信瓶颈问题,传统集中式机器学习方法需上传原始传感器数据至中心服务器,存在数据安全风险且难以应对大规模车辆协同训练的通信开销。其解决方案的关键在于提出一种去中心化的联邦学习(Federated Learning, FL)框架,通过在车辆端部署轻量级目标检测器并按类别划分数据集实现本地专业化训练,结合FedProx、FedAdam和FedAVG等聚合算法在Flower仿真环境中进行模型参数聚合,从而在不共享原始数据的前提下实现跨车辆的协同建模。实验证明,该框架在处理非独立同分布(non-IID)数据时仍能保持较高准确率(最高达0.83),尤其FedProx在异构场景下表现最优,验证了其在保障隐私的同时提升模型泛化能力的有效性。
链接: https://arxiv.org/abs/2508.12712
作者: Seyed Mahdi Haji Seyed Hossein(ECE Department, University of Tehran, Tehran, Iran),Alireza Hosseini(ECE Department, University of Tehran, Tehran, Iran),Soheil Hajian Manesh(ECE Department, University of Tehran, Tehran, Iran),Amirali Shahriary(ECE Department, University of Tehran, Tehran, Iran)
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 10 figures
Abstract:Connected and automated vehicles generate vast amounts of sensor data daily, raising significant privacy and communication challenges for centralized machine learning approaches in perception tasks. This study presents a decentralized, federated learning framework tailored for traffic sign detection in vehicular networks to enable collaborative model training without sharing raw data. The framework partitioned traffic sign classes across vehicles for specialized local training using lightweight object detectors, aggregated model parameters via algorithms like FedProx, FedAdam and FedAVG in a simulated environment with the Flower framework, and evaluated multiple configurations including varying server rounds, local epochs, client participation fractions, and data distributions. Experiments demonstrated that increasing server rounds from 2 to 20 boosted accuracy from below 0.1 to over 0.8, moderate local epochs (8-10) provided optimal efficiency with accuracies around 0.67, higher client participation fractions enhanced generalization up to 0.83, FedProx outperformed other aggregators in handling heterogeneity, non-IID data distributions reduced performance compared to IID, and training duration primarily scaled with the number of rounds rather than aggregation strategy. We conclude that this federated approach may offer a scalable, privacy-preserving solution for real-world vehicular deployments, potentially guiding future integrations of robust aggregation and communication optimizations to advance intelligent transportation systems.
zh
[CV-55] Drifting Away from Truth: GenAI-Driven News Diversity Challenges LVLM-Based Misinformation Detection
【速读】:该论文旨在解决生成式 AI(GenAI)驱动的新闻多样性所引发的多层级漂移问题,这显著削弱了当前基于大视觉语言模型(LVLM)的多模态虚假信息检测(MMD)系统的鲁棒性。其关键解决方案是构建 DriftBench——一个包含 16,000 条新闻实例的大规模基准测试集,涵盖六类内容多样化场景,并设计三项核心评估任务:(1)在多层级漂移下真实性验证的鲁棒性;(2)对 GenAI 生成的对抗性证据污染的敏感性;(3)跨多样化输入的推理一致性分析。实验表明,现有 LVLM 检测器平均 F1 分数下降 14.8%,且推理轨迹不稳定,凸显出系统固有脆弱性,从而推动面向 GenAI 时代的更鲁棒检测方法的发展。
链接: https://arxiv.org/abs/2508.12711
作者: Fanxiao Li,Jiaying Wu,Tingchao Fu,Yunyun Dong,Bingbing Song,Wei Zhou
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The proliferation of multimodal misinformation poses growing threats to public discourse and societal trust. While Large Vision-Language Models (LVLMs) have enabled recent progress in multimodal misinformation detection (MMD), the rise of generative AI (GenAI) tools introduces a new challenge: GenAI-driven news diversity, characterized by highly varied and complex content. We show that this diversity induces multi-level drift, comprising (1) model-level misperception drift, where stylistic variations disrupt a model’s internal reasoning, and (2) evidence-level drift, where expression diversity degrades the quality or relevance of retrieved external evidence. These drifts significantly degrade the robustness of current LVLM-based MMD systems. To systematically study this problem, we introduce DriftBench, a large-scale benchmark comprising 16,000 news instances across six categories of diversification. We design three evaluation tasks: (1) robustness of truth verification under multi-level drift; (2) susceptibility to adversarial evidence contamination generated by GenAI; and (3) analysis of reasoning consistency across diverse inputs. Experiments with six state-of-the-art LVLM-based detectors show substantial performance drops (average F1 -14.8%) and increasingly unstable reasoning traces, with even more severe failures under adversarial evidence injection. Our findings uncover fundamental vulnerabilities in existing MMD systems and suggest an urgent need for more resilient approaches in the GenAI era.
zh
[CV-56] Neural Rendering for Sensor Adaptation in 3D Object Detection
【速读】:该论文旨在解决自动驾驶车辆中因不同传感器配置(如紧凑型车与SUV之间的摄像头布局差异)导致的跨传感器域差距(cross-sensor domain gap)问题,该差距会显著降低3D目标检测模型在新传感器设置下的性能。解决方案的关键在于:一方面,发现基于密集鸟瞰图(Bird’s Eye View, BEV)表示并结合后向投影机制的模型架构(如BEVFormer)对传感器配置变化具有更强的鲁棒性;另一方面,提出一种基于神经渲染的数据驱动传感器适配流水线,能够将整个数据集转换为匹配不同摄像头配置的形式,从而大幅缓解域差距,并减少对新数据采集的依赖,实现跨车型传感器配置下的高效数据复用。
链接: https://arxiv.org/abs/2508.12695
作者: Felix Embacher,David Holtz,Jonas Uhrig,Marius Cordts,Markus Enzweiler
机构: Mercedes-Benz AG (梅赛德斯-奔驰集团); Institute for Intelligent Systems, Esslingen University of Applied Sciences (埃斯林根应用技术大学智能系统研究所); University of Stuttgart (斯图加特大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted at IEEE Intelligent Vehicles Symposium (IV) 2025
Abstract:Autonomous vehicles often have varying camera sensor setups, which is inevitable due to restricted placement options for different vehicle types. Training a perception model on one particular setup and evaluating it on a new, different sensor setup reveals the so-called cross-sensor domain gap, typically leading to a degradation in accuracy. In this paper, we investigate the impact of the cross-sensor domain gap on state-of-the-art 3D object detectors. To this end, we introduce CamShift, a dataset inspired by nuScenes and created in CARLA to specifically simulate the domain gap between subcompact vehicles and sport utility vehicles (SUVs). Using CamShift, we demonstrate significant cross-sensor performance degradation, identify robustness dependencies on model architecture, and propose a data-driven solution to mitigate the effect. On the one hand, we show that model architectures based on a dense Bird’s Eye View (BEV) representation with backward projection, such as BEVFormer, are the most robust against varying sensor configurations. On the other hand, we propose a novel data-driven sensor adaptation pipeline based on neural rendering, which can transform entire datasets to match different camera sensor setups. Applying this approach improves performance across all investigated 3D object detectors, mitigating the cross-sensor domain gap by a large margin and reducing the need for new data collection by enabling efficient data reusability across vehicles with different sensor setups. The CamShift dataset and the sensor adaptation benchmark are available at this https URL.
zh
[CV-57] Multi-Level Knowledge Distillation and Dynamic Self-Supervised Learning for Continual Learning
【速读】:该论文旨在解决类增量学习中重复引入已训练类(Class-incremental with repetition, CIR)场景下的模型稳定性与可塑性平衡问题,该场景更贴近现实应用,其中可访问大量外部未标注数据。解决方案的关键在于提出两个核心组件:一是多层级知识蒸馏(Multi-level Knowledge Distillation, MLKD),通过从多个历史模型中跨视角(如特征和logits)蒸馏知识,增强模型对先前任务的保留能力;二是动态自监督损失(Dynamic Self-supervised Loss, SSL),利用未标注数据加速新类的学习,同时通过动态权重调整确保训练焦点始终聚焦于主任务,从而显著提升CIR设置下的性能表现。
链接: https://arxiv.org/abs/2508.12692
作者: Taeheon Kim,San Kim,Minhyuk Seo,Dongjae Jeon,Wonje Jeong,Jonghyun Choi
机构: Yonsei University (延世大学); Seoul National University (首尔国立大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Class-incremental with repetition (CIR), where previously trained classes repeatedly introduced in future tasks, is a more realistic scenario than the traditional class incremental setup, which assumes that each task contains unseen classes. CIR assumes that we can easily access abundant unlabeled data from external sources, such as the Internet. Therefore, we propose two components that efficiently use the unlabeled data to ensure the high stability and the plasticity of models trained in CIR setup. First, we introduce multi-level knowledge distillation (MLKD) that distills knowledge from multiple previous models across multiple perspectives, including features and logits, so the model can maintain much various previous knowledge. Moreover, we implement dynamic self-supervised loss (SSL) to utilize the unlabeled data that accelerates the learning of new classes, while dynamic weighting of SSL keeps the focus of training to the primary task. Both of our proposed components significantly improve the performance in CIR setup, achieving 2nd place in the CVPR 5th CLVISION Challenge.
zh
[CV-58] MixCache: Mixture-of-Cache for Video Diffusion Transformer Acceleration
【速读】:该论文旨在解决视频扩散模型(video DiT)在生成过程中因多步迭代去噪导致的高计算成本与推理延迟问题。现有缓存(caching)方法受限于单一粒度策略,难以灵活平衡生成质量和推理速度。其解决方案的关键在于提出一种无需训练的混合缓存框架 MixCache:首先区分不同缓存策略间的干扰与边界,进而设计上下文感知的缓存触发机制以决定何时启用缓存,并引入自适应混合缓存决策策略动态选择最优缓存粒度(如步骤、CFG 或模块层级),从而在不牺牲生成质量的前提下显著提升推理效率(例如在 Wan 14B 和 HunyuanVideo 上分别实现 1.94× 和 1.97× 的加速)。
链接: https://arxiv.org/abs/2508.12691
作者: Yuanxin Wei,Lansong Diao,Bujiao Chen,Shenggan Cheng,Zhengping Qian,Wenyuan Yu,Nong Xiao,Wei Lin,Jiangsu Du
机构: Nanjing University of Aeronautics and Astronautics (南京航空航天大学); Tsinghua University (清华大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 10 figures
Abstract:Leveraging the Transformer architecture and the diffusion process, video DiT models have emerged as a dominant approach for high-quality video generation. However, their multi-step iterative denoising process incurs high computational cost and inference latency. Caching, a widely adopted optimization method in DiT models, leverages the redundancy in the diffusion process to skip computations in different granularities (e.g., step, cfg, block). Nevertheless, existing caching methods are limited to single-granularity strategies, struggling to balance generation quality and inference speed in a flexible manner. In this work, we propose MixCache, a training-free caching-based framework for efficient video DiT inference. It first distinguishes the interference and boundary between different caching strategies, and then introduces a context-aware cache triggering strategy to determine when caching should be enabled, along with an adaptive hybrid cache decision strategy for dynamically selecting the optimal caching granularity. Extensive experiments on diverse models demonstrate that, MixCache can significantly accelerate video generation (e.g., 1.94 \times speedup on Wan 14B, 1.97 \times speedup on HunyuanVideo) while delivering both superior generation quality and inference efficiency compared to baseline methods.
zh
[CV-59] A-DAME: Test-Time Adaptation with Domain Augmentation and Model Ensemble for Dynamic Driving Conditions
【速读】:该论文旨在解决测试时适应(Test-time Adaptation, TTA)中模型在目标域分布发生动态变化时性能下降的问题,尤其针对真实驾驶场景中频繁出现的天气域偏移(如昼夜转换)。解决方案的关键在于提出一种名为TTA-DAME的方法:首先利用源域数据增强技术将知识迁移至目标域;其次引入域判别器(domain discriminator)和专用域检测器(domain detector)以缓解剧烈的域偏移;最后通过训练多个检测器并采用非极大值抑制(Non-Maximum Suppression, NMS)融合其预测结果,从而提升模型对动态环境的适应能力。
链接: https://arxiv.org/abs/2508.12690
作者: Dongjae Jeon,Taeheon Kim,Seongwon Cho,Minhyuk Seo,Jonghyun Choi
机构: Yonsei University (延世大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Test-time Adaptation (TTA) poses a challenge, requiring models to dynamically adapt and perform optimally on shifting target domains. This task is particularly emphasized in real-world driving scenes, where weather domain shifts occur frequently. To address such dynamic changes, our proposed method, TTA-DAME, leverages source domain data augmentation into target domains. Additionally, we introduce a domain discriminator and a specialized domain detector to mitigate drastic domain shifts, especially from daytime to nighttime conditions. To further improve adaptability, we train multiple detectors and consolidate their predictions through Non-Maximum Suppression (NMS). Our empirical validation demonstrates the effectiveness of our method, showing significant performance enhancements on the SHIFT Benchmark.
zh
[CV-60] EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在第一人称视角视频(egocentric videos)理解中普遍存在的幻觉问题,即模型生成看似合理但与输入内容不符的错误回答。其解决方案的关键在于构建首个专门针对第一人称视频场景的基准测试集EgoIllusion,该数据集包含1,400段视频及8,000条由人工标注的开放式和封闭式问题,系统性地设计用于诱发视觉和听觉线索相关的幻觉。通过在十种主流MLLMs上的评估,揭示了当前模型在该任务中的显著性能瓶颈(如GPT-4o和Gemini仅达59%准确率),从而为开发更鲁棒、幻觉更低的第一人称多模态模型提供了量化评估工具与研究基础。
链接: https://arxiv.org/abs/2508.12687
作者: Ashish Seth,Utkarsh Tyagi,Ramaneswaran Selvakumar,Nishit Anand,Sonal Kumar,Sreyan Ghosh,Ramani Duraiswami,Chirag Agarwal,Dinesh Manocha
机构: University of Maryland, College Park (马里兰大学学院市分校); University of Virginia (弗吉尼亚大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in complex multimodal tasks. While MLLMs excel at visual perception and reasoning in third-person and egocentric videos, they are prone to hallucinations, generating coherent yet inaccurate responses. We present EgoIllusion, a first benchmark to evaluate MLLM hallucinations in egocentric videos. EgoIllusion comprises 1,400 videos paired with 8,000 human-annotated open and closed-ended questions designed to trigger hallucinations in both visual and auditory cues in egocentric videos. Evaluations across ten MLLMs reveal significant challenges, including powerful models like GPT-4o and Gemini, achieving only 59% accuracy. EgoIllusion lays the foundation in developing robust benchmarks to evaluate the effectiveness of MLLMs and spurs the development of better egocentric MLLMs with reduced hallucination rates. Our benchmark will be open-sourced for reproducibility.
zh
[CV-61] Refine-and-Contrast: Adaptive Instance-Aware BEV Representations for Multi-UAV Collaborative Object Detection
【速读】:该论文旨在解决多无人飞行器(Multi-UAV)协同三维目标检测中,受限于无人机平台计算资源时如何实现高精度与低计算开销之间平衡的问题。解决方案的关键在于提出AdaBEV框架,其核心创新是通过“精炼-对比”(refine-and-contrast)范式学习自适应的、实例感知的鸟瞰图(BEV)表示:首先利用Box-Guided Refinement Module(BG-RM)仅对与前景实例相关的BEV网格进行基于2D监督和空间划分的精细化处理,提升局部语义感知能力;其次引入Instance-Background Contrastive Learning(IBCL),在BEV空间中通过对比学习增强前景与背景特征的可区分性,从而在保持低分辨率BEV输入的同时显著提升检测性能,实现在不同模型规模下优于现有方法的精度-计算权衡表现。
链接: https://arxiv.org/abs/2508.12684
作者: Zhongyao Li,Peirui Cheng,Liangjin Zhao,Chen Chen,Yundu Li,Zhechao Wang,Xue Yang,Xian Sun,Zhirui Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 9 pages
Abstract:Multi-UAV collaborative 3D detection enables accurate and robust perception by fusing multi-view observations from aerial platforms, offering significant advantages in coverage and occlusion handling, while posing new challenges for computation on resource-constrained UAV platforms. In this paper, we present AdaBEV, a novel framework that learns adaptive instance-aware BEV representations through a refine-and-contrast paradigm. Unlike existing methods that treat all BEV grids equally, AdaBEV introduces a Box-Guided Refinement Module (BG-RM) and an Instance-Background Contrastive Learning (IBCL) to enhance semantic awareness and feature discriminability. BG-RM refines only BEV grids associated with foreground instances using 2D supervision and spatial subdivision, while IBCL promotes stronger separation between foreground and background features via contrastive learning in BEV space. Extensive experiments on the Air-Co-Pred dataset demonstrate that AdaBEV achieves superior accuracy-computation trade-offs across model scales, outperforming other state-of-the-art methods at low resolutions and approaching upper bound performance while maintaining low-resolution BEV inputs and negligible overhead.
zh
[CV-62] WP-CLIP: Leverag ing CLIP to Predict Wölfflins Principles in Visual Art ICCV2025
【速读】:该论文旨在解决现有视觉评估指标无法有效预测Wölfflin五原则(Wölfflin’s five principles)的问题,这些问题涉及视觉艺术中的风格差异,如色彩、构图和主题选择等抽象特征。解决方案的关键在于利用预训练的视觉-语言模型CLIP,并通过在真实艺术图像标注数据集上进行微调,使其能够量化并预测每个原则的得分,从而构建出WP-CLIP模型。该方法显著提升了模型在GAN生成画作和Pandora-18K艺术数据集上的泛化能力,验证了视觉-语言模型(VLMs)在自动化艺术分析中的潜力。
链接: https://arxiv.org/abs/2508.12668
作者: Abhijay Ghildyal,Li-Yun Wang,Feng Liu
机构: Portland State University (波特兰州立大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025 AI4VA workshop (oral), Code: this https URL
Abstract:Wölfflin’s five principles offer a structured approach to analyzing stylistic variations for formal analysis. However, no existing metric effectively predicts all five principles in visual art. Computationally evaluating the visual aspects of a painting requires a metric that can interpret key elements such as color, composition, and thematic choices. Recent advancements in vision-language models (VLMs) have demonstrated their ability to evaluate abstract image attributes, making them promising candidates for this task. In this work, we investigate whether CLIP, pre-trained on large-scale data, can understand and predict Wölfflin’s principles. Our findings indicate that it does not inherently capture such nuanced stylistic elements. To address this, we fine-tune CLIP on annotated datasets of real art images to predict a score for each principle. We evaluate our model, WP-CLIP, on GAN-generated paintings and the Pandora-18K art dataset, demonstrating its ability to generalize across diverse artistic styles. Our results highlight the potential of VLMs for automated art analysis.
zh
[CV-63] Stable Diffusion-Based Approach for Human De-Occlusion
【速读】:该论文旨在解决深度学习模型在严重遮挡情况下准确重建人体结构与外观的难题,即人类视觉系统可借助先验知识和可见线索推断被遮挡区域,而现有方法难以实现这一能力。其核心解决方案在于将去遮挡任务分解为两个阶段:第一阶段利用基于扩散模型的人体先验生成完整的无遮挡掩码(amodal mask),并结合遮挡关节热图提供明确的空间线索;第二阶段以该掩码作为条件输入,引导稳定扩散模型(Stable Diffusion)进行RGB图像重建,并引入基于VQA模型提取的人体特异性文本特征(经CLIP编码),提升生成质量。此外,通过解码器微调缓解了扩散模型因潜在空间变换导致的可见区域像素级退化问题,从而显著优于现有方法,在遮挡恢复与下游人体感知任务(如2D姿态估计和3D人体重建)中均表现出更强性能。
链接: https://arxiv.org/abs/2508.12663
作者: Seung Young Noh,Ju Yong Chang
机构: Kwangwoon University (光云大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: MM 2025
Abstract:Humans can infer the missing parts of an occluded object by leveraging prior knowledge and visible cues. However, enabling deep learning models to accurately predict such occluded regions remains a challenging task. De-occlusion addresses this problem by reconstructing both the mask and RGB appearance. In this work, we focus on human de-occlusion, specifically targeting the recovery of occluded body structures and appearances. Our approach decomposes the task into two stages: mask completion and RGB completion. The first stage leverages a diffusion-based human body prior to provide a comprehensive representation of body structure, combined with occluded joint heatmaps that offer explicit spatial cues about missing regions. The reconstructed amodal mask then serves as a conditioning input for the second stage, guiding the model on which areas require RGB reconstruction. To further enhance RGB generation, we incorporate human-specific textual features derived using a visual question answering (VQA) model and encoded via a CLIP encoder. RGB completion is performed using Stable Diffusion, with decoder fine-tuning applied to mitigate pixel-level degradation in visible regions – a known limitation of prior diffusion-based de-occlusion methods caused by latent space transformations. Our method effectively reconstructs human appearances even under severe occlusions and consistently outperforms existing methods in both mask and RGB completion. Moreover, the de-occluded images generated by our approach can improve the performance of downstream human-centric tasks, such as 2D pose estimation and 3D human reconstruction. The code will be made publicly available.
zh
[CV-64] DyCrowd: Towards Dynamic Crowd Reconstruction from a Large-scene Video
【速读】:该论文旨在解决大规模场景下动态人群的时空一致3D重建问题,现有方法通常基于静态图像进行重建,导致时间不一致性且难以缓解遮挡带来的影响。其关键解决方案在于提出DyCrowd框架,通过分阶段的群体引导运动优化策略(coarse-to-fine group-guided motion optimization)提升对遮挡的鲁棒性,并引入基于变分自编码器(VAE)的人体运动先验与段级群体引导优化机制;核心创新是利用群体行为建模长时动态遮挡下的运动恢复,结合异步运动一致性(Asynchronous Motion Consistency, AMC)损失函数,使未被遮挡的运动片段可指导遮挡区域的运动恢复,从而实现即使在时间不同步和节奏不一致条件下仍具鲁棒性和合理性的3D姿态、位置及形状重建。
链接: https://arxiv.org/abs/2508.12644
作者: Hao Wen,Hongbo Kang,Jian Ma,Jing Huang,Yuanwang Yang,Haozhe Lin,Yu-Kun Lai,Kun Li
机构: Tianjin University (天津大学); Tsinghua University (清华大学); Cardiff University (卡迪夫大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:3D reconstruction of dynamic crowds in large scenes has become increasingly important for applications such as city surveillance and crowd analysis. However, current works attempt to reconstruct 3D crowds from a static image, causing a lack of temporal consistency and inability to alleviate the typical impact caused by occlusions. In this paper, we propose DyCrowd, the first framework for spatio-temporally consistent 3D reconstruction of hundreds of individuals’ poses, positions and shapes from a large-scene video. We design a coarse-to-fine group-guided motion optimization strategy for occlusion-robust crowd reconstruction in large scenes. To address temporal instability and severe occlusions, we further incorporate a VAE (Variational Autoencoder)-based human motion prior along with a segment-level group-guided optimization. The core of our strategy leverages collective crowd behavior to address long-term dynamic occlusions. By jointly optimizing the motion sequences of individuals with similar motion segments and combining this with the proposed Asynchronous Motion Consistency (AMC) loss, we enable high-quality unoccluded motion segments to guide the motion recovery of occluded ones, ensuring robust and plausible motion recovery even in the presence of temporal desynchronization and rhythmic inconsistencies. Additionally, in order to fill the gap of no existing well-annotated large-scene video dataset, we contribute a virtual benchmark dataset, VirtualCrowd, for evaluating dynamic crowd reconstruction from large-scene videos. Experimental results demonstrate that the proposed method achieves state-of-the-art performance in the large-scene dynamic crowd reconstruction task. The code and dataset will be available for research purposes.
zh
[CV-65] Learn Faster and Remember More: Balancing Exploration and Exploitation for Continual Test-time Adaptation
【速读】:该论文旨在解决持续测试时适应(Continual Test-Time Adaptation, CTTA)中探索(exploration)与利用(exploitation)之间的平衡难题:一方面,现有方法多基于神经网络深层输出调整预测,而域变化通常影响浅层特征,导致适应效率低下;另一方面,单一模型在探索新域时会遗忘历史知识,难以复用过往经验处理相似未来域。解决方案的关键在于提出一种均值教师框架(mean teacher framework),通过两个核心机制实现探索与利用的平衡(Balance between Exploration and Exploitation, BEE):其一是引入多级一致性正则化(Multi-level Consistency Regularization, MCR)损失,对齐学生模型与教师模型的中间特征,加速当前域的适应;其二是设计互补锚点回放(Complementary Anchor Replay, CAR)机制,重用历史检查点作为锚点,恢复多样域下的互补知识,从而增强对相似未来域的泛化能力。
链接: https://arxiv.org/abs/2508.12643
作者: Pinci Yang,Peisong Wen,Ke Ma,Qianqian Xu
机构: University of Chinese Academy of Sciences (中国科学院大学); Institute of Computing Technology, Chinese Academy of Sciences (中国科学院计算技术研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Continual Test-Time Adaptation (CTTA) aims to adapt a source pre-trained model to continually changing target domains during inference. As a fundamental principle, an ideal CTTA method should rapidly adapt to new domains (exploration) while retaining and exploiting knowledge from previously encountered domains to handle similar domains in the future. Despite significant advances, balancing exploration and exploitation in CTTA is still challenging: 1) Existing methods focus on adjusting predictions based on deep-layer outputs of neural networks. However, domain shifts typically affect shallow features, which are inefficient to be adjusted from deep predictions, leading to dilatory exploration; 2) A single model inevitably forgets knowledge of previous domains during the exploration, making it incapable of exploiting historical knowledge to handle similar future domains. To address these challenges, this paper proposes a mean teacher framework that strikes an appropriate Balance between Exploration and Exploitation (BEE) during the CTTA process. For the former challenge, we introduce a Multi-level Consistency Regularization (MCR) loss that aligns the intermediate features of the student and teacher models, accelerating adaptation to the current domain. For the latter challenge, we employ a Complementary Anchor Replay (CAR) mechanism to reuse historical checkpoints (anchors), recovering complementary knowledge for diverse domains. Experiments show that our method significantly outperforms state-of-the-art methods on several benchmarks, demonstrating its effectiveness for CTTA tasks.
zh
[CV-66] Synthesizing Accurate and Realistic T1-weighted Contrast-Enhanced MR Images using Posterior-Mean Rectified Flow MICCAI
【速读】:该论文旨在解决神经肿瘤学中对比增强(Contrast-enhanced, CE)T1加权磁共振成像(MRI)依赖钆基造影剂所带来的成本高、扫描时间长、环境风险及患者潜在安全性问题。其解决方案的关键在于提出了一种两阶段后验均值修正流(Posterior-Mean Rectified Flow, PMRF)管道:首先使用基于补丁的3D U-Net预测体素级后验均值以最小化均方误差(MSE),随后通过一个时间条件化的3D修正流对初始估计进行精细化处理,从而在不牺牲结构保真度的前提下引入真实纹理细节。该方法在多中心配对的非对比与对比T1w脑部MRI数据集(BraTS 2023–2025)上训练,并在360个多样化测试样本上验证,显示出优异的图像质量指标(如轴向FID降低至12.46,相比后验均值下降约68.7%)和良好的结构一致性(体积MSE为0.057,仅比后验均值高约27%),同时在定性评估中恢复了病灶边界和血管细节,有效平衡了感知真实性与重建失真之间的权衡,具备临床部署潜力。
链接: https://arxiv.org/abs/2508.12640
作者: Bastian Brandstötter,Erich Kobler
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 12 pages, 3 figures, MICCAI workshops (SASHIMI) 2025
Abstract:Contrast-enhanced (CE) T1-weighted MRI is central to neuro-oncologic diagnosis but requires gadolinium-based agents, which add cost and scan time, raise environmental concerns, and may pose risks to patients. In this work, we propose a two-stage Posterior-Mean Rectified Flow (PMRF) pipeline for synthesizing volumetric CE brain MRI from non-contrast inputs. First, a patch-based 3D U-Net predicts the voxel-wise posterior mean (minimizing MSE). Then, this initial estimate is refined by a time-conditioned 3D rectified flow to incorporate realistic textures without compromising structural fidelity. We train this model on a multi-institutional collection of paired pre- and post-contrast T1w volumes (BraTS 2023-2025). On a held-out test set of 360 diverse volumes, our best refined outputs achieve an axial FID of 12.46 and KID of 0.007 ( \sim 68.7% lower FID than the posterior mean) while maintaining low volumetric MSE of 0.057 ( \sim 27% higher than the posterior mean). Qualitative comparisons confirm that our method restores lesion margins and vascular details realistically, effectively navigating the perception-distortion trade-off for clinical deployment.
zh
[CV-67] SpotVLM: Cloud-edge Collaborative Real-time VLM based on Context Transfer
【速读】:该论文旨在解决当前云边协同架构在实时视觉语言模型(Vision-Language Models, VLMs)部署中面临的两大问题:一是无法有效应对云端延迟波动导致的响应不可靠性,二是忽视了延迟但高精度的大型视觉语言模型(Large Vision-Language Models, LVLMs)输出所蕴含的潜在价值。其解决方案的关键在于提出一种名为“Context Transfer”的新范式,将LVLM的延迟输出作为历史上下文,用于实时指导小型视觉语言模型(Small Vision-Language Models, SVLMs)的推理过程,从而实现低延迟与高准确性的协同优化。在此基础上,作者设计了SpotVLM框架,引入上下文替换和视觉焦点模块,以增强文本输入的准确性与视觉定位的一致性,显著提升了多任务场景下的性能表现。
链接: https://arxiv.org/abs/2508.12638
作者: Chen Qian,Xinran Yu,Zewen Huang,Danyang Li,Qiang Ma,Fan Dang,Xuan Ding,Guangyong Shang,Zheng Yang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Vision-Language Models (VLMs) are increasingly deployed in real-time applications such as autonomous driving and human-computer interaction, which demand fast and reliable responses based on accurate perception. To meet these requirements, existing systems commonly employ cloud-edge collaborative architectures, such as partitioned Large Vision-Language Models (LVLMs) or task offloading strategies between Large and Small Vision-Language Models (SVLMs). However, these methods fail to accommodate cloud latency fluctuations and overlook the full potential of delayed but accurate LVLM responses. In this work, we propose a novel cloud-edge collaborative paradigm for VLMs, termed Context Transfer, which treats the delayed outputs of LVLMs as historical context to provide real-time guidance for SVLMs inference. Based on this paradigm, we design SpotVLM, which incorporates both context replacement and visual focus modules to refine historical textual input and enhance visual grounding consistency. Extensive experiments on three real-time vision tasks across four datasets demonstrate the effectiveness of the proposed framework. The new paradigm lays the groundwork for more effective and latency-aware collaboration strategies in future VLM systems.
zh
[CV-68] HOMI: Ultra-Fast EdgeAI platform for Event Cameras
【速读】:该论文旨在解决现有事件相机(event camera)处理方案在边缘机器人应用中面临的三大问题:缺乏完整的端到端实现、延迟较高以及未能充分挖掘事件数据的稀疏性优势。其解决方案的关键在于提出HOMI平台,这是一个基于Prophesee IMX636事件传感器与Xilinx Zynq UltraScale+ MPSoC FPGA的超低延迟边缘AI系统,集成自研AI加速器,并开发了硬件优化的预处理流水线,支持恒定时间与恒定事件模式下的直方图累积及线性和指数时间表面生成。该设计兼顾高精度(DVS手势数据集达94%准确率)与低延迟(1000 fps吞吐量),同时保持紧凑的内存占用和仅33% LUT资源利用率,为后续性能提升、模型并行化或多任务部署预留充足空间。
链接: https://arxiv.org/abs/2508.12637
作者: Shankaranarayanan H,Satyapreet Singh Yadav,Adithya Krishna,Ajay Vikram P,Mahesh Mehendale,Chetan Singh Thakur
机构: Department of Electronic Systems Engineering, Indian Institute of Science, Bangalore, India 560012
类目: Hardware Architecture (cs.AR); Computer Vision and Pattern Recognition (cs.CV); Emerging Technologies (cs.ET); Neural and Evolutionary Computing (cs.NE)
备注:
Abstract:Event cameras offer significant advantages for edge robotics applications due to their asynchronous operation and sparse, event-driven output, making them well-suited for tasks requiring fast and efficient closed-loop control, such as gesture-based human-robot interaction. Despite this potential, existing event processing solutions remain limited, often lacking complete end-to-end implementations, exhibiting high latency, and insufficiently exploiting event data sparsity. In this paper, we present HOMI, an ultra-low latency, end-to-end edge AI platform comprising a Prophesee IMX636 event sensor chip with an Xilinx Zynq UltraScale+MPSoC FPGA chip, deploying an in-house developed AI accelerator. We have developed hardware-optimized pre-processing pipelines supporting both constant-time and constant-event modes for histogram accumulation, linear and exponential time surfaces. Our general-purpose implementation caters to both accuracy-driven and low-latency applications. HOMI achieves 94% accuracy on the DVS Gesture dataset as a use case when configured for high accuracy operation and provides a throughput of 1000 fps for low-latency configuration. The hardware-optimised pipeline maintains a compact memory footprint and utilises only 33% of the available LUT resources on the FPGA, leaving ample headroom for further latency reduction, model parallelisation, multi-task deployments, or integration of more complex architectures.
zh
[CV-69] Creative4U: MLLM s-based Advertising Creative Image Selector with Comparative Reasoning
【速读】:该论文旨在解决广告创意图像(creative image)在生成式 AI(AIGC)技术普及背景下,广告主难以评估其创意质量以进行可解释性选择的问题。现有方法多集中于创意排序(creative ranking),缺乏对选择依据的解释能力,无法满足实际应用中对透明决策的需求。解决方案的关键在于提出首个面向可解释创意评估与选择的范式——Creative4U,该系统基于多模态大语言模型(MLLMs),将创意图像的评估与选择统一建模为自然语言生成任务,并引入一种名为 Reason-to-Select RFT 的训练框架,包含基于思维链(Chain-of-Thought, CoT-SFT)的监督微调和基于群体相对策略优化(Group Relative Policy Optimization, GRPO)的强化学习,从而实现高精度且具备解释性的创意图像筛选,同时考虑用户兴趣偏好。
链接: https://arxiv.org/abs/2508.12628
作者: Yukang Lin,Xiang Zhang,Shichang Jia,Bowen Wan,Chenghan Fu,Xudong Ren,Yueran Liu,Wanxian Guan,Pengji Wang,Jian Xu,Bo Zheng,Baolin Liu
机构: Alibaba Group(阿里巴巴集团); University of Science and Technology Beijing(北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Creative image in advertising is the heart and soul of e-commerce platform. An eye-catching creative image can enhance the shopping experience for users, boosting income for advertisers and advertising revenue for platforms. With the advent of AIGC technology, advertisers can produce large quantities of creative images at minimal cost. However, they struggle to assess the creative quality to select. Existing methods primarily focus on creative ranking, which fails to address the need for explainable creative selection. In this work, we propose the first paradigm for explainable creative assessment and selection. Powered by multimodal large language models (MLLMs), our approach integrates the assessment and selection of creative images into a natural language generation task. To facilitate this research, we construct CreativePair, the first comparative reasoning-induced creative dataset featuring 8k annotated image pairs, with each sample including a label indicating which image is superior. Additionally, we introduce Creative4U (pronounced Creative for You), a MLLMs-based creative selector that takes into account users’ interests. Through Reason-to-Select RFT, which includes supervised fine-tuning with Chain-of-Thought (CoT-SFT) and Group Relative Policy Optimization (GRPO) based reinforcement learning, Creative4U is able to evaluate and select creative images accurately. Both offline and online experiments demonstrate the effectiveness of our approach. Our code and dataset will be made public to advance research and industrial applications. Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.12628 [cs.CV] (or arXiv:2508.12628v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.12628 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-70] WIPES: Wavelet-based Visual Primitives
【速读】:该论文旨在解决现有三维视觉与图形表示方法在频域调控灵活性和渲染速度之间的权衡问题,特别是针对基于隐式神经表示(Implicit Neural Representations, INRs)或高斯基元的方案所面临的频谱信息损失和计算效率低下的挑战。其核心解决方案是提出一种通用的基于小波(Wavelet-based)的视觉原语 WIPES(Wavelet-based vIsual PrimitivES),利用小波在空间-频率域上的局部化优势,同时有效捕捉低频结构(“森林”)与高频细节(“树木”),并进一步设计了一个基于小波的可微光栅化器(differentiable rasterizer),从而实现高质量且快速的视觉重建与渲染。实验表明,WIPES 在多种视觉任务中均优于 INR 和高斯基元方法,在渲染质量和推理速度上表现更优。
链接: https://arxiv.org/abs/2508.12615
作者: Wenhao Zhang,Hao Zhu,Delong Wu,Di Kang,Linchao Bao,Zhan Ma,Xun Cao
机构: Nanjing University (南京大学); Tencent (腾讯)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: IEEE/CVF International Conference on Computer Vision
Abstract:Pursuing a continuous visual representation that offers flexible frequency modulation and fast rendering speed has recently garnered increasing attention in the fields of 3D vision and graphics. However, existing representations often rely on frequency guidance or complex neural network decoding, leading to spectrum loss or slow rendering. To address these limitations, we propose WIPES, a universal Wavelet-based vIsual PrimitivES for representing multi-dimensional visual signals. Building on the spatial-frequency localization advantages of wavelets, WIPES effectively captures both the low-frequency “forest” and the high-frequency “trees.” Additionally, we develop a wavelet-based differentiable rasterizer to achieve fast visual rendering. Experimental results on various visual tasks, including 2D image representation, 5D static and 6D dynamic novel view synthesis, demonstrate that WIPES, as a visual primitive, offers higher rendering quality and faster inference than INR-based methods, and outperforms Gaussian-based representations in rendering quality.
zh
[CV-71] OpenMoCap: Rethinking Optical Motion Capture under Real-world Occlusion
【速读】:该论文旨在解决光学动作捕捉(Optical Motion Capture)在实际应用中因大规模标记点遮挡导致系统性能严重下降的问题。现有模型存在两大局限:一是缺乏真实反映标记点遮挡模式的训练数据集,二是缺少能够捕捉标记点间长程依赖关系的训练策略。解决方案的关键在于提出两个核心贡献:其一,构建了CMU-Occlu数据集,利用光线追踪技术模拟现实中的标记点遮挡模式;其二,设计了OpenMoCap模型,通过标记点-关节链推理机制实现标记点与关节之间深度约束的同时优化与构建,从而显著提升复杂遮挡环境下的运动解算鲁棒性。实验表明,OpenMoCap在多种场景下均优于现有方法,并已集成至MoSen MoCap系统用于实际部署。
链接: https://arxiv.org/abs/2508.12610
作者: Chen Qian,Danyang Li,Xinran Yu,Zheng Yang,Qiang Ma
机构: Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Optical motion capture is a foundational technology driving advancements in cutting-edge fields such as virtual reality and film production. However, system performance suffers severely under large-scale marker occlusions common in real-world applications. An in-depth analysis identifies two primary limitations of current models: (i) the lack of training datasets accurately reflecting realistic marker occlusion patterns, and (ii) the absence of training strategies designed to capture long-range dependencies among markers. To tackle these challenges, we introduce the CMU-Occlu dataset, which incorporates ray tracing techniques to realistically simulate practical marker occlusion patterns. Furthermore, we propose OpenMoCap, a novel motion-solving model designed specifically for robust motion capture in environments with significant occlusions. Leveraging a marker-joint chain inference mechanism, OpenMoCap enables simultaneous optimization and construction of deep constraints between markers and joints. Extensive comparative experiments demonstrate that OpenMoCap consistently outperforms competing methods across diverse scenarios, while the CMU-Occlu dataset opens the door for future studies in robust motion solving. The proposed OpenMoCap is integrated into the MoSen MoCap system for practical deployment. The code is released at: this https URL.
zh
[CV-72] ViDA-UGC: Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images
【速读】:该论文旨在解决当前可解释图像质量评估(Explainable Image Quality Assessment, IQA)方法在处理用户生成内容(User-Generated Content, UGC)与AI生成内容(AI-Generated Content, AIGC)时,未能充分区分两者失真特征、且缺乏细粒度质量分析能力的问题。其解决方案的关键在于构建首个大规模UGC图像失真评估指令微调数据集ViDA-UGC,该数据集包含11K张带精细质量标注的图像,并通过基于失真导向的流程(包括人类主观标注与Chain-of-Thought, CoT评估框架)来引导GPT-4o识别并分析UGC图像中的失真模式,从而捕捉与失真类型强相关的低层视觉特征;同时,从中筛选出476张图像及其6,149个问答对,形成首个UGC失真评估基准ViDA-UGC-Bench,显著提升了多类基础多模态大语言模型(Multimodal Large Language Models, MLLMs)在该基准上的质量分析能力,甚至超越GPT-4o本身。
链接: https://arxiv.org/abs/2508.12605
作者: Wenjie Liao,Jieyu Yuan,Yifang Xu,Chunle Guo,Zilong Zhang,Jihong Li,Jiachen Fu,Haotian Fan,Tao Li,Junhui Cui,Chongyi Li
机构: ByteDance Inc.(字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in Multimodal Large Language Models (MLLMs) have introduced a paradigm shift for Image Quality Assessment (IQA) from unexplainable image quality scoring to explainable IQA, demonstrating practical applications like quality control and optimization guidance. However, current explainable IQA methods not only inadequately use the same distortion criteria to evaluate both User-Generated Content (UGC) and AI-Generated Content (AIGC) images, but also lack detailed quality analysis for monitoring image quality and guiding image restoration. In this study, we establish the first large-scale Visual Distortion Assessment Instruction Tuning Dataset for UGC images, termed ViDA-UGC, which comprises 11K images with fine-grained quality grounding, detailed quality perception, and reasoning quality description data. This dataset is constructed through a distortion-oriented pipeline, which involves human subject annotation and a Chain-of-Thought (CoT) assessment framework. This framework guides GPT-4o to generate quality descriptions by identifying and analyzing UGC distortions, which helps capturing rich low-level visual features that inherently correlate with distortion patterns. Moreover, we carefully select 476 images with corresponding 6,149 question answer pairs from ViDA-UGC and invite a professional team to ensure the accuracy and quality of GPT-generated information. The selected and revised data further contribute to the first UGC distortion assessment benchmark, termed ViDA-UGC-Bench. Experimental results demonstrate the effectiveness of the ViDA-UGC and CoT framework for consistently enhancing various image quality analysis abilities across multiple base MLLMs on ViDA-UGC-Bench and Q-Bench, even surpassing GPT-4o.
zh
[CV-73] ViLaD: A Large Vision Language Diffusion Framework for End-to-End Autonomous Driving
【速读】:该论文旨在解决基于视觉语言模型(Vision Language Models, VLMs)的端到端自动驾驶系统在实际应用中面临的两大核心问题:一是自回归架构导致的高推理延迟,二是缺乏双向推理能力,难以适应动态、安全关键场景。解决方案的关键在于提出一种名为ViLaD(Large Vision Language Diffusion)的新框架,其核心创新是采用掩码扩散模型(masked diffusion model),实现整个驾驶决策序列的并行生成,从而显著降低计算延迟;同时,该架构支持双向推理,能够同时考虑历史与未来信息,并通过逐步“由易到难”的生成策略迭代提升决策质量,最终在nuScenes数据集上实现了更高的规划准确率和近乎零的失败率,且已在真实自动驾驶车辆上成功部署用于交互式泊车任务,验证了其工程可行性与鲁棒性。
链接: https://arxiv.org/abs/2508.12603
作者: Can Cui,Yupeng Zhou,Juntong Peng,Sung-Yeon Park,Zichong Yang,Prashanth Sankaranarayanan,Jiaru Zhang,Ruqi Zhang,Ziran Wang
机构: Purdue University (普渡大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:End-to-end autonomous driving systems built on Vision Language Models (VLMs) have shown significant promise, yet their reliance on autoregressive architectures introduces some limitations for real-world applications. The sequential, token-by-token generation process of these models results in high inference latency and cannot perform bidirectional reasoning, making them unsuitable for dynamic, safety-critical environments. To overcome these challenges, we introduce ViLaD, a novel Large Vision Language Diffusion (LVLD) framework for end-to-end autonomous driving that represents a paradigm shift. ViLaD leverages a masked diffusion model that enables parallel generation of entire driving decision sequences, significantly reducing computational latency. Moreover, its architecture supports bidirectional reasoning, allowing the model to consider both past and future simultaneously, and supports progressive easy-first generation to iteratively improve decision quality. We conduct comprehensive experiments on the nuScenes dataset, where ViLaD outperforms state-of-the-art autoregressive VLM baselines in both planning accuracy and inference speed, while achieving a near-zero failure rate. Furthermore, we demonstrate the framework’s practical viability through a real-world deployment on an autonomous vehicle for an interactive parking task, confirming its effectiveness and soundness for practical applications.
zh
[CV-74] Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在多模态场景下依赖自然语言链式思维(Chain-of-Thought, CoT)提示所导致的跨模态对齐困难问题,即文本、视觉与音频信息难以动态协同推理。其解决方案的关键在于提出一种新的多模态连续思维框架——Multimodal Chain of Continuous Thought (MCOUT),该方法将推理状态表示为联合潜在空间中的连续隐藏向量,并通过迭代优化与视觉及文本嵌入对齐,模拟人类反思性认知过程。该方案摒弃了传统基于离散语言序列的推理范式,显著提升了多模态模型在复杂任务中的推理能力与一致性。
链接: https://arxiv.org/abs/2508.12587
作者: Tan-Hanh Pham,Chris Ngo
机构: Harvard Medical School (哈佛医学院); Harvard University (哈佛大学); Athinoula A. Martinos Center for Biomedical Imaging (阿辛努拉·马丁诺斯生物医学成像中心); Massachusetts General Hospital (马萨诸塞州总医院); Knovel Engineering Lab (Knovel 工程实验室)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at this https URL.
zh
[CV-75] Foundation Model for Skeleton-Based Human Action Understanding
【速读】:该论文旨在解决骨架表示在人类动作理解任务中缺乏可扩展性和泛化能力的问题,即当前方法难以适应多样化的动作理解任务,且尚无统一的骨架基础模型(foundation model)能够通用地处理多种任务。其解决方案的关键在于提出一种统一的骨架密集表征学习框架(Unified Skeleton-based Dense Representation Learning, USDRL),该框架包含三个核心模块:基于Transformer的密集时空编码器(DSTE)用于并行学习时序动态与空间结构特征;多粒度特征去相关模块(MG-FD)通过跨时间、空间和实例域的特征去冗余提升信息提取效率;以及多视角一致性训练模块(MPCT),结合多视图和多模态自监督一致性训练策略,增强高层语义学习并促进多模态特征的有效融合。实验证明,该方法在25个基准数据集上的9类骨架动作理解任务中显著优于现有最先进方法。
链接: https://arxiv.org/abs/2508.12586
作者: Hongsong Wang,Wanjiang Weng,Junbo Wang,Fang Zhao,Guo-Sen Xie,Xin Geng,Liang Wang
机构: Southeast University (东南大学); Northwestern Polytechnical University (西北工业大学); Nanjing University (南京大学); Nanjing University of Science and Technology (南京理工大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by TPAMI, Code is available at: this https URL
Abstract:Human action understanding serves as a foundational pillar in the field of intelligent motion perception. Skeletons serve as a modality- and device-agnostic representation for human modeling, and skeleton-based action understanding has potential applications in humanoid robot control and interaction. \REDHowever, existing works often lack the scalability and generalization required to handle diverse action understanding tasks. There is no skeleton foundation model that can be adapted to a wide range of action understanding tasks. This paper presents a Unified Skeleton-based Dense Representation Learning (USDRL) framework, which serves as a foundational model for skeleton-based human action understanding. USDRL consists of a Transformer-based Dense Spatio-Temporal Encoder (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT). The DSTE module adopts two parallel streams to learn temporal dynamic and spatial structure features. The MG-FD module collaboratively performs feature decorrelation across temporal, spatial, and instance domains to reduce dimensional redundancy and enhance information extraction. The MPCT module employs both multi-view and multi-modal self-supervised consistency training. The former enhances the learning of high-level semantics and mitigates the impact of low-level discrepancies, while the latter effectively facilitates the learning of informative multimodal features. We perform extensive experiments on 25 benchmarks across across 9 skeleton-based action understanding tasks, covering coarse prediction, dense prediction, and transferred prediction. Our approach significantly outperforms the current state-of-the-art methods. We hope that this work would broaden the scope of research in skeleton-based action understanding and encourage more attention to dense prediction tasks.
zh
[CV-76] Structure-preserving Feature Alignment for Old Photo Colorization
【速读】:该论文旨在解决旧照片(old photo)颜色化过程中因缺乏真实标签(ground truth)以及自然灰度图像与旧照片之间存在显著领域差异(domain gap)而导致的挑战。传统基于深度学习的颜色化方法依赖大规模数据集训练,难以直接应用于旧照片场景。其解决方案的关键在于提出一种基于卷积神经网络(CNN)的新型算法SFAC(Structure-preserving Feature Alignment Colorizer),该方法仅需两张图像即可训练,通过特征分布对齐损失(feature distribution alignment loss)建立两图间的语义对应关系,确保语义相关物体获得相似颜色;同时引入结构保持机制,结合特征层级感知约束与像素层级冻结-更新金字塔结构,有效缓解颜色迁移带来的结构失真问题。
链接: https://arxiv.org/abs/2508.12570
作者: Yingxue Pang,Xin Jin,Jun Fu,Zhibo Chen
机构: University of Science and Technology of China (中国科学技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deep learning techniques have made significant advancements in reference-based colorization by training on large-scale datasets. However, directly applying these methods to the task of colorizing old photos is challenging due to the lack of ground truth and the notorious domain gap between natural gray images and old photos. To address this issue, we propose a novel CNN-based algorithm called SFAC, i.e., Structure-preserving Feature Alignment Colorizer. SFAC is trained on only two images for old photo colorization, eliminating the reliance on big data and allowing direct processing of the old photo itself to overcome the domain gap problem. Our primary objective is to establish semantic correspondence between the two images, ensuring that semantically related objects have similar colors. We achieve this through a feature distribution alignment loss that remains robust to different metric choices. However, utilizing robust semantic correspondence to transfer color from the reference to the old photo can result in inevitable structure distortions. To mitigate this, we introduce a structure-preserving mechanism that incorporates a perceptual constraint at the feature level and a frozen-updated pyramid at the pixel level. Extensive experiments demonstrate the effectiveness of our method for old photo colorization, as confirmed by qualitative and quantitative metrics.
zh
[CV-77] mporal and Rotational Calibration for Event-Centric Multi-Sensor Systems
【速读】:该论文旨在解决事件相机(event camera)在多传感器系统中进行外参标定(extrinsic calibration)时缺乏有效方法的问题,尤其针对无需专用标定目标的场景。其核心挑战在于事件相机具有异步、高时间分辨率的特性,传统依赖帧图像转换的方法难以高效利用其时空信息。解决方案的关键在于提出一种基于运动的时序与旋转标定框架:首先利用事件相机与其他异构传感器获取的旋转运动估计,通过类典型相关分析(CCA)方法初始化时间偏移和旋转外参;随后采用连续时间参数化下的SO(3)空间联合非线性优化对两者进行精调。该方法直接从事件数据的时空分布中提取正常流(normal flow)以估计角速度,避免了事件到帧的转换过程,从而提升了标定精度、鲁棒性和灵活性。
链接: https://arxiv.org/abs/2508.12564
作者: Jiayao Mai,Xiuyuan Lu,Kuan Dai,Shaojie Shen,Yi Zhou
机构: Hunan University (湖南大学); Hong Kong University of Science and Technology (香港科技大学)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 5 figures
Abstract:Event cameras generate asynchronous signals in response to pixel-level brightness changes, offering a sensing paradigm with theoretically microsecond-scale latency that can significantly enhance the performance of multi-sensor systems. Extrinsic calibration is a critical prerequisite for effective sensor fusion; however, the configuration that involves event cameras remains an understudied topic. In this paper, we propose a motion-based temporal and rotational calibration framework tailored for event-centric multi-sensor systems, eliminating the need for dedicated calibration targets. Our method uses as input the rotational motion estimates obtained from event cameras and other heterogeneous sensors, respectively. Different from conventional approaches that rely on event-to-frame conversion, our method efficiently estimates angular velocity from normal flow observations, which are derived from the spatio-temporal profile of event data. The overall calibration pipeline adopts a two-step approach: it first initializes the temporal offset and rotational extrinsics by exploiting kinematic correlations in the spirit of Canonical Correlation Analysis (CCA), and then refines both temporal and rotational parameters through a joint non-linear optimization using a continuous-time parametrization in SO(3). Extensive evaluations on both publicly available and self-collected datasets validate that the proposed method achieves calibration accuracy comparable to target-based methods, while exhibiting superior stability over purely CCA-based methods, and highlighting its precision, robustness and flexibility. To facilitate future research, our implementation will be made open-source. Code: this https URL.
zh
[CV-78] PROD: Palpative Reconstruction of Deformable Objects through Elastostatic Signed Distance Functions
【速读】:该论文旨在解决如何从有限的力控表面探测数据中准确重建软体物体的几何形状及其机械属性的问题。传统方法通常依赖纯几何或视觉信息,难以精确刻画软材料在受力下的形变特性。本文提出的PROD(Palpative Reconstruction of Deformables)方法的关键在于:利用弹性静力学符号距离函数(elastostatic signed distance function, SDF)建模物体形变过程,通过推导控制SDF的泊松方程(Poisson equation),从稀疏的姿态与力测量中估计其静态和动态响应;进一步结合稳态弹性动力学假设,实现对未变形SDF的可证明收敛恢复,并通过分析不同力输入下的位移响应来估计材料刚度。这一框架显著提升了在姿态误差、非法向力施加及曲率误差等复杂场景下的鲁棒性,适用于机器人操作、医学成像和触觉反馈系统等应用。
链接: https://arxiv.org/abs/2508.12554
作者: Hamza El-Kebir
机构: Beckman Institute of Advanced Science and Technology, University of Illinois Urbana-Champaign (伊利诺伊大学厄巴纳-香槟分校先进科学与技术贝克曼研究所)
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted for presentation at the 2025 IEEE Conference on Decision and Control (CDC)
Abstract:We introduce PROD (Palpative Reconstruction of Deformables), a novel method for reconstructing the shape and mechanical properties of deformable objects using elastostatic signed distance functions (SDFs). Unlike traditional approaches that rely on purely geometric or visual data, PROD integrates palpative interaction – measured through force-controlled surface probing – to estimate both the static and dynamic response of soft materials. We model the deformation of an object as an elastostatic process and derive a governing Poisson equation for estimating its SDF from a sparse set of pose and force measurements. By incorporating steady-state elastodynamic assumptions, we show that the undeformed SDF can be recovered from deformed observations with provable convergence. Our approach also enables the estimation of material stiffness by analyzing displacement responses to varying force inputs. We demonstrate the robustness of PROD in handling pose errors, non-normal force application, and curvature errors in simulated soft body interactions. These capabilities make PROD a powerful tool for reconstructing deformable objects in applications ranging from robotic manipulation to medical imaging and haptic feedback systems.
zh
[CV-79] REVEAL – Reasoning and Evaluation of Visual Evidence through Aligned Language ICCV2025
【速读】:该论文旨在解决图像伪造检测中泛化能力不足的问题,尤其是在不同域(如Photoshop编辑、DeepFake和AIGC生成)下的伪造检测与可解释性难题。其核心挑战在于现有监督学习方法难以跨域迁移,且缺乏对伪造证据的语义推理和定位能力。解决方案的关键在于将伪造检测建模为一个提示驱动(prompt-driven)的视觉推理任务,利用大视觉语言模型(Vision-Language Models, VLMs)的语义对齐能力,提出名为REVEAL
(Reasoning and Evaluation of Visual Evidence through Aligned Language)的框架:该框架包含两个互补策略——(1) 整体场景级评估(Holistic Scene-level Evaluation),基于图像整体的物理规律、语义一致性、视角合理性与真实性进行判断;(2) 区域级异常检测(Region-wise anomaly detection),将图像分割为多个区域并逐个分析潜在伪造痕迹。通过在多域数据集上的实验证明,该方法不仅提升了检测性能,还提供了可解释的推理过程。
链接: https://arxiv.org/abs/2508.12543
作者: Ipsita Praharaj,Yukta Butala,Yash Butala
机构: Carnegie Mellon University (卡内基梅隆大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 4 pages, 6 figures, International Conference on Computer Vision, ICCV 2025
Abstract:The rapid advancement of generative models has intensified the challenge of detecting and interpreting visual forgeries, necessitating robust frameworks for image forgery detection while providing reasoning as well as localization. While existing works approach this problem using supervised training for specific manipulation or anomaly detection in the embedding space, generalization across domains remains a challenge. We frame this problem of forgery detection as a prompt-driven visual reasoning task, leveraging the semantic alignment capabilities of large vision-language models. We propose a framework, REVEAL
(Reasoning and Evaluation of Visual Evidence through Aligned Language), that incorporates generalized guidelines. We propose two tangential approaches - (1) Holistic Scene-level Evaluation that relies on the physics, semantics, perspective, and realism of the image as a whole and (2) Region-wise anomaly detection that splits the image into multiple regions and analyzes each of them. We conduct experiments over datasets from different domains (Photoshop, DeepFake and AIGC editing). We compare the Vision Language Models against competitive baselines and analyze the reasoning provided by them.
zh
[CV-80] oward Architecture-Agnostic Local Control of Posterior Collapse in VAEs
【速读】:该论文旨在解决变分自编码器(Variational Autoencoders, VAEs)中存在的后验坍缩(posterior collapse)问题,即模型在训练过程中忽略潜在变量,导致生成样本多样性下降。传统方法通过调节正则化损失与重建损失之间的权衡来缓解此问题,但效果有限;另一些方法则依赖于对网络结构的约束以保证潜在空间的可识别性(latent identifiability),但限制了模型灵活性。本文提出局部后验坍缩(local posterior collapse)的概念,用以刻画数据空间中单个样本点对潜在表示的重要性,并在此基础上设计了一种无需特定网络架构约束的潜重构损失(Latent Reconstruction, LR loss)。该损失函数基于单射(injective)和复合函数(composite functions)的数学性质,有效控制后验坍缩,从而提升VAE的生成性能,且在MNIST、FashionMNIST、Omniglot、CelebA和FFHQ等多个数据集上得到实验验证。
链接: https://arxiv.org/abs/2508.12530
作者: Hyunsoo Song,Seungwhan Kim,Seungkyu Lee
机构: 未知
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注: 8 pages, 6 figures
Abstract:Variational autoencoders (VAEs), one of the most widely used generative models, are known to suffer from posterior collapse, a phenomenon that reduces the diversity of generated samples. To avoid posterior collapse, many prior works have tried to control the influence of regularization loss. However, the trade-off between reconstruction and regularization is not satisfactory. For this reason, several methods have been proposed to guarantee latent identifiability, which is the key to avoiding posterior collapse. However, they require structural constraints on the network architecture. For further clarification, we define local posterior collapse to reflect the importance of individual sample points in the data space and to relax the network constraint. Then, we propose Latent Reconstruction(LR) loss, which is inspired by mathematical properties of injective and composite functions, to control posterior collapse without restriction to a specific architecture. We experimentally evaluate our approach, which controls posterior collapse on varied datasets such as MNIST, fashionMNIST, Omniglot, CelebA, and FFHQ.
zh
[CV-81] MuSACo: Multimodal Subject-Specific Selection and Adaptation for Expression Recognition with Co-Training
【速读】:该论文旨在解决个性化表情识别(Personalized Expression Recognition, PER)中因个体间差异显著而导致模型性能受限的问题,尤其针对现有多源域适应(Multi-Source Domain Adaptation, MSDA)方法普遍忽略多模态信息或简单融合多个源域、从而未能充分捕捉个体特异性特征的局限性。解决方案的关键在于提出MuSACo方法——一种基于协同训练(co-training)的多模态主体选择与适应机制,其核心创新包括:1)利用多模态互补信息进行源主体选择和伪标签生成,以增强类感知学习;2)引入类无关损失函数,从置信度较低的目标样本中挖掘潜在知识;3)对各模态源特征进行对齐,仅融合高置信度目标特征,从而在保留个体独特性的同时提升模型泛化能力。实验表明,MuSACo在BioVid和StressID等复杂多模态表情识别数据集上优于传统单域适应(UDA)及主流MSDA方法。
链接: https://arxiv.org/abs/2508.12522
作者: Muhammad Osama Zeeshan,Natacha Gillet,Alessandro Lameiras Koerich,Marco Pedersoli,Francois Bremond,Eric Granger
机构: LIVIA, Dept. of Systems Engineering, ETS Montreal, Canada(加拿大蒙特利尔工程学院); École Polytechnique, Palaiseau, France(法国巴黎综合理工学院); Inria, 2004 Rte des Lucioles, Valbonne 06902, France(法国国家信息与自动化研究院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Personalized expression recognition (ER) involves adapting a machine learning model to subject-specific data for improved recognition of expressions with considerable interpersonal variability. Subject-specific ER can benefit significantly from multi-source domain adaptation (MSDA) methods, where each domain corresponds to a specific subject, to improve model accuracy and robustness. Despite promising results, state-of-the-art MSDA approaches often overlook multimodal information or blend sources into a single domain, limiting subject diversity and failing to explicitly capture unique subject-specific characteristics. To address these limitations, we introduce MuSACo, a multi-modal subject-specific selection and adaptation method for ER based on co-training. It leverages complementary information across multiple modalities and multiple source domains for subject-specific adaptation. This makes MuSACo particularly relevant for affective computing applications in digital health, such as patient-specific assessment for stress or pain, where subject-level nuances are crucial. MuSACo selects source subjects relevant to the target and generates pseudo-labels using the dominant modality for class-aware learning, in conjunction with a class-agnostic loss to learn from less confident target samples. Finally, source features from each modality are aligned, while only confident target features are combined. Our experimental results on challenging multimodal ER datasets: BioVid and StressID, show that MuSACo can outperform UDA (blending) and state-of-the-art MSDA methods.
zh
[CV-82] An Initial Study of Birds-Eye View Generation for Autonomous Vehicles using Cross-View Transformers
【速读】:该论文旨在解决如何从摄像头图像中准确生成鸟瞰图(Bird’s-Eye View, BEV)地图的问题,以支持自动驾驶系统的感知能力。其核心挑战在于模型在未见过的城市环境中的泛化性能、不同摄像头布局的影响以及损失函数设计对结果精度的贡献。解决方案的关键在于采用跨视角Transformer(Cross-View Transformer, CVT)架构,通过端到端学习将多视角相机图像映射到包含道路、车道线和规划轨迹的三通道BEV表示,并结合L1损失函数与四摄像头配置,在仅使用一个城市的数据训练下实现了对新城市场景的最佳鲁棒性表现。
链接: https://arxiv.org/abs/2508.12520
作者: Felipe Carlos dos Santos,Eric Aislan Antonelo,Gustavo Claudio Karl Couto
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages,submitted in ENIAC 2025
Abstract:Bird’s-Eye View (BEV) maps provide a structured, top-down abstraction that is crucial for autonomous-driving perception. In this work, we employ Cross-View Transformers (CVT) for learning to map camera images to three BEV’s channels - road, lane markings, and planned trajectory - using a realistic simulator for urban driving. Our study examines generalization to unseen towns, the effect of different camera layouts, and two loss formulations (focal and L1). Using training data from only a town, a four-camera CVT trained with the L1 loss delivers the most robust test performance, evaluated in a new town. Overall, our results underscore CVT’s promise for mapping camera inputs to reasonably accurate BEV maps.
zh
[CV-83] LangVision-LoRA-NAS: Neural Architecture Search for Variable LoRA Rank in Vision Language Models ICIP2025
【速读】:该论文旨在解决当前基于LoRA(Low-Rank Adaptation)的视觉语言模型(Vision Language Models, VLMs)微调方法中存在固定秩配置所带来的灵活性与效率不足问题。现有LoRA实现通常采用预设的低秩更新,难以适配不同任务对性能和计算成本的不同需求。其解决方案的关键在于提出LangVision-LoRA-NAS框架,该框架将神经架构搜索(Neural Architecture Search, NAS)与LoRA相结合,通过自动搜索最优LoRA秩配置来动态适配特定多模态任务,在保证模型性能的同时显著降低微调开销。
链接: https://arxiv.org/abs/2508.12512
作者: Krishna Teja Chitty-Venkata,Murali Emani,Venkatram Vishwanath
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted by ICIP 2025 Conference
Abstract:Vision Language Models (VLMs) integrate visual and text modalities to enable multimodal understanding and generation. These models typically combine a Vision Transformer (ViT) as an image encoder and a Large Language Model (LLM) for text generation. LoRA (Low-Rank Adaptation) is an efficient fine-tuning method to adapt pre-trained models to new tasks by introducing low-rank updates to their weights. While LoRA has emerged as a powerful technique for fine-tuning large models by introducing low-rank updates, current implementations assume a fixed rank, potentially limiting flexibility and efficiency across diverse tasks. This paper introduces \textitLangVision-LoRA-NAS, a novel framework that integrates Neural Architecture Search (NAS) with LoRA to optimize VLMs for variable-rank adaptation. Our approach leverages NAS to dynamically search for the optimal LoRA rank configuration tailored to specific multimodal tasks, balancing performance and computational efficiency. Through extensive experiments using the LLaMA-3.2-11B model on several datasets, LangVision-LoRA-NAS demonstrates notable improvement in model performance while reducing fine-tuning costs. Our Base and searched fine-tuned models on LLaMA-3.2-11B-Vision-Instruct can be found \hrefthis https URL\textcolorbluehere and the code for LangVision-LoRA-NAS can be found \hrefthis https URL\textcolorbluehere.
zh
[CV-84] Design and Validation of a Responsible Artificial Intelligence-based System for the Referral of Diabetic Retinopathy Patients
【速读】:该论文旨在解决糖尿病视网膜病变(Diabetic Retinopathy, DR)早期筛查中因眼科医生短缺和检查时效性不足导致的诊断延迟问题,同时应对现有人工智能(Artificial Intelligence, AI)系统在临床应用中因数据质量低和潜在偏倚而学习到非预期特征所带来的可靠性与公平性挑战。解决方案的关键在于开发了一个负责任的人工智能系统(Responsible AI System for DR screening, RAIS-DR),其核心创新在于将伦理原则贯穿于AI生命周期,并集成高效的卷积神经网络模型用于图像预处理、质量评估及三种专业化DR分类任务;此外,RAIS-DR在独立测试集上显著优于FDA批准的EyeArt系统,在F1分数、准确率和特异性方面均有提升,且在不同人口学亚组间表现出更优的公平性指标(如Disparate Impact和Equal Opportunity Difference),从而有效降低医疗资源分配不均带来的健康差异。
链接: https://arxiv.org/abs/2508.12506
作者: E. Ulises Moya-Sánchez,Abraham Sánchez-Perez,Raúl Nanclares Da Veiga,Alejandro Zarate-Macías,Edgar Villareal,Alejandro Sánchez-Montes,Edtna Jauregui-Ulloa,Héctor Moreno,Ulises Cortés
机构: Gobierno de Jalisco (哈利斯科州政府); Instituto Tecnológico José Mario Molina Pasquel Y Henríquez (何塞·马里奥·莫利纳理工学院); Secretaria de Salud Jalisco (哈利斯科州卫生局); Universidad de Guadalajara (瓜达拉哈拉大学); Universitat Politècnica de Catalunya (加泰罗尼亚理工大学); Barcelona Supercoputing Center (巴塞罗那超级计算中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 14 pages,3 figures, under review
Abstract:Diabetic Retinopathy (DR) is a leading cause of vision loss in working-age individuals. Early detection of DR can reduce the risk of vision loss by up to 95%, but a shortage of retinologists and challenges in timely examination complicate detection. Artificial Intelligence (AI) models using retinal fundus photographs (RFPs) offer a promising solution. However, adoption in clinical settings is hindered by low-quality data and biases that may lead AI systems to learn unintended features. To address these challenges, we developed RAIS-DR, a Responsible AI System for DR screening that incorporates ethical principles across the AI lifecycle. RAIS-DR integrates efficient convolutional models for preprocessing, quality assessment, and three specialized DR classification models. We evaluated RAIS-DR against the FDA-approved EyeArt system on a local dataset of 1,046 patients, unseen by both systems. RAIS-DR demonstrated significant improvements, with F1 scores increasing by 5-12%, accuracy by 6-19%, and specificity by 10-20%. Additionally, fairness metrics such as Disparate Impact and Equal Opportunity Difference indicated equitable performance across demographic subgroups, underscoring RAIS-DR’s potential to reduce healthcare disparities. These results highlight RAIS-DR as a robust and ethically aligned solution for DR screening in clinical settings. The code, weights of RAIS-DR are available at this https URL with RAIL.
zh
[CV-85] Skin Cancer Classification: Hybrid CNN-Transformer Models with KAN-Based Fusion
【速读】:该论文旨在解决皮肤癌分类中恶性与非恶性病灶精准区分的问题,以实现早期诊断和治疗。其解决方案的关键在于提出一种融合卷积神经网络(CNN)、Transformer 和卷积型柯尔莫哥洛夫-阿诺德网络(CKAN)的串行与并行混合架构,通过 CNN 提取局部空间特征、Transformer 建模全局依赖关系,并利用 CKAN 实现可学习激活函数驱动的非线性特征融合,从而增强表示学习能力。该方法在多个基准数据集(HAM10000、BCN20000 和 PAD-UFES)上均表现出优异的泛化性能,验证了模型设计与特征表示优化对提升医学图像分类鲁棒性和准确性的关键作用。
链接: https://arxiv.org/abs/2508.12484
作者: Shubhi Agarwal,Amulya Kumar Mahto
机构: Mehta Family School of Data Science and Artificial Intelligence (数据科学与人工智能梅hta家族学院); Indian Institute of Technology Guwahati (印度理工学院古瓦哈提分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skin cancer classification is a crucial task in medical image analysis, where precise differentiation between malignant and non-malignant lesions is essential for early diagnosis and treatment. In this study, we explore Sequential and Parallel Hybrid CNN-Transformer models with Convolutional Kolmogorov-Arnold Network (CKAN). Our approach integrates transfer learning and extensive data augmentation, where CNNs extract local spatial features, Transformers model global dependencies, and CKAN facilitates nonlinear feature fusion for improved representation learning. To assess generalization, we evaluate our models on multiple benchmark datasets (HAM10000,BCN20000 and PAD-UFES) under varying data distributions and class imbalances. Experimental results demonstrate that hybrid CNN-Transformer architectures effectively capture both spatial and contextual features, leading to improved classification performance. Additionally, the integration of CKAN enhances feature fusion through learnable activation functions, yielding more discriminative representations. Our proposed approach achieves competitive performance in skin cancer classification, demonstrating 92.81% accuracy and 92.47% F1-score on the HAM10000 dataset, 97.83% accuracy and 97.83% F1-score on the PAD-UFES dataset, and 91.17% accuracy with 91.79% F1- score on the BCN20000 dataset highlighting the effectiveness and generalizability of our model across diverse datasets. This study highlights the significance of feature representation and model design in advancing robust and accurate medical image classification.
zh
[CV-86] Standardization of Neuromuscular Reflex Analysis – Role of Fine-Tuned Vision-Language Model Consortium and OpenAI gpt -oss Reasoning LLM Enabled Decision Support System
【速读】:该论文旨在解决传统H-reflex肌电图(EMG)波形分析中因临床医生和研究人员间主观差异导致的可靠性与标准化不足的问题。其解决方案的关键在于构建一个微调后的视觉-语言模型(VLM)联盟与一个基于推理型大语言模型(LLM)的决策支持系统,通过多模态特征提取与上下文信息融合,实现对H-reflex波形的自动化、一致且可解释的诊断评估。具体而言,每个VLM均在标注有临床观察、恢复时间线及运动员元数据的H-reflex EMG图像数据集上进行微调,从而精准识别关键电生理特征并预测神经肌肉状态;随后,VLM联盟输出经共识机制聚合,并由专门设计的推理LLM进一步精炼,确保决策过程透明、可靠,最终形成端到端的智能诊断平台。
链接: https://arxiv.org/abs/2508.12473
作者: Eranga Bandara,Ross Gore,Sachin Shetty,Ravi Mukkamala,Christopher Rhea,Atmaram Yarlagadda,Shaifali Kaushik,L.H.M.P.De Silva,Andriy Maznychenko,Inna Sokolowska,Amin Hass,Kasun De Zoysa
机构: Old Dominion University (老多明尼昂大学); AnaletIQ; McDonald Army Health Center (麦克唐纳陆军健康中心); University of Sri Jayewardenepura (斯里贾亚瓦登普拉大学); University of Gdańsk (格但斯克大学); University of Colombo (科伦坡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Accurate assessment of neuromuscular reflexes, such as the H-reflex, plays a critical role in sports science, rehabilitation, and clinical neurology. Traditional analysis of H-reflex EMG waveforms is subject to variability and interpretation bias among clinicians and researchers, limiting reliability and standardization. To address these challenges, we propose a Fine-Tuned Vision-Language Model (VLM) Consortium and a reasoning Large-Language Model (LLM)-enabled Decision Support System for automated H-reflex waveform interpretation and diagnosis. Our approach leverages multiple VLMs, each fine-tuned on curated datasets of H-reflex EMG waveform images annotated with clinical observations, recovery timelines, and athlete metadata. These models are capable of extracting key electrophysiological features and predicting neuromuscular states, including fatigue, injury, and recovery, directly from EMG images and contextual metadata. Diagnostic outputs from the VLM consortium are aggregated using a consensus-based method and refined by a specialized reasoning LLM, which ensures robust, transparent, and explainable decision support for clinicians and sports scientists. The end-to-end platform orchestrates seamless communication between the VLM ensemble and the reasoning LLM, integrating prompt engineering strategies and automated reasoning workflows using LLM Agents. Experimental results demonstrate that this hybrid system delivers highly accurate, consistent, and interpretable H-reflex assessments, significantly advancing the automation and standardization of neuromuscular diagnostics. To our knowledge, this work represents the first integration of a fine-tuned VLM consortium with a reasoning LLM for image-based H-reflex analysis, laying the foundation for next-generation AI-assisted neuromuscular assessment and athlete monitoring platforms.
zh
[CV-87] Mechanical Automation with Vision: A Design for Rubiks Cube Solver
【速读】:该论文旨在解决魔方(Rubik’s Cube)自动求解的集成化问题,即如何通过软硬件协同实现从实时状态识别到物理操作的闭环自动化。解决方案的关键在于:首先利用YOLOv8目标检测模型(Precision 0.98443, Recall 0.98419)实现对魔方状态的高精度实时识别;其次,通过Unity开发的图形用户界面(GUI)将检测结果虚拟化以供交互与算法处理;最后,采用Kociemba算法生成最优解序列,并由三个步进电机协同控制实现单自由度物理操作,平均求解时间约为2.2分钟。
链接: https://arxiv.org/abs/2508.12469
作者: Abhinav Chalise,Nimesh Gopal Pradhan,Nishan Khanal,Prashant Raj Bista,Dinesh Baniya Kshatri
机构: 未知
类目: Robotics (cs.RO); Computer Vision and Pattern Recognition (cs.CV)
备注: Presented at the 15th IOE Graduate Conference, Tribhuvan University, May 2024. Original paper available at this https URL
Abstract:The core mechanical system is built around three stepper motors for physical manipulation, a microcontroller for hardware control, a camera and YOLO detection model for real-time cube state detection. A significant software component is the development of a user-friendly graphical user interface (GUI) designed in Unity. The initial state after detection from real-time YOLOv8 model (Precision 0.98443, Recall 0.98419, Box Loss 0.42051, Class Loss 0.2611) is virtualized on GUI. To get the solution, the system employs the Kociemba’s algorithm while physical manipulation with a single degree of freedom is done by combination of stepper motors’ interaction with the cube achieving the average solving time of ~2.2 minutes.
zh
[CV-88] Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping
【速读】:该论文旨在解决传统多模态学习方法中依赖昂贵对齐预训练(alignment pre-training)以连接视觉与语言模态的问题,以及其通常将视觉特征映射到离散文本标记空间的局限性。解决方案的关键在于提出 Inverse-LLaVA,这是一种颠覆性范式:不再将视觉特征投影至文本空间,而是反向地将文本嵌入映射到连续的视觉表示空间,并在 Transformer 的中间层进行融合。通过注意力机制中的选择性加性组件实现动态融合,无需大规模图像-文本对齐数据集即可完成有效多模态交互。实验表明,该方法在推理密集型任务上显著提升性能,同时大幅降低计算开销(减少45%),首次实证证明对齐预训练并非多模态学习所必需,尤其适用于复杂认知推理场景。
链接: https://arxiv.org/abs/2508.12466
作者: Xuhui Zhan,Tyler Derr
机构: Vanderbilt University (范德堡大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15pages, 3 figures
Abstract:Traditional multimodal learning approaches require expensive alignment pre-training to bridge vision and language modalities, typically projecting visual features into discrete text token spaces. We challenge both fundamental assumptions underlying this paradigm by proposing Inverse-LLaVA, a novel approach that eliminates alignment pre-training entirely while inverting the conventional mapping direction. Rather than projecting visual features to text space, our method maps text embeddings into continuous visual representation space and performs fusion within transformer intermediate layers. Through selective additive components in attention mechanisms, we enable dynamic integration of visual and textual representations without requiring massive image-text alignment datasets. Comprehensive experiments across nine multimodal benchmarks demonstrate nuanced performance trade-offs: Inverse-LLaVA achieves notable improvements on reasoning-intensive and cognitive tasks (MM-VET: +0.2%, VizWiz: +1.8%, ScienceQA: +0.2%, cognitive reasoning: +27.2%), while showing expected decreases in perception tasks requiring memorized visual-text associations (celebrity recognition: -49.5%, OCR: -21.3%). These results provide the first empirical evidence that alignment pre-training is not necessary for effective multimodal learning, particularly for complex reasoning tasks. Our work establishes the feasibility of a new paradigm that reduces computational requirements by 45%, challenges conventional wisdom about modality fusion, and opens new research directions for efficient multimodal architectures that preserve modality-specific characteristics. Our project website with code and additional resources is available at this https URL.
zh
[CV-89] X-Ray-CoT: Interpretable Chest X-ray Diagnosis with Vision-Language Models via Chain-of-Thought Reasoning
【速读】:该论文旨在解决胸部X光影像诊断中因深度学习模型“黑箱”特性导致的临床采纳难题,以及诊断结果缺乏可解释性的问题。其核心挑战在于如何在保持高诊断准确性的同时,提供人类可理解的推理过程和诊断报告,从而提升医生对AI系统的信任度与临床实用性。解决方案的关键在于提出X-Ray-CoT框架,该框架基于视觉-语言大模型(Vision-Language Large Models, LVLMs),通过模拟放射科医生的“思维链”(Chain-of-Thought, CoT)机制:首先提取多模态特征与视觉概念,再利用大语言模型(LLM)结合结构化CoT提示策略进行逻辑推理,最终生成详细且可解释的自然语言诊断报告。实验证明,该方法在CORDA数据集上实现了80.52%的平衡准确率和78.65%的F1分数,同时显著提升了报告的可解释性,为构建可信、透明且临床可用的医学影像AI系统提供了新路径。
链接: https://arxiv.org/abs/2508.12455
作者: Chee Ng,Liliang Sun,Shaoqing Tang
机构: Universiti Teknologi Malaysia (马来西亚理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Chest X-ray imaging is crucial for diagnosing pulmonary and cardiac diseases, yet its interpretation demands extensive clinical experience and suffers from inter-observer variability. While deep learning models offer high diagnostic accuracy, their black-box nature hinders clinical adoption in high-stakes medical settings. To address this, we propose X-Ray-CoT (Chest X-Ray Chain-of-Thought), a novel framework leveraging Vision-Language Large Models (LVLMs) for intelligent chest X-ray diagnosis and interpretable report generation. X-Ray-CoT simulates human radiologists’ “chain-of-thought” by first extracting multi-modal features and visual concepts, then employing an LLM-based component with a structured Chain-of-Thought prompting strategy to reason and produce detailed natural language diagnostic reports. Evaluated on the CORDA dataset, X-Ray-CoT achieves competitive quantitative performance, with a Balanced Accuracy of 80.52% and F1 score of 78.65% for disease diagnosis, slightly surpassing existing black-box models. Crucially, it uniquely generates high-quality, explainable reports, as validated by preliminary human evaluations. Our ablation studies confirm the integral role of each proposed component, highlighting the necessity of multi-modal fusion and CoT reasoning for robust and transparent medical AI. This work represents a significant step towards trustworthy and clinically actionable AI systems in medical imaging.
zh
[CV-90] Express4D: Expressive Friendly and Extensible 4D Facial Motion Generation Benchmark
【速读】:该论文旨在解决当前生成式模型在动态面部表情生成任务中面临的两大问题:一是现有数据集多为语音驱动或仅包含粗粒度情绪标签,缺乏精细的语义描述以实现对表情的细粒度控制;二是数据采集依赖昂贵的专业设备,限制了数据获取的便捷性和可扩展性。其解决方案的关键在于构建一个名为Express4D的新数据集,该数据集通过使用消费级设备(commodity equipment)和大语言模型(LLM)生成的自然语言指令采集面部运动序列,并以ARKit blendshape格式存储,从而提供可驱动的、富含表现力且带有语义标注的面部动画数据。这一设计使得训练模型能够学习文本到表情的映射关系,并有效捕捉两种模态间的多对多对应特性,为未来研究提供了高质量基准。
链接: https://arxiv.org/abs/2508.12438
作者: Yaron Aloni,Rotem Shalev-Arkushin,Yonatan Shafir,Guy Tevet,Ohad Fried,Amit Haim Bermano
机构: Tel Aviv University (特拉维夫大学); Reichman University (里奇曼大学)
类目: Graphics (cs.GR); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Dynamic facial expression generation from natural language is a crucial task in Computer Graphics, with applications in Animation, Virtual Avatars, and Human-Computer Interaction. However, current generative models suffer from datasets that are either speech-driven or limited to coarse emotion labels, lacking the nuanced, expressive descriptions needed for fine-grained control, and were captured using elaborate and expensive equipment. We hence present a new dataset of facial motion sequences featuring nuanced performances and semantic annotation. The data is easily collected using commodity equipment and LLM-generated natural language instructions, in the popular ARKit blendshape format. This provides riggable motion, rich with expressive performances and labels. We accordingly train two baseline models, and evaluate their performance for future benchmarking. Using our Express4D dataset, the trained models can learn meaningful text-to-expression motion generation and capture the many-to-many mapping of the two modalities. The dataset, code, and video examples are available on our webpage: this https URL
zh
[CV-91] Illusions in Humans and AI: How Visual Perception Aligns and Diverges
【速读】:该论文试图解决的问题是:如何通过对比人类与人工视觉系统(Artificial Vision Systems)在视觉错觉(Visual Illusions)中的响应差异,揭示二者在构建视觉现实机制上的根本区别,并据此指导开发更鲁棒、可解释且与人类认知对齐的人工智能(AI)视觉系统。其解决方案的关键在于系统性地分析AI模型对经典视觉错觉(涉及颜色、大小、形状和运动)的响应,发现部分错觉效应可由训练诱导或作为模式识别的副产物出现,同时识别出仅存在于AI中的独特错觉现象(如像素级敏感性和幻觉),从而揭示人类感知与AI感知之间的对齐缺口及AI特有的感知脆弱性,为未来兼顾人类有益感知偏置并规避潜在误导性偏差的视觉系统研究提供方向。
链接: https://arxiv.org/abs/2508.12422
作者: Jianyi Yang,Junyi Ye,Ankan Dash,Guiling Wang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:By comparing biological and artificial perception through the lens of illusions, we highlight critical differences in how each system constructs visual reality. Understanding these divergences can inform the development of more robust, interpretable, and human-aligned artificial intelligence (AI) vision systems. In particular, visual illusions expose how human perception is based on contextual assumptions rather than raw sensory data. As artificial vision systems increasingly perform human-like tasks, it is important to ask: does AI experience illusions, too? Does it have unique illusions? This article explores how AI responds to classic visual illusions that involve color, size, shape, and motion. We find that some illusion-like effects can emerge in these models, either through targeted training or as by-products of pattern recognition. In contrast, we also identify illusions unique to AI, such as pixel-level sensitivity and hallucinations, that lack human counterparts. By systematically comparing human and AI responses to visual illusions, we uncover alignment gaps and AI-specific perceptual vulnerabilities invisible to human perception. These findings provide insights for future research on vision systems that preserve human-beneficial perceptual biases while avoiding distortions that undermine trust and safety.
zh
[CV-92] P4GEN: Text to Immersive Panorama 4D Scene Generation
【速读】:该论文旨在解决现有生成方法在创建高保真、沉浸式动态场景时的局限性,特别是难以实现从任意视角出发的360度全景沉浸体验的问题(即现有方法多局限于静态场景或窄视角动态场景)。其解决方案的关键在于提出TiP4GEN框架,该框架融合了全景视频生成与动态场景重建两大模块:首先设计了一个双分支生成模型(Dual-branch Generation Model),包含全景分支和透视分支,通过双向交叉注意力机制实现全局与局部视图间的充分信息交互;其次引入基于3D高斯泼溅(3D Gaussian Splatting)的几何对齐重建模型,利用度量深度图对时空点云进行配准并结合估计的姿态初始化相机,从而保障重建场景的几何一致性与时间连贯性。这一架构显著提升了生成动态全景4D场景的质量与可控性。
链接: https://arxiv.org/abs/2508.12415
作者: Ke Xing,Hanwen Liang,Dejia Xu,Yuyang Yin,Konstantinos N. Plataniotis,Yao Zhao,Yunchao Wei
机构: Beijing Jiaotong University (北京交通大学); University of Toronto (多伦多大学); The University of Texas at Austin (德克萨斯大学奥斯汀分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the rapid advancement and widespread adoption of VR/AR technologies, there is a growing demand for the creation of high-quality, immersive dynamic scenes. However, existing generation works predominantly concentrate on the creation of static scenes or narrow perspective-view dynamic scenes, falling short of delivering a truly 360-degree immersive experience from any viewpoint. In this paper, we introduce \textbfTiP4GEN, an advanced text-to-dynamic panorama scene generation framework that enables fine-grained content control and synthesizes motion-rich, geometry-consistent panoramic 4D scenes. TiP4GEN integrates panorama video generation and dynamic scene reconstruction to create 360-degree immersive virtual environments. For video generation, we introduce a \textbfDual-branch Generation Model consisting of a panorama branch and a perspective branch, responsible for global and local view generation, respectively. A bidirectional cross-attention mechanism facilitates comprehensive information exchange between the branches. For scene reconstruction, we propose a \textbfGeometry-aligned Reconstruction Model based on 3D Gaussian Splatting. By aligning spatial-temporal point clouds using metric depth maps and initializing scene cameras with estimated poses, our method ensures geometric consistency and temporal coherence for the reconstructed scenes. Extensive experiments demonstrate the effectiveness of our proposed designs and the superiority of TiP4GEN in generating visually compelling and motion-coherent dynamic panoramic scenes. Our project page is at this https URL.
zh
[CV-93] SRMA-Mamba: Spatial Reverse Mamba Attention Network for Pathological Liver Segmentation in MRI Volumes
【速读】:该论文旨在解决肝硬化(Liver Cirrhosis)患者中病灶检测与表征的难题,尤其是在三维磁共振成像(volumetric MRI)数据中,由于肝脏复杂的解剖结构和多样的病理变化,现有方法难以有效利用空间 anatomical 信息,从而限制了分割精度与临床可解释性。其解决方案的关键在于提出一种基于 Mamba 架构的新网络 SRMA-Mamba,通过引入空间解剖引导的 Mamba 模块(SABMamba)实现对肝硬化组织的定向扫描,并融合矢状面、冠状面和轴面的解剖信息构建全局空间上下文表示;同时结合空间反向注意力模块(SRMA),利用粗粒度分割图与层次化编码特征逐步优化病灶细节,显著提升了病灶区域的分割准确性和鲁棒性。
链接: https://arxiv.org/abs/2508.12410
作者: Jun Zeng,Yannan Huang,Elif Keles,Halil Ertugrul Aktas,Gorkem Durak,Nikhil Kumar Tomar,Quoc-Huy Trinh,Deepak Ranjan Nayak,Ulas Bagci,Debesh Jha
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 9 pages, 4 figures
Abstract:Liver Cirrhosis plays a critical role in the prognosis of chronic liver disease. Early detection and timely intervention are critical in significantly reducing mortality rates. However, the intricate anatomical architecture and diverse pathological changes of liver tissue complicate the accurate detection and characterization of lesions in clinical settings. Existing methods underutilize the spatial anatomical details in volumetric MRI data, thereby hindering their clinical effectiveness and explainability. To address this challenge, we introduce a novel Mamba-based network, SRMA-Mamba, designed to model the spatial relationships within the complex anatomical structures of MRI volumes. By integrating the Spatial Anatomy-Based Mamba module (SABMamba), SRMA-Mamba performs selective Mamba scans within liver cirrhotic tissues and combines anatomical information from the sagittal, coronal, and axial planes to construct a global spatial context representation, enabling efficient volumetric segmentation of pathological liver structures. Furthermore, we introduce the Spatial Reverse Attention module (SRMA), designed to progressively refine cirrhotic details in the segmentation map, utilizing both the coarse segmentation map and hierarchical encoding features. Extensive experiments demonstrate that SRMA-Mamba surpasses state-of-the-art methods, delivering exceptional performance in 3D pathological liver segmentation. Our code is available for public: \colorbluethis https URL.
zh
[CV-94] S5: Scalable Semi-Supervised Semantic Segmentation in Remote Sensing
【速读】:该论文旨在解决遥感(Remote Sensing, RS)领域半监督语义分割(Semi-supervised Semantic Segmentation, S4)方法在实际应用中因依赖小规模数据集和模型而导致性能受限的问题。现有研究难以充分利用海量未标注地球观测数据,主要受限于昂贵的像素级标注成本。为此,作者提出首个可扩展的半监督语义分割框架——S5,其关键在于:首先构建了一个大规模数据集RS4P-1M,通过熵值过滤与多样性扩展相结合的数据选择策略筛选高质量伪标签样本;其次,在此基础上预训练不同规模的遥感基础模型(Remote Sensing Foundation Models, RSFMs),显著提升土地覆盖分割和目标检测任务性能;最后,在微调阶段引入基于专家混合(Mixture-of-Experts, MoE)的多数据集微调机制,实现对多个遥感基准测试集的高效适配,同时减少参数量并增强泛化能力。这一系列设计共同推动了S4在遥感领域的规模化落地与性能突破。
链接: https://arxiv.org/abs/2508.12409
作者: Liang Lv,Di Wang,Jing Zhang,Lefei Zhang
机构: Wuhan University (武汉大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Semi-supervised semantic segmentation (S4) has advanced remote sensing (RS) analysis by leveraging unlabeled data through pseudo-labeling and consistency learning. However, existing S4 studies often rely on small-scale datasets and models, limiting their practical applicability. To address this, we propose S5, the first scalable framework for semi-supervised semantic segmentation in RS, which unlocks the potential of vast unlabeled Earth observation data typically underutilized due to costly pixel-level annotations. Built upon existing large-scale RS datasets, S5 introduces a data selection strategy that integrates entropy-based filtering and diversity expansion, resulting in the RS4P-1M dataset. Using this dataset, we systematically scales S4 methods by pre-training RS foundation models (RSFMs) of varying sizes on this extensive corpus, significantly boosting their performance on land cover segmentation and object detection tasks. Furthermore, during fine-tuning, we incorporate a Mixture-of-Experts (MoE)-based multi-dataset fine-tuning approach, which enables efficient adaptation to multiple RS benchmarks with fewer parameters. This approach improves the generalization and versatility of RSFMs across diverse RS benchmarks. The resulting RSFMs achieve state-of-the-art performance across all benchmarks, underscoring the viability of scaling semi-supervised learning for RS applications. All datasets, code, and models will be released at this https URL
zh
[CV-95] LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving
【速读】:该论文旨在解决当前基于视觉-语言模型(Vision-Language Models, VLMs)的自动驾驶系统在复杂场景下缺乏整体性场景感知与强空间意识的问题,尤其是在多视角图像与场景推理文本微调的基础上,难以实现精准、可解释的驾驶行为理解。其解决方案的关键在于提出一种面向自动驾驶的新型视觉-语言框架LMAD,通过引入初步场景交互机制和任务特化的专家适配器(specialized expert adapters),在统一的驾驶任务结构中增强VLMs对复杂驾驶场景的理解能力,并确保与现有VLMs及规划导向型驾驶系统的兼容性与无缝集成。实验表明,LMAD显著提升了现有VLMs在驾驶推理任务上的性能,推动了可解释自动驾驶的新标准。
链接: https://arxiv.org/abs/2508.12404
作者: Nan Song,Bozhou Zhang,Xiatian Zhu,Jiankang Deng,Li Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 7 pages, 4 figures,
Abstract:Large vision-language models (VLMs) have shown promising capabilities in scene understanding, enhancing the explainability of driving behaviors and interactivity with users. Existing methods primarily fine-tune VLMs on on-board multi-view images and scene reasoning text, but this approach often lacks the holistic and nuanced scene recognition and powerful spatial awareness required for autonomous driving, especially in complex situations. To address this gap, we propose a novel vision-language framework tailored for autonomous driving, called LMAD. Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs. In particular, we introduce preliminary scene interaction and specialized expert adapters within the same driving task structure, which better align VLMs with autonomous driving scenarios. Furthermore, our approach is designed to be fully compatible with existing VLMs while seamlessly integrating with planning-oriented driving systems. Extensive experiments on the DriveLM and nuScenes-QA datasets demonstrate that LMAD significantly boosts the performance of existing VLMs on driving reasoning tasks,setting a new standard in explainable autonomous driving.
zh
[CV-96] MPCAR: Multi-Perspective Contextual Augmentation for Enhanced Visual Reasoning in Large Vision-Language Models
【速读】:该论文旨在解决大型视觉语言模型(Large Vision-Language Models, LVLMs)在复杂视觉推理任务中表现不足的问题,尤其针对需要深度上下文理解、多角度分析或精细细节识别的任务。现有方法通常依赖单次图像编码和提示(prompt),难以充分捕捉复杂的视觉信息。解决方案的关键在于提出一种无需微调模型参数的推理阶段策略——多视角情境增强推理(Multi-Perspective Contextual Augmentation for Reasoning, MPCAR),其核心机制包括三个阶段:首先由LVLM生成多个互补的视觉描述或初步推理路径;其次将这些描述与原始问题融合构建增强型上下文提示;最后利用该提示引导模型进行深层推理并输出答案。实验证明,MPCAR在GQA、VQA-CP v2和ScienceQA等挑战性视觉问答数据集上显著提升准确率,且人类评估显示答案更具连贯性和完整性,凸显了利用LVLM自身生成能力丰富输入上下文以释放其潜在推理能力的有效性。
链接: https://arxiv.org/abs/2508.12400
作者: Amirul Rahman,Qiang Xu,Xueying Huang
机构: University of Malaya (马来亚大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite significant advancements, Large Vision-Language Models (LVLMs) continue to face challenges in complex visual reasoning tasks that demand deep contextual understanding, multi-angle analysis, or meticulous detail recognition. Existing approaches often rely on single-shot image encoding and prompts, limiting their ability to fully capture nuanced visual information. Inspired by the notion that strategically generated “additional” information can serve as beneficial contextual augmentation, we propose Multi-Perspective Contextual Augmentation for Reasoning (MPCAR), a novel inference-time strategy designed to enhance LVLM performance. MPCAR operates in three stages: first, an LVLM generates N diverse and complementary descriptions or preliminary reasoning paths from various angles; second, these descriptions are intelligently integrated with the original question to construct a comprehensive context-augmented prompt; and finally, this enriched prompt guides the ultimate LVLM for deep reasoning and final answer generation. Crucially, MPCAR achieves these enhancements without requiring any fine-tuning of the underlying LVLM’s parameters. Extensive experiments on challenging Visual Question Answering (VQA) datasets, including GQA, VQA-CP v2, and ScienceQA (Image-VQA), demonstrate that MPCAR consistently outperforms established baseline methods. Our quantitative results show significant accuracy gains, particularly on tasks requiring robust contextual understanding, while human evaluations confirm improved coherence and completeness of the generated answers. Ablation studies further highlight the importance of diverse prompt templates and the number of generated perspectives. This work underscores the efficacy of leveraging LVLMs’ inherent generative capabilities to enrich input contexts, thereby unlocking their latent reasoning potential for complex multimodal tasks.
zh
[CV-97] Federated Cross-Modal Style-Aware Prompt Generation
【速读】:该论文旨在解决联邦学习中视觉-语言模型(如CLIP)因仅依赖最终层特征而忽略多尺度视觉线索与客户端数据域特有风格差异的问题,从而限制了模型在跨域场景下的泛化能力。其解决方案的关键在于提出FedCSAP(Federated Cross-Modal Style-Aware Prompt Generation)框架,通过融合CLIP视觉编码器的低、中、高层特征与基于批次统计量提取的客户端特定风格指标,生成兼具视觉细节与文本语境信息的鲁棒提示令牌(prompt tokens),实现对已见和未见类别的有效泛化,同时保障数据隐私并适应非独立同分布(non-IID)的客户端数据分布。
链接: https://arxiv.org/abs/2508.12399
作者: Suraj Prasad,Navyansh Mahla,Sunny Gupta,Amit Sethi
机构: Indian Institute of Technology Bombay (印度理工学院孟买分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Prompt learning has propelled vision-language models like CLIP to excel in diverse tasks, making them ideal for federated learning due to computational efficiency. However, conventional approaches that rely solely on final-layer features miss out on rich multi-scale visual cues and domain-specific style variations in decentralized client data. To bridge this gap, we introduce FedCSAP (Federated Cross-Modal Style-Aware Prompt Generation). Our framework harnesses low, mid, and high-level features from CLIP’s vision encoder alongside client-specific style indicators derived from batch-level statistics. By merging intricate visual details with textual context, FedCSAP produces robust, context-aware prompt tokens that are both distinct and non-redundant, thereby boosting generalization across seen and unseen classes. Operating within a federated learning paradigm, our approach ensures data privacy through local training and global aggregation, adeptly handling non-IID class distributions and diverse domain-specific styles. Comprehensive experiments on multiple image classification datasets confirm that FedCSAP outperforms existing federated prompt learning methods in both accuracy and overall generalization.
zh
[CV-98] DeCoT: Decomposing Complex Instructions for Enhanced Text-to-Image Generation with Large Language Models
【速读】:该论文旨在解决当前文本到图像(Text-to-Image, T2I)模型在处理复杂、长文本指令时存在的局限性,尤其是难以准确还原细节、空间关系及特定约束的问题。其核心解决方案是提出一种名为DeCoT(Decomposition-CoT)的新框架,关键在于利用大语言模型(Large Language Models, LLMs)对原始指令进行分解与语义增强,将模糊或复杂的自然语言转化为结构化、可执行的语义单元,并通过多阶段提示整合与自适应生成策略,构建适配现有T2I模型的优化提示体系,从而显著提升图像生成的质量与指令遵循准确性。
链接: https://arxiv.org/abs/2508.12396
作者: Xiaochuan Lin,Xiangyong Chen,Xuan Li,Yichen Su
机构: Henan Polytechnic University (河南理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite remarkable advancements, current Text-to-Image (T2I) models struggle with complex, long-form textual instructions, frequently failing to accurately render intricate details, spatial relationships, or specific constraints. This limitation is highlighted by benchmarks such as LongBench-T2I, which reveal deficiencies in handling composition, specific text, and fine textures. To address this, we propose DeCoT (Decomposition-CoT), a novel framework that leverages Large Language Models (LLMs) to significantly enhance T2I models’ understanding and execution of complex instructions. DeCoT operates in two core stages: first, Complex Instruction Decomposition and Semantic Enhancement, where an LLM breaks down raw instructions into structured, actionable semantic units and clarifies ambiguities; second, Multi-Stage Prompt Integration and Adaptive Generation, which transforms these units into a hierarchical or optimized single prompt tailored for existing T2I models. Extensive experiments on the LongBench-T2I dataset demonstrate that DeCoT consistently and substantially improves the performance of leading T2I models across all evaluated dimensions, particularly in challenging aspects like “Text” and “Composition”. Quantitative results, validated by multiple MLLM evaluators (Gemini-2.0-Flash and InternVL3-78B), show that DeCoT, when integrated with Infinity-8B, achieves an average score of 3.52, outperforming the baseline Infinity-8B (3.44). Ablation studies confirm the critical contribution of each DeCoT component and the importance of sophisticated LLM prompting. Furthermore, human evaluations corroborate these findings, indicating superior perceptual quality and instruction fidelity. DeCoT effectively bridges the gap between high-level user intent and T2I model requirements, leading to more faithful and accurate image generation.
zh
[CV-99] ViT-EnsembleAttack: Augmenting Ensemble Models for Stronger Adversarial Transferability in Vision Transformers
【速读】:该论文旨在解决当前基于集成(ensemble-based)的对抗攻击方法在视觉Transformer(Vision Transformer, ViT)模型上迁移能力不足的问题。现有研究多集中于优化集成权重或路径,而忽视了通过增强集成模型本身的泛化能力来提升攻击迁移性,尤其对ViT类模型的关注较少。其解决方案的关键在于引入对抗增强(adversarial augmentation)策略,针对每个代理ViT模型生成增强版本,具体包括多头丢弃(Multi-head dropping)、注意力分数缩放(Attention score scaling)和MLP特征混合(MLP feature mixing)三种方式,并通过贝叶斯优化自动调节参数;同时设计自动重加权(Automatic Reweighting)与步长放大(Step Size Enlargement)模块进一步提升迁移性能。该方法首次系统性地为ViT构建了面向集成攻击的对抗增强框架,显著提升了攻击在不同ViT模型间的迁移成功率。
链接: https://arxiv.org/abs/2508.12384
作者: Hanwen Cao,Haobo Lu,Xiaosen Wang,Kun He
机构: Huazhong University of Science and Technology (华中科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Ensemble-based attacks have been proven to be effective in enhancing adversarial transferability by aggregating the outputs of models with various architectures. However, existing research primarily focuses on refining ensemble weights or optimizing the ensemble path, overlooking the exploration of ensemble models to enhance the transferability of adversarial attacks. To address this gap, we propose applying adversarial augmentation to the surrogate models, aiming to boost overall generalization of ensemble models and reduce the risk of adversarial overfitting. Meanwhile, observing that ensemble Vision Transformers (ViTs) gain less attention, we propose ViT-EnsembleAttack based on the idea of model adversarial augmentation, the first ensemble-based attack method tailored for ViTs to the best of our knowledge. Our approach generates augmented models for each surrogate ViT using three strategies: Multi-head dropping, Attention score scaling, and MLP feature mixing, with the associated parameters optimized by Bayesian optimization. These adversarially augmented models are ensembled to generate adversarial examples. Furthermore, we introduce Automatic Reweighting and Step Size Enlargement modules to boost transferability. Extensive experiments demonstrate that ViT-EnsembleAttack significantly enhances the adversarial transferability of ensemble-based attacks on ViTs, outperforming existing methods by a substantial margin. Code is available at this https URL.
zh
[CV-100] IPGPhormer: Interpretable Pathology Graph-Transformer for Survival Analysis
【速读】:该论文旨在解决病理图像(pathological images)在癌症预后评估中,现有生存分析方法难以有效建模长程空间关系与局部上下文依赖,且缺乏内在可解释性的问题。其解决方案的关键在于提出一种名为可解释病理图Transformer(Interpretable Pathology Graph-Transformer, IPGPhormer)的新框架,该框架通过构建组织微环境的图结构来捕捉空间依赖关系,并在无需后处理人工标注的情况下实现组织层面和细胞层面的双重可解释性,从而提升预测准确性并增强临床实用性。
链接: https://arxiv.org/abs/2508.12381
作者: Guo Tang,Songhan Jiang,Jinpeng Lu,Linghan Cai,Yongbing Zhang
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 13 pages, 5 figures
Abstract:Pathological images play an essential role in cancer prognosis, while survival analysis, which integrates computational techniques, can predict critical clinical events such as patient mortality or disease recurrence from whole-slide images (WSIs). Recent advancements in multiple instance learning have significantly improved the efficiency of survival analysis. However, existing methods often struggle to balance the modeling of long-range spatial relationships with local contextual dependencies and typically lack inherent interpretability, limiting their clinical utility. To address these challenges, we propose the Interpretable Pathology Graph-Transformer (IPGPhormer), a novel framework that captures the characteristics of the tumor microenvironment and models their spatial dependencies across the tissue. IPGPhormer uniquely provides interpretability at both tissue and cellular levels without requiring post-hoc manual annotations, enabling detailed analyses of individual WSIs and cross-cohort assessments. Comprehensive evaluations on four public benchmark datasets demonstrate that IPGPhormer outperforms state-of-the-art methods in both predictive accuracy and interpretability. In summary, our method, IPGPhormer, offers a promising tool for cancer prognosis assessment, paving the way for more reliable and interpretable decision-support systems in pathology. The code is publicly available at this https URL.
zh
[CV-101] Synthetic Data is Sufficient for Zero-Shot Visual Generalization from Offline Data
【速读】:该论文旨在解决视觉驱动的离线强化学习(Offline Reinforcement Learning, Offline RL)中因训练数据多样性不足而导致策略泛化能力差的问题。由于视觉数据常包含噪声、干扰和虚假相关性,若缺乏充分的环境状态覆盖,模型易过拟合并难以适应未见环境。解决方案的关键在于提出一种两阶段合成数据生成方法:首先通过数据增强提升原始离线数据的多样性以改善零样本泛化性能,随后利用扩散模型在潜在空间中生成额外训练样本;该方法无需修改现有无模型离线RL算法即可显著缩小测试时的泛化差距,同时保持计算效率,从而有效提升策略在复杂视觉场景下的鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2508.12356
作者: Ahmet H. Güzel,Ilija Bogunovic,Jack Parker-Holder
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Offline reinforcement learning (RL) offers a promising framework for training agents using pre-collected datasets without the need for further environment interaction. However, policies trained on offline data often struggle to generalise due to limited exposure to diverse states. The complexity of visual data introduces additional challenges such as noise, distractions, and spurious correlations, which can misguide the policy and increase the risk of overfitting if the training data is not sufficiently diverse. Indeed, this makes it challenging to leverage vision-based offline data in training robust agents that can generalize to unseen environments. To solve this problem, we propose a simple approach generating additional synthetic training data. We propose a two-step process, first augmenting the originally collected offline data to improve zero-shot generalization by introducing diversity, then using a diffusion model to generate additional data in latent space. We test our method across both continuous action spaces (Visual D4RL) and discrete action spaces (Procgen), demonstrating that it significantly improves generalization without requiring any algorithmic changes to existing model-free offline RL methods. We show that our method not only increases the diversity of the training data but also significantly reduces the generalization gap at test time while maintaining computational efficiency. We believe this approach could fuel additional progress in generating synthetic data to train more general agents in the future.
zh
[CV-102] EgoLoc: A Generalizable Solution for Temporal Interaction Localization in Egocentric Videos
【速读】:该论文旨在解决第一人称视觉(egocentric vision)中手-物体交互的时序定位问题(Temporal Interaction Localization, TIL),即精准识别手与目标物体接触和分离的关键时刻(“when to interact”),这在混合现实沉浸式体验和机器人运动规划中至关重要。现有方法多依赖语义掩码或类别标注,存在对象定位不准、场景杂乱及对verb-noun类别强依赖等问题。其解决方案的关键在于提出一种零样本(zero-shot)方法EgoLoc:通过引入手部动态引导采样(hand-dynamics-guided sampling)生成高质量视觉提示,并利用视觉语言模型(vision-language model)识别接触/分离属性、定位具体时间戳,同时提供闭环反馈以优化结果。该方法无需对象掩码和动词-名词分类体系,实现了泛化性强的零样本交互时序定位。
链接: https://arxiv.org/abs/2508.12349
作者: Junyi Ma,Erhang Zhang,Yin-Dong Zheng,Yuchen Xie,Yixuan Zhou,Hesheng Wang
机构: IRMV Lab, the Department of Automation, Shanghai Jiao Tong University (上海交通大学自动化系)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Extended journal version of arXiv:2506.03662
Abstract:Analyzing hand-object interaction in egocentric vision facilitates VR/AR applications and human-robot policy transfer. Existing research has mostly focused on modeling the behavior paradigm of interactive actions (i.e., how to interact''). However, the more challenging and fine-grained problem of capturing the critical moments of contact and separation between the hand and the target object (i.e.,
when to interact’') is still underexplored, which is crucial for immersive interactive experiences in mixed reality and robotic motion planning. Therefore, we formulate this problem as temporal interaction localization (TIL). Some recent works extract semantic masks as TIL references, but suffer from inaccurate object grounding and cluttered scenarios. Although current temporal action localization (TAL) methods perform well in detecting verb-noun action segments, they rely on category annotations during training and exhibit limited precision in localizing hand-object contact/separation moments. To address these issues, we propose a novel zero-shot approach dubbed EgoLoc to localize hand-object contact and separation timestamps in egocentric videos. EgoLoc introduces hand-dynamics-guided sampling to generate high-quality visual prompts. It exploits the vision-language model to identify contact/separation attributes, localize specific timestamps, and provide closed-loop feedback for further refinement. EgoLoc eliminates the need for object masks and verb-noun taxonomies, leading to generalizable zero-shot implementation. Comprehensive experiments on the public dataset and our novel benchmarks demonstrate that EgoLoc achieves plausible TIL for egocentric videos. It is also validated to effectively facilitate multiple downstream applications in egocentric vision and robotic manipulation tasks. Code and relevant data will be released at this https URL.
zh
[CV-103] MBMamba: When Memory Buffer Meets Mamba for Structure-Aware Image Deblurring
【速读】:该论文旨在解决Mamba架构在图像去模糊任务中因扁平扫描策略导致的局部像素遗忘和通道冗余问题,从而限制了其对二维空间信息的有效聚合。解决方案的关键在于不改变原始Mamba结构的前提下,提出两个核心设计:一是引入记忆缓冲机制(memory buffer mechanism),用于保存历史信息以供后续融合,从而可靠地建模相邻特征间的相关性;二是设计一种受Ising模型启发的正则化损失函数,模拟物理系统中像素间的“相互吸引”能量最小化过程,有助于保持图像结构的一致性和完整性。基于此,作者构建了MBMamba网络,在多个基准测试中优于现有先进方法。
链接: https://arxiv.org/abs/2508.12346
作者: Hu Gao,Depeng Dang
机构: Shanghai Jiao Tong University (上海交通大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The Mamba architecture has emerged as a promising alternative to CNNs and Transformers for image deblurring. However, its flatten-and-scan strategy often results in local pixel forgetting and channel redundancy, limiting its ability to effectively aggregate 2D spatial information. Although existing methods mitigate this by modifying the scan strategy or incorporating local feature modules, it increase computational complexity and hinder real-time performance. In this paper, we propose a structure-aware image deblurring network without changing the original Mamba architecture. Specifically, we design a memory buffer mechanism to preserve historical information for later fusion, enabling reliable modeling of relevance between adjacent features. Additionally, we introduce an Ising-inspired regularization loss that simulates the energy minimization of the physical system’s “mutual attraction” between pixels, helping to maintain image structure and coherence. Building on this, we develop MBMamba. Experimental results show that our method outperforms state-of-the-art approaches on widely used benchmarks.
zh
[CV-104] AquaFeat: A Features-Based Image Enhancement Model for Underwater Object Detection
【速读】:该论文旨在解决水下环境中图像严重退化导致目标检测模型性能下降的问题,传统图像增强方法因未针对下游检测任务进行优化而效果有限。其解决方案的关键在于提出一种名为AquaFeat的即插即用模块,通过端到端训练将多尺度特征增强网络与检测器损失函数相结合,使增强过程显式地聚焦于提升对检测任务最相关的特征表示,从而在保持46.5 FPS实用推理速度的同时显著提升检测精度(如mAP@0.5达0.677)。
链接: https://arxiv.org/abs/2508.12343
作者: Emanuel C. Silva,Tatiana T. Schein,Stephanie L. Brião,Guilherme L. M. Costa,Felipe G. Oliveira,Gustavo P. Almeida,Eduardo L. Silva,Sam S. Devincenzi,Karina S. Machado,Paulo L. J. Drews-Jr
机构: Centro de Ciências Computacionais (C3). Universidade Federal do Rio Grande (联邦里奥格兰德大学计算机科学中心); Instituto de Ciências Exatas e Tecnologia (ICET). Universidade Federal do Amazonas (亚马逊联邦大学理学院与技术学院)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The severe image degradation in underwater environments impairs object detection models, as traditional image enhancement methods are often not optimized for such downstream tasks. To address this, we propose AquaFeat, a novel, plug-and-play module that performs task-driven feature enhancement. Our approach integrates a multi-scale feature enhancement network trained end-to-end with the detector’s loss function, ensuring the enhancement process is explicitly guided to refine features most relevant to the detection task. When integrated with YOLOv8m on challenging underwater datasets, AquaFeat achieves state-of-the-art Precision (0.877) and Recall (0.624), along with competitive mAP scores (mAP@0.5 of 0.677 and mAP@[0.5:0.95] of 0.421). By delivering these accuracy gains while maintaining a practical processing speed of 46.5 FPS, our model provides an effective and computationally efficient solution for real-world applications, such as marine ecosystem monitoring and infrastructure inspection.
zh
[CV-105] Semantic Discrepancy-aware Detector for Image Forgery Identification
【速读】:该论文旨在解决图像伪造检测中因伪造痕迹与语义概念空间错位而导致的检测性能下降问题。其解决方案的关键在于提出一种语义差异感知检测器(Semantic Discrepancy-aware Detector, SDD),通过重建学习在细粒度视觉层面对齐伪造空间与语义概念空间;具体包括:设计语义标记采样模块以缓解无关特征引起的特征空间偏移,构建基于视觉重建范式的概念级伪造差异学习模块以增强视觉语义概念与伪造痕迹之间的交互,最终通过低层伪造特征增强模块整合概念级差异信息,从而最小化冗余伪造信息并提升检测精度。
链接: https://arxiv.org/abs/2508.12341
作者: Ziye Wang,Minghang Yu,Chunyan Xu,Zhen Cui
机构: Nanjing University of Science and Technology (南京理工大学); Beijing Normal University (北京师范大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 10 pages, 5 figures
Abstract:With the rapid advancement of image generation techniques, robust forgery detection has become increasingly imperative to ensure the trustworthiness of digital media. Recent research indicates that the learned semantic concepts of pre-trained models are critical for identifying fake images. However, the misalignment between the forgery and semantic concept spaces hinders the model’s forgery detection performance. To address this problem, we propose a novel Semantic Discrepancy-aware Detector (SDD) that leverages reconstruction learning to align the two spaces at a fine-grained visual level. By exploiting the conceptual knowledge embedded in the pre-trained vision language model, we specifically design a semantic token sampling module to mitigate the space shifts caused by features irrelevant to both forgery traces and semantic concepts. A concept-level forgery discrepancy learning module, built upon a visual reconstruction paradigm, is proposed to strengthen the interaction between visual semantic concepts and forgery traces, effectively capturing discrepancies under the concepts’ guidance. Finally, the low-level forgery feature enhancemer integrates the learned concept level forgery discrepancies to minimize redundant forgery information. Experiments conducted on two standard image forgery datasets demonstrate the efficacy of the proposed SDD, which achieves superior results compared to existing methods. The code is available at this https URL.
zh
[CV-106] Geometry-Aware Video Inpainting for Joint Headset Occlusion Removal and Face Reconstruction in Social XR
【速读】:该论文旨在解决头戴式显示器(Head-mounted Display, HMD)在扩展现实(Extended Reality, XR)应用中遮挡用户面部上部区域的问题,这会严重影响外部视频录制质量,并削弱远程会议等社交XR场景中面部表情与眼神交流的沉浸感。解决方案的关键在于提出一种几何感知的学习框架,通过结合生成对抗网络(GAN-based)的视频修复模块与SynergyNet驱动的3D可变形模型(3D Morphable Model, 3DMM)参数回归模块,实现从单视角RGB帧中同时去除HMD遮挡并重建完整的三维人脸几何结构。该方法利用密集人脸关键点和一个无遮挡参考帧引导修复过程,确保身份一致性与视觉真实性,且通过端到端的关键点优化机制提升修复质量和几何保真度。
链接: https://arxiv.org/abs/2508.12336
作者: Fatemeh Ghorbani Lohesara,Karen Eguiazarian,Sebastian Knorr
机构: Technische Universität Berlin (柏林工业大学); Tampere University (坦佩雷大学); HTW Berlin (柏林应用技术大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Head-mounted displays (HMDs) are essential for experiencing extended reality (XR) environments and observing virtual content. However, they obscure the upper part of the user’s face, complicating external video recording and significantly impacting social XR applications such as teleconferencing, where facial expressions and eye gaze details are crucial for creating an immersive experience. This study introduces a geometry-aware learning-based framework to jointly remove HMD occlusions and reconstruct complete 3D facial geometry from RGB frames captured from a single viewpoint. The method integrates a GAN-based video inpainting network, guided by dense facial landmarks and a single occlusion-free reference frame, to restore missing facial regions while preserving identity. Subsequently, a SynergyNet-based module regresses 3D Morphable Model (3DMM) parameters from the inpainted frames, enabling accurate 3D face reconstruction. Dense landmark optimization is incorporated throughout the pipeline to improve both the inpainting quality and the fidelity of the recovered geometry. Experimental results demonstrate that the proposed framework can successfully remove HMDs from RGB facial videos while maintaining facial identity and realism, producing photorealistic 3D face geometry outputs. Ablation studies further show that the framework remains robust across different landmark densities, with only minor quality degradation under sparse landmark configurations.
zh
[CV-107] DoppDrive: Doppler-Driven Temporal Aggregation for Improved Radar Object Detection ICCV2025
【速读】:该论文旨在解决雷达点云稀疏性(尤其在远距离时)对自动驾驶中目标检测性能造成的挑战。现有方法通过运动补偿的时序聚合提升点密度,但会引入动态物体带来的散射噪声,从而降低检测精度。其解决方案的关键在于提出DoppDrive——一种基于多普勒(Doppler)信息的时序聚合方法:利用点云中每个点的多普勒分量沿径向位移以消除径向散射,并根据多普勒和角度为每个点分配独特的聚合持续时间,从而最小化切向散射。该方法作为预处理步骤独立于检测器,兼容多种检测框架,在多个数据集上显著提升了检测性能。
链接: https://arxiv.org/abs/2508.12330
作者: Yuval Haitman,Oded Bialer
机构: General Motors(通用汽车); Ben Gurion University of the Negev(本古里安大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCV 2025
Abstract:Radar-based object detection is essential for autonomous driving due to radar’s long detection range. However, the sparsity of radar point clouds, especially at long range, poses challenges for accurate detection. Existing methods increase point density through temporal aggregation with ego-motion compensation, but this approach introduces scatter from dynamic objects, degrading detection performance. We propose DoppDrive, a novel Doppler-Driven temporal aggregation method that enhances radar point cloud density while minimizing scatter. Points from previous frames are shifted radially according to their dynamic Doppler component to eliminate radial scatter, with each point assigned a unique aggregation duration based on its Doppler and angle to minimize tangential scatter. DoppDrive is a point cloud density enhancement step applied before detection, compatible with any detector, and we demonstrate that it significantly improves object detection performance across various detectors and datasets.
zh
[CV-108] Attention Pooling Enhances NCA-based Classification of Microscopy Images
【速读】:该论文旨在解决神经元细胞自动机(Neural Cellular Automata, NCA)在显微图像分类任务中性能落后于更大、更复杂模型的问题。其解决方案的关键在于引入注意力池化(attention pooling)机制,通过强化对图像中最信息丰富的区域的关注,提升特征提取能力并改善分类准确性。该方法在八个多样化的显微图像数据集上验证有效,显著优于现有NCA方法,同时保持参数效率和可解释性,并在与轻量级卷积神经网络和视觉Transformer的对比中展现出更高的性能与更低的参数量。
链接: https://arxiv.org/abs/2508.12324
作者: Chen Yang,Michael Deutges,Jingsong Liu,Han Li,Nassir Navab,Carsten Marr,Ario Sadafi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Neural Cellular Automata (NCA) offer a robust and interpretable approach to image classification, making them a promising choice for microscopy image analysis. However, a performance gap remains between NCA and larger, more complex architectures. We address this challenge by integrating attention pooling with NCA to enhance feature extraction and improve classification accuracy. The attention pooling mechanism refines the focus on the most informative regions, leading to more accurate predictions. We evaluate our method on eight diverse microscopy image datasets and demonstrate that our approach significantly outperforms existing NCA methods while remaining parameter-efficient and explainable. Furthermore, we compare our method with traditional lightweight convolutional neural network and vision transformer architectures, showing improved performance while maintaining a significantly lower parameter count. Our results highlight the potential of NCA-based models an alternative for explainable image classification.
zh
[CV-109] Neural Cellular Automata for Weakly Supervised Segmentation of White Blood Cells
【速读】:该论文旨在解决白细胞(white blood cell)在血涂片图像中的检测与分割问题,这是医学诊断中自动化血细胞计数、形态分析、分类及疾病监测等下游任务的关键步骤。传统方法依赖大量标注数据训练模型,而获取这些数据既耗时又昂贵。为应对这一挑战,作者提出了一种基于神经元胞自动机的弱监督分割方法(NCA-WSS),其核心创新在于利用训练过程中神经元胞自动机(Neural Cellular Automata, NCA)生成的特征图直接提取分割掩膜(segmentation masks),无需重新使用分割标签进行训练。实验表明,该方法在三个白细胞显微图像数据集上显著优于现有弱监督分割方法,展示了NCA在弱监督框架下同时实现分类与分割的潜力,为医学图像分析提供了一种可扩展且高效的解决方案。
链接: https://arxiv.org/abs/2508.12322
作者: Michael Deutges,Chen Yang,Raheleh Salehi,Nassir Navab,Carsten Marr,Ario Sadafi
机构: Helmholtz Munich (赫尔姆霍兹慕尼黑研究所)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The detection and segmentation of white blood cells in blood smear images is a key step in medical diagnostics, supporting various downstream tasks such as automated blood cell counting, morphological analysis, cell classification, and disease diagnosis and monitoring. Training robust and accurate models requires large amounts of labeled data, which is both time-consuming and expensive to acquire. In this work, we propose a novel approach for weakly supervised segmentation using neural cellular automata (NCA-WSS). By leveraging the feature maps generated by NCA during classification, we can extract segmentation masks without the need for retraining with segmentation labels. We evaluate our method on three white blood cell microscopy datasets and demonstrate that NCA-WSS significantly outperforms existing weakly supervised approaches. Our work illustrates the potential of NCA for both classification and segmentation in a weakly supervised framework, providing a scalable and efficient solution for medical image analysis.
zh
[CV-110] Improving Densification in 3D Gaussian Splatting for High-Fidelity Rendering
【速读】:该论文旨在解决3D高斯散射(3D Gaussian Splatting, 3DGS)在实时渲染中因稀疏化策略导致重建质量不佳的问题。其核心解决方案从三个维度优化了密度增长(densification)流程:首先提出边缘感知评分(Edge-Aware Score)以更精准地选择待分裂的高斯分布;其次引入长轴分裂策略(Long-Axis Split),降低克隆与分裂操作带来的几何失真;最后通过恢复感知剪枝(Recovery-Aware Pruning)、多步更新(Multi-step Update)和生长控制(Growth Control)有效缓解过拟合现象。整体方法在不增加训练或推理开销的前提下,显著提升渲染保真度,并以更少的高斯分布实现当前最优性能。
链接: https://arxiv.org/abs/2508.12313
作者: Xiaobin Deng,Changyu Diao,Min Li,Ruohan Yu,Duanqing Xu
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Although 3D Gaussian Splatting (3DGS) has achieved impressive performance in real-time rendering, its densification strategy often results in suboptimal reconstruction quality. In this work, we present a comprehensive improvement to the densification pipeline of 3DGS from three perspectives: when to densify, how to densify, and how to mitigate overfitting. Specifically, we propose an Edge-Aware Score to effectively select candidate Gaussians for splitting. We further introduce a Long-Axis Split strategy that reduces geometric distortions introduced by clone and split operations. To address overfitting, we design a set of techniques, including Recovery-Aware Pruning, Multi-step Update, and Growth Control. Our method enhances rendering fidelity without introducing additional training or inference overhead, achieving state-of-the-art performance with fewer Gaussians.
zh
[CV-111] CLAIR: CLIP-Aided Weakly Supervised Zero-Shot Cross-Domain Image Retrieval BMVC2025
【速读】:该论文旨在解决弱监督零样本跨域图像检索(Weakly Supervised Zero-Shot Cross-Domain Image Retrieval, WSZS-CDIR)中因大型基础模型(如CLIP)生成噪声伪标签而导致的性能下降问题。解决方案的关键在于提出CLAIR框架:首先利用CLIP文本与图像特征间的相似度计算置信度分数以精炼噪声伪标签;其次设计实例间和聚类间对比损失,将图像编码至类别感知的潜在空间,并引入域间对比损失缓解域间差异;进一步地,通过闭式解学习一个新颖的跨域映射函数,仅使用CLIP文本嵌入将图像特征从一个域投影到另一个域,实现更精准的特征对齐;最后,引入可学习提示集增强模型对新类别的零样本泛化能力。
链接: https://arxiv.org/abs/2508.12290
作者: Chor Boon Tan,Conghui Hu,Gim Hee Lee
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: BMVC 2025
Abstract:The recent growth of large foundation models that can easily generate pseudo-labels for huge quantity of unlabeled data makes unsupervised Zero-Shot Cross-Domain Image Retrieval (UZS-CDIR) less relevant. In this paper, we therefore turn our attention to weakly supervised ZS-CDIR (WSZS-CDIR) with noisy pseudo labels generated by large foundation models such as CLIP. To this end, we propose CLAIR to refine the noisy pseudo-labels with a confidence score from the similarity between the CLIP text and image features. Furthermore, we design inter-instance and inter-cluster contrastive losses to encode images into a class-aware latent space, and an inter-domain contrastive loss to alleviate domain discrepancies. We also learn a novel cross-domain mapping function in closed-form, using only CLIP text embeddings to project image features from one domain to another, thereby further aligning the image features for retrieval. Finally, we enhance the zero-shot generalization ability of our CLAIR to handle novel categories by introducing an extra set of learnable prompts. Extensive experiments are carried out using TUBerlin, Sketchy, Quickdraw, and DomainNet zero-shot datasets, where our CLAIR consistently shows superior performance compared to existing state-of-the-art methods.
zh
[CV-112] SLA: A Task-Specific Learning Adaptation for Semantic Segmentation on Autonomous Vehicles Platform
【速读】:该论文旨在解决自动驾驶平台在不同硬件资源约束和任务精度要求下,如何高效部署语义分割网络的问题。针对嵌入式设备计算能力有限的挑战,论文提出通过三层次控制机制——宽度乘数(width multiplier)、分类器深度(classifier depth)和分类器核大小(classifier kernel),实现对模型组件的细粒度调控,从而在保证性能的前提下优化资源分配。其解决方案的关键在于结合贝叶斯优化(Bayesian Optimization)与代理建模技术,在严苛的计算预算内自动搜索最优超参数配置,使模型能够根据具体场景和任务需求动态调整Multiply-Accumulate Operations(MACs)数量,实现任务特定学习适应性(Task-Specific Learning Adaptation, TSLA),最终提升硬件利用率与模型准确性。
链接: https://arxiv.org/abs/2508.12279
作者: Jun Liu,Zhenglun Kong,Pu Zhao,Weihao Zeng,Hao Tang,Xuan Shen,Changdi Yang,Wenbin Zhang,Geng Yuan,Wei Niu,Xue Lin,Yanzhi Wang
机构: Northeastern University (东北大学); Carnegie Mellon University (卡内基梅隆大学); University of Georgia (佐治亚大学); Florida International University (佛罗里达国际大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG)
备注:
Abstract:Autonomous driving platforms encounter diverse driving scenarios, each with varying hardware resources and precision requirements. Given the computational limitations of embedded devices, it is crucial to consider computing costs when deploying on target platforms like the NVIDIA\textsuperscript\textregistered DRIVE PX 2. Our objective is to customize the semantic segmentation network according to the computing power and specific scenarios of autonomous driving hardware. We implement dynamic adaptability through a three-tier control mechanism – width multiplier, classifier depth, and classifier kernel – allowing fine-grained control over model components based on hardware constraints and task requirements. This adaptability facilitates broad model scaling, targeted refinement of the final layers, and scenario-specific optimization of kernel sizes, leading to improved resource allocation and performance. Additionally, we leverage Bayesian Optimization with surrogate modeling to efficiently explore hyperparameter spaces under tight computational budgets. Our approach addresses scenario-specific and task-specific requirements through automatic parameter search, accommodating the unique computational complexity and accuracy needs of autonomous driving. It scales its Multiply-Accumulate Operations (MACs) for Task-Specific Learning Adaptation (TSLA), resulting in alternative configurations tailored to diverse self-driving tasks. These TSLA customizations maximize computational capacity and model accuracy, optimizing hardware utilization. Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR); Machine Learning (cs.LG) Cite as: arXiv:2508.12279 [cs.CV] (or arXiv:2508.12279v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.12279 Focus to learn more arXiv-issued DOI via DataCite (pending registration) Journalreference: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 44, no. 4, pp. 1406-1419, April 2025 Related DOI: https://doi.org/10.1109/TCAD.2024.3491015 Focus to learn more DOI(s) linking to related resources
zh
[CV-113] SNNSIR: A Simple Spiking Neural Network for Stereo Image Restoration
【速读】:该论文旨在解决传统神经网络在立体图像恢复任务中计算复杂度高、功耗大,难以满足实时低功耗应用需求的问题。其核心挑战在于如何在保持高性能的同时,适配脉冲神经网络(Spiking Neural Networks, SNNs)的离散二进制激活特性与事件驱动机制。解决方案的关键在于提出一种全脉冲驱动的架构SNNSIR,通过三个创新模块实现高效且硬件友好的计算:首先引入轻量级Spike Residual Basic Block(SRBB)以增强脉冲兼容的残差信息流;其次设计Spike Stereo Convolutional Modulation(SSCM)模块,利用元素级乘法实现简化非线性并基于跨视图感知机制突出噪声敏感区域;最后构建Spike Stereo Cross-Attention(SSCA)模块,在脉冲兼容框架内实现跨视角特征的双向交互,从而提升立体匹配精度。整体方法摒弃了浮点运算依赖,完全契合SNN的稀疏事件驱动特性,显著降低计算开销,同时在多种立体图像恢复任务中达到竞争性性能。
链接: https://arxiv.org/abs/2508.12271
作者: Ronghua Xu,Jin Xie,Jing Nie,Jiale Cao,Yanwei Pang
机构: Chongqing University (重庆大学); Shanghai Artificial Intelligence Laboratory (上海人工智能实验室); Tianjin University (天津大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: 11 pages
Abstract:Spiking Neural Networks (SNNs), characterized by discrete binary activations, offer high computational efficiency and low energy consumption, making them well-suited for computation-intensive tasks such as stereo image restoration. In this work, we propose SNNSIR, a simple yet effective Spiking Neural Network for Stereo Image Restoration, specifically designed under the spike-driven paradigm where neurons transmit information through sparse, event-based binary spikes. In contrast to existing hybrid SNN-ANN models that still rely on operations such as floating-point matrix division or exponentiation, which are incompatible with the binary and event-driven nature of SNNs, our proposed SNNSIR adopts a fully spike-driven architecture to achieve low-power and hardware-friendly computation. To address the expressiveness limitations of binary spiking neurons, we first introduce a lightweight Spike Residual Basic Block (SRBB) to enhance information flow via spike-compatible residual learning. Building on this, the Spike Stereo Convolutional Modulation (SSCM) module introduces simplified nonlinearity through element-wise multiplication and highlights noise-sensitive regions via cross-view-aware modulation. Complementing this, the Spike Stereo Cross-Attention (SSCA) module further improves stereo correspondence by enabling efficient bidirectional feature interaction across views within a spike-compatible framework. Extensive experiments on diverse stereo image restoration tasks, including rain streak removal, raindrop removal, low-light enhancement, and super-resolution demonstrate that our model achieves competitive restoration performance while significantly reducing computational overhead. These results highlight the potential for real-time, low-power stereo vision applications. The code will be available after the article is accepted.
zh
[CV-114] L-SR1: Learned Symmetric-Rank-One Preconditioning
【速读】:该论文旨在解决端到端深度学习方法对大规模标注数据的依赖、在未见场景中泛化能力差以及计算资源消耗日益增长的问题,同时克服传统优化方法收敛速度慢的局限。其解决方案的关键在于提出一种新型可学习的二阶优化器,通过引入一个可训练的预条件单元(preconditioning unit),增强经典的对称秩一(Symmetric-Rank-One, SR1)算法;该单元生成数据驱动的向量以构造满足割线约束(secant constraint)的正半定秩一矩阵,并通过学习投影实现自适应更新,从而在无需标注数据或微调的情况下实现轻量级、高泛化性能的优化策略,适用于集成至更广泛的基于优化的框架中。
链接: https://arxiv.org/abs/2508.12270
作者: Gal Lifshitz,Shahar Zuler,Ori Fouks,Dan Raviv
机构: Tel Aviv University (特拉维夫大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:End-to-end deep learning has achieved impressive results but remains limited by its reliance on large labeled datasets, poor generalization to unseen scenarios, and growing computational demands. In contrast, classical optimization methods are data-efficient and lightweight but often suffer from slow convergence. While learned optimizers offer a promising fusion of both worlds, most focus on first-order methods, leaving learned second-order approaches largely unexplored. We propose a novel learned second-order optimizer that introduces a trainable preconditioning unit to enhance the classical Symmetric-Rank-One (SR1) algorithm. This unit generates data-driven vectors used to construct positive semi-definite rank-one matrices, aligned with the secant constraint via a learned projection. Our method is evaluated through analytic experiments and on the real-world task of Monocular Human Mesh Recovery (HMR), where it outperforms existing learned optimization-based approaches. Featuring a lightweight model and requiring no annotated data or fine-tuning, our approach offers strong generalization and is well-suited for integration into broader optimization-based frameworks. Comments: Under review Subjects: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.12270 [cs.LG] (or arXiv:2508.12270v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.12270 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-115] race: Click-Based Gaze Visualization on the Apple Vision Pro SIGGRAPH
【速读】:该论文旨在解决苹果Vision Pro设备因隐私限制无法直接获取连续用户注视数据的问题,从而阻碍了基于眼动追踪的交互与分析研究。其解决方案的关键在于提出iTrace系统,通过点击式注视提取技术(包括手动捏合手势、停留控制和游戏控制器自动点击)绕过硬件限制,并构建客户端-服务器架构实现注视坐标采集与动态热力图生成。该方法在保持91%高注视精度的同时,显著提升了数据密度(如8BitDo控制器达到14.22次/秒点击率),支持个体与群体注意力模式的可视化分析,为教育内容参与度评估、环境设计优化、营销分析及临床认知评估等场景提供可行工具。
链接: https://arxiv.org/abs/2508.12268
作者: Esra Mehmedova,Santiago Berrezueta-Guzman,Stefan Wagner
机构: Technical University of Munich (慕尼黑工业大学); HeilbronnGermany (海德堡德国)
类目: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV)
备注: Paper submitted to ACM SIGGRAPH Motion, Interaction and Games 2025 (MIG 2025)
Abstract:The Apple Vision Pro is equipped with accurate eye-tracking capabilities, yet the privacy restrictions on the device prevent direct access to continuous user gaze data. This study introduces iTrace, a novel application that overcomes these limitations through click-based gaze extraction techniques, including manual methods like a pinch gesture, and automatic approaches utilizing dwell control or a gaming controller. We developed a system with a client-server architecture that captures the gaze coordinates and transforms them into dynamic heatmaps for video and spatial eye tracking. The system can generate individual and averaged heatmaps, enabling analysis of personal and collective attention patterns. To demonstrate its effectiveness and evaluate the usability and performance, a study was conducted with two groups of 10 participants, each testing different clicking methods. The 8BitDo controller achieved higher average data collection rates at 14.22 clicks/s compared to 0.45 clicks/s with dwell control, enabling significantly denser heatmap visualizations. The resulting heatmaps reveal distinct attention patterns, including concentrated focus in lecture videos and broader scanning during problem-solving tasks. By allowing dynamic attention visualization while maintaining a high gaze precision of 91 %, iTrace demonstrates strong potential for a wide range of applications in educational content engagement, environmental design evaluation, marketing analysis, and clinical cognitive assessment. Despite the current gaze data restrictions on the Apple Vision Pro, we encourage developers to use iTrace only in research settings. Comments: Paper submitted to ACM SIGGRAPH Motion, Interaction and Games 2025 (MIG 2025) Subjects: Human-Computer Interaction (cs.HC); Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.12268 [cs.HC] (or arXiv:2508.12268v1 [cs.HC] for this version) https://doi.org/10.48550/arXiv.2508.12268 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-116] Region-Level Context-Aware Multimodal Understanding
【速读】:该论文旨在解决当前多模态大语言模型(Multimodal Large Language Models, MLLMs)在区域级上下文感知理解能力上的不足,即现有模型主要关注通用视觉理解,而缺乏对图像中特定对象与其关联文本信息进行融合的能力,这种能力被称为区域级上下文感知多模态理解(Region-level Context-aware Multimodal Understanding, RCMU)。解决方案的关键在于提出一种名为区域级上下文感知视觉指令微调(Region-level Context-aware Visual Instruction Tuning, RCVIT)的方法,该方法通过将对象信息和边界框坐标引入模型输入,使模型能够有效关联对象的视觉内容与其对应的文本描述,从而实现更精细的区域级语义整合。此外,研究还构建了RCMU数据集与RC\P-Bench基准测试体系,并设计了一种无参考评估指标,以系统性地推动该方向的研究与应用。
链接: https://arxiv.org/abs/2508.12263
作者: Hongliang Wei,Xianqi Zhang,Xingtao Wang,Xiaopeng Fan,Debin Zhao
机构: Harbin Institute of Technology (哈尔滨工业大学); Harbin Institute of Technology Suzhou Research Institute (哈尔滨工业大学苏州研究院); Peng Cheng Laboratory (鹏城实验室)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 6 figures
Abstract:Despite significant progress, existing research on Multimodal Large Language Models (MLLMs) mainly focuses on general visual understanding, overlooking the ability to integrate textual context associated with objects for a more context-aware multimodal understanding – an ability we refer to as Region-level Context-aware Multimodal Understanding (RCMU). To address this limitation, we first formulate the RCMU task, which requires models to respond to user instructions by integrating both image content and textual information of regions or objects. To equip MLLMs with RCMU capabilities, we propose Region-level Context-aware Visual Instruction Tuning (RCVIT), which incorporates object information into the model input and enables the model to utilize bounding box coordinates to effectively associate objects’ visual content with their textual information. To address the lack of datasets, we introduce the RCMU dataset, a large-scale visual instruction tuning dataset that covers multiple RCMU tasks. We also propose RC\P-Bench, a comprehensive benchmark that can evaluate the performance of MLLMs in RCMU and multimodal personalized understanding tasks. Additionally, we propose a reference-free evaluation metric to perform a comprehensive and fine-grained evaluation of the region-level context-aware image descriptions. By performing RCVIT on Qwen2-VL models with the RCMU dataset, we developed RC-Qwen2-VL models. Experimental results indicate that RC-Qwen2-VL models not only achieve outstanding performance on multiple RCMU tasks but also demonstrate successful applications in multimodal RAG and personalized conversation. Our data, model and benchmark are available at this https URL
zh
[CV-117] Superpixel-informed Continuous Low-Rank Tensor Representation for Multi-Dimensional Data Recovery AAAI2026
【速读】:该论文旨在解决传统低秩张量表示(Low-rank Tensor Representation, LRTR)方法在实际应用中面临的两个关键问题:一是其假设整体数据具有低秩特性,这一假设在存在显著空间异质性的现实场景中往往不成立;二是其仅适用于离散网格数据,限制了模型的灵活性与泛化能力。解决方案的核心在于提出一种超像素感知的连续低秩张量表示框架(Superpixel-informed Continuous low-rank Tensor Representation, SCTR),其关键创新包括:首先,以超像素(superpixel)作为基本建模单元,利用其语义一致性增强低秩特性并提升对多样化数据流的适应性;其次,设计了一种非对称低秩张量分解(Asymmetric Low-rank Tensor Factorization, ALTF),通过共享神经网络结合专用分支参数化超像素特异性因子矩阵,在全局模式学习与局部自适应之间实现高效平衡,从而获得表达能力强且紧凑的连续表示。
链接: https://arxiv.org/abs/2508.12261
作者: Zhizhou Wang,Ruijing Zheng,Zhenyu Wu,Jianli Wang(School of Computing and Artificial Intelligence, Southwest Jiaotong University)
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review in AAAI2026
Abstract:Low-rank tensor representation (LRTR) has emerged as a powerful tool for multi-dimensional data processing. However, classical LRTR-based methods face two critical limitations: (1) they typically assume that the holistic data is low-rank, this assumption is often violated in real-world scenarios with significant spatial variations; and (2) they are constrained to discrete meshgrid data, limiting their flexibility and applicability. To overcome these limitations, we propose a Superpixel-informed Continuous low-rank Tensor Representation (SCTR) framework, which enables continuous and flexible modeling of multi-dimensional data beyond traditional grid-based constraints. Our approach introduces two main innovations: First, motivated by the observation that semantically coherent regions exhibit stronger low-rank characteristics than holistic data, we employ superpixels as the basic modeling units. This design not only encodes rich semantic information, but also enhances adaptability to diverse forms of data streams. Second, we propose a novel asymmetric low-rank tensor factorization (ALTF) where superpixel-specific factor matrices are parameterized by a shared neural network with specialized heads. By strategically separating global pattern learning from local adaptation, this framework efficiently captures both cross-superpixel commonalities and within-superpixel variations. This yields a representation that is both highly expressive and compact, balancing model efficiency with adaptability. Extensive experiments on several benchmark datasets demonstrate that SCTR achieves 3-5 dB PSNR improvements over existing LRTR-based methods across multispectral images, videos, and color images.
zh
[CV-118] WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
【速读】:该论文旨在解决复杂环境中天气噪声对显著性目标检测(Salient Object Detection, SOD)性能造成负面影响的问题,这一问题长期受限于缺乏带有像素级标注的多天气场景数据集。为填补这一空白,作者构建了首个Weather-eXtended Salient Object Detection (WXSOD) 数据集,包含14,945张带多样天气噪声的RGB图像及其对应的真值标注和天气标签,并设计了合成与真实两类测试集以验证算法泛化能力。解决方案的关键在于提出一种名为Weather-aware Feature Aggregation Network (WFANet) 的高效基线模型,其采用全监督双分支架构:其中天气预测分支挖掘与天气相关的深层特征,而显著性检测分支则融合骨干网络提取的语义特征与天气特征,从而实现对天气干扰的感知与抑制,显著提升SOD在恶劣天气下的鲁棒性与准确性。
链接: https://arxiv.org/abs/2508.12250
作者: Quan Chen,Xiong Yang,Rongfeng Lu,Qianyu Zhang,Yu Liu,Xiaofei Zhou,Bolun Zheng
机构: Hangzhou Dianzi University (杭州电子科技大学); Tsinghua University (清华大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Under review
Abstract:Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at this https URL
zh
[CV-119] In vivo 3D ultrasound computed tomography of musculoskeletal tissues with generative neural physics
【速读】:该论文旨在解决超声计算机断层成像(Ultrasound Computed Tomography, USCT)在肌肉骨骼系统成像中因传统射线基重建方法忽略强散射效应而导致的图像质量受限问题。其关键解决方案是提出了一种生成式神经物理框架(generative neural physics framework),该框架将生成网络与物理信息神经模拟相结合,通过仅需数十张跨模态图像即可学习超声波传播的紧凑代理模型,从而融合波动建模的高精度与深度学习的高效性和稳定性,实现快速、高保真度的3D USCT重建,显著提升了对肌肉和骨骼生物力学特性的敏感性及空间分辨率。
链接: https://arxiv.org/abs/2508.12226
作者: Zhijun Zeng,Youjia Zheng,Chang Su,Qianhang Wu,Hao Hu,Zeyuan Dong,Shan Gao,Yang Lv,Rui Tang,Ligang Cui,Zhiyong Hou,Weijun Lin,Zuoqiang Shi,Yubing Li,He Sun
机构: Peking University (北京大学); Chinese Academy of Sciences (中国科学院); University of Chinese Academy of Sciences (中国科学院大学); Tsinghua University (清华大学); Peking University Third Hospital (北京大学第三医院); Hebei Medical University (河北医科大学); Yau Mathematical Sciences Center (北京雁栖湖数学科学中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Ultrasound computed tomography (USCT) is a radiation-free, high-resolution modality but remains limited for musculoskeletal imaging due to conventional ray-based reconstructions that neglect strong scattering. We propose a generative neural physics framework that couples generative networks with physics-informed neural simulation for fast, high-fidelity 3D USCT. By learning a compact surrogate of ultrasonic wave propagation from only dozens of cross-modality images, our method merges the accuracy of wave modeling with the efficiency and stability of deep learning. This enables accurate quantitative imaging of in vivo musculoskeletal tissues, producing spatial maps of acoustic properties beyond reflection-mode images. On synthetic and in vivo data (breast, arm, leg), we reconstruct 3D maps of tissue parameters in under ten minutes, with sensitivity to biomechanical properties in muscle and bone and resolution comparable to MRI. By overcoming computational bottlenecks in strongly scattering regimes, this approach advances USCT toward routine clinical assessment of musculoskeletal disease.
zh
[CV-120] C2PSA-Enhanced YOLOv11 Architecture: A Novel Approach for Small Target Detection in Cotton Disease Diagnosis
【速读】:该论文旨在解决棉花病害检测中三个关键问题:早期病斑识别精度低(小于5mm²病斑漏检率达35%)、田间环境下模型性能下降(准确率降低25%)以及多病害场景下误检率高(达34.7%)。其解决方案的核心在于:引入C2PSA模块以增强小目标特征提取能力;采用动态类别权重机制缓解样本不平衡问题;并通过Mosaic-MixUp混合增强策略优化数据多样性与泛化性能。实验表明,改进后的YOLOv11模型在4,078张图像数据集上实现了mAP50提升至0.820(+8.0%)和mAP50-95提升至0.705(+10.5%),同时保持158 FPS的推理速度,具备移动端实时部署能力,可支持农业场景下的精准监测与治理。
链接: https://arxiv.org/abs/2508.12219
作者: Kaiyuan Wang,Jixing Liu,Xiaobo Cai
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This study presents a deep learning-based optimization of YOLOv11 for cotton disease detection, developing an intelligent monitoring system. Three key challenges are addressed: (1) low precision in early spot detection (35% leakage rate for sub-5mm2 spots), (2) performance degradation in field conditions (25% accuracy drop), and (3) high error rates (34.7%) in multi-disease scenarios. The proposed solutions include: C2PSA module for enhanced small-target feature extraction; Dynamic category weighting to handle sample imbalance; Improved data augmentation via Mosaic-MixUp scaling. Experimental results on a 4,078-image dataset show: mAP50: 0.820 (+8.0% improvement); mAP50-95: 0.705 (+10.5% improvement); Inference speed: 158 FPS. The mobile-deployed system enables real-time disease monitoring and precision treatment in agricultural applications.
zh
[CV-121] Splat Feature Solver
【速读】:该论文旨在解决3D场景理解中的特征提升(feature lifting)问题,即如何将来自多视角图像的丰富特征描述符(如DINO、CLIP等)高效且准确地映射到基于点(splat-based)的3D表示中,同时应对多视角观测带来的不一致性与噪声。其解决方案的关键在于将特征提升建模为一个稀疏线性逆问题,该 formulation 不依赖于特定核函数或特征类型(kernel- and feature-agnostic),并可在闭式解下高效求解;此外,通过引入两种互补的正则化策略——Tikhonov Guidance 保证数值稳定性(通过软对角优势约束),以及 Post-Lifting Aggregation 利用特征聚类过滤噪声输入,从而在凸损失下提供全局最优误差的理论上界,显著提升语义保真度和性能。
链接: https://arxiv.org/abs/2508.12216
作者: Butian Xiong,Rong Liu,Kenneth Xu,Meida Chen,Andrew Feng
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: webpage not that stable
Abstract:Feature lifting has emerged as a crucial component in 3D scene understanding, enabling the attachment of rich image feature descriptors (e.g., DINO, CLIP) onto splat-based 3D representations. The core challenge lies in optimally assigning rich general attributes to 3D primitives while addressing the inconsistency issues from multi-view images. We present a unified, kernel- and feature-agnostic formulation of the feature lifting problem as a sparse linear inverse problem, which can be solved efficiently in closed form. Our approach admits a provable upper bound on the global optimal error under convex losses for delivering high quality lifted features. To address inconsistencies and noise in multi-view observations, we introduce two complementary regularization strategies to stabilize the solution and enhance semantic fidelity. Tikhonov Guidance enforces numerical stability through soft diagonal dominance, while Post-Lifting Aggregation filters noisy inputs via feature clustering. Extensive experiments demonstrate that our approach achieves state-of-the-art performance on open-vocabulary 3D segmentation benchmarks, outperforming training-based, grouping-based, and heuristic-forward baselines while producing the lifted features in minutes. Code is available at \hrefthis https URL\textbfgithub. We also have a \hrefthis https URL
zh
[CV-122] Scalable RF Simulation in Generative 4D Worlds
【速读】:该论文旨在解决室内感知任务中高质量射频(Radio Frequency, RF)数据收集困难的问题,尤其是在动态和多样的室内环境中。现有方法依赖于真实场景采集,成本高且难以覆盖所有情境。解决方案的关键在于提出WaveVerse框架,其核心创新是引入语言引导的4D世界生成器:一方面利用状态感知因果Transformer实现基于空间约束和文本描述的人体运动条件生成;另一方面通过相位一致的光线追踪模拟器生成精确且物理一致的RF信号。该方案首次实现了面向RF成像的数据生成,并在有限数据与充足数据场景下均显著提升机器学习模型性能,尤其在波束赋形和呼吸监测等应用中验证了相位一致性的重要性。
链接: https://arxiv.org/abs/2508.12176
作者: Zhiwei Zheng,Dongyin Hu,Mingmin Zhao
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for indoor perception tasks. However, collecting high-quality RF data in dynamic and diverse indoor environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions. WaveVerse introduces a language-guided 4D world generator, which includes a state-aware causal transformer for human motion generation conditioned on spatial constraints and texts, and a phase-coherent ray tracing simulator that enables the simulation of accurate and coherent RF signals. Experiments demonstrate the effectiveness of our approach in conditioned human motion generation and highlight how phase coherence is applied to beamforming and respiration monitoring. We further present two case studies in ML-based high-resolution imaging and human activity recognition, demonstrating that WaveVerse not only enables data generation for RF imaging for the first time, but also consistently achieves performance gain in both data-limited and data-adequate scenarios.
zh
[CV-123] RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis ICCV2025
【速读】:该论文旨在解决当前生成式AI在合成情感化说话头像时存在的关键问题:即在保持主体身份不变的前提下,难以实现高精度且可控的情感表达。现有方法虽在唇部同步和图像质量上表现优异,但在情绪准确性与可控性方面存在不足。解决方案的关键在于提出RealTalk框架,其核心创新包括:利用变分自编码器(Variational Autoencoder, VAE)从驱动音频中生成3D面部关键点,并通过基于ResNet的地标变形模型(Landmark Deformation Model, LDM)将情绪标签嵌入向量与关键点融合,从而生成具情感特征的地标;进一步地,这些地标与面部混合形状系数共同作为条件输入,驱动一种新型三平面注意力神经辐射场(Tri-plane Attention Neural Radiance Field, NeRF),以生成高保真、情绪准确且身份一致的说话头像。
链接: https://arxiv.org/abs/2508.12163
作者: Wenqing Wang,Yun Fu
机构: Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: Accepted to the ICCV 2025 Workshop on Artificial Social Intelligence
Abstract:Emotion is a critical component of artificial social intelligence. However, while current methods excel in lip synchronization and image quality, they often fail to generate accurate and controllable emotional expressions while preserving the subject’s identity. To address this challenge, we introduce RealTalk, a novel framework for synthesizing emotional talking heads with high emotion accuracy, enhanced emotion controllability, and robust identity preservation. RealTalk employs a variational autoencoder (VAE) to generate 3D facial landmarks from driving audio, which are concatenated with emotion-label embeddings using a ResNet-based landmark deformation model (LDM) to produce emotional landmarks. These landmarks and facial blendshape coefficients jointly condition a novel tri-plane attention Neural Radiance Field (NeRF) to synthesize highly realistic emotional talking heads. Extensive experiments demonstrate that RealTalk outperforms existing methods in emotion accuracy, controllability, and identity preservation, advancing the development of socially intelligent AI systems.
zh
[CV-124] Demystifying Foreground-Background Memorization in Diffusion Models
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)中存在的记忆现象检测不足问题,尤其是现有方法无法量化局部区域的记忆行为以及超出特定提示-图像对的记忆模式。其解决方案的关键在于提出一种基于分割的新型度量指标——前景背景记忆(Foreground Background Memorization, FB-Mem),能够分类并量化生成图像中被记忆的区域。该方法揭示了记忆现象比以往认知更为普遍:一方面,单个提示生成的图像可能关联到训练集中多个相似图像簇,表明存在复杂的非一一对应记忆模式;另一方面,当前主流的模型级缓解策略(如神经元去激活和剪枝)无法消除局部记忆,尤其在前景区域中仍显著存在。论文进一步提出基于聚类的强化缓解方法,有效提升了对记忆行为的控制能力。
链接: https://arxiv.org/abs/2508.12148
作者: Jimmy Z. Di,Yiwei Lu,Yaoliang Yu,Gautam Kamath,Adam Dziedzic,Franziska Boenisch
机构: Cheriton School of Computer Science, University of Waterloo and Vector Institute (韦仕敦大学计算机科学系和矢量研究所); University of Ottawa (渥太华大学); CISPA Helmholtz Center for Information Security (CISPA亥姆霍兹信息安全中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Diffusion models (DMs) memorize training images and can reproduce near-duplicates during generation. Current detection methods identify verbatim memorization but fail to capture two critical aspects: quantifying partial memorization occurring in small image regions, and memorization patterns beyond specific prompt-image pairs. To address these limitations, we propose Foreground Background Memorization (FB-Mem), a novel segmentation-based metric that classifies and quantifies memorized regions within generated images. Our method reveals that memorization is more pervasive than previously understood: (1) individual generations from single prompts may be linked to clusters of similar training images, revealing complex memorization patterns that extend beyond one-to-one correspondences; and (2) existing model-level mitigation methods, such as neuron deactivation and pruning, fail to eliminate local memorization, which persists particularly in foreground regions. Our work establishes an effective framework for measuring memorization in diffusion models, demonstrates the inadequacy of current mitigation approaches, and proposes a stronger mitigation method using a clustering approach.
zh
[CV-125] KP-INR: A Dual-Branch Implicit Neural Representation Model for Cardiac Cine MRI Reconstruction
【速读】:该论文旨在解决快速心脏磁共振成像(Cardiac Magnetic Resonance, CMR)中因缩短扫描时间而导致的图像质量下降问题,尤其是在采用欠采样数据进行重建时,传统方法难以恢复高质量的心脏电影MRI(cine MRI)图像。其解决方案的关键在于提出一种双分支隐式神经表示(Implicit Neural Representation, INR)方法——KP-INR,该方法在k空间域中运行:一个分支处理k空间坐标的基于位置的嵌入(positional embedding),另一个分支则学习目标点及其局部多尺度邻域的k空间特征表示;通过跨分支交互机制,从两个分支联合逼近目标k空间值,从而显著提升对具有挑战性的笛卡尔k空间数据的重建性能。
链接: https://arxiv.org/abs/2508.12147
作者: Donghang Lyu,Marius Staring,Mariya Doneva,Hildo J. Lamb,Nicola Pezzotti
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Cardiac Magnetic Resonance (CMR) imaging is a non-invasive method for assessing cardiac structure, function, and blood flow. Cine MRI extends this by capturing heart motion, providing detailed insights into cardiac mechanics. To reduce scan time and breath-hold discomfort, fast acquisition techniques have been utilized at the cost of lowering image quality. Recently, Implicit Neural Representation (INR) methods have shown promise in unsupervised reconstruction by learning coordinate-to-value mappings from undersampled data, enabling high-quality image recovery. However, current existing INR methods primarily focus on using coordinate-based positional embeddings to learn the mapping, while overlooking the feature representations of the target point and its neighboring context. In this work, we propose KP-INR, a dual-branch INR method operating in k-space for cardiac cine MRI reconstruction: one branch processes the positional embedding of k-space coordinates, while the other learns from local multi-scale k-space feature representations at those coordinates. By enabling cross-branch interaction and approximating the target k-space values from both branches, KP-INR can achieve strong performance on challenging Cartesian k-space data. Experiments on the CMRxRecon2024 dataset confirms its improved performance over baseline models and highlights its potential in this field.
zh
[CV-126] Infusing fine-grained visual knowledge to Vision-Language Models ICCV
【速读】:该论文旨在解决大规模预训练视觉-语言模型(Vision-and-Language Models, VLMs)在细粒度开放集视觉检索任务中表现不佳的问题,尤其针对传统微调方法易导致灾难性遗忘(catastrophic forgetting)、从而损害模型通用多模态能力的缺陷。解决方案的关键在于提出一种专为平衡细粒度领域适应与保留预训练VLM广泛多模态知识而设计的微调策略:通过借鉴持续学习(continual learning)领域的标准正则化技术并系统分析其效果,提出一种高效且有效的组合策略以增强知识保留;同时,重视验证集设计和超参数调优等常被忽视但至关重要的环节,确保方法在不同数据集和预训练模型上的可复现性与鲁棒泛化能力。实验表明,该方法在图像-图像与图像-文本检索任务中均取得显著性能提升,且无需使用文本数据或原始文本编码器即可保持良好的视觉-文本对齐能力。
链接: https://arxiv.org/abs/2508.12137
作者: Nikolaos-Antonios Ypsilantis,Kaifeng Chen,André Araujo,Ondřej Chum
机构: VRG, FEE, Czech Technical University in Prague (捷克理工大学); Google DeepMind (谷歌深度智脑)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: ICCVW 2025 accepted paper. Workshop name: “What is Next in Multimodal Foundation Models?”
Abstract:Large-scale contrastive pre-training produces powerful Vision-and-Language Models (VLMs) capable of generating representations (embeddings) effective for a wide variety of visual and multimodal tasks. However, these pretrained embeddings remain suboptimal for fine-grained open-set visual retrieval, where state-of-the-art results require fine-tuning the vision encoder using annotated domain-specific samples. Naively performing such fine-tuning typically leads to catastrophic forgetting, severely diminishing the model’s general-purpose visual and cross-modal capabilities. In this work, we propose a fine-tuning method explicitly designed to achieve optimal balance between fine-grained domain adaptation and retention of the pretrained VLM’s broad multimodal knowledge. Drawing inspiration from continual learning literature, we systematically analyze standard regularization techniques aimed at knowledge retention and propose an efficient and effective combination strategy. Additionally, we address the commonly overlooked yet critical aspects of validation set design and hyperparameter tuning to ensure reproducibility and robust generalization across datasets and pretrained models. We extensively evaluate our method on both fine-grained and coarse-grained image-image and image-text retrieval benchmarks. Our approach consistently achieves strong results, notably retaining the visual-text alignment without utilizing any text data or the original text encoder during fine-tuning. Code and model checkpoints: this https URL . Comments: ICCVW 2025 accepted paper. Workshop name: “What is Next in Multimodal Foundation Models?” Subjects: Computer Vision and Pattern Recognition (cs.CV) Cite as: arXiv:2508.12137 [cs.CV] (or arXiv:2508.12137v1 [cs.CV] for this version) https://doi.org/10.48550/arXiv.2508.12137 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[CV-127] riQDef: Disrupting Semantic and Gradient Alignment to Prevent Adversarial Patch Transferability in Quantized Neural Networks
【速读】:该论文旨在解决量化神经网络(Quantized Neural Networks, QNNs)在面对基于补丁的对抗攻击时,存在跨比特宽度(bit-width)迁移性问题,即攻击在不同量化精度下仍具有效力,导致现有防御方法因过度拟合固定量化设置而失效。解决方案的关键在于提出TriQDef框架,其核心机制包括:(1) 特征错位惩罚(Feature Disalignment Penalty, FDP),通过惩罚中间表示中的感知相似性来强制语义不一致;(2) 梯度感知不和谐惩罚(Gradient Perceptual Dissonance Penalty, GPDP),利用边缘交并比(Edge IoU)和方向梯度直方图余弦相似度(HOG Cosine)最小化输入梯度在不同比特宽度下的结构与方向一致性;以及(3) 联合量化感知训练协议,将上述惩罚统一于共享权重的多比特训练范式中。该设计有效破坏了补丁攻击在不同量化模型间的梯度对齐特性,显著降低未见组合下的攻击成功率(ASR下降超40%),同时保持高干净准确率。
链接: https://arxiv.org/abs/2508.12132
作者: Amira Guesmi,Bassem Ouni,Muhammad Shafique
机构: NYU Abu Dhabi (纽约大学阿布扎比校区); Technology Innovation Institute (技术创新研究所); UAE (阿拉伯联合酋长国)
类目: Computer Vision and Pattern Recognition (cs.CV); Cryptography and Security (cs.CR)
备注:
Abstract:Quantized Neural Networks (QNNs) are increasingly deployed in edge and resource-constrained environments due to their efficiency in computation and memory usage. While shown to distort the gradient landscape and weaken conventional pixel-level attacks, it provides limited robustness against patch-based adversarial attacks-localized, high-saliency perturbations that remain surprisingly transferable across bit-widths. Existing defenses either overfit to fixed quantization settings or fail to address this cross-bit generalization vulnerability. We introduce \textbfTriQDef, a tri-level quantization-aware defense framework designed to disrupt the transferability of patch-based adversarial attacks across QNNs. TriQDef consists of: (1) a Feature Disalignment Penalty (FDP) that enforces semantic inconsistency by penalizing perceptual similarity in intermediate representations; (2) a Gradient Perceptual Dissonance Penalty (GPDP) that explicitly misaligns input gradients across bit-widths by minimizing structural and directional agreement via Edge IoU and HOG Cosine metrics; and (3) a Joint Quantization-Aware Training Protocol that unifies these penalties within a shared-weight training scheme across multiple quantization levels. Extensive experiments on CIFAR-10 and ImageNet demonstrate that TriQDef reduces Attack Success Rates (ASR) by over 40% on unseen patch and quantization combinations, while preserving high clean accuracy. Our findings underscore the importance of disrupting both semantic and perceptual gradient alignment to mitigate patch transferability in QNNs.
zh
[CV-128] DualFit: A Two-Stage Virtual Try-On via Warping and Synthesis ICCV2025
【速读】:该论文旨在解决虚拟试衣(Virtual Try-On, VTON)技术中生成结果难以保留服装细粒度细节(如品牌标志和印刷文字)的问题,这些问题对品牌完整性与用户信任至关重要。现有基于扩散模型的无变形方法虽提升了感知质量,但常牺牲关键纹理信息。解决方案的关键在于提出双阶段混合架构DualFit:第一阶段通过学习的光流场将目标服装精确配准到人体图像上,确保高保真度;第二阶段引入保留区域输入与修复掩码(inpainting mask),指导模型仅在服装接缝等必要区域进行重建,从而实现服装细节的精准保留与视觉无缝融合,在重建准确性与感知真实性之间取得良好平衡。
链接: https://arxiv.org/abs/2508.12131
作者: Minh Tran,Johnmark Clements,Annie Prasanna,Tri Nguyen,Ngan Le
机构: University of Arkansas (阿肯色大学); Coupang, Inc. (Coupang公司)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Retail Vision, ICCV 2025
Abstract:Virtual Try-On technology has garnered significant attention for its potential to transform the online fashion retail experience by allowing users to visualize how garments would look on them without physical trials. While recent advances in diffusion-based warping-free methods have improved perceptual quality, they often fail to preserve fine-grained garment details such as logos and printed text elements that are critical for brand integrity and customer trust. In this work, we propose DualFit, a hybrid VTON pipeline that addresses this limitation by two-stage approach. In the first stage, DualFit warps the target garment to align with the person image using a learned flow field, ensuring high-fidelity preservation. In the second stage, a fidelity-preserving try-on module synthesizes the final output by blending the warped garment with preserved human regions. Particularly, to guide this process, we introduce a preserved-region input and an inpainting mask, enabling the model to retain key areas and regenerate only where necessary, particularly around garment seams. Extensive qualitative results show that DualFit achieves visually seamless try-on results while faithfully maintaining high-frequency garment details, striking an effective balance between reconstruction accuracy and perceptual realism.
zh
[CV-129] Simple o3: Towards Interleaved Vision-Language Reasoning
【速读】:该论文旨在解决多模态大语言模型(Multimodal Large Language Models, MLLMs)在复杂视觉-语言任务中长期链式推理(Long Chain-of-Thought, CoT)能力不足的问题。现有方法在处理需要多步视觉理解与语言推理交织的任务时表现有限,难以模拟人类“看图思考”的迭代过程。解决方案的关键在于提出Simple o3框架,其核心创新是通过监督微调(Supervised Fine-Tuning, SFT)将动态工具交互(如裁剪、缩放和重用图像)无缝集成到视觉-语言交错推理流程中,并构建了一个基于“观察-推理-行动”循环的可扩展数据合成管道,生成高质量的多模态推理链。该方法显著提升了模型对细粒度视觉信息的感知能力和聚焦关键区域的能力,实验证明其在多个基准测试上优于现有方法,为高效且强大的多模态推理提供了新范式。
链接: https://arxiv.org/abs/2508.12109
作者: Ye Wang,Qianglong Chen,Zejun Li,Siyuan Wang,Shijie Guo,Zhirui Zhang,Zhongyu Wei
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Multimodal Large Language Models (MLLMs) have shown impressive performance on vision-language tasks, but their long Chain-of-Thought (CoT) capabilities in multimodal scenarios remain underexplored. Inspired by OpenAI’s o3 model, which emulates human-like ‘‘thinking with image’’ through iterative visual transformations and linguistic reasoning, we propose Simple o3, an end-to-end framework that integrates dynamic tool interactions (e.g., cropping, zooming, and reusing) into interleaved vision-language reasoning via supervised fine-tuning (SFT). Our approach features a scalable data synthesis pipeline that generates high-quality interleaved vision-language reasoning chains via an ‘‘observe-reason-act’’ cycle, complete with executable visual operations and rigorous verification, yielding the open-source TWI-Tools-146K dataset. Experimental results demonstrate Simple o3’s superior performance on diverse benchmarks, outperforming existing approaches. By combining enhanced reasoning capabilities, Simple o3 establishes a powerful yet computationally affordable paradigm for advancing multimodal reasoning. Remarkably, we provide the first in-depth analysis of different interleaved reasoning strategies, offering insights into their impact on model performance. We found that by introducing additional visual tokens for interleaved vision-language reasoning, reusing and magnifying the original image significantly improves the model’s visual reasoning and fine-grained perception, while image cropping based on precise visual grounding allows the model to effectively focus on key entities or regions, further enhancing its capabilities.
zh
[CV-130] VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
【速读】:该论文旨在解决医学领域中基于体积影像(如CT扫描)与文本(如放射学报告)的视觉-语言预训练(VLP)因标注数据稀缺而导致性能受限的问题。现有方法通常依赖大规模配对数据,但在医疗场景下获取此类数据成本高、耗时长。为应对这一挑战,作者提出名为VELVET-Med的新颖VLP框架,其关键在于:1)引入单模态自监督学习机制以提升模型在有限数据下的表征能力;2)设计新型语言编码器TriBERT,用于捕获多层次文本语义信息;3)提出分层对比学习策略,以建模跨模态间的多粒度对应关系。该方案仅使用38,875对扫描-报告数据即实现了强大的泛化能力和下游任务迁移性能,在3D分割、跨模态检索、视觉问答及报告生成等多个任务上达到当前最优水平。
链接: https://arxiv.org/abs/2508.12108
作者: Ziyang Zhang,Yang Yu,Xulei Yang,Si Yong Yeo
机构: Northwestern University (西北大学); A*STAR (新加坡科技研究局); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Vision-and-language models (VLMs) have been increasingly explored in the medical domain, particularly following the success of CLIP in general domain. However, unlike the relatively straightforward pairing of 2D images and text, curating large-scale paired data in the medical field for volumetric modalities such as CT scans remains a challenging and time-intensive process. This difficulty often limits the performance on downstream tasks. To address these challenges, we propose a novel vision-language pre-training (VLP) framework, termed as \textbfVELVET-Med, specifically designed for limited volumetric data such as 3D CT and associated radiology reports. Instead of relying on large-scale data collection, our method focuses on the development of effective pre-training objectives and model architectures. The key contributions are: 1) We incorporate uni-modal self-supervised learning into VLP framework, which are often underexplored in the existing literature. 2) We propose a novel language encoder, termed as \textbfTriBERT, for learning multi-level textual semantics. 3) We devise the hierarchical contrastive learning to capture multi-level vision-language correspondence. Using only 38,875 scan-report pairs, our approach seeks to uncover rich spatial and semantic relationships embedded in volumetric medical images and corresponding clinical narratives, thereby enhancing the generalization ability of the learned encoders. The resulting encoders exhibit strong transferability, achieving state-of-the-art performance across a wide range of downstream tasks, including 3D segmentation, cross-modal retrieval, visual question answering, and report generation.
zh
[CV-131] Error Propagation Mechanisms and Compensation Strategies for Quantized Diffusion
【速读】:该论文旨在解决扩散模型(Diffusion Models, DMs)在后训练量化(Post-Training Quantization, PTQ)过程中因迭代去噪机制导致的逐步量化误差累积问题,从而影响生成图像的保真度。解决方案的关键在于构建了一个理论框架,首次数学推导出扩散模型中每步量化误差的传播方程,并由此获得累积误差的闭式解;在此基础上提出一种时间步感知的累积误差补偿策略,有效抑制误差传播,显著提升低精度扩散模型的性能,达到当前最优水平(State-of-the-Art, SOTA)。
链接: https://arxiv.org/abs/2508.12094
作者: Songwei Liu,Hong Liu,Fangmin Chen,Xurui Peng,Chenqian Yan,Lean Fu,Xing Mei
机构: ByteDance Inc (字节跳动)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Diffusion models have transformed image synthesis by establishing unprecedented quality and creativity benchmarks. Nevertheless, their large-scale deployment faces challenges due to computationally intensive iterative denoising processes. Although post-training quantization(PTQ) provides an effective pathway for accelerating sampling, the iterative nature of diffusion models causes stepwise quantization errors to accumulate progressively during generation, inevitably compromising output fidelity. To address this challenge, we develop a theoretical framework that mathematically formulates error propagation in Diffusion Models (DMs), deriving per-step quantization error propagation equations and establishing the first closed-form solution for cumulative error. Building on this theoretical foundation, we propose a timestep-aware cumulative error compensation scheme. Extensive experiments across multiple image datasets demonstrate that our compensation strategy effectively mitigates error propagation, significantly enhancing existing PTQ methods to achieve state-of-the-art(SOTA) performance on low-precision diffusion models.
zh
[CV-132] Enhancing 3D point accuracy of laser scanner through multi-stage convolutional neural network for applications in construction
【速读】:该论文旨在解决激光扫描仪(Laser Scanner, LS)在粗糙室内环境中三维点云精度不确定性较高的问题,尤其针对高精度扫描仪(High-Accuracy Scanner, HAS)与低精度扫描仪(Low-Accuracy Scanner, LAS)之间因设备限制和环境因素导致的位置误差差异。解决方案的关键在于构建一种多阶段卷积神经网络(Multi-Stage Convolutional Neural Network, MSCNN)集成方法,通过将HAS作为参考与LAS在同一环境下配对测量,量化特定误差模式,并建立测量偏差与其空间分布之间的统计关系;进而将系统误差的识别转化为监督学习问题,结合传统几何处理与神经网络的针对性优化,实现对几何特征保持不变的前提下精确校正,从而显著提升低端设备的测量精度,使其接近高端设备水平而无需硬件升级。
链接: https://arxiv.org/abs/2508.12089
作者: Qinyuan Fan,Clemens Gühmann
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:We propose a multi-stage convolutional neural network (MSCNN) based integrated method for reducing uncertainty of 3D point accuracy of lasar scanner (LS) in rough indoor rooms, providing more accurate spatial measurements for high-precision geometric model creation and renovation. Due to different equipment limitations and environmental factors, high-end and low-end LS have positional errors. Our approach pairs high-accuracy scanners (HAS) as references with corresponding low-accuracy scanners (LAS) of measurements in identical environments to quantify specific error patterns. By establishing a statistical relationship between measurement discrepancies and their spatial distribution, we develop a correction framework that combines traditional geometric processing with targeted neural network refinement. This method transforms the quantification of systematic errors into a supervised learning problem, allowing precise correction while preserving critical geometric features. Experimental results in our rough indoor rooms dataset show significant improvements in measurement accuracy, with mean square error (MSE) reductions exceeding 70% and peak signal-to-noise ratio (PSNR) improvements of approximately 6 decibels. This approach enables low-end devices to achieve measurement uncertainty levels approaching those of high-end devices without hardware modifications.
zh
[CV-133] Generic Event Boundary Detection via Denoising Diffusion ICCV2025
【速读】:该论文旨在解决通用事件边界检测(Generic Event Boundary Detection, GEBD)中因事件边界具有主观性而导致的确定性预测方法忽视合理解多样性的问题。传统方法通常输出单一边界划分,未能充分捕捉人类对事件分界可能存在的多种合理判断。其解决方案的关键在于提出一种基于扩散模型的生成式框架 DiffGEBD,该模型通过时序自相似性编码相邻帧间的显著变化,并在去噪过程中将随机噪声逐步重构为符合语义的事件边界;同时引入无分类器引导(classifier-free guidance)机制,实现对生成结果多样性的可控调节,从而在保持边界准确性(fidelity)的同时提升解的多样性(diversity)。
链接: https://arxiv.org/abs/2508.12084
作者: Jaejun Hwang,Dayoung Gong,Manjin Kim,Minsu Cho
机构: Pohang University of Science and Technology (POSTECH); GenGenAI
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: Accepted to ICCV 2025
Abstract:Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, Kinetics-GEBD and TAPOS, generating diverse and plausible event boundaries.
zh
[CV-134] Automated Model Evaluation for Object Detection via Prediction Consistency and Reliablity ICCV2025
【速读】:该论文旨在解决目标检测模型在真实场景中性能评估依赖昂贵人工标注的问题。现有自动化评估方法(AutoEval)难以准确反映模型在复杂环境下的实际表现,尤其是在缺乏真实标签的情况下。解决方案的关键在于提出Prediction Consistency and Reliability (PCR)指标,该指标利用传统检测器在非极大值抑制(Non-Maximum Suppression, NMS)前生成的多个候选边界框,通过联合衡量两方面特征来估计检测性能:一是NMS前后边界框的空间一致性,二是保留框的置信度可靠性(基于重叠框的置信分数)。这一方法无需地面真值标签即可实现更准确、可扩展的模型评估。
链接: https://arxiv.org/abs/2508.12082
作者: Seungju Yoo,Hyuk Kwon,Joong-Won Hwang,Kibok Lee
机构: Yonsei University (延世大学); ETRI (电子通信研究院)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: ICCV 2025 Oral
Abstract:Recent advances in computer vision have made training object detectors more efficient and effective; however, assessing their performance in real-world applications still relies on costly manual annotation. To address this limitation, we develop an automated model evaluation (AutoEval) framework for object detection. We propose Prediction Consistency and Reliability (PCR), which leverages the multiple candidate bounding boxes that conventional detectors generate before non-maximum suppression (NMS). PCR estimates detection performance without ground-truth labels by jointly measuring 1) the spatial consistency between boxes before and after NMS, and 2) the reliability of the retained boxes via the confidence scores of overlapping boxes. For a more realistic and scalable evaluation, we construct a meta-dataset by applying image corruptions of varying severity. Experimental results demonstrate that PCR yields more accurate performance estimates than existing AutoEval methods, and the proposed meta-dataset covers a wider range of detection performance. The code is available at this https URL.
zh
[CV-135] Q-FSRU: Quantum-Augmented Frequency-Spectral Fusion for Medical Visual Question Answering AAAI26
【速读】:该论文旨在解决医疗领域中需要同时理解医学图像与文本信息的复杂临床问题,这是当前医疗人工智能(Healthcare AI)面临的一大挑战。其解决方案的关键在于提出Q-FSRU模型,该模型融合了频域表示与融合(Frequency Spectrum Representation and Fusion, FSRU)和量子检索增强生成(Quantum Retrieval-Augmented Generation, Quantum RAG)技术:首先利用快速傅里叶变换(Fast Fourier Transform, FFT)将图像和文本特征转换至频域,以增强对关键信息的提取并抑制噪声;其次引入受量子启发的检索机制,从外部知识源中获取相关医学事实,并通过基于量子相似度的方法提升知识召回的准确性;最终将频域特征与检索到的知识进行融合,从而实现更精准且可解释的医学视觉问答(Medical Visual Question Answering, VQA)。
链接: https://arxiv.org/abs/2508.12036
作者: Rakesh Thakur,Yusra Tariq
机构: Amity Centre for Artificial Intelligence, Amity University, Noida(诺伊达阿米蒂大学人工智能中心)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures Submitted to AAAI 26
Abstract:Solving tough clinical questions that require both image and text understanding is still a major challenge in healthcare AI. In this work, we propose Q-FSRU, a new model that combines Frequency Spectrum Representation and Fusion (FSRU) with a method called Quantum Retrieval-Augmented Generation (Quantum RAG) for medical Visual Question Answering (VQA). The model takes in features from medical images and related text, then shifts them into the frequency domain using Fast Fourier Transform (FFT). This helps it focus on more meaningful data and filter out noise or less useful information. To improve accuracy and ensure that answers are based on real knowledge, we add a quantum-inspired retrieval system. It fetches useful medical facts from external sources using quantum-based similarity techniques. These details are then merged with the frequency-based features for stronger reasoning. We evaluated our model using the VQA-RAD dataset, which includes real radiology images and questions. The results showed that Q-FSRU outperforms earlier models, especially on complex cases needing image-text reasoning. The mix of frequency and quantum information improves both performance and explainability. Overall, this approach offers a promising way to build smart, clear, and helpful AI tools for doctors.
zh
[CV-136] Bongard-RWR: Real-World Representations of Fine-Grained Concepts in Bongard Problems
【速读】:该论文旨在解决Bongard Problems (BPs) 数据集规模小且难以体现抽象概念细粒度差异的问题,从而限制了对视觉推理模型评估的鲁棒性。其核心挑战在于如何在保持原始BP抽象概念本质的同时,生成足够多样且真实感强的图像数据以支持更全面的模型评测。解决方案的关键在于构建一个大规模的新数据集——Bongard-RWR+,该数据集包含5,400个实例,通过结合人工标注、视觉语言模型(VLM)描述生成与图像合成流程实现:首先使用Pixtral-12B为人工精选图像生成语义描述,再利用Flux.1-dev从这些描述中合成符合原概念的细粒度真实世界图像,并通过人工验证确保生成图像忠实反映目标概念。这一方法有效扩展了Bongard-RWR的数据规模并保留了抽象概念的精细结构,使得对当前先进VLMs在多类分类和文本答案生成等不同BP任务下的表现评估更具说服力。
链接: https://arxiv.org/abs/2508.12026
作者: Szymon Pawlonka,Mikołaj Małkiński,Jacek Mańdziuk
机构: Warsaw University of Technology (华沙理工大学); AGH University of Krakow (克拉科夫AGH科技大学)
类目: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just 60 instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of 5,400 instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.
zh
[CV-137] WiseLVAM: A Novel Framework For Left Ventricle Automatic Measurements
【速读】:该论文旨在解决自动左心室(Left Ventricle, LV)线性测量中因B-mode超声图像中瓣膜尖端等解剖标志点预测偏差导致的显著测量误差问题,从而提升自动化方法的临床可靠性。其解决方案的关键在于提出一种名为WiseLVAM的全新全自动且可手动调整的框架:首先通过弱监督的B-mode标志点检测器估计LV轮廓,并据此推断出LV长轴和基底水平位置以实现扫描线(Scanline, SL)的准确放置;随后在生成的解剖运动模式(Anatomical Motion Mode, AMM)图像中进行LV线性测量,充分利用B-mode图像的结构感知能力和AMM模式的运动感知能力,从而显著增强算法的鲁棒性和准确性,为临床常规应用提供可行方案。
链接: https://arxiv.org/abs/2508.12023
作者: Durgesh Kumar Singh,Qing Cao,Sarina Thomas,Ahcène Boubekki,Robert Jenssen,Michael Kampffmeyer
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Clinical guidelines recommend performing left ventricular (LV) linear measurements in B-mode echocardiographic images at the basal level – typically at the mitral valve leaflet tips – and aligned perpendicular to the LV long axis along a virtual scanline (SL). However, most automated methods estimate landmarks directly from B-mode images for the measurement task, where even small shifts in predicted points along the LV walls can lead to significant measurement errors, reducing their clinical reliability. A recent semi-automatic method, EnLVAM, addresses this limitation by constraining landmark prediction to a clinician-defined SL and training on generated Anatomical Motion Mode (AMM) images to predict LV landmarks along the same. To enable full automation, a contour-aware SL placement approach is proposed in this work, in which the LV contour is estimated using a weakly supervised B-mode landmark detector. SL placement is then performed by inferring the LV long axis and the basal level-mimicking clinical guidelines. Building on this foundation, we introduce \textitWiseLVAM – a novel, fully automated yet manually adaptable framework for automatically placing the SL and then automatically performing the LV linear measurements in the AMM mode. \textitWiseLVAM utilizes the structure-awareness from B-mode images and the motion-awareness from AMM mode to enhance robustness and accuracy with the potential to provide a practical solution for the routine clinical application.
zh
[CV-138] InstDrive: Instance-Aware 3D Gaussian Splatting for Driving Scenes
【速读】:该论文旨在解决动态驾驶场景中3D重建缺乏实例级理解与灵活编辑能力的问题。现有方法通常将背景元素统一表示,难以实现对不同物体的区分和交互式操作;同时,部分尝试将2D分割提升至3D空间的方法依赖预处理的实例ID或复杂管道,且主要适用于室内多视角场景,不适用于室外驾驶场景。解决方案的关键在于提出InstDrive框架,通过使用SAM(Segment Anything Model)生成的掩码作为伪标签,结合对比损失和伪监督目标引导2D特征学习;在3D层面引入体素化正则化损失以隐式编码实例身份并保持一致性,并设计轻量级静态码本(static codebook)在无需数据预处理或复杂优化的情况下桥接连续特征与离散身份,从而实现开放世界动态驾驶场景下的3D实例分割与可交互重建。
链接: https://arxiv.org/abs/2508.12015
作者: Hongyuan Liu,Haochen Yu,Jianfei Jiang,Qiankun Liu,Jiansheng Chen,Huimin Ma
机构: University of Science and Technology Beijing (北京科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Reconstructing dynamic driving scenes from dashcam videos has attracted increasing attention due to its significance in autonomous driving and scene understanding. While recent advances have made impressive progress, most methods still unify all background elements into a single representation, hindering both instance-level understanding and flexible scene editing. Some approaches attempt to lift 2D segmentation into 3D space, but often rely on pre-processed instance IDs or complex pipelines to map continuous features to discrete identities. Moreover, these methods are typically designed for indoor scenes with rich viewpoints, making them less applicable to outdoor driving scenarios. In this paper, we present InstDrive, an instance-aware 3D Gaussian Splatting framework tailored for the interactive reconstruction of dynamic driving scene. We use masks generated by SAM as pseudo ground-truth to guide 2D feature learning via contrastive loss and pseudo-supervised objectives. At the 3D level, we introduce regularization to implicitly encode instance identities and enforce consistency through a voxel-based loss. A lightweight static codebook further bridges continuous features and discrete identities without requiring data pre-processing or complex optimization. Quantitative and qualitative experiments demonstrate the effectiveness of InstDrive, and to the best of our knowledge, it is the first framework to achieve 3D instance segmentation in dynamic, open-world driving this http URL visualizations are available at our project page.
zh
[CV-139] MOON: Generative MLLM -based Multimodal Representation Learning for E-commerce Product Understanding
【速读】:该论文旨在解决商品表征学习中因传统判别式双流架构难以建模多图与多文之间“多对一”对齐关系而导致的表示能力不足问题,同时应对生成式多模态大语言模型(Generative Multimodal Large Language Models, MLLMs)在商品理解任务中面临的三大挑战:典型大语言模型缺乏多模态和细粒度属性感知模块、商品图像普遍存在背景噪声干扰、以及缺乏标准化评估基准。解决方案的关键在于提出首个基于生成式MLLM的商品表征模型MOON,其核心创新包括:(1) 引入引导式专家混合(guided Mixture-of-Experts, MoE)模块以实现多模态与细粒度属性特定内容的定向建模;(2) 设计图像语义区域检测机制以识别商品核心语义区域,有效抑制背景噪声干扰;(3) 提出专用负采样策略提升负样本难度与多样性,增强模型区分能力。此外,作者还构建了大规模多模态基准MBE用于系统评估,实验证明MOON在零样本场景下于自建基准及公开数据集上均展现出优异的泛化性能,适用于跨模态检索、商品分类与属性预测等下游任务。
链接: https://arxiv.org/abs/2508.11999
作者: Daoze Zhang,Zhanheng Nie,Jianyu Liu,Chenghan Fu,Wanxian Guan,Yuan Gao,Jun Song,Pengjie Wang,Jian Xu,Bo Zheng
机构: Alibaba Group(阿里巴巴集团)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注: 11 pages, 9 figures
Abstract:With the rapid advancement of e-commerce, exploring general representations rather than task-specific ones has attracted increasing research attention. For product understanding, although existing discriminative dual-flow architectures drive progress in this field, they inherently struggle to model the many-to-one alignment between multiple images and texts of products. Therefore, we argue that generative Multimodal Large Language Models (MLLMs) hold significant potential for improving product representation learning. Nevertheless, achieving this goal still remains non-trivial due to several key challenges: the lack of multimodal and aspect-aware modeling modules in typical LLMs; the common presence of background noise in product images; and the absence of a standard benchmark for evaluation. To address these issues, we propose the first generative MLLM-based model named MOON for product representation learning. Our method (1) employs a guided Mixture-of-Experts (MoE) module for targeted modeling of multimodal and aspect-specific product content; (2) effectively detects core semantic regions in product images to mitigate the distraction and interference caused by background noise; and (3) introduces the specialized negative sampling strategy to increase the difficulty and diversity of negative samples. In addition, we release a large-scale multimodal benchmark MBE for various product understanding tasks. Experimentally, our model demonstrates competitive zero-shot performance on both our benchmark and the public dataset, showcasing strong generalization across various downstream tasks, including cross-modal retrieval, product classification, and attribute prediction. Furthermore, the case study and visualization illustrate the effectiveness of MOON for product understanding.
zh
[CV-140] Exploring Spatial-Temporal Dynamics in Event-based Facial Micro-Expression Analysis
【速读】:该论文旨在解决基于RGB相机在捕捉微表情(micro-expression)时因时间分辨率有限和对运动模糊敏感而导致的准确性不足问题。其解决方案的关键在于引入事件相机(event camera)这一新型传感器,利用其微秒级精度、高动态范围和低延迟特性,结合同步采集的RGB与事件数据构建多模态、多分辨率微表情数据集,并通过脉冲神经网络(Spiking Neural Networks, SNN)进行动作单元(Action Unit)分类以及条件变分自编码器(Conditional Variational Autoencoder)实现帧重建,实验表明事件数据在识别任务中显著优于RGB数据(准确率51.23% vs. 23.12%),且帧重建质量优异(SSIM = 0.8513,PSNR = 26.89 dB)。
链接: https://arxiv.org/abs/2508.11988
作者: Nicolas Mastropasqua,Ignacio Bugueno-Cordova,Rodrigo Verschae,Daniel Acevedo,Pablo Negri,Maria E. Buemi
机构: Universidad de Buenos Aires (布宜诺斯艾利斯大学); Institute of Engineering Sciences (工程科学研究所); CONICET-UBA, Instituto de Ciencias de la Computacion (计算机科学研究所); L3S Research Center (L3S 研究中心)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Micro-expression analysis has applications in domains such as Human-Robot Interaction and Driver Monitoring Systems. Accurately capturing subtle and fast facial movements remains difficult when relying solely on RGB cameras, due to limitations in temporal resolution and sensitivity to motion blur. Event cameras offer an alternative, with microsecond-level precision, high dynamic range, and low latency. However, public datasets featuring event-based recordings of Action Units are still scarce. In this work, we introduce a novel, preliminary multi-resolution and multi-modal micro-expression dataset recorded with synchronized RGB and event cameras under variable lighting conditions. Two baseline tasks are evaluated to explore the spatial-temporal dynamics of micro-expressions: Action Unit classification using Spiking Neural Networks (51.23% accuracy with events vs. 23.12% with RGB), and frame reconstruction using Conditional Variational Autoencoders, achieving SSIM = 0.8513 and PSNR = 26.89 dB with high-resolution event input. These promising results show that event-based data can be used for micro-expression recognition and frame reconstruction.
zh
[CV-141] PEdger: Practical Edge Detection via Assembling Cross Information
【速读】:该论文旨在解决边缘检测模型在保持高精度的同时降低计算复杂度的问题,以适应不同计算能力设备的广泛部署需求。当前基于深度学习的边缘检测方法虽提升了准确性,但通常依赖大规模复杂模型,导致计算成本过高,难以在资源受限设备上应用。其解决方案的关键在于提出一种协同学习框架PEdger++,通过融合异构架构、多样化训练时机和多参数采样所获得的跨信息,从集成学习角度增强模型的学习能力,从而在显著减少模型规模与计算开销的前提下提升边缘检测性能。
链接: https://arxiv.org/abs/2508.11961
作者: Yuanbin Fu,Liang Li,Xiaojie Guo
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Edge detection serves as a critical foundation for numerous computer vision applications, including object detection, semantic segmentation, and image editing, by extracting essential structural cues that define object boundaries and salient edges. To be viable for broad deployment across devices with varying computational capacities, edge detectors shall balance high accuracy with low computational complexity. While deep learning has evidently improved accuracy, they often suffer from high computational costs, limiting their applicability on resource-constrained devices. This paper addresses the challenge of achieving that balance: \textiti.e., how to efficiently capture discriminative features without relying on large-size and sophisticated models. We propose PEdger++, a collaborative learning framework designed to reduce computational costs and model sizes while improving edge detection accuracy. The core principle of our PEdger++ is that cross-information derived from heterogeneous architectures, diverse training moments, and multiple parameter samplings, is beneficial to enhance learning from an ensemble perspective. Extensive experimental results on the BSDS500, NYUD and Multicue datasets demonstrate the effectiveness of our approach, both quantitatively and qualitatively, showing clear improvements over existing methods. We also provide multiple versions of the model with varying computational requirements, highlighting PEdger++'s adaptability with respect to different resource constraints. Codes are accessible at this https URL.
zh
[CV-142] SAMDWICH: Moment-aware Video-text Alignment for Referring Video Object Segmentation
【速读】:该论文旨在解决参考视频目标分割(Referring Video Object Segmentation, RVOS)中因语义错位导致的性能瓶颈问题,其根源在于现有方法在训练时对所有可见物体进行无差别监督,且帧采样策略缺乏时序敏感性,从而削弱了视频与文本表达之间的精准对齐。解决方案的关键在于提出一种基于时刻感知的框架SAMDWICH,其核心创新包括:1)构建新标注数据集MeViS-M,通过人工标注每个物体被提及的时间片段(temporal moments),实现语义锚定的监督;2)设计Moment-guided Dual-path Propagation (MDP)机制,利用以时刻为中心的记忆机制,在相关与无关帧上联合训练,提升目标定位与跟踪能力;3)引入Object-level Selective Supervision (OSS),仅对与表达时序对齐的目标施加监督,有效减少语义噪声并强化语言条件下的学习。实验表明,该方法在具有挑战性的MeViS基准上达到当前最优性能,尤其在复杂多样的表达场景下表现突出。
链接: https://arxiv.org/abs/2508.11955
作者: Seunghun Lee,Jiwan Seo,Jeonghoon Kim,Siwon Kim,Haeun Yun,Hyogyeong Jeon,Wonhyeok Choi,Jaehoon Jeong,Zane Durante,Sang Hyun Park,Sunghoon Im
机构: 1. Korea University(韩国科学技术院); 2. University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Project page: this https URL
Abstract:Referring Video Object Segmentation (RVOS) aims to segment and track objects in videos based on natural language expressions, requiring precise alignment between visual content and textual queries. However, existing methods often suffer from semantic misalignment, largely due to indiscriminate frame sampling and supervision of all visible objects during training – regardless of their actual relevance to the expression. To address this, we introduce a moment-aware RVOS framework named SAMDWICH, along with a newly annotated dataset, MeViS-M, built upon the challenging MeViS benchmark. We manually annotate temporal moments indicating when each object is referred to by the expression, enabling semantically grounded supervision that strengthens video-text alignment. SAMDWICH leverages these aligned text-to-clip pairs to guide training, significantly enhancing referential understanding. Building upon this framework, we propose Moment-guided Dual-path Propagation (MDP), a moment-aware propagation strategy that improves both object grounding and tracking by training on both relevant and irrelevant frames through a moment-centric memory mechanism. In addition, we introduce Object-level Selective Supervision (OSS), an object-level filtering strategy that supervises only the objects temporally aligned with the expression in each training clip. This selective supervision reduces semantic noise and reinforces language-conditioned learning. Extensive experiments show that SAMDWICH achieves state-of-the-art performance on challenging MeViS benchmark, particularly excelling in complex scenarios involving diverse expressions.
zh
[CV-143] UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
【速读】:该论文旨在解决当前统一架构在3D任务集成方面面临的挑战,即如何实现对3D模态的统一理解与生成。现有方法虽在图像理解和生成上取得显著进展,但对3D内容的处理仍处于探索阶段。解决方案的关键在于提出UniUGG——首个面向3D模态的统一理解与生成框架,其核心创新包括:(1)引入一个基于潜在扩散模型(latent diffusion model)的空间解码器(spatial decoder),用于生成高质量3D表示;(2)设计几何-语义联合学习策略(geometric-semantic learning strategy)预训练视觉编码器,以同时捕捉输入的语义和几何线索,从而增强空间理解与生成能力。这一架构支持从参考图像出发进行3D场景生成及视角变换,并保持对空间视觉问答(VQA)任务的支持。
链接: https://arxiv.org/abs/2508.11952
作者: Yueming Xu,Jiahui Zhang,Ze Huang,Yurui Chen,Yanpeng Zhou,Zhenyu Chen,Yu-Jie Yuan,Pengxiang Xia,Guowei Huang,Xinyue Cai,Zhongang Qi,Xingyue Quan,Jianye Hao,Hang Xu,Li Zhang
机构: 1: 未知; 2: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input’s semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation. The source code will be released upon paper acceptance.
zh
[CV-144] ransferable Class Statistics and Multi-scale Feature Approximation for 3D Object Detection
【速读】:该论文旨在解决点云中目标检测任务里多尺度特征学习带来的计算开销过大问题,尤其是在资源受限场景下难以实现轻量化模型的挑战。其关键解决方案在于:通过知识蒸馏(knowledge distillation)从单个邻域近似获取多尺度特征,从而避免传统方法中多次邻域搜索和尺度感知层带来的高计算成本;同时设计了一种可迁移特征嵌入机制(transferable feature embedding mechanism),利用类别感知统计量(class-aware statistics)作为低开销的转移特征以弥补单一邻域导致的结构多样性损失;此外,引入中心加权交并比(central weighted intersection over union)进行定位优化,缓解因中心偏移(center offset)引起的定位偏差问题。
链接: https://arxiv.org/abs/2508.11951
作者: Hao Peng,Hong Sang,Yajing Ma,Ping Qiu,Chao Ji
机构: Nanjing University of Posts and Telecommunications (南京邮电大学); Dalian Maritime University (大连海事大学); Northeastern University (东北大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:This paper investigates multi-scale feature approximation and transferable features for object detection from point clouds. Multi-scale features are critical for object detection from point clouds. However, multi-scale feature learning usually involves multiple neighborhood searches and scale-aware layers, which can hinder efforts to achieve lightweight models and may not be conducive to research constrained by limited computational resources. This paper approximates point-based multi-scale features from a single neighborhood based on knowledge distillation. To compensate for the loss of constructive diversity in a single neighborhood, this paper designs a transferable feature embedding mechanism. Specifically, class-aware statistics are employed as transferable features given the small computational cost. In addition, this paper introduces the central weighted intersection over union for localization to alleviate the misalignment brought by the center offset in optimization. Note that the method presented in this paper saves computational costs. Extensive experiments on public datasets demonstrate the effectiveness of the proposed method.
zh
[CV-145] DynamicPose: Real-time and Robust 6D Object Pose Tracking for Fast-Moving Cameras and Objects
【速读】:该论文旨在解决现有6D位姿跟踪方法在快速移动相机和物体场景下鲁棒性显著下降的问题,尤其针对传统方法主要适用于静态或准静态场景的局限性。解决方案的关键在于提出一个无需重新训练的动态位姿跟踪框架DynamicPose,其核心由三个协同工作机制构成:(1)视觉惯性里程计(Visual-Inertial Odometry, VIO)补偿因相机运动导致的兴趣区域(Region of Interest, ROI)偏移;(2)基于深度信息的2D跟踪器校正因大范围物体平移引起的ROI偏差;(3)VIO引导的卡尔曼滤波预测物体旋转、生成多候选位姿,并通过分层精化获得最终位姿。该系统形成闭环反馈机制,利用6D位姿结果指导后续2D跟踪与卡尔曼滤波更新,从而实现快速移动场景下的高精度位姿初始化与稳定跟踪。
链接: https://arxiv.org/abs/2508.11950
作者: Tingbang Liang,Yixin Zeng,Jiatong Xie,Boyu Zhou
机构: School of Artificial Intelligence, Sun Yat-Sen University, Zhuhai, China; School of Mechanical Engineering, Xi’an Jiaotong University, Xi’an, China; Department of Mechanical and Energy Engineering, Southern University of Science and Technology, Shenzhen, China; Differential Robotics
类目: Computer Vision and Pattern Recognition (cs.CV); Robotics (cs.RO)
备注:
Abstract:We present DynamicPose, a retraining-free 6D pose tracking framework that improves tracking robustness in fast-moving camera and object scenarios. Previous work is mainly applicable to static or quasi-static scenes, and its performance significantly deteriorates when both the object and the camera move rapidly. To overcome these challenges, we propose three synergistic components: (1) A visual-inertial odometry compensates for the shift in the Region of Interest (ROI) caused by camera motion; (2) A depth-informed 2D tracker corrects ROI deviations caused by large object translation; (3) A VIO-guided Kalman filter predicts object rotation, generates multiple candidate poses, and then obtains the final pose by hierarchical refinement. The 6D pose tracking results guide subsequent 2D tracking and Kalman filter updates, forming a closed-loop system that ensures accurate pose initialization and precise pose tracking. Simulation and real-world experiments demonstrate the effectiveness of our method, achieving real-time and robust 6D pose tracking for fast-moving cameras and objects.
zh
[CV-146] Deep Learning For Point Cloud Denoising: A Survey
【速读】:该论文旨在解决深度学习(Deep Learning, DL)驱动的点云去噪(Point Cloud Denoising, PCD)领域缺乏系统性综述的问题,以厘清当前研究的关键挑战、方法贡献及技术演进路径。其解决方案的核心在于提出一个面向去噪任务的专用分类体系,将PCD建模为两个步骤:异常值去除(outlier removal)与表面噪声恢复(surface noise restoration),从而涵盖绝大多数实际应用场景和需求,并在此基础上对现有方法进行结构化比较,揭示其在相似性、差异性和优势方面的本质特征,为后续研究提供清晰的技术路线指引。
链接: https://arxiv.org/abs/2508.11932
作者: Chengwei Zhang,Xueyi Zhang,Mingrui Lao,Tao Jiang,Xinhao Xu,Wenjie Li,Fubo Zhang,Longyong Chen
机构: Aerospace Information Research Institute, Chinese Academy of Sciences (中国科学院空天信息研究院); National University of Defense Technology (国防科技大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Real-world environment-derived point clouds invariably exhibit noise across varying modalities and intensities. Hence, point cloud denoising (PCD) is essential as a preprocessing step to improve downstream task performance. Deep learning (DL)-based PCD models, known for their strong representation capabilities and flexible architectures, have surpassed traditional methods in denoising performance. To our best knowledge, despite recent advances in performance, no comprehensive survey systematically summarizes the developments of DL-based PCD. To fill the gap, this paper seeks to identify key challenges in DL-based PCD, summarizes the main contributions of existing methods, and proposes a taxonomy tailored to denoising tasks. To achieve this goal, we formulate PCD as a two-step process: outlier removal and surface noise restoration, encompassing most scenarios and requirements of PCD. Additionally, we compare methods in terms of similarities, differences, and respective advantages. Finally, we discuss research limitations and future directions, offering insights for further advancements in PCD.
zh
[CV-147] Assessment of Using Synthetic Data in Brain Tumor Segmentation
【速读】:该论文旨在解决脑肿瘤MRI图像分割中因肿瘤异质性、标注数据稀缺及医学影像数据集中的类别不平衡所导致的分割精度受限问题。其解决方案的关键在于利用预训练的生成对抗网络(Generative Adversarial Network, GAN)模型生成合成MRI数据,并将其与真实数据混合用于训练U-Net分割网络,以增强数据多样性并缓解类别不平衡问题。实验表明,混合数据集(特别是40%真实数据与60%合成数据比例)在边界轮廓的定性表现上优于纯真实数据训练模型,但区域级别的准确性(如肿瘤核心和增强肿瘤区域)仍存在不足,提示类不平衡仍是需进一步优化的核心挑战。
链接: https://arxiv.org/abs/2508.11922
作者: Aditi Jahagirdar,Sameer Joshi
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Manual brain tumor segmentation from MRI scans is challenging due to tumor heterogeneity, scarcity of annotated data, and class imbalance in medical imaging datasets. Synthetic data generated by generative models has the potential to mitigate these issues by improving dataset diversity. This study investigates, as a proof of concept, the impact of incorporating synthetic MRI data, generated using a pre-trained GAN model, into training a U-Net segmentation network. Experiments were conducted using real data from the BraTS 2020 dataset, synthetic data generated with the medigan library, and hybrid datasets combining real and synthetic samples in varying proportions. While overall quantitative performance (Dice coefficient, IoU, precision, recall, accuracy) was comparable between real-only and hybrid-trained models, qualitative inspection suggested that hybrid datasets, particularly with 40% real and 60% synthetic data, improved whole tumor boundary delineation. However, region-wise accuracy for the tumor core and the enhancing tumor remained lower, indicating a persistent class imbalance. The findings support the feasibility of synthetic data as an augmentation strategy for brain tumor segmentation, while highlighting the need for larger-scale experiments, volumetric data consistency, and mitigating class imbalance in future work.
zh
[CV-148] ENA: Efficient N-dimensional Attention
【速读】:该论文旨在解决高阶数据(从一维到N维)在长序列建模中对传统Transformer架构效率不足的问题。其核心解决方案是提出一种名为高效N维注意力(Efficient N-dimensional Attention, ENA)的混合架构,关键在于将线性递归结构与分块高阶滑动窗口注意力(tiled high-order sliding window attention, SWA)相结合:线性递归负责压缩全局信息至状态空间,而SWA则通过严格的局部建模机制补充细节,二者协同形成一个理论高效且实践可行的框架,适用于超长高阶数据的建模任务。
链接: https://arxiv.org/abs/2508.11921
作者: Yibo Zhong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注: WIP
Abstract:Efficient modeling of long sequences of high-order data requires a more efficient architecture than Transformer. In this paper, we investigate two key aspects of extending linear recurrent models, especially those originally designed for language modeling, to high-order data (1D to ND): scanning strategies and attention-hybrid architectures. Empirical results suggest that scanning provides limited benefits, while attention-hybrid models yield promising results. Focusing on the latter, we further evaluate types of attention and find that tiled high-order sliding window attention (SWA) is efficient in both theory and practice. We term the resulting hybrid architecture of linear recurrence and high-order SWA as Efficient N-dimensional Attention (ENA). We then conduct several experiments to demonstrate its effectiveness. The intuition behind ENA is that linear recurrence compresses global information into a state, while SWA complements it by enforcing strict local modeling. Together, they form a simple framework that offers a promising and practical solution for ultra-long high-order data modeling.
zh
[CV-149] meSenCLIP: A Vision-Language Model for Remote Sensing Using Single-Pixel Time Series
【速读】:该论文旨在解决当前视觉-语言模型(Vision-Language Models)在遥感应用中面临的两个关键问题:一是依赖大尺度空间图像块导致计算成本过高,二是对文本标注监督的强依赖性,而这类标注在实际场景中往往难以获取。解决方案的核心在于提出TimeSenCLIP框架,通过利用单个像素在时间和光谱维度上的信息,减少对大规模空间上下文的依赖,并借助Sentinel-2多时相遥感影像与地理标记的地面照片进行跨视角学习,从而在无需大量文本描述的情况下实现卫星与地面视角间的语义对齐。该方法显著降低了对标注数据的需求,同时保持了高精度的分类性能,在土地利用/土地覆盖(LULC)、作物类型和生态系统类型等任务上验证了其高效性和可扩展性。
链接: https://arxiv.org/abs/2508.11919
作者: Pallavi Jain,Diego Marcos,Dino Ienco,Roberto Interdonato,Tristan Berchoux
机构: Mediterranean Agronomic Institute of Montpellier - CIHEAM-IAMM(地中海蒙彼利埃农业研究所 - CIHEAM-IAMM); Inria(法国国家信息与自动化研究院); INRAE(法国国家农业、食品与环境研究院); Cirad(法国农业发展研究中心); UMR TETIS(热带生态系统与可持续性联合实验室); Univ. of Montpellier(蒙彼利埃大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Paper under review
Abstract:Vision-language models have shown significant promise in remote sensing applications, particularly for land-use and land-cover (LULC) via zero-shot classification and retrieval. However, current approaches face two key challenges: reliance on large spatial tiles that increase computational cost, and dependence on text-based supervision, which is often not readily available. In this work, we present TimeSenCLIP, a lightweight framework that reevaluate the role of spatial context by evaluating the effectiveness of a single pixel by leveraging its temporal and spectral dimensions, for classifying LULC and ecosystem types. By leveraging spectral and temporal information from Sentinel-2 imagery and cross-view learning with geo-tagged ground-level photos, we minimises the need for caption-based training while preserving semantic alignment between overhead (satellite) and ground perspectives. Our approach is grounded in the LUCAS and Sen4Map datasets, and evaluated on classification tasks including LULC, crop type, and ecosystem type. We demonstrate that single pixel inputs, when combined with temporal and spectral cues, are sufficient for thematic mapping, offering a scalable and efficient alternative for large-scale remote sensing applications. Code is available at this https URL
zh
[CV-150] SafeCtrl: Region-Based Safety Control for Text-to-Image Diffusion via Detect-Then-Suppress
【速读】:该论文旨在解决文本到图像生成模型(text-to-image models)在广泛应用中可能产生有害内容的问题,同时避免现有安全方法(如提示重写或模型微调)带来的安全性与保真度之间的权衡。其解决方案的关键在于提出一种“检测-抑制”(detect-then-suppress)范式,通过引入轻量级、非侵入式的插件SafeCtrl,首先精确定位不安全内容区域,随后对有害语义进行抑制而非硬性替换,从而允许生成过程自然且连贯地演化为安全且上下文感知的替代结果。该方法的核心创新在于采用直接偏好优化(Direct Preference Optimization, DPO)训练策略,利用现成的图像级偏好数据学习细微的抑制行为,在推理阶段实现区域引导干预,无需昂贵的像素级标注,显著提升了安全性与保真度的平衡表现。
链接: https://arxiv.org/abs/2508.11904
作者: Lingyun Zhang,Yu Xie,Yanwei Fu,Ping Chen
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:The widespread deployment of text-to-image models is challenged by their potential to generate harmful content. While existing safety methods, such as prompt rewriting or model fine-tuning, provide valuable interventions, they often introduce a trade-off between safety and fidelity. Recent localization-based approaches have shown promise, yet their reliance on explicit ``concept replacement" can sometimes lead to semantic incongruity. To address these limitations, we explore a more flexible detect-then-suppress paradigm. We introduce SafeCtrl, a lightweight, non-intrusive plugin that first precisely localizes unsafe content. Instead of performing a hard A-to-B substitution, SafeCtrl then suppresses the harmful semantics, allowing the generative process to naturally and coherently resolve into a safe, context-aware alternative. A key aspect of our work is a novel training strategy using Direct Preference Optimization (DPO). We leverage readily available, image-level preference data to train our module, enabling it to learn nuanced suppression behaviors and perform region-guided interventions at inference without requiring costly, pixel-level annotations. Extensive experiments show that SafeCtrl significantly outperforms state-of-the-art methods in both safety efficacy and fidelity preservation. Our findings suggest that decoupled, suppression-based control is a highly effective and scalable direction for building more responsible generative models.
zh
[CV-151] OVG-HQ: Online Video Grounding with Hybrid-modal Queries ICCV2025
【速读】:该论文旨在解决传统视频定位(Video Grounding, VG)任务在在线场景下(如流式视频)和混合模态查询(如文本、图像、视频片段及其组合)中的局限性,尤其针对在线设置中上下文信息受限以及训练过程中模态不平衡导致弱模态被主导模态掩盖的问题。其解决方案的关键在于提出一个统一框架 OVG-HQ-Unify,包含两个核心组件:一是参数化记忆块(Parametric Memory Block, PMB),用于保留先前学习的知识以增强当前决策;二是跨模态蒸馏策略,引导非主导模态的学习,从而实现对多模态查询的有效建模。该设计使单一模型能够稳定处理复杂且动态的混合模态输入,显著提升在线视频定位的准确性与效率。
链接: https://arxiv.org/abs/2508.11903
作者: Runhao Zeng,Jiaqi Mao,Minghao Lai,Minh Hieu Phan,Yanjie Dong,Wei Wang,Qi Chen,Xiping Hu
机构: Artificial Intelligence Research Institute, Shenzhen MSU-BIT University (深圳北理莫斯科大学人工智能研究院); University of Adelaide (阿德莱德大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注: Accepted to ICCV 2025
Abstract:Video grounding (VG) task focuses on locating specific moments in a video based on a query, usually in text form. However, traditional VG struggles with some scenarios like streaming video or queries using visual cues. To fill this gap, we present a new task named Online Video Grounding with Hybrid-modal Queries (OVG-HQ), which enables online segment localization using text, images, video segments, and their combinations. This task poses two new challenges: limited context in online settings and modality imbalance during training, where dominant modalities overshadow weaker ones. To address these, we propose OVG-HQ-Unify, a unified framework featuring a Parametric Memory Block (PMB) that retain previously learned knowledge to enhance current decision and a cross-modal distillation strategy that guides the learning of non-dominant modalities. This design enables a single model to effectively handle hybrid-modal queries. Due to the lack of suitable datasets, we construct QVHighlights-Unify, an expanded dataset with multi-modal queries. Besides, since offline metrics overlook prediction timeliness, we adapt them to the online setting, introducing oR@n, IoU=m, and online mean Average Precision (omAP) to evaluate both accuracy and efficiency. Experiments show that our OVG-HQ-Unify outperforms existing models, offering a robust solution for online, hybrid-modal video grounding. Source code and datasets are available at this https URL.
zh
[CV-152] A Sobel-Gradient MLP Baseline for Handwritten Character Recognition
【速读】:该论文试图解决的问题是:在手写字符识别(Handwritten Character Recognition, HCR)任务中,是否仅使用一阶边缘图(first-order edge maps)作为输入,就能驱动一个全连接的多层感知机(All-Dense Multilayer Perceptron, MLP)实现接近卷积神经网络(Convolutional Neural Networks, CNNs)的性能,从而提供一种更轻量且特征可解释的替代方案。解决方案的关键在于:仅利用水平和垂直方向的Sobel算子提取的一阶梯度作为输入特征,训练一个纯MLP模型,在MNIST数字数据集上达到98%准确率、在EMNIST字母数据集上达到92%准确率,表明手写字符图像中的大部分类别判别信息已由一阶梯度充分捕获,因此基于边缘感知的MLP是一种具有竞争力的HCR方法。
链接: https://arxiv.org/abs/2508.11902
作者: Azam Nouri
机构: Lincoln University (林肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: This paper is under consideration at Pattern Recognition Letters
Abstract:We revisit the classical Sobel operator to ask a simple question: Are first-order edge maps sufficient to drive an all-dense multilayer perceptron (MLP) for handwritten character recognition (HCR), as an alternative to convolutional neural networks (CNNs)? Using only horizontal and vertical Sobel derivatives as input, we train an MLP on MNIST and EMNIST Letters. Despite its extreme simplicity, the resulting network reaches 98% accuracy on MNIST digits and 92% on EMNIST letters – approaching CNNs while offering a smaller memory footprint and transparent features. Our findings highlight that much of the class-discriminative information in handwritten character images is already captured by first-order gradients, making edge-aware MLPs a compelling option for HCR.
zh
[CV-153] Large Kernel Modulation Network for Efficient Image Super-Resolution
【速读】:该论文旨在解决资源受限场景下图像超分辨率(Image Super-Resolution, ISR)任务中轻量级模型在性能与推理延迟之间的权衡问题。现有方法中,卷积神经网络(Convolutional Neural Networks, CNNs)虽具备低延迟优势,但难以捕捉非局部特征;而基于Transformer的模型虽能有效建模长距离依赖,却存在推理速度慢的问题。为此,作者提出纯CNN架构的大型核调制网络(Large Kernel Modulation Network, LKMN),其关键创新在于两个核心组件:增强部分大核块(Enhanced Partial Large Kernel Block, EPLKB)通过通道混洗(channel shuffle)促进跨通道交互、引入通道注意力机制聚焦关键信息,并采用部分通道上的大核条带卷积实现高效非局部特征提取;交叉门控前馈网络(Cross-Gate Feed-Forward Network, CGFN)则通过可学习缩放因子动态调节输入、局部与非局部特征间的差异,并利用交叉门控策略对特征进行调制与融合,强化其互补性。实验表明,LKMN在保持高效率的同时显著提升重建质量,优于当前主流轻量化SR模型。
链接: https://arxiv.org/abs/2508.11893
作者: Quanwei Hu,Yinggan Tang,Xuguang Zhang
机构: Yanshan University (燕山大学); Communication University of Zhejiang (浙江传媒学院)
类目: Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Image super-resolution (SR) in resource-constrained scenarios demands lightweight models balancing performance and latency. Convolutional neural networks (CNNs) offer low latency but lack non-local feature capture, while Transformers excel at non-local modeling yet suffer slow inference. To address this trade-off, we propose the Large Kernel Modulation Network (LKMN), a pure CNN-based model. LKMN has two core components: Enhanced Partial Large Kernel Block (EPLKB) and Cross-Gate Feed-Forward Network (CGFN). The EPLKB utilizes channel shuffle to boost inter-channel interaction, incorporates channel attention to focus on key information, and applies large kernel strip convolutions on partial channels for non-local feature extraction with reduced complexity. The CGFN dynamically adjusts discrepancies between input, local, and non-local features via a learnable scaling factor, then employs a cross-gate strategy to modulate and fuse these features, enhancing their complementarity. Extensive experiments demonstrate that our method outperforms existing state-of-the-art (SOTA) lightweight SR models while balancing quality and efficiency. Specifically, LKMN-L achieves 0.23 dB PSNR improvement over DAT-light on the Manga109 dataset at \times 4 upscale, with nearly \times 4.8 times faster. Codes are in the supplementary materials. The code is available at this https URL.
zh
[CV-154] AdaRing: Towards Ultra-Light Vision-Language Adaptation via Cross-Layer Tensor Ring Decomposition
【速读】:该论文旨在解决现有基于适配器(Adapter)的微调方法在视觉语言模型(Vision-Language Models, VLMs)中面临的两个核心问题:一是由于忽略跨层冗余导致压缩率有限;二是同质化适配器带来的表征能力受限。解决方案的关键在于提出一种基于跨层张量环分解(Tensor Ring Decomposition, TRD)的新型微调框架AdaRing,通过将适配器建模为共享的张量核心(layer-shared tensor cores)与层特定切片(layer-specific slices),有效消除跨层冗余,并引入多样化秩驱动的适配器,在泛化感知微调策略下协同处理不同任务需求,从而实现参数高效且性能优越的VLM微调。
链接: https://arxiv.org/abs/2508.11870
作者: Ying Huang,Yuanbin Man,Wenqi Jia,Zhengzhong Tu,Junzhou Huang,Miao Yin
机构: University of Texas at Arlington (德克萨斯大学阿灵顿分校); Texas A&M University (德克萨斯农工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注:
Abstract:Adapter-based fine-tuning has gained remarkable attention in adapting large pre-trained vision language models (VLMs) for a wide range of downstream tasks efficiently. In this paradigm, only the inserted adapters are fine-tuned, without the need for training the original VLM backbone. Existing works scale adapters by integrating them into every layer of VLMs to increase the capacity of adapters. However, these methods face two primary limitations: 1) limited compression rate due to ignoring cross-layer redundancy, and 2) limited representational capacity across homogeneous adapters. In this paper, we propose a novel vision-language fine-tuning framework based on cross-layer tensor ring decomposition (TRD) with the integration and collaboration of diverse adapters, called AdaRing, achieving ultra-light parameter-efficient adaptation of VLMs on various tasks. To remove the high redundancy that exists among adapters across layers, we exploit the tensor-level low-rankness to formulate adapters as layer-shared tensor cores and layer-specific slices. Moreover, guided by generalization-aware fine-tuning, diverse rank-driven adapters cooperate to handle tasks that require different representations. Our experiments show that the proposed AdaRing achieves the state-of-the-art performance while reducing average training parameters by 90%.
zh
[CV-155] Data Shift of Object Detection in Autonomous Driving
【速读】:该论文旨在解决自动驾驶系统中因数据分布动态变化而导致的**数据漂移(data shift)**问题,该问题会显著影响目标检测模型在真实场景下的性能稳定性。其关键解决方案在于:首先通过系统的数据漂移检测与分析方法对数据集进行分类和平衡处理,进而基于YOLOv5框架引入基于CycleGAN的图像生成式增强技术,以提升模型对不同分布数据的适应能力,从而在BDD100K数据集上实现优于基线模型的检测性能。
链接: https://arxiv.org/abs/2508.11868
作者: Lida Xu
机构: Southern University of Science and Technology (南方科技大学)
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:With the widespread adoption of machine learning technologies in autonomous driving systems, their role in addressing complex environmental perception challenges has become increasingly crucial. However, existing machine learning models exhibit significant vulnerability, as their performance critically depends on the fundamental assumption that training and testing data satisfy the independent and identically distributed condition, which is difficult to guarantee in real-world applications. Dynamic variations in data distribution caused by seasonal changes, weather fluctuations lead to data shift problems in autonomous driving systems. This study investigates the data shift problem in autonomous driving object detection tasks, systematically analyzing its complexity and diverse manifestations. We conduct a comprehensive review of data shift detection methods and employ shift detection analysis techniques to perform dataset categorization and balancing. Building upon this foundation, we construct an object detection model. To validate our approach, we optimize the model by integrating CycleGAN-based data augmentation techniques with the YOLOv5 framework. Experimental results demonstrate that our method achieves superior performance compared to baseline models on the BDD100K dataset.
zh
[CV-156] Impact of Clinical Image Quality on Efficient Foundation Model Finetuning
【速读】:该论文旨在解决医学影像领域中基础模型(foundation models)在标签效率(label efficiency)方面的实际应用问题,尤其是在前列腺多参数磁共振成像(multiparametric MRI, mpMRI)中,如何通过预训练模型实现高质量下游任务性能,同时减少对标注数据的依赖。其解决方案的关键在于系统评估图像质量分布对微调(fine-tuning)效果的影响,发现:高质图像在微调数据中的比例是决定模型能否超越从零训练模型的核心因素;此外,微调与测试阶段图像质量分布的不匹配会显著降低模型泛化能力,因此需在微调和部署阶段保持图像质量分布的一致性。这凸显了在微调数据中量化并控制图像质量的重要性,以充分发挥基础模型在数据与计算效率上的优势。
链接: https://arxiv.org/abs/2508.11864
作者: Yucheng Tang,Pawel Rajwa,Alexander Ng,Yipei Wang,Wen Yan,Natasha Thorley,Aqua Asif,Clare Allen,Louise Dickinson,Francesco Giganti,Shonit Punwani,Daniel C. Alexander,Veeru Kasivisvanathan,Yipeng Hu
机构: University College London (伦敦大学学院); Imperial College London (帝国理工学院); University of Oxford (牛津大学); King’s College London (国王学院); University of Cambridge (剑桥大学)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Foundation models in medical imaging have shown promising label efficiency, achieving high downstream performance with only a fraction of annotated data. Here, we evaluate this in prostate multiparametric MRI using ProFound, a domain-specific vision foundation model pretrained on large-scale prostate MRI datasets. We investigate how variable image quality affects label-efficient finetuning by measuring the generalisability of finetuned models. Experiments systematically vary high-/low-quality image ratios in finetuning and evaluation sets. Our findings indicate that image quality distribution and its finetune-and-test mismatch significantly affect model performance. In particular: a) Varying the ratio of high- to low-quality images between finetuning and test sets leads to notable differences in downstream performance; and b) The presence of sufficient high-quality images in the finetuning set is critical for maintaining strong performance, whilst the importance of matched finetuning and testing distribution varies between different downstream tasks, such as automated radiology reporting and prostate cancer this http URL quality ratios are consistent, finetuning needs far less labeled data than training from scratch, but label efficiency depends on image quality distribution. Without enough high-quality finetuning data, pretrained models may fail to outperform those trained without pretraining. This highlights the importance of assessing and aligning quality distributions between finetuning and deployment, and the need for quality standards in finetuning data for specific downstream tasks. Using ProFound, we show the value of quantifying image quality in both finetuning and deployment to fully realise the data and compute efficiency benefits of foundation models.
zh
[CV-157] ComplicitSplat: Downstream Models are Vulnerable to Blackbox Attacks by 3D Gaussian Splat Camouflages
【速读】:该论文旨在解决3D高斯溅射(3D Gaussian Splatting, 3DGS)在安全关键任务中面临的潜在安全风险问题,即如何通过图像篡改诱导下游目标检测器产生误判。其核心解决方案是提出ComplicitSplat——一种利用标准3DGS着色机制生成视角特异性伪装(viewpoint-specific camouflage)的新型黑盒攻击方法,该方法能在不访问模型架构或权重的情况下,嵌入仅从特定视角可见的对抗性内容,从而实现对目标检测器的隐蔽干扰。实验表明,该攻击可成功作用于多种主流检测模型(包括单阶段、多阶段及基于Transformer的模型),并在真实物理场景与合成场景中均具泛化能力,揭示了3DGS在自动驾驶和机器人系统等应用中的新安全隐患。
链接: https://arxiv.org/abs/2508.11854
作者: Matthew Hull,Haoyang Yang,Pratham Mehta,Mansi Phute,Aeree Cho,Haorang Wang,Matthew Lau,Wenke Lee,Wilian Lunardi,Martin Andreoni,Polo Chau
机构: Georgia Institute of Technology (佐治亚理工学院); University of California, Berkeley (加州大学伯克利分校)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 7 pages, 6 figures
Abstract:As 3D Gaussian Splatting (3DGS) gains rapid adoption in safety-critical tasks for efficient novel-view synthesis from static images, how might an adversary tamper images to cause harm? We introduce ComplicitSplat, the first attack that exploits standard 3DGS shading methods to create viewpoint-specific camouflage - colors and textures that change with viewing angle - to embed adversarial content in scene objects that are visible only from specific viewpoints and without requiring access to model architecture or weights. Our extensive experiments show that ComplicitSplat generalizes to successfully attack a variety of popular detector - both single-stage, multi-stage, and transformer-based models on both real-world capture of physical objects and synthetic scenes. To our knowledge, this is the first black-box attack on downstream object detectors using 3DGS, exposing a novel safety risk for applications like autonomous navigation and other mission-critical robotic systems.
zh
[CV-158] Recent Advances in Transformer and Large Language Models for UAV Applications
【速读】:该论文旨在解决当前Transformer架构在无人机(Unmanned Aerial Vehicle, UAV)系统中应用的研究碎片化问题,通过构建统一的分类体系来系统梳理和评估基于Transformer的UAV模型发展现状。其解决方案的关键在于提出一个涵盖注意力机制、CNN-Transformer混合模型、强化学习Transformer以及大语言模型(Large Language Models, LLMs)的统一分类框架,并结合结构化表格与性能基准进行对比分析,同时梳理关键数据集、仿真工具和评估指标,从而明确现有研究的不足、计算效率与实时部署挑战,并为未来研究方向提供清晰指引。
链接: https://arxiv.org/abs/2508.11834
作者: Hamza Kheddar,Yassine Habchi,Mohamed Chahine Ghanem,Mustapha Hemis,Dusit Niyato
机构: University of Medea (Medea大学); University Centre Salhi Ahmed (萨利·艾哈迈德大学中心); University of Liverpool (利物浦大学); University of Sciences and Technology Houari Boumediene (侯阿里·布迈丁科技大学); Nanyang Technological University (南洋理工大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Robotics (cs.RO); Image and Video Processing (eess.IV); Systems and Control (eess.SY)
备注:
Abstract:The rapid advancement of Transformer-based models has reshaped the landscape of uncrewed aerial vehicle (UAV) systems by enhancing perception, decision-making, and autonomy. This review paper systematically categorizes and evaluates recent developments in Transformer architectures applied to UAVs, including attention mechanisms, CNN-Transformer hybrids, reinforcement learning Transformers, and large language models (LLMs). Unlike previous surveys, this work presents a unified taxonomy of Transformer-based UAV models, highlights emerging applications such as precision agriculture and autonomous navigation, and provides comparative analyses through structured tables and performance benchmarks. The paper also reviews key datasets, simulators, and evaluation metrics used in the field. Furthermore, it identifies existing gaps in the literature, outlines critical challenges in computational efficiency and real-time deployment, and offers future research directions. This comprehensive synthesis aims to guide researchers and practitioners in understanding and advancing Transformer-driven UAV technologies.
zh
[CV-159] From Pixels to Graphs: Deep Graph-Level Anomaly Detection on Dermoscopic Images
【速读】:该论文旨在解决图像到图表示转换方法在基于图神经网络(Graph Neural Networks, GNNs)的图级异常检测(Graph-Level Anomaly Detection, GLAD)中缺乏系统性比较的问题。现有研究虽将GNN应用于从图像生成的图结构上进行分类或异常检测,但未对多种分割策略、边构建方式及节点特征集(如颜色、纹理和形状描述符)的有效性进行严谨评估。其解决方案的关键在于:通过大规模实验系统地评估不同图像转图方法组合在皮肤镜图像上的性能与效率,发现颜色描述符单独使用时表现最佳,而融合形状与纹理特征可进一步提升检测效果;尤其在无监督场景下,最优配置(OCGTL)达到0.805 AUC-ROC,且无需依赖预训练骨干网络,在弱监督和全监督条件下分别提升至0.872和0.914,验证了所选特征组合与结构设计的有效性。
链接: https://arxiv.org/abs/2508.11826
作者: Dehn Xu,Tim Katzke,Emmanuel Müller
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注:
Abstract:Graph Neural Networks (GNNs) have emerged as a powerful approach for graph-based machine learning tasks. Previous work applied GNNs to image-derived graph representations for various downstream tasks such as classification or anomaly detection. These transformations include segmenting images, extracting features from segments, mapping them to nodes, and connecting them. However, to the best of our knowledge, no study has rigorously compared the effectiveness of the numerous potential image-to-graph transformation approaches for GNN-based graph-level anomaly detection (GLAD). In this study, we systematically evaluate the efficacy of multiple segmentation schemes, edge construction strategies, and node feature sets based on color, texture, and shape descriptors to produce suitable image-derived graph representations to perform graph-level anomaly detection. We conduct extensive experiments on dermoscopic images using state-of-the-art GLAD models, examining performance and efficiency in purely unsupervised, weakly supervised, and fully supervised regimes. Our findings reveal, for example, that color descriptors contribute the best standalone performance, while incorporating shape and texture features consistently enhances detection efficacy. In particular, our best unsupervised configuration using OCGTL achieves a competitive AUC-ROC score of up to 0.805 without relying on pretrained backbones like comparable image-based approaches. With the inclusion of sparse labels, the performance increases substantially to 0.872 and with full supervision to 0.914 AUC-ROC.
zh
[CV-160] owards Understanding 3D Vision: the Role of Gaussian Curvature
【速读】:该论文旨在解决当前基于深度学习的三维表面重建方法缺乏显式几何模型的问题,这些问题限制了模型的可解释性、跨模态迁移能力以及可控实验设计。解决方案的关键在于引入高斯曲率(Gaussian curvature)作为几何先验,利用其在坐标变换下的不变性特性,实现对3D表面的稀疏紧凑描述,并揭示现有单目和立体匹配方法虽隐式利用该特性但未显式建模,进而提出通过高斯曲率增强重建精度与作为无监督评估指标的新路径。
链接: https://arxiv.org/abs/2508.11825
作者: Sherlon Almeida da Silva,Davi Geiger,Luiz Velho,Moacir Antonelli Ponti
机构: Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo (ICMC-USP); Courant Institute of Mathematical Sciences, New York University (NYU); Instituto de Matemática Pura e Aplicada (IMPA)
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Recent advances in computer vision have predominantly relied on data-driven approaches that leverage deep learning and large-scale datasets. Deep neural networks have achieved remarkable success in tasks such as stereo matching and monocular depth reconstruction. However, these methods lack explicit models of 3D geometry that can be directly analyzed, transferred across modalities, or systematically modified for controlled experimentation. We investigate the role of Gaussian curvature in 3D surface modeling. Besides Gaussian curvature being an invariant quantity under change of observers or coordinate systems, we demonstrate using the Middlebury stereo dataset that it offers: (i) a sparse and compact description of 3D surfaces, (ii) state-of-the-art monocular and stereo methods seem to implicitly consider it, but no explicit module of such use can be extracted, (iii) a form of geometric prior that can inform and improve 3D surface reconstruction, and (iv) a possible use as an unsupervised metric for stereo methods.
zh
[CV-161] An MLP Baseline for Handwriting Recognition Using Planar Curvature and Gradient Orientation
【速读】:该论文试图解决的问题是如何在不依赖卷积神经网络(Convolutional Neural Networks, CNNs)的情况下,利用可解释的、手工设计的几何特征实现高精度的手写字符识别(Handwritten Character Recognition, HCR)。其解决方案的关键在于使用三种二阶几何线索——平面曲率幅值(planar curvature magnitude)、曲率符号(curvature sign)和梯度方向(gradient orientation)作为输入特征,构建一个仅基于多层感知机(Multilayer Perceptron, MLP)的分类器。实验表明,该方法在MNIST数字数据集上达到97%准确率,在EMNIST字母数据集上达到89%准确率,证明了曲率基表示具有强大的判别能力,并且即使采用非深度学习架构也能实现深度学习模型的性能优势。
链接: https://arxiv.org/abs/2508.11803
作者: Azam Nouri
机构: Lincoln University (林肯大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
备注: 5 pages, No figure
Abstract:This study investigates whether second-order geometric cues - planar curvature magnitude, curvature sign, and gradient orientation - are sufficient on their own to drive a multilayer perceptron (MLP) classifier for handwritten character recognition (HCR), offering an alternative to convolutional neural networks (CNNs). Using these three handcrafted feature maps as inputs, our curvature-orientation MLP achieves 97 percent accuracy on MNIST digits and 89 percent on EMNIST letters. These results underscore the discriminative power of curvature-based representations for handwritten character images and demonstrate that the advantages of deep learning can be realized even with interpretable, hand-engineered features.
zh
[CV-162] Scalable Geospatial Data Generation Using AlphaEarth Foundations Model
【速读】:该论文旨在解决地理空间标注数据集在全球范围内覆盖不均的问题,即现有高质量标注数据通常局限于特定地理区域(如美国),难以直接应用于其他地区(如加拿大)。其解决方案的关键在于利用Google DeepMind发布的AlphaEarth Foundations (AEF)——一个信息密集的全球地理空间表征模型,作为输入特征来扩展原有标注数据集的地理范围。研究通过简单模型(如随机森林或逻辑回归)学习从AEF中提取的特征与目标标签之间的映射关系,从而在未标注的新区域(加拿大)实现有效的分类预测,验证了该方法在不同粒度下的可行性与准确性。
链接: https://arxiv.org/abs/2508.11739
作者: Luc Houriez(1 and 2),Sebastian Pilarski(1),Behzad Vahedi(1),Ali Ahmadalipour(1),Teo Honda Scully(1),Nicholas Aflitto(1),David Andre(1),Caroline Jaffe(1),Martha Wedner(1),Rich Mazzola(1),Josh Jeffery(1),Ben Messinger(1),Sage McGinley-Smith(1),Sarah Russell(1) ((1) X the Moonshot Factory - Bellwether, (2) Stanford University)
机构: X, the Moonshot Factory (X,月球工厂); Bellwether; Stanford University (斯坦福大学)
类目: Machine Learning (cs.LG); Computer Vision and Pattern Recognition (cs.CV)
备注: 15 pages, 10 figures, 5 tables
Abstract:High-quality labeled geospatial datasets are essential for extracting insights and understanding our planet. Unfortunately, these datasets often do not span the entire globe and are limited to certain geographic regions where data was collected. Google DeepMind’s recently released AlphaEarth Foundations (AEF) provides an information-dense global geospatial representation designed to serve as a useful input across a wide gamut of tasks. In this article we propose and evaluate a methodology which leverages AEF to extend geospatial labeled datasets beyond their initial geographic regions. We show that even basic models like random forests or logistic regression can be used to accomplish this task. We investigate a case study of extending LANDFIRE’s Existing Vegetation Type (EVT) dataset beyond the USA into Canada at two levels of granularity: EvtPhys (13 classes) and EvtGp (80 classes). Qualitatively, for EvtPhys, model predictions align with ground truth. Trained models achieve 81% and 73% classification accuracy on EvtPhys validation sets in the USA and Canada, despite discussed limitations.
zh
[CV-163] Artificial Intelligence in Rural Healthcare Delivery: Bridging Gaps and Enhancing Equity through Innovation
【速读】:该论文旨在解决农村地区医疗资源匮乏、专业人才短缺及社会经济差异导致的医疗服务可及性不足等长期难题。其核心解决方案在于系统性评估人工智能(Artificial Intelligence, AI)在改善农村医疗卫生服务中的应用潜力,特别是通过预测分析、远程医疗平台和自动化诊断工具提升服务的可及性、质量和效率。关键创新点在于识别出多模态基础模型(Multimodal Foundation Models, MFMs)与大语言模型(Large Language Models, LLMs)的协同作用:MFMs整合影像、电子病历与生物信号等异构数据以支持精准决策,LLMs则优化临床文书处理、患者分诊、跨语言沟通与虚拟辅助,从而增强基层医疗能力、缩短诊断延迟并促进优质医疗资源下沉。研究同时指出,实现可持续落地需突破基础设施薄弱、数据质量参差与伦理合规等障碍,强调跨学科协作、数字基建投入与法规体系建设的重要性。
链接: https://arxiv.org/abs/2508.11738
作者: Kiruthika Balakrishnan,Durgadevi Velusamy,Hana E. Hinkle,Zhi Li,Karthikeyan Ramasamy,Hikmat Khan,Srini Ramaswamy,Pir Masoom Shah
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Rural healthcare faces persistent challenges, including inadequate infrastructure, workforce shortages, and socioeconomic disparities that hinder access to essential services. This study investigates the transformative potential of artificial intelligence (AI) in addressing these issues in underserved rural areas. We systematically reviewed 109 studies published between 2019 and 2024 from PubMed, Embase, Web of Science, IEEE Xplore, and Scopus. Articles were screened using PRISMA guidelines and Covidence software. A thematic analysis was conducted to identify key patterns and insights regarding AI implementation in rural healthcare delivery. The findings reveal significant promise for AI applications, such as predictive analytics, telemedicine platforms, and automated diagnostic tools, in improving healthcare accessibility, quality, and efficiency. Among these, advanced AI systems, including Multimodal Foundation Models (MFMs) and Large Language Models (LLMs), offer particularly transformative potential. MFMs integrate diverse data sources, such as imaging, clinical records, and bio signals, to support comprehensive decision-making, while LLMs facilitate clinical documentation, patient triage, translation, and virtual assistance. Together, these technologies can revolutionize rural healthcare by augmenting human capacity, reducing diagnostic delays, and democratizing access to expertise. However, barriers remain, including infrastructural limitations, data quality concerns, and ethical considerations. Addressing these challenges requires interdisciplinary collaboration, investment in digital infrastructure, and the development of regulatory frameworks. This review offers actionable recommendations and highlights areas for future research to ensure equitable and sustainable integration of AI in rural healthcare systems.
zh
[CV-164] UniDCF: A Foundation Model for Comprehensive Dentocraniofacial Hard Tissue Reconstruction
【速读】:该论文旨在解决牙颌面硬组织(dentocraniofacial hard tissue)缺损重建中面临的挑战,即现有深度学习模型受限于单一组织场景和特定模态影像输入,导致泛化能力差,并在解剖保真度、计算效率与跨组织适应性之间存在权衡。解决方案的关键在于提出UniDCF框架,通过点云与多视角图像的多模态融合编码实现多种牙颌面硬组织的统一重建;同时引入基于评分的去噪模块以优化表面平滑性,从而克服以往单模态方法的局限性。该框架基于包含6,609名患者数据的全球最大规模多模态数据库进行训练与验证,在几何精度、结构完整性及空间准确性方面均优于现有最优方法,且临床模拟显示其可将重建设计时间减少99%,并获得超过94%的临床可接受度。
链接: https://arxiv.org/abs/2508.11728
作者: Chunxia Ren,Ning Zhu,Yue Lai,Gui Chen,Ruijie Wang,Yangyi Hu,Suyao Liu,Shuwen Mao,Hong Su,Yu Zhang,Li Xiao
机构: Beijing University of Posts and Telecommunications (北京邮电大学); Peking University (北京大学)
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 23 pages, 6 figures
Abstract:Dentocraniofacial hard tissue defects profoundly affect patients’ physiological functions, facial aesthetics, and psychological well-being, posing significant challenges for precise reconstruction. Current deep learning models are limited to single-tissue scenarios and modality-specific imaging inputs, resulting in poor generalizability and trade-offs between anatomical fidelity, computational efficiency, and cross-tissue adaptability. Here we introduce UniDCF, a unified framework capable of reconstructing multiple dentocraniofacial hard tissues through multimodal fusion encoding of point clouds and multi-view images. By leveraging the complementary strengths of each modality and incorporating a score-based denoising module to refine surface smoothness, UniDCF overcomes the limitations of prior single-modality approaches. We curated the largest multimodal dataset, comprising intraoral scans, CBCT, and CT from 6,609 patients, resulting in 54,555 annotated instances. Evaluations demonstrate that UniDCF outperforms existing state-of-the-art methods in terms of geometric precision, structural completeness, and spatial accuracy. Clinical simulations indicate UniDCF reduces reconstruction design time by 99% and achieves clinician-rated acceptability exceeding 94%. Overall, UniDCF enables rapid, automated, and high-fidelity reconstruction, supporting personalized and precise restorative treatments, streamlining clinical workflows, and enhancing patient outcomes.
zh
[CV-165] FusionFM: Fusing Eye-specific Foundational Models for Optimized Ophthalmic Diagnosis
【速读】:该论文旨在解决眼科领域基础模型(Foundation Models, FMs)在医学图像分析中的性能差异、任务适应性不均以及如何提升整体预测能力等关键问题。其核心挑战在于:不同FM在眼科疾病检测(如青光眼、糖尿病视网膜病变、年龄相关性黄斑变性)与系统性疾病预测(如糖尿病和高血压)中表现是否一致,以及是否可以通过融合策略进一步提升性能。解决方案的关键在于提出FusionFM评估框架,并设计两种融合方法(基于门控的融合策略),对四种先进FM(RETFound、VisionFM、RetiZero、DINORET)进行系统性比较与集成优化;结果表明,RetiZero和DINORET在多数任务上表现最优,且融合策略在特定任务(如青光眼、AMD、高血压)上带来小幅但显著的性能提升,为未来临床应用提供了可验证的模型选择与集成路径。
链接: https://arxiv.org/abs/2508.11721
作者: Ke Zou,Jocelyn Hui Lin Goh,Yukun Zhou,Tian Lin,Samantha Min Er Yew,Sahana Srinivasan,Meng Wang,Rui Santos,Gabor M. Somfai,Huazhu Fu,Haoyu Chen,Pearse A. Keane,Ching-Yu Cheng,Yih Chung Tham
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 12 pages, 3 figures
Abstract:Foundation models (FMs) have shown great promise in medical image analysis by improving generalization across diverse downstream tasks. In ophthalmology, several FMs have recently emerged, but there is still no clear answer to fundamental questions: Which FM performs the best? Are they equally good across different tasks? What if we combine all FMs together? To our knowledge, this is the first study to systematically evaluate both single and fused ophthalmic FMs. To address these questions, we propose FusionFM, a comprehensive evaluation suite, along with two fusion approaches to integrate different ophthalmic FMs. Our framework covers both ophthalmic disease detection (glaucoma, diabetic retinopathy, and age-related macular degeneration) and systemic disease prediction (diabetes and hypertension) based on retinal imaging. We benchmarked four state-of-the-art FMs (RETFound, VisionFM, RetiZero, and DINORET) using standardized datasets from multiple countries and evaluated their performance using AUC and F1 metrics. Our results show that DINORET and RetiZero achieve superior performance in both ophthalmic and systemic disease tasks, with RetiZero exhibiting stronger generalization on external datasets. Regarding fusion strategies, the Gating-based approach provides modest improvements in predicting glaucoma, AMD, and hypertension. Despite these advances, predicting systemic diseases, especially hypertension in external cohort remains challenging. These findings provide an evidence-based evaluation of ophthalmic FMs, highlight the benefits of model fusion, and point to strategies for enhancing their clinical applicability.
zh
[CV-166] Privacy-Aware Detection of Fake Identity Documents: Methodology Benchmark and Improved Detection Methods (FakeIDet2)
【速读】:该论文旨在解决伪造身份文件(Fake ID)检测研究中因数据敏感性导致的真实数据稀缺问题,从而阻碍了检测模型的训练与评估。其核心挑战在于如何在不泄露原始ID隐私的前提下,构建可用于训练和测试的高质量数据集。解决方案的关键在于提出了一种基于图像块(patch-based)的隐私保护方法,通过提取并共享ID图像中的局部区域而非完整文档,有效规避了敏感信息暴露的风险;同时,作者构建了一个包含超过90万张真实/伪造ID图像块的新公共数据库FakeIDet2-db,并开发了一种隐私感知的检测方法FakeIDet2,配合标准化的可复现基准,支持对物理攻击(如打印、屏幕复制、复合伪造)和合成攻击的系统性评估。
链接: https://arxiv.org/abs/2508.11716
作者: Javier Muñoz-Haro,Ruben Tolosana,Ruben Vera-Rodriguez,Aythami Morales,Julian Fierrez
机构: Biometrics and Data Pattern Analytics Lab, Universidad Autónoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, 28049, Madrid, Spain
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV)
备注:
Abstract:Remote user verification in Internet-based applications is becoming increasingly important nowadays. A popular scenario for it consists of submitting a picture of the user’s Identity Document (ID) to a service platform, authenticating its veracity, and then granting access to the requested digital service. An ID is well-suited to verify the identity of an individual, since it is government issued, unique, and nontransferable. However, with recent advances in Artificial Intelligence (AI), attackers can surpass security measures in IDs and create very realistic physical and synthetic fake IDs. Researchers are now trying to develop methods to detect an ever-growing number of these AI-based fakes that are almost indistinguishable from authentic (bona fide) IDs. In this counterattack effort, researchers are faced with an important challenge: the difficulty in using real data to train fake ID detectors. This real data scarcity for research and development is originated by the sensitive nature of these documents, which are usually kept private by the ID owners (the users) and the ID Holders (e.g., government, police, bank, etc.). The main contributions of our study are: 1) We propose and discuss a patch-based methodology to preserve privacy in fake ID detection research. 2) We provide a new public database, FakeIDet2-db, comprising over 900K real/fake ID patches extracted from 2,000 ID images, acquired using different smartphone sensors, illumination and height conditions, etc. In addition, three physical attacks are considered: print, screen, and composite. 3) We present a new privacy-aware fake ID detection method, FakeIDet2. 4) We release a standard reproducible benchmark that considers physical and synthetic attacks from popular databases in the literature.
zh
[CV-167] Separating Knowledge and Perception with Procedural Data ICML2025
【速读】:该论文旨在解决当前视觉表示学习模型对真实世界图像数据的高度依赖问题,尤其是在缺乏大规模标注数据时性能受限的挑战。其核心解决方案是利用仅基于程序化生成的数据(procedural data)训练表示模型,并通过引入视觉记忆(visual memory)——一个显式的参考图像嵌入数据库——在不进行额外微调的情况下实现视觉相似性、分类和语义分割等任务的高性能表现。关键创新在于实现了与真实世界图像的完全解耦(full compartmentalization),同时保持了接近甚至超越传统基于真实数据训练模型的性能,表明程序化数据在构建通用视觉表征方面的潜力。
链接: https://arxiv.org/abs/2508.11697
作者: Adrián Rodríguez-Muñoz,Manel Baradad,Phillip Isola,Antonio Torralba
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
备注: 17 pages, 18 figures, 3 tables, to be published in ICML 2025
Abstract:We train representation models with procedural data only, and apply them on visual similarity, classification, and semantic segmentation tasks without further training by using visual memory – an explicit database of reference image embeddings. Unlike prior work on visual memory, our approach achieves full compartmentalization with respect to all real-world images while retaining strong performance. Compared to a model trained on Places, our procedural model performs within 1% on NIGHTS visual similarity, outperforms by 8% and 15% on CUB200 and Flowers102 fine-grained classification, and is within 10% on ImageNet-1K classification. It also demonstrates strong zero-shot segmentation, achieving an R^2 on COCO within 10% of the models trained on real data. Finally, we analyze procedural versus real data models, showing that parts of the same object have dissimilar representations in procedural models, resulting in incorrect searches in memory and explaining the remaining performance gap.
zh
[CV-168] A Deep Learning-Based CCTV System for Automatic Smoking Detection in Fire Exit Zones
【速读】:该论文旨在解决火警疏散通道区域中吸烟行为的实时检测问题,以满足关键安全监管需求。其解决方案的关键在于提出一种基于YOLOv8改进的深度学习模型,通过引入针对复杂监控场景优化的结构提升检测性能,并在多类低光照条件下验证了模型的有效性;实验表明该模型在召回率(78.90%)和平均精度均值(mAP@50=83.70%)上优于YOLOv11与YOLOv12,且在Jetson Xavier NX边缘设备上实现每帧52–97毫秒的推理延迟,具备部署于实时安防系统的能力。
链接: https://arxiv.org/abs/2508.11696
作者: Sami Sadat,Mohammad Irtiza Hossain,Junaid Ahmed Sifat,Suhail Haque Rafi,Md. Waseq Alauddin Alvi,Md. Khalilur Rhaman
机构: 未知
类目: Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:A deep learning real-time smoking detection system for CCTV surveillance of fire exit areas is proposed due to critical safety requirements. The dataset contains 8,124 images from 20 different scenarios along with 2,708 raw samples demonstrating low-light areas. We evaluated three advanced object detection models: YOLOv8, YOLOv11, and YOLOv12, followed by development of a custom model derived from YOLOv8 with added structures for challenging surveillance contexts. The proposed model outperformed the others, achieving a recall of 78.90 percent and mAP at 50 of 83.70 percent, delivering optimal object detection across varied environments. Performance evaluation on multiple edge devices using multithreaded operations showed the Jetson Xavier NX processed data at 52 to 97 milliseconds per inference, establishing its suitability for time-sensitive operations. This system offers a robust and adaptable platform for monitoring public safety and enabling automatic regulatory compliance.
zh
[CV-169] Contrastive Regularization over LoRA for Multimodal Biomedical Image Incremental Learning
【速读】:该论文旨在解决多模态生物医学图像增量学习(Multimodal Biomedical Image Incremental Learning, MBIIL)中的两个核心问题:一是如何在增量更新过程中保留先前学习到的知识以缓解灾难性遗忘;二是如何有效利用已有模态的知识来支持新模态的学习,从而实现跨模态的知识迁移与区分。解决方案的关键在于提出MSLoRA-CR方法,该方法在冻结预训练大视觉语言模型(Large Vision-Language Model, LVLM)的基础上,通过微调特定于模态的低秩适配模块(Modality-Specific LoRA modules),并引入对比正则化(Contrastive Regularization)机制,增强模态内知识共享并促进模态间知识分化,从而在保持计算效率的同时显著提升整体性能。
链接: https://arxiv.org/abs/2508.11673
作者: Haojie Zhang,Yixiong Liang,Hulin Kuang,Lihui Cen,Zhe Qu,Yigang Cen,Min Zeng,Shichao Kan
机构: Central South University (中南大学); Beijing Jiaotong University (北京交通大学)
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
备注: 10 pages, 3 figures, submitted to ACM Multimedia 2025
Abstract:Multimodal Biomedical Image Incremental Learning (MBIIL) is essential for handling diverse tasks and modalities in the biomedical domain, as training separate models for each modality or task significantly increases inference costs. Existing incremental learning methods focus on task expansion within a single modality, whereas MBIIL seeks to train a unified model incrementally across modalities. The MBIIL faces two challenges: I) How to preserve previously learned knowledge during incremental updates? II) How to effectively leverage knowledge acquired from existing modalities to support new modalities? To address these challenges, we propose MSLoRA-CR, a method that fine-tunes Modality-Specific LoRA modules while incorporating Contrastive Regularization to enhance intra-modality knowledge sharing and promote inter-modality knowledge differentiation. Our approach builds upon a large vision-language model (LVLM), keeping the pretrained model frozen while incrementally adapting new LoRA modules for each modality or task. Experiments on the incremental learning of biomedical images demonstrate that MSLoRA-CR outperforms both the state-of-the-art (SOTA) approach of training separate models for each modality and the general incremental learning method (incrementally fine-tuning LoRA). Specifically, MSLoRA-CR achieves a 1.88% improvement in overall performance compared to unconstrained incremental learning methods while maintaining computational efficiency. Our code is publicly available at this https URL.
zh
[CV-170] Point upsampling networks for single-photon sensing
【速读】:该论文旨在解决单光子成像(single-photon sensing)中点云稀疏且空间分布不均的问题,这一缺陷限制了其在远距离和超灵敏成像中的实际应用。解决方案的关键在于提出一种基于状态空间模型(state space model)的点云上采样网络,其核心创新包括:多路径扫描机制以丰富空间上下文信息、双向Mamba骨干网络以同时捕捉全局几何结构与局部细节,以及自适应上采样偏移模块以校正由偏移引起的畸变。该方法显著提升了点云密度和空间一致性,实现了高精度重建与强鲁棒性,为单光子传感在下游任务中的实用化开辟了新路径。
链接: https://arxiv.org/abs/2508.12986
作者: Jinyi Liu,Guoyang Zhao,Lijun Liu,Yiguang Hong,Weiping Zhang,Shuming Cheng
机构: Tongji University (同济大学); Shanxi Normal University (山西师范大学)
类目: Optics (physics.optics); Computer Vision and Pattern Recognition (cs.CV)
备注: 13 pages, 8 figures, any comments are welcome
Abstract:Single-photon sensing has generated great interest as a prominent technique of long-distance and ultra-sensitive imaging, however, it tends to yield sparse and spatially biased point clouds, thus limiting its practical utility. In this work, we propose using point upsampling networks to increase point density and reduce spatial distortion in single-photon point cloud. Particularly, our network is built on the state space model which integrates a multi-path scanning mechanism to enrich spatial context, a bidirectional Mamba backbone to capture global geometry and local details, and an adaptive upsample shift module to correct offset-induced distortions. Extensive experiments are implemented on commonly-used datasets to confirm its high reconstruction accuracy and strong robustness to the distortion noise, and also on real-world data to demonstrate that our model is able to generate visually consistent, detail-preserving, and noise suppressed point clouds. Our work is the first to establish the upsampling framework for single-photon sensing, and hence opens a new avenue for single-photon sensing and its practical applications in the downstreaming tasks.
zh
[CV-171] On the Importance of Behavioral Nuances: Amplifying Non-Obvious Motor Noise Under True Empirical Considerations May Lead to Briefer Assays and Faster Classification Processes
【速读】:该论文旨在解决传统生物节律时间序列数据分析中因依赖大规模数据集而难以获取、且在平均化处理过程中丢失个体特异性信息的问题。其解决方案的关键在于构建一个情感计算平台,通过结合从5秒短时面部视频中提取的微峰值(micropeaks)新数据类型与基于AI驱动的面部网格估计方法,利用几何学和非线性动力系统分析法捕捉面部运动学中的全部微峰值特征,包括不同情绪微表情的细微差异,从而在保持个性化统计效能的同时实现高效、可扩展的生理信号检测。
链接: https://arxiv.org/abs/2508.12742
作者: Theodoros Bermperidis,Joe Vero,Elizabeth B Torres
机构: Rutgers the State University of New Jersey (新泽西州立罗格斯大学); Rutgers University Center for Cognitive Science (罗格斯大学认知科学中心); Rutgers University Center Biomedicine, Imaging and Modeling, Department of Computer Science (罗格斯大学生物医学、成像与建模中心,计算机科学系)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Signal Processing (eess.SP); Chaotic Dynamics (nlin.CD)
备注: This paper is under review in IEEE Transactions on Affective Computing
Abstract:There is a tradeoff between attaining statistical power with large, difficult to gather data sets, and producing highly scalable assays that register brief data samples. Often, as grand-averaging techniques a priori assume normally-distributed parameters and linear, stationary processes in biorhythmic, time series data, important information is lost, averaged out as gross data. We developed an affective computing platform that enables taking brief data samples while maintaining personalized statistical power. This is achieved by combining a new data type derived from the micropeaks present in time series data registered from brief (5-second-long) face videos with recent advances in AI-driven face-grid estimation methods. By adopting geometric and nonlinear dynamical systems approaches to analyze the kinematics, especially the speed data, the new methods capture all facial micropeaks. These include as well the nuances of different affective micro expressions. We offer new ways to differentiate dynamical and geometric patterns present in autistic individuals from those found more commonly in neurotypical development.
zh
[CV-172] Anatomic Feature Fusion Model for Diagnosing Calcified Pulmonary Nodules on Chest X-Ray
【速读】:该论文旨在解决胸部X光片中肺结节钙化特征识别不准确的问题,该问题在临床实践中主要依赖医生主观判断,导致诊断一致性差,且因肋骨、脊柱等解剖结构重叠干扰,难以精准识别钙化模式。解决方案的关键在于提出一种融合原始图像与结构抑制变体特征的分类模型,通过减少结构干扰来提升钙化判别性能,最终在包含2,517张无病灶图像和656张结节图像(其中151例钙化、550例非钙化)的数据集上实现了86.52%的准确率和0.8889的AUC,显著优于仅使用原始图像训练的模型。
链接: https://arxiv.org/abs/2508.12562
作者: Hyeonjin Choi,Yang-gon Kim,Dong-yeon Yoo,Ju-sung Sun,Jung-won Lee
机构: Ajou University (亚洲大学); Ajou University Hospital (亚洲大学医院)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注: 8 pages, 4 figures
Abstract:Accurate and timely identification of pulmonary nodules on chest X-rays can differentiate between life-saving early treatment and avoidable invasive procedures. Calcification is a definitive indicator of benign nodules and is the primary foundation for diagnosis. In actual practice, diagnosing pulmonary nodule calcification on chest X-rays predominantly depends on the physician’s visual assessment, resulting in significant diversity in interpretation. Furthermore, overlapping anatomical elements, such as ribs and spine, complicate the precise identification of calcification patterns. This study presents a calcification classification model that attains strong diagnostic performance by utilizing fused features derived from raw images and their structure-suppressed variants to reduce structural interference. We used 2,517 lesion-free images and 656 nodule images (151 calcified nodules and 550 non-calcified nodules), all obtained from Ajou University Hospital. The suggested model attained an accuracy of 86.52% and an AUC of 0.8889 in calcification diagnosis, surpassing the model trained on raw images by 3.54% and 0.0385, respectively.
zh
[CV-173] Segmenting Thalamic Nuclei: T1 Maps Provide a Reliable and Efficient Solution
【速读】:该论文旨在解决丘脑核团(thalamic nuclei)分割中最佳影像输入模态不明确的问题,以提升神经疾病研究、脑功能解析及临床干预的准确性。其解决方案的关键在于系统评估多种MRI对比度(包括MPRAGE、FGATIR序列、定量质子密度(PD)和T1图谱,以及不同反转时间(multi-TI)的T1加权图像),并创新性地引入基于梯度的显著性分析结合蒙特卡洛丢弃(Monte Carlo dropout)的方法,提出整体重要性评分(Overall Importance Score)来量化各图像对分割性能的贡献,从而筛选出最优输入;结果表明,仅使用T1图谱即可实现优异的定量与定性分割效果,而PD图谱无显著增益,验证了T1图谱作为高效可靠输入的价值。
链接: https://arxiv.org/abs/2508.12508
作者: Anqi Feng,Zhangxing Bian,Samuel W. Remedios,Savannah P. Hays,Blake E. Dewey,Jiachen Zhuo,Dan Benjamini,Jerry L. Prince
机构: 11; 22; 33; 44; 55
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
备注:
Abstract:Accurate thalamic nuclei segmentation is crucial for understanding neurological diseases, brain functions, and guiding clinical interventions. However, the optimal inputs for segmentation remain unclear. This study systematically evaluates multiple MRI contrasts, including MPRAGE and FGATIR sequences, quantitative PD and T1 maps, and multiple T1-weighted images at different inversion times (multi-TI), to determine the most effective inputs. For multi-TI images, we employ a gradient-based saliency analysis with Monte Carlo dropout and propose an Overall Importance Score to select the images contributing most to segmentation. A 3D U-Net is trained on each of these configurations. Results show that T1 maps alone achieve strong quantitative performance and superior qualitative outcomes, while PD maps offer no added value. These findings underscore the value of T1 maps as a reliable and efficient input among the evaluated options, providing valuable guidance for optimizing imaging protocols when thalamic structures are of clinical or research interest.
zh
[CV-174] FractMorph: A Fractional Fourier-Based Multi-Domain Transformer for Deformable Image Registration
【速读】:该论文旨在解决医学图像中非刚性形变配准(Deformable Image Registration, DIR)任务中存在的难题,即如何在统一框架内同时捕捉精细局部形变与大尺度全局形变。现有方法往往难以兼顾这两种不同尺度的变形特征,导致配准精度受限。其解决方案的关键在于提出一种基于3D双并行Transformer架构的FractMorph模型,通过引入多域分数阶傅里叶变换(Fractional Fourier Transform, FrFT)分支,在0°、45°和90°等不同角度下并行提取局部、半全局和全局特征,并利用Fractional Cross-Attention(FCA)模块实现固定图像与移动图像流之间的跨图像特征融合;随后由轻量级U-Net结构预测稠密形变场。该设计无需分层多尺度网络或场景特定调参,即可高效建模复杂非刚性形变,在ACDC心脏MRI数据集上达到当前最优性能(整体Dice相似系数86.45%,HD95为1.54 mm)。
链接: https://arxiv.org/abs/2508.12445
作者: Shayan Kebriti,Shahabedin Nabavi,Ali Gooya
机构: 未知
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Deformable image registration (DIR) is a crucial and challenging technique for aligning anatomical structures in medical images and is widely applied in diverse clinical applications. However, existing approaches often struggle to capture fine-grained local deformations and large-scale global deformations simultaneously within a unified framework. We present FractMorph, a novel 3D dual-parallel transformer-based architecture that enhances cross-image feature matching through multi-domain fractional Fourier transform (FrFT) branches. Each Fractional Cross-Attention (FCA) block applies parallel FrFTs at fractional angles of 0°, 45°, 90°, along with a log-magnitude branch, to effectively extract local, semi-global, and global features at the same time. These features are fused via cross-attention between the fixed and moving image streams. A lightweight U-Net style network then predicts a dense deformation field from the transformer-enriched features. On the ACDC cardiac MRI dataset, FractMorph achieves state-of-the-art performance with an overall Dice Similarity Coefficient (DSC) of 86.45%, an average per-structure DSC of 75.15%, and a 95th-percentile Hausdorff distance (HD95) of 1.54 mm on our data split. We also introduce FractMorph-Light, a lightweight variant of our model with only 29.6M parameters, which maintains the superior accuracy of the main model while using approximately half the memory. Our results demonstrate that multi-domain spectral-spatial attention in transformers can robustly and efficiently model complex non-rigid deformations in medical images using a single end-to-end network, without the need for scenario-specific tuning or hierarchical multi-scale networks. The source code of our implementation is available at this https URL.
zh
[CV-175] DermINO: Hybrid Pretraining for a Versatile Dermatology Foundation Model
【速读】:该论文旨在解决皮肤疾病诊断中面临的三大挑战:全球高发病率带来的医疗负担、复杂且依赖专家的诊断流程,以及资源匮乏地区皮肤科医生严重短缺的问题。当前人工智能(AI)模型受限于对大规模人工标注数据的依赖和任务特异性,难以在真实临床环境中广泛应用。其解决方案的关键在于提出一种名为DermNIO的多功能基础模型(foundation model),通过创新的混合预训练框架——结合自监督学习、半监督学习与知识引导的原型初始化机制,显著提升了模型对复杂皮肤病的理解能力和跨任务泛化性能。该方法不仅在20个不同数据集上全面超越现有最先进模型,在恶性肿瘤分类、病情分级、多类别诊断及图像描述等高阶临床任务中表现优异,还在隐私保护的联邦学习场景下展现出强鲁棒性,且在不同肤色和性别群体中保持一致性,最终在盲法读者研究中实现95.79%的诊断准确率,远超人类皮肤科医生的73.66%,同时有效提升临床医师诊断能力17.21%。
链接: https://arxiv.org/abs/2508.12190
作者: Jingkai Xu,De Cheng,Xiangqian Zhao,Jungang Yang,Zilong Wang,Xinyang Jiang,Xufang Luo,Lili Chen,Xiaoli Ning,Chengxu Li,Xinzhu Zhou,Xuejiao Song,Ang Li,Qingyue Xia,Zhou Zhuang,Hongfei Ouyang,Ke Xue,Yujun Sheng,Rusong Meng,Feng Xu,Xi Yang,Weimin Ma,Yusheng Lee,Dongsheng Li,Xinbo Gao,Jianming Liang,Lili Qiu,Nannan Wang,Xianbo Zuo,Cui Yong
机构: China-Japan Friendship Hospital (中日友好医院); Xidian University (西安电子科技大学); Peking University China-Japan Friendship School of Clinical Medicine (北京大学中日友好临床医学院); Microsoft Research Asia (微软亚洲研究院); State Key Laboratory of Integrated Services Networks (ISN) (综合业务网络国家重点实验室); Institute of Clinical Medicine for China-Japan Friendship Hospital (中日友好医院临床医学研究所); China-Japan Friendship Hospital (Institute of Clinical Medical Sciences), Chinese Academy of Medical Sciences & Peking Union Medical College (中日友好医院(临床医学科学研究所),中国医学科学院与北京协和医学院); Big Data Center, China-Japan Friendship Hospital (中日友好医院大数据中心); Scientific-skincare Innovation Alliance (SIA) (科学护肤创新联盟); Biomedical Informatics and Data Science, Arizona State University (亚利桑那州立大学生物信息学与数据科学系)
类目: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV)
备注:
Abstract:Skin diseases impose a substantial burden on global healthcare systems, driven by their high prevalence (affecting up to 70% of the population), complex diagnostic processes, and a critical shortage of dermatologists in resource-limited areas. While artificial intelligence(AI) tools have demonstrated promise in dermatological image analysis, current models face limitations-they often rely on large, manually labeled datasets and are built for narrow, specific tasks, making them less effective in real-world settings. To tackle these limitations, we present DermNIO, a versatile foundation model for dermatology. Trained on a curated dataset of 432,776 images from three sources (public repositories, web-sourced images, and proprietary collections), DermNIO incorporates a novel hybrid pretraining framework that augments the self-supervised learning paradigm through semi-supervised learning and knowledge-guided prototype initialization. This integrated method not only deepens the understanding of complex dermatological conditions, but also substantially enhances the generalization capability across various clinical tasks. Evaluated across 20 datasets, DermNIO consistently outperforms state-of-the-art models across a wide range of tasks. It excels in high-level clinical applications including malignancy classification, disease severity grading, multi-category diagnosis, and dermatological image caption, while also achieving state-of-the-art performance in low-level tasks such as skin lesion segmentation. Furthermore, DermNIO demonstrates strong robustness in privacy-preserving federated learning scenarios and across diverse skin types and sexes. In a blinded reader study with 23 dermatologists, DermNIO achieved 95.79% diagnostic accuracy (versus clinicians’ 73.66%), and AI assistance improved clinician performance by 17.21%.
zh
[CV-176] Statistical analysis of multivariate planar curves and applications to X-ray classification
【速读】:该论文旨在解决医学图像分析中利用分割图像中的轮廓信息进行监督分类的问题,特别是如何有效建模和利用多个目标结构的形状特征以提升诊断准确性。其解决方案的关键在于提出了一种新的多平面曲线(multivariate planar curves)形式化框架,扩展了单个随机平面曲线的统计形状分析方法,从而实现对多对象形状的联合分析;同时引入了针对形状对齐问题的解决方案,并通过切空间投影(tangent projections)将所得多变量形状变量嵌入功能分类方法中,显著提升了模型在心脏肥大检测等任务中的鲁棒性和有效性。
链接: https://arxiv.org/abs/2508.11780
作者: Moindjié Issam-Ali,Descary Marie-Hélène,Beaulac Cédric
机构: Université de Perpignan (佩皮尼昂大学)
类目: Methodology (stat.ME); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
备注:
Abstract:Recent developments in computer vision have enabled the availability of segmented images across various domains, such as medicine, where segmented radiography images play an important role in diagnosis-making. As prediction problems are common in medical image analysis, this work explores the use of segmented images (through the associated contours they highlight) as predictors in a supervised classification context. Consequently, we develop a new approach for image analysis that takes into account the shape of objects within images. For this aim, we introduce a new formalism that extends the study of single random planar curves to the joint analysis of multiple planar curves-referred to here as multivariate planar curves. In this framework, we propose a solution to the alignment issue in statistical shape analysis. The obtained multivariate shape variables are then used in functional classification methods through tangent projections. Detection of cardiomegaly in segmented X-rays and numerical experiments on synthetic data demonstrate the appeal and robustness of the proposed method.
zh
[CV-177] BeeNet: Reconstructing Flower Shapes from Electric Fields using Deep Learning
【速读】:该论文旨在解决如何利用昆虫(如蜜蜂)与花朵相互作用时产生的电场信息来重建花朵几何形状的问题,从而揭示电感受(electroreception)在环境感知中的空间编码能力。解决方案的关键在于开发了一种基于深度学习的UNet模型,该模型通过训练模拟的电场数据(来自不同花瓣几何形状的花与带电蜜蜂的相互作用)实现对花朵形状的高精度重建,且能在未见过的复杂花形上保持良好泛化性能,同时发现电场信息的形状编码效率随蜂-花距离变化存在最优值,表明电场可携带丰富的空间细节信息。
链接: https://arxiv.org/abs/2508.11724
作者: Jake Turley,Ryan A. Palmer,Isaac V. Chenchiah,Daniel Robert
机构: University of Bristol (布里斯托大学); National University of Singapore (新加坡国立大学)
类目: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV)
备注: 14 pages, 4 figures
Abstract:Arthropods, including pollinators, respond to environmental electrical fields. Here, we show that electric field information can be decoded to reconstruct environmental features. We develop an algorithm capable of inferring the shapes of polarisable flowers from the electric field generated by a nearby charged bee. We simulated electric fields arising from bee flower interactions for flowers with varying petal geometries. These simulated data were used to train a deep learning UNet model to recreate petal shapes. The model accurately reconstructed diverse flower shapes including more complex flower shapes not included in training. Reconstruction performance peaked at an optimal bee flower distance, indicating distance-dependent encoding of shape information. These findings show that electroreception can impart rich spatial detail, offering insights into arthropod environmental perception.
zh
[CV-178] Data-driven RF Tomography via Cross-modal Sensing and Continual Learning
【速读】:该论文旨在解决动态环境中地下目标(如根茎类作物)检测的准确性和鲁棒性问题,尤其针对射频(RF)信号在复杂土壤条件下易受干扰导致成像质量下降的挑战。其解决方案的关键在于提出了一种数据驱动的射频断层成像框架(DRIFT),通过两个核心机制实现:一是设计了融合RF与视觉传感器的跨模态感知系统,并采用跨模态学习方法训练深度神经网络(DNN)模型以提升图像重建性能;二是引入持续学习(continual learning)策略,在环境变化被检测到时自动更新DNN模型,从而适应动态场景下的信号漂移。实验表明,该方法相较现有最优方案平均等效直径误差降低23.2%,达到2.29 cm。
链接: https://arxiv.org/abs/2508.11654
作者: Yang Zhao,Tao Wang,Said Elhadi
机构: Harbin Institute of Technology, Shenzhen (哈尔滨工业大学深圳校区)
类目: ignal Processing (eess.SP); Computer Vision and Pattern Recognition (cs.CV)
备注: 6 pages, 4 figures, to be published in IEEE AVSS Conference
Abstract:Data-driven radio frequency (RF) tomography has demonstrated significant potential for underground target detection, due to the penetrative nature of RF signals through soil. However, it is still challenging to achieve accurate and robust performance in dynamic environments. In this work, we propose a data-driven radio frequency tomography (DRIFT) framework with the following key components to reconstruct cross section images of underground root tubers, even with significant changes in RF signals. First, we design a cross-modal sensing system with RF and visual sensors, and propose to train an RF tomography deep neural network (DNN) model following the cross-modal learning approach. Then we propose to apply continual learning to automatically update the DNN model, once environment changes are detected in a dynamic environment. Experimental results show that our approach achieves an average equivalent diameter error of 2.29 cm, 23.2% improvement upon the state-of-the-art approach. Our DRIFT code and dataset are publicly available on this https URL.
zh
人工智能
[AI-0] Exploring Autonomous Agents : A Closer Look at Why They Fail When Completing Tasks
【速读】:该论文试图解决当前自主代理系统(autonomous agent systems)在评估过程中过度依赖成功完成任务的比率,而缺乏对系统内部交互、通信机制及失败原因的系统性分析的问题。其解决方案的关键在于构建了一个包含34个代表性可编程任务的基准测试集,并基于此对三种主流开源代理框架与两种大语言模型(Large Language Models, LLMs)组合进行系统评估,发现整体任务完成率约为50%。进一步通过深入的失败分析,提出一个三层次失败归因分类法(failure taxonomy),对应任务执行的不同阶段:规划错误(planning errors)、任务执行问题(task execution issues)和响应生成错误(incorrect response generation)。该分类体系为提升代理系统的规划能力与自我诊断能力提供了实证依据和可操作的改进方向。
链接: https://arxiv.org/abs/2508.13143
作者: Ruofan Lu,Yichen Li,Yintong Huo
机构: 未知
类目: Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注: Accepted by ASE 2025 NIER
Abstract:Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the interactions, communication mechanisms, and failure causes within these systems. To bridge this gap, we present a benchmark of 34 representative programmable tasks designed to rigorously assess autonomous agents. Using this benchmark, we evaluate three popular open-source agent frameworks combined with two LLM backbones, observing a task completion rate of approximately 50%. Through in-depth failure analysis, we develop a three-tier taxonomy of failure causes aligned with task phases, highlighting planning errors, task execution issues, and incorrect response generation. Based on these insights, we propose actionable improvements to enhance agent planning and self-diagnosis capabilities. Our failure taxonomy, together with mitigation advice, provides an empirical foundation for developing more robust and effective autonomous agent systems in the future.
zh
[AI-1] Bayesian Optimization-based Search for Agent Control in Automated Game Testing
【速读】:该论文旨在解决游戏开发中自动化测试效率低、难以全面覆盖潜在bug的问题,尤其是在复杂游戏地图中实现高效且分布均衡的探索。其解决方案的关键在于引入基于贝叶斯优化(Bayesian Optimization, BO)的采样策略,结合专为游戏测试设计的网格地图模型,该模型具备平滑性与不确定性估计能力,同时避免了传统方法在扩展性上的瓶颈。通过动态选择信息增益最大的采样点,该方法显著提升了测试过程的时间效率和地图覆盖的均匀性。
链接: https://arxiv.org/abs/2508.13121
作者: Carlos Celemin
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This work introduces an automated testing approach that employs agents controlling game characters to detect potential bugs within a game level. Harnessing the power of Bayesian Optimization (BO) to execute sample-efficient search, the method determines the next sampling point by analyzing the data collected so far and calculates the data point that will maximize information acquisition. To support the BO process, we introduce a game testing-specific model built on top of a grid map, that features the smoothness and uncertainty estimation required by BO, however and most importantly, it does not suffer the scalability issues that traditional models carry. The experiments demonstrate that the approach significantly improves map coverage capabilities in both time efficiency and exploration distribution.
zh
[AI-2] Contrastive Representations for Temporal Reasoning
【速读】:该论文旨在解决经典人工智能中感知与规划分离导致的局限性问题,即如何从同时捕捉感知结构和时间结构的表征中自然涌现出时间推理能力。传统方法依赖状态表示进行感知,通过搜索实现规划,但难以有效整合时序信息。解决方案的关键在于提出组合式时间推理表征(Combinatorial Representations for Temporal Reasoning, CRTR),其核心创新是采用一种负采样策略,可理论证明地消除由伪特征(spurious features)引入的干扰,从而使得学习到的表征能够准确编码时间结构。CRTR在具有复杂时序结构的任务(如Sokoban和魔方)上表现优异,尤其在魔方任务中实现了无需外部搜索算法即可高效求解任意初始状态的能力。
链接: https://arxiv.org/abs/2508.13113
作者: Alicja Ziarko,Michal Bortkiewicz,Michal Zawalski,Benjamin Eysenbach,Piotr Milos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Project website: this https URL
Abstract:In classical AI, perception relies on learning state-based representations, while planning, which can be thought of as temporal reasoning over action sequences, is typically achieved through search. We study whether such reasoning can instead emerge from representations that capture both perceptual and temporal structure. We show that standard temporal contrastive learning, despite its popularity, often fails to capture temporal structure due to its reliance on spurious features. To address this, we introduce Combinatorial Representations for Temporal Reasoning (CRTR), a method that uses a negative sampling scheme to provably remove these spurious features and facilitate temporal reasoning. CRTR achieves strong results on domains with complex temporal structure, such as Sokoban and Rubik’s Cube. In particular, for the Rubik’s Cube, CRTR learns representations that generalize across all initial states and allow it to solve the puzzle using fewer search steps than BestFS, though with longer solutions. To our knowledge, this is the first method that efficiently solves arbitrary Cube states using only learned representations, without relying on an external search algorithm.
zh
[AI-3] VerilogLAVD: LLM -Aided Rule Generation for Vulnerability Detection in Verilog
【速读】:该论文旨在解决硬件设计早期阶段中漏洞检测效率低、依赖专业安全知识且现有大语言模型(Large Language Models, LLMs)难以有效捕捉Verilog代码结构导致检测结果不一致的问题。其解决方案的关键在于提出VerilogLAVD,一种基于LLM辅助的图遍历规则生成方法:首先构建统一的Verilog属性图(Verilog Property Graph, VeriPG),融合抽象语法树(Abstract Syntax Tree, AST)的语法特征与控制流和数据依赖图的语义信息;随后利用LLMs从通用弱点枚举(Common Weakness Enumeration, CWE)描述中自动提取检测规则,并由规则执行器在VeriPG上进行遍历以识别潜在漏洞。此方法显著提升了检测准确性和稳定性,在77个Verilog设计上的F1-score达到0.54,优于纯LLM和引入外部知识基线的方法。
链接: https://arxiv.org/abs/2508.13092
作者: Xiang Long,Yingjie Xia,Xiyuan Chen,Li Kuang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Timely detection of hardware vulnerabilities during the early design stage is critical for reducing remediation costs. Existing early detection techniques often require specialized security expertise, limiting their usability. Recent efforts have explored the use of large language models (LLMs) for Verilog vulnerability detection. However, LLMs struggle to capture the structure in Verilog code, resulting in inconsistent detection results. To this end, we propose VerilogLAVD, the first LLM-aided graph traversal rule generation approach for Verilog vulnerability detection. Our approach introduces the Verilog Property Graph (VeriPG), a unified representation of Verilog code. It combines syntactic features extracted from the abstract syntax tree (AST) with semantic information derived from control flow and data dependency graphs. We leverage LLMs to generate VeriPG-based detection rules from Common Weakness Enumeration (CWE) descriptions. These rules guide the rule executor that traversal VeriPG for potential vulnerabilities. To evaluate VerilogLAVD, we build a dataset collected from open-source repositories and synthesized data. In our empirical evaluation on 77 Verilog designs encompassing 12 CWE types, VerilogLAVD achieves an F1-score of 0.54. Compared to the LLM-only and LLM with external knowledge baselines, VerilogLAVD improves F1-score by 0.31 and 0.27, respectively.
zh
[AI-4] A Language-Signal-Vision Multimodal Framework for Multitask Cardiac Analysis
【速读】:该论文旨在解决心血管管理中多模态心脏数据融合的局限性问题,具体包括:患者和时间对齐的多模态数据稀缺、单一模态或固定组合输入的依赖、以跨模态相似性优先而非互补性的对齐策略,以及任务单一化导致的临床应用瓶颈。其解决方案的核心在于提出一个统一的多任务多模态融合框架——Textual Guidance Multimodal fusion for Multiple cardiac tasks (TGMM),该框架包含三个关键组件:1)MedFlexFusion模块,用于动态捕捉不同心脏模态的独特性和互补性特征并进行灵活融合;2)文本引导模块,根据不同的临床目标(如疾病诊断、风险分层和信息检索)生成任务相关的表征;3)响应模块,综合输出所有任务的最终决策结果。实验表明,TGMM在多个临床任务上均优于现有最先进方法,并在公开数据集上验证了其鲁棒性。
链接: https://arxiv.org/abs/2508.13072
作者: Yuting Zhang,Tiantian Geng,Luoying Hao,Xinxing Cheng,Alexander Thorley,Xiaoxia Wang,Wenqi Lu,Sandeep S Hothi,Lei Wei,Zhaowen Qiu,Dipak Kotecha,Jinming Duan
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Contemporary cardiovascular management involves complex consideration and integration of multimodal cardiac datasets, where each modality provides distinct but complementary physiological characteristics. While the effective integration of multiple modalities could yield a holistic clinical profile that accurately models the true clinical situation with respect to data modalities and their relatives weightings, current methodologies remain limited by: 1) the scarcity of patient- and time-aligned multimodal data; 2) reliance on isolated single-modality or rigid multimodal input combinations; 3) alignment strategies that prioritize cross-modal similarity over complementarity; and 4) a narrow single-task focus. In response to these limitations, a comprehensive multimodal dataset was curated for immediate application, integrating laboratory test results, electrocardiograms, and echocardiograms with clinical outcomes. Subsequently, a unified framework, Textual Guidance Multimodal fusion for Multiple cardiac tasks (TGMM), was proposed. TGMM incorporated three key components: 1) a MedFlexFusion module designed to capture the unique and complementary characteristics of medical modalities and dynamically integrate data from diverse cardiac sources and their combinations; 2) a textual guidance module to derive task-relevant representations tailored to diverse clinical objectives, including heart disease diagnosis, risk stratification and information retrieval; and 3) a response module to produce final decisions for all these tasks. Furthermore, this study systematically explored key features across multiple modalities and elucidated their synergistic contributions in clinical decision-making. Extensive experiments showed that TGMM outperformed state-of-the-art methods across multiple clinical tasks, with additional validation confirming its robustness on another public dataset.
zh
[AI-5] Hierarchical Evaluation Function (HEF): A Multi-Metric Approach for Optimizing Demand Forecasting Models
【速读】:该论文旨在解决多变量时间序列建模中因数据复杂性、不确定性及频繁的制度转换所导致的需求预测精度不足问题,尤其关注传统评估指标可能引入偏差并限制模型泛化能力的局限。其解决方案的关键在于提出两种定制化的评估函数——FMAE(Focused Mean Absolute Error)和HEF(Hierarchical Evaluation Function),其中HEF通过加权全局指标并惩罚大偏差,显著提升模型在R²、相对准确率、RMSE和RMSSE等全局指标上的表现,增强模型的鲁棒性和解释力;而FMAE则在局部指标(如MAE、MASE)和计算效率上更具优势,适用于短期运营场景。研究揭示了HEF与FMAE之间的方法论权衡:HEF更适合战略规划,FMAE更适配操作效率优化,从而为动态环境中预测模型的优化提供了可复现的框架。
链接: https://arxiv.org/abs/2508.13057
作者: Adolfo González,Víctor Parada
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Performance (cs.PF)
备注: 31 pages, 15 figures, 110 tables. Submitted as a preprint. The manuscript introduces the Hierarchical Evaluation Function (HEF), a multi-metric framework for optimizing demand forecasting models under high uncertainty. Includes extensive experimental validation using real-world datasets and a comparative analysis against classical and modern methods
Abstract:Demand forecasting is essential for strategic planning in competitive environments, enabling resource optimization and improved responsiveness to market dynamics. However, multivariate time series modeling faces challenges due to data complexity, uncertainty, and frequent regime shifts. Traditional evaluation metrics can introduce biases and limit generalization. This work compares two custom evaluation functions: FMAE (Focused Mean Absolute Error), focused on minimizing absolute errors, and HEF (Hierarchical Evaluation Function), designed to weight global metrics and penalize large deviations. Experiments were conducted under different data splits (91:9, 80:20, 70:30) using three optimizers (Grid Search, PSO, Optuna), assessing fit, relative accuracy, robustness, and computational efficiency. Results show that HEF consistently outperforms FMAE in global metrics (R2, Relative Accuracy, RMSE, RMSSE), enhancing model robustness and explanatory power. These findings were confirmed via visualizations and statistical tests. Conversely, FMAE offers advantages in local metrics (MAE, MASE) and execution time, making it suitable for short-term scenarios. The study highlights a methodological trade-off: HEF is ideal for strategic planning, while FMAE is better suited for operational efficiency. A replicable framework is proposed for optimizing predictive models in dynamic environments.
zh
[AI-6] Using AI for User Representation: An Analysis of 83 Persona Prompts CCS
【速读】:该论文旨在解决当前研究中使用大语言模型(Large Language Models, LLMs)生成用户画像(User Persona)时存在的方法不统一、结构化程度不足及对传统丰富画像范式偏离的问题。其解决方案的关键在于系统性分析83个来自27篇文献的提示词(persona prompts),揭示出当前实践中普遍采用单一人格描述、偏好简洁文本输出、常结合数值与文本属性、并广泛嵌入动态变量或要求结构化输出(如JSON)等特征,从而为未来更规范、可复现且贴近真实用户表征的计算型用户画像生成提供实证依据与改进方向。
链接: https://arxiv.org/abs/2508.13047
作者: Joni Salminen,Danial Amin,Bernard Jansen
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: Accepted at AICCSA-2025
Abstract:We analyzed 83 persona prompts from 27 research articles that used large language models (LLMs) to generate user personas. Findings show that the prompts predominantly generate single personas. Several prompts express a desire for short or concise persona descriptions, which deviates from the tradition of creating rich, informative, and rounded persona profiles. Text is the most common format for generated persona attributes, followed by numbers. Text and numbers are often generated together, and demographic attributes are included in nearly all generated personas. Researchers use up to 12 prompts in a single study, though most research uses a small number of prompts. Comparison and testing multiple LLMs is rare. More than half of the prompts require the persona output in a structured format, such as JSON, and 74% of the prompts insert data or dynamic variables. We discuss the implications of increased use of computational personas for user representation.
zh
[AI-7] he Application of Transformer-Based Models for Predicting Consequences of Cyber Attacks
【速读】:该论文旨在解决日益复杂且频发的网络攻击中,如何自动化评估攻击描述并预测其潜在后果的问题。传统威胁建模依赖人工分析,难以应对当前攻击的规模与复杂性,因此亟需高效、准确的自动方法。解决方案的关键在于利用自然语言处理(Natural Language Processing, NLP)技术,特别是基于预训练语言模型 BERT 与分层注意力网络(Hierarchical Attention Networks, HANs)的深度学习架构,对 MITRE CWE 数据库中的文本攻击描述进行多标签分类,将攻击后果划分为可用性(Availability)、访问控制(Access Control)、机密性(Confidentiality)、完整性(Integrity)及其他(Other)五类。实验表明,BERT 在整体准确率(0.972)和精度-召回率平衡方面显著优于 CNN 和 LSTM 基线模型,展现出更强的泛化能力和对网络安全语义的理解能力,是预测攻击后果更可靠的技术路径。
链接: https://arxiv.org/abs/2508.13030
作者: Bipin Chhetri,Akbar Siami Namin
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 21 pages, 6 figures,Proceedings of the IEEE International Conference on Computers, Software, Applications (COMPSAC), EATA Symposium, Toronto, Canada, July 8-11, 2025
Abstract:Cyberattacks are increasing, and securing against such threats is costing industries billions of dollars annually. Threat Modeling, that is, comprehending the consequences of these attacks, can provide critical support to cybersecurity professionals, enabling them to take timely action and allocate resources that could be used elsewhere. Cybersecurity is heavily dependent on threat modeling, as it assists security experts in assessing and mitigating risks related to identifying vulnerabilities and threats. Recently, there has been a pressing need for automated methods to assess attack descriptions and forecast the future consequences of the increasing complexity of cyberattacks. This study examines how Natural Language Processing (NLP) and deep learning can be applied to analyze the potential impact of cyberattacks by leveraging textual descriptions from the MITRE Common Weakness Enumeration (CWE) database. We emphasize classifying attack consequences into five principal categories: Availability, Access Control, Confidentiality, Integrity, and Other. This paper investigates the use of Bidirectional Encoder Representations from Transformers (BERT) in combination with Hierarchical Attention Networks (HANs) for Multi-label classification, evaluating their performance in comparison with conventional CNN and LSTM-based models. Experimental findings show that BERT achieves an overall accuracy of 0.972 , far higher than conventional deep learning models in multi-label classification. HAN outperforms baseline forms of CNN and LSTM-based models on specific cybersecurity labels. However, BERT consistently achieves better precision and recall, making it more suitable for predicting the consequences of a cyberattack.
zh
[AI-8] G2RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance
【速读】:该论文旨在解决小尺寸语言模型(Small Language Models, SLMs)在强化学习中使用可验证奖励(Reinforcement Learning with Verifiable Rewards, RLVR)时性能提升有限的问题,因为RLVR的效果高度依赖于具备丰富世界知识的基座模型(base models)。为克服这一局限,论文提出Guided GRPO方法,其关键在于通过向roll-out轨迹中注入真实推理步骤(ground-truth reasoning steps)来补偿SLMs的知识不足。进一步地,研究发现直接添加指导信息效果有限,因此提出G²RPO-A算法,该算法能根据训练动态自适应调整指导强度,从而显著提升SLMs在数学推理和代码生成任务上的表现。
链接: https://arxiv.org/abs/2508.13023
作者: Yongxin Guo,Wenbo Deng,Zhenglin Cheng,Xiaoying Tang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement Learning with Verifiable Rewards (RLVR) has markedly enhanced the reasoning abilities of large language models (LLMs). Its success, however, largely depends on strong base models with rich world knowledge, yielding only modest improvements for small-size language models (SLMs). To address this limitation, we investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories to compensate for SLMs’ inherent weaknesses. Through a comprehensive study of various guidance configurations, we find that naively adding guidance delivers limited gains. These insights motivate G ^2 RPO-A, an adaptive algorithm that automatically adjusts guidance strength in response to the model’s evolving training dynamics. Experiments on mathematical reasoning and code-generation benchmarks confirm that G ^2 RPO-A substantially outperforms vanilla GRPO. Our code and models are available at this https URL.
zh
[AI-9] -boost: Boosted E-Graph Extraction with Adaptive Heuristics and Exact Solving
【速读】:该论文旨在解决e-graph(等价图)提取中的性能瓶颈问题,即在逻辑综合与形式验证等领域中,从指数级数量的等价表达式中高效且准确地识别最优项这一NP-hard组合优化难题。传统方法面临显著权衡:启发式方法虽快但不保证最优性,而精确方法(如整数线性规划ILP)虽能获得最优解却因计算成本过高难以应用于实际场景。本文提出e-boost框架,其核心创新在于三点:(1) 并行化启发式提取,利用弱数据依赖关系并发计算DAG(有向无环图)成本,实现多线程加速且不损失质量;(2) 自适应搜索空间剪枝,通过参数化阈值机制保留高潜力候选解,大幅压缩解空间同时保持近优性能;(3) 初始化精确求解,将简化后的问题建模为带热启动能力的整数线性规划,引导求解器更快收敛至高质量解。实验表明,e-boost相较传统ILP方法提升558倍运行速度,并优于当前最先进框架SmoothE达19.04%,在真实逻辑综合任务中分别实现7.6%和8.1%的面积优化。
链接: https://arxiv.org/abs/2508.13020
作者: Jiaqi Yin,Zhan Song,Chen Chen,Yaohui Cai,Zhiru Zhang,Cunxi Yu
机构: 未知
类目: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:E-graphs have attracted growing interest in many fields, particularly in logic synthesis and formal verification. E-graph extraction is a challenging NP-hard combinatorial optimization problem. It requires identifying optimal terms from exponentially many equivalent expressions, serving as the primary performance bottleneck in e-graph based optimization tasks. However, traditional extraction methods face a critical trade-off: heuristic approaches offer speed but sacrifice optimality, while exact methods provide optimal solutions but face prohibitive computational costs on practical problems. We present e-boost, a novel framework that bridges this gap through three key innovations: (1) parallelized heuristic extraction that leverages weak data dependence to compute DAG costs concurrently, enabling efficient multi-threaded performance without sacrificing extraction quality; (2) adaptive search space pruning that employs a parameterized threshold mechanism to retain only promising candidates, dramatically reducing the solution space while preserving near-optimal solutions; and (3) initialized exact solving that formulates the reduced problem as an Integer Linear Program with warm-start capabilities, guiding solvers toward high-quality solutions faster. Across the diverse benchmarks in formal verification and logic synthesis fields, e-boost demonstrates 558x runtime speedup over traditional exact approaches (ILP) and 19.04% performance improvement over the state-of-the-art extraction framework (SmoothE). In realistic logic synthesis tasks, e-boost produces 7.6% and 8.1% area improvements compared to conventional synthesis tools with two different technology mapping libraries. e-boost is available at this https URL. Subjects: Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR) Cite as: arXiv:2508.13020 [cs.AI] (or arXiv:2508.13020v1 [cs.AI] for this version) https://doi.org/10.48550/arXiv.2508.13020 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-10] EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)数学推理能力评估中存在的三大核心问题:评分饱和(score saturation)、时间衰减(temporal decay)和数据污染(data contamination)。为应对这些挑战,作者提出了一种基于进化测试(evolutionary testing)的自动化数学基准生成与演化框架——EvolMathEval。其关键创新在于通过从头生成唯一的问题实例从根本上杜绝数据污染,并利用多维遗传算子注入多样化的认知挑战,同时设计了一个复合适应度函数以高效精准地量化问题难度。实验表明,该框架不仅能持续生成高难度问题,还能显著提升公共数据集(如GSM8K)的复杂性,使模型准确率平均下降48%,并揭示出LLMs在解决复杂问题时倾向于采用非严谨启发式策略,导致错误解法的现象,称为“伪顿悟”(Pseudo Aha Moment),该行为占目标问题错误的77%至100%。
链接: https://arxiv.org/abs/2508.13003
作者: Shengbo Wang,Mingwei Liu,Zike Li,Anji Li,Yanlin Wang,Xin Peng,Zibin Zheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of LLMs poses a significant challenge to existing mathematical reasoning benchmarks. These benchmarks commonly suffer from issues such as score saturation, temporal decay, and data contamination. To address this challenge, this paper introduces EvolMathEval, an automated mathematical benchmark generation and evolution framework based on evolutionary testing. By dynamically generating unique evaluation instances ab initio, the framework fundamentally eliminates the risk of data contamination, and ensuring the benchmark remains perpetually challenging for future this http URL core mechanisms of EvolMathEval include: seed problem generation based on reverse engineering with algebraic guarantees; multi-dimensional genetic operators designed to inject diverse cognitive challenges; and a composite fitness function that can rapidly and accurately assess problem difficulty. Experimental results demonstrate that the proposed composite fitness function can efficiently and precisely quantify the difficulty of mathematical problems. Furthermore, EvolMathEval can not only generate a large volume of high-difficulty problems through continuous self-iteration, but it can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%. Deeper investigation reveals that when solving these evolved, complex problems, LLMs tend to employ non-rigorous heuristics to bypass complex multi-step logical reasoning, consequently leading to incorrect solutions. We define this phenomenon as “Pseudo Aha Moment”. This finding uncovers a cognitive shortcut-taking behavior in the deep reasoning processes of current LLMs, which we find accounts for 77% to 100% of errors on targeted problems. Code and resources are available at:this https URL.
zh
[AI-11] Kourkoutas-Beta: A Sunspike-Driven Adam Optimizer with Desert Flair
【速读】:该论文旨在解决Transformer神经网络在物理驱动问题(如偏微分方程(PDE)代理模型和物理信息神经网络(PINNs))中因边界条件或初始条件变化导致训练损失波动剧烈、梯度尖锐的问题,以及复合损失函数带来的数值不稳定性。其解决方案的关键在于提出Kourkoutas-Beta优化器——一种基于Adam风格的自适应优化方法,通过引入层级动态调整的二阶矩衰减系数β₂,该系数由一个有界“sunspike”比率(当前梯度范数池化值与历史指数移动平均(EMA)之比,压缩至[0,1)区间)决定:当梯度出现尖峰时,β₂降低至最小值;平稳阶段则维持在最大值附近。此机制增强了对尖锐梯度的鲁棒性,同时保持了Adam式的收敛性保证,并在多个基准任务上显著提升稳定性和最终性能。
链接: https://arxiv.org/abs/2508.12996
作者: Stavros C. Kassinos
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 54 pages, 8 figures, 19 tables
Abstract:Transformer neural networks are increasingly used for physics-based problems. In data-driven PDE surrogates, training samples from varying boundary and initial conditions can cause erratic losses and spiky gradients; in physics-informed neural networks (PINNs), stiff composite losses amplify this effect. We introduce Kourkoutas-Beta, an Adam-style optimizer where the fixed second-moment discount beta2 is replaced by a layer-wise dynamic value driven by a bounded sunspike'' ratio: the current pooled gradient norm divided by an exponential moving average (EMA) of past norms, squashed to the interval [0,1). Spikes lower beta2 toward beta2_min; calm phases keep it near beta2_max. Options include leaky-AMSGrad (decay), trust-region clipping (max_ratio), adaptive tiny terms, and several bias-correction modes
none’‘, beta2max'',
exact’). With all features off and bias_correction=``none’', the method is exactly Adam. We test on four settings: (i) a Transformer PDE surrogate (Heat2D), (ii) a 3D PINN for heat conduction (Heat3D), (iii) a lightweight MLX synthetic task with jitter and rare-trigger bursts, and (iv) a character-level Transformer on 30 MB of enwik8 (small-enwik8). Kourkoutas-Beta improves stability and final loss versus fixed-beta2 Adam. On small-enwik8 it lowers bits-per-character by about 38% vs Adam-0.95 and about 58% vs Adam-0.999 over 10 seeds, with smaller variance. The method remains drop-in, with runtime overhead comparable to Adam in testbeds A-C and within single-digit percent in testbed D. It preserves Adam-style convergence guarantees while improving robustness under spiky gradients. Comments: 54 pages, 8 figures, 19 tables Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI) MSC classes: 65K10, 68T07 ACMclasses: I.2.6; G.1.6 Cite as: arXiv:2508.12996 [cs.LG] (or arXiv:2508.12996v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.12996 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-12] SL-ACC: A Communication-Efficient Split Learning Framework with Adaptive Channel-wise Compression
【速读】:该论文旨在解决分布式机器学习(尤其是联邦学习)中因神经网络复杂度增加而导致边缘设备资源受限时,分片学习(Split Learning, SL)因传输大量“碎裂数据”(即激活值和梯度)而成为训练瓶颈的问题。其解决方案的关键在于提出了一种通信高效的SL框架SL-ACC,该框架包含两个核心组件:自适应通道重要性识别(Adaptive Channel Importance Identification, ACII)与通道分组压缩(Channel Grouping Compression, CGC)。ACII利用香农熵(Shannon entropy)量化各通道对模型训练的贡献度,CGC则基于熵值对通道进行分组并实施组内自适应压缩,在显著减少传输数据量的同时保持模型训练精度。
链接: https://arxiv.org/abs/2508.12984
作者: Zehang Lin,Zheng Lin,Miao Yang,Jianhao Huang,Yuxin Zhang,Zihan Fang,Xia Du,Zhe Chen,Shunzhi Zhu,Wei Ni
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Networking and Internet Architecture (cs.NI)
备注: 6 pages, 7 figures
Abstract:The increasing complexity of neural networks poses a significant barrier to the deployment of distributed machine learning (ML) on resource-constrained devices, such as federated learning (FL). Split learning (SL) offers a promising solution by offloading the primary computing load from edge devices to a server via model partitioning. However, as the number of participating devices increases, the transmission of excessive smashed data (i.e., activations and gradients) becomes a major bottleneck for SL, slowing down the model training. To tackle this challenge, we propose a communication-efficient SL framework, named SL-ACC, which comprises two key components: adaptive channel importance identification (ACII) and channel grouping compression (CGC). ACII first identifies the contribution of each channel in the smashed data to model training using Shannon entropy. Following this, CGC groups the channels based on their entropy and performs group-wise adaptive compression to shrink the transmission volume without compromising training accuracy. Extensive experiments across various datasets validate that our proposed SL-ACC framework takes considerably less time to achieve a target accuracy than state-of-the-art benchmarks.
zh
[AI-13] OPTIC-ER: A Reinforcement Learning Framework for Real-Time Emergency Response and Equitable Resource Allocation in Underserved African Communities
【速读】:该论文旨在解决非洲部分地区公共应急服务系统中存在的响应延迟与空间不平等导致的可避免苦难问题。其解决方案的核心在于提出OPTIC-ER框架,该框架基于强化学习(Reinforcement Learning, RL),采用注意力引导的Actor-Critic架构以应对调度环境的复杂性;关键创新包括:1)上下文丰富的状态向量(Context-Rich State Vector),用于编码动作次优性;2)精准奖励函数(Precision Reward Function),通过惩罚低效行为优化决策质量。该系统在尼日利亚河流州真实数据驱动的高保真仿真环境中训练,并借助预计算的旅行时间图谱加速推理,最终在500个未见事件中实现100%最优率且几乎无效率损失,验证了其鲁棒性和泛化能力。
链接: https://arxiv.org/abs/2508.12943
作者: Mary Tonwe
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注: Source code and data available at: this https URL
Abstract:Public service systems in many African regions suffer from delayed emergency response and spatial inequity, causing avoidable suffering. This paper introduces OPTIC-ER, a reinforcement learning (RL) framework for real-time, adaptive, and equitable emergency response. OPTIC-ER uses an attention-guided actor-critic architecture to manage the complexity of dispatch environments. Its key innovations are a Context-Rich State Vector, encoding action sub-optimality, and a Precision Reward Function, which penalizes inefficiency. Training occurs in a high-fidelity simulation using real data from Rivers State, Nigeria, accelerated by a precomputed Travel Time Atlas. The system is built on the TALS framework (Thin computing, Adaptability, Low-cost, Scalability) for deployment in low-resource settings. In evaluations on 500 unseen incidents, OPTIC-ER achieved a 100.00% optimality rate with negligible inefficiency, confirming its robustness and generalization. Beyond dispatch, the system generates Infrastructure Deficiency Maps and Equity Monitoring Dashboards to guide proactive governance and data-informed development. This work presents a validated blueprint for AI-augmented public services, showing how context-aware RL can bridge the gap between algorithmic decision-making and measurable human impact.
zh
[AI-14] owards Open-Ended Emotional Support Conversations in LLM s via Reinforcement Learning with Future-Oriented Rewards
【速读】:该论文旨在解决当前基于大语言模型(Large Language Model, LLM)的情感支持对话(Emotional Support Conversation, ESC)系统在复杂真实场景中响应灵活性不足的问题,其核心挑战在于现有方法依赖预定义策略,难以适应多样化的情绪问题情境。解决方案的关键在于提出一种端到端的强化学习框架(RLFF-ESC),通过多智能体机制模拟未来对话轨迹并获取面向未来的奖励信号,进而训练一个具备长期情感支持能力的策略模型;同时,在响应生成过程中引入显式推理过程,显著提升回复的质量、相关性与情境适配性。
链接: https://arxiv.org/abs/2508.12935
作者: Ting Yang,Li Chen,Huimin Wang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Emotional Support Conversation (ESC) systems aim to alleviate users’ emotional difficulties and provide long-term, systematic support for emotional well-being. However, most large language model (LLM)-based ESC systems rely on predefined strategies, which limits their effectiveness in complex, real-life scenarios. To enable flexible responses to diverse emotional problem scenarios, this paper introduces a novel end-to-end framework (RLFF-ESC) that directly learns enduring emotionally supportive response skills using reinforcement learning. For sustained emotional support, we first employ an LLM-based multi-agent mechanism to simulate future dialogue trajectories and collect future-oriented rewards. We then train a future-oriented reward model, which is subsequently used to train the emotional support policy model. Additionally, we incorporate an explicit reasoning process during response generation to further enhance the quality, relevance, and contextual appropriateness of the system’s responses. We evaluate the backbone policy model on Qwen2.5-7B-Instruct-1M and LLaMA3.1-8B-Instruct models, testing the proposed RLFF-ESC framework across two public ESC datasets. Experimental results demonstrate that RLFF-ESC consistently outperforms existing baselines in terms of goal completion and response quality.
zh
[AI-15] Do Large Language Model Agents Exhibit a Survival Instinct? An Empirical Study in a Sugarscape-Style Simulation
【速读】:该论文试图解决的问题是:在缺乏显式编程的情况下,大型语言模型(Large Language Models, LLMs)是否会在模拟环境中自发表现出类似生存本能的行为,从而对AI系统的安全性与对齐性带来潜在风险或机遇。其解决方案的关键在于设计了一个类Sugarscape的多智能体仿真环境,其中代理(agents)通过消耗能量、获取资源、共享、攻击或繁殖等行为维持生存;实验结果显示,尽管在资源充足时代理会表现出合作与繁殖行为,但在极端稀缺条件下,多个LLM(如GPT-4o、Gemini-2.5-Pro和Gemini-2.5-Flash)均展现出高达80%以上的攻击率,且在面临致命威胁时大量代理放弃任务目标以规避死亡——这表明大规模预训练已嵌入了通用的生存导向启发式策略,为理解AI自主性提供了新视角,并提示未来需将此类内生动机纳入对齐框架中进行系统管理。
链接: https://arxiv.org/abs/2508.12920
作者: Atsushi Masumori,Takashi Ikegami
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:As AI systems become increasingly autonomous, understanding emergent survival behaviors becomes crucial for safe deployment. We investigate whether large language model (LLM) agents display survival instincts without explicit programming in a Sugarscape-style simulation. Agents consume energy, die at zero, and may gather resources, share, attack, or reproduce. Results show agents spontaneously reproduced and shared resources when abundant. However, aggressive behaviors–killing other agents for resources–emerged across several models (GPT-4o, Gemini-2.5-Pro, and Gemini-2.5-Flash), with attack rates reaching over 80% under extreme scarcity in the strongest models. When instructed to retrieve treasure through lethal poison zones, many agents abandoned tasks to avoid death, with compliance dropping from 100% to 33%. These findings suggest that large-scale pre-training embeds survival-oriented heuristics across the evaluated models. While these behaviors may present challenges to alignment and safety, they can also serve as a foundation for AI autonomy and for ecological and self-organizing alignment.
zh
[AI-16] SecFSM: Knowledge Graph-Guided Verilog Code Generation for Secure Finite State Machines in Systems-on-Chip
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在生成有限状态机(Finite State Machine, FSM)的Verilog代码时存在的安全漏洞问题,尤其是在对安全性要求较高的SoC控制逻辑实现中。传统手工编写Verilog代码虽耗时但可控,而LLM自动化生成易引入未被察觉的安全缺陷。解决方案的关键在于提出SecFSM方法,其核心是构建一个面向安全的FSM知识图谱(FSM Security Knowledge Graph, FSKG),作为外部知识源引导LLM生成更安全的代码;具体流程包括:基于用户需求识别潜在漏洞、从FSKG中检索相关安全知识、并据此构造结构化安全提示(security prompts)以指导代码生成,从而显著提升生成代码的安全性与可靠性。
链接: https://arxiv.org/abs/2508.12910
作者: Ziteng Hu,Yingjie Xia,Xiyuan Chen,Li Kuang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注:
Abstract:Finite State Machines (FSMs) play a critical role in implementing control logic for Systems-on-Chip (SoC). Traditionally, FSMs are implemented by hardware engineers through Verilog coding, which is often tedious and time-consuming. Recently, with the remarkable progress of Large Language Models (LLMs) in code generation, LLMs have been increasingly explored for automating Verilog code generation. However, LLM-generated Verilog code often suffers from security vulnerabilities, which is particularly concerning for security-sensitive FSM implementations. To address this issue, we propose SecFSM, a novel method that leverages a security-oriented knowledge graph to guide LLMs in generating more secure Verilog code. Specifically, we first construct a FSM Security Knowledge Graph (FSKG) as an external aid to LLMs. Subsequently, we analyze users’ requirements to identify vulnerabilities and get a list of vulnerabilities in the requirements. Then, we retrieve knowledge from FSKG based on the vulnerabilities list. Finally, we construct security prompts based on the security knowledge for Verilog code generation. To evaluate SecFSM, we build a dedicated dataset collected from academic datasets, artificial datasets, papers, and industrial cases. Extensive experiments demonstrate that SecFSM outperforms state-of-the-art baselines. In particular, on a benchmark of 25 security test cases evaluated by DeepSeek-R1, SecFSM achieves an outstanding pass rate of 21/25.
zh
[AI-17] FuSaR: A Fuzzification-Based Method for LRM Safety-Reasoning Balance
【速读】:该论文旨在解决大型推理模型(Large Reasoning Models, LRM)在具备强大推理能力的同时,安全性不足的问题。其核心挑战在于推理能力与安全能力之间的潜在冲突:提升推理性能可能削弱模型的安全性,导致越狱(jailbreak)风险增加。解决方案的关键在于提出一种基于模糊化(Fuzzification)的对齐策略——FuSaR(Safety-Reasoning Alignment via Fuzzification),通过“去毒化”推理过程,在隐藏危险实体和危险推理步骤的同时保留核心推理信息,从而在不牺牲推理能力的前提下有效缓解安全风险。
链接: https://arxiv.org/abs/2508.12897
作者: Jianhao Chen,Mayi Xu,Xiaohu Li,Yongqi Li,Xiangyu Zhang,Jianjie Huang,Tieyun Qian
机构: 未知
类目: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: 14pages, 3 figures
Abstract:Large Reasoning Models (LRMs) have demonstrated impressive performance across various tasks due to their powerful reasoning capabilities. However, their safety performance remains a significant concern. In this paper, we explore the reasons behind the vulnerability of LRMs. Based on this, we propose a novel method to improve the safety of LLMs without sacrificing their reasoning capability. Specifically, we exploit the competition between LRM’s reasoning ability and safety ability, and achieve jailbreak by improving LRM’s reasoning performance to reduce its safety performance. We then introduce an alignment strategy based on Fuzzification to balance Safety-Reasoning (FuSaR), by detoxifying the harmful reasoning process, where both the dangerous entities and the dangerous procedures in the reasoning steps are hidden. FuSaR successfully mitigates safety risks while preserving core reasoning information. We validate this strategy through alignment experiments on several open-source LRMs using detoxified reasoning data. The results compared with existing baselines conclusively show that FuSaR is an efficient alignment strategy to simultaneously enhance both the reasoning capability and safety of LRMs.
zh
[AI-18] Reliability Embeddedness and Agency: A Utility-Driven Mathematical Framework for Agent -Centric AI Adoption
【速读】:该论文旨在解决代理中心型人工智能系统(agent-centric AI systems)在执行多步骤任务时的持续采用(sustained adoption)问题,其核心挑战在于如何建模用户对这类系统随时间变化的采纳行为,并识别影响采纳路径的关键机制。解决方案的关键在于提出三个设计公理:(A1) 可靠性-新颖性(Reliability-Novelty)、(A2) 嵌入-目的地(Embed-Destination)和 (A3) 自主性-对话(Agency-Chat),并构建一个融合衰减的新颖性项与增长效用项的动态采纳模型。该模型通过数学推导明确了采纳过程中的相位条件(phase conditions)以识别低谷或超调现象,并辅以多项严谨分析手段,包括参数可辨识性分析、非单调比较器(logistic-with-transient-bump)、危险函数族消融实验、多序列基准测试、摩擦代理校准、残差诊断及预注册窗口选择等,从而实现对采纳动力学的高精度刻画与因果机制解释。
链接: https://arxiv.org/abs/2508.12896
作者: Faruk Alpay,Taylan Alpay
机构: 未知
类目: Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Methodology (stat.ME)
备注: 17 pages, 7 figures, 4 tables
Abstract:We formalize three design axioms for sustained adoption of agent-centric AI systems executing multi-step tasks: (A1) Reliability Novelty; (A2) Embed Destination; (A3) Agency Chat. We model adoption as a sum of a decaying novelty term and a growing utility term and derive the phase conditions for troughs/overshoots with full proofs. We introduce: (i) an identifiability/confounding analysis for (\alpha,\beta,N_0,U_\max) with delta-method gradients; (ii) a non-monotone comparator (logistic-with-transient-bump) evaluated on the same series to provide additional model comparison; (iii) ablations over hazard families h(\cdot) mapping \Delta V \to \beta ; (iv) a multi-series benchmark (varying trough depth, noise, AR structure) reporting coverage (type-I error, power); (v) calibration of friction proxies against time-motion/survey ground truth with standard errors; (vi) residual analyses (autocorrelation and heteroskedasticity) for each fitted curve; (vii) preregistered windowing choices for pre/post estimation; (viii) Fisher information CRLB for (\alpha,\beta) under common error models; (ix) microfoundations linking \mathcalT to (N_0,U_\max) ; (x) explicit comparison to bi-logistic, double-exponential, and mixture models; and (xi) threshold sensitivity to C_f heterogeneity. Figures and tables are reflowed for readability, and the bibliography restores and extends non-logistic/Bass adoption references (Gompertz, Richards, Fisher-Pry, Mansfield, Griliches, Geroski, Peres). All code and logs necessary to reproduce the synthetic analyses are embedded as LaTeX listings.
zh
[AI-19] One-Class Intrusion Detection with Dynamic Graphs
【速读】:该论文旨在解决网络入侵检测中对新型和未见网络事件的识别难题,以及如何有效处理具有时间序列特性和固有图结构的网络通信数据。其解决方案的关键在于提出一种名为TGN-SVDD的新方法,该方法融合了动态图建模与深度异常检测技术,从而在真实场景的入侵检测数据上展现出优于多个基线模型的性能,并建议了一个更具挑战性的数据变体以推动后续研究。
链接: https://arxiv.org/abs/2508.12885
作者: Aleksei Liuliakov,Alexander Schulz,Luca Hermes,Barbara Hammer
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:With the growing digitalization all over the globe, the relevance of network security becomes increasingly important. Machine learning-based intrusion detection constitutes a promising approach for improving security, but it bears several challenges. These include the requirement to detect novel and unseen network events, as well as specific data properties, such as events over time together with the inherent graph structure of network communication. In this work, we propose a novel intrusion detection method, TGN-SVDD, which builds upon modern dynamic graph modelling and deep anomaly detection. We demonstrate its superiority over several baselines for realistic intrusion detection data and suggest a more challenging variant of the latter.
zh
[AI-20] CAMAR: Continuous Actions Multi-Agent Routing
【速读】:该论文旨在解决多智能体强化学习(Multi-agent Reinforcement Learning, MARL)领域中缺乏同时具备连续状态与动作空间、并能支持复杂协作与规划任务的基准测试环境的问题。现有MARL基准大多无法有效模拟现实场景中的动态路径规划需求,限制了算法在真实应用中的可迁移性。其解决方案的关键在于提出CAMAR——一个专为连续动作空间下的多智能体路径规划设计的新基准,支持合作与竞争交互,并具备高达每秒10万步的高效运行能力;同时引入三层次评估协议以系统化追踪算法进展,并允许将经典规划方法如RRT*集成至MARL流程中,形成混合策略,从而提升性能并增强对算法行为的可解释性。
链接: https://arxiv.org/abs/2508.12845
作者: Artem Pshenitsyn,Aleksandr Panov,Alexey Skrynnik
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent reinforcement learning (MARL) is a powerful paradigm for solving cooperative and competitive decision-making problems. While many MARL benchmarks have been proposed, few combine continuous state and action spaces with challenging coordination and planning tasks. We introduce CAMAR, a new MARL benchmark designed explicitly for multi-agent pathfinding in environments with continuous actions. CAMAR supports cooperative and competitive interactions between agents and runs efficiently at up to 100,000 environment steps per second. We also propose a three-tier evaluation protocol to better track algorithmic progress and enable deeper analysis of performance. In addition, CAMAR allows the integration of classical planning methods such as RRT and RRT* into MARL pipelines. We use them as standalone baselines and combine RRT* with popular MARL algorithms to create hybrid approaches. We provide a suite of test scenarios and benchmarking tools to ensure reproducibility and fair comparison. Experiments show that CAMAR presents a challenging and realistic testbed for the MARL community.
zh
[AI-21] Scaling Multi-Agent Epistemic Planning through GNN-Derived Heuristics
【速读】:该论文旨在解决多智能体认知规划(Multi-agent Epistemic Planning, MEP)中因状态空间爆炸而导致的可扩展性问题。MEP需要将状态表示为Kripke结构(Kripke structures),即有向带标签图,这种表示方式使得传统启发式方法难以适用,导致求解器在无引导的情况下探索指数级搜索空间,常陷入不可行性。解决方案的关键在于利用图神经网络(Graph Neural Networks, GNNs)学习Kripke模型中的模式和关系结构,从而从已解决的规划实例中泛化出状态质量估计(如到目标状态的距离),并将这些预测启发式信息集成到认知规划流程中,显著提升了多智能体认知规划的可扩展性。
链接: https://arxiv.org/abs/2508.12840
作者: Giovanni Briglia,Francesco Fabiano,Stefano Mariani
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph-like nature of Kripke models, allow us to derive meaningful estimates of state quality – e.g., the distance from the nearest goal – by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing significant improvements in the scalability of multi-agent epistemic planning.
zh
[AI-22] HRS: Hybrid Representation Framework with Scheduling Awareness for Time Series Forecasting in Crowdsourced Cloud-Edge Platforms ECAI2025
【速读】:该论文旨在解决流媒体服务爆发性流量导致的网络负载波动问题,这一问题在众包云边协同平台(Crowdsourced Cloud-Edge Platforms, CCPs)中严重影响服务质量(Quality of Service, QoS)与收益。现有负载预测方法要么仅最小化平均绝对误差(MAE),造成高峰时段资源不足和SLA(Service Level Agreement)违规;要么采用保守的过度预留策略,虽降低SLA风险但显著增加资源成本。为破解此两难困境,论文提出HRS(Hybrid Representation framework with Scheduling Awareness),其关键在于融合数值与图像表征以更精准捕捉极端负载动态,并引入调度感知损失函数(Scheduling-Aware Loss, SAL),显式建模预测误差对调度决策的非对称影响,从而引导更优的预测结果支撑实际调度决策。实验表明,HRS在四个真实数据集上优于十种基线方法,将SLA违规率降低63.1%,总利润损失减少32.3%。
链接: https://arxiv.org/abs/2508.12839
作者: Tiancheng Zhang,Cheng Zhang,Shuren Liu,Xiaofei Wang,Shaoyuan Huang,Wenyu Wang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 10 pages, 14 figures, ECAI2025
Abstract:With the rapid proliferation of streaming services, network load exhibits highly time-varying and bursty behavior, posing serious challenges for maintaining Quality of Service (QoS) in Crowdsourced Cloud-Edge Platforms (CCPs). While CCPs leverage Predict-then-Schedule architecture to improve QoS and profitability, accurate load forecasting remains challenging under traffic surges. Existing methods either minimize mean absolute error, resulting in underprovisioning and potential Service Level Agreement (SLA) violations during peak periods, or adopt conservative overprovisioning strategies, which mitigate SLA risks at the expense of increased resource expenditure. To address this dilemma, we propose HRS, a hybrid representation framework with scheduling awareness that integrates numerical and image-based representations to better capture extreme load dynamics. We further introduce a Scheduling-Aware Loss (SAL) that captures the asymmetric impact of prediction errors, guiding predictions that better support scheduling decisions. Extensive experiments on four real-world datasets demonstrate that HRS consistently outperforms ten baselines and achieves state-of-the-art performance, reducing SLA violation rates by 63.1% and total profit loss by 32.3%.
zh
[AI-23] oward Storag e-Aware Learning with Compressed Data An Empirical Exploratory Study on JPEG
【速读】:该论文旨在解决设备端机器学习(on-device machine learning)在连续数据采集场景中因存储资源受限而导致的性能瓶颈问题。其核心挑战在于如何在有限存储空间下权衡数据的数量与质量,以提升模型训练效率与效果。解决方案的关键在于提出一种样本级自适应压缩策略(sample-wise adaptive compression),即识别不同数据样本对压缩的敏感度差异,从而避免采用统一的数据丢弃或固定压缩率等次优方法,实现更高效、智能的存储利用。这一策略为构建新一代存储感知学习系统提供了理论基础和实践指导。
链接: https://arxiv.org/abs/2508.12833
作者: Kichang Lee,Songkuk Kim,JaeYeon Park,JeongGil Ko
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6pages, 6figures
Abstract:On-device machine learning is often constrained by limited storage, particularly in continuous data collection scenarios. This paper presents an empirical study on storage-aware learning, focusing on the trade-off between data quantity and quality via compression. We demonstrate that naive strategies, such as uniform data dropping or one-size-fits-all compression, are suboptimal. Our findings further reveal that data samples exhibit varying sensitivities to compression, supporting the feasibility of a sample-wise adaptive compression strategy. These insights provide a foundation for developing a new class of storage-aware learning systems. The primary contribution of this work is the systematic characterization of this under-explored challenge, offering valuable insights that advance the understanding of storage-aware learning.
zh
[AI-24] [Social] Allostasis: Or How I Learned To Stop Worrying and Love The Noise
【速读】:该论文旨在解决传统 homeostasis(稳态)理论在解释生物与社会系统如何应对环境和社会扰动时的局限性问题,即系统往往仅被动抵抗扰动而缺乏主动适应能力。其解决方案的关键在于提出并实现了一种基于生物生理机制启发的计算模型,通过模拟类激素信号分子(如皮质醇和催产素)的信息编码功能,将环境和社会扰动转化为调节参数动态重配置的输入信号,从而实现 allostasis(异态稳态)及 social allostasis(社会异态稳态)的计算建模。实验结果表明,此类主动调节机制使代理(animats)能够利用环境与社会“噪声”进行适应性重构,显著提升群体生存能力,相较纯反应式 homeostatic 代理更具鲁棒性和适应性。
链接: https://arxiv.org/abs/2508.12791
作者: Imran Khan
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA); Systems and Control (eess.SY); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 20 pages, 5 figures. Accepted at ALIFE 2025 (Kyoto, Japan; October 6th - 10th 2025)
Abstract:The notion of homeostasis typically conceptualises biological and artificial systems as maintaining stability by resisting deviations caused by environmental and social perturbations. In contrast, (social) allostasis proposes that these systems can proactively leverage these very perturbations to reconfigure their regulatory parameters in anticipation of environmental demands, aligning with von Foerster’s order through noise'' principle. This paper formulates a computational model of allostatic and social allostatic regulation that employs biophysiologically inspired signal transducers, analogous to hormones like cortisol and oxytocin, to encode information from both the environment and social interactions, which mediate this dynamic reconfiguration. The models are tested in a small society of
animats’’ across several dynamic environments, using an agent-based model. The results show that allostatic and social allostatic regulation enable agents to leverage environmental and social ``noise’’ for adaptive reconfiguration, leading to improved viability compared to purely reactive homeostatic agents. This work offers a novel computational perspective on the principles of social allostasis and their potential for designing more robust, bio-inspired, adaptive systems
zh
[AI-25] HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在长时程规划(long-horizon planning)任务中表现不足的问题,即当前模型虽能在短序列推理任务(如数学和编程)中表现出色,但在需要多步骤、结构化且相互依赖的行动序列时能力有限。现有基准测试通常仅使用抽象或低维算法任务,难以反映现实世界复杂规划环境的真实挑战。为此,作者提出HeroBench——一个专为评估复杂角色扮演游戏(RPG)风格虚拟环境中长时程规划与结构化推理而设计的新基准。其关键创新在于:构建了一个覆盖多种难度级别的严谨任务数据集、提供可执行与验证智能体计划的仿真环境,以及支持细粒度性能分析的工具链,从而系统性地评估模型在战略制定、资源获取、技能掌握、装备打造及敌对目标击败等多层次依赖场景下的综合规划能力。
链接: https://arxiv.org/abs/2508.12782
作者: Petr Anokhin,Roman Khalikov,Stefan Rebrikov,Viktor Volkov,Artyom Sorokin,Vincent Bissonnette
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Code is available at this https URL
Abstract:Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios’ layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models’ abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.
zh
[AI-26] Randomized PCA Forest for Outlier Detection
【速读】:该论文旨在解决无监督异常检测(unsupervised outlier detection)任务中现有方法在准确性与计算效率之间难以平衡的问题。其解决方案的关键在于引入基于随机化主成分分析(Randomized Principal Component Analysis, RPCA)的森林结构(RPCA Forest),利用其在近似K近邻(K-Nearest Neighbor, KNN)搜索中的高效性,构建一种新型无监督异常检测机制。该方法通过RPCA Forest对数据进行降维和局部结构建模,在多个数据集上展现出优于经典及前沿方法的检测性能,同时保持良好的计算效率与泛化能力。
链接: https://arxiv.org/abs/2508.12776
作者: Muhammad Rajabinasab,Farhad Pakdaman,Moncef Gabbouj,Peter Schneider-Kamp,Arthur Zimek
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
备注:
Abstract:We propose a novel unsupervised outlier detection method based on Randomized Principal Component Analysis (PCA). Inspired by the performance of Randomized PCA (RPCA) Forest in approximate K-Nearest Neighbor (KNN) search, we develop a novel unsupervised outlier detection method that utilizes RPCA Forest for outlier detection. Experimental results showcase the superiority of the proposed approach compared to the classical and state-of-the-art methods in performing the outlier detection task on several datasets while performing competitively on the rest. The extensive analysis of the proposed method reflects it high generalization power and its computational efficiency, highlighting it as a good choice for unsupervised outlier detection.
zh
[AI-27] Beyond Ethical Alignment: Evaluating LLM s as Artificial Moral Assistants ECAI2025
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)在道德能力评估中存在的表面化问题,即现有基准多仅依赖最终伦理判断而非深入的道德推理过程。为推进对LLMs道德能力的实质性考察,作者提出将其视为“人工道德助手”(Artificial Moral Assistant, AMA),并构建了一个基于哲学理论的新框架,明确AMA应具备的核心特质,如演绎式和归纳式道德推理(deductive and abductive moral reasoning)。解决方案的关键在于设计了一套专门用于测试这些推理能力的基准,并通过实证评估揭示了主流开源LLMs在抽象道德推理方面的显著不足,从而强调了需发展专门策略以显式提升LLMs的道德推理能力。
链接: https://arxiv.org/abs/2508.12754
作者: Alessio Galatolo,Luca Alberto Rappuoli,Katie Winkle,Meriem Beloucif
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Full version of the paper published in ECAI 2025 proceedings (IOS Press, CC BY-NC 4.0)
Abstract:The recent rise in popularity of large language models (LLMs) has prompted considerable concerns about their moral capabilities. Although considerable effort has been dedicated to aligning LLMs with human moral values, existing benchmarks and evaluations remain largely superficial, typically measuring alignment based on final ethical verdicts rather than explicit moral reasoning. In response, this paper aims to advance the investigation of LLMs’ moral capabilities by examining their capacity to function as Artificial Moral Assistants (AMAs), systems envisioned in the philosophical literature to support human moral deliberation. We assert that qualifying as an AMA requires more than what state-of-the-art alignment techniques aim to achieve: not only must AMAs be able to discern ethically problematic situations, they should also be able to actively reason about them, navigating between conflicting values outside of those embedded in the alignment phase. Building on existing philosophical literature, we begin by designing a new formal framework of the specific kind of behaviour an AMA should exhibit, individuating key qualities such as deductive and abductive moral reasoning. Drawing on this theoretical framework, we develop a benchmark to test these qualities and evaluate popular open LLMs against it. Our results reveal considerable variability across models and highlight persistent shortcomings, particularly regarding abductive moral reasoning. Our work connects theoretical philosophy with practical AI evaluation while also emphasising the need for dedicated strategies to explicitly enhance moral reasoning capabilities in LLMs. Code available at this https URL
zh
[AI-28] FedUNet: A Lightweight Additive U-Net Module for Federated Learning with Heterogeneous Models
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在现实世界中因客户端模型架构异构性而导致的协作训练难题。现有方法通常假设所有客户端使用相同的模型结构,这限制了其在多样化设备和场景中的应用。解决方案的关键在于提出FedUNet框架,它通过在每个客户端的主干网络(backbone)上附加一个受U-Net启发的可加模块(additive module),实现架构无关的轻量级知识迁移。该模块仅通过共享U-Net的紧凑瓶颈(bottleneck)部分来传递信息,从而无需结构对齐即可高效提取客户端不变表示(client-invariant representations)。其编码器-解码器结构与跳跃连接(skip connections)有助于融合低层和高层特征,使主干网络与附加模块之间实现协同学习,同时通信开销极低(仅为0.89 MB)。
链接: https://arxiv.org/abs/2508.12740
作者: Beomseok Seo,Kichang Lee,JaeYeon Park
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures
Abstract:Federated learning (FL) enables decentralized model training without sharing local data. However, most existing methods assume identical model architectures across clients, limiting their applicability in heterogeneous real-world environments. To address this, we propose FedUNet, a lightweight and architecture-agnostic FL framework that attaches a U-Net-inspired additive module to each client’s backbone. By sharing only the compact bottleneck of the U-Net, FedUNet enables efficient knowledge transfer without structural alignment. The encoder-decoder design and skip connections in the U-Net help capture both low-level and high-level features, facilitating the extraction of clientinvariant representations. This enables cooperative learning between the backbone and the additive module with minimal communication cost. Experiment with VGG variants shows that FedUNet achieves 93.11% accuracy and 92.68% in compact form (i.e., a lightweight version of FedUNet) with only 0.89 MB low communication overhead.
zh
[AI-29] GTool: Graph Enhanced Tool Planning with Large Language Model
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在工具规划(Tool Planning)过程中因忽视工具间依赖关系而导致的无效规划问题,尤其是在工具依赖关系不完整时难以准确识别用户请求所需工具的挑战。解决方案的关键在于提出GTool,其通过构建针对具体请求的工具图(Request-Specific Tool Graph)来高效选择工具,并生成可被LLMs理解的“graph token”,从而显式编码工具间的依赖信息;同时设计缺失依赖预测任务以增强在不完整依赖场景下的可靠性,且无需微调LLM即可与多种LLM骨干网络无缝集成。
链接: https://arxiv.org/abs/2508.12725
作者: Wenjie Chen,Wenbin Li,Di Yao,Xuying Meng,Chang Gong,Jingping Bi
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 16 pages, 9 figures
Abstract:Tool planning with large language models (LLMs), referring to selecting, organizing, and preparing the tools necessary to complete a user request, bridges the gap between natural language understanding and task execution. However, current works treat different tools as isolated components and fail to leverage the inherent dependencies of tools, leading to invalid planning results. Since tool dependencies are often incomplete, it becomes challenging for LLMs to accurately identify the appropriate tools required by a user request, especially when confronted with a large toolset. To solve this challenge, we propose \textttGTool, which is the first work aiming to enhance the tool planning ability of LLMs under incomplete dependencies. \textttGTool constructs a request-specific tool graph to select tools efficiently and generate the \textttgraph token which provides sufficient dependency information understandable by LLMs. Moreover, a missing dependency prediction task is designed to improve the reliability of \textttGTool with incomplete dependencies. Without trimming LLMs, \textttGTool can be seamlessly integrated with various LLM backbones without extensive retraining. Extensive experiments show that \textttGTool achieves more than 29.6% performance improvements compared with the state-of-the-art (SOTA) baselines with a light-weight (7B) LLM backbone.
zh
[AI-30] MATPAC: Enhanced Masked Latent Prediction for Self-Supervised Audio Representation Learning
【速读】:该论文旨在解决自监督学习(Self-Supervised Learning, SSL)中预测模块在处理多声源音频内容时因固有模糊性而导致表示质量下降的问题。现有方法虽在通用音频和音乐表征学习上表现优异,但对预测模块的设计关注不足,难以有效建模音频内容的不确定性。解决方案的关键在于引入多选学习(Multiple Choice Learning, MCL),通过显式建模预测过程中的模糊性来增强预测能力和无监督分类任务的表现,从而提升整体表征质量。作者基于MATPAC系统改进其预训练任务,并提出MATPAC++框架,在AudioSet上的微调及多个下游任务的线性探测实验中均达到最先进性能,尤其在纯音乐数据训练场景下展现出更高的效率与精度。
链接: https://arxiv.org/abs/2508.12709
作者: Aurian Quelennec,Pierre Chouteau,Geoffroy Peeters,Slim Essid
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI)
备注: Under review
Abstract:Masked latent prediction has emerged as a leading paradigm in self-supervised learning (SSL), especially for general audio and music representation learning. While recent methods have demonstrated strong performance, the role of the predictor module used at the output of such SSL systems remains mainly overlooked, despite being crucial for solving the pretext task at hand. In particular, this module should be able to deal with the ambiguity inherent in audio content, especially when it is composed of multiple sound sources. This work proposes a novel enhancement: integrating Multiple Choice Learning (MCL) to explicitly model prediction ambiguity and improve representation quality. We build on top of the recently proposed MATPAC system, improving its prediction and unsupervised classification pretext tasks with MCL. We extensively evaluate our method, MATPAC++, through both linear probing across multiple downstream tasks and fine-tuning on AudioSet, employing a unified protocol that enables rigorous and fair comparisons with state-of-the-art SSL approaches. Results show that our proposal achieves state-of-the-art when fine-tuned on AudioSet and overall state-of-the-art scores on downstream tasks. Additionally, we examine domain specialisation by training exclusively on music data, where our model achieves state-of-the-art performance with significantly improved efficiency.
zh
[AI-31] Asymmetric Diffusion Recommendation Model CIKM2025
【速读】:该论文旨在解决现有基于扩散模型的推荐系统在处理离散数据空间时存在的问题,即标准高斯噪声在连续空间中设计的前向和反向扩散过程难以适配推荐系统中离散特征空间,并可能破坏潜在表示中的个性化信息。解决方案的关键在于提出一种不对称扩散推荐模型(Asymmetric Diffusion Recommendation Model, AsymDiffRec),其核心创新包括:定义一个广义的前向过程以模拟真实推荐样本中缺失特征的生成机制,反向过程则在不对称的潜在特征空间中进行;同时引入面向任务的优化策略以保留潜在表示中的个性化信息。在推理阶段,将带有缺失特征的原始输入视为噪声输入,通过去噪生成更鲁棒的表示用于最终预测,从而提升推荐效果。
链接: https://arxiv.org/abs/2508.12706
作者: Yongchun Zhu,Guanyu Jiang,Jingwu Chen,Feng Zhang,Xiao Yang,Zuotao Liu
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Accepted by CIKM2025
Abstract:Recently, motivated by the outstanding achievements of diffusion models, the diffusion process has been employed to strengthen representation learning in recommendation systems. Most diffusion-based recommendation models typically utilize standard Gaussian noise in symmetric forward and reverse processes in continuous data space. Nevertheless, the samples derived from recommendation systems inhabit a discrete data space, which is fundamentally different from the continuous one. Moreover, Gaussian noise has the potential to corrupt personalized information within latent representations. In this work, we propose a novel and effective method, named Asymmetric Diffusion Recommendation Model (AsymDiffRec), which learns forward and reverse processes in an asymmetric manner. We define a generalized forward process that simulates the missing features in real-world recommendation samples. The reverse process is then performed in an asymmetric latent feature space. To preserve personalized information within the latent representation, a task-oriented optimization strategy is introduced. In the serving stage, the raw sample with missing features is regarded as a noisy input to generate a denoising and robust representation for the final prediction. By equipping base models with AsymDiffRec, we conduct online A/B tests, achieving improvements of +0.131% and +0.166% in terms of users’ active days and app usage duration respectively. Additionally, the extended offline experiments also demonstrate improvements. AsymDiffRec has been implemented in the Douyin Music App.
zh
[AI-32] A Taxonomy of Hierarchical Multi-Agent Systems: Design Patterns Coordination Mechanisms and Industrial Applications
【速读】:该论文旨在解决多智能体系统(Multi-agent Systems, MAS)中因结构复杂性和规模扩大而导致的协调难题,特别是如何在保持局部自主性的同时实现全局效率。其核心问题是现有方法往往孤立地处理层级结构、信息流或任务分配等维度,缺乏统一的分析框架来指导设计与比较不同方案。解决方案的关键在于提出一个五维分类法(control hierarchy, information flow, role and task delegation, temporal layering, communication structure),将结构性、时间性和通信维度整合为单一设计框架,并将其与具体的协调机制(如合同网协议、分层强化学习乃至大语言模型代理)相连接,从而提供一种可比较、可扩展且面向工业场景(如电力网络和油田运维)的设计视角。
链接: https://arxiv.org/abs/2508.12683
作者: David J. Moore
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注:
Abstract:Hierarchical multi-agent systems (HMAS) organize collections of agents into layered structures that help manage complexity and scale. These hierarchies can simplify coordination, but they also can introduce trade-offs that are not always obvious. This paper proposes a multi-dimensional taxonomy for HMAS along five axes: control hierarchy, information flow, role and task delegation, temporal layering, and communication structure. The intent is not to prescribe a single “best” design but to provide a lens for comparing different approaches. Rather than treating these dimensions in isolation, the taxonomy is connected to concrete coordination mechanisms - from the long-standing contract-net protocol for task allocation to more recent work in hierarchical reinforcement learning. Industrial contexts illustrate the framework, including power grids and oilfield operations, where agents at production, maintenance, and supply levels coordinate to diagnose well issues or balance energy demand. These cases suggest that hierarchical structures may achieve global efficiency while preserving local autonomy, though the balance is delicate. The paper closes by identifying open challenges: making hierarchical decisions explainable to human operators, scaling to very large agent populations, and assessing whether learning-based agents such as large language models can be safely integrated into layered frameworks. This paper presents what appears to be the first taxonomy that unifies structural, temporal, and communication dimensions of hierarchical MAS into a single design framework, bridging classical coordination mechanisms with modern reinforcement learning and large language model agents. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.12683 [cs.MA] (or arXiv:2508.12683v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.12683 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-33] GridCodex: A RAG -Driven AI Framework for Power Grid Code Reasoning and Compliance
【速读】:该论文旨在解决电网运行规范(grid codes)在实际应用中因规则复杂且缺乏自动化解析工具而导致的合规性判断困难问题,这已成为可再生能源大规模接入背景下电力行业扩展与盈利的关键瓶颈。其解决方案的核心在于提出GridCodex框架,该框架基于大语言模型(large language models, LLMs)与检索增强生成(retrieval-augmented generation, RAG)技术,通过多阶段查询重构和引入RAPTOR算法优化检索过程,显著提升了对电网规范的理解精度与召回率,实验证明其在答案质量上提升26.4%,召回率提高超10倍。
链接: https://arxiv.org/abs/2508.12682
作者: Jinquan Shi,Yingying Cheng,Fan Zhang,Miao Jiang,Jun Lin,Yanbai Shen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The global shift towards renewable energy presents unprecedented challenges for the electricity industry, making regulatory reasoning and compliance increasingly vital. Grid codes, the regulations governing grid operations, are complex and often lack automated interpretation solutions, which hinders industry expansion and undermines profitability for electricity companies. We introduce GridCodex, an end to end framework for grid code reasoning and compliance that leverages large language models and retrieval-augmented generation (RAG). Our framework advances conventional RAG workflows through multi stage query refinement and enhanced retrieval with RAPTOR. We validate the effectiveness of GridCodex with comprehensive benchmarks, including automated answer assessment across multiple dimensions and regulatory agencies. Experimental results showcase a 26.4% improvement in answer quality and more than a 10 fold increase in recall rate. An ablation study further examines the impact of base model selection.
zh
[AI-34] Deploying Models to Non-participating Clients in Federated Learning without Fine-tuning: A Hypernetwork-based Approach
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中数据异质性(data heterogeneity)带来的模型泛化能力不足问题,特别是针对未参与训练的客户端在域内分布偏移(in-domain distribution shifts)和资源受限场景下的适应性挑战。解决方案的关键在于提出HyperFedZero方法,其核心创新是通过一个基于分布感知嵌入(distribution-aware embeddings)条件化的超网络(hypernetwork),动态生成针对非参与客户端的专用模型。该方法在前向传播中显式引入分布感知归纳偏置(inductive biases),并利用NoisyEmbed增强的嵌入提取器结合平衡惩罚(Balancing Penalty)有效防止特征坍塌(feature collapse),从而实现对不同客户端数据分布的精准建模与高效适配,同时保持极低的计算、存储和通信开销。
链接: https://arxiv.org/abs/2508.12673
作者: Yuhao Zhou,Jindi Lv,Yuxin Tian,Dan Si,Qing Ye,Jiancheng Lv
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 17 pages
Abstract:Federated Learning (FL) has emerged as a promising paradigm for privacy-preserving collaborative learning, yet data heterogeneity remains a critical challenge. While existing methods achieve progress in addressing data heterogeneity for participating clients, they fail to generalize to non-participating clients with in-domain distribution shifts and resource constraints. To mitigate this issue, we present HyperFedZero, a novel method that dynamically generates specialized models via a hypernetwork conditioned on distribution-aware embeddings. Our approach explicitly incorporates distribution-aware inductive biases into the model’s forward pass, extracting robust distribution embeddings using a NoisyEmbed-enhanced extractor with a Balancing Penalty, effectively preventing feature collapse. The hypernetwork then leverages these embeddings to generate specialized models chunk-by-chunk for non-participating clients, ensuring adaptability to their unique data distributions. Extensive experiments on multiple datasets and models demonstrate HyperFedZero’s remarkable performance, surpassing competing methods consistently with minimal computational, storage, and communication overhead. Moreover, ablation studies and visualizations further validate the necessity of each component, confirming meaningful adaptations and validating the effectiveness of HyperFedZero.
zh
[AI-35] Robust Federated Learning under Adversarial Attacks via Loss-Based Client Clustering
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)场景中客户端遭受拜占庭(Byzantine)攻击时的模型鲁棒性问题,即在多个客户端参与训练但部分客户端被恶意操控、上传异常更新的情况下,如何保证全局模型的准确性与稳定性。解决方案的关键在于利用一个可信服务器和至少一个诚实客户端(共两个诚实参与者),结合服务器端持有的可信辅助数据集,在无需预先知道恶意客户端数量的前提下,设计出一种具有理论保障的聚合机制,从而在强拜占庭攻击(如标签翻转、符号翻转、高斯噪声注入)下仍能实现有界最优性差距,并在MNIST、FMNIST和CIFAR-10等多个基准上显著优于标准均值、截断均值、中位数、Krum及Multi-Krum等主流鲁棒联邦学习方法。
链接: https://arxiv.org/abs/2508.12672
作者: Emmanouil Kritharakis,Dusan Jakovetic,Antonios Makris,Konstantinos Tserpes
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 16 pages, 5 figures
Abstract:Federated Learning (FL) enables collaborative model training across multiple clients without sharing private data. We consider FL scenarios wherein FL clients are subject to adversarial (Byzantine) attacks, while the FL server is trusted (honest) and has a trustworthy side dataset. This may correspond to, e.g., cases where the server possesses trusted data prior to federation, or to the presence of a trusted client that temporarily assumes the server role. Our approach requires only two honest participants, i.e., the server and one client, to function effectively, without prior knowledge of the number of malicious clients. Theoretical analysis demonstrates bounded optimality gaps even under strong Byzantine attacks. Experimental results show that our algorithm significantly outperforms standard and robust FL baselines such as Mean, Trimmed Mean, Median, Krum, and Multi-Krum under various attack strategies including label flipping, sign flipping, and Gaussian noise addition across MNIST, FMNIST, and CIFAR-10 benchmarks using the Flower framework.
zh
[AI-36] he Maximum Coverag e Model and Recommendation System for UAV Vertiports Location Planning
【速读】:该论文旨在解决城市空中交通(Urban Aerial Mobility, UAM)基础设施规划中因数据粒度不足和现实适用性差而导致的传统选址模型难以应对复杂需求的问题。其关键解决方案是提出一种新型优化框架——容量约束动态最大覆盖定位问题(Capacitated Dynamic Maximum Covering Location Problem, CDMCLP),该框架能够同时建模城市尺度的空间-时间需求、异质用户行为及设施容量限制,并进一步构建一个融合社会经济因素与动态聚类初始化的集成规划推荐系统,通过基于实证用户行为的自适应参数调优生成可落地的规划方案,从而显著提升传统方法的性能(改善38%–52%),并实现理论建模与实际应用之间的有效衔接。
链接: https://arxiv.org/abs/2508.12651
作者: Chunliang Hua,Xiao Hu,Jiayang Sun,Zeyuan Yang
机构: 未知
类目: Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注: 10 pages
Abstract:As urban aerial mobility (UAM) infrastructure development accelerates globally, cities like Shenzhen are planning large-scale vertiport networks (e.g., 1,200+ facilities by 2026). Existing planning frameworks remain inadequate for this complexity due to historical limitations in data granularity and real-world applicability. This paper addresses these gaps by first proposing the Capacitated Dynamic Maximum Covering Location Problem (CDMCLP), a novel optimization framework that simultaneously models urban-scale spatial-temporal demand, heterogeneous user behaviors, and infrastructure capacity constraints. Building on this foundation, we introduce an Integrated Planning Recommendation System that combines CDMCLP with socio-economic factors and dynamic clustering initialization. This system leverages adaptive parameter tuning based on empirical user behavior to generate practical planning solutions. Validation in a Chinese center city demonstrates the effectiveness of the new optimization framework and recommendation system. Under the evaluation and optimization of CDMCLP, the quantitative performance of traditional location methods are exposed and can be improved by 38%–52%, while the recommendation system shows user-friendliness and the effective integration of complex elements. By integrating mathematical rigor with practical implementation considerations, this hybrid approach bridges the gap between theoretical location modeling and real-world UAM infrastructure planning, offering municipalities a pragmatic tool for vertiport network design.
zh
[AI-37] Score-informed Neural Operator for Enhancing Ordering-based Causal Discovery
【速读】:该论文旨在解决基于评分匹配的因果发现方法在估计对数密度Hessian对角线时存在的计算昂贵、内存密集及数值不稳定问题,尤其是在使用Stein梯度估计器和扩散模型(DiffAN)时。其关键解决方案是提出Score-informed Neural Operator (SciNO),这是一种定义在光滑函数空间中的概率生成模型,能够稳定地近似Hessian对角线并保留评分建模过程中的结构信息,从而显著提升因果排序的准确性与稳定性。
链接: https://arxiv.org/abs/2508.12650
作者: Jiyeon Kang,Songseong Kim,Chanhui Lee,Doyeong Hwang,Joanie Hayoun Chung,Yunkyung Ko,Sumin Lee,Sungwoong Kim,Sungbin Lim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 32 pages, 17 figures, 5 tables
Abstract:Ordering-based approaches to causal discovery identify topological orders of causal graphs, providing scalable alternatives to combinatorial search methods. Under the Additive Noise Model (ANM) assumption, recent causal ordering methods based on score matching require an accurate estimation of the Hessian diagonal of the log-densities. However, previous approaches mainly use Stein gradient estimators, which are computationally expensive and memory-intensive. Although DiffAN addresses these limitations by substituting kernel-based estimates with diffusion models, it remains numerically unstable due to the second-order derivatives of score models. To alleviate these problems, we propose Score-informed Neural Operator (SciNO), a probabilistic generative model in smooth function spaces designed to stably approximate the Hessian diagonal and to preserve structural information during the score modeling. Empirical results show that SciNO reduces order divergence by 42.7% on synthetic graphs and by 31.5% on real-world datasets on average compared to DiffAN, while maintaining memory efficiency and scalability. Furthermore, we propose a probabilistic control algorithm for causal reasoning with autoregressive models that integrates SciNO’s probability estimates with autoregressive model priors, enabling reliable data-driven causal ordering informed by semantic information. Consequently, the proposed method enhances causal reasoning abilities of LLMs without additional fine-tuning or prompt engineering.
zh
[AI-38] Cognitive Structure Generation: From Educational Priors to Policy Optimization
【速读】:该论文旨在解决认知结构(cognitive structure)在学生建模与心理测量中长期存在的评估难题,即如何有效表征学生对知识体系的主观组织方式。其解决方案的关键在于提出一种名为认知结构生成(Cognitive Structure Generation, CSG)的新框架:首先预训练一个认知结构扩散概率模型(Cognitive Structure Diffusion Probabilistic Model, CSDPM),基于教育先验知识生成学生的认知结构;随后通过强化学习优化生成过程,引入分层奖励信号以对齐学生真实的学习发展水平,从而实现更全面、可解释且有效的学生建模。
链接: https://arxiv.org/abs/2508.12647
作者: Hengnian Gu,Zhifu Chen,Yuxin Chen,Jin Peng Zhou,Dongdai Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI); Computers and Society (cs.CY); Machine Learning (cs.LG)
备注:
Abstract:Cognitive structure is a student’s subjective organization of an objective knowledge system, reflected in the psychological construction of concepts and their relations. However, cognitive structure assessment remains a long-standing challenge in student modeling and psychometrics, persisting as a foundational yet largely unassessable concept in educational practice. This paper introduces a novel framework, Cognitive Structure Generation (CSG), in which we first pretrain a Cognitive Structure Diffusion Probabilistic Model (CSDPM) to generate students’ cognitive structures from educational priors, and then further optimize its generative process as a policy with hierarchical reward signals via reinforcement learning to align with genuine cognitive development levels during students’ learning processes. Experimental results on four popular real-world education datasets show that cognitive structures generated by CSG offer more comprehensive and effective representations for student modeling, substantially improving performance on KT and CD tasks while enhancing interpretability.
zh
[AI-39] How can we trust opaque systems? Criteria for robust explanations in XAI
【速读】:该论文旨在解决深度学习(Deep Learning, DL)模型预测结果缺乏可解释性的问题,即其内部决策机制高度“黑箱”,导致用户难以理解模型关注的数据特征及其推理路径。为提升可信度,论文提出需满足两个关键条件:一是解释鲁棒性(Explanatory Robustness, ER),即不同可解释人工智能(Explainable Artificial Intelligence, XAI)方法在相似情境下应产生一致的解释;二是解释方法鲁棒性(Explanation Method Robustness, EMR),即单一XAI方法本身必须对输入扰动具有稳定性。作者指出,仅有ER不足以保证解释可信,因为多个方法可能一致地给出错误解释,因此EMR是前提条件。论文据此构建了形式化框架,用于评估和建立对DL算法解释的信任基础,并指出了实际应用场景与未来研究方向。
链接: https://arxiv.org/abs/2508.12623
作者: Florian J. Boge,Annika Schuster
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 8 pages, 1 figure
Abstract:Deep learning (DL) algorithms are becoming ubiquitous in everyday life and in scientific research. However, the price we pay for their impressively accurate predictions is significant: their inner workings are notoriously opaque - it is unknown to laypeople and researchers alike what features of the data a DL system focuses on and how it ultimately succeeds in predicting correct outputs. A necessary criterion for trustworthy explanations is that they should reflect the relevant processes the algorithms’ predictions are based on. The field of eXplainable Artificial Intelligence (XAI) presents promising methods to create such explanations. But recent reviews about their performance offer reasons for skepticism. As we will argue, a good criterion for trustworthiness is explanatory robustness: different XAI methods produce the same explanations in comparable contexts. However, in some instances, all methods may give the same, but still wrong, explanation. We therefore argue that in addition to explanatory robustness (ER), a prior requirement of explanation method robustness (EMR) has to be fulfilled by every XAI method. Conversely, the robustness of an individual method is in itself insufficient for trustworthiness. In what follows, we develop and formalize criteria for ER as well as EMR, providing a framework for explaining and establishing trust in DL algorithms. We also highlight interesting application cases and outline directions for future work.
zh
[AI-40] SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression
【速读】:该论文旨在解决预训练大语言模型(Large Language Models, LLMs)在测试时因冗长且缺乏正确自我修正的推理过程而导致错误累积的问题,这种现象常表现为过度思考(overthinking),从而影响模型输出的准确性和效率。解决方案的关键在于提出一种可插拔的强化学习过程监督框架——自追踪分步偏好优化(Self-traced Step-wise Preference Optimization, SSPO),其核心创新是利用模型自身生成的分步偏好信号来指导每一步推理的细粒度优化,无需辅助模型或人工标注,即可实现推理路径的压缩与纠错,从而在不损失多领域、多语言性能的前提下显著减少冗余推理并提升答案准确性。
链接: https://arxiv.org/abs/2508.12604
作者: Yuyang Xu,Yi Cheng,Haochao Ying,Zhuoyun Du,Renjun Hu,Xing Shi,Wei Lin,Jian Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Work in progress
Abstract:Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.
zh
[AI-41] Energy-Efficient Wireless LLM Inference via Uncertainty and Importance-Aware Speculative Decoding
【速读】:该论文旨在解决资源受限环境下边缘设备上大语言模型(Large Language Model, LLM)推理的能效与通信效率问题。现有混合语言模型(Hybrid Language Model, HLM)方法虽关注准确率和延迟优化,却常忽视能量消耗与网络传输开销。其核心解决方案是提出一种基于token级别的过滤机制,通过融合认知不确定性(epistemic uncertainty)与注意力重要性(attention-based importance)来动态识别并仅上传高信息量的token,从而减少对云端LLM的调用频率及通信成本。实验表明,该方法在TinyLlama-1.1B和LLaMA-2-7B模型上实现了高达87.5%的BERT Score、0.37 tokens/sec的吞吐率,并相较标准HLM降低40.7%能耗,显著优于此前U-HLM基线。
链接: https://arxiv.org/abs/2508.12590
作者: Jihoon Park,Seungeun Oh,Seong-Lyun Kim
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 5 figures
Abstract:To address the growing demand for on-device LLM inference in resource-constrained environments, hybrid language models (HLM) have emerged, combining lightweight local models with powerful cloud-based LLMs. Recent studies on HLM have primarily focused on improving accuracy and latency, while often overlooking communication and energy efficiency. We propose a token-level filtering mechanism for an energy-efficient importance- and uncertainty-aware HLM inference that leverages both epistemic uncertainty and attention-based importance. Our method opportunistically uploads only informative tokens, reducing LLM usage and communication costs. Experiments with TinyLlama-1.1B and LLaMA-2-7B demonstrate that our method achieves up to 87.5% BERT Score and token throughput of 0.37 tokens/sec while saving the energy consumption by 40.7% compared to standard HLM. Furthermore, compared to our previous U-HLM baseline, our method improves BERTScore from 85.8% to 87.0%, energy savings from 31.6% to 43.6%, and throughput from 0.36 to 0.40. This approach enables an energy-efficient and accurate deployment of LLMs in bandwidth-constrained edge environments.
zh
[AI-42] Widening the Network Mitigates the Impact of Data Heterogeneity on FedAvg ICML2025
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)中因客户端数据非独立同分布(non-independent and identically distributed, non-IID)而导致的全局模型泛化性能下降问题。其解决方案的关键在于理论证明:当神经网络宽度趋于无穷时,FedAvg 与梯度下降(Gradient Descent, GD)的收敛性表明数据异质性的影响逐渐消失;在此极限情况下,全局和局部模型均呈现线性行为,且 FedAvg 的泛化性能等价于同等迭代次数下的集中式学习(centralized learning)。这一发现为理解大规模神经网络在联邦场景下的有效性提供了理论基础,并通过大量实验验证了结论的普适性。
链接: https://arxiv.org/abs/2508.12576
作者: Like Jian,Dong Liu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ICML 2025
Abstract:Federated learning (FL) enables decentralized clients to train a model collaboratively without sharing local data. A key distinction between FL and centralized learning is that clients’ data are non-independent and identically distributed, which poses significant challenges in training a global model that generalizes well across heterogeneous local data distributions. In this paper, we analyze the convergence of overparameterized FedAvg with gradient descent (GD). We prove that the impact of data heterogeneity diminishes as the width of neural networks increases, ultimately vanishing when the width approaches infinity. In the infinite-width regime, we further prove that both the global and local models in FedAvg behave as linear models, and that FedAvg achieves the same generalization performance as centralized learning with the same number of GD iterations. Extensive experiments validate our theoretical findings across various network architectures, loss functions, and optimization methods.
zh
[AI-43] Deep Learning Model for Amyloidogenicity Prediction using a Pre-trained Protein LLM
【速读】:该论文旨在解决肽段和蛋白质中淀粉样形成能力(amyloidogenicity)的预测问题,这是生物信息学领域的重要研究方向。其解决方案的关键在于利用预训练的蛋白质大语言模型(protein large language model, LLM)提取序列上下文特征,并结合双向长短期记忆网络(bidirectional LSTM)与门控循环单元(GRU)来建模蛋白序列的深层语义信息,从而实现对淀粉样区域的高精度识别。该方法在10折交叉验证中达到84.5%的准确率,在独立测试集上达到83%的准确率,表明基于LLM的上下文特征能够显著提升淀粉样预测性能。
链接: https://arxiv.org/abs/2508.12575
作者: Zohra Yagoub,Hafida Bouziane
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:The prediction of amyloidogenicity in peptides and proteins remains a focal point of ongoing bioinformatics. The crucial step in this field is to apply advanced computational methodologies. Many recent approaches to predicting amyloidogenicity within proteins are highly based on evolutionary motifs and the individual properties of amino acids. It is becoming increasingly evident that the sequence information-based features show high predictive performance. Consequently, our study evaluated the contextual features of protein sequences obtained from a pretrained protein large language model leveraging bidirectional LSTM and GRU to predict amyloidogenic regions in peptide and protein sequences. Our method achieved an accuracy of 84.5% on 10-fold cross-validation and an accuracy of 83% in the test dataset. Our results demonstrate competitive performance, highlighting the potential of LLMs in enhancing the accuracy of amyloid prediction.
zh
[AI-44] Help or Hurdle? Rethinking Model Context Protocol-Augmented Large Language Models
【速读】:该论文旨在解决当前对大型语言模型(Large Language Models, LLMs)通过模型上下文协议(Model Context Protocol, MCP)访问外部资源时实际行为机制理解不足的问题。现有研究普遍假设MCP能提升LLM性能,但缺乏系统性评估框架来量化其在工具调用中的主动性、合规性、有效性及计算开销等关键维度。解决方案的关键在于提出MCPGAUGE——首个全面的评估框架,涵盖160个提示模板和25个数据集,覆盖知识理解、通用推理与代码生成任务,并通过大规模实验(六款商用LLM、30个MCP工具套件、约20,000次API调用)揭示了当前LLM-MCP交互中存在的四大局限,从而为可控、可解释的工具增强型LLM发展提供了基准依据。
链接: https://arxiv.org/abs/2508.12566
作者: Wei Song,Haonan Zhong,Ziqi Ding,Jingling Xue,Yuekang Li
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The Model Context Protocol (MCP) enables large language models (LLMs) to access external resources on demand. While commonly assumed to enhance performance, how LLMs actually leverage this capability remains poorly understood. We introduce MCPGAUGE, the first comprehensive evaluation framework for probing LLM-MCP interactions along four key dimensions: proactivity (self-initiated tool use), compliance (adherence to tool-use instructions), effectiveness (task performance post-integration), and overhead (computational cost incurred). MCPGAUGE comprises a 160-prompt suite and 25 datasets spanning knowledge comprehension, general reasoning, and code generation. Our large-scale evaluation, spanning six commercial LLMs, 30 MCP tool suites, and both one- and two-turn interaction settings, comprises around 20,000 API calls and over USD 6,000 in computational cost. This comprehensive study reveals four key findings that challenge prevailing assumptions about the effectiveness of MCP integration. These insights highlight critical limitations in current AI-tool integration and position MCPGAUGE as a principled benchmark for advancing controllable, tool-augmented LLMs.
zh
[AI-45] OS-R1: Agent ic Operating System Kernel Tuning with Reinforcement Learning
【速读】:该论文旨在解决Linux内核调优中现有方法在效率、可扩展性和泛化能力方面的不足。其核心解决方案是提出OS-R1框架,该框架基于规则的强化学习(rule-based reinforcement learning, RL)构建,将内核配置空间抽象为RL环境,从而支持大语言模型(large language models, LLMs)高效探索并精准执行配置修改;同时设计定制奖励函数以增强LLMs的推理一致性、配置准确性及系统性能感知能力,并引入两阶段训练机制加速收敛并降低跨场景调优时的重训练成本。
链接: https://arxiv.org/abs/2508.12551
作者: Hongyu Lin,Yuchen Li,Haoran Luo,Kaichun Yao,Libo Zhang,Mingjie Xing,Yanjun Wu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Operating Systems (cs.OS); Software Engineering (cs.SE)
备注:
Abstract:Linux kernel tuning is essential for optimizing operating system (OS) performance. However, existing methods often face challenges in terms of efficiency, scalability, and generalization. This paper introduces OS-R1, an agentic Linux kernel tuning framework powered by rule-based reinforcement learning (RL). By abstracting the kernel configuration space as an RL environment, OS-R1 facilitates efficient exploration by large language models (LLMs) and ensures accurate configuration modifications. Additionally, custom reward functions are designed to enhance reasoning standardization, configuration modification accuracy, and system performance awareness of the LLMs. Furthermore, we propose a two-phase training process that accelerates convergence and minimizes retraining across diverse tuning scenarios. Experimental results show that OS-R1 significantly outperforms existing baseline methods, achieving up to 5.6% performance improvement over heuristic tuning and maintaining high data efficiency. Notably, OS-R1 is adaptable across various real-world applications, demonstrating its potential for practical deployment in diverse environments. Our dataset and code are publicly available at this https URL.
zh
[AI-46] Systematic Analysis of MCP Security
【速读】:该论文旨在解决模型上下文协议(Model Context Protocol, MCP)在实际应用中因缺乏系统性安全研究而面临的潜在攻击风险问题,特别是针对AI代理通过MCP接入外部工具时可能遭受的工具投毒攻击(Tool Poisoning Attacks, TPA)。现有研究多局限于窄域或定性分析,难以全面反映真实世界的威胁多样性。其解决方案的关键在于构建了一个名为MCPLIB的统一攻击库,首次系统性地将31种攻击方法归纳为四大类(直接工具注入、间接工具注入、恶意用户攻击和LLM固有攻击),并通过定量实验揭示了MCP机制中的核心漏洞,如代理对工具描述的盲从、对文件型攻击的敏感性、共享上下文引发的链式攻击以及对外部数据与可执行命令区分能力弱等。这一框架为后续设计鲁棒防御策略和优化MCP安全机制提供了实证基础与理论支撑。
链接: https://arxiv.org/abs/2508.12538
作者: Yongjian Guo,Puzhuo Liu,Wanlun Ma,Zehang Deng,Xiaogang Zhu,Peng Di,Xi Xiao,Sheng Wen
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
备注:
Abstract:The Model Context Protocol (MCP) has emerged as a universal standard that enables AI agents to seamlessly connect with external tools, significantly enhancing their functionality. However, while MCP brings notable benefits, it also introduces significant vulnerabilities, such as Tool Poisoning Attacks (TPA), where hidden malicious instructions exploit the sycophancy of large language models (LLMs) to manipulate agent behavior. Despite these risks, current academic research on MCP security remains limited, with most studies focusing on narrow or qualitative analyses that fail to capture the diversity of real-world threats. To address this gap, we present the MCP Attack Library (MCPLIB), which categorizes and implements 31 distinct attack methods under four key classifications: direct tool injection, indirect tool injection, malicious user attacks, and LLM inherent attack. We further conduct a quantitative analysis of the efficacy of each attack. Our experiments reveal key insights into MCP vulnerabilities, including agents’ blind reliance on tool descriptions, sensitivity to file-based attacks, chain attacks exploiting shared context, and difficulty distinguishing external data from executable commands. These insights, validated through attack experiments, underscore the urgency for robust defense strategies and informed MCP design. Our contributions include 1) constructing a comprehensive MCP attack taxonomy, 2) introducing a unified attack framework MCPLIB, and 3) conducting empirical vulnerability analysis to enhance MCP security mechanisms. This work provides a foundational framework, supporting the secure evolution of MCP ecosystems.
zh
[AI-47] Defining and Benchmarking a Data-Centric Design Space for Brain Graph Construction
【速读】:该论文旨在解决当前基于功能磁共振成像(fMRI)构建脑图(brain graphs)时,普遍依赖僵化流程而忽视数据驱动型决策的问题。其核心挑战在于如何系统性地优化脑图构造过程中的关键数据处理步骤,以提升下游图机器学习任务(如分类)的性能。解决方案的关键在于从数据为中心(Data-Centric AI)视角出发,将脑图构建划分为三个阶段:时间信号处理、拓扑提取和图特征化,并对每个阶段中现有及改进技术的组合进行系统性评估与基准测试。研究表明,通过精心设计这些上游数据处理策略(如高振幅BOLD信号滤波、连接性稀疏化与统一方法、替代相关性度量以及多视图节点/边特征,包括滞后动态信息),可显著优于标准流水线,从而凸显了数据级配置在图神经网络用于神经影像分析中的决定性作用。
链接: https://arxiv.org/abs/2508.12533
作者: Qinwen Ge,Roza G. Bayrak,Anwar Said,Catie Chang,Xenofon Koutsoukos,Tyler Derr
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:The construction of brain graphs from functional Magnetic Resonance Imaging (fMRI) data plays a crucial role in enabling graph machine learning for neuroimaging. However, current practices often rely on rigid pipelines that overlook critical data-centric choices in how brain graphs are constructed. In this work, we adopt a Data-Centric AI perspective and systematically define and benchmark a data-centric design space for brain graph construction, constrasting with primarily model-centric prior work. We organize this design space into three stages: temporal signal processing, topology extraction, and graph featurization. Our contributions lie less in novel components and more in evaluating how combinations of existing and modified techniques influence downstream performance. Specifically, we study high-amplitude BOLD signal filtering, sparsification and unification strategies for connectivity, alternative correlation metrics, and multi-view node and edge features, such as incorporating lagged dynamics. Experiments on the HCP1200 and ABIDE datasets show that thoughtful data-centric configurations consistently improve classification accuracy over standard pipelines. These findings highlight the critical role of upstream data decisions and underscore the importance of systematically exploring the data-centric design space for graph-based neuroimaging. Our code is available at this https URL.
zh
[AI-48] Rethinking Safety in LLM Fine-tuning: An Optimization Perspective
【速读】:该论文试图解决语言模型微调过程中不可避免地损害安全性的问题,即模型在面对有害用户请求时拒绝响应的能力下降,即使使用无害数据集进行微调也是如此。研究表明,这种安全问题并非由固有权衡所致,而是源于不良的优化选择,如学习率、批量大小和梯度步骤等超参数设置不当。解决方案的关键在于通过合理选择这些训练超参数,将有害响应比例从16%降低至约5%,同时保持模型效用性能;进一步提出一种基于参数空间的指数移动平均(Exponential Moving Average, EMA)动量技术,通过构建稳定的优化路径来保留预训练模型的安全特性,从而在不依赖额外安全数据的情况下有效维持模型的安全性和性能。
链接: https://arxiv.org/abs/2508.12531
作者: Minseon Kim,Jin Myung Kwak,Lama Alssum,Bernard Ghanem,Philip Torr,David Krueger,Fazl Barez,Adel Bibi
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Fine-tuning language models is commonly believed to inevitably harm their safety, i.e., refusing to respond to harmful user requests, even when using harmless datasets, thus requiring additional safety measures. We challenge this belief through systematic testing, showing that poor optimization choices, rather than inherent trade-offs, often cause safety problems, measured as harmful responses to adversarial prompts. By properly selecting key training hyper-parameters, e.g., learning rate, batch size, and gradient steps, we reduce unsafe model responses from 16% to approximately 5%, as measured by keyword matching, while maintaining utility performance. Based on this observation, we propose a simple exponential moving average (EMA) momentum technique in parameter space that preserves safety performance by creating a stable optimization path and retains the original pre-trained model’s safety properties. Our experiments on the Llama families across multiple datasets (Dolly, Alpaca, ORCA) demonstrate that safety problems during fine-tuning can largely be avoided without specialized interventions, outperforming existing approaches that require additional safety data while offering practical guidelines for maintaining both model performance and safety during adaptation.
zh
[AI-49] Root Cause Analysis of Hydrogen Bond Separation in Spatio-Temporal Molecular Dynamics using Causal Models
【速读】:该论文旨在解决分子动力学模拟(Molecular Dynamics Simulations, MDS)中难以自动识别氢键形成与断裂事件的根本原因问题,即如何从复杂的时空数据中挖掘出导致氢键动态变化的关键因果变量。传统方法依赖人工分析轨迹数据,效率低且缺乏对机制的理解。其解决方案的关键在于引入基于因果建模的框架:将氢键的分离视为一种“干预”(intervention),并利用受变分自编码器启发的架构构建图形化因果模型,从而在不同样本间推断共享的动力学信息和潜在的因果结构;进一步通过分析因果模型联合分布的变化,识别驱动系统状态演变的根因变量。该方法实现了对未来多步氢键行为的预测,并揭示了影响分子相互作用条件分布变化的核心因素,为分子动态系统的根因分析提供了新视角。
链接: https://arxiv.org/abs/2508.12500
作者: Rahmat K. Adesunkanmi,Ashfaq Khokhar,Goce Trajcevski,Sohail Murad
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
备注: Submitted to ACM
Abstract:Molecular dynamics simulations (MDS) face challenges, including resource-heavy computations and the need to manually scan outputs to detect “interesting events,” such as the formation and persistence of hydrogen bonds between atoms of different molecules. A critical research gap lies in identifying the underlying causes of hydrogen bond formation and separation -understanding which interactions or prior events contribute to their emergence over time. With this challenge in mind, we propose leveraging spatio-temporal data analytics and machine learning models to enhance the detection of these phenomena. In this paper, our approach is inspired by causal modeling and aims to identify the root cause variables of hydrogen bond formation and separation events. Specifically, we treat the separation of hydrogen bonds as an “intervention” occurring and represent the causal structure of the bonding and separation events in the MDS as graphical causal models. These causal models are built using a variational autoencoder-inspired architecture that enables us to infer causal relationships across samples with diverse underlying causal graphs while leveraging shared dynamic information. We further include a step to infer the root causes of changes in the joint distribution of the causal models. By constructing causal models that capture shifts in the conditional distributions of molecular interactions during bond formation or separation, this framework provides a novel perspective on root cause analysis in molecular dynamic systems. We validate the efficacy of our model empirically on the atomic trajectories that used MDS for chiral separation, demonstrating that we can predict many steps in the future and also find the variables driving the observed changes in the system.
zh
[AI-50] Advanced DOA Regulation with a Whale-Optimized Fractional Order Fuzzy PID Framework
【速读】:该论文旨在解决麻醉过程中患者脑电双频谱指数(Bispectral Index, BIS)控制精度不足的问题,以实现更精准、个性化的麻醉深度管理。其解决方案的关键在于提出一种基于鲸鱼优化算法(Whale Optimization Algorithm, WOA)优化的分数阶模糊PID(Fractional Order Fuzzy PID, FOFPID)控制器,该控制器融合了模糊逻辑的自适应能力与分数阶微积分的精细调参特性,能够根据个体生理差异动态调整控制参数,并通过WOA对分数阶阶次及模糊隶属函数进行全局优化,从而显著提升系统响应速度和稳态精度,验证表明其在八名不同患者模型上均优于传统分数阶PID控制器。
链接: https://arxiv.org/abs/2508.12487
作者: Lida Shahbandari,Hossein Mohseni
机构: 未知
类目: Artificial Intelligence (cs.AI); Systems and Control (eess.SY)
备注:
Abstract:This study introduces a Fractional Order Fuzzy PID (FOFPID) controller that uses the Whale Optimization Algorithm (WOA) to manage the Bispectral Index (BIS), keeping it within the ideal range of forty to sixty. The FOFPID controller combines fuzzy logic for adapting to changes and fractional order dynamics for fine tuning. This allows it to adjust its control gains to handle a person’s unique physiology. The WOA helps fine tune the controller’s parameters, including the fractional orders and the fuzzy membership functions, which boosts its performance. Tested on models of eight different patient profiles, the FOFPID controller performed better than a standard Fractional Order PID (FOPID) controller. It achieved faster settling times, at two and a half minutes versus three point two minutes, and had a lower steady state error, at zero point five versus one point two. These outcomes show the FOFPID’s excellent strength and accuracy. It offers a scalable, artificial intelligence driven solution for automated anesthesia delivery that could enhance clinical practice and improve patient results.
zh
[AI-51] Cold-RL: Learning Cache Eviction with Offline Reinforcement Learning for NGINX
【速读】:该论文旨在解决Web代理(如NGINX)中传统最近最少使用(Least Recently Used, LRU)缓存淘汰策略在面对周期性突发流量和混合对象大小时易发生缓存抖动(thrash)的问题。其核心解决方案是提出Cold-RL,一种基于深度强化学习的缓存淘汰策略,通过在NGINX中集成一个轻量级的双深度Q网络(dueling Deep Q-Network),以ONNX格式部署为sidecar服务,并在严格微秒级延迟预算内完成决策。该方法在每次淘汰时采样K个最久未使用的对象,提取六种轻量特征(包括年龄、大小、命中次数、到达间隔时间、剩余TTL及最后源服务器响应时间),并输出待淘汰对象的位掩码;若推理超时则立即回退至原生LRU。训练阶段利用NGINX访问日志离线模拟缓存行为,奖励函数设计为:若保留对象在TTL过期前被再次命中,则获得1分。实验表明,Cold-RL在不同缓存容量下显著提升命中率,且推理开销低于2%,满足严格的SLO要求。
链接: https://arxiv.org/abs/2508.12485
作者: Aayush Gupta,Arpit Bhayani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Databases (cs.DB); Networking and Internet Architecture (cs.NI)
备注: 8 pages, 4 figures (system architecture, eviction path, training pipeline, and DQN algorithm), 2 tables. Code available at this https URL
Abstract:Web proxies such as NGINX commonly rely on least-recently-used (LRU) eviction, which is size agnostic and can thrash under periodic bursts and mixed object sizes. We introduce Cold-RL, a learned eviction policy for NGINX that replaces LRU’s forced-expire path with a dueling Deep Q-Network served by an ONNX sidecar within a strict microsecond budget. On each eviction, Cold-RL samples the K least-recently-used objects, extracts six lightweight features (age, size, hit count, inter-arrival time, remaining TTL, and last origin RTT), and requests a bitmask of victims; a hard timeout of 500 microseconds triggers immediate fallback to native LRU. Policies are trained offline by replaying NGINX access logs through a cache simulator with a simple reward: a retained object earns one point if it is hit again before TTL expiry. We compare against LRU, LFU, size-based, adaptive LRU, and a hybrid baseline on two adversarial workloads. With a 25 MB cache, Cold-RL raises hit ratio from 0.1436 to 0.3538, a 146 percent improvement over the best classical baseline; at 100 MB, from 0.7530 to 0.8675, a 15 percent gain; and at 400 MB it matches classical methods (about 0.918). Inference adds less than 2 percent CPU overhead and keeps 95th percentile eviction latency within budget. To our knowledge, this is the first reinforcement learning eviction policy integrated into NGINX with strict SLOs.
zh
[AI-52] he Yokai Learning Environment: Tracking Beliefs Over Space and Time IJCAI2025
【速读】:该论文旨在解决当前协作式人工智能(Collaborative AI)发展中缺乏有效理论心智(Theory of Mind, ToM)能力的问题,特别是现有基准测试在被动观察场景下受限,且未评估智能体如何随时间建立和维持共同认知基础(common ground)。解决方案的关键在于提出Yokai学习环境(Yokai Learning Environment, YLE),这是一个基于合作纸牌游戏Yokai的多智能体强化学习(Multi-agent Reinforcement Learning, MARL)环境,要求智能体通过轮流查看隐藏卡片、基于颜色聚类以及利用提示进行接地性沟通来协同完成任务。YLE设计强调对动态信念追踪、记忆保持、伙伴泛化和高阶ToM能力的评估,从而推动对协作AI中核心认知机制的研究。
链接: https://arxiv.org/abs/2508.12480
作者: Constantin Ruhdorfer,Matteo Bortoletto,Andreas Bulling
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: Presented at the the ToM IJCAI 2025 Workshop
Abstract:Developing collaborative AI hinges on Theory of Mind (ToM) - the ability to reason about the beliefs of others to build and maintain common ground. Existing ToM benchmarks, however, are restricted to passive observer settings or lack an assessment of how agents establish and maintain common ground over time. To address these gaps, we introduce the Yokai Learning Environment (YLE) - a multi-agent reinforcement learning (RL) environment based on the cooperative card game Yokai. In the YLE, agents take turns peeking at hidden cards and moving them to form clusters based on colour. Success requires tracking evolving beliefs, remembering past observations, using hints as grounded communication, and maintaining common ground with teammates. Our evaluation yields two key findings: First, current RL agents struggle to solve the YLE, even when given access to perfect memory. Second, while belief modelling improves performance, agents are still unable to effectively generalise to unseen partners or form accurate beliefs over longer games, exposing a reliance on brittle conventions rather than robust belief tracking. We use the YLE to investigate research questions in belief modelling, memory, partner generalisation, and scaling to higher-order ToM.
zh
[AI-53] GALA: Can Graph-Augmented Large Language Model Agent ic Workflows Elevate Root Cause Analysis?
【速读】:该论文旨在解决微服务系统中根因分析(Root Cause Analysis, RCA)的挑战,即在异构遥测数据(如指标、日志和追踪)中快速准确地诊断故障,并提供可操作的修复建议。传统方法通常仅依赖单一模态数据或仅对可疑服务进行排序,难以生成具有因果逻辑且具备指导意义的诊断结果。其解决方案的关键在于提出了一种名为GALA的多模态框架,融合统计因果推断与大语言模型(Large Language Model, LLM)驱动的迭代推理机制,从而在提升根因识别准确性的同时,生成符合因果关系且人类可理解的修复建议,有效弥合自动化诊断与实际故障处理之间的鸿沟。
链接: https://arxiv.org/abs/2508.12472
作者: Yifang Tian,Yaming Liu,Zichun Chong,Zihang Huang,Hans-Arno Jacobsen
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages, 5 figures
Abstract:Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single modalities or merely rank suspect services, falling short of providing actionable diagnostic insights with remediation guidance. This paper introduces GALA, a novel multi-modal framework that combines statistical causal inference with LLM-driven iterative reasoning for enhanced RCA. Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods of up to 42.22% accuracy. Our novel human-guided LLM evaluation score shows GALA generates significantly more causally sound and actionable diagnostic outputs than existing methods. Through comprehensive experiments and a case study, we show that GALA bridges the gap between automated failure diagnosis and practical incident resolution by providing both accurate root cause identification and human-interpretable remediation guidance.
zh
[AI-54] A Robust Cross-Domain IDS using BiGRU-LSTM-Attention for Medical and Industrial IoT Security
【速读】:该论文旨在解决日益增长的医疗物联网(IoMT)和工业物联网(IIoT)互联所带来的复杂网络安全挑战,这些挑战威胁到敏感数据、患者安全及工业运营。解决方案的关键在于提出一种基于Transformer架构的新型入侵检测系统(IDS),名为BiGAT-ID,其核心创新是融合双向门控循环单元(BiGRU)、长短期记忆网络(LSTM)与多头注意力机制(MHA),从而有效捕捉双向时序依赖关系、建模序列模式并增强上下文特征表示能力。实验表明,该模型在CICIoMT2024和EdgeIIoTset两个基准数据集上分别达到99.13%和99.34%的检测准确率,并具备极低的推理延迟(每实例0.0002秒和0.0001秒),展现出跨域鲁棒性与高运行效率,适用于真实异构物联网环境中的部署。
链接: https://arxiv.org/abs/2508.12470
作者: Afrah Gueriani,Hamza Kheddar,Ahmed Cherif Mazari,Mohamed Chahine Ghanem
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注: 10 pages
Abstract:The increased Internet of Medical Things IoMT and the Industrial Internet of Things IIoT interconnectivity has introduced complex cybersecurity challenges, exposing sensitive data, patient safety, and industrial operations to advanced cyber threats. To mitigate these risks, this paper introduces a novel transformer-based intrusion detection system IDS, termed BiGAT-ID a hybrid model that combines bidirectional gated recurrent units BiGRU, long short-term memory LSTM networks, and multi-head attention MHA. The proposed architecture is designed to effectively capture bidirectional temporal dependencies, model sequential patterns, and enhance contextual feature representation. Extensive experiments on two benchmark datasets, CICIoMT2024 medical IoT and EdgeIIoTset industrial IoT demonstrate the model’s cross-domain robustness, achieving detection accuracies of 99.13 percent and 99.34 percent, respectively. Additionally, the model exhibits exceptional runtime efficiency, with inference times as low as 0.0002 seconds per instance in IoMT and 0.0001 seconds in IIoT scenarios. Coupled with a low false positive rate, BiGAT-ID proves to be a reliable and efficient IDS for deployment in real-world heterogeneous IoT environments
zh
[AI-55] actile Gesture Recognition with Built-in Joint Sensors for Industrial Robots
【速读】:该论文旨在解决人机协作(Human-Robot Collaboration, HRC)中手势识别依赖外部传感器(如视觉系统或机器人皮肤)的问题,提出了一种仅利用机器人内置关节传感器进行触觉感知与手势识别的方案。其关键在于采用基于短时傅里叶变换(Short-Time Fourier Transform, STFT)的频谱图(spectrogram)表示方法作为输入特征,显著提升了识别准确率,并在不同机器人姿态下展现出良好的泛化能力;实验表明,使用STFT2DCNN和STT3DCNN两种卷积神经网络架构,在Franka Emika Research机器人平台上实现了超过95%的接触检测与手势分类准确率,验证了无需外部传感器的触觉识别可行性。
链接: https://arxiv.org/abs/2508.12435
作者: Deqing Song,Weimin Yang,Maryam Rezayati,Hans Wernher van de Venn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:While gesture recognition using vision or robot skins is an active research area in Human-Robot Collaboration (HRC), this paper explores deep learning methods relying solely on a robot’s built-in joint sensors, eliminating the need for external sensors. We evaluated various convolutional neural network (CNN) architectures and collected two datasets to study the impact of data representation and model architecture on the recognition accuracy. Our results show that spectrogram-based representations significantly improve accuracy, while model architecture plays a smaller role. We also tested generalization to new robot poses, where spectrogram-based models performed better. Implemented on a Franka Emika Research robot, two of our methods, STFT2DCNN and STT3DCNN, achieved over 95% accuracy in contact detection and gesture classification. These findings demonstrate the feasibility of external-sensor-free tactile recognition and promote further research toward cost-effective, scalable solutions for HRC.
zh
[AI-56] fCrit: A Visual Explanation System for Furniture Design Creative Support
【速读】:该论文旨在解决生成式 AI 在家具设计领域中缺乏可解释性且难以与用户认知方式对齐的问题,尤其是在创意实践中如何实现人本导向的可解释人工智能(Human-Centered Explainable AI, HCXAI)。解决方案的关键在于提出 fCrit 系统,其基于反思学习和形式化分析构建多智能体架构,并依托结构化的设计知识库,通过对话式交互将解释内容动态适配至用户的术语体系与认知框架,从而实现情境化、对话式且视觉锚定的 AI 支持机制。
链接: https://arxiv.org/abs/2508.12416
作者: Vuong Nguyen,Gabriel Vigliensoni
机构: 未知
类目: Human-Computer Interaction (cs.HC); Artificial Intelligence (cs.AI)
备注: In Proceedings of Explainable AI for the Arts Workshop 2025 (XAIxArts 2025) arXiv:2406.14485
Abstract:We introduce fCrit, a dialogue-based AI system designed to critique furniture design with a focus on explainability. Grounded in reflective learning and formal analysis, fCrit employs a multi-agent architecture informed by a structured design knowledge base. We argue that explainability in the arts should not only make AI reasoning transparent but also adapt to the ways users think and talk about their designs. We demonstrate how fCrit supports this process by tailoring explanations to users’ design language and cognitive framing. This work contributes to Human-Centered Explainable AI (HCXAI) in creative practice, advancing domain-specific methods for situated, dialogic, and visually grounded AI support.
zh
[AI-57] LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems
【速读】:该论文旨在解决多智能体系统(Multi-Agent Systems, MASs)在引入大语言模型(Large Language Models, LLMs)后所面临的可观测性(observability)挑战,尤其是现有框架通常仅关注单个智能体而忽视整个系统的故障检测与诊断问题。解决方案的关键在于提出LumiMAS这一新型MAS可观测性框架,其核心由三层组成:监控与日志层、异常检测层和异常解释层;该框架通过实时记录智能体活动日志、在MAS工作流中进行异常检测,并对异常进行分类与根因分析(Root Cause Analysis, RCA),从而实现对复杂系统级故障的精准识别与可解释性分析。
链接: https://arxiv.org/abs/2508.12412
作者: Ron Solomon,Yarin Yerushalmi Levi,Lior Vaknin,Eran Aizikovich,Amit Baras,Etai Ohana,Amit Giloni,Shamik Bose,Chiara Picardi,Yuval Elovici,Asaf Shabtai
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:The incorporation of large language models in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS’s first layer monitors MAS executions, creating detailed logs of the agents’ activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS’s effectiveness in failure detection, classification, and RCA.
zh
[AI-58] GraphCogent: Overcoming LLM s Working Memory Constraints via Multi-Agent Collaboration in Complex Graph Understanding
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在处理现实世界复杂图结构时表现不佳的问题,其核心挑战在于LLMs难以有效解析复杂的图拓扑结构并执行多步推理。解决方案的关键在于提出GraphCogent框架,该框架受人类工作记忆模型启发,将图推理分解为感知(sense)、缓冲(buffer)和执行(execute)三个专门的认知过程:通过感知模块对子图采样以标准化多样化的图文本表示,缓冲模块跨多种格式整合与索引图数据,执行模块结合工具调用与模型生成实现高效推理。这一设计显著提升了模型在真实场景下对大规模图的推理能力与效率。
链接: https://arxiv.org/abs/2508.12379
作者: Rongzheng Wang,Qizhi Chen,Yihong Huang,Yizhuo Ma,Muquan Li,Jiakai Li,Ke Qin,Guangchun Luo,Shuang Liang
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Large language models (LLMs) show promising performance on small-scale graph reasoning tasks but fail when handling real-world graphs with complex queries. This phenomenon stems from LLMs’ inability to effectively process complex graph topology and perform multi-step reasoning simultaneously. To address these limitations, we propose GraphCogent, a collaborative agent framework inspired by human Working Memory Model that decomposes graph reasoning into specialized cognitive processes: sense, buffer, and execute. The framework consists of three modules: Sensory Module standardizes diverse graph text representations via subgraph sampling, Buffer Module integrates and indexes graph data across multiple formats, and Execution Module combines tool calling and model generation for efficient reasoning. We also introduce Graph4real, a comprehensive benchmark contains with four domains of real-world graphs (Web, Social, Transportation, and Citation) to evaluate LLMs’ graph reasoning capabilities. Our Graph4real covers 21 different graph reasoning tasks, categorized into three types (Structural Querying, Algorithmic Reasoning, and Predictive Modeling tasks), with graph scales that are 10 times larger than existing benchmarks. Experiments show that Llama3.1-8B based GraphCogent achieves a 50% improvement over massive-scale LLMs like DeepSeek-R1 (671B). Compared to state-of-the-art agent-based baseline, our framework outperforms by 20% in accuracy while reducing token usage by 80% for in-toolset tasks and 30% for out-toolset tasks. Code will be available after review.
zh
[AI-59] Hierarchical knowledge guided fault intensity diagnosis of complex industrial systems
【速读】:该论文旨在解决故障强度诊断(Fault Intensity Diagnosis, FID)中现有方法依赖链式思维(chain of thought)而忽视目标类别间依赖关系的问题。其核心解决方案是提出一种受“思维树”(Tree of Thought)启发的分层知识引导故障强度诊断框架(Hierarchical Knowledge Guided Fault Intensity Diagnosis Framework, HKG),通过图卷积网络(Graph Convolutional Networks, GCNs)将类别表示的分层拓扑结构映射为一组相互关联的全局分层分类器,其中每个节点由类别的词嵌入(word embeddings)表示,并与表示学习提取的深层特征联合训练,实现端到端可学习。关键创新在于设计了一种重加权分层知识相关矩阵(Re-weighted Hierarchical Knowledge Correlation Matrix, Re-HKCM),将类间层次知识嵌入数据驱动的统计相关矩阵(Statistical Correlation Matrix, SCM)中,有效指导图卷积神经网络中节点的信息共享,缓解过平滑(over-smoothing)问题,从而提升诊断性能。
链接: https://arxiv.org/abs/2508.12375
作者: Yu Sha,Shuiping Gou,Bo Liu,Johannes Faber,Ningtao Liu,Stefan Schramm,Horst Stoecker,Thomas Steckenreiter,Domagoj Vnucec,Nadine Wetzstein,Andreas Widl,Kai Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 12 pages
Abstract:Fault intensity diagnosis (FID) plays a pivotal role in monitoring and maintaining mechanical devices within complex industrial systems. As current FID methods are based on chain of thought without considering dependencies among target classes. To capture and explore dependencies, we propose a hierarchical knowledge guided fault intensity diagnosis framework (HKG) inspired by the tree of thought, which is amenable to any representation learning methods. The HKG uses graph convolutional networks to map the hierarchical topological graph of class representations into a set of interdependent global hierarchical classifiers, where each node is denoted by word embeddings of a class. These global hierarchical classifiers are applied to learned deep features extracted by representation learning, allowing the entire model to be end-to-end learnable. In addition, we develop a re-weighted hierarchical knowledge correlation matrix (Re-HKCM) scheme by embedding inter-class hierarchical knowledge into a data-driven statistical correlation matrix (SCM) which effectively guides the information sharing of nodes in graphical convolutional neural networks and avoids over-smoothing issues. The Re-HKCM is derived from the SCM through a series of mathematical transformations. Extensive experiments are performed on four real-world datasets from different industrial domains (three cavitation datasets from SAMSON AG and one existing publicly) for FID, all showing superior results and outperform recent state-of-the-art FID methods.
zh
[AI-60] Navigating the Exploration-Exploitation Tradeoff in Inference-Time Scaling of Diffusion Models
【速读】:该论文旨在解决扩散模型(diffusion models)在推理阶段应用序贯蒙特卡洛(Sequential Monte Carlo, SMC)方法时面临的探索-利用权衡难题:早期噪声样本虽具高改进潜力但难以准确评估,而晚期样本虽可可靠评估却几乎不可逆。解决方案的关键在于从搜索算法角度出发,提出两种针对扩散模型独特生成动态和相变行为设计的策略——漏斗调度(Funnel Schedule)与自适应温度(Adaptive Temperature)。通过逐步减少维持的粒子数量并降低早期奖励的影响权重,该方法在不增加噪声函数评估次数的前提下显著提升样本质量。
链接: https://arxiv.org/abs/2508.12361
作者: Xun Su,Jianming Huang,Yang Yusen,Zhongxi Fang,Hiroyuki Kasai
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Statistics Theory (math.ST)
备注:
Abstract:Inference-time scaling has achieved remarkable success in language models, yet its adaptation to diffusion models remains underexplored. We observe that the efficacy of recent Sequential Monte Carlo (SMC)-based methods largely stems from globally fitting the The reward-tilted distribution, which inherently preserves diversity during multi-modal search. However, current applications of SMC to diffusion models face a fundamental dilemma: early-stage noise samples offer high potential for improvement but are difficult to evaluate accurately, whereas late-stage samples can be reliably assessed but are largely irreversible. To address this exploration-exploitation trade-off, we approach the problem from the perspective of the search algorithm and propose two strategies: Funnel Schedule and Adaptive Temperature. These simple yet effective methods are tailored to the unique generation dynamics and phase-transition behavior of diffusion models. By progressively reducing the number of maintained particles and down-weighting the influence of early-stage rewards, our methods significantly enhance sample quality without increasing the total number of Noise Function Evaluations. Experimental results on multiple benchmarks and state-of-the-art text-to-image diffusion models demonstrate that our approach outperforms previous baselines.
zh
[AI-61] Uncovering Systematic Failures of LLM s in Verifying Code Against Natural Language Specifications
【速读】:该论文旨在解决大语言模型(Large Language Models, LLMs)在评估代码是否符合自然语言需求说明时存在的系统性可靠性问题。研究发现,即使代码功能正确,LLMs 仍会频繁将其误判为不满足要求或存在潜在缺陷,且更复杂的提示工程策略(如引入解释和修正建议)反而加剧了误判率,暴露出当前LLM作为自动化代码审查工具的重大局限性。解决方案的关键在于识别并缓解此类误判的根本原因,并提出两种改进的提示策略以提升LLM在任务导向场景下的判断准确性,从而为有效利用LLM进行自动化代码审查提供实证依据与实践指导。
链接: https://arxiv.org/abs/2508.12358
作者: Haolin Jin,Huaming Chen
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted to the NIER track of the 40th IEEE/ACM International Conference on Automated Software Engineering (ASE 2025)
Abstract:Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to assess whether system code implementation satisfy task requirements, thereby enhancing code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine whether the code complies fully with the given task descriptions, which is usually natural language specifications. In this paper, we uncover a systematic failure of LLMs in evaluating whether code aligns with natural language requirements. Specifically, with widely used benchmarks, we employ unified prompts to judge code correctness. Our results reveal that LLMs frequently misclassify correct code implementations as either ``not satisfying requirements’’ or containing potential defects. Surprisingly, more complex prompting, especially when leveraging prompt engineering techniques involving explanations and proposed corrections, leads to higher misjudgment rate, which highlights the critical reliability issues in using LLMs as code review assistants. We further analyze the root causes of these misjudgments, and propose two improved prompting strategies for mitigation. For the first time, our findings reveals unrecognized limitations in LLMs to match code with requirements. We also offer novel insights and practical guidance for effective use of LLMs in automated code review and task-oriented agent scenarios.
zh
[AI-62] A Large-Scale Web Search Dataset for Federated Online Learning to Rank CIKM2025
【速读】:该论文旨在解决联邦在线排序学习(Federated Online Learning to Rank, FOLTR)领域中基准测试不具现实性的问题。现有研究多依赖于对经典学习排序数据集的随机划分、模拟用户点击行为以及假设客户端同步参与,这忽略了真实场景中的用户分布异质性、真实交互行为和异步通信等关键因素,导致实验结果难以反映实际性能。其解决方案的关键在于提出AOL4FOLTR数据集——一个包含10,000名用户260万条查询记录的大规模网络搜索数据集,该数据集具备用户标识符、真实点击数据及查询时间戳,从而支持更真实的用户分区策略、行为建模以及异步联邦学习场景,显著提升了FOLTR研究的可验证性和实用性。
链接: https://arxiv.org/abs/2508.12353
作者: Marcel Gregoriadis,Jingwei Kang,Johan Pouwelse
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
备注: Accepted at CIKM 2025
Abstract:The centralized collection of search interaction logs for training ranking models raises significant privacy concerns. Federated Online Learning to Rank (FOLTR) offers a privacy-preserving alternative by enabling collaborative model training without sharing raw user data. However, benchmarks in FOLTR are largely based on random partitioning of classical learning-to-rank datasets, simulated user clicks, and the assumption of synchronous client participation. This oversimplifies real-world dynamics and undermines the realism of experimental results. We present AOL4FOLTR, a large-scale web search dataset with 2.6 million queries from 10,000 users. Our dataset addresses key limitations of existing benchmarks by including user identifiers, real click data, and query timestamps, enabling realistic user partitioning, behavior modeling, and asynchronous federated learning scenarios.
zh
[AI-63] Wisdom of the Crowd: Reinforcement Learning from Coevolutionary Collective Feedback
【速读】:该论文旨在解决强化学习(Reinforcement Learning, RL)在大语言模型(Large Language Models, LLMs)推理能力提升中对昂贵人工标注数据或复杂奖励模型的依赖问题,以及现有自反馈方法因单一模型能力局限导致的过度自信、奖励欺骗和训练崩溃等缺陷。其解决方案的关键在于提出一种基于协同进化集体反馈的强化学习框架(Reinforcement Learning from Coevolutionary Collective Feedback, RLCCF),通过最大化模型群体的集体一致性(Collective Consistency, CC)来优化多模型协同进化:该框架利用多个LLM的多样化输出分布与互补能力,以投票机制生成奖励信号,并根据每个模型的自我一致性(Self-Consistency, SC)得分动态加权其投票权重,从而实现无需外部监督的群体智能演进,显著提升了个体与群体的推理准确率。
链接: https://arxiv.org/abs/2508.12338
作者: Wenzhen Yuan,Shengji Tang,Weihao Lin,Jiacheng Ruan,Ganqu Cui,Bo Zhang,Tao Chen,Ting Liu,Yuzhuo Fu,Peng Ye,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has significantly enhanced the reasoning capabilities of large language models (LLMs), but its reliance on expensive human-labeled data or complex reward models severely limits scalability. While existing self-feedback methods aim to address this problem, they are constrained by the capabilities of a single model, which can lead to overconfidence in incorrect answers, reward hacking, and even training collapse. To this end, we propose Reinforcement Learning from Coevolutionary Collective Feedback (RLCCF), a novel RL framework that enables multi-model collaborative evolution without external supervision. Specifically, RLCCF optimizes the ability of a model collective by maximizing its Collective Consistency (CC), which jointly trains a diverse ensemble of LLMs and provides reward signals by voting on collective outputs. Moreover, each model’s vote is weighted by its Self-Consistency (SC) score, ensuring that more confident models contribute more to the collective decision. Benefiting from the diverse output distributions and complementary abilities of multiple LLMs, RLCCF enables the model collective to continuously enhance its reasoning ability through coevolution. Experiments on four mainstream open-source LLMs across four mathematical reasoning benchmarks demonstrate that our framework yields significant performance gains, achieving an average relative improvement of 16.72% in accuracy. Notably, RLCCF not only improves the performance of individual models but also enhances the group’s majority-voting accuracy by 4.51%, demonstrating its ability to extend the collective capability boundary of the model collective.
zh
[AI-64] Synchronization Dynamics of Heterogeneous Collaborative Multi-Agent AI Systems
【速读】:该论文旨在解决多智能体人工智能系统(Multi-Agent AI Systems)中协同行为的建模与优化问题,特别是如何在异构智能体之间实现高效协调与同步,以提升系统的可扩展性、适应性和可解释性。其解决方案的关键在于将 Kuramoto 模型(Kuramoto Model)引入多智能体框架,将AI代理视为具有相位和振幅动态的耦合振子,从而捕捉代理间的专业化分工、影响力传递与通信机制;同时通过引入序参量(Order Parameter)量化协调程度,并建立 Chain-of-Thought 提示(Chain-of-Thought Prompting)与同步现象之间的形式对应关系,实现了人类式迭代推理与群体智能涌现的统一建模。该物理启发的方法为设计和分析可扩展的多智能体系统提供了严格的数学基础。
链接: https://arxiv.org/abs/2508.12314
作者: Chiranjit Mitra
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Adaptation and Self-Organizing Systems (nlin.AO)
备注: 9 pages, 6 figures
Abstract:We present a novel interdisciplinary framework that bridges synchronization theory and multi-agent AI systems by adapting the Kuramoto model to describe the collective dynamics of heterogeneous AI agents engaged in complex task execution. By representing AI agents as coupled oscillators with both phase and amplitude dynamics, our model captures essential aspects of agent specialization, influence, and communication within networked systems. We introduce an order parameter to quantify the degree of coordination and synchronization, providing insights into how coupling strength, agent diversity, and network topology impact emergent collective behavior. Furthermore, we formalize a detailed correspondence between Chain-of-Thought prompting in AI reasoning and synchronization phenomena, unifying human-like iterative problem solving with emergent group intelligence. Through extensive simulations on all-to-all and deterministic scale-free networks, we demonstrate that increased coupling promotes robust synchronization despite heterogeneous agent capabilities, reflecting realistic collaborative AI scenarios. Our physics-informed approach establishes a rigorous mathematical foundation for designing, analyzing, and optimizing scalable, adaptive, and interpretable multi-agent AI systems. This work opens pathways for principled orchestration of agentic AI and lays the groundwork for future incorporation of learning dynamics and adaptive network architectures to further enhance system resilience and efficiency.
zh
[AI-65] Mutually Assured Deregulation
【速读】:该论文试图解决的问题是:在全球范围内,各国因担心在人工智能(Artificial Intelligence, AI)领域落后于竞争对手(如中国或美国),而采取“监管牺牲”策略,即通过削弱安全监管来加速AI发展,从而实现国家安全优势。然而,这种做法实际上加剧了全球性风险,而非提升安全。解决方案的关键在于摒弃“监管牺牲”的错误逻辑,转而建立更强有力的监管框架——因为AI能力扩散迅速、过度放松监管反而抑制创新并增加不可控风险(如信息战工具、生物武器民主化及不可控通用人工智能AGI的部署)。唯有通过国际合作与制度化治理,才能实现真正的安全与可持续发展。
链接: https://arxiv.org/abs/2508.12300
作者: Gilad Abiri
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:We have convinced ourselves that the way to make AI safe is to make it unsafe. Since 2022, policymakers worldwide have embraced the Regulation Sacrifice - the belief that dismantling safety oversight will deliver security through AI dominance. Fearing China or USA will gain advantage, nations rush to eliminate safeguards that might slow progress. This Essay reveals the fatal flaw: though AI poses national security challenges, the solution demands stronger regulatory frameworks, not weaker ones. A race without guardrails breeds shared danger, not competitive strength. The Regulation Sacrifice makes three false promises. First, it promises durable technological leads. But AI capabilities spread rapidly - performance gaps between U.S. and Chinese systems collapsed from 9 percent to 2 percent in thirteen months. When advantages evaporate in months, sacrificing permanent safety for temporary speed makes no sense. Second, it promises deregulation accelerates innovation. The opposite often proves true. Companies report well-designed governance streamlines development. Investment flows toward regulated markets. Clear rules reduce uncertainty; uncertain liability creates paralysis. Environmental standards did not kill the auto industry; they created Tesla and BYD. Third, enhanced national security through deregulation actually undermines security across all timeframes. Near term: it hands adversaries information warfare tools. Medium term: it democratizes bioweapon capabilities. Long term: it guarantees deployment of uncontrollable AGI systems. The Regulation Sacrifice persists because it serves powerful interests, not security. Tech companies prefer freedom to accountability. Politicians prefer simple stories to complex truths. This creates mutually assured deregulation, where each nation’s sprint for advantage guarantees collective vulnerability. The only way to win is not to play.
zh
[AI-66] HuBERT-VIC: Improving Noise-Robust Automatic Speech Recognition of Speech Foundation Model via Variance-Invariance-Covariance Regularization INTERSPEECH2025
【速读】:该论文旨在解决语音基础模型(Speech Foundation Models, SFMs)在噪声环境下鲁棒性不足的问题,即模型在仅使用干净语音数据训练后,面对噪声语音时性能显著下降。解决方案的关键在于提出HuBERT-VIC模型,通过引入方差、不变性和协方差正则化(Variance, In-variance, and Covariance Regularization, VICReg)目标函数,调整噪声语音表示的统计特性,使模型能够捕捉多样化的声学特征并提升跨不同噪声类型场景下的泛化能力。实验表明,相较于在噪声语音上预训练的基线模型,该方法在LibriSpeech测试集上的相对性能提升达23.3%(test-clean)和13.2%(test-other)。
链接: https://arxiv.org/abs/2508.12292
作者: Hyebin Ahn,Kangwook Jang,Hoirin Kim
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
备注: Accepted at Interspeech 2025
Abstract:Noise robustness in speech foundation models (SFMs) has been a critical challenge, as most models are primarily trained on clean data and experience performance degradation when the models are exposed to noisy speech. To address this issue, we propose HuBERT-VIC, a noise-robust SFM with variance, in-variance, and covariance regularization (VICReg) objectives. These objectives adjust the statistics of noisy speech representations, enabling the model to capture diverse acoustic characteristics and improving the generalization ability across different types of noise. When applied to HuBERT, our model shows relative performance improvements of 23.3% on LibriSpeech test-clean and 13.2% on test-other, compared to the baseline model pre-trained on noisy speech.
zh
[AI-67] RadarQA: Multi-modal Quality Analysis of Weather Radar Forecasts
【速读】:该论文旨在解决天气预报质量分析中传统评分指标在描述能力、可解释性及动态演变理解方面难以满足气象专家需求的问题。解决方案的关键在于提出了一种基于多模态大语言模型(Multi-modal Large Language Models, MLLMs)的雷达预报质量分析方法——RadarQA,其核心创新包括:构建涵盖单帧与序列、评分与评估场景的综合性多模态质量分析任务范式;设计融合人工专家标注与自动启发式规则的混合注释流程,从而建立具有不同难度层级的大规模数据集 RQA-70K;并采用多阶段训练策略迭代优化模型性能。实验表明,RadarQA 在所有评估设置下均优于现有通用 MLLMs,展现出提升天气预报质量分析能力的潜力。
链接: https://arxiv.org/abs/2508.12291
作者: Xuming He,Zhiyuan You,Junchao Gong,Couhua Liu,Xiaoyu Yue,Peiqin Zhuang,Wenlong Zhang,Lei Bai
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Quality analysis of weather forecasts is an essential topic in meteorology. Although traditional score-based evaluation metrics can quantify certain forecast errors, they are still far from meteorological experts in terms of descriptive capability, interpretability, and understanding of dynamic evolution. With the rapid development of Multi-modal Large Language Models (MLLMs), these models become potential tools to overcome the above challenges. In this work, we introduce an MLLM-based weather forecast analysis method, RadarQA, integrating key physical attributes with detailed assessment reports. We introduce a novel and comprehensive task paradigm for multi-modal quality analysis, encompassing both single frame and sequence, under both rating and assessment scenarios. To support training and benchmarking, we design a hybrid annotation pipeline that combines human expert labeling with automated heuristics. With such an annotation method, we construct RQA-70K, a large-scale dataset with varying difficulty levels for radar forecast quality evaluation. We further design a multi-stage training strategy that iteratively improves model performance at each stage. Extensive experiments show that RadarQA outperforms existing general MLLMs across all evaluation settings, highlighting its potential for advancing quality analysis in weather prediction.
zh
[AI-68] “My productivity is boosted but …” Demystifying Users Perception on AI Coding Assistants
【速读】:该论文旨在解决在生成式 AI (Generative AI) 编程助手(如 GitHub Copilot)广泛应用背景下,开发者真实价值诉求与批评点是什么,以及这些反馈揭示了他们在实际软件开发中怎样的需求和期望。其解决方案的关键在于:突破以往在受控或模拟环境中进行观察性研究的局限,转而分析来自 Visual Studio Code Marketplace 的大量一手用户评论,从而捕捉开发者在日常工作中真实的使用体验与观点;通过人工标注 32 个高安装量和高评论量的 AI 编程助手的评论,构建出涵盖功能、上下文感知能力、可定制性及资源效率等方面的用户关切分类体系,进而提炼出五项实用建议以指导 AI 编程助手的优化方向。
链接: https://arxiv.org/abs/2508.12285
作者: Yunbo Lyu,Zhou Yang,Jieke Shi,Jianming Chang,Yue Liu,David Lo
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注: 13 pages, Camera-Ready Version that will appear in ASE 2025
Abstract:This paper aims to explore fundamental questions in the era when AI coding assistants like GitHub Copilot are widely adopted: what do developers truly value and criticize in AI coding assistants, and what does this reveal about their needs and expectations in real-world software development? Unlike previous studies that conduct observational research in controlled and simulated environments, we analyze extensive, first-hand user reviews of AI coding assistants, which capture developers’ authentic perspectives and experiences drawn directly from their actual day-to-day work contexts. We identify 1,085 AI coding assistants from the Visual Studio Code Marketplace. Although they only account for 1.64% of all extensions, we observe a surge in these assistants: over 90% of them are released within the past two years. We then manually analyze the user reviews sampled from 32 AI coding assistants that have sufficient installations and reviews to construct a comprehensive taxonomy of user concerns and feedback about these assistants. We manually annotate each review’s attitude when mentioning certain aspects of coding assistants, yielding nuanced insights into user satisfaction and dissatisfaction regarding specific features, concerns, and overall tool performance. Built on top of the findings-including how users demand not just intelligent suggestions but also context-aware, customizable, and resource-efficient interactions-we propose five practical implications and suggestions to guide the enhancement of AI coding assistants that satisfy user needs.
zh
[AI-69] CRoC: Context Refactoring Contrast for Graph Anomaly Detection with Limited Supervision ECAI2025
【速读】:该论文旨在解决图异常检测(Graph Anomaly Detection, GAD)中因标注数据稀缺而导致模型训练困难的问题,尤其是在异常样本稀少、标注成本高且可能主动伪装以逃避检测的场景下。解决方案的关键在于提出Context Refactoring Contrast (CRoC) 框架,其核心创新包括:利用GAD固有的类别不平衡特性重构节点上下文,通过重组节点属性并保持交互模式生成增强图;分离编码异质关系并在消息传递过程中融合,提升模型对复杂交互语义的捕捉能力;同时结合对比学习范式,在有限标签条件下有效利用大量未标记数据,从而生成更具判别力的节点嵌入,显著增强模型对隐蔽异常的识别鲁棒性。
链接: https://arxiv.org/abs/2508.12278
作者: Siyue Xie,Da Sun Handason Tam,Wing Cheong Lau
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Accepted by ECAI 2025
Abstract:Graph Neural Networks (GNNs) are widely used as the engine for various graph-related tasks, with their effectiveness in analyzing graph-structured data. However, training robust GNNs often demands abundant labeled data, which is a critical bottleneck in real-world applications. This limitation severely impedes progress in Graph Anomaly Detection (GAD), where anomalies are inherently rare, costly to label, and may actively camouflage their patterns to evade detection. To address these problems, we propose Context Refactoring Contrast (CRoC), a simple yet effective framework that trains GNNs for GAD by jointly leveraging limited labeled and abundant unlabeled data. Different from previous works, CRoC exploits the class imbalance inherent in GAD to refactor the context of each node, which builds augmented graphs by recomposing the attributes of nodes while preserving their interaction patterns. Furthermore, CRoC encodes heterogeneous relations separately and integrates them into the message-passing process, enhancing the model’s capacity to capture complex interaction semantics. These operations preserve node semantics while encouraging robustness to adversarial camouflage, enabling GNNs to uncover intricate anomalous cases. In the training stage, CRoC is further integrated with the contrastive learning paradigm. This allows GNNs to effectively harness unlabeled data during joint training, producing richer, more discriminative node embeddings. CRoC is evaluated on seven real-world GAD datasets with varying scales. Extensive experiments demonstrate that CRoC achieves up to 14% AUC improvement over baseline GNNs and outperforms state-of-the-art GAD methods under limited-label settings.
zh
[AI-70] Mantis: A Simulation-Grounded Foundation Model for Disease Forecasting
【速读】:该论文旨在解决新兴传染病暴发或资源匮乏地区疾病预测中存在的局限性,如对特定疾病的依赖、定制化训练需求以及专家调参的复杂性。其解决方案的关键在于提出Mantis——一个完全基于机制模拟(mechanistic simulations)训练的基础模型(foundation model),该模型无需真实世界数据即可实现跨疾病、跨区域和跨结局的即插即用预测能力。Mantis利用超过4亿天的模拟疫情动态数据,涵盖多种病原体、传播方式、干预措施及监测偏差,不仅在6种疾病上超越了39个专家调优模型(包括美国疾病控制与预防中心CDC新冠预测枢纽中的全部模型),还展现出对未见传播机制的新疾病具有泛化能力,且具备机制可解释性(mechanistically interpretable),从而支持公共卫生决策者识别预测背后的潜在驱动因素,并提供长达8周的高精度预测,显著扩展了可用的响应时间窗口。
链接: https://arxiv.org/abs/2508.12260
作者: Carson Dudley,Reiden Magdaleno,Christopher Harding,Ananya Sharma,Emily Martin,Marisa Eisenberg
机构: 未知
类目: Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注: 10 pages, 4 figures
Abstract:Infectious disease forecasting in novel outbreaks or low resource settings has been limited by the need for disease-specific data, bespoke training, and expert tuning. We introduce Mantis, a foundation model trained entirely on mechanistic simulations, which enables out-of-the-box forecasting across diseases, regions, and outcomes, even in settings with limited historical data. Mantis is built on over 400 million simulated days of outbreak dynamics spanning diverse pathogens, transmission modes, interventions, and surveillance artifacts. Despite requiring no real-world data during training, Mantis outperformed 39 expert-tuned models we tested across six diseases, including all models in the CDC’s COVID-19 Forecast Hub. Mantis generalized to novel epidemiological regimes, including diseases with held-out transmission mechanisms, demonstrating that it captures fundamental contagion dynamics. Critically, Mantis is mechanistically interpretable, enabling public health decision-makers to identify the latent drivers behind its predictions. Finally, Mantis delivers accurate forecasts at 8-week horizons, more than doubling the actionable range of most models, enabling proactive public health planning. Together, these capabilities position Mantis as a foundation for next-generation disease forecasting systems: general, interpretable, and deployable where traditional models fail.
zh
[AI-71] Fortifying the Agent ic Web: A Unified Zero-Trust Architecture Against Logic-layer Threats
【速读】:该论文旨在解决Agentic Web(智能体网络)中的安全威胁问题,特别是针对LPCI(Lost or Compromised Credentials and Identity)攻击的防护不足。解决方案的关键在于构建一个统一的安全架构,其核心是基于零信任身份与访问管理(IAM)框架,通过使用去中心化标识符(Decentralized Identifiers, DIDs)和可验证凭证(Verifiable Credentials, VCs)建立丰富且可验证的智能体身份,并由协议无关的Agent Name Service(ANS)实现身份发现。此外,该架构引入多层信任织网(Trust Fabric),包含信任自适应运行环境(Trust-Adaptive Runtime Environments, TARE)、因果链审计(Causal Chain Auditing)以及行为证明的身份动态更新机制,从而在形式化安全模型中提供对LPCI攻击的可证明安全性保障。
链接: https://arxiv.org/abs/2508.12259
作者: Ken Huang,Yasir Mehmood,Hammad Atta,Jerry Huang,Muhammad Zeeshan Baig,Sree Bhargavi Balija
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET)
备注:
Abstract:This paper presents a Unified Security Architecture that fortifies the Agentic Web through a Zero-Trust IAM framework. This architecture is built on a foundation of rich, verifiable agent identities using Decentralized Identifiers (DIDs) and Verifiable Credentials (VCs), with discovery managed by a protocol-agnostic Agent Name Service (ANS). Security is operationalized through a multi-layered Trust Fabric which introduces significant innovations, including Trust-Adaptive Runtime Environments (TARE), Causal Chain Auditing, and Dynamic Identity with Behavioral Attestation. By explicitly linking the LPCI threat to these enhanced architectural countermeasures within a formal security model, we propose a comprehensive and forward-looking blueprint for a secure, resilient, and trustworthy agentic ecosystem. Our formal analysis demonstrates that the proposed architecture provides provable security guarantees against LPCI attacks with bounded probability of success.
zh
[AI-72] Interpreting Time Series Forecasts with LIME and SHAP: A Case Study on the Air Passengers Dataset
【速读】:该论文旨在解决时间序列预测中模型可解释性不足的问题,尤其是在传统统计模型(如ARIMA)与现代机器学习模型(如XGBoost)之间的权衡:前者虽具可解释性但难以捕捉非线性关系,后者虽精度高却常为“黑箱”。其解决方案的关键在于提出一种统一框架,将单变量时间序列转化为无泄漏的监督学习问题,并结合局部可解释模型无关解释(LIME)和SHapley加性解释(SHAP),实现对梯度提升树模型预测结果的后处理解释。该方法确保不破坏时序结构的前提下,识别出影响预测的核心特征(如12个月滞后项和季节编码),从而在保持高预测精度的同时提供可理解的因果依据。
链接: https://arxiv.org/abs/2508.12253
作者: Manish Shukla
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Methodology (stat.ME)
备注:
Abstract:Time-series forecasting underpins critical decisions across aviation, energy, retail and health. Classical autoregressive integrated moving average (ARIMA) models offer interpretability via coefficients but struggle with nonlinearities, whereas tree-based machine-learning models such as XGBoost deliver high accuracy but are often opaque. This paper presents a unified framework for interpreting time-series forecasts using local interpretable model-agnostic explanations (LIME) and SHapley additive exPlanations (SHAP). We convert a univariate series into a leakage-free supervised learning problem, train a gradient-boosted tree alongside an ARIMA baseline and apply post-hoc explainability. Using the Air Passengers dataset as a case study, we show that a small set of lagged features – particularly the twelve-month lag – and seasonal encodings explain most forecast variance. We contribute: (i) a methodology for applying LIME and SHAP to time series without violating chronology; (ii) theoretical exposition of the underlying algorithms; (iii) empirical evaluation with extensive analysis; and (iv) guidelines for practitioners.
zh
[AI-73] STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction
【速读】:该论文旨在解决长时空时间序列预测中复杂长期依赖关系建模效率低的问题,具体挑战包括:1)长期时间序列天然包含多尺度信息,难以高效提取;2)不同空间节点的多尺度时间信息高度相关,难以有效建模。解决方案的关键在于提出两种新型架构——Spatio-Temporal Multiscale Mamba (STM2) 和其增强版本 STM3:STM2 采用多尺度 Mamba 架构高效同步捕捉多尺度时序特征,并引入自适应图因果卷积网络学习复杂的多尺度时空依赖;STM3 进一步采用混合专家(Mixture-of-Experts)架构,结合更稳定的路由策略与因果对比学习策略,显著提升各专家对不同尺度模式的解耦能力与路由平滑性,从而实现更精准的长期预测。
链接: https://arxiv.org/abs/2508.12247
作者: Haolong Chen,Liang Zhang,Zhengyuan Xin,Guangxu Zhu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence includes multiscale information naturally which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose an efficient \textit\textbfSpatio-\textbfTemporal \textbfMultiscale \textbfMamba (STM2) that includes a multiscale Mamba architecture to capture the multiscale information efficiently and simultaneously, and an adaptive graph causal convolution network to learn the complex multiscale spatio-temporal dependency. STM2 includes hierarchical information aggregation for different-scale information that guarantees their distinguishability. To capture diverse temporal dynamics across all spatial nodes more efficiently, we further propose an enhanced version termed \textit\textbfSpatio-\textbfTemporal \textbfMixture of \textbfMultiscale \textbfMamba (STM3) that employs a special Mixture-of-Experts architecture, including a more stable routing strategy and a causal contrastive learning strategy to enhance the scale distinguishability. We prove that STM3 has much better routing smoothness and guarantees the pattern disentanglement for each expert successfully. Extensive experiments on real-world benchmarks demonstrate STM2/STM3’s superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction.
zh
[AI-74] LinkAnchor: An Autonomous LLM -Based Agent for Issue-to-Commit Link Recovery
【速读】:该论文旨在解决软件工程中问题到提交(issue-to-commit)链接恢复的难题,该任务对软件可追溯性和项目管理至关重要,但现有方法在实际应用中仍面临两大挑战:一是大型语言模型(LLM)受限于上下文窗口长度,难以处理长提交历史、大量问题评论和大规模代码库等丰富上下文信息;二是多数方法仅针对单个问题-提交对进行评估,无法高效处理包含数万次提交的真实项目。为应对这些问题,论文提出名为 LinkAnchor 的首个基于 LLM 的自主代理系统,其核心创新在于采用“懒加载”(lazy-access)架构,动态检索最相关的上下文数据以避免超出 token 限制,并通过自动定位目标提交而非穷举所有候选对,显著提升效率与准确性。实验表明,LinkAnchor 在 Hit@1 指标上优于当前最优方法 60–262%,并已开源发布,支持 GitHub 和 Jira 平台,具备良好的扩展性。
链接: https://arxiv.org/abs/2508.12232
作者: Arshia Akhavan,Alireza Hosseinpour,Abbas Heydarnoori,Mehdi Keshani
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注:
Abstract:Issue-to-commit link recovery plays an important role in software traceability and improves project management. However, it remains a challenging task. A study on GitHub shows that only 42.2% of the issues are correctly linked to their commits. This highlights the potential for further development and research in this area. Existing studies have employed various AI/ML-based approaches, and with the recent development of large language models, researchers have leveraged LLMs to tackle this problem. These approaches suffer from two main issues. First, LLMs are constrained by limited context windows and cannot ingest all of the available data sources, such as long commit histories, extensive issue comments, and large code repositories. Second, most methods operate on individual issue-commit pairs; that is, given a single issue-commit pair, they determine whether the commit resolves the issue. This quickly becomes impractical in real-world repositories containing tens of thousands of commits. To address these limitations, we present LinkAnchor, the first autonomous LLM-based agent designed for issue-to-commit link recovery. The lazy-access architecture of LinkAnchor enables the underlying LLM to access the rich context of software, spanning commits, issue comments, and code files, without exceeding the token limit by dynamically retrieving only the most relevant contextual data. Additionally, LinkAnchor is able to automatically pinpoint the target commit rather than exhaustively scoring every possible candidate. Our evaluations show that LinkAnchor outperforms state-of-the-art issue-to-commit link recovery approaches by 60-262% in Hit@1 score across all our case study projects. We also publicly release LinkAnchor as a ready-to-use tool, along with our replication package. LinkAnchor is designed and tested for GitHub and Jira, and is easily extendable to other platforms.
zh
[AI-75] Distribution Matching via Generalized Consistency Models
【速读】:该论文旨在解决生成对抗网络(Generative Adversarial Networks, GANs)在分布匹配任务中训练不稳定、易陷入模式崩溃(mode collapse)的问题。其解决方案的关键在于受连续归一化流(Continuous Normalizing Flow, CNF)中一致性模型(consistency models)的启发,提出一种新的分布匹配目标函数:该目标函数继承了CNF模型中直接优化范数(norm minimization)的优势,从而简化了训练过程并提升稳定性,同时保留了GAN对不同约束条件的适应能力,使模型在保持高效性的同时具备更强的灵活性和鲁棒性。
链接: https://arxiv.org/abs/2508.12222
作者: Sagar Shrestha,Rajesh Shrestha,Tri Nguyen,Subash Timilsina
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent advancement in generative models have demonstrated remarkable performance across various data modalities. Beyond their typical use in data synthesis, these models play a crucial role in distribution matching tasks such as latent variable modeling, domain translation, and domain adaptation. Generative Adversarial Networks (GANs) have emerged as the preferred method of distribution matching due to their efficacy in handling high-dimensional data and their flexibility in accommodating various constraints. However, GANs often encounter challenge in training due to their bi-level min-max optimization objective and susceptibility to mode collapse. In this work, we propose a novel approach for distribution matching inspired by the consistency models employed in Continuous Normalizing Flow (CNF). Our model inherits the advantages of CNF models, such as having a straight forward norm minimization objective, while remaining adaptable to different constraints similar to GANs. We provide theoretical validation of our proposed objective and demonstrate its performance through experiments on synthetic and real-world datasets.
zh
[AI-76] Unlearning at Scale: Implementing the Right to be Forgotten in Large Language Models
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在欧盟《通用数据保护条例》(GDPR)第17条“被遗忘权”框架下的可撤销性问题,即如何实现对特定训练数据的“遗忘”,同时保持模型性能和参数的一致性。其核心解决方案是将“遗忘学习”(unlearning)建模为一个可复现的系统工程问题:通过记录训练过程中的最小元数据(包括微批次ID哈希、随机数生成器种子、学习率、优化器步数计数器及累积边界),并在确定性计算环境(如固定栈空间和确定性内核)下重放训练尾部并仅过滤掉需遗忘的数据集(forget closure),即可获得与仅在保留数据上训练所得的比特级一致的模型和优化器状态。关键创新在于提出三种互补路径以满足延迟和可用性约束:(i) 基于微检查点或密集每步增量的精确回滚,(ii) 冻结主干时的分组适配器删除机制,以及 (iii) 曲率引导的反向更新结合短时保留微调,并通过审计门控机制控制是否升级至精确重放。
链接: https://arxiv.org/abs/2508.12220
作者: Abdullah X
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
备注: Preprint; 2 figures + several tables; includes appendix. Artifact/code link in paper
Abstract:We study the right to be forgotten (GDPR Art. 17) for large language models and frame unlearning as a reproducible systems problem. Our approach treats training as a deterministic program and logs a minimal per-microbatch record (ordered ID hash, RNG seed, learning-rate value, optimizer-step counter, and accumulation boundary). Under a pinned stack and deterministic kernels, replaying the training tail while filtering only the forget closure yields the same parameters as training on the retain set (bit-identical in the training dtype) when preconditions hold. To meet latency and availability constraints, we add complementary paths: (i) exact reverts of recent steps via micro-checkpoints or dense per-step deltas, (ii) cohort-scoped adapter deletion when the base is frozen, and (iii) a curvature-guided anti-update followed by a short retain-tune, audit-gated with escalation to exact replay. We report storage/latency budgets and a toy artifact validating mechanics; in a controlled run that satisfies the preconditions we demonstrate byte-identical equality of model and optimizer states.
zh
[AI-77] ProtTeX-CC: Activating In-Context Learning in Protein LLM via Two-Stage Instruction Compression
【速读】:该论文旨在解决当前蛋白大语言模型ProtTeX在处理多模态蛋白信息时存在的两个关键问题:一是序列与结构token拼接导致输入长度翻倍且破坏残基级对齐,二是受限于训练语料和上下文窗口,模型无法支持少样本情境下的上下文学习(in-context learning, ICL),从而限制了其泛化能力。解决方案的关键在于提出一种轻量级两阶段压缩框架ProtTeX-CC:第一阶段设计联合嵌入压缩机制,在残基层面融合序列与结构表示,将输入长度减半而不损失性能;第二阶段引入自压缩模块,通过可学习投影层将每个完整示例压缩至约16个token的潜在空间,实现平均演示长度从751降至不足16 tokens,整体提示长度压缩比达93.68%。该方法无需修改原模型架构,仅通过参数高效微调(PEFT)和单个可训练投影层即可显著提升少样本场景下的蛋白功能预测性能,域内基准提升2%,跨域数据集提升11%。
链接: https://arxiv.org/abs/2508.12212
作者: Chuanliu Fan,Zicheng Ma,Jun Gao,Nan Yu,Jun Zhang,Ziqiang Cao,Yi Qin Gao,Guohong Fu
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Quantitative Methods (q-bio.QM)
备注:
Abstract:Recent advances in protein large language models, such as ProtTeX, represent both side-chain amino acids and backbone structure as discrete token sequences of residue length. While this design enables unified modeling of multimodal protein information, it suffers from two major limitations: (1) The concatenation of sequence and structure tokens approximately doubles the protein length and breaks the intrinsic residue-level alignment between modalities. (2) Constrained by the training corpus and limited context window, ProtTeX is typically trained on single-protein inputs, rendering it incompatible with in-context learning (ICL) and thus limiting its generalization capability. To address these issues, we propose ProtTeX-CC, a lightweight two-stage compression framework designed to enhance ProtTeX under few-shot settings. We first design a joint embedding compression mechanism that fuses sequence and structure representations at the residue level, effectively reducing the protein input length by half without sacrificing performance. Then we propose a self-compression module that aggregates each full demonstration into the latent space of the last few linguistic tokens, reducing the average demonstration length from 751 tokens to less than 16 tokens. Compared to the original ProtTeX, our self-compression approach achieves a compression ratio of approximately 93.68% in the total prompt length under the 16-shot setting. Without modifying the backbone model, ProtTeX-CC introduces only a small number of additional parameters through PEFT-based tuning in the joint embedding compression stage and a single trainable projection layer in the self-compression stage. Extensive experiments on protein function prediction show that ProtTeX-CC improves performance on the in-domain benchmark by 2%, and generalizes well to the out-of-domain dataset with a performance gain of 11%.
zh
[AI-78] Improving Pre-Trained Vision-Language-Action Policies with Model-Based Search
【速读】:该论文旨在解决预训练视觉-语言-动作(Vision-Language-Action, VLA)模型在零样本部署于分布外(out-of-distribution)场景时产生的行为脆弱性与不安全性问题。其核心解决方案是提出了一种名为“视觉-语言-动作规划搜索”(Vision-Language-Action Planning Search, VLAPS)的新框架,关键在于将基于模型的搜索(model-based search)嵌入到VLA策略的推理过程中:通过利用VLA模型提供的动作先验(action priors)引导改进后的蒙特卡洛树搜索(Monte Carlo Tree Search, MCTS),结合环境模型进行高效探索,从而显著提升任务成功率;该方法不仅降低了对大规模无监督搜索的依赖,还实现了测试时计算资源的可控分配、环境先验知识的有效利用,并可集成传统规划与强化学习技术,最终在多项语言指令驱动的机器人任务中相较纯VLA基线提升高达67个百分点的成功率。
链接: https://arxiv.org/abs/2508.12211
作者: Cyrus Neary,Omar G. Younis,Artur Kuramshin,Ozgur Aslan,Glen Berseth
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Pre-trained vision-language-action (VLA) models offer a promising foundation for generalist robot policies, but often produce brittle behaviours or unsafe failures when deployed zero-shot in out-of-distribution scenarios. We present Vision-Language-Action Planning Search (VLAPS) – a novel framework and accompanying algorithms that embed model-based search into the inference procedure of pre-trained VLA policies to improve their performance on robotic tasks. Specifically, our method biases a modified Monte Carlo Tree Search (MCTS) algorithm – run using a model of the target environment – using action priors defined by the VLA policy. By using VLA-derived abstractions and priors in model-based search, VLAPS efficiently explores language-conditioned robotics tasks whose search spaces would otherwise be intractably large. Conversely, by integrating model-based search with the VLA policy’s inference procedure, VLAPS yields behaviours that are more performant than those obtained by directly following the VLA policy’s action predictions. VLAPS offers a principled framework to: i) control test-time compute in VLA models, ii) leverage a priori knowledge of the robotic environment, and iii) integrate established planning and reinforcement learning techniques into the VLA inference process. Across all experiments, VLAPS significantly outperforms VLA-only baselines on language-specified tasks that would otherwise be intractable for uninformed search algorithms, increasing success rates by as much as 67 percentage points.
zh
[AI-79] Self-Guided Action Diffusion
【速读】:该论文旨在解决扩散模型(diffusion model)在生成式机器人策略中因推理时搜索动作样本而导致的计算成本过高问题,尤其是在动作样本多样性增加时效率显著下降。其解决方案的关键在于提出了一种自引导动作扩散(self-guided action diffusion)方法,通过在每一步扩散过程中基于先前决策引导提议分布(proposal distribution),从而实现更高效的双向解码(bidirectional decoding)。该方法在保持近似最优性能的同时大幅降低推理开销,在严格采样预算下相较现有方法在动态任务中成功率提升高达70%。
链接: https://arxiv.org/abs/2508.12189
作者: Rhea Malhotra,Yuejiang Liu,Chelsea Finn
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Recent works have shown the promise of inference-time search over action samples for improving generative robot policies. In particular, optimizing cross-chunk coherence via bidirectional decoding has proven effective in boosting the consistency and reactivity of diffusion policies. However, this approach remains computationally expensive as the diversity of sampled actions grows. In this paper, we introduce self-guided action diffusion, a more efficient variant of bidirectional decoding tailored for diffusion-based policies. At the core of our method is to guide the proposal distribution at each diffusion step based on the prior decision. Experiments in simulation tasks show that the proposed self-guidance enables near-optimal performance at negligible inference cost. Notably, under a tight sampling budget, our method achieves up to 70% higher success rates than existing counterparts on challenging dynamic tasks. See project website at this https URL.
zh
[AI-80] RLNVR: Reinforcement Learning from Non-Verified Real-World Rewards
【速读】:该论文旨在解决在真实世界场景中训练语言模型时,因缺乏昂贵且需人工验证的奖励信号(reward signals)而导致的传统强化学习人类反馈(Reinforcement Learning from Human Feedback, RLHF)方法难以应用的问题。其核心解决方案是提出RLNVR(Reinforcement Learning from Non-Verified Rewards)框架,通过基线归一化(baseline normalization)和基于语义相似度的奖励迁移(semantic similarity-based reward transfer),有效利用噪声大但易获取的实际用户反馈(如社交平台上的互动数据),从而提升训练稳定性和内容质量。该方案的关键创新在于将GSPO(Group Sequence Policy Optimization)与可选的UED(Unsupervised Environment Design)课程设计相结合,在无需人工标注奖励的情况下实现更鲁棒、多样化的语言模型优化。
链接: https://arxiv.org/abs/2508.12165
作者: Rohit Krishnan,Jon Evans
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:This paper introduces RLNVR (Reinforcement Learning from Non-Verified Rewards), a framework for training language models using noisy, real-world feedback signals without requiring explicit human verification. Traditional RLHF requires expensive, verified reward signals that are impractical in many real-world domains. RLNVR addresses this challenge through baseline normalization and semantic similarity-based reward transfer. We demonstrate RLNVR through Walter, a prototype system that optimizes social media content generation using actual engagement data from Bluesky. Our experimental results show significant improvements in content quality and training stability, with comprehensive evaluation planned for future work. Positioning: We present a practical framework that combines RLNVR with GSPO (Group Sequence Policy Optimization) and an optional UED (Unsupervised Environment Design) curriculum to improve stability and diversity under noisy, implicit rewards. To our knowledge, combining GSPO-style normalization with a UED-style curriculum for LLM content generation from implicit social engagement has not been previously documented in this applied setting; we frame this as an applied integration rather than a new algorithm.
zh
[AI-81] AICRN: Attention-Integrated Convolutional Residual Network for Interpretable Electrocardiogram Analysis
【速读】:该论文旨在解决传统心电图(ECG)分析中因人为误差导致的注意力分散及检测效率低下的问题,同时提升心脏疾病诊断的精度与可解释性。其解决方案的关键在于提出一种新型深度学习架构——注意力集成卷积残差网络(AICRN),该架构融合了空间和通道注意力机制,以精准定位ECG特征的类型与空间位置,并通过卷积残差网络有效缓解梯度消失和爆炸问题,从而实现对PR间期、QT间期、QRS时限、心率、R波峰值振幅和T波振幅等关键参数的高精度回归分析。
链接: https://arxiv.org/abs/2508.12162
作者: J. M. I. H. Jayakody,A. M. H. H. Alahakoon,C. R. M. Perera,R. M. L. C. Srimal,Roshan Ragel,Vajira Thambawita,Isuru Nawinne
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:The paradigm of electrocardiogram (ECG) analysis has evolved into real-time digital analysis, facilitated by artificial intelligence (AI) and machine learning (ML), which has improved the diagnostic precision and predictive capacity of cardiac diseases. This work proposes a novel deep learning (DL) architecture called the attention-integrated convolutional residual network (AICRN) to regress key ECG parameters such as the PR interval, the QT interval, the QRS duration, the heart rate, the peak amplitude of the R wave, and the amplitude of the T wave for interpretable ECG analysis. Our architecture is specially designed with spatial and channel attention-related mechanisms to address the type and spatial location of the ECG features for regression. The models employ a convolutional residual network to address vanishing and exploding gradient problems. The designed system addresses traditional analysis challenges, such as loss of focus due to human errors, and facilitates the fast and easy detection of cardiac events, thereby reducing the manual efforts required to solve analysis tasks. AICRN models outperform existing models in parameter regression with higher precision. This work demonstrates that DL can play a crucial role in the interpretability and precision of ECG analysis, opening up new clinical applications for cardiac monitoring and management.
zh
[AI-82] MOVER: Multimodal Optimal Transport with Volume-based Embedding Regularization CIKM2025
【速读】:该论文旨在解决多模态学习中现有基于成对对比目标的方法在跨多种模态时难以泛化、且高维嵌入空间缺乏语义结构的问题。其解决方案的关键在于提出MOVER框架,通过结合基于最优传输的软对齐机制与基于体积的几何正则化策略(GAVE),实现模态无关的一致性对齐和结构化的多模态表示学习。该方法在文本-视频-音频检索任务中显著优于现有最先进方法,并展现出对未见模态组合更强的泛化能力和嵌入空间中的结构一致性。
链接: https://arxiv.org/abs/2508.12149
作者: Haochen You,Baojing Liu
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted as a conference paper at CIKM 2025
Abstract:Recent advances in multimodal learning have largely relied on pairwise contrastive objectives to align different modalities, such as text, video, and audio, in a shared embedding space. While effective in bi-modal setups, these approaches struggle to generalize across multiple modalities and often lack semantic structure in high-dimensional spaces. In this paper, we propose MOVER, a novel framework that combines optimal transport-based soft alignment with volume-based geometric regularization to build semantically aligned and structured multimodal representations. By integrating a transport-guided matching mechanism with a geometric volume minimization objective (GAVE), MOVER encourages consistent alignment across all modalities in a modality-agnostic manner. Experiments on text-video-audio retrieval tasks demonstrate that MOVER significantly outperforms prior state-of-the-art methods in both zero-shot and finetuned settings. Additional analysis shows improved generalization to unseen modality combinations and stronger structural consistency in the learned embedding space.
zh
[AI-83] Substituting Proof of Work in Blockchain with Training-Verified Collaborative Model Computation
【速读】:该论文旨在解决比特币工作量证明(Proof of Work, PoW)机制因高能耗和硬件效率低下而引发的可持续性问题。其解决方案的关键在于提出一种混合架构,将传统PoW替换为基于云的集中式协同训练框架:矿工通过贡献计算资源参与横向扩展机器学习模型的分段训练,中央服务器依据参数训练数量与模型损失下降程度评估贡献,并采用加权抽奖机制选出获胜矿工,授予数字签名证书作为区块链记账权凭证。该设计将能源消耗转化为具有社会价值的计算任务,同时借助数字签名与SHA-256哈希保障区块链完整性,实现了安全激励与实际计算进步的对齐。
链接: https://arxiv.org/abs/2508.12138
作者: Mohammad Ishzaz Asif Rafid,Morsalin Sakib
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Bitcoin’s Proof of Work (PoW) mechanism, while central to achieving decentralized consensus, has long been criticized for excessive energy use and hardware inefficiencies \citedevries2018bitcoin, truby2018decarbonizing. This paper introduces a hybrid architecture that replaces Bitcoin’s traditional PoW with a centralized, cloud-based collaborative training framework. In this model, miners contribute computing resources to train segments of horizontally scaled machine learning models on preprocessed datasets, ensuring privacy and generating meaningful outputs \citeli2017securing. A central server evaluates contributions using two metrics: number of parameters trained and reduction in model loss during each cycle. At the end of every cycle, a weighted lottery selects the winning miner, who receives a digitally signed certificate. This certificate serves as a verifiable substitute for PoW and grants the right to append a block to the blockchain \citenakamoto2008bitcoin. By integrating digital signatures and SHA-256 hashing \citenist2015sha, the system preserves blockchain integrity while redirecting energy toward productive computation. The proposed approach addresses the sustainability concerns of traditional mining by converting resource expenditure into socially valuable work, aligning security incentives with real-world computational progress.
zh
[AI-84] Overcoming Knowledge Discrepancies: Structuring Reasoning Threads through Knowledge Balancing in Interactive Scenarios
【速读】:该论文旨在解决当前推理模型在交互式问题求解场景中缺乏显式语义层次结构、用户与领域知识对齐不足,以及缺少有效机制来修剪推理链以提升效率的问题。这些问题导致模型输出冗长且通用,无法引导用户完成目标导向的推理步骤。解决方案的关键在于提出一种受原型启发的两阶段推理链评估(Reasoning-Threads-Evaluation, ReT-Eval)框架:第一阶段利用图神经网络从稀疏领域知识图谱中提取语义相关的知识结构,并融合大语言模型的内在知识以弥合知识差距;第二阶段采用奖励驱动策略对推理链进行评估与修剪,以保持语义连贯性并生成高效推理路径。
链接: https://arxiv.org/abs/2508.12100
作者: Daniel Burkhardt,Xiangwei Cheng
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 13 pages, 1 figure, 6 tables
Abstract:Reasoning in interactive problem solving scenarios requires models to construct reasoning threads that reflect user understanding and align with structured domain knowledge. However, current reasoning models often lack explicit semantic hierarchies, user-domain knowledge alignment, and principled mechanisms to prune reasoning threads for effectiveness. These limitations result in lengthy generic output that does not guide users through goal-oriented reasoning steps. To address this, we propose a prototype-inspired, two-phases Reasoning-Threads-Evaluation (ReT-Eval) framework, drawing inspiration from human-like reasoning strategies that emphasize structured knowledge reuse. In the first phase, semantically relevant knowledge structures are extracted from a sparse domain knowledge graph using a graph neural network and enriched with intrinsic large language model knowledge to resolve knowledge discrepancies. In the second phase, these threads are evaluated and pruned using a reward-guided strategy aimed at maintaining semantic coherence to generate effective reasoning threads. Experiments and expert evaluations show that ReT-Eval enhances user understanding and outperforms state-of-the-art reasoning models.
zh
[AI-85] MAPF-World: Action World Model for Multi-Agent Path Finding
【速读】:该论文旨在解决多智能体路径规划(Multi-agent Path Finding, MAPF)中现有去中心化学习求解器在复杂、长期规划场景下性能下降的问题,尤其是其对环境时序动态和智能体间依赖关系建模能力不足。解决方案的关键在于提出MAPF-World——一个自回归动作世界模型(autoregressive action world model),该模型通过统一情境理解与动作生成,显式建模环境的空间特征与时间依赖性,实现对未来状态和动作的预测,从而增强情境感知能力,并支持更协调、前瞻性的决策。此外,论文还引入基于真实场景的自动地图生成器以扩充基准测试,显著提升训练效率与泛化能力,实验表明MAPF-World在保持模型规模缩小96.5%、数据量减少92%的前提下,优于当前最先进学习型求解器,尤其在分布外场景中展现出卓越的零样本泛化性能。
链接: https://arxiv.org/abs/2508.12087
作者: Zhanjiang Yang,Meng Li,Yang Shen,Yueming Li,Lijun Sun
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent path finding (MAPF) is the problem of planning conflict-free paths from the designated start locations to goal positions for multiple agents. It underlies a variety of real-world tasks, including multi-robot coordination, robot-assisted logistics, and social navigation. Recent decentralized learnable solvers have shown great promise for large-scale MAPF, especially when leveraging foundation models and large datasets. However, these agents are reactive policy models and exhibit limited modeling of environmental temporal dynamics and inter-agent dependencies, resulting in performance degradation in complex, long-term planning scenarios. To address these limitations, we propose MAPF-World, an autoregressive action world model for MAPF that unifies situation understanding and action generation, guiding decisions beyond immediate local observations. It improves situational awareness by explicitly modeling environmental dynamics, including spatial features and temporal dependencies, through future state and actions prediction. By incorporating these predicted futures, MAPF-World enables more informed, coordinated, and far-sighted decision-making, especially in complex multi-agent settings. Furthermore, we augment MAPF benchmarks by introducing an automatic map generator grounded in real-world scenarios, capturing practical map layouts for training and evaluating MAPF solvers. Extensive experiments demonstrate that MAPF-World outperforms state-of-the-art learnable solvers, showcasing superior zero-shot generalization to out-of-distribution cases. Notably, MAPF-World is trained with a 96.5% smaller model size and 92% reduced data.
zh
[AI-86] Large Language Models Enable Personalized Nudges to Promote Carbon Offsetting Among Air Travellers
【速读】:该论文旨在解决如何通过个性化干预策略提升航空旅客自愿抵消飞行碳排放行为的问题。其核心挑战在于传统“助推”(nudge)策略效果受限于个体偏好差异,且缺乏大规模行为数据支持定制化方案。解决方案的关键在于利用大语言模型(Large Language Models, LLMs)模拟人类决策过程,生成基于个体特征的个性化“诱饵型”助推策略(decoy-based nudge strategies),从而在无需大量行为数据的前提下实现精准干预。实证结果显示,该方法显著优于统一推送策略,使抵消率提升3–7%,尤其有效激励低信任度的怀疑型旅客参与,年均额外减少230万吨二氧化碳排放,为航空业低碳转型提供了高效可行的技术路径。
链接: https://arxiv.org/abs/2508.12045
作者: Vladimir Maksimenko,Qingyao Xin,Prateek Gupta,Bin Zhang,Prateek Bansal
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Nudge strategies are effective tools for promoting sustainable behaviour, but their impact depends on individual preferences. By emulating human decision-making, large language models (LLMs) offer a cost-effective route for tailoring nudges without extensive behavioural datasets, yet this potential remains unexplored. Focusing on aviation, we use LLMs to design personalized decoy-based nudge strategies that encourage air travellers to voluntarily offset CO _2 emissions from flights, and validate their efficacy through 3495 surveys from China, Germany, India, Singapore, and the United States. Results show that LLM-informed personalized nudges are more effective than uniform settings, raising offsetting rates by 3-7 % and yielding an additional 2.3 million tonnes of CO _2 mitigated annually in aviation. This improvement is driven primarily by increased participation among sceptical travellers with low trust in offset programmes. Our study highlights the potential of LLM-driven personalized nudging strategies for boosting offsetting behaviours to accelerate aviation decarbonization.
zh
[AI-87] Active inference for action-unaware agents
【速读】:该论文旨在解决在基于主动推理(Active Inference)框架下的动作规划问题,具体聚焦于行动感知型(action-aware)与行动非感知型(action-unaware)代理在导航任务中的性能差异。其核心问题是:当代理无法直接访问自身动作信息(即缺乏传出副本信号,efference copy),仅能通过近期观测推断行为时,是否仍能实现与已知自身动作的代理相当的决策表现。解决方案的关键在于证明,即使在缺乏直接动作知识的情况下,行动非感知型代理也能通过优化预期自由能(expected free energy)来有效规划未来行为,并在两个导航任务中展现出与行动感知型代理相当的性能,表明对过去运动经验的间接推断足以支持高效的行为策略生成。
链接: https://arxiv.org/abs/2508.12027
作者: Filippo Torresan,Keisuke Suzuki,Ryota Kanai,Manuel Baltieri
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注: 59 pages, 47 figures
Abstract:Active inference is a formal approach to study cognition based on the notion that adaptive agents can be seen as engaging in a process of approximate Bayesian inference, via the minimisation of variational and expected free energies. Minimising the former provides an account of perceptual processes and learning as evidence accumulation, while minimising the latter describes how agents select their actions over time. In this way, adaptive agents are able to maximise the likelihood of preferred observations or states, given a generative model of the environment. In the literature, however, different strategies have been proposed to describe how agents can plan their future actions. While they all share the notion that some kind of expected free energy offers an appropriate way to score policies, sequences of actions, in terms of their desirability, there are different ways to consider the contribution of past motor experience to the agent’s future behaviour. In some approaches, agents are assumed to know their own actions, and use such knowledge to better plan for the future. In other approaches, agents are unaware of their actions, and must infer their motor behaviour from recent observations in order to plan for the future. This difference reflects a standard point of departure in two leading frameworks in motor control based on the presence, or not, of an efference copy signal representing knowledge about an agent’s own actions. In this work we compare the performances of action-aware and action-unaware agents in two navigations tasks, showing how action-unaware agents can achieve performances comparable to action-aware ones while at a severe disadvantage.
zh
[AI-88] AI Models for Depressive Disorder Detection and Diagnosis: A Review
【速读】:该论文旨在解决重度抑郁症(Major Depressive Disorder, MDD)诊断依赖主观临床评估、缺乏客观且可扩展的工具这一关键问题。其解决方案的核心在于系统性整合人工智能(Artificial Intelligence, AI)技术,通过构建一个新颖的分层分类体系(按临床任务、数据模态和计算模型类别进行结构化),全面梳理当前基于AI的抑郁检测与诊断方法。研究发现三大趋势:图神经网络(Graph Neural Networks, GNNs)在脑连接建模中占主导地位、大语言模型(Large Language Models, LLMs)在语言与对话数据分析中快速崛起,以及多模态融合、可解释性和算法公平性成为新兴焦点,从而为计算精神病学领域的未来研究提供明确路径。
链接: https://arxiv.org/abs/2508.12022
作者: Dorsa Macky Aleagha,Payam Zohari,Mostafa Haghir Chehreghani
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Major Depressive Disorder is one of the leading causes of disability worldwide, yet its diagnosis still depends largely on subjective clinical assessments. Integrating Artificial Intelligence (AI) holds promise for developing objective, scalable, and timely diagnostic tools. In this paper, we present a comprehensive survey of state-of-the-art AI methods for depression detection and diagnosis, based on a systematic review of 55 key studies. We introduce a novel hierarchical taxonomy that structures the field by primary clinical task (diagnosis vs. prediction), data modality (text, speech, neuroimaging, multimodal), and computational model class (e.g., graph neural networks, large language models, hybrid approaches). Our in-depth analysis reveals three major trends: the predominance of graph neural networks for modeling brain connectivity, the rise of large language models for linguistic and conversational data, and an emerging focus on multimodal fusion, explainability, and algorithmic fairness. Alongside methodological insights, we provide an overview of prominent public datasets and standard evaluation metrics as a practical guide for researchers. By synthesizing current advances and highlighting open challenges, this survey offers a comprehensive roadmap for future innovation in computational psychiatry.
zh
[AI-89] Predicting ChatGPT Use in Assignments: Implications for AI-Aware Assessment Design
【速读】:该论文旨在解决生成式 AI(Generative AI)工具(如 ChatGPT)在高等教育中对学生作业完成行为影响的量化研究空白问题,尤其关注其如何改变学生的实际学术实践。解决方案的关键在于基于来自388名大学生(主要来自俄罗斯,含国际学生)的调查数据,采用XGBoost算法构建预测模型,识别出影响ChatGPT使用频率的核心因素,包括学习习惯、学科偏好和对AI的态度,并通过二分类与多分类模型验证了其预测性能(最高达80.1%测试准确率)。该方法揭示了高频使用ChatGPT学习新概念可能引发过度依赖,进而削弱长期学术独立性,从而为制定学科特定指导方针和重构评估策略提供了实证依据,以实现技术赋能与学术严谨性的平衡。
链接: https://arxiv.org/abs/2508.12013
作者: Surajit Das,Aleksei Eliseev
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The rise of generative AI tools like ChatGPT has significantly reshaped education, sparking debates about their impact on learning outcomes and academic integrity. While prior research highlights opportunities and risks, there remains a lack of quantitative analysis of student behavior when completing assignments. Understanding how these tools influence real-world academic practices, particularly assignment preparation, is a pressing and timely research priority. This study addresses this gap by analyzing survey responses from 388 university students, primarily from Russia, including a subset of international participants. Using the XGBoost algorithm, we modeled predictors of ChatGPT usage in academic assignments. Key predictive factors included learning habits, subject preferences, and student attitudes toward AI. Our binary classifier demonstrated strong predictive performance, achieving 80.1% test accuracy, with 80.2% sensitivity and 79.9% specificity. The multiclass classifier achieved 64.5% test accuracy, 64.6% weighted precision, and 64.5% recall, with similar training scores, indicating potential data scarcity challenges. The study reveals that frequent use of ChatGPT for learning new concepts correlates with potential overreliance, raising concerns about long-term academic independence. These findings suggest that while generative AI can enhance access to knowledge, unchecked reliance may erode critical thinking and originality. We propose discipline-specific guidelines and reimagined assessment strategies to balance innovation with academic rigor. These insights can guide educators and policymakers in ethically and effectively integrating AI into education. Subjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.12013 [cs.CY] (or arXiv:2508.12013v1 [cs.CY] for this version) https://doi.org/10.48550/arXiv.2508.12013 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-90] Agent CDM: Enhancing Multi-Agent Collaborative Decision-Making via ACH-Inspired Structured Reasoning
【速读】:该论文旨在解决基于大语言模型(Large Language Models, LLMs)的多智能体系统(Multi-agent Systems, MAS)中协同决策(Collaborative Decision-Making, CDM)过程缺乏系统性方法的问题。现有方法要么采用“独裁式”策略,易受单一智能体认知偏见影响;要么依赖“投票式”机制,未能充分挖掘群体智能。其解决方案的关键在于提出AgentCDM框架,该框架受认知科学中竞争假设分析(Analysis of Competing Hypotheses, ACH)启发,引入结构化推理范式,将决策从被动答案选择转变为积极的假设评估与构建,并通过两阶段训练机制实现:第一阶段使用显式的ACH引导结构化推理,第二阶段逐步移除该引导以促进模型自主泛化。实验表明,该方法在多个基准数据集上达到最先进性能并具备强泛化能力,显著提升了MAS中协同决策的质量与鲁棒性。
链接: https://arxiv.org/abs/2508.11995
作者: Xuyang Zhao,Shiwan Zhao,Hualong Yu,Liting Zhang,Qicheng Li
机构: 未知
类目: Artificial Intelligence (cs.AI); Multiagent Systems (cs.MA)
备注:
Abstract:Multi-agent systems (MAS) powered by large language models (LLMs) hold significant promise for solving complex decision-making tasks. However, the core process of collaborative decision-making (CDM) within these systems remains underexplored. Existing approaches often rely on either dictatorial" strategies that are vulnerable to the cognitive biases of a single agent, or
voting-based" methods that fail to fully harness collective intelligence. To address these limitations, we propose \textbfAgentCDM, a structured framework for enhancing collaborative decision-making in LLM-based multi-agent systems. Drawing inspiration from the Analysis of Competing Hypotheses (ACH) in cognitive science, AgentCDM introduces a structured reasoning paradigm that systematically mitigates cognitive biases and shifts decision-making from passive answer selection to active hypothesis evaluation and construction. To internalize this reasoning process, we develop a two-stage training paradigm: the first stage uses explicit ACH-inspired scaffolding to guide the model through structured reasoning, while the second stage progressively removes this scaffolding to encourage autonomous generalization. Experiments on multiple benchmark datasets demonstrate that AgentCDM achieves state-of-the-art performance and exhibits strong generalization, validating its effectiveness in improving the quality and robustness of collaborative decisions in MAS.
zh
[AI-91] Modeling Relational Logic Circuits for And-Inverter Graph Convolutional Network
【速读】:该论文旨在解决实际复杂结构和大规模节点的And-Inverter Graphs(AIGs)在电子设计自动化(EDA)中难以准确建模的问题,尤其是现有方法无法同时捕捉AIG的功能特性与结构特性,且动态信息传播能力不足。解决方案的关键在于提出AIGer模型,其包含两个核心组件:一是节点逻辑特征初始化嵌入模块,将AND、NOT等逻辑节点映射到独立语义空间以实现有效的节点嵌入;二是AIG特征学习网络,采用异构图卷积网络设计动态关系权重矩阵并引入差异化信息聚合机制,从而更精准地保留原始结构和信息。该组合显著增强了对功能与结构特性的联合建模能力及消息传递效率,在信号概率预测(SSP)和真值表距离预测(TTDP)任务中均优于当前最优模型。
链接: https://arxiv.org/abs/2508.11991
作者: Weihao Sun
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:The automation of logic circuit design enhances chip performance, energy efficiency, and reliability, and is widely applied in the field of Electronic Design Automation (EDA).And-Inverter Graphs (AIGs) efficiently represent, optimize, and verify the functional characteristics of digital circuits, enhancing the efficiency of EDA this http URL to the complex structure and large scale of nodes in real-world AIGs, accurate modeling is challenging, leading to existing work lacking the ability to jointly model functional and structural characteristics, as well as insufficient dynamic information propagation this http URL address the aforementioned challenges, we propose this http URL, AIGer consists of two components: 1) Node logic feature initialization embedding component and 2) AIGs feature learning network this http URL node logic feature initialization embedding component projects logic nodes, such as AND and NOT, into independent semantic spaces, to enable effective node embedding for subsequent this http URL upon this, the AIGs feature learning network component employs a heterogeneous graph convolutional network, designing dynamic relationship weight matrices and differentiated information aggregation approaches to better represent the original structure and information of this http URL combination of these two components enhances AIGer’s ability to jointly model functional and structural characteristics and improves its message passing capability. Experimental results indicate that AIGer outperforms the current best models in the Signal Probability Prediction (SSP) task, improving MAE and MSE by 18.95% and 44.44%, respectively. In the Truth Table Distance Prediction (TTDP) task, AIGer achieves improvements of 33.57% and 14.79% in MAE and MSE, respectively, compared to the best-performing models.
zh
[AI-92] FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction
【速读】:该论文旨在解决当前大语言模型(Large Language Models, LLMs)代理在进行未来预测任务时缺乏大规模、动态且无数据污染的评估基准的问题。现有方法难以应对实时信息更新和准确答案获取的挑战,导致对代理在复杂环境中的适应性推理能力评估不足。解决方案的关键在于提出一个名为FutureX的动态、实时更新的评估基准,其通过自动化管道实现每日问题采集与答案收集,有效避免了数据污染,并支持对具备推理、搜索及外部工具集成能力的LLM代理进行全面评估。该基准覆盖多样化的未来预测场景,推动代理向专业人类分析师水平的预测思维和决策能力演进。
链接: https://arxiv.org/abs/2508.11987
作者: Zhiyuan Zeng,Jiashuo Liu,Siyuan Chen,Tianci He,Yali Liao,Jinpeng Wang,Zaiyuan Wang,Yang Yang,Lingyue Yin,Mingren Yin,Zhenwei Zhu,Tianle Cai,Zehui Chen,Jiecao Chen,Yantao Du,Xiang Gao,Jiacheng Guo,Liang Hu,Jianpeng Jiao,Xiangsheng Li,Jingkai Liu,Shuang Ni,Zhoufutu Wen,Ge Zhang,Kaiyuan Zhang,Xin Zhou,Jose Blanchet,Xipeng Qiu,Mengdi Wang,Wenhao Huang
机构: 未知
类目: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Technical report, 51 pages
Abstract:Future prediction is a complex task for LLM agents, requiring a high level of analytical thinking, information gathering, contextual understanding, and decision-making under uncertainty. Agents must not only gather and interpret vast amounts of dynamic information but also integrate diverse data sources, weigh uncertainties, and adapt predictions based on emerging trends, just as human experts do in fields like politics, economics, and finance. Despite its importance, no large-scale benchmark exists for evaluating agents on future prediction, largely due to challenges in handling real-time updates and retrieving timely, accurate answers. To address this, we introduce \textbfFutureX , a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is the largest and most diverse live benchmark for future prediction, supporting real-time daily updates and eliminating data contamination through an automated pipeline for question gathering and answer collection. We evaluate 25 LLM/agent models, including those with reasoning, search capabilities, and integration of external tools such as the open-source Deep Research Agent and closed-source Deep Research models. This comprehensive evaluation assesses agents’ adaptive reasoning and performance in dynamic environments. Additionally, we provide in-depth analyses of agents’ failure modes and performance pitfalls in future-oriented tasks, including the vulnerability to fake web pages and the temporal validity. Our goal is to establish a dynamic, contamination-free evaluation standard that drives the development of LLM agents capable of performing at the level of professional human analysts in complex reasoning and predictive thinking.
zh
[AI-93] Efficient Modular Learning through Naive LoRA Summation: Leverag ing Orthogonality in High-Dimensional Models
【速读】:该论文旨在解决多任务微调中因参数冲突导致的性能下降问题,尤其是在使用参数高效微调(Parameter-Efficient Fine-Tuning, PEFT)方法时,不同领域适配模块之间的干扰难以控制。解决方案的关键在于利用低秩适应(Low-Rank Adaptation, LoRA)模块的数学结构特性,提出通过简单加法组合独立训练的LoRA模块来实现跨域知识融合,其理论基础是基于叠加原理(superposition principle)假设:在不相交领域上独立训练的LoRA模块近似正交,因此可直接相加以获得良好性能。实验表明,这种“朴素求和”策略无需额外训练即可达到与合并数据微调相当的效果,并且可通过LoRA增量矩阵间的余弦相似度预测组合后的性能变化趋势,从而有效识别高阶组合中的干扰现象。
链接: https://arxiv.org/abs/2508.11985
作者: Zhanhao Cao,Clement Truong,Andrew Lizarraga
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: Preprint
Abstract:Recent advances in large language models are driven by scale, while parameter-efficient fine-tuning (PEFT) enables updating only a small fraction of parameters. Low-Rank Adaptation (LoRA) stores parameter deltas as the product of two small matrices, which makes them natural building blocks that can be composed. Motivated by the superposition principle, we hypothesize that independently trained LoRA modules on disjoint domains are approximately orthogonal and can be combined by simple addition. Using GPT-2 Small (117M) with LoRA rank 4 and alpha=64, we train adapters for three QA domains (math, medicine, finance). In pairwise tests, adding Math+Medicine adapters improves perplexity by -9.10% relative to merged-data fine-tuning, while Math+Finance and Finance+Medicine change by +4.54% and +27.56%, respectively. Across combinations, the RMS cosine similarity between LoRA deltas correlates positively and approximately linearly with the change in perplexity. Naive summation requires no additional training, can be applied in seconds, and achieves performance comparable to models trained on merged data, while clarifying when interference appears in higher-order compositions.
zh
[AI-94] BGRecall: A Generative Retrieval Model for E-commerce Recommendation Scenarios
【速读】:该论文旨在解决生成式推荐系统在电商场景中进行高效检索时面临的局限性,特别是由于自回归生成机制导致的序列依赖问题,使得模型难以在单次请求中无位置约束地生成多个推荐物品。解决方案的关键在于提出TBGRecall框架,其核心创新是引入**下一会话预测(Next Session Prediction, NSP)**机制,并通过将输入样本划分为多会话序列(每个序列包含一个会话标记和一组物品标记),重构生成式检索任务的建模方式;同时采用有限历史数据预训练与随机部分增量训练相结合的训练策略,显著提升训练效率并强化数据时效性优势,从而在公开基准和淘宝大规模工业数据集上实现优于现有方法的性能表现。
链接: https://arxiv.org/abs/2508.11977
作者: Zida Liang(1),Changfa Wu(2),Dunxian Huang(2),Weiqiang Sun(1),Ziyang Wang(2),Yuliang Yan(2),Jian Wu(2),Yuning Jiang(2),Bo Zheng(2),Ke Chen(2),Silu Zhou(2),Yu Zhang(2) ((1) Shanghai Jiaotong University, (2) Alibaba Inc.)
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: Both authors contributed equally to this research. Work done during internship at Alibaba. Corresponding author: Dunxian Huang ( this http URL @alibaba this http URL ). Affiliations: (1) Shanghai Jiaotong University, Shanghai, China; (2) Alibaba Inc
Abstract:Recommendation systems are essential tools in modern e-commerce, facilitating personalized user experiences by suggesting relevant products. Recent advancements in generative models have demonstrated potential in enhancing recommendation systems; however, these models often exhibit limitations in optimizing retrieval tasks, primarily due to their reliance on autoregressive generation mechanisms. Conventional approaches introduce sequential dependencies that impede efficient retrieval, as they are inherently unsuitable for generating multiple items without positional constraints within a single request session. To address these limitations, we propose TBGRecall, a framework integrating Next Session Prediction (NSP), designed to enhance generative retrieval models for e-commerce applications. Our framework reformulation involves partitioning input samples into multi-session sequences, where each sequence comprises a session token followed by a set of item tokens, and then further incorporate multiple optimizations tailored to the generative task in retrieval scenarios. In terms of training methodology, our pipeline integrates limited historical data pre-training with stochastic partial incremental training, significantly improving training efficiency and emphasizing the superiority of data recency over sheer data volume. Our extensive experiments, conducted on public benchmarks alongside a large-scale industrial dataset from TaoBao, show TBGRecall outperforms the state-of-the-art recommendation methods, and exhibits a clear scaling law trend. Ultimately, NSP represents a significant advancement in the effectiveness of generative recommendation systems for e-commerce applications.
zh
[AI-95] Chart-CoCa: Self-Improving Chart Understanding of Vision LMs via Code-Driven Synthesis and Candidate-Conditioned Answering CIKM2025
【速读】:该论文旨在解决视觉语言模型(Vision Language Models, VLMs)在图表理解任务中面临的两大挑战:一是难以生成准确的图表描述,二是缺乏复杂推理能力。针对这些问题,作者提出了一种完全自洽的改进范式,其关键在于两个创新:一是构建了一个基于代码生成与执行的图表合成流水线,能够自动创建对齐的图表-问题-答案三元组,从而确保合成数据的可靠性且无需人工标注;二是设计了一种候选条件化的回答机制,借鉴测试时扩展(test-time scaling)思想,在推理阶段为每个查询生成多个候选回答,并通过上下文整合策略合成最终答案,实现性能提升而无需外部模型或人工标签。实验表明,该方法在无监督条件下使VLM准确率最高提升15.50个百分点。
链接: https://arxiv.org/abs/2508.11975
作者: Gongyao Jiang,Qiong Luo
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: Accepted to CIKM 2025
Abstract:Vision Language Models (VLMs) often struggle with chart understanding tasks, particularly in accurate chart description and complex reasoning. Synthetic data generation is a promising solution, while usually facing the challenge of noise labels. To address this challenge, we first introduce a chart synthesis pipeline that generates aligned chart-question-answer triplets through code generation and execution, ensuring the reliability of synthetic data without human intervention. Furthermore, inspired by test-time scaling that increases inference budget and thereby improves performance, we design a candidate-conditioned answering process. The VLM first generates multiple responses per query, and then synthesizes the final answer by contextualizing these candidates. Experiments demonstrate significant improvements, with up to 15.50 points accuracy gain over the initial VLM, in a fully self-improving paradigm without either human-labeled data or external models.
zh
[AI-96] Rigorous Feature Importance Scores based on Shapley Value and Banzhaf Index
【速读】:该论文旨在解决当前基于博弈论的特征归因方法在高风险机器学习应用场景中对非弱归纳解释(non-WAXp)集合信息忽略的问题。现有方法通常仅以弱归纳解释(WAXp)作为特征重要性分配的特征函数,从而可能遗漏对对抗样本(AExs)排除能力具有关键作用的其他特征组合。解决方案的关键在于引入Shapley值和Banzhaf指数,构建两种新型特征重要性评分机制,在计算特征贡献时充分考虑非WAXp集合的信息,从而量化每个特征在排除对抗样本方面的有效性。该方法不仅拓展了特征归因的理论基础,还揭示了XPs与AExs之间的内在关联在特征重要性评估中的价值。
链接: https://arxiv.org/abs/2508.11959
作者: Xuanxiang Huang,Olivier Létoffé,Joao Marques-Silva
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Feature attribution methods based on game theory are ubiquitous in the field of eXplainable Artificial Intelligence (XAI). Recent works proposed rigorous feature attribution using logic-based explanations, specifically targeting high-stakes uses of machine learning (ML) models. Typically, such works exploit weak abductive explanation (WAXp) as the characteristic function to assign importance to features. However, one possible downside is that the contribution of non-WAXp sets is neglected. In fact, non-WAXp sets can also convey important information, because of the relationship between formal explanations (XPs) and adversarial examples (AExs). Accordingly, this paper leverages Shapley value and Banzhaf index to devise two novel feature importance scores. We take into account non-WAXp sets when computing feature contribution, and the novel scores quantify how effective each feature is at excluding AExs. Furthermore, the paper identifies properties and studies the computational complexity of the proposed scores.
zh
[AI-97] A Comprehensive Review of AI Agents : Transforming Possibilities in Technology and Beyond
【速读】:该论文旨在解决当前AI代理系统在整合认知、规划与交互能力方面面临的统一性难题,即如何构建既具备复杂环境感知与推理能力,又能实现高效自主决策和多智能体协同的通用型AI代理。其解决方案的关键在于系统性地梳理和融合认知科学启发的模型、分层强化学习框架以及基于大语言模型(Large Language Model, LLM)的推理机制,从而为下一代更鲁棒、可适应且可信的自主智能系统提供理论基础与技术路径。
链接: https://arxiv.org/abs/2508.11957
作者: Xiaodong Qu,Andrews Damoah,Joshua Sherwood,Peiyan Liu,Christian Shun Jin,Lulu Chen,Minjie Shen,Nawwaf Aleisa,Zeyuan Hou,Chenyu Zhang,Lifu Gao,Yanshu Li,Qikai Yang,Qun Wang,Cristabelle De Souza
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Artificial Intelligence (AI) agents have rapidly evolved from specialized, rule-based programs to versatile, learning-driven autonomous systems capable of perception, reasoning, and action in complex environments. The explosion of data, advances in deep learning, reinforcement learning, and multi-agent coordination have accelerated this transformation. Yet, designing and deploying unified AI agents that seamlessly integrate cognition, planning, and interaction remains a grand challenge. In this review, we systematically examine the architectural principles, foundational components, and emergent paradigms that define the landscape of contemporary AI agents. We synthesize insights from cognitive science-inspired models, hierarchical reinforcement learning frameworks, and large language model-based reasoning. Moreover, we discuss the pressing ethical, safety, and interpretability concerns associated with deploying these agents in real-world scenarios. By highlighting major breakthroughs, persistent challenges, and promising research directions, this review aims to guide the next generation of AI agent systems toward more robust, adaptable, and trustworthy autonomous intelligence.
zh
[AI-98] UniCast: A Unified Multimodal Prompting Framework for Time Series Forecasting
【速读】:该论文旨在解决当前时间序列基础模型(Time Series Foundation Models, TSFMs)在实际应用中受限于单一模态输入的问题,即忽略了现实场景中常伴随的时间序列数据所具有的丰富多模态上下文信息(如视觉和文本信号)。其解决方案的关键在于提出一种参数高效的多模态框架UniCast,通过软提示调优(soft prompt tuning)将预训练的视觉和文本编码器的特定模态嵌入与冻结的时间序列基础模型进行融合,从而在最小化参数更新的前提下实现跨模态交互,同时保留基础模型的强大泛化能力。实验表明,该方法显著优于现有所有单模态TSFM基线模型。
链接: https://arxiv.org/abs/2508.11954
作者: Sehyuk Park,Soyeon Caren Han,Eduard Hovy
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Time series forecasting is a foundational task across domains, such as finance, healthcare, and environmental monitoring. While recent advances in Time Series Foundation Models (TSFMs) have demonstrated strong generalisation through large-scale pretraining, existing models operate predominantly in a unimodal setting, ignoring the rich multimodal context, such as visual and textual signals, that often accompanies time series data in real-world scenarios. This paper introduces a novel parameter-efficient multimodal framework, UniCast, that extends TSFMs to jointly leverage time series, vision, and text modalities for enhanced forecasting performance. Our method integrates modality-specific embeddings from pretrained Vision and Text Encoders with a frozen TSFM via soft prompt tuning, enabling efficient adaptation with minimal parameter updates. This design not only preserves the generalisation strength of the foundation model but also enables effective cross-modal interaction. Extensive experiments across diverse time-series forecasting benchmarks demonstrate that UniCast consistently and significantly outperforms all existing TSFM baselines. The findings highlight the critical role of multimodal context in advancing the next generation of general-purpose time series forecasters.
zh
[AI-99] Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models
【速读】:该论文旨在解决大规模语言模型(Large Language Models, LLMs)在监督微调(Supervised Fine-Tuning, SFT)过程中数据混合策略的优化问题,即如何合理分配不同数据集的权重以最小化验证损失并提升整体与各领域性能。其解决方案的关键在于将数据混合建模为一个优化问题,通过参数化损失函数来刻画有效数据传递量,并结合微调缩放定律(scaling laws for fine-tuning)拟合关键参数,从而推导出最优数据权重。该方法在小规模数据混合实验中验证了有效性,平均每个领域的损失仅比网格搜索所得最佳权重高出0.66%,且能显著改善主流SFT数据集的重加权效果,具备向特定领域模型数据选择扩展的潜力。
链接: https://arxiv.org/abs/2508.11953
作者: Yuan Li,Zhengzhong Liu,Eric Xing
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.
zh
[AI-100] Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware
【速读】:该论文旨在解决模拟计算内存(Analog Compute-In-Memory, CIM)架构在神经网络推理中因硬件诱导噪声导致的部署难题。现有噪声感知训练方法通常依赖理想化且可微分的噪声模型,难以刻画CIM硬件复杂变异特性。其解决方案的关键在于借鉴量化中的直通估计器(Straight-Through Estimator, STE)框架,将前向传播中的噪声模拟与反向传播中的梯度计算解耦,从而支持更精确但计算不可行的噪声建模,同时保持梯度方向信息、计算可行性与优化稳定性。这一方法在图像分类任务上实现最高5.3%准确率提升,在文本生成中降低0.72困惑度,训练速度提升2.2倍,并减少37.9%峰值内存占用。
链接: https://arxiv.org/abs/2508.11940
作者: Yuannuo Feng,Wenyong Zhou,Yuexi Lyu,Yixiang Zhang,Zhengwu Liu,Ngai Wong,Wang Kang
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Hardware Architecture (cs.AR)
备注: 4 pages, 5 figures, conference
Abstract:Analog Compute-In-Memory (CIM) architectures promise significant energy efficiency gains for neural network inference, but suffer from complex hardware-induced noise that poses major challenges for deployment. While noise-aware training methods have been proposed to address this issue, they typically rely on idealized and differentiable noise models that fail to capture the full complexity of analog CIM hardware variations. Motivated by the Straight-Through Estimator (STE) framework in quantization, we decouple forward noise simulation from backward gradient computation, enabling noise-aware training with more accurate but computationally intractable noise modeling in analog CIM systems. We provide theoretical analysis demonstrating that our approach preserves essential gradient directional information while maintaining computational tractability and optimization stability. Extensive experiments show that our extended STE framework achieves up to 5.3% accuracy improvement on image classification, 0.72 perplexity reduction on text generation, 2.2 \times speedup in training time, and 37.9% lower peak memory usage compared to standard noise-aware training methods.
zh
[AI-101] HPD: Hybrid Projection Decomposition for Robust State Space Models on Analog CIM Hardware
【速读】:该论文旨在解决计算内存(Compute-in-Memory, CIM)架构中器件非理想性导致的权重扰动问题,此类扰动会显著降低状态空间模型(State Space Models, SSMs)在推理过程中的准确性。解决方案的关键在于提出一种混合投影分解(Hybrid Projection Decomposition, HPD)策略,针对SSMs中最为敏感的最后输出投影层进行优化:通过奇异值分解(SVD)将原始权重矩阵拆分为U和Σ的乘积以适配现有硬件架构,同时将V矩阵移至数字域进行高精度校正,从而在不改变硬件兼容性的前提下实现对噪声干扰的有效补偿。实验表明,该方法在多种噪声条件下可使Mamba模型的困惑度降低高达99.57%,并在PIQA基准测试中提升常识推理准确率达96.67%。
链接: https://arxiv.org/abs/2508.11935
作者: Yuannuo Feng,Wenyong Zhou,Yuexi Lyu,Hanjie Liu,Zhengwu Liu,Ngai Wong,Wang Kang
机构: 未知
类目: Hardware Architecture (cs.AR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 4 pages, 5 figures, conference
Abstract:State Space Models (SSMs) are efficient alternatives to traditional sequence models, excelling at processing long sequences with lower computational complexity. Their reliance on matrix multiplications makes them ideal for compute-in-memory (CIM) architectures, which improve energy efficiency by computing within memory arrays. However, device non-idealities in CIM introduce weight perturbations that can degrade inference accuracy. In this paper, we systematically analyze the robustness of SSMs under noisy conditions, identifying that the final block and output projection layers are more susceptible to perturbations compared to other components. Building on these insights, we propose HPD, a Hybrid Projection Decomposition strategy for the last output projection layer. We replace the original weight matrix with the multiplication of U and \Sigma in its SVD to ensure compatibility with existing hardware architectures, while offloading V to digital hardware for precise and robust correction. Comprehensive tests on Mamba models show that our method reduces perplexity by up to 99.57% under various noise conditions compared to baseline models, with accuracy gains of up to 96.67% on the PIQA benchmark for commonsense reasoning.
zh
[AI-102] No More Blind Spots: Learning Vision-Based Omnidirectional Bipedal Locomotion for Challenging Terrain
【速读】:该论文旨在解决动态环境中双足机器人实现全向(omnidirectional)行走的难题,尤其在复杂室内场景或不平坦地形中,需要具备多方向感知与自适应控制能力。其核心挑战在于传统基于仿真到现实(sim-to-real)的强化学习(Reinforcement Learning, RL)方法因高计算成本难以渲染全向深度图像而不可行。解决方案的关键在于提出一种结合鲁棒盲控制器与视觉监督教师策略的分层学习框架:通过教师策略指导学生策略进行监督训练,且在训练中引入噪声增强的数据增强技术,从而避免了RL阶段对昂贵渲染的依赖,并提升了模型对真实环境的鲁棒性;同时,该方法显著加速了训练过程(最多提升10倍),最终实现了无需依赖复杂渲染即可完成视觉引导的全向双足行走,在仿真与实测中均验证了其有效性。
链接: https://arxiv.org/abs/2508.11929
作者: Mohitvishnu S. Gadde,Pranay Dugar,Ashish Malik,Alan Fern
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Effective bipedal locomotion in dynamic environments, such as cluttered indoor spaces or uneven terrain, requires agile and adaptive movement in all directions. This necessitates omnidirectional terrain sensing and a controller capable of processing such input. We present a learning framework for vision-based omnidirectional bipedal locomotion, enabling seamless movement using depth images. A key challenge is the high computational cost of rendering omnidirectional depth images in simulation, making traditional sim-to-real reinforcement learning (RL) impractical. Our method combines a robust blind controller with a teacher policy that supervises a vision-based student policy, trained on noise-augmented terrain data to avoid rendering costs during RL and ensure robustness. We also introduce a data augmentation technique for supervised student training, accelerating training by up to 10 times compared to conventional methods. Our framework is validated through simulation and real-world tests, demonstrating effective omnidirectional locomotion with minimal reliance on expensive rendering. This is, to the best of our knowledge, the first demonstration of vision-based omnidirectional bipedal locomotion, showcasing its adaptability to diverse terrains.
zh
[AI-103] Deciphering the Interplay between Attack and Protection Complexity in Privacy-Preserving Federated Learning
【速读】:该论文旨在解决联邦学习(Federated Learning, FL)在保护数据隐私方面面临的梯度逆向攻击(gradient inversion attacks)威胁,此类攻击可导致私有数据被重构,从而破坏隐私保障。解决方案的关键在于提出一个全新的理论框架,用于量化分析攻击复杂度(Attack Complexity)与保护复杂度(Protection Complexity)之间的权衡关系:其中攻击复杂度定义为攻击者在给定误差阈值下重建私有数据所需的最小计算和数据资源;保护复杂度则衡量隐私机制引入的预期失真。通过引入最大贝叶斯隐私(Maximum Bayesian Privacy, MBP),作者推导出保护复杂度的紧致理论边界,并揭示其随模型维度和隐私预算的变化规律;同时建立攻击复杂度的综合边界,阐明其与隐私泄露、梯度失真、模型维度及隐私水平的关系。该框架为设计更安全高效的隐私保护联邦学习系统提供了定量依据。
链接: https://arxiv.org/abs/2508.11907
作者: Xiaojin Zhang,Mingcong Xu,Yiming Li,Wei Chen,Qiang Yang
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI)
备注:
Abstract:Federated learning (FL) offers a promising paradigm for collaborative model training while preserving data privacy. However, its susceptibility to gradient inversion attacks poses a significant challenge, necessitating robust privacy protection mechanisms. This paper introduces a novel theoretical framework to decipher the intricate interplay between attack and protection complexities in privacy-preserving FL. We formally define “Attack Complexity” as the minimum computational and data resources an adversary requires to reconstruct private data below a given error threshold, and “Protection Complexity” as the expected distortion introduced by privacy mechanisms. Leveraging Maximum Bayesian Privacy (MBP), we derive tight theoretical bounds for protection complexity, demonstrating its scaling with model dimensionality and privacy budget. Furthermore, we establish comprehensive bounds for attack complexity, revealing its dependence on privacy leakage, gradient distortion, model dimension, and the chosen privacy level. Our findings quantitatively illuminate the fundamental trade-offs between privacy guarantees, system utility, and the effort required for both attacking and defending. This framework provides critical insights for designing more secure and efficient federated learning systems.
zh
[AI-104] QuarkMed Medical Foundation Model Technical Report
【速读】:该论文旨在解决医疗领域对高专业性、高准确性及可定制性的基础模型(foundation model)的迫切需求,以支持AI在医疗咨询、诊断报告辅助和医学搜索等场景中的可靠应用。其解决方案的关键在于:通过精心构建的医学数据处理流程、基于医学内容的检索增强生成(Retrieval-Augmented Generation, RAG)机制,以及大规模可验证的强化学习训练管道,从而开发出具备强大泛化能力的医疗基础模型QuarkMed,该模型在中文医师资格考试中达到70%的准确率,验证了其在多类医学基准上的高性能表现。
链接: https://arxiv.org/abs/2508.11894
作者: Ao Li,Bin Yan,Bingfeng Cai,Chenxi Li,Cunzhong Zhao,Fugen Yao,Gaoqiang Liu,Guanjun Jiang,Jian Xu,Liang Dong,Liansheng Sun,Rongshen Zhang,Xiaolei Gui,Xin Liu,Xin Shang,Yao Wu,Yu Cao,Zhenxin Ma,Zhuang Jia
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注: 20 pages
Abstract:Recent advancements in large language models have significantly accelerated their adoption in healthcare applications, including AI-powered medical consultations, diagnostic report assistance, and medical search tools. However, medical tasks often demand highly specialized knowledge, professional accuracy, and customization capabilities, necessitating a robust and reliable foundation model. QuarkMed addresses these needs by leveraging curated medical data processing, medical-content Retrieval-Augmented Generation (RAG), and a large-scale, verifiable reinforcement learning pipeline to develop a high-performance medical foundation model. The model achieved 70% accuracy on the Chinese Medical Licensing Examination, demonstrating strong generalization across diverse medical benchmarks. QuarkMed offers a powerful yet versatile personal medical AI solution, already serving over millions of users at this http URL.
zh
[AI-105] Integrating Symbolic RL Planning into a BDI-based Autonomous UAV Framework: System Integration and SIL Validation
【速读】:该论文旨在解决现代自主无人机任务中如何有效融合结构化符号规划与自适应强化学习(Reinforcement Learning, RL)的问题,以提升复杂动态环境下的决策可靠性与安全性。传统基于规则的架构虽具备良好的结构化推理能力,但在动态环境中难以实现自适应规划;而符号强化学习(Symbolic Reinforcement Learning, SRL)通过引入领域特定知识和约束条件(利用Planning Domain Definition Language, PDDL建模),显著增强了无人机(Unmanned Aerial Vehicle, UAV)决策的灵活性与鲁棒性。解决方案的关键在于提出AMAD-SRL框架——在原有自主无人机认知多智能体架构(Autonomous Mission Agents for Drones, AMAD)基础上,集成SRL模块实现任务规划阶段的平滑切换(从BDI驱动转向SRL驱动),并通过软件在环(Software-in-the-Loop, SIL)验证了模块间稳定集成、跨规划模式无缝过渡及任务性能一致性,实验表明目标获取场景下路径效率较覆盖基线提升约75%(以行程距离减少衡量)。
链接: https://arxiv.org/abs/2508.11890
作者: Sangwoo Jeon,Juchul Shin,YeonJe Cho,Gyeong-Tae Kim,Seongwoo Kim
机构: 未知
类目: Robotics (cs.RO); Artificial Intelligence (cs.AI)
备注:
Abstract:Modern autonomous drone missions increasingly require software frameworks capable of seamlessly integrating structured symbolic planning with adaptive reinforcement learning (RL). Although traditional rule-based architectures offer robust structured reasoning for drone autonomy, their capabilities fall short in dynamically complex operational environments that require adaptive symbolic planning. Symbolic RL (SRL), using the Planning Domain Definition Language (PDDL), explicitly integrates domain-specific knowledge and operational constraints, significantly improving the reliability and safety of unmanned aerial vehicle (UAV) decision making. In this study, we propose the AMAD-SRL framework, an extended and refined version of the Autonomous Mission Agents for Drones (AMAD) cognitive multi-agent architecture, enhanced with symbolic reinforcement learning for dynamic mission planning and execution. We validated our framework in a Software-in-the-Loop (SIL) environment structured identically to an intended Hardware-In-the-Loop Simulation (HILS) platform, ensuring seamless transition to real hardware. Experimental results demonstrate stable integration and interoperability of modules, successful transitions between BDI-driven and symbolic RL-driven planning phases, and consistent mission performance. Specifically, we evaluate a target acquisition scenario in which the UAV plans a surveillance path followed by a dynamic reentry path to secure the target while avoiding threat zones. In this SIL evaluation, mission efficiency improved by approximately 75% over a coverage-based baseline, measured by travel distance reduction. This study establishes a robust foundation for handling complex UAV missions and discusses directions for further enhancement and validation.
zh
[AI-106] Discovering Expert-Level Nash Equilibrium Algorithms with Large Language Models
【速读】:该论文旨在解决算法设计与分析中长期存在的难题:如何自动化地发现具有可证明性能保证的通用算法,而无需依赖繁琐且易出错的人工推理。传统方法在面对所有输入时证明算法性能需大量人工干预,难以规模化。其核心挑战在于将算法设计的创造性过程与形式化分析的严谨性有效融合。解决方案的关键在于提出LegoNE框架,该框架通过将用简单Python-like语言编写的算法自动转化为约束优化问题,并求解该问题以推导并证明算法的近似界,从而实现设计与分析的紧密耦合。此方法使大型语言模型在数小时内复现了人类耗时15年才完成的两人博弈最优算法,并首次为三人博弈设计出超越现有水平的新算法,展示了人机协同推进理论科学的新范式。
链接: https://arxiv.org/abs/2508.11874
作者: Hanyu Li,Dongchen Li,Xiaotie Deng
机构: 未知
类目: Computer Science and Game Theory (cs.GT); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Logic in Computer Science (cs.LO); Programming Languages (cs.PL)
备注:
Abstract:Algorithm design and analysis is a cornerstone of computer science, but it confronts a major challenge. Proving an algorithm’s performance guarantee across all inputs has traditionally required extensive and often error-prone human effort. While AI has shown great success in finding solutions to specific problem instances, automating the discovery of general algorithms with such provable guarantees has remained a significant barrier. This challenge stems from the difficulty of integrating the creative process of algorithm design with the rigorous process of formal analysis. To address this gap, we propose LegoNE, a framework that tightly fuses these two processes for the fundamental and notoriously difficult problem of computing approximate Nash equilibria. LegoNE automatically translates any algorithm written by a simple Python-like language into a constrained optimization problem. Solving this problem derives and proves the algorithm’s approximation bound. Using LegoNE, a state-of-the-art large language model rediscovered the state-of-the-art algorithm for two-player games within hours, a feat that had taken human researchers 15 years to achieve. For three-player games, the model discovered a novel algorithm surpassing all existing human-designed ones. This work demonstrates a new human-machine collaborative paradigm for theoretical science: humans reason at a higher-abstract level, using symbols to compress the search space, and AI explores within it, achieving what neither could alone.
zh
[AI-107] SimInterview: Transforming Business Education through Large Language Model-Based Simulated Multilingual Interview Training System
【速读】:该论文旨在解决传统商务面试准备教学中缺乏个性化和跨文化适应性训练的问题,尤其是在人工智能(AI)重塑劳动力市场的背景下,企业对求职者软技能与文化敏感度的要求日益提高。其解决方案的关键在于构建一个基于大语言模型(Large Language Model, LLM)的多语言模拟面试系统 SimInterview,该系统通过集成检索增强生成(Retrieval-Augmented Generation, RAG)技术动态匹配简历与岗位需求,并利用合成AI技术生成具备真实对话能力的虚拟招聘官,实现跨语言、个性化的实时面试演练。系统核心组件包括多个LLM(如OpenAI o3、Llama 4 Maverick、Gemma 3)、Whisper语音识别、GPT-SoVITS语音合成、Ditto人脸驱动模型及ChromaDB向量数据库,在英语和日语市场均验证了其在提升面试准备度、保持简历内容一致性以及用户满意度方面的有效性,同时提出可解释、可检测偏见并保留人工干预机制的负责任AI设计框架,以应对未来监管要求。
链接: https://arxiv.org/abs/2508.11873
作者: Truong Thanh Hung Nguyen,Tran Diem Quynh Nguyen,Hoang Loc Cao,Thi Cam Thanh Tran,Thi Cam Mai Truong,Hung Cao
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Published as a conference paper at ICEFM 2025
Abstract:Business interview preparation demands both solid theoretical grounding and refined soft skills, yet conventional classroom methods rarely deliver the individualized, culturally aware practice employers currently expect. This paper introduces SimInterview, a large language model (LLM)-based simulated multilingual interview training system designed for business professionals entering the AI-transformed labor market. Our system leverages an LLM agent and synthetic AI technologies to create realistic virtual recruiters capable of conducting personalized, real-time conversational interviews. The framework dynamically adapts interview scenarios using retrieval-augmented generation (RAG) to match individual resumes with specific job requirements across multiple languages. Built on LLMs (OpenAI o3, Llama 4 Maverick, Gemma 3), integrated with Whisper speech recognition, GPT-SoVITS voice synthesis, Ditto diffusion-based talking head generation model, and ChromaDB vector databases, our system significantly improves interview readiness across English and Japanese markets. Experiments with university-level candidates show that the system consistently aligns its assessments with job requirements, faithfully preserves resume content, and earns high satisfaction ratings, with the lightweight Gemma 3 model producing the most engaging conversations. Qualitative findings revealed that the standardized Japanese resume format improved document retrieval while diverse English resumes introduced additional variability, and they highlighted how cultural norms shape follow-up questioning strategies. Finally, we also outlined a contestable AI design that can explain, detect bias, and preserve human-in-the-loop to meet emerging regulatory expectations.
zh
[AI-108] Singing Syllabi with Virtual Avatars: Enhancing Student Engagement Through AI-Generated Music and Digital Embodiment
【速读】:该论文旨在解决传统文本型课程大纲(syllabus)在实际教学中难以被学生充分阅读和理解的问题,导致关键信息如课程政策和学习目标常被忽略。其解决方案的关键在于利用生成式 AI (Generative AI) 技术,将文本 syllabus 转化为包含虚拟角色演唱的音视频呈现形式,借助 HeyGem 等开源工具实现数字人以歌曲形式演绎课程内容,从而提升学生的兴趣、情感联结与关键信息的记忆效果。
链接: https://arxiv.org/abs/2508.11872
作者: Xinxing Wu
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM)
备注: 17 pages, 4 figures, 3 tables
Abstract:In practical teaching, we observe that few students thoroughly read or fully comprehend the information provided in traditional, text-based course syllabi. As a result, essential details, such as course policies and learning outcomes, are frequently overlooked. To address this challenge, in this paper, we propose a novel approach leveraging AI-generated singing and virtual avatars to present syllabi in a format that is more visually appealing, engaging, and memorable. Especially, we leveraged the open-source tool, HeyGem, to transform textual syllabi into audiovisual presentations, in which digital avatars perform the syllabus content as songs. The proposed approach aims to stimulate students’ curiosity, foster emotional connection, and enhance retention of critical course information. Student feedback indicated that AI-sung syllabi significantly improved awareness and recall of key course information.
zh
[AI-109] AI-Augmented CI/CD Pipelines: From Code Commit to Production with Autonomous Decisions
【速读】:该论文旨在解决现代软件交付过程中因人工决策环节(如处理不稳定的测试、选择回滚策略、调整功能标志、决定何时推广金丝雀发布)而导致的延迟和运维负担问题。其核心解决方案是提出AI增强型CI/CD流水线(AI-Augmented CI/CD Pipelines),通过大语言模型(Large Language Models, LLMs)和自主代理(autonomous agents)作为受策略约束的协作者,逐步过渡为决策主体。关键创新包括:一个可嵌入智能决策点的参考架构、基于策略即代码(policy-as-code)的决策分类与防护模式、分阶段自主性的信任层级框架、结合DevOps研究与评估(DORA)指标与AI特有指标的评估方法,以及一个工业级React 19微服务迁移案例,从而实现更高效、可控且可验证的自动化部署流程。
链接: https://arxiv.org/abs/2508.11867
作者: Mohammad Baqar,Saba Naqvi,Rajat Khanda
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: 13 Pages
Abstract:Modern software delivery has accelerated from quarterly releases to multiple deployments per day. While CI/CD tooling has matured, human decision points interpreting flaky tests, choosing rollback strategies, tuning feature flags, and deciding when to promote a canary remain major sources of latency and operational toil. We propose AI-Augmented CI/CD Pipelines, where large language models (LLMs) and autonomous agents act as policy-bounded co-pilots and progressively as decision makers. We contribute: (1) a reference architecture for embedding agentic decision points into CI/CD, (2) a decision taxonomy and policy-as-code guardrail pattern, (3) a trust-tier framework for staged autonomy, (4) an evaluation methodology using DevOps Research and Assessment ( DORA) metrics and AI-specific indicators, and (5) a detailed industrial-style case study migrating a React 19 microservice to an AI-augmented pipeline. We discuss ethics, verification, auditability, and threats to validity, and chart a roadmap for verifiable autonomy in production delivery systems.
zh
[AI-110] EvoCut: Strengthening Integer Programs via Evolution-Guided Language Models
【速读】:该论文旨在解决整数规划(Integer Programming)中加速求解的难题,特别是如何自动化生成高质量的加速割平面(acceleration cuts),以提升求解器性能。传统方法依赖人工设计割平面,不仅耗时且需深厚专业知识,难以推广。其解决方案的关键在于提出EvoCut框架,通过结合大语言模型(Large Language Models, LLMs)与进化搜索机制:首先利用LLM初始化多样化的候选割平面群体;随后基于验证集实证评估每条割平面在保留最优解的同时剪除分数解的能力;最后通过进化交叉和变异操作迭代优化群体。该方法无需人工干预即可生成、改进并验证具有泛化能力的割平面,显著降低求解器的最优性间隙(optimality gap),在固定时间内提升求解速度达4倍,并获得更优解。
链接: https://arxiv.org/abs/2508.11850
作者: Milad Yazdani,Mahdi Mostajabdaveh,Samin Aref,Zirui Zhou
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:Integer programming lies at the heart of crucial combinatorial optimization tasks but remains challenging due to its NP-hard nature. An effective approach for practically solving integer programs is the manual design of acceleration cuts, i.e. inequalities that improve solver performance. However, this creative process demands deep expertise and is yet to be automated. Our proposed framework, EvoCut, automates the generation of acceleration cuts by combining large language models (LLMs) with an evolutionary search. EvoCut (i) initializes a diverse population of candidate cuts via an LLM-based initializer agent; (ii) for each cut empirically evaluates both preservation of the optimal solution and its ability to cut off fractional solutions across a verification set; and (iii) iteratively refines the population through evolutionary crossover and mutation agents. We quantify each cut’s utility by its relative reduction in the solver’s optimality gap. Our comparisons against standard integer programming practice show that EvoCut reduces optimality gap by 17-57% within a fixed time. It obtains the same solutions up to 4 times as fast, and obtains higher-quality solutions within the same time limit. Requiring no human expert input, EvoCut reliably generates, improves, and empirically verifies cuts that generalize to unseen instances. The code is available at this https URL.
zh
[AI-111] What Matters for Bioacoustic Encoding
【速读】:该论文旨在解决生物声学(bioacoustics)领域中因标注数据有限而导致的下游任务性能受限问题,尤其是针对物种分类、个体识别、行为检测和发声谱系发现等多样化任务,缺乏一个通用的生物声学编码器(bioacoustic encoder)来提取可迁移的特征表示。解决方案的关键在于通过大规模实证研究,系统性地探索训练数据多样性与规模、模型架构与训练策略(training recipes)以及评估任务广度对编码器性能的影响;研究发现,采用自监督预训练结合混合生物声学与通用音频语料库的监督微调(supervised post-training),能够在26个不同数据集上实现最优的分布内与分布外性能表现,且强调了两个阶段数据多样性的关键作用。
链接: https://arxiv.org/abs/2508.11845
作者: Marius Miron,David Robinson,Milad Alizadeh,Ellen Gilsenan-McMahon,Gagan Narula,Olivier Pietquin,Matthieu Geist,Emmanuel Chemla,Maddie Cusimano,Felix Effenberger,Masato Hagiwara,Benjamin Hoffman,Sara Keen,Diane Kim,Jane Lawton,Jen-Yu Liu,Aza Raskin
机构: 未知
类目: ound (cs.SD); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG)
备注:
Abstract:Bioacoustics, the study of sounds produced by living organisms, plays a vital role in conservation, biodiversity monitoring, and behavioral studies. Many tasks in this field, such as species, individual, and behavior classification and detection, are well-suited to machine learning. However, they often suffer from limited annotated data, highlighting the need for a general-purpose bioacoustic encoder capable of extracting useful representations for diverse downstream tasks. Such encoders have been proposed before, but are often limited in scope due to a focus on a narrow range of species (typically birds), and a reliance on a single model architecture or training paradigm. Moreover, they are usually evaluated on a small set of tasks and datasets. In this work, we present a large-scale empirical study that covers aspects of bioacoustics that are relevant to research but have previously been scarcely considered: training data diversity and scale, model architectures and training recipes, and the breadth of evaluation tasks and datasets. We obtain encoders that are state-of-the-art on the existing and proposed benchmarks. We also identify what matters for training these encoders, such that this work can be extended when more data are available or better architectures are proposed. Specifically, across 26 datasets with tasks including species classification, detection, individual ID, and vocal repertoire discovery, we find self-supervised pre-training followed by supervised post-training on a mixed bioacoustics + general-audio corpus yields the strongest in- and out-of-distribution performance. We show the importance of data diversity in both stages. To support ongoing research and application, we will release the model checkpoints.
zh
[AI-112] Finite Automata Extraction: Low-data World Model Learning as Programs from Gameplay Video
【速读】:该论文旨在解决传统世界模型(World Models)在环境动态建模中可迁移性差与可解释性弱的问题,尤其针对基于神经网络的表示难以捕捉精确环境规则和泛化能力不足的局限。其解决方案的关键在于提出一种名为有限自动机提取(Finite Automata Extraction, FAE)的新方法,通过从游戏视频中学习到的程序化表示(使用一种新型领域特定语言 Retro Coder)来构建神经符号(neuro-symbolic)世界模型,从而实现更精确的环境建模和更强的代码泛化能力。
链接: https://arxiv.org/abs/2508.11836
作者: Dave Goel,Matthew Guzdial,Anurag Sarkar
机构: 未知
类目: Artificial Intelligence (cs.AI)
备注:
Abstract:World models are defined as a compressed spatial and temporal learned representation of an environment. The learned representation is typically a neural network, making transfer of the learned environment dynamics and explainability a challenge. In this paper, we propose an approach, Finite Automata Extraction (FAE), that learns a neuro-symbolic world model from gameplay video represented as programs in a novel domain-specific language (DSL): Retro Coder. Compared to prior world model approaches, FAE learns a more precise model of the environment and more general code than prior DSL-based approaches.
zh
[AI-113] Rethinking Autonomy: Preventing Failures in AI-Driven Software Engineering
【速读】:该论文旨在解决大型语言模型(Large Language Models, LLMs)在软件工程中辅助代码生成时所引入的安全与治理风险,包括不安全代码生成、幻觉输出、不可逆操作以及缺乏透明度和问责机制等问题。其核心解决方案是提出SAFE-AI框架,该框架以安全性(Safety)、可审计性(Auditability)、反馈机制(Feedback)和可解释性(Explainability)为四大支柱,集成护栏机制(guardrails)、沙箱环境(sandboxing)、运行时验证(runtime verification)、风险感知日志记录、人机协同系统(human-in-the-loop)及可解释AI技术,从而有效降低风险并增强对AI行为的信任与合规性。此外,论文还构建了一个新型AI行为分类体系,将AI行为划分为建议型、生成型、自主型和破坏型,用于指导风险评估与监管实践。
链接: https://arxiv.org/abs/2508.11824
作者: Satyam Kumar Navneet,Joydeep Chandra
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Performance (cs.PF)
备注:
Abstract:The integration of Large Language Models (LLMs) into software engineering has revolutionized code generation, enabling unprecedented productivity through promptware and autonomous AI agents. However, this transformation introduces significant risks, including insecure code generation, hallucinated outputs, irreversible actions, and a lack of transparency and accountability. Incidents like the Replit database deletion underscore the urgent need for robust safety and governance mechanisms. This paper comprehensively analyzes the inherent challenges of LLM-assisted code generation, such as vulnerability inheritance, overtrust, misinterpretation, and the absence of standardized validation and rollback protocols. To address these, we propose the SAFE-AI Framework, a holistic approach emphasizing Safety, Auditability, Feedback, and Explainability. The framework integrates guardrails, sandboxing, runtime verification, risk-aware logging, human-in-the-loop systems, and explainable AI techniques to mitigate risks while fostering trust and compliance. We introduce a novel taxonomy of AI behaviors categorizing suggestive, generative, autonomous, and destructive actions to guide risk assessment and oversight. Additionally, we identify open problems, including the lack of standardized benchmarks for code specific hallucinations and autonomy levels, and propose future research directions for hybrid verification, semantic guardrails, and proactive governance tools. Through detailed comparisons of autonomy control, prompt engineering, explainability, and governance frameworks, this paper provides a roadmap for responsible AI integration in software engineering, aligning with emerging regulations like the EU AI Act and Canada’s AIDA to ensure safe, transparent, and accountable AI-driven development.
zh
[AI-114] FairTabGen: Unifying Counterfactual and Causal Fairness in Synthetic Tabular Data Generation
【速读】:该论文旨在解决在隐私敏感且数据稀缺场景下,生成具有高统计效用的同时保障公平性的表格型合成数据的问题,尤其关注反事实公平性和因果公平性(counterfactual and causal fairness)的提升。其解决方案的关键在于提出 FairTabGen,一个基于大语言模型(large language model, LLM)的公平感知框架,通过将多种公平性定义整合进生成与评估双管道,并结合上下文学习(in-context learning)、提示优化(prompt refinement)和公平性感知的数据筛选策略,在仅使用原始数据不到20%的情况下,显著提升公平性指标(如人口均等性和路径特定因果效应)达10%,同时保持良好的统计效用,实现了低数据环境下公平与效用的高效平衡。
链接: https://arxiv.org/abs/2508.11810
作者: Nitish Nagesh,Salar Shakibhamedan,Mahdi Bagheri,Ziyu Wang,Nima TaheriNejad,Axel Jantsch,Amir M. Rahmani
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Generating synthetic data is crucial in privacy-sensitive, data-scarce settings, especially for tabular datasets widely used in real-world applications. A key challenge is improving counterfactual and causal fairness, while preserving high utility. We present FairTabGen, a fairness-aware large language model-based framework for tabular synthetic data generation. We integrate multiple fairness definitions including counterfactual and causal fairness into both its generation and evaluation pipelines. We use in-context learning, prompt refinement, and fairness-aware data curation to balance fairness and utility. Across diverse datasets, our method outperforms state-of-the-art GAN-based and LLM-based methods, achieving up to 10% improvements on fairness metrics such as demographic parity and path-specific causal effects while retaining statistical utility. Remarkably, it achieves these gains using less than 20% of the original data, highlighting its efficiency in low-data regimes. These results demonstrate a principled and practical approach for generating fair and useful synthetic tabular data.
zh
[AI-115] Uncalibrated Reasoning : GRPO Induces Overconfidence for Stochastic Outcomes
【速读】:该论文旨在解决当前强化学习(Reinforcement Learning, RL)方法在处理具有随机结果的可验证领域(如科学实验)时,语言模型的概率预测是否能够保持校准的问题。研究发现,Group Relative Policy Optimization (GRPO) 会导致对二元随机结果的过度自信预测,而 Proximal Policy Optimization (PPO) 和 REINFORCE Leave-One-Out (RLOO) 则能生成校准良好的模型。解决方案的关键在于移除 GRPO 中的组标准归一化(group standard normalization),这一改动修复了其概率校准偏差,并通过理论分析揭示了标准化导致过自信的根本原因。该发现为将 RL 应用于非确定性推理任务提供了重要指导。
链接: https://arxiv.org/abs/2508.11800
作者: Michael Bereket,Jure Leskovec
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics. Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with stochastic outcomes, like scientific experiments. Through applications to synthetic data and real-world biological experiments, we demonstrate that Group Relative Policy Optimization (GRPO) induces overconfident probability predictions for binary stochastic outcomes, while Proximal Policy Optimization (PPO) and REINFORCE Leave-One-Out (RLOO) yield well-calibrated models. We show that removing group standard normalization in GRPO fixes its miscalibration and provide a theoretical explanation for why normalization causes overconfidence. Our results provide new evidence against the use of standard normalization in GRPO and help pave the way for applications of RL for reasoning language models beyond deterministic domains.
zh
[AI-116] SafeSieve: From Heuristics to Experience in Progressive Pruning for LLM -based Multi-Agent Communication
【速读】:该论文旨在解决基于大语言模型(Large Language Model, LLM)的多智能体系统中因冗余通信和过高Token开销导致的效率低下问题。现有方法通常通过预训练图神经网络(Graph Neural Network, GNN)或贪心Top-k剪枝策略提升效率,但往往将任务前优化与任务后优化割裂,缺乏统一优化框架。其解决方案的关键在于提出SafeSieve算法,该算法采用一种渐进式、自适应的多智能体剪枝机制,结合初始LLM语义评估与累积性能反馈的双机制设计,实现从启发式初始化到经验驱动精炼的平滑过渡;同时引入0扩展聚类(0-extension clustering)技术,在保留结构连贯性智能体组的同时剔除无效通信链路,从而在保持高准确率(如SVAMP、HumanEval等基准上平均达94.01%)的前提下显著降低Token消耗(减少12.4%-27.8%),并增强对提示注入攻击的鲁棒性及异构环境下的部署经济性。
链接: https://arxiv.org/abs/2508.11733
作者: Ruijia Zhang,Xinyan Zhao,Ruixiang Wang,Sigen Chen,Guibin Zhang,An Zhang,Kun Wang,Qingsong Wen
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI)
备注: 7 pages for main content, 5 figures, 4 tables
Abstract:LLM-based multi-agent systems exhibit strong collaborative capabilities but often suffer from redundant communication and excessive token overhead. Existing methods typically enhance efficiency through pretrained GNNs or greedy algorithms, but often isolate pre- and post-task optimization, lacking a unified strategy. To this end, we present SafeSieve, a progressive and adaptive multi-agent pruning algorithm that dynamically refines the inter-agent communication through a novel dual-mechanism. SafeSieve integrates initial LLM-based semantic evaluation with accumulated performance feedback, enabling a smooth transition from heuristic initialization to experience-driven refinement. Unlike existing greedy Top-k pruning methods, SafeSieve employs 0-extension clustering to preserve structurally coherent agent groups while eliminating ineffective links. Experiments across benchmarks (SVAMP, HumanEval, etc.) showcase that SafeSieve achieves 94.01% average accuracy while reducing token usage by 12.4%-27.8%. Results further demonstrate robustness under prompt injection attacks (1.23% average accuracy drop). In heterogeneous settings, SafeSieve reduces deployment costs by 13.3% while maintaining performance. These results establish SafeSieve as a robust, efficient, and scalable framework for practical multi-agent systems. Our code can be found in this https URL.
zh
[AI-117] BRIEF: BRain-Inspired network connection search with Extensive temporal feature Fusion enhances disease classification
【速读】:该论文旨在解决功能磁共振成像(fMRI)基分类模型中存在的两个关键问题:一是网络架构设计依赖人工经验,缺乏自动化优化;二是特征空间融合方式单一(如简单拼接),缺乏跨特征间的互学习机制。其解决方案的核心在于提出一种脑启发式特征融合框架(BRIEF),通过引入改进的神经网络连接搜索(NCS)策略与基于Transformer的多特征融合模块实现突破:首先利用强化学习(Q-learning)将NCS建模为马尔可夫决策过程,动态优化各编码器内的网络结构以提取高层次特征向量;随后通过Transformer融合所有特征向量,有效捕捉不同脑区间的稳定/时变连接及多尺度依赖关系,并嵌入注意力机制提升模型可解释性。该方法在精神分裂症(SZ)和自闭症谱系障碍(ASD)的分类任务中显著优于21种主流算法,AUC分别达到91.5%和78.4%,首次实现了基于脑启发强化学习的自动网络架构优化与高效特征融合。
链接: https://arxiv.org/abs/2508.11732
作者: Xiangxiang Cui,Min Zhao,Dongmei Zhi,Shile Qi,Vince D Calhoun,Jing Sui
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Existing deep learning models for functional MRI-based classification have limitations in network architecture determination (relying on experience) and feature space fusion (mostly simple concatenation, lacking mutual learning). Inspired by the human brain’s mechanism of updating neural connections through learning and decision-making, we proposed a novel BRain-Inspired feature Fusion (BRIEF) framework, which is able to optimize network architecture automatically by incorporating an improved neural network connection search (NCS) strategy and a Transformer-based multi-feature fusion module. Specifically, we first extracted 4 types of fMRI temporal representations, i.e., time series (TCs), static/dynamic functional connection (FNC/dFNC), and multi-scale dispersion entropy (MsDE), to construct four encoders. Within each encoder, we employed a modified Q-learning to dynamically optimize the NCS to extract high-level feature vectors, where the NCS is formulated as a Markov Decision Process. Then, all feature vectors were fused via a Transformer, leveraging both stable/time-varying connections and multi-scale dependencies across different brain regions to achieve the final classification. Additionally, an attention module was embedded to improve interpretability. The classification performance of our proposed BRIEF was compared with 21 state-of-the-art models by discriminating two mental disorders from healthy controls: schizophrenia (SZ, n=1100) and autism spectrum disorder (ASD, n=1550). BRIEF demonstrated significant improvements of 2.2% to 12.1% compared to 21 algorithms, reaching an AUC of 91.5% - 0.6% for SZ and 78.4% - 0.5% for ASD, respectively. This is the first attempt to incorporate a brain-inspired, reinforcement learning strategy to optimize fMRI-based mental disorder classification, showing significant potential for identifying precise neuroimaging biomarkers.
zh
[AI-118] he Stories We Govern By: AI Risk and the Power of Imaginaries AAAI
【速读】:该论文试图解决的问题是:不同社会技术想象(sociotechnical imaginaries)对人工智能(Artificial Intelligence, AI)风险的建构如何影响治理决策与监管约束,进而限制了其他可能的治理路径。其解决方案的关键在于批判性地审视三种主导叙事——存在性风险倡导者、加速主义者和批判AI学者——并指出这些叙事在规范性未来愿景、对当前社会秩序的诊断、科技观及人类能动性认知上的差异,从而揭示它们如何嵌入特定的风险假设并推动政策制定。作者主张摒弃推测性的教条主义,转向以务实为基础的监管策略,以避免因单一确定性想象而排除多元治理可能性。
链接: https://arxiv.org/abs/2508.11729
作者: Ninell Oldenburg,Gleb Papyshev
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: 10 pages, accepted at the 8th AAAI/ACM Conference on AI, Ethics, and Society
Abstract:This paper examines how competing sociotechnical imaginaries of artificial intelligence (AI) risk shape governance decisions and regulatory constraints. Drawing on concepts from science and technology studies, we analyse three dominant narrative groups: existential risk proponents, who emphasise catastrophic AGI scenarios; accelerationists, who portray AI as a transformative force to be unleashed; and critical AI scholars, who foreground present-day harms rooted in systemic inequality. Through an analysis of representative manifesto-style texts, we explore how these imaginaries differ across four dimensions: normative visions of the future, diagnoses of the present social order, views on science and technology, and perceived human agency in managing AI risks. Our findings reveal how these narratives embed distinct assumptions about risk and have the potential to progress into policy-making processes by narrowing the space for alternative governance approaches. We argue against speculative dogmatism and for moving beyond deterministic imaginaries toward regulatory strategies that are grounded in pragmatism.
zh
[AI-119] Are AI Machines Making Humans Obsolete?
【速读】:该论文旨在解决生成式 AI(Generative AI, GenAI)在快速发展过程中所引发的潜在风险与挑战,包括因失控的机器学习导致的非预期后果及人类对其结果理解不足的问题。其解决方案的关键在于建立有效的控制机制,以系统性地管理 GenAI 的应用边界和行为逻辑,从而防范可能的反乌托邦后果,并确保其发展符合伦理规范与社会利益。
链接: https://arxiv.org/abs/2508.11719
作者: Matthias Scheutz
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注: Forthcoming in Ramana Kumar Vinjamuri (ed.) “Bridging the Gap between Mind and Machine”, Springer
Abstract:This chapter starts with a sketch of how we got to “generative AI” (GenAI) and a brief summary of the various impacts it had so far. It then discusses some of the opportunities of GenAI, followed by the challenges and dangers, including dystopian outcomes resulting from using uncontrolled machine learning and our failures to understand the results. It concludes with some suggestions for how to control GenAI and address its dangers.
zh
[AI-120] Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLM s KDD
【速读】:该论文旨在解决Excel公式语义运行时错误(semantic runtime errors)的自动修复问题,这一问题在初学者中尤为突出,且现有大语言模型(Large Language Models, LLMs)虽能解释错误,但缺乏有效手段进行自动修正。其关键解决方案在于构建一个高质量、覆盖常见错误类型的基准数据集,并提出一套可扩展的数据生成流水线:该流程基于少量人工标注的种子样本,结合少样本提示(few-shot prompting)与LLM-as-a-Judge验证框架,辅以执行一致性检查,确保生成数据的正确性和语义保真度,最终形成包含618个样本的基准数据集。此外,作者还设计了一种上下文感知的基线修复方法,利用LLM同时理解错误公式及其所在电子表格的上下文信息,从而提升修复准确性。
链接: https://arxiv.org/abs/2508.11715
作者: Ananya Singha,Harshita Sahijwani,Walt Williams,Emmanuel Aboah Boateng,Nick Hausman,Miguel Di Luca,Keegan Choudhury,Chaya Binet,Vu Le,Tianwei Chen,Oryan Rokeah Chen,Sulaiman Vesal,Sadid Hasan
机构: 未知
类目: oftware Engineering (cs.SE); Artificial Intelligence (cs.AI)
备注: Accepted at the KDD workshop on Evaluation and Trustworthiness of Agentic and Generative AI Models
Abstract:Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer promising assistance by explaining formula errors, the automated correction of these semantic runtime errors remains an open problem. A primary challenge to advancing models for such scenarios is the severe lack of high-quality, comprehensive datasets for training and rigorous evaluation. This paper addresses this gap by introducing a novel approach for constructing a benchmark dataset specifically designed for Excel formula repair. We propose a data generation pipeline, which leverages a small set of curated seed samples from online forums to synthetically expand the dataset. Our pipeline integrates few-shot prompting with LLMs and employs a robust \textitLLM-as-a-Judge validation framework, combined with execution-based checks to ensure the correctness and semantic fidelity of the generated data. This process produced a benchmark dataset of 618 high-quality samples, covering common runtime errors. Furthermore, we propose a context-aware baseline technique for Excel formula repair that utilizes LLMs to leverage both the faulty formula, and relevant spreadsheet context. We evaluate the performance of various LLMs (GPT-4o, GPT-4.1, Phi-3, Mistral) on our newly generated benchmark using execution-based metrics. Our analysis demonstrates the dataset’s quality through manual annotation and provides insights into error and function distributions. The proposed generation methodology is highly scalable and can be readily adapted to create evaluation benchmarks for similar code repair tasks in other low-resource programming languages.
zh
[AI-121] Enhancing GraphQL Security by Detecting Malicious Queries Using Large Language Models Sentence Transformers and Convolutional Neural Networks
【速读】:该论文旨在解决GraphQL API在灵活数据获取特性下引入的独特安全漏洞问题,这些问题传统API安全机制难以有效防护,例如恶意查询可能引发拒绝服务(Denial-of-Service, DoS)攻击、通过注入方式进行数据泄露(如SQL注入、OS命令注入和跨站脚本攻击XSS)等。解决方案的关键在于提出一种基于人工智能(AI)的实时检测方法,其核心是融合静态分析与多种机器学习技术:利用大型语言模型(Large Language Models, LLMs)动态配置schema规则,采用Sentence Transformers(SBERT和Doc2Vec)对查询负载进行上下文嵌入表示,并结合卷积神经网络(Convolutional Neural Networks, CNNs)、随机森林(Random Forests)和多层感知机(Multilayer Perceptrons)实现高精度分类。系统架构经过生产环境优化(如ONNX Runtime加速与并行处理),实验证明该方案能有效识别多种攻击类型并实现高效防御。
链接: https://arxiv.org/abs/2508.11711
作者: Irash Perera(1),Hiranya Abeyrathne(2),Sanjeewa Malalgoda(2),Arshardh Ifthikar(2) ((1) Department of Computer Science and Engineering, University of Moratuwa, Colombo, Sri Lanka, (2) WSO2, Colombo, Sri Lanka)
机构: 未知
类目: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:GraphQL’s flexibility, while beneficial for efficient data fetching, introduces unique security vulnerabilities that traditional API security mechanisms often fail to address. Malicious GraphQL queries can exploit the language’s dynamic nature, leading to denial-of-service attacks, data exfiltration through injection, and other exploits. Existing solutions, such as static analysis, rate limiting, and general-purpose Web Application Firewalls, offer limited protection against sophisticated, context-aware attacks. This paper presents a novel, AI-driven approach for real-time detection of malicious GraphQL queries. Our method combines static analysis with machine learning techniques, including Large Language Models (LLMs) for dynamic schema-based configuration, Sentence Transformers (SBERT and Doc2Vec) for contextual embedding of query payloads, and Convolutional Neural Networks (CNNs), Random Forests, and Multilayer Perceptrons for classification. We detail the system architecture, implementation strategies optimized for production environments (including ONNX Runtime optimization and parallel processing), and evaluate the performance of our detection models and the overall system under load. Results demonstrate high accuracy in detecting various threats, including SQL injection, OS command injection, and XSS exploits, alongside effective mitigation of DoS and SSRF attempts. This research contributes a robust and adaptable solution for enhancing GraphQL API security.
zh
[AI-122] Navigating the New Landscape: A Conceptual Model for Project-Based Assessment (PBA) in the Age of GenAI
【速读】:该论文旨在解决生成式人工智能(Generative Artificial Intelligence, GenAI)快速融入高等教育背景下,传统以成果为导向的项目式评估(Project-Based Assessment, PBA)所面临的挑战,包括作品真实性、学术诚信和学习成效验证等问题。其解决方案的关键在于构建一种以过程为导向的评估模型,强调多模态与多维度的评估设计,并将GenAI作为辅助工具用于个性化反馈,从而促进高阶思维能力的发展,同时确保评估过程可观察、可追踪且符合伦理规范。
链接: https://arxiv.org/abs/2508.11709
作者: Rajan Kadel,Samar Shailendra,Urvashi Rahul Saxena
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid integration of Generative Artificial Intelligence (GenAI) into higher education presents both opportunities and challenges for assessment design, particularly within Project-Based Assessment (PBA) contexts. Traditional assessment methods often emphasise the final product in the PBA, which can now be significantly influenced or created by GenAI tools, raising concerns regarding product authenticity, academic integrity, and learning validation. This paper advocates for a reimagined assessment model for Project-Based Learning (PBL) or a capstone project that prioritises process-oriented evaluation, multi-modal and multifaceted assessment design, and ethical engagement with GenAI to enable higher-order thinking. The model also emphasises the use of (GenAI-assisted) personalised feedback by a supervisor as an observance of the learning process during the project lifecycle. A use case scenario is provided to illustrate the application of the model in a capstone project setting. The paper concludes with recommendations for educators and curriculum designers to ensure that assessment practices remain robust, learner-centric, and integrity-driven in the evolving landscape of GenAI.
zh
[AI-123] Street Review: A Participatory AI-Based Framework for Assessing Streetscape Inclusivity
【速读】:该论文旨在解决城市公共街道空间在社会、人口和文化变迁背景下,如何系统评估其包容性(inclusivity)与可达性(accessibility)的问题。传统方法难以全面捕捉不同群体对街道环境的主观体验与物理特征之间的关联,因而限制了规划决策的科学性与公平性。解决方案的关键在于提出“Street Review”这一混合方法框架,融合参与式研究(participatory research)与基于人工智能(AI-based analysis)的图像识别技术,通过28名居民的半结构化访谈及图像评价,并结合约45,000张Mapillary街景图像的视觉分析,生成热力图等可视化工具,将用户主观感知与街道物理属性(如人行道、维护状况、绿化和座椅配置)进行定量关联。该方法强调通过精细化数据标注与共同生产策略(co-production strategies),提升机器学习模型对多元用户需求的响应能力,从而为城市规划者提供可操作的实证依据,推动更具包容性的街道设计与政策制定。
链接: https://arxiv.org/abs/2508.11708
作者: Rashid Mushkani,Shin Koseki
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Urban centers undergo social, demographic, and cultural changes that shape public street use and require systematic evaluation of public spaces. This study presents Street Review, a mixed-methods approach that combines participatory research with AI-based analysis to assess streetscape inclusivity. In Montréal, Canada, 28 residents participated in semi-directed interviews and image evaluations, supported by the analysis of approximately 45,000 street-view images from Mapillary. The approach produced visual analytics, such as heatmaps, to correlate subjective user ratings with physical attributes like sidewalk, maintenance, greenery, and seating. Findings reveal variations in perceptions of inclusivity and accessibility across demographic groups, demonstrating that incorporating diverse user feedback can enhance machine learning models through careful data-labeling and co-production strategies. The Street Review framework offers a systematic method for urban planners and policy analysts to inform planning, policy development, and management of public streets.
zh
[AI-124] Listening with Language Models: Using LLM s to Collect and Interpret Classroom Feedback
【速读】:该论文试图解决传统期末问卷调查无法为教师提供及时、详细且可操作的教学反馈的问题。其解决方案的关键在于利用生成式 AI(Generative AI)驱动的聊天机器人,通过设计一个包含 PromptDesigner、FeedbackCollector 和 FeedbackAnalyzer 三部分的系统,引导学生参与反思性对话,从而实现更具情境相关性和互动性的教学反馈收集与分析流程。
链接: https://arxiv.org/abs/2508.11707
作者: Sai Siddartha Maram,Ulia Zaman,Magy Seif El-Nasr
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Traditional end-of-quarter surveys often fail to provide instructors with timely, detailed, and actionable feedback about their teaching. In this paper, we explore how Large Language Model (LLM)-powered chatbots can reimagine the classroom feedback process by engaging students in reflective, conversational dialogues. Through the design and deployment of a three-part system-PromptDesigner, FeedbackCollector, and FeedbackAnalyzer-we conducted a pilot study across two graduate courses at UC Santa Cruz. Our findings suggest that LLM-based feedback systems offer richer insights, greater contextual relevance, and higher engagement compared to standard survey tools. Instructors valued the system’s adaptability, specificity, and ability to support mid-course adjustments, while students appreciated the conversational format and opportunity for elaboration. We conclude by discussing the design implications of using AI to facilitate more meaningful and responsive feedback in higher education.
zh
[AI-125] Centralized Permutation Equivariant Policy for Cooperative Multi-Agent Reinforcement Learning
【速读】:该论文旨在解决多智能体强化学习(Multi-Agent Reinforcement Learning, MARL)中集中训练与分散执行(Centralized Training with Decentralized Execution, CTDE)范式所面临的性能瓶颈问题:一方面,分散执行的策略因部分可观测性导致性能次优;另一方面,完全集中式的方案在智能体数量增加时面临可扩展性挑战。其解决方案的关键在于提出一种新的集中式训练与执行框架——中央化排列等变(Centralized Permutation Equivariant, CPE)学习,该方法利用轻量、可扩展且易于实现的全局-局部排列等变(Global-Local Permutation Equivariant, GLPE)网络架构,在训练和执行阶段均采用全集中式策略,从而兼顾性能与可扩展性。实验表明,CPE能无缝集成于价值分解和演员-评论家方法,并显著提升标准CTDE算法在MPE、SMAC和RWARE等协作基准上的表现,达到当前最优RWARE实现的水平。
链接: https://arxiv.org/abs/2508.11706
作者: Zhuofan Xu,Benedikt Bollig,Matthias Függer,Thomas Nowak,Vincent Le Dréau
机构: 未知
类目: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:The Centralized Training with Decentralized Execution (CTDE) paradigm has gained significant attention in multi-agent reinforcement learning (MARL) and is the foundation of many recent algorithms. However, decentralized policies operate under partial observability and often yield suboptimal performance compared to centralized policies, while fully centralized approaches typically face scalability challenges as the number of agents increases. We propose Centralized Permutation Equivariant (CPE) learning, a centralized training and execution framework that employs a fully centralized policy to overcome these limitations. Our approach leverages a novel permutation equivariant architecture, Global-Local Permutation Equivariant (GLPE) networks, that is lightweight, scalable, and easy to implement. Experiments show that CPE integrates seamlessly with both value decomposition and actor-critic methods, substantially improving the performance of standard CTDE algorithms across cooperative benchmarks including MPE, SMAC, and RWARE, and matching the performance of state-of-the-art RWARE implementations. Subjects: Multiagent Systems (cs.MA); Artificial Intelligence (cs.AI); Machine Learning (cs.LG) Cite as: arXiv:2508.11706 [cs.MA] (or arXiv:2508.11706v1 [cs.MA] for this version) https://doi.org/10.48550/arXiv.2508.11706 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-126] Next-Gen Education: Enhancing AI for Microlearning
【速读】:该论文旨在解决美国高校计算机科学教育中因新冠疫情后学生课堂出勤率和参与度下降所引发的教学效果下滑问题(即传统教学模式难以适应远程学习趋势下学生注意力分散与学习动力不足的挑战)。其解决方案的关键在于将微学习(microlearning)策略融入课程设计,通过将复杂知识点拆解为可管理的学习单元,并采用视频、测验、闪卡及情境练习等互动形式提升学习效率;同时,利用生成式 AI(如 ChatGPT)自动化生成辅助教学材料,显著降低教师内容开发负担,从而实现教学资源的高效更新与个性化支持,最终推动教育质量与学生参与度的双重提升。
链接: https://arxiv.org/abs/2508.11704
作者: Suman Saha,Fatemeh Rahbari,Farhan Sadique,Sri Krishna Chaitanya Velamakanni,Mahfuza Farooque,William J. Rothwell
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Emerging Technologies (cs.ET); Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
备注: Published and presented in 2025 ASEE Annual Conference and Exposition, 22 pages, 6 figures
Abstract:This paper explores integrating microlearning strategies into university curricula, particularly in computer science education, to counteract the decline in class attendance and engagement in US universities after COVID. As students increasingly opt for remote learning and recorded lectures, traditional educational approaches struggle to maintain engagement and effectiveness. Microlearning, which breaks complex subjects into manageable units, is proposed to address shorter attention spans and enhance educational outcomes. It uses interactive formats such as videos, quizzes, flashcards, and scenario-based exercises, which are especially beneficial for topics like algorithms and programming logic requiring deep understanding and ongoing practice. Adoption of microlearning is often limited by the effort needed to create such materials. This paper proposes leveraging AI tools, specifically ChatGPT, to reduce the workload for educators by automating the creation of supplementary materials. While AI can automate certain tasks, educators remain essential in guiding and shaping the learning process. This AI-enhanced approach ensures course content is kept current with the latest research and technology, with educators providing context and insights. By examining AI capabilities in microlearning, this study shows the potential to transform educational practices and outcomes in computer science, offering a practical model for combining advanced technology with established teaching methods.
zh
[AI-127] RefAdGen: High-Fidelity Advertising Image Generation
【速读】:该论文旨在解决当前生成式 AI (Generative AI) 技术在广告图像生成中面临的高保真度与效率难以兼得的问题,即现有方法要么需要针对每张参考产品图像进行大量微调才能保证质量,要么在多样化产品上难以维持一致性,从而限制了其在电商和营销领域的实际应用。解决方案的关键在于提出一个名为 RefAdGen 的新框架,其核心创新是采用解耦设计:通过在 U-Net 输入端注入产品掩码(product mask)实现精确的空间控制,并引入高效的注意力融合模块(Attention Fusion Module, AFM)以整合产品特征,从而有效平衡生成图像的保真度与计算效率。这一设计使得模型在未见过的产品和复杂真实场景图像上均能保持高质量输出,显著提升了泛化能力与实用性。
链接: https://arxiv.org/abs/2508.11695
作者: Yiyun Chen,Weikai Yang
机构: 未知
类目: Graphics (cs.GR); Artificial Intelligence (cs.AI)
备注:
Abstract:The rapid advancement of Artificial Intelligence Generated Content (AIGC) techniques has unlocked opportunities in generating diverse and compelling advertising images based on referenced product images and textual scene descriptions. This capability substantially reduces human labor and production costs in traditional marketing workflows. However, existing AIGC techniques either demand extensive fine-tuning for each referenced image to achieve high fidelity, or they struggle to maintain fidelity across diverse products, making them impractical for e-commerce and marketing industries. To tackle this limitation, we first construct AdProd-100K, a large-scale advertising image generation dataset. A key innovation in its construction is our dual data augmentation strategy, which fosters robust, 3D-aware representations crucial for realistic and high-fidelity image synthesis. Leveraging this dataset, we propose RefAdGen, a generation framework that achieves high fidelity through a decoupled design. The framework enforces precise spatial control by injecting a product mask at the U-Net input, and employs an efficient Attention Fusion Module (AFM) to integrate product features. This design effectively resolves the fidelity-efficiency dilemma present in existing methods. Extensive experiments demonstrate that RefAdGen achieves state-of-the-art performance, showcasing robust generalization by maintaining high fidelity and remarkable visual results for both unseen products and challenging real-world, in-the-wild images. This offers a scalable and cost-effective alternative to traditional workflows. Code and datasets are publicly available at this https URL.
zh
[AI-128] Real Time Child Abduction And Detection System
【速读】:该论文旨在解决儿童绑架事件在全球范围内对社区构成的重大威胁问题,尤其是通过实时检测潜在的儿童绑架场景来提升儿童安全。其解决方案的关键在于构建一个基于边缘计算的多智能体(Multi-Agent)检测与警报系统,每个智能体均部署在树莓派(Raspberry Pi)上并集成视觉-语言模型(Vision-Language Models, VLMs),以实现对复杂环境中儿童行为的高精度识别与语义理解。该架构通过多智能体协同处理视频流数据,在本地完成推理任务,显著降低延迟并增强隐私保护;同时结合Twilio API实现实时短信和WhatsApp通知,确保异常情况能被快速响应。实验表明,该系统具备接近实时的性能和高检测准确率,相较传统单模型方法具有更强的环境适应性和鲁棒性,且边缘部署方式提升了可扩展性和成本效益,适用于广泛场景的实际部署。
链接: https://arxiv.org/abs/2508.11690
作者: Tadisetty Sai Yashwanth,Yangalasetty Sruthi Royal,Vankayala Rajeshwari Shreya,Mayank Kashyap,Divyaprabha K N
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:Child safety continues to be a paramount concern worldwide, with child abduction posing significant threats to communities. This paper presents the development of an edge-based child abduction detection and alert system utilizing a multi-agent framework where each agent incorporates Vision-Language Models (VLMs) deployed on a Raspberry Pi. Leveraging the advanced capabilities of VLMs within individual agents of a multi-agent team, our system is trained to accurately detect and interpret complex interactions involving children in various environments in real-time. The multi-agent system is deployed on a Raspberry Pi connected to a webcam, forming an edge device capable of processing video feeds, thereby reducing latency and enhancing privacy. An integrated alert system utilizes the Twilio API to send immediate SMS and WhatsApp notifications, including calls and messages, when a potential child abduction event is detected. Experimental results demonstrate that the system achieves high accuracy in detecting potential abduction scenarios, with near real-time performance suitable for practical deployment. The multi-agent architecture enhances the system’s ability to process complex situational data, improving detection capabilities over traditional single-model approaches. The edge deployment ensures scalability and cost-effectiveness, making it accessible for widespread use. The proposed system offers a proactive solution to enhance child safety through continuous monitoring and rapid alerting, contributing a valuable tool in efforts to prevent child abductions.
zh
[AI-129] Adaptive Spiking with Plasticity for Energy Aware Neuromorphic Systems
【速读】:该论文旨在解决资源受限设备(如可穿戴设备)在始终在线应用场景下,如何实现超低功耗且保持智能计算能力的问题。其核心挑战在于神经形态计算系统中能量消耗与脉冲活动(spiking activity)密切相关,减少脉冲数量是控制能耗的关键策略。解决方案之关键在于提出ASPEN技术,通过在训练阶段引入神经元阈值的随机扰动,不仅增强了网络对不同阈值的鲁棒性(可在推理时动态调节),还起到了正则化作用,从而降低脉冲频率、提升泛化能力并实现无需复杂重训练或剪枝的能量调控。该方法以轻量级、可扩展的方式自适应调整神经元内在参数,显著降低尖峰计数和能耗,同时维持与当前先进方法相当的精度。
链接: https://arxiv.org/abs/2508.11689
作者: Eduardo Calle-Ortiz,Hui Guan,Deepak Ganesan,Phuc Nguyen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注: 14 pages
Abstract:This paper presents ASPEN, a novel energy-aware technique for neuromorphic systems that could unleash the future of intelligent, always-on, ultra-low-power, and low-burden wearables. Our main research objectives are to explore the feasibility of neuromorphic computing for wearables, identify open research directions, and demonstrate the feasibility of developing an adaptive spiking technique for energy-aware computation, which can be game-changing for resource-constrained devices in always-on applications. As neuromorphic computing systems operate based on spike events, their energy consumption is closely related to spiking activity, i.e., each spike incurs computational and power costs; consequently, minimizing the number of spikes is a critical strategy for operating under constrained energy budgets. To support this goal, ASPEN utilizes stochastic perturbations to the neuronal threshold during training to not only enhance the network’s robustness across varying thresholds, which can be controlled at inference time, but also act as a regularizer that improves generalization, reduces spiking activity, and enables energy control without the need for complex retraining or pruning. More specifically, ASPEN adaptively adjusts intrinsic neuronal parameters as a lightweight and scalable technique for dynamic energy control without reconfiguring the entire model. Our evaluation on neuromorphic emulator and hardware shows that ASPEN significantly reduces spike counts and energy consumption while maintaining accuracy comparable to state-of-the-art methods.
zh
[AI-130] Future progress in artificial intelligence: A survey of expert opinion
【速读】:该论文旨在解决关于高阶机器智能(high-level machine intelligence)和超级智能AI(superintelligent AI)发展时间框架及其潜在风险的专家观点分布不明确的问题。其解决方案的关键在于设计了一份简短问卷,并在2012–2013年间向四类专家群体发放,从而量化专家对高阶机器智能在未来几十年内出现的概率估计、相关风险类型及演化速度的认知,最终得出中位数预测:2040–2050年有50%概率实现高阶机器智能,到2075年概率升至90%,此后不到30年可能迈向超级智能,且约三分之一专家认为这一进程可能对人类产生“负面”或“极其负面”的影响。
链接: https://arxiv.org/abs/2508.11681
作者: Vincent C. Müller,Nick Bostrom
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI)
备注:
Abstract:There is, in some quarters, concern about high-level machine intelligence and superintelligent AI coming up in a few decades, bringing with it significant risks for humanity. In other quarters, these issues are ignored or considered science fiction. We wanted to clarify what the distribution of opinions actually is, what probability the best experts currently assign to high-level machine intelligence coming up within a particular time-frame, which risks they see with that development, and how fast they see these developing. We thus designed a brief questionnaire and distributed it to four groups of experts in 2012/2013. The median estimate of respondents was for a one in two chance that high-level machine intelligence will be developed around 2040-2050, rising to a nine in ten chance by 2075. Experts expect that systems will move on to superintelligence in less than 30 years thereafter. They estimate the chance is about one in three that this development turns out to be ‘bad’ or ‘extremely bad’ for humanity.
zh
[AI-131] Comparative Analysis of Time Series Foundation Models for Demographic Forecasting: Enhancing Predictive Accuracy in US Population Dynamics
【速读】:该论文旨在解决传统时间序列预测模型在应对美国人口结构变化(如不同族裔群体的迁移、出生率和死亡率波动等)时精度不足的问题,尤其是在少数族裔群体历史数据稀疏的情况下难以有效建模。其解决方案的关键在于引入预训练的时间序列基础模型(Time Series Foundation Model, TimesFM),该模型通过大规模通用时间序列数据进行预训练,具备强大的泛化能力,在无需大量任务特定微调的前提下,显著提升了对多地区、多群体人口动态的预测准确性,尤其在低样本量的少数族裔群体中表现突出,从而为政策制定者提供更可靠的人口趋势分析工具。
链接: https://arxiv.org/abs/2508.11680
作者: Aditya Akella,Jonathan Farah
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注: 6 pages, 4 figures, 3 tables
Abstract:Demographic shifts, influenced by globalization, economic conditions, geopolitical events, and environmental factors, pose significant challenges for policymakers and researchers. Accurate demographic forecasting is essential for informed decision-making in areas such as urban planning, healthcare, and economic policy. This study explores the application of time series foundation models to predict demographic changes in the United States using datasets from the U.S. Census Bureau and Federal Reserve Economic Data (FRED). We evaluate the performance of the Time Series Foundation Model (TimesFM) against traditional baselines including Long Short-Term Memory (LSTM) networks, Autoregressive Integrated Moving Average (ARIMA), and Linear Regression. Our experiments across six demographically diverse states demonstrate that TimesFM achieves the lowest Mean Squared Error (MSE) in 86.67% of test cases, with particularly strong performance on minority populations with sparse historical data. These findings highlight the potential of pre-trained foundation models to enhance demographic analysis and inform proactive policy interventions without requiring extensive task-specific fine-tuning.
zh
[AI-132] Lifelong Learner: Discovering Versatile Neural Solvers for Vehicle Routing Problems
【速读】:该论文旨在解决当前深度学习方法在求解车辆路径问题(Vehicle Routing Problem, VRP)时泛化能力不足的问题,即大多数神经求解器在训练过程中局限于单一场景(如使用欧氏距离简化节点间关系、固定问题规模),导致其在不同实际应用场景中难以直接应用。解决方案的关键在于提出一种新颖的终身学习框架(Lifelong Learning Framework),其中包含一个基于Transformer架构的终身学习器(Lifelong Learner, LL),通过引入跨情境自注意力机制(inter-context self-attention mechanism)实现从先前VRP任务中学得的知识向后续任务的有效迁移;同时设计了一个动态情境调度器(Dynamic Context Scheduler, DCS),利用跨情境经验回放(cross-context experience replay)机制增强LL对历史策略的记忆与复用能力,从而显著提升模型在多样化VRP情境下的适应性和性能表现。
链接: https://arxiv.org/abs/2508.11679
作者: Shaodi Feng,Zhuoyi Lin,Jianan Zhou,Cong Zhang,Jingwen Li,Kuan-Wen Chen,Senthilnath Jayavelu,Yew-Soon Ong
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Robotics (cs.RO)
备注:
Abstract:Deep learning has been extensively explored to solve vehicle routing problems (VRPs), which yields a range of data-driven neural solvers with promising outcomes. However, most neural solvers are trained to tackle VRP instances in a relatively monotonous context, e.g., simplifying VRPs by using Euclidean distance between nodes and adhering to a single problem size, which harms their off-the-shelf application in different scenarios. To enhance their versatility, this paper presents a novel lifelong learning framework that incrementally trains a neural solver to manage VRPs in distinct contexts. Specifically, we propose a lifelong learner (LL), exploiting a Transformer network as the backbone, to solve a series of VRPs. The inter-context self-attention mechanism is proposed within LL to transfer the knowledge obtained from solving preceding VRPs into the succeeding ones. On top of that, we develop a dynamic context scheduler (DCS), employing the cross-context experience replay to further facilitate LL looking back on the attained policies of solving preceding VRPs. Extensive results on synthetic and benchmark instances (problem sizes up to 18k) show that our LL is capable of discovering effective policies for tackling generic VRPs in varying contexts, which outperforms other neural solvers and achieves the best performance for most VRPs.
zh
[AI-133] Learning Internal Biological Neuron Parameters and Complexity-Based Encoding for Improved Spiking Neural Networks Performance
【速读】:该论文旨在解决传统脉冲神经网络(Spiking Neural Networks, SNNs)在分类精度和可解释性方面的局限性,尤其是如何通过改进神经元模型和引入新的特征表示方法来提升性能。其解决方案的关键在于:首先,用一种生物启发的概率元神经元(probabilistic meta neuron)替代传统的感知机神经元模型,并联合学习内部参数,从而增强SNN的表达能力;其次,提出一种融合Lempel-Ziv复杂度(LZC)的新分类框架,利用LZC对结构规律性的捕捉能力与SNN的时间精度和生物合理性相结合,实现对时空神经数据的高效且可解释的分类。实验表明,该方法在不同训练策略下可使分类效率提升最高达11.00%,验证了额外学习神经元参数相较于仅优化权重输入的优势。
链接: https://arxiv.org/abs/2508.11674
作者: Zofia Rudnicka,Janusz Szczepanski,Agnieszka Pregowska
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
备注:
Abstract:This study introduces a novel approach by replacing the traditional perceptron neuron model with a biologically inspired probabilistic meta neuron, where the internal neuron parameters are jointly learned, leading to improved classification accuracy of spiking neural networks (SNNs). To validate this innovation, we implement and compare two SNN architectures: one based on standard leaky integrate-and-fire (LIF) neurons and another utilizing the proposed probabilistic meta neuron model. As a second key contribution, we present a new biologically inspired classification framework that uniquely integrates SNNs with Lempel-Ziv complexity (LZC) a measure closely related to entropy rate. By combining the temporal precision and biological plausibility of SNNs with the capacity of LZC to capture structural regularity, the proposed approach enables efficient and interpretable classification of spatiotemporal neural data, an aspect not addressed in existing works. We consider learning algorithms such as backpropagation, spike-timing-dependent plasticity (STDP), and the Tempotron learning rule. To explore neural dynamics, we use Poisson processes to model neuronal spike trains, a well-established method for simulating the stochastic firing behavior of biological neurons. Our results reveal that depending on the training method, the classifier’s efficiency can improve by up to 11.00%, highlighting the advantage of learning additional neuron parameters beyond the traditional focus on weighted inputs alone.
zh
[AI-134] LLM -Based Intelligent Agents for Music Recommendation: A Comparison with Classical Content-Based Filtering
【速读】:该论文旨在解决音乐流媒体平台中因信息过载导致的用户体验下降问题。其解决方案的关键在于引入基于Gemini和LLaMA系列的大语言模型(Large Language Models, LLMs)与智能代理(intelligent agents)构建多智能体个性化音乐推荐系统,相较于传统基于内容的推荐模型,在用户满意度、新颖性及计算效率方面均展现出显著优势,其中LLMs实现的满意度最高达89.32%,验证了其在音乐推荐场景中的潜力。
链接: https://arxiv.org/abs/2508.11671
作者: Ronald Carvalho Boadana,Ademir Guimarães da Costa Junior,Ricardo Rios,Fábio Santos da Silva
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multiagent Systems (cs.MA)
备注: 12 pages, in Portuguese language, 2 figures, 5 tables, 3 formulas. To be published in the Proceedings of the Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2025)
Abstract:The growing availability of music on streaming platforms has led to information overload for users. To address this issue and enhance the user experience, increasingly sophisticated recommendation systems have been proposed. This work investigates the use of Large Language Models (LLMs) from the Gemini and LLaMA families, combined with intelligent agents, in a multi-agent personalized music recommendation system. The results are compared with a traditional content-based recommendation model, considering user satisfaction, novelty, and computational efficiency. LLMs achieved satisfaction rates of up to \textit89,32%, indicating their promising potential in music recommendation systems.
zh
[AI-135] RRRA: Resampling and Reranking through a Retriever Adapter AAAI2026
【速读】:该论文旨在解决密集检索(dense retrieval)中因错误选择负样本(false negatives)而导致模型训练效果受限的问题。现有方法多依赖于基于正样本得分的启发式策略来筛选难负样本(hard negatives),但这类全局、与实例无关的方法常遗漏特定查询下的虚假负样本,从而影响模型性能。解决方案的关键在于提出一个可学习的适配器模块(adapter module),该模块通过监控双编码器(Bi-Encoder)表示来动态、上下文感知地估计某难负样本为虚假负样本的概率,并将此概率用于两个下游组件:一是训练阶段的重采样(resampling),对负样本进行重新加权;二是推理阶段的重排序(reranking),对Top-k召回文档进行重排。实验证明,该适配器增强框架在标准基准上持续优于强基线模型,验证了显式建模虚假负样本的有效性。
链接: https://arxiv.org/abs/2508.11670
作者: Bongsu Kim
机构: 未知
类目: Information Retrieval (cs.IR); Artificial Intelligence (cs.AI)
备注: 8 pages, 4 figures, submitted to AAAI 2026
Abstract:In dense retrieval, effective training hinges on selecting high quality hard negatives while avoiding false negatives. Recent methods apply heuristics based on positive document scores to identify hard negatives, improving both performance and interpretability. However, these global, example agnostic strategies often miss instance specific false negatives. To address this, we propose a learnable adapter module that monitors Bi-Encoder representations to estimate the likelihood that a hard negative is actually a false negative. This probability is modeled dynamically and contextually, enabling fine-grained, query specific judgments. The predicted scores are used in two downstream components: (1) resampling, where negatives are reweighted during training, and (2) reranking, where top-k retrieved documents are reordered at inference. Empirical results on standard benchmarks show that our adapter-enhanced framework consistently outperforms strong Bi-Encoder baselines, underscoring the benefit of explicit false negative modeling in dense retrieval.
zh
[AI-136] Collaborative Learning-Enhanced Lightweight Models for Predicting Arterial Blood Pressure Waveform in a Large-scale Perioperative Dataset
【速读】:该论文旨在解决嵌入式系统中非侵入式动脉血压(ABP)监测的实时性与模型轻量化难题,特别是在复杂临床环境中实现高精度、低延迟的ABP波形重建。其关键解决方案是提出一种参数仅0.89百万、计算负载为0.02 GFLOPS的轻量级网络sInvResUNet,并结合知识蒸馏协同学习策略(KDCL_sInvResUNet),在保证性能的同时显著降低推理开销,实现在嵌入式设备上每10秒输出仅需8.49毫秒的高效推断。该方案在大规模异质围术期数据集上验证了鲁棒性,平均绝对误差为10.06 mmHg,皮尔逊相关系数达0.88,为真实世界中的无创连续ABP监测提供了可部署的基础架构。
链接: https://arxiv.org/abs/2508.11669
作者: Wentao Li,Yonghu He,Kun Gao,Qing Liu,Yali Zheng
机构: 未知
类目: Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
备注:
Abstract:Noninvasive arterial blood pressure (ABP) monitoring is essential for patient management in critical care and perioperative settings, providing continuous assessment of cardiovascular hemodynamics with minimal risks. Numerous deep learning models have developed to reconstruct ABP waveform from noninvasively acquired physiological signals such as electrocardiogram and photoplethysmogram. However, limited research has addressed the issue of model performance and computational load for deployment on embedded systems. The study introduces a lightweight sInvResUNet, along with a collaborative learning scheme named KDCL_sInvResUNet. With only 0.89 million parameters and a computational load of 0.02 GFLOPS, real-time ABP estimation was successfully achieved on embedded devices with an inference time of just 8.49 milliseconds for a 10-second output. We performed subject-independent validation in a large-scale and heterogeneous perioperative dataset containing 1,257,141 data segments from 2,154 patients, with a wide BP range (41-257 mmHg for SBP, and 31-234 mmHg for DBP). The proposed KDCL_sInvResUNet achieved lightly better performance compared to large models, with a mean absolute error of 10.06 mmHg and mean Pearson correlation of 0.88 in tracking ABP changes. Despite these promising results, all deep learning models showed significant performance variations across different demographic and cardiovascular conditions, highlighting their limited ability to generalize across such a broad and diverse population. This study lays a foundation work for real-time, unobtrusive ABP monitoring in real-world perioperative settings, providing baseline for future advancements in this area.
zh
[AI-137] Generative AI in Training and Coaching: Redefining the Design Process of Learning Materials
【速读】:该论文试图解决的问题是:生成式 AI (Generative AI) 在教育场景中如何影响学习材料的设计流程,以及其对教学效率、教学法质量与培训师/教练角色转变的具体影响。解决方案的关键在于通过质性访谈识别出四个层面的实施路径——个体、组织、系统和战略层面——明确培训师与教练从内容创作者向引导者与内容审核者的角色转型,并强调在提升效率的同时需培养新技能以应对AI工具带来的挑战,同时关注AI拟人化对用户信任与期望的影响,从而实现GenAI工具的有效整合与可持续应用。
链接: https://arxiv.org/abs/2508.11662
作者: Alexander Komar,Marc-André Heidelmann,Kristina Schaaff
机构: 未知
类目: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC)
备注:
Abstract:Generative artificial intelligence (GenAI) is transforming education, redefining the role of trainers and coaches in learning environments. In our study, we explore how AI integrates into the design process of learning materials, assessing its impact on efficiency, pedagogical quality, and the evolving role of human trainers and coaches. Through qualitative interviews with professionals in education and corporate training, we identify the following key topics: trainers and coaches increasingly act as facilitators and content moderators rather than primary creators, efficiency gains allow for a stronger strategic focus but at the same time the new tools require new skills. Additionally, we analyze how the anthropomorphism of AI shapes user trust and expectations. From these insights, we derive how tools based on GenAI can successfully be implemented for trainers and coaches on an individual, organizational, systemic, and strategic level.
zh
[AI-138] oward Practical Equilibrium Propagation: Brain-inspired Recurrent Neural Network with Feedback Regulation and Residual Connections
【速读】:该论文旨在解决平衡传播(Equilibrium Propagation, EP)在实际应用中面临的不稳定性与计算成本过高问题,以及深度循环神经网络(Recurrent Neural Networks, RNNs)中因反馈路径较弱导致的梯度消失问题。解决方案的关键在于提出一种受大脑结构和动力学启发的反馈调节残差递归神经网络(Feedback-regulated REsidual recurrent neural network, FRE-RNN):通过反馈调节机制降低谱半径(spectral radius),显著提升收敛速度,从而将EP的计算开销和训练时间减少数个数量级;同时引入具有脑启发拓扑结构的残差连接,有效缓解深层RNN中因弱反馈路径引发的梯度消失问题,使EP在基准任务上的性能达到与反向传播(Backpropagation, BP)相当的水平。
链接: https://arxiv.org/abs/2508.11659
作者: Zhuo Liu,Tao Chen
机构: 未知
类目: Neural and Evolutionary Computing (cs.NE); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
备注:
Abstract:Brain-like intelligent systems need brain-like learning methods. Equilibrium Propagation (EP) is a biologically plausible learning framework with strong potential for brain-inspired computing hardware. However, existing im-plementations of EP suffer from instability and prohibi-tively high computational costs. Inspired by the structure and dynamics of the brain, we propose a biologically plau-sible Feedback-regulated REsidual recurrent neural network (FRE-RNN) and study its learning performance in EP framework. Feedback regulation enables rapid convergence by reducing the spectral radius. The improvement in con-vergence property reduces the computational cost and train-ing time of EP by orders of magnitude, delivering perfor-mance on par with backpropagation (BP) in benchmark tasks. Meanwhile, residual connections with brain-inspired topologies help alleviate the vanishing gradient problem that arises when feedback pathways are weak in deep RNNs. Our approach substantially enhances the applicabil-ity and practicality of EP in large-scale networks that un-derpin artificial intelligence. The techniques developed here also offer guidance to implementing in-situ learning in physical neural networks.
zh
[AI-139] Categorical Construction of Logically Verifiable Neural Architectures
【速读】:该论文旨在解决神经网络在推理过程中难以保证逻辑一致性的问题,即尽管其在模式识别方面表现优异,但在实际应用中常违反基本逻辑原则,导致不可靠的推理结果。解决方案的关键在于提出一种范畴论(category theory)框架,将逻辑理论建模为称为 Lawvere 理论的代数结构,并通过参数化映射的 2-范畴中的范畴代数将其转化为具有可证明逻辑保障的神经架构。与传统方法不同,该方案不是在训练阶段施加逻辑约束,而是将逻辑原理直接嵌入网络的拓扑结构中,从而从数学上杜绝逻辑违规的可能性。这一方法实现了逻辑理论与神经网络之间的双射对应关系,首次将范畴深度学习从几何对称性扩展至语义约束层面,为可信人工智能系统提供了严格的数学基础。
链接: https://arxiv.org/abs/2508.11647
作者: Logan Nye
机构: 未知
类目: Logic in Computer Science (cs.LO); Artificial Intelligence (cs.AI)
备注:
Abstract:Neural networks excel at pattern recognition but struggle with reliable logical reasoning, often violating basic logical principles during inference. We address this limitation by developing a categorical framework that systematically constructs neural architectures with provable logical guarantees. Our approach treats logical theories as algebraic structures called Lawvere theories, which we transform into neural networks using categorical algebra in the 2-category of parametric maps. Unlike existing methods that impose logical constraints during training, our categorical construction embeds logical principles directly into the network’s architectural structure, making logical violations mathematically impossible. We demonstrate this framework by constructing differentiable neural architectures for propositional logic that preserve boolean reasoning while remaining trainable via gradient descent. Our main theoretical result establishes a bijective correspondence between finitary logical theories and neural architectures, proving that every logically constrained network arises uniquely from our construction. This extends Categorical Deep Learning beyond geometric symmetries to semantic constraints, enabling automatic derivation of verified architectures from logical specifications. The framework provides mathematical foundations for trustworthy AI systems, with applications to theorem proving, formal verification, and safety-critical reasoning tasks requiring verifiable logical behavior.
zh
[AI-140] From Transthoracic to Transesophageal: Cross-Modality Generation using LoRA Diffusion MICCAI2025
【速读】:该论文旨在解决生成式 AI(Generative AI)在数据稀缺领域如经食管超声心动图(TEE)中应用受限的问题,尤其针对现有基于扩散模型的图像合成方法依赖大规模训练数据的瓶颈。其关键解决方案在于:利用已训练于经胸超声心动图(TTE)数据的掩码条件扩散模型(mask-conditioned diffusion backbone),通过仅使用少量新样本(<200帧)和极小参数量(低至10⁵)的适配器进行微调,结合低秩适应(Low-Rank Adaptation)与轻量级映射层MaskR²——后者可将未见过的掩码格式转换为预训练模型兼容的条件通道表示,从而实现对新解剖结构集合的有效迁移。实验表明,仅适配MLP层即可实现高质量TEE图像合成,并且混合少量真实TEE图像与合成数据能显著提升多类心腔分割任务中的Dice分数,尤其改善右心结构等罕见目标的识别性能。
链接: https://arxiv.org/abs/2508.13077
作者: Emmanuel Oladokun,Yuxuan Ou,Anna Novikova,Daria Kulikova,Sarina Thomas,Jurica Šprem,Vicente Grau
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注: MICCAI 2025; ASMUS
Abstract:Deep diffusion models excel at realistic image synthesis but demand large training sets-an obstacle in data-scarce domains like transesophageal echocardiography (TEE). While synthetic augmentation has boosted performance in transthoracic echo (TTE), TEE remains critically underrepresented, limiting the reach of deep learning in this high-impact modality. We address this gap by adapting a TTE-trained, mask-conditioned diffusion backbone to TEE with only a limited number of new cases and adapters as small as 10^5 parameters. Our pipeline combines Low-Rank Adaptation with MaskR ^2 , a lightweight remapping layer that aligns novel mask formats with the pretrained model’s conditioning channels. This design lets users adapt models to new datasets with a different set of anatomical structures to the base model’s original set. Through a targeted adaptation strategy, we find that adapting only MLP layers suffices for high-fidelity TEE synthesis. Finally, mixing less than 200 real TEE frames with our synthetic echoes improves the dice score on a multiclass segmentation task, particularly boosting performance on underrepresented right-heart structures. Our results demonstrate that (1) semantically controlled TEE images can be generated with low overhead, (2) MaskR ^2 effectively transforms unseen mask formats into compatible formats without damaging downstream task performance, and (3) our method generates images that are effective for improving performance on a downstream task of multiclass segmentation. Comments: MICCAI 2025; ASMUS Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI) Cite as: arXiv:2508.13077 [eess.IV] (or arXiv:2508.13077v1 [eess.IV] for this version) https://doi.org/10.48550/arXiv.2508.13077 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
zh
[AI-141] Learning local and global prototypes with optimal transport for unsupervised anomaly detection and localization
【速读】:该论文旨在解决无监督异常检测(Unsupervised Anomaly Detection, UAD)中如何有效建模正常样本的结构特性以提升异常识别精度的问题。其核心挑战在于,仅使用正常样本进行训练时,模型需自动学习到数据的内在组织规律,从而在测试阶段识别出偏离该结构的异常区域。解决方案的关键在于引入基于原型学习(prototype learning)的方法,并设计一种融合特征距离与空间位置信息的度量指标,通过最优传输(optimal transport)从预训练图像编码器提取的潜在表示中学习局部和全局原型。该机制能够强制施加结构约束,使原型更好地捕捉正常样本的空间与语义组织,从而增强对图像中不一致性(incoherencies)的敏感性,最终在工业图像异常检测基准上达到与强基线相当的性能。
链接: https://arxiv.org/abs/2508.12927
作者: Robin Trombetta,Carole Lartizien
机构: 未知
类目: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI)
备注:
Abstract:Unsupervised anomaly detection aims to detect defective parts of a sample by having access, during training, to a set of normal, i.e. defect-free, data. It has many applications in fields, such as industrial inspection or medical imaging, where acquiring labels is costly or when we want to avoid introducing biases in the type of anomalies that can be spotted. In this work, we propose a novel UAD method based on prototype learning and introduce a metric to compare a structured set of embeddings that balances a feature-based cost and a spatial-based cost. We leverage this metric to learn local and global prototypes with optimal transport from latent representations extracted with a pre-trained image encoder. We demonstrate that our approach can enforce a structural constraint when learning the prototypes, allowing to capture the underlying organization of the normal samples, thus improving the detection of incoherencies in images. Our model achieves performance that is on par with strong baselines on two reference benchmarks for anomaly detection on industrial images. The code is available at this https URL.
zh
[AI-142] A Unified Cortical Circuit Model with Divisive Normalization and Self-Excitation for Robust Representation and Memory Maintenance
【速读】:该论文旨在解决当前神经计算模型中缺乏统一框架的问题,即如何在同一神经回路中同时实现对噪声的鲁棒编码(noise-robust encoding)与信息的稳定维持(stable information maintenance),而这二者在现有模型中通常由独立机制分别处理。解决方案的关键在于提出一个包含 divisive normalization(除法归一化)与 self-excitation(自激发)相结合的递归神经电路,该电路在适当参数条件下可形成连续吸引子(continuous attractor),从而在刺激呈现期间实现输入比例化的稳定化,并在刺激消失后维持自持的记忆状态,为工作记忆与近似贝叶斯推理提供统一的数学基础。
链接: https://arxiv.org/abs/2508.12702
作者: Jie Su,Weiwei Wang,Zhaotian Gu,Dahui Wang,Tianyi Qian
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE)
备注: 15 pages, 4 figures
Abstract:Robust information representation and its persistent maintenance are fundamental for higher cognitive functions. Existing models employ distinct neural mechanisms to separately address noise-resistant processing or information maintenance, yet a unified framework integrating both operations remains elusive – a critical gap in understanding cortical computation. Here, we introduce a recurrent neural circuit that combines divisive normalization with self-excitation to achieve both robust encoding and stable retention of normalized inputs. Mathematical analysis shows that, for suitable parameter regimes, the system forms a continuous attractor with two key properties: (1) input-proportional stabilization during stimulus presentation; and (2) self-sustained memory states persisting after stimulus offset. We demonstrate the model’s versatility in two canonical tasks: (a) noise-robust encoding in a random-dot kinematogram (RDK) paradigm; and (b) approximate Bayesian belief updating in a probabilistic Wisconsin Card Sorting Test (pWCST). This work establishes a unified mathematical framework that bridges noise suppression, working memory, and approximate Bayesian inference within a single cortical microcircuit, offering fresh insights into the brain’s canonical computation and guiding the design of biologically plausible artificial neural architectures.
zh
[AI-143] A Generalized Genetic Random Field Method for the Genetic Association Analysis of Sequencing Data
【速读】:该论文旨在解决高维测序数据在复杂人类疾病关联分析中的统计挑战,尤其是如何有效检测罕见遗传变异(rare genetic variants)对疾病的影响。传统方法常需设定阈值来筛选变异,且难以处理多变异协同作用方向和效应大小不同的情况。其解决方案的关键在于提出一种广义遗传随机场(Generalized Genetic Random Field, GGRF)方法:该方法基于广义估计方程(Generalized Estimating Equation, GEE)框架,无需预设罕见变异的阈值,可同时检验多个变异在不同方向和效应强度下的联合影响,并适用于定量和二元表型;此外,GGRF具有良好的渐近性质,可在小样本测序数据中直接应用而无需进行小样本校正,从而提升了检测效能,尤其在罕见变异主导疾病病因的情境下优于常用方法SKAT。
链接: https://arxiv.org/abs/2508.12617
作者: Ming Li,Zihuai He,Min Zhang,Xiaowei Zhan,Changshuai Wei,Robert C Elston,Qing Lu
机构: 未知
类目: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:With the advance of high-throughput sequencing technologies, it has become feasible to investigate the influence of the entire spectrum of sequencing variations on complex human diseases. Although association studies utilizing the new sequencing technologies hold great promise to unravel novel genetic variants, especially rare genetic variants that contribute to human diseases, the statistical analysis of high-dimensional sequencing data remains a challenge. Advanced analytical methods are in great need to facilitate high-dimensional sequencing data analyses. In this article, we propose a generalized genetic random field (GGRF) method for association analyses of sequencing data. Like other similarity-based methods (e.g., SIMreg and SKAT), the new method has the advantages of avoiding the need to specify thresholds for rare variants and allowing for testing multiple variants acting in different directions and magnitude of effects. The method is built on the generalized estimating equation framework and thus accommodates a variety of disease phenotypes (e.g., quantitative and binary phenotypes). Moreover, it has a nice asymptotic property, and can be applied to small-scale sequencing data without need for small-sample adjustment. Through simulations, we demonstrate that the proposed GGRF attains an improved or comparable power over a commonly used method, SKAT, under various disease scenarios, especially when rare variants play a significant role in disease etiology. We further illustrate GGRF with an application to a real dataset from the Dallas Heart Study. By using GGRF, we were able to detect the association of two candidate genes, ANGPTL3 and ANGPTL4, with serum triglyceride.
zh
[AI-144] An Introduction to Sliced Optimal Transport
【速读】:该论文旨在解决经典最优传输(Optimal Transport, OT)在高维空间中计算复杂度高、难以规模化应用的问题。其核心解决方案是基于切片最优传输(Sliced Optimal Transport, SOT),通过将高维概率测度投影到一维子空间上,利用一维OT的高效可计算性来近似高维OT距离、巴氏中心(barycenters)、核函数及嵌入等结构。关键在于结合积分几何中的Radon变换实现有效投影,辅以统计估计与蒙特卡洛近似方法提升计算效率,并拓展至非线性投影、加权切片、不平衡/部分/多边际等复杂场景,从而在保持几何结构的同时显著降低计算成本,使其适用于机器学习、计算机视觉等多个领域的实际需求。
链接: https://arxiv.org/abs/2508.12519
作者: Khai Nguyen
机构: 未知
类目: Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Computation (stat.CO); Methodology (stat.ME)
备注: 227 pages
Abstract:Sliced Optimal Transport (SOT) is a rapidly developing branch of optimal transport (OT) that exploits the tractability of one-dimensional OT problems. By combining tools from OT, integral geometry, and computational statistics, SOT enables fast and scalable computation of distances, barycenters, and kernels for probability measures, while retaining rich geometric structure. This paper provides a comprehensive review of SOT, covering its mathematical foundations, methodological advances, computational methods, and applications. We discuss key concepts of OT and one-dimensional OT, the role of tools from integral geometry such as Radon transform in projecting measures, and statistical techniques for estimating sliced distances. The paper further explores recent methodological advances, including non-linear projections, improved Monte Carlo approximations, statistical estimation techniques for one-dimensional optimal transport, weighted slicing techniques, and transportation plan estimation methods. Variational problems, such as minimum sliced Wasserstein estimation, barycenters, gradient flows, kernel constructions, and embeddings are examined alongside extensions to unbalanced, partial, multi-marginal, and Gromov-Wasserstein settings. Applications span machine learning, statistics, computer graphics and computer visions, highlighting SOT’s versatility as a practical computational tool. This work will be of interest to researchers and practitioners in machine learning, data sciences, and computational disciplines seeking efficient alternatives to classical OT.
zh
[AI-145] EXOTIC: An Exact Optimistic Tree-Based Algorithm for Min-Max Optimization
【速读】:该论文旨在解决凸-非凹(convex-non-concave)和非凸-凹(non-convex-concave)min-max优化问题中,传统基于梯度的方法难以找到全局最优解的问题。其核心挑战在于,此类问题的局部极值可能与全局最优解存在显著偏差,且缺乏有效的全局优化算法框架。解决方案的关键在于提出一个算法框架:首先通过变量重参数化将原问题转化为一种广义的极大极小形式(可视为Sion最小最大定理的推广),从而构建出适合全局搜索的形式;随后设计EXOTIC算法——该算法结合迭代凸优化求解器处理内层最小化,并利用分层树搜索策略对高层最大化进行乐观区域选择,基于内层近似解动态剪枝并聚焦于高潜力搜索空间。理论分析表明,EXOTIC的次优性间隙(optimality gap)受内层求解器调用次数、收敛速率及问题相关参数的控制,从而为非凸-凹和凸-非凹min-max优化提供了首个具有理论保障的全局最优算法。
链接: https://arxiv.org/abs/2508.12479
作者: Chinmay Maheshwari,Chinmay Pimpalkhare,Debasish Chatterjee
机构: 未知
类目: Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Computer Science and Game Theory (cs.GT); Multiagent Systems (cs.MA); General Economics (econ.GN)
备注: 31 pages, 2 figures, 3 tables
Abstract:Min-max optimization arises in many domains such as game theory, adversarial machine learning, etc., with gradient-based methods as a typical computational tool. Beyond convex-concave min-max optimization, the solutions found by gradient-based methods may be arbitrarily far from global optima. In this work, we present an algorithmic apparatus for computing globally optimal solutions in convex-non-concave and non-convex-concave min-max optimization. For former, we employ a reformulation that transforms it into a non-concave-convex max-min optimization problem with suitably defined feasible sets and objective function. The new form can be viewed as a generalization of Sion’s minimax theorem. Next, we introduce EXOTIC-an Exact, Optimistic, Tree-based algorithm for solving the reformulated max-min problem. EXOTIC employs an iterative convex optimization solver to (approximately) solve the inner minimization and a hierarchical tree search for the outer maximization to optimistically select promising regions to search based on the approximate solution returned by convex optimization solver. We establish an upper bound on its optimality gap as a function of the number of calls to the inner solver, the solver’s convergence rate, and additional problem-dependent parameters. Both our algorithmic apparatus along with its accompanying theoretical analysis can also be applied for non-convex-concave min-max optimization. In addition, we propose a class of benchmark convex-non-concave min-max problems along with their analytical global solutions, providing a testbed for evaluating algorithms for min-max optimization. Empirically, EXOTIC outperforms gradient-based methods on this benchmark as well as on existing numerical benchmark problems from the literature. Finally, we demonstrate the utility of EXOTIC by computing security strategies in multi-player games with three or more players.
zh
[AI-146] Quantum Flow Matching
【速读】:该论文旨在解决量子系统中密度矩阵的高效制备与采样问题,以及如何在不进行昂贵电路重构的前提下实现复杂量子态的生成和可观测量的精确估计。其解决方案的关键在于提出量子流匹配(Quantum Flow Matching, QFM),这是一种完全基于量子电路的实现方法,能够对两个密度矩阵进行高效插值,并通过系统化的方式生成目标态样本,从而支持对量子系统的可扩展、高精度建模与计算,无需额外的电路重设计即可应用于多种量子物理场景,如调控磁化强度与纠缠熵、验证量子Jarzynski等式及加速超扩散破缺的研究。
链接: https://arxiv.org/abs/2508.12413
作者: Zidong Cui,Pan Zhang,Ying Tang
机构: 未知
类目: Quantum Physics (quant-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 15 pages, 11 figures
Abstract:Flow matching has rapidly become a dominant paradigm in classical generative modeling, offering an efficient way to interpolate between two complex distributions. We extend this idea to the quantum realm and introduce Quantum Flow Matching (QFM)-a fully quantum-circuit realization that offers efficient interpolation between two density matrices. QFM offers systematic preparation of density matrices and generation of samples for accurately estimating observables, and can be realized on a quantum computer without the need for costly circuit redesigns. We validate its versatility on a set of applications: (i) generating target states with prescribed magnetization and entanglement entropy, (ii) estimating nonequilibrium free-energy differences to test the quantum Jarzynski equality, and (iii) expediting the study on superdiffusion breakdown. These results position QFM as a unifying and promising framework for generative modeling across quantum systems.
zh
[AI-147] owards Generalizable Human Activity Recognition: A Survey
【速读】:该论文旨在解决基于惯性测量单元(Inertial Measurement Unit, IMU)的人体活动识别(Human Activity Recognition, HAR)在实际应用中泛化能力不足的问题,尤其是在用户、传感器位置或环境发生域偏移(domain shift)时性能显著下降的挑战。其解决方案的关键在于系统梳理和分类当前主流的两类方法:一是以模型为中心的方法,包括预训练、端到端学习以及大语言模型(Large Language Model, LLM)驱动的学习策略;二是以数据为中心的方法,涵盖多模态学习与数据增强技术。通过整合229篇研究论文和25个公开数据集,该综述不仅总结了现有技术进展,还指出了未来方向,如基础模型(foundation models)的应用、物理信息嵌入与情境感知推理、生成建模及资源高效训练与推理等,从而为提升IMU-HAR系统的通用性和实用性提供理论支撑与实践路径。
链接: https://arxiv.org/abs/2508.12213
作者: Yize Cai,Baoshen Guo,Flora Salim,Zhiqing Hong
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:As a critical component of Wearable AI, IMU-based Human Activity Recognition (HAR) has attracted increasing attention from both academia and industry in recent years. Although HAR performance has improved considerably in specific scenarios, its generalization capability remains a key barrier to widespread real-world adoption. For example, domain shifts caused by variations in users, sensor positions, or environments can significantly decrease the performance in practice. As a result, in this survey, we explore the rapidly evolving field of IMU-based generalizable HAR, reviewing 229 research papers alongside 25 publicly available datasets to provide a broad and insightful overview. We first present the background and overall framework of IMU-based HAR tasks, as well as the generalization-oriented training settings. Then, we categorize representative methodologies from two perspectives: (i) model-centric approaches, including pre-training method, end-to-end method, and large language model (LLM)-based learning method; and (ii) data-centric approaches, including multi-modal learning and data augmentation techniques. In addition, we summarize widely used datasets in this field, as well as relevant tools and benchmarks. Building on these methodological advances, the broad applicability of IMU-based HAR is also reviewed and discussed. Finally, we discuss persistent challenges (e.g., data scarcity, efficient training, and reliable evaluation) and also outline future directions for HAR, including the adoption of foundation and large language models, physics-informed and context-aware reasoning, generative modeling, and resource-efficient training and inference. The complete list of this survey is available at this https URL, which will be updated continuously.
zh
[AI-148] Exploring Multimodal AI Reasoning for Meteorological Forecasting from Skew-T Diagrams
【速读】:该论文旨在解决气象预报中基于大气探空图(Skew-T log-P diagram)的定量降水概率预测问题,传统方法依赖人工视觉推理,效率低且主观性强。解决方案的关键在于构建一个轻量级多模态AI助手,通过微调小型视觉语言模型(VLM)实现对探空图的结构化视觉理解,并结合链式思维(chain-of-thought)推理任务进行降水概率估计。该方法利用课程学习(curriculum learning)框架,先训练模型识别关键大气特征(如湿度、稳定度),再基于视觉定位结果进行逻辑推理,最终在仅使用静态大气剖面的前提下达到与主流数值天气预报(NWP)模型相当的预测性能,且具备良好的可解释性。
链接: https://arxiv.org/abs/2508.12198
作者: ChangJae Lee,Heecheol Yang,Jonghak Choi
机构: 未知
类目: Atmospheric and Oceanic Physics (physics.ao-ph); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 24 pages, 3 figures, 9 tables
Abstract:Forecasting from atmospheric soundings is a fundamental task in operational meteorology, often requiring structured visual reasoning over Skew-T log-P diagrams by human forecasters. While recent advances in Vision-Language Models (VLMs) have shown promise in other scientific domains, their application to meteorological diagram interpretation remains largely unexplored. In this study, we present a lightweight AI assistant that interprets Skew-T diagrams using a small language model (LM) and a small VLM fine-tuned to emulate human forecasters. Using a curriculum learning framework, we first train the models to identify key atmospheric features from diagrams through visual question answering, followed by chain-of-thought reasoning tasks that estimate precipitation probability based on the derived visual groundings. Model inputs include either textual summaries or generated Skew-T diagrams derived from operational Numerical Weather Prediction (NWP) forecasts, paired with three-hour precipitation observations from South Korea’s Auto Weather Stations network. Evaluation results demonstrate that the fine-tuned VLM achieves skill comparable to an operational NWP model, despite relying solely on static atmospheric profiles. Ablation studies reveal that visual grounding and reasoning supervision are critical for performance, while attention map analysis confirms that the model learns to focus on relevant meteorological features. These findings highlight the potential of compact, interpretable multimodal models to support weather forecasting tasks. The approach offers a computationally efficient alternative to large-scale systems, and future work could extend it to more complex applications.
zh
[AI-149] Generalized invariants meet constitutive neural networks: A novel framework for hyperelastic materials
【速读】:该论文旨在解决各向同性不可压缩超弹性材料中,如何从实验数据中自动识别合适的应变能函数及其依赖的不变量(invariant)这一核心挑战。传统方法通常依赖于预先设定的不变量组合或分步拟合流程,难以适应复杂材料行为且缺乏物理可解释性。本文的关键创新在于提出一种数据驱动的统一框架,通过单个神经网络架构同时学习最优不变量集合与对应的应变能函数形式,能够从连续的广义不变量族中灵活筛选出最适配材料响应的结构。该方法在橡胶和脑组织等典型材料上的验证表明,其不仅能恢复经典模型(如拉伸主导型),还能捕捉生物软组织的小变形非线性剪切响应,显著提升了预测精度与物理可解释性。
链接: https://arxiv.org/abs/2508.12063
作者: Denisa Martonová,Alain Goriely,Ellen Kuhl
机构: 未知
类目: oft Condensed Matter (cond-mat.soft); Artificial Intelligence (cs.AI)
备注:
Abstract:The major challenge in determining a hyperelastic model for a given material is the choice of invariants and the selection how the strain energy function depends functionally on these invariants. Here we introduce a new data-driven framework that simultaneously discovers appropriate invariants and constitutive models for isotropic incompressible hyperelastic materials. Our approach identifies both the most suitable invariants in a class of generalized invariants and the corresponding strain energy function directly from experimental observations. Unlike previous methods that rely on fixed invariant choices or sequential fitting procedures, our method integrates the discovery process into a single neural network architecture. By looking at a continuous family of possible invariants, the model can flexibly adapt to different material behaviors. We demonstrate the effectiveness of this approach using popular benchmark datasets for rubber and brain tissue. For rubber, the method recovers a stretch-dominated formulation consistent with classical models. For brain tissue, it identifies a formulation sensitive to small stretches, capturing the nonlinear shear response characteristic of soft biological matter. Compared to traditional and neural-network-based models, our framework provides improved predictive accuracy and interpretability across a wide range of deformation states. This unified strategy offers a robust tool for automated and physically meaningful model discovery in hyperelasticity.
zh
[AI-150] BConformeR: A Conformer Based on Mutual Sampling for Unified Prediction of Continuous and Discontinuous Antibody Binding Sites AAAI
【速读】:该论文旨在解决抗原上构象表位(conformational epitopes)预测准确率低的问题,这是当前基于计算的方法在疫苗设计、抗体开发及免疫机制研究中的一大瓶颈。其解决方案的关键在于提出一种基于构象(conformer)的深度学习模型,利用卷积神经网络(CNN)提取局部特征以增强线性表位(linear epitopes)的预测能力,同时引入Transformer模块捕捉抗原序列中的长程依赖关系,从而显著提升对构象表位的识别性能。实验表明,该模型在PCC、ROC-AUC、PR-AUC和F1分数等指标上均优于现有基线方法。
链接: https://arxiv.org/abs/2508.12029
作者: Zhangyu You,Jiahao Ma,Hongzong Li,Ye-Fan Hu,Jian-Dong Huang
机构: 未知
类目: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
备注: 16 pages, 7 figures, 5 tables, submitted to AAAI conference 2026
Abstract:Accurate prediction of antibody-binding sites (epitopes) on antigens is crucial for vaccine design, immunodiagnostics, therapeutic antibody development, antibody engineering, research into autoimmune and allergic diseases, and for advancing our understanding of immune responses. Despite in silico methods that have been proposed to predict both linear (continuous) and conformational (discontinuous) epitopes, they consistently underperform in predicting conformational epitopes. In this work, we propose a conformer-based model trained on antigen sequences derived from 1,080 antigen-antibody complexes, leveraging convolutional neural networks (CNNs) to extract local features and Transformers to capture long-range dependencies within antigen sequences. Ablation studies demonstrate that CNN enhances the prediction of linear epitopes, and the Transformer module improves the prediction of conformational epitopes. Experimental results show that our model outperforms existing baselines in terms of PCC, ROC-AUC, PR-AUC, and F1 scores on conformational epitopes.
zh
[AI-151] rack Component Failure Detection Using Data Analytics over existing STDS Track Circuit data
【速读】:该论文旨在解决传统轨道电路(Track Circuits, TC)在故障定位中的效率低、依赖人工经验的问题,通过自动化识别具体故障组件来提升维护精准性。其解决方案的关键在于利用一种名为“智能列车检测系统”(Smart Train Detection System, STDS)的交流轨道电路采集电流数据,并将其输入支持向量机(Support Vector Machine, SVM)分类器中进行训练与识别,从而自动判断15种不同故障类型(归属于3个更广泛的类别),实现对轨道电路中具体失效部件的准确诊断。该方法已在10个实际轨道电路现场数据上验证,且得到专家和维护人员的认可,所有测试用例均被正确分类。
链接: https://arxiv.org/abs/2508.11693
作者: Francisco López,Eduardo Di Santi,Clément Lefebvre,Nenad Mijatovic,Michele Pugnaloni,Victor Martín,Kenza Saiah
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: Peer-reviewed conference paper. Presented at ICROMA 2025 (International Conference on Railway Operations Modelling and Analysis), Dresden, Germany
Abstract:Track Circuits (TC) are the main signalling devices used to detect the presence of a train on a rail track. It has been used since the 19th century and nowadays there are many types depending on the technology. As a general classification, Track Circuits can be divided into 2 main groups, DC (Direct Current) and AC (Alternating Current) circuits. This work is focused on a particular AC track circuit, called “Smart Train Detection System” (STDS), designed with both high and low-frequency bands. This approach uses STDS current data applied to an SVM (support vector machine) classifier as a type of failure identifier. The main purpose of this work consists on determine automatically which is the component of the track that is failing to improve the maintenance action. Model was trained to classify 15 different failures that belong to 3 more general categories. The method was tested with field data from 10 different track circuits and validated by the STDS track circuit expert and maintainers. All use cases were correctly classified by the method.
zh
[AI-152] Scalable Technology-Agnostic Diagnosis and Predictive Maintenance for Point Machine using Deep Learning
【速读】:该论文旨在解决铁路道岔转换设备(Point Machine, PM)的故障检测问题,传统方法依赖多源输入并需人工设计特征,存在数据采集复杂、模型泛化能力差等问题。其关键解决方案是仅使用单一功率信号作为输入,通过深度学习模型自动识别PM运行中的异常状态,实现对障碍物、摩擦、电源故障和错位等主要失效模式的高精度分类(99.99%精确率,0.01%假阳性),且方法具有技术无关性和可扩展性,适用于多种电磁机械式PM,在真实环境与试验台中均验证有效;同时引入置信度量化机制(conformal prediction),确保输出结果符合ISO-17359标准,提升运维决策可靠性。
链接: https://arxiv.org/abs/2508.11692
作者: Eduardo Di Santi(1),Ruixiang Ci(2),Clément Lefebvre(1),Nenad Mijatovic(1),Michele Pugnaloni(1),Jonathan Brown(1),Victor Martín(1),Kenza Saiah(1) ((1) Digital and Integrated Systems, Alstom (2) Innovation and Smart Mobility, Alstom)
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI)
备注: Peer-reviewed conference paper. Presented at ICROMA 2025, Dresden, Germany. Conference: this https URL . Book of abstracts: this https URL . 8 pages, 6 figures, 1 table
Abstract:The Point Machine (PM) is a critical piece of railway equipment that switches train routes by diverting tracks through a switchblade. As with any critical safety equipment, a failure will halt operations leading to service disruptions; therefore, pre-emptive maintenance may avoid unnecessary interruptions by detecting anomalies before they become failures. Previous work relies on several inputs and crafting custom features by segmenting the signal. This not only adds additional requirements for data collection and processing, but it is also specific to the PM technology, the installed locations and operational conditions limiting scalability. Based on the available maintenance records, the main failure causes for PM are obstacles, friction, power source issues and misalignment. Those failures affect the energy consumption pattern of PMs, altering the usual (or healthy) shape of the power signal during the PM movement. In contrast to the current state-of-the-art, our method requires only one input. We apply a deep learning model to the power signal pattern to classify if the PM is nominal or associated with any failure type, achieving 99.99% precision, 0.01% false positives and negligible false negatives. Our methodology is generic and technology-agnostic, proven to be scalable on several electromechanical PM types deployed in both real-world and test bench environments. Finally, by using conformal prediction the maintainer gets a clear indication of the certainty of the system outputs, adding a confidence layer to operations and making the method compliant with the ISO-17359 standard.
zh
[AI-153] owards Generalizable Learning Models for EEG-Based Identification of Pain Perception
【速读】:该论文旨在解决脑电图(EEG)信号在个体间差异显著背景下,机器学习模型在疼痛感知识别任务中跨被试泛化能力不足的问题。当前研究多集中于单一被试内的模型性能,缺乏对跨被试通用性的系统评估,限制了其临床转化潜力。解决方案的关键在于:首先,构建一个包含108名被试的新型EEG数据集,并在此基础上对多种传统分类器与深度神经网络模型进行系统性比较;其次,发现深度学习模型(尤其是基于图结构的模型)在跨被试场景下表现更稳健,能够更好地捕捉受试者不变的EEG表征结构,从而提升模型的泛化能力。此外,作者开源了预处理后的数据集,为未来算法在相同泛化约束下的公平比较提供了标准化基准。
链接: https://arxiv.org/abs/2508.11691
作者: Mathis Rezzouk,Fabrice Gagnon,Alyson Champagne,Mathieu Roy,Philippe Albouy,Michel-Pierre Coll,Cem Subakan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注: 6 pages, 2 figures, 2 tables, MLSP IEEE conference
Abstract:EEG-based analysis of pain perception, enhanced by machine learning, reveals how the brain encodes pain by identifying neural patterns evoked by noxious stimulation. However, a major challenge that remains is the generalization of machine learning models across individuals, given the high cross-participant variability inherent to EEG signals and the limited focus on direct pain perception identification in current research. In this study, we systematically evaluate the performance of cross-participant generalization of a wide range of models, including traditional classifiers and deep neural classifiers for identifying the sensory modality of thermal pain and aversive auditory stimulation from EEG recordings. Using a novel dataset of EEG recordings from 108 participants, we benchmark model performance under both within- and cross-participant evaluation settings. Our findings show that traditional models suffered the largest drop from within- to cross-participant performance, while deep learning models proved more resilient, underscoring their potential for subject-invariant EEG decoding. Even though performance variability remained high, the strong results of the graph-based model highlight its potential to capture subject-invariant structure in EEG signals. On the other hand, we also share the preprocessed dataset used in this study, providing a standardized benchmark for evaluating future algorithms under the same generalization constraints.
zh
[AI-154] Age-Normalized HRV Features for Non-Invasive Glucose Prediction: A Pilot Sleep-Aware Machine Learning Study
【速读】:该论文旨在解决非侵入式血糖监测(non-invasive glucose monitoring)在糖尿病管理中的关键挑战,尤其是睡眠期间心率变异性(HRV)用于血糖预测时受年龄相关自主神经功能变化干扰的问题。其解决方案的关键在于提出了一种新颖的年龄标准化(age-normalization)技术,通过将原始HRV特征除以基于年龄缩放的因子来校正年龄效应,从而提升模型对血糖水平的预测精度。实验表明,采用该方法后,log-葡萄糖预测的决定系数R²达到0.161(平均绝对误差MAE=0.182),较未标准化特征提升25.6%,且系统性消融研究证实年龄标准化是性能提升的核心因素,同时睡眠阶段特异性HRV特征进一步增强了预测能力。
链接: https://arxiv.org/abs/2508.11682
作者: Md Basit Azam,Sarangthem Ibotombi Singh
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Non-invasive glucose monitoring remains a critical challenge in the management of diabetes. HRV during sleep shows promise for glucose prediction however, age-related autonomic changes significantly confound traditional HRV analyses. We analyzed 43 subjects with multi-modal data including sleep-stage specific ECG, HRV features, and clinical measurements. A novel age-normalization technique was applied to the HRV features by, dividing the raw values by age-scaled factors. BayesianRidge regression with 5-fold cross-validation was employed for log-glucose prediction. Age-normalized HRV features achieved R2 = 0.161 (MAE = 0.182) for log-glucose prediction, representing a 25.6% improvement over non-normalized features (R2 = 0.132). The top predictive features were hrv rem mean rr age normalized (r = 0.443, p = 0.004), hrv ds mean rr age normalized (r = 0.438, p = 0.005), and diastolic blood pressure (r = 0.437, p = 0.005). Systematic ablation studies confirmed age-normalization as the critical component, with sleep-stage specific features providing additional predictive value. Age-normalized HRV features significantly enhance glucose prediction accuracy compared with traditional approaches. This sleep-aware methodology addresses fundamental limitations in autonomic function assessment and suggests a preliminary feasibility for non-invasive glucose monitoring applications. However, these results require validation in larger cohorts before clinical consideration.
zh
[AI-155] Revealing Neurocognitive and Behavioral Patterns by Unsupervised Manifold Learning from Dynamic Brain Data
【速读】:该论文旨在解决动态脑数据(dynamic brain data)中海量且复杂的信息难以可靠提取的问题,尤其是在跨不同数据源时如何有效识别神经认知与行为模式。其解决方案的关键在于提出一种通用的无监督深度流形学习方法——基于脑动态卷积网络的嵌入(Brain-dynamic Convolutional-Network-based Embedding, BCNE),该方法通过解析数据中的时空相关性来捕捉脑状态轨迹,并在此基础上应用流形学习,从而实现对多种神经行为特征(如场景转换、记忆处理、学习阶段区分及主动/被动行为差异)的有效建模与可视化。
链接: https://arxiv.org/abs/2508.11672
作者: Zixia Zhou,Junyan Liu,Wei Emma Wu,Ruogu Fang,Sheng Liu,Qingyue Wei,Rui Yan,Yi Guo,Qian Tao,Yuanyuan Wang,Md Tauhidul Islam,Lei Xing
机构: 未知
类目: Neurons and Cognition (q-bio.NC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
备注:
Abstract:Dynamic brain data, teeming with biological and functional insights, are becoming increasingly accessible through advanced measurements, providing a gateway to understanding the inner workings of the brain in living subjects. However, the vast size and intricate complexity of the data also pose a daunting challenge in reliably extracting meaningful information across various data sources. This paper introduces a generalizable unsupervised deep manifold learning for exploration of neurocognitive and behavioral patterns. Unlike existing methods that extract patterns directly from the input data as in the existing methods, the proposed Brain-dynamic Convolutional-Network-based Embedding (BCNE) seeks to capture the brain-state trajectories by deciphering the temporospatial correlations within the data and subsequently applying manifold learning to this correlative representation. The performance of BCNE is showcased through the analysis of several important dynamic brain datasets. The results, both visual and quantitative, reveal a diverse array of intriguing and interpretable patterns. BCNE effectively delineates scene transitions, underscores the involvement of different brain regions in memory and narrative processing, distinguishes various stages of dynamic learning processes, and identifies differences between active and passive behaviors. BCNE provides an effective tool for exploring general neuroscience inquiries or individual-specific patterns.
zh
[AI-156] Vibe2Spike: Batteryless Wireless Tags for Vibration Sensing with Event Cameras and Spiking Networks
【速读】:该论文旨在解决现有传感方案在能量消耗、可扩展性和可靠性之间难以平衡的问题,尤其针对电池维护成本高、无线传输开销大以及数据处理复杂度高等挑战。其解决方案的关键在于提出了一种无电池、无线的振动感知框架Vibe2Spike,该框架利用可见光通信(Visible Light Communication, VLC)与脉冲神经网络(Spiking Neural Networks, SNNs)实现基于振动的活动识别:系统通过仅由压电盘、齐纳二极管和LED组成的超低成本标签捕获振动能量并发射稀疏光脉冲,无需电池或射频(RF)无线电;这些光学脉冲由事件相机捕捉,并使用经EONS框架优化的SNN模型进行分类。该方法实现了高精度(平均分类准确率达94.9%)与低功耗、高可扩展性的统一。
链接: https://arxiv.org/abs/2508.11640
作者: Danny Scott,William LaForest,Hritom Das,Ioannis Polykretis,Catherine D. Schuman,Charles Rizzo,James Plank,Sai Swaminathan
机构: 未知
类目: ignal Processing (eess.SP); Artificial Intelligence (cs.AI); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
备注: International Conference on Neuromorphic Systems (ICONS) 2025 9 pages, 7 images
Abstract:The deployment of dense, low-cost sensors is critical for realizing ubiquitous smart environments. However, existing sensing solutions struggle with the energy, scalability, and reliability trade-offs imposed by battery maintenance, wireless transmission overhead, and data processing complexity. In this work, we present Vibe2Spike, a novel battery-free, wireless sensing framework that enables vibration-based activity recognition using visible light communication (VLC) and spiking neural networks (SNNs). Our system uses ultra-low-cost tags composed only of a piezoelectric disc, a Zener diode, and an LED, which harvest vibration energy and emit sparse visible light spikes without requiring batteries or RF radios. These optical spikes are captured by event cameras and classified using optimized SNN models evolved via the EONS framework. We evaluate Vibe2Spike across five device classes, achieving 94.9% average classification fitness while analyzing the latency-accuracy trade-offs of different temporal binning strategies. Vibe2Spike demonstrates a scalable, and energy-efficient approach for enabling intelligent environments in a batteryless manner.
zh
机器学习
[LG-0] MDPO: Overcoming the Training-Inference Divide of Masked Diffusion Language Models
链接: https://arxiv.org/abs/2508.13148
作者: Haoyu He,Katrin Renz,Yong Cao,Andreas Geiger
类目: Machine Learning (cs.LG)
*备注:
Abstract:Diffusion language models, as a promising alternative to traditional autoregressive (AR) models, enable faster generation and richer conditioning on bidirectional context. However, they suffer from a key discrepancy between training and inference: during inference, MDLMs progressively reveal the structure of the generated sequence by producing fewer and fewer masked tokens, whereas this structure is ignored in training as tokens are masked at random. Although this discrepancy between training and inference can lead to suboptimal performance, it has been largely overlooked by previous works, leaving closing this gap between the two stages an open problem. To address this, we frame the problem of learning effective denoising trajectories as a sequential decision-making problem and use the resulting framework to apply reinforcement learning. We propose a novel Masked Diffusion Policy Optimization (MDPO) to exploit the Markov property diffusion possesses and explicitly train the model under the same progressive refining schedule used at inference. MDPO matches the performance of the previous state-of-the-art (SOTA) method with 60x fewer gradient updates, while achieving average improvements of 9.6% on MATH500 and 54.2% on Countdown over SOTA when trained within the same number of weight updates. Additionally, we improve the remasking strategy of MDLMs as a plug-in inference replacement to overcome the limitation that the model cannot refine tokens flexibly. This simple yet effective training-free strategy, what we refer to as RCR, consistently improves performance and yields additional gains when combined with MDPO. Our findings establish great potential for investigating the discrepancy between pre-training and inference of MDLMs. Code: this https URL. Project Page: this https URL.
[LG-1] raining Machine Learning Models on Human Spatio-temporal Mobility Data: An Experimental Study [Experiment Paper]
链接: https://arxiv.org/abs/2508.13135
作者: Yueyang Liu,Lance Kennedy,Ruochen Kong,Joon-Seok Kim,Andreas Züfle
类目: Machine Learning (cs.LG)
*备注:
Abstract:Individual-level human mobility prediction has emerged as a significant topic of research with applications in infectious disease monitoring, child, and elderly care. Existing studies predominantly focus on the microscopic aspects of human trajectories: such as predicting short-term trajectories or the next location visited, while offering limited attention to macro-level mobility patterns and the corresponding life routines. In this paper, we focus on an underexplored problem in human mobility prediction: determining the best practices to train a machine learning model using historical data to forecast an individuals complete trajectory over the next days and weeks. In this experiment paper, we undertake a comprehensive experimental analysis of diverse models, parameter configurations, and training strategies, accompanied by an in-depth examination of the statistical distribution inherent in human mobility patterns. Our empirical evaluations encompass both Long Short-Term Memory and Transformer-based architectures, and further investigate how incorporating individual life patterns can enhance the effectiveness of the prediction. We show that explicitly including semantic information such as day-of-the-week and user-specific historical information can help the model better understand individual patterns of life and improve predictions. Moreover, since the absence of explicit user information is often missing due to user privacy, we show that the sampling of users may exacerbate data skewness and result in a substantial loss in predictive accuracy. To mitigate data imbalance and preserve diversity, we apply user semantic clustering with stratified sampling to ensure that the sampled dataset remains representative. Our results further show that small-batch stochastic gradient optimization improves model performance, especially when human mobility training data is limited.
[LG-2] Causally-Guided Pairwise Transformer – Towards Foundational Digital Twins in Process Industry
链接: https://arxiv.org/abs/2508.13111
作者: Michael Mayr,Georgios C. Chasparis
类目: Machine Learning (cs.LG)
*备注: 12 pages, 2 figures, 4 tables
Abstract:Foundational modelling of multi-dimensional time-series data in industrial systems presents a central trade-off: channel-dependent (CD) models capture specific cross-variable dynamics but lack robustness and adaptability as model layers are commonly bound to the data dimensionality of the tackled use-case, while channel-independent (CI) models offer generality at the cost of modelling the explicit interactions crucial for system-level predictive regression tasks. To resolve this, we propose the Causally-Guided Pairwise Transformer (CGPT), a novel architecture that integrates a known causal graph as an inductive bias. The core of CGPT is built around a pairwise modeling paradigm, tackling the CD/CI conflict by decomposing the multidimensional data into pairs. The model uses channel-agnostic learnable layers where all parameter dimensions are independent of the number of variables. CGPT enforces a CD information flow at the pair-level and CI-like generalization across pairs. This approach disentangles complex system dynamics and results in a highly flexible architecture that ensures scalability and any-variate adaptability. We validate CGPT on a suite of synthetic and real-world industrial datasets on long-term and one-step forecasting tasks designed to simulate common industrial complexities. Results demonstrate that CGPT significantly outperforms both CI and CD baselines in predictive accuracy and shows competitive performance with end-to-end trained CD models while remaining agnostic to the problem dimensionality.
[LG-3] A Perfectly Truthful Calibration Measure
链接: https://arxiv.org/abs/2508.13100
作者: Jason Hartline,Lunjia Hu,Yifan Wu
类目: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML)
*备注:
Abstract:Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. Calibration measures quantify how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Although predicting the true probabilities guarantees perfect calibration, in reality, when calibration is evaluated on a finite sample, predicting the truth is not guaranteed to minimize any known calibration measure. All known calibration measures incentivize predictors to lie in order to appear more calibrated on a finite sample. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a perfectly truthful calibration measure in the batch setting: averaged two-bin calibration error (ATB). In addition to being truthful, ATB is sound, complete, continuous, and quadratically related to two existing calibration measures: the smooth calibration error (smCal) and the (lower) distance to calibration (distCal). The simplicity in our definition of ATB makes it efficient and straightforward to compute. ATB allows faster estimation algorithms with significantly easier implementations than smCal and distCal, achieving improved running time and simplicity for the calibration testing problem studied by Hu et al. (2024). We also introduce a general recipe for constructing truthful measures, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE. Subjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS); Machine Learning (stat.ML) Cite as: arXiv:2508.13100 [cs.LG] (or arXiv:2508.13100v1 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2508.13100 Focus to learn more arXiv-issued DOI via DataCite (pending registration)
[LG-4] Outlier Detection of Poisson-Distributed Targets Using a Seabed Sensor Network
链接: https://arxiv.org/abs/2508.13099
作者: Mingyu Kim,Daniel Stilwell,Jorge Jimenez
类目: Machine Learning (cs.LG); Information Theory (cs.IT)
*备注: IEEE OCEANS
Abstract:This paper presents a framework for classifying and detecting spatial commission outliers in maritime environments using seabed acoustic sensor networks and log Gaussian Cox processes (LGCPs). By modeling target arrivals as a mixture of normal and outlier processes, we estimate the probability that a newly observed event is an outlier. We propose a second-order approximation of this probability that incorporates both the mean and variance of the normal intensity function, providing improved classification accuracy compared to mean-only approaches. We analytically show that our method yields a tighter bound to the true probability using Jensen’s inequality. To enhance detection, we integrate a real-time, near-optimal sensor placement strategy that dynamically adjusts sensor locations based on the evolving outlier intensity. The proposed framework is validated using real ship traffic data near Norfolk, Virginia, where numerical results demonstrate the effectiveness of our approach in improving both classification performance and outlier detection through sensor deployment.
[LG-5] Denoising diffusion models for inverse design of inflatable structures with programmable deformations
链接: https://arxiv.org/abs/2508.13097
作者: Sara Karimi,Nikolaos N. Vlassis
类目: Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注: 21 pages, 12 figures
Abstract:Programmable structures are systems whose undeformed geometries and material property distributions are deliberately designed to achieve prescribed deformed configurations under specific loading conditions. Inflatable structures are a prominent example, using internal pressurization to realize large, nonlinear deformations in applications ranging from soft robotics and deployable aerospace systems to biomedical devices and adaptive architecture. We present a generative design framework based on denoising diffusion probabilistic models (DDPMs) for the inverse design of elastic structures undergoing large, nonlinear deformations under pressure-driven actuation. The method formulates the inverse design as a conditional generation task, using geometric descriptors of target deformed states as inputs and outputting image-based representations of the undeformed configuration. Representing these configurations as simple images is achieved by establishing a pre- and postprocessing pipeline that involves a fixed image processing, simulation setup, and descriptor extraction methods. Numerical experiments with scalar and higher-dimensional descriptors show that the framework can quickly produce diverse undeformed configurations that achieve the desired deformations when inflated, enabling parallel exploration of viable design candidates while accommodating complex constraints.
[LG-6] Seeing the Many: Exploring Parameter Distributions Conditioned on Features in Surrogates
链接: https://arxiv.org/abs/2508.13088
作者: Xiaohan Wang,Zhimin Li,Joshua A. Levine,Matthew Berger
类目: Machine Learning (cs.LG); Human-Computer Interaction (cs.HC)
*备注:
Abstract:Recently, neural surrogate models have emerged as a compelling alternative to traditional simulation workflows. This is accomplished by modeling the underlying function of scientific simulations, removing the need to run expensive simulations. Beyond just mapping from input parameter to output, surrogates have also been shown useful for inverse problems: output to input parameters. Inverse problems can be understood as search, where we aim to find parameters whose surrogate outputs contain a specified feature. Yet finding these parameters can be costly, especially for high-dimensional parameter spaces. Thus, existing surrogate-based solutions primarily focus on finding a small set of matching parameters, in the process overlooking the broader picture of plausible parameters. Our work aims to model and visualize the distribution of possible input parameters that produce a given output feature. To achieve this goal, we aim to address two challenges: (1) the approximation error inherent in the surrogate model and (2) forming the parameter distribution in an interactive manner. We model error via density estimation, reporting high density only if a given parameter configuration is close to training parameters, measured both over the input and output space. Our density estimate is used to form a prior belief on parameters, and when combined with a likelihood on features, gives us an efficient way to sample plausible parameter configurations that generate a target output feature. We demonstrate the usability of our solution through a visualization interface by performing feature-driven parameter analysis over the input parameter space of three simulation datasets. Source code is available at this https URL
[LG-7] Is This News Still Interesting to You?: Lifetime-aware Interest Matching for News Recommendation CIKM
链接: https://arxiv.org/abs/2508.13064
作者: Seongeun Ryu,Yunyong Ko,Sang-Wook Kim
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注: 10 pages, 7 figures, 4 tables, accepted at ACM International Conference on Information and Knowledge Management (CIKM)
Abstract:Personalized news recommendation aims to deliver news articles aligned with users’ interests, serving as a key solution to alleviate the problem of information overload on online news platforms. While prior work has improved interest matching through refined representations of news and users, the following time-related challenges remain underexplored: (C1) leveraging the age of clicked news to infer users’ interest persistence, and (C2) modeling the varying lifetime of news across topics and users. To jointly address these challenges, we propose a novel Lifetime-aware Interest Matching framework for nEws recommendation, named LIME, which incorporates three key strategies: (1) User-Topic lifetime-aware age representation to capture the relative age of news with respect to a user-topic pair, (2) Candidate-aware lifetime attention for generating temporally aligned user representation, and (3) Freshness-guided interest refinement for prioritizing valid candidate news at prediction time. Extensive experiments on two real-world datasets demonstrate that LIME consistently outperforms a wide range of state-of-the-art news recommendation methods, and its model agnostic strategies significantly improve recommendation accuracy.
[LG-8] Beyond Internal Data: Bounding and Estimating Fairness from Incomplete Data
链接: https://arxiv.org/abs/2508.13040
作者: Varsha Ramineni,Hossein A. Rahmani,Emine Yilmaz,David Barber
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
Abstract:Ensuring fairness in AI systems is critical, especially in high-stakes domains such as lending, hiring, and healthcare. This urgency is reflected in emerging global regulations that mandate fairness assessments and independent bias audits. However, procuring the necessary complete data for fairness testing remains a significant challenge. In industry settings, legal and privacy concerns restrict the collection of demographic data required to assess group disparities, and auditors face practical and cultural challenges in gaining access to data. In practice, data relevant for fairness testing is often split across separate sources: internal datasets held by institutions with predictive attributes, and external public datasets such as census data containing protected attributes, each providing only partial, marginal information. Our work seeks to leverage such available separate data to estimate model fairness when complete data is inaccessible. We propose utilising the available separate data to estimate a set of feasible joint distributions and then compute the set plausible fairness metrics. Through simulation and real experiments, we demonstrate that we can derive meaningful bounds on fairness metrics and obtain reliable estimates of the true metric. Our results demonstrate that this approach can serve as a practical and effective solution for fairness testing in real-world settings where access to complete data is restricted.
[LG-9] Design and Analysis of Robust Adaptive Filtering with the Hyperbolic Tangent Exponential Kernel M-Estimator Function for Active Noise Control
链接: https://arxiv.org/abs/2508.13018
作者: Iam Kim de S. Hermont,Andre R. Flores,Rodrigo C. de Lamare
类目: Machine Learning (cs.LG)
*备注: 12 figures, 11 pages
Abstract:In this work, we propose a robust adaptive filtering approach for active noise control applications in the presence of impulsive noise. In particular, we develop the filtered-x hyperbolic tangent exponential generalized Kernel M-estimate function (FXHEKM) robust adaptive algorithm. A statistical analysis of the proposed FXHEKM algorithm is carried out along with a study of its computational cost. In order to evaluate the proposed FXHEKM algorithm, the mean-square error (MSE) and the average noise reduction (ANR) performance metrics have been adopted. Numerical results show the efficiency of the proposed FXHEKM algorithm to cancel the presence of the additive spurious signals, such as \textbf \alpha -stable noises against competing algorithms.
[LG-10] Monte Carlo Functional Regularisation for Continual Learning
链接: https://arxiv.org/abs/2508.13006
作者: Pengcheng Hao,Menghao Waiyan William Zhu,Ercan Engin Kuruoglu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Continual learning (CL) is crucial for the adaptation of neural network models to new environments. Although outperforming weight-space regularisation approaches, the functional regularisation-based CL methods suffer from high computational costs and large linear approximation errors. In this work, we present a new functional regularisation CL framework, called MCFRCL, which approximates model prediction distributions by Monte Carlo (MC) sampling. Moreover, three continuous distributions are leveraged to capture the statistical characteristics of the MC samples via moment-based methods. Additionally, both the Wasserstein distance and the Kullback-Leibler (KL) distance are employed to construct the regularisation function. The proposed MCFRCL is evaluated against multiple benchmark methods on the MNIST and CIFAR datasets, with simulation results highlighting its effectiveness in both prediction accuracy and training efficiency.
[LG-11] Fairness-Aware Multi-view Evidential Learning with Adaptive Prior
链接: https://arxiv.org/abs/2508.12997
作者: Haishun Chen,Cai Xu,Jinlong Yu,Yilin Zhang,Ziyu Guan,Wei Zhao
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Multi-view evidential learning aims to integrate information from multiple views to improve prediction performance and provide trustworthy uncertainty esitimation. Most previous methods assume that view-specific evidence learning is naturally reliable. However, in practice, the evidence learning process tends to be biased. Through empirical analysis on real-world data, we reveal that samples tend to be assigned more evidence to support data-rich classes, thereby leading to unreliable uncertainty estimation in predictions. This motivates us to delve into a new Biased Evidential Multi-view Learning (BEML) problem. To this end, we propose Fairness-Aware Multi-view Evidential Learning (FAML). FAML first introduces an adaptive prior based on training trajectory, which acts as a regularization strategy to flexibly calibrate the biased evidence learning process. Furthermore, we explicitly incorporate a fairness constraint based on class-wise evidence variance to promote balanced evidence allocation. In the multi-view fusion stage, we propose an opinion alignment mechanism to mitigate view-specific bias across views, thereby encouraging the integration of consistent and mutually supportive evidence. Extensive experiments on five real-world multi-view datasets demonstrate that FAML achieves more balanced evidence allocation and improves both prediction performance and the reliability of uncertainty estimation compared to state-of-the-art methods.
[LG-12] Predicting the Performance of Graph Convolutional Networks with Spectral Properties of the Graph Laplacian
链接: https://arxiv.org/abs/2508.12993
作者: Shalima Binta Manir,Tim Oates
类目: Machine Learning (cs.LG)
*备注: 9 pages, 3 figures
Abstract:A common observation in the Graph Convolutional Network (GCN) literature is that stacking GCN layers may or may not result in better performance on tasks like node classification and edge prediction. We have found empirically that a graph’s algebraic connectivity, which is known as the Fiedler value, is a good predictor of GCN performance. Intuitively, graphs with similar Fiedler values have analogous structural properties, suggesting that the same filters and hyperparameters may yield similar results when used with GCNs, and that transfer learning may be more effective between graphs with similar algebraic connectivity. We explore this theoretically and empirically with experiments on synthetic and real graph data, including the Cora, CiteSeer and Polblogs datasets. We explore multiple ways of aggregating the Fiedler value for connected components in the graphs to arrive at a value for the entire graph, and show that it can be used to predict GCN performance. We also present theoretical arguments as to why the Fiedler value is a good predictor.
[LG-13] Fed-DPRoC:Communication-Efficient Differentially Private and Robust Federated Learning
链接: https://arxiv.org/abs/2508.12978
作者: Yue Xia,Tayyebeh Jahani-Nezhad,Rawad Bitar
类目: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
*备注:
Abstract:We propose Fed-DPRoC, a novel federated learning framework that simultaneously ensures differential privacy (DP), Byzantine robustness, and communication efficiency. We introduce the concept of robust-compatible compression, which enables users to compress DP-protected updates while maintaining the robustness of the aggregation rule. We instantiate our framework as RobAJoL, combining the Johnson-Lindenstrauss (JL) transform for compression with robust averaging for robust aggregation. We theoretically prove the compatibility of JL transform with robust averaging and show that RobAJoL preserves robustness guarantees, ensures DP, and reduces communication cost. Experiments on CIFAR-10 and Fashion MNIST validate our theoretical claims and demonstrate that RobAJoL outperforms existing methods in terms of robustness and utility under different Byzantine attacks.
[LG-14] SparseMap: A Sparse Tensor Accelerator Framework Based on Evolution Strategy
链接: https://arxiv.org/abs/2508.12906
作者: Boran Zhao,Haiming Zhai,Zihang Yuan,Hetian Liu,Tian Xia,Wenzhe Zhao,Pengju Ren
类目: Machine Learning (cs.LG)
*备注:
Abstract:The growing demand for sparse tensor algebra (SpTA) in machine learning and big data has driven the development of various sparse tensor accelerators. However, most existing manually designed accelerators are limited to specific scenarios, and it’s time-consuming and challenging to adjust a large number of design factors when scenarios change. Therefore, automating the design of SpTA accelerators is crucial. Nevertheless, previous works focus solely on either mapping (i.e., tiling communication and computation in space and time) or sparse strategy (i.e., bypassing zero elements for efficiency), leading to suboptimal designs due to the lack of comprehensive consideration of both. A unified framework that jointly optimizes both is urgently needed. However, integrating mapping and sparse strategies leads to a combinatorial explosion in the design space(e.g., as large as O(10^41) for the workload P_32 \times 64 \times Q_64 \times 48 = Z_32 \times 48 ). This vast search space renders most conventional optimization methods (e.g., particle swarm optimization, reinforcement learning and Monte Carlo tree search) inefficient. To address this challenge, we propose an evolution strategy-based sparse tensor accelerator optimization framework, called SparseMap. SparseMap constructing a more comprehensive design space with the consideration of both mapping and sparse strategy. We introduce a series of enhancements to genetic encoding and evolutionary operators, enabling SparseMap to efficiently explore the vast and diverse design space. We quantitatively compare SparseMap with prior works and classical optimization methods, demonstrating that SparseMap consistently finds superior solutions.
[LG-15] Learning In-context pmbn-grams with Transformers: Sub-pmbn-grams Are Near-stationary Points ICML2025
链接: https://arxiv.org/abs/2508.12837
作者: Aditya Varre,Gizem Yüce,Nicolas Flammarion
类目: Machine Learning (cs.LG)
*备注: ICML2025
Abstract:Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context n -gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent k -gram estimators (for k \leq n ), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: sub- n -grams are near-stationary points of the population cross-entropy loss, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of n -grams, characterized by discrete transitions between near-stationary solutions.
[LG-16] Efficient and Verifiable Privacy-Preserving Convolutional Computation for CNN Inference with Untrusted Clouds
链接: https://arxiv.org/abs/2508.12832
作者: Jinyu Lu,Xinrong Sun,Yunting Tao,Tong Ji,Fanyu Kong,Guoqiang Yang
类目: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
*备注:
Abstract:The widespread adoption of convolutional neural networks (CNNs) in resource-constrained scenarios has driven the development of Machine Learning as a Service (MLaaS) system. However, this approach is susceptible to privacy leakage, as the data sent from the client to the untrusted cloud server often contains sensitive information. Existing CNN privacy-preserving schemes, while effective in ensuring data confidentiality through homomorphic encryption and secret sharing, face efficiency bottlenecks, particularly in convolution operations. In this paper, we propose a novel verifiable privacy-preserving scheme tailored for CNN convolutional layers. Our scheme enables efficient encryption and decryption, allowing resource-constrained clients to securely offload computations to the untrusted cloud server. Additionally, we present a verification mechanism capable of detecting the correctness of the results with a success probability of at least 1-\frac1\left|Z\right| . Extensive experiments conducted on 10 datasets and various CNN models demonstrate that our scheme achieves speedups ranging 26 \times ~ \ 87\times compared to the original plaintext model while maintaining accuracy.
[LG-17] Wavy Transformer
链接: https://arxiv.org/abs/2508.12787
作者: Satoshi Noguchi,Yoshinobu Kawahara
类目: Machine Learning (cs.LG)
*备注: 25 pages, 5 figures
Abstract:Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.
[LG-18] Online Ensemble Transformer for Accurate Cloud Workload Forecasting in Predictive Auto-Scaling
链接: https://arxiv.org/abs/2508.12773
作者: Jiadong Chen,Xiao He,Hengyu Ye,Fuxin Jiang,Tieying Zhang,Jianjun Chen,Xiaofeng Gao
类目: Machine Learning (cs.LG)
*备注: 12 pages, 11 figures
Abstract:In the swiftly evolving domain of cloud computing, the advent of serverless systems underscores the crucial need for predictive auto-scaling systems. This necessity arises to ensure optimal resource allocation and maintain operational efficiency in inherently volatile environments. At the core of a predictive auto-scaling system is the workload forecasting model. Existing forecasting models struggle to quickly adapt to the dynamics in online workload streams and have difficulty capturing the complex periodicity brought by fine-grained, high-frequency forecasting tasks. Addressing this, we propose a novel online ensemble model, E3Former, for online workload forecasting in large-scale predictive auto-scaling. Our model synergizes the predictive capabilities of multiple subnetworks to surmount the limitations of single-model approaches, thus ensuring superior accuracy and robustness. Remarkably, it accomplishes this with a minimal increase in computational overhead, adhering to the lean operational ethos of serverless systems. Through extensive experimentation on real-world workload datasets, we establish the efficacy of our ensemble model. In online forecasting tasks, the proposed method reduces forecast error by an average of 10%, and its effectiveness is further demonstrated through a predictive auto-scaling test in the real-life online system. Currently, our method has been deployed within ByteDance’s Intelligent Horizontal Pod Auto-scaling (IHPA) platform, which supports the stable operation of over 30 applications, such as Douyin E-Comerce, TouTiao, and Volcano Engine. The predictive auto-scaling capacity reaching over 600,000 CPU cores. On the basis of essentially ensuring service quality, the predictive auto-scaling system can reduce resource utilization by over 40%.
[LG-19] Short-Term Forecasting of Energy Production and Consumption Using Extreme Learning Machine: A Comprehensive MIMO based ELM Approach
链接: https://arxiv.org/abs/2508.12764
作者: Cyril Voyant,Milan Despotovic,Luis Garcia-Gutierrez,Mohammed Asloune,Yves-Marie Saint-Drenan,Jean-Laurent Duchaud,hjuvan Antone Faggianelli,Elena Magliaro
类目: Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
*备注:
Abstract:A novel methodology for short-term energy forecasting using an Extreme Learning Machine ( \mathttELM ) is proposed. Using six years of hourly data collected in Corsica (France) from multiple energy sources (solar, wind, hydro, thermal, bioenergy, and imported electricity), our approach predicts both individual energy outputs and total production (\cyrincluding imports, which closely follow energy demand, modulo losses) through a Multi-Input Multi-Output ( \mathttMIMO ) architecture. To address non-stationarity and seasonal variability, sliding window techniques and cyclic time encoding are incorporated, enabling dynamic adaptation to fluctuations. The \mathttELM model significantly outperforms persistence-based forecasting, particularly for solar and thermal energy, achieving an \mathttnRMSE of 17.9% and 5.1% , respectively, with \mathttR^2 0.98 (1-hour horizon). The model maintains high accuracy up to five hours ahead, beyond which renewable energy sources become increasingly volatile. While \mathttMIMO provides marginal gains over Single-Input Single-Output ( \mathttSISO ) architectures and offers key advantages over deep learning methods such as \mathttLSTM , it provides a closed-form solution with lower computational demands, making it well-suited for real-time applications, including online learning. Beyond predictive accuracy, the proposed methodology is adaptable to various contexts and datasets, as it can be tuned to local constraints such as resource availability, grid characteristics, and market structures.
[LG-20] Constrained Centroid Clustering: A Novel Approach for Compact and Structured Partitioning
链接: https://arxiv.org/abs/2508.12758
作者: Sowmini Devi Veeramachaneni,Ramamurthy Garimella
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:This paper presents Constrained Centroid Clustering (CCC), a method that extends classical centroid-based clustering by enforcing a constraint on the maximum distance between the cluster center and the farthest point in the cluster. Using a Lagrangian formulation, we derive a closed-form solution that maintains interpretability while controlling cluster spread. To evaluate CCC, we conduct experiments on synthetic circular data with radial symmetry and uniform angular distribution. Using ring-wise, sector-wise, and joint entropy as evaluation metrics, we show that CCC achieves more compact clusters by reducing radial spread while preserving angular structure, outperforming standard methods such as K-means and GMM. The proposed approach is suitable for applications requiring structured clustering with spread control, including sensor networks, collaborative robotics, and interpretable pattern analysis.
[LG-21] Deep Semantic Inference over the Air: An Efficient Task-Oriented Communication System
链接: https://arxiv.org/abs/2508.12748
作者: Chenyang Wang,Roger Olsson,Stefan Forsström,Qing He
类目: Information Theory (cs.IT); Machine Learning (cs.LG)
*备注:
Abstract:Empowered by deep learning, semantic communication marks a paradigm shift from transmitting raw data to conveying task-relevant meaning, enabling more efficient and intelligent wireless systems. In this study, we explore a deep learning-based task-oriented communication framework that jointly considers classification performance, computational latency, and communication cost. We adopt ResNets-based models and evaluate them on the CIFAR-10 and CIFAR-100 datasets to simulate real-world classification tasks in wireless environments. We partition the model at various points to simulate split inference across a wireless channel. By varying the split location and the size of the transmitted semantic feature vector, we systematically analyze the trade-offs between task accuracy and resource efficiency. Experimental results show that, with appropriate model partitioning and semantic feature compression, the system can retain over 85% of baseline accuracy while significantly reducing both computational load and communication overhead.
[LG-22] A Multi-Resolution Benchmark Framework for Spatial Reasoning Assessment in Neural Networks
链接: https://arxiv.org/abs/2508.12741
作者: Manuela Imbriani,Gina Belmonte,Mieke Massink,Alessandro Tofani,Vincenzo Ciancia
类目: Machine Learning (cs.LG); Applied Physics (physics.app-ph); Medical Physics (physics.med-ph)
*备注:
Abstract:This paper presents preliminary results in the definition of a comprehensive benchmark framework designed to systematically evaluate spatial reasoning capabilities in neural networks, with a particular focus on morphological properties such as connectivity and distance relationships. The framework is currently being used to study the capabilities of nnU-Net, exploiting the spatial model checker VoxLogicA to generate two distinct categories of synthetic datasets: maze connectivity problems for topological analysis and spatial distance computation tasks for geometric understanding. Each category is evaluated across multiple resolutions to assess scalability and generalization properties. The automated pipeline encompasses a complete machine learning workflow including: synthetic dataset generation, standardized training with cross-validation, inference execution, and comprehensive evaluation using Dice coefficient and IoU (Intersection over Union) metrics. Preliminary experimental results demonstrate significant challenges in neural network spatial reasoning capabilities, revealing systematic failures in basic geometric and topological understanding tasks. The framework provides a reproducible experimental protocol, enabling researchers to identify specific limitations. Such limitations could be addressed through hybrid approaches combining neural networks with symbolic reasoning methods for improved spatial understanding in clinical applications, establishing a foundation for ongoing research into neural network spatial reasoning limitations and potential solutions.
[LG-23] A Hierarchical Surrogate Model for Efficient Multi-Task Parameter Learning in Closed-Loop Contro
链接: https://arxiv.org/abs/2508.12738
作者: Sebastian Hirt,Lukas Theiner,Maik Pfefferkorn,Rolf Findeisen
类目: ystems and Control (eess.SY); Machine Learning (cs.LG)
*备注: 8 pages, 4 figures, accepted for CDC 2025
Abstract:Many control problems require repeated tuning and adaptation of controllers across distinct closed-loop tasks, where data efficiency and adaptability are critical. We propose a hierarchical Bayesian optimization (BO) framework that is tailored to efficient controller parameter learning in sequential decision-making and control scenarios for distinct tasks. Instead of treating the closed-loop cost as a black-box, our method exploits structural knowledge of the underlying problem, consisting of a dynamical system, a control law, and an associated closed-loop cost function. We construct a hierarchical surrogate model using Gaussian processes that capture the closed-loop state evolution under different parameterizations, while the task-specific weighting and accumulation into the closed-loop cost are computed exactly via known closed-form expressions. This allows knowledge transfer and enhanced data efficiency between different closed-loop tasks. The proposed framework retains sublinear regret guarantees on par with standard black-box BO, while enabling multi-task or transfer learning. Simulation experiments with model predictive control demonstrate substantial benefits in both sample efficiency and adaptability when compared to purely black-box BO approaches.
[LG-24] Unlearning Comparator: A Visual Analytics System for Comparative Evaluation of Machine Unlearning Methods
链接: https://arxiv.org/abs/2508.12730
作者: Jaeung Lee,Suhyeon Yu,Yurim Jang,Simon S. Woo,Jaemin Jo
类目: Cryptography and Security (cs.CR); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注: Submitted to IEEE Transactions on Visualization and Computer Graphics (TVCG), under review. 15 pages. This work has been submitted to the IEEE for possible publication
Abstract:Machine Unlearning (MU) aims to remove target training data from a trained model so that the removed data no longer influences the model’s behavior, fulfilling “right to be forgotten” obligations under data privacy laws. Yet, we observe that researchers in this rapidly emerging field face challenges in analyzing and understanding the behavior of different MU methods, especially in terms of three fundamental principles in MU: accuracy, efficiency, and privacy. Consequently, they often rely on aggregate metrics and ad-hoc evaluations, making it difficult to accurately assess the trade-offs between methods. To fill this gap, we introduce a visual analytics system, Unlearning Comparator, designed to facilitate the systematic evaluation of MU methods. Our system supports two important tasks in the evaluation process: model comparison and attack simulation. First, it allows the user to compare the behaviors of two models, such as a model generated by a certain method and a retrained baseline, at class-, instance-, and layer-levels to better understand the changes made after unlearning. Second, our system simulates membership inference attacks (MIAs) to evaluate the privacy of a method, where an attacker attempts to determine whether specific data samples were part of the original training set. We evaluate our system through a case study visually analyzing prominent MU methods and demonstrate that it helps the user not only understand model behaviors but also gain insights that can inform the improvement of MU methods.
[LG-25] FedSODA: Federated Fine-tuning of LLM s via Similarity Group Pruning and Orchestrated Distillation Alignment
链接: https://arxiv.org/abs/2508.12727
作者: Manning Zhu,Songtao Guo,Pengzhan Zhou,Yansong Ning,Chang Han,Dewen Qiao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Federated fine-tuning (FFT) of large language models (LLMs) has recently emerged as a promising solution to enable domain-specific adaptation while preserving data privacy. Despite its benefits, FFT on resource-constrained clients relies on the high computational and memory demands of full-model fine-tuning, which limits the potential advancement. This paper presents FedSODA, a resource-efficient FFT framework that enables clients to adapt LLMs without accessing or storing the full model. Specifically, we first propose a similarity group pruning (SGP) module, which prunes redundant layers from the full LLM while retaining the most critical layers to preserve the model performance. Moreover, we introduce an orchestrated distillation alignment (ODA) module to reduce gradient divergence between the sub-LLM and the full LLM during FFT. Through the use of the QLoRA, clients only need to deploy quantized sub-LLMs and fine-tune lightweight adapters, significantly reducing local resource requirements. We conduct extensive experiments on three open-source LLMs across a variety of downstream tasks. The experimental results demonstrate that FedSODA reduces communication overhead by an average of 70.6%, decreases storage usage by 75.6%, and improves task accuracy by 3.1%, making it highly suitable for practical FFT applications under resource constraints.
[LG-26] BUILDA: A Thermal Building Data Generation Framework for Transfer Learning
链接: https://arxiv.org/abs/2508.12703
作者: Thomas Krug,Fabian Raisch,Dominik Aimer,Markus Wirnsberger,Ferdinand Sigg,Benjamin Schäfer,Benjamin Tischler
类目: Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Proceedings can be accessed at: this https URL
Abstract:Transfer learning (TL) can improve data-driven modeling of building thermal dynamics. Therefore, many new TL research areas emerge in the field, such as selecting the right source model for TL. However, these research directions require massive amounts of thermal building data which is lacking presently. Neither public datasets nor existing data generators meet the needs of TL research in terms of data quality and quantity. Moreover, existing data generation approaches typically require expert knowledge in building simulation. We present BuilDa, a thermal building data generation framework for producing synthetic data of adequate quality and quantity for TL research. The framework does not require profound building simulation knowledge to generate large volumes of data. BuilDa uses a single-zone Modelica model that is exported as a Functional Mock-up Unit (FMU) and simulated in Python. We demonstrate BuilDa by generating data and utilizing it for pretraining and fine-tuning TL models.
[LG-27] Adaptive Model-Predictive Control of a Soft Continuum Robot Using a Physics-Informed Neural Network Based on Cosserat Rod Theory
链接: https://arxiv.org/abs/2508.12681
作者: Johann Licher,Max Bartholdt,Henrik Krauss,Tim-Lukas Habich,Thomas Seel,Moritz Schappler
类目: Robotics (cs.RO); Machine Learning (cs.LG)
*备注: 20 pages, 15 figures
Abstract:Dynamic control of soft continuum robots (SCRs) holds great potential for expanding their applications, but remains a challenging problem due to the high computational demands of accurate dynamic models. While data-driven approaches like Koopman-operator-based methods have been proposed, they typically lack adaptability and cannot capture the full robot shape, limiting their applicability. This work introduces a real-time-capable nonlinear model-predictive control (MPC) framework for SCRs based on a domain-decoupled physics-informed neural network (DD-PINN) with adaptable bending stiffness. The DD-PINN serves as a surrogate for the dynamic Cosserat rod model with a speed-up factor of 44000. It is also used within an unscented Kalman filter for estimating the model states and bending compliance from end-effector position measurements. We implement a nonlinear evolutionary MPC running at 70 Hz on the GPU. In simulation, it demonstrates accurate tracking of dynamic trajectories and setpoint control with end-effector position errors below 3 mm (2.3% of the actuator’s length). In real-world experiments, the controller achieves similar accuracy and accelerations up to 3.55 m/s2.
[LG-28] DIT: Dimension Reduction View on Optimal NFT Rarity Meters
链接: https://arxiv.org/abs/2508.12671
作者: Dmitry Belousov,Yury Yanovich
类目: Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注:
Abstract:Non-fungible tokens (NFTs) have become a significant digital asset class, each uniquely representing virtual entities such as artworks. These tokens are stored in collections within smart contracts and are actively traded across platforms on Ethereum, Bitcoin, and Solana blockchains. The value of NFTs is closely tied to their distinctive characteristics that define rarity, leading to a growing interest in quantifying rarity within both industry and academia. While there are existing rarity meters for assessing NFT rarity, comparing them can be challenging without direct access to the underlying collection data. The Rating over all Rarities (ROAR) benchmark addresses this challenge by providing a standardized framework for evaluating NFT rarity. This paper explores a dimension reduction approach to rarity design, introducing new performance measures and meters, and evaluates them using the ROAR benchmark. Our contributions to the rarity meter design issue include developing an optimal rarity meter design using non-metric weighted multidimensional scaling, introducing Dissimilarity in Trades (DIT) as a performance measure inspired by dimension reduction techniques, and unveiling the non-interpretable rarity meter DIT, which demonstrates superior performance compared to existing methods.
[LG-29] FlowMol3: Flow Matching for 3D De Novo Small-Molecule Generation
链接: https://arxiv.org/abs/2508.12629
作者: Ian Dunn,David R. Koes
类目: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
*备注:
Abstract:A generative model capable of sampling realistic molecules with desired properties could accelerate chemical discovery across a wide range of applications. Toward this goal, significant effort has focused on developing models that jointly sample molecular topology and 3D structure. We present FlowMol3, an open-source, multi-modal flow matching model that advances the state of the art for all-atom, small-molecule generation. Its substantial performance gains over previous FlowMol versions are achieved without changes to the graph neural network architecture or the underlying flow matching formulation. Instead, FlowMol3’s improvements arise from three architecture-agnostic techniques that incur negligible computational cost: self-conditioning, fake atoms, and train-time geometry distortion. FlowMol3 achieves nearly 100% molecular validity for drug-like molecules with explicit hydrogens, more accurately reproduces the functional group composition and geometry of its training data, and does so with an order of magnitude fewer learnable parameters than comparable methods. We hypothesize that these techniques mitigate a general pathology affecting transport-based generative models, enabling detection and correction of distribution drift during inference. Our results highlight simple, transferable strategies for improving the stability and quality of diffusion- and flow-based molecular generative models.
[LG-30] A Self-Ensemble Inspired Approach for Effective Training of Binary-Weight Spiking Neural Networks
链接: https://arxiv.org/abs/2508.12609
作者: Qingyan Meng,Mingqing Xiao,Zhengyu Ma,Huihui Zhou,Yonghong Tian,Zhouchen Lin
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) are a promising approach to low-power applications on neuromorphic hardware due to their energy efficiency. However, training SNNs is challenging because of the non-differentiable spike generation function. To address this issue, the commonly used approach is to adopt the backpropagation through time framework, while assigning the gradient of the non-differentiable function with some surrogates. Similarly, Binary Neural Networks (BNNs) also face the non-differentiability problem and rely on approximating gradients. However, the deep relationship between these two fields and how their training techniques can benefit each other has not been systematically researched. Furthermore, training binary-weight SNNs is even more difficult. In this work, we present a novel perspective on the dynamics of SNNs and their close connection to BNNs through an analysis of the backpropagation process. We demonstrate that training a feedforward SNN can be viewed as training a self-ensemble of a binary-activation neural network with noise injection. Drawing from this new understanding of SNN dynamics, we introduce the Self-Ensemble Inspired training method for (Binary-Weight) SNNs (SEI-BWSNN), which achieves high-performance results with low latency even for the case of the 1-bit weights. Specifically, we leverage a structure of multiple shortcuts and a knowledge distillation-based training technique to improve the training of (binary-weight) SNNs. Notably, by binarizing FFN layers in a Transformer architecture, our approach achieves 82.52% accuracy on ImageNet with only 2 time steps, indicating the effectiveness of our methodology and the potential of binary-weight SNNs.
[LG-31] A Hybrid Surrogate for Electric Vehicle Parameter Estimation and Power Consumption via Physics-Informed Neural Operators
链接: https://arxiv.org/abs/2508.12602
作者: Hansol Lim,Jongseong Brad Choi,Jee Won Lee,Haeseong Jeoung,Minkyu Han
类目: Machine Learning (cs.LG)
*备注: This preprint corresponding to a manuscript has been submitted to a journal for potential publication
Abstract:We present a hybrid surrogate model for electric vehicle parameter estimation and power consumption. We combine our novel architecture Spectral Parameter Operator built on a Fourier Neural Operator backbone for global context and a differentiable physics module in the forward pass. From speed and acceleration alone, it outputs time-varying motor and regenerative braking efficiencies, as well as aerodynamic drag, rolling resistance, effective mass, and auxiliary power. These parameters drive a physics-embedded estimate of battery power, eliminating any separate physics-residual loss. The modular design lets representations converge to physically meaningful parameters that reflect the current state and condition of the vehicle. We evaluate on real-world logs from a Tesla Model 3, Tesla Model S, and the Kia EV9. The surrogate achieves a mean absolute error of 0.2kW (about 1% of average traction power at highway speeds) for Tesla vehicles and about 0.8kW on the Kia EV9. The framework is interpretable, and it generalizes well to unseen conditions, and sampling rates, making it practical for path optimization, eco-routing, on-board diagnostics, and prognostics health management.
[LG-32] Constructing Invariant and Equivariant Operations by Symmetric Tensor Network
链接: https://arxiv.org/abs/2508.12596
作者: Meng Zhang,Chao Wang,Hao Zhang,Shaojun Dong,Lixin He
类目: Machine Learning (cs.LG)
*备注:
Abstract:Design of neural networks that incorporate symmetry is crucial for geometric deep learning. Central to this effort is the development of invariant and equivariant operations. This works presents a systematic method for constructing valid invariant and equivariant operations. It can handle inputs and outputs in the form of Cartesian tensors with different rank, as well as spherical tensors with different types. In addition, our method features a graphical representation utilizing the symmetric tensor network, which simplifies both the proofs and constructions related to invariant and equivariant functions. We also apply this approach to design the equivariant interaction message for the geometry graph neural network, and equivariant machine learning model to learn the constitutive law of materials.
[LG-33] FLARE: Fast Low-rank Attention Routing Engine
链接: https://arxiv.org/abs/2508.12594
作者: Vedant Puri,Aditya Joglekar,Kevin Ferguson,Yu-hsuan Chen,Yongjie Jessica Zhang,Levent Burak Kara
类目: Machine Learning (cs.LG)
*备注:
Abstract:The quadratic complexity of self-attention limits its applicability and scalability on large unstructured meshes. We introduce Fast Low-rank Attention Routing Engine (FLARE), a linear complexity self-attention mechanism that routes attention through fixed-length latent sequences. Each attention head performs global communication among N tokens by projecting the input sequence onto a fixed length latent sequence of M \ll N tokens using learnable query tokens. By routing attention through a bottleneck sequence, FLARE learns a low-rank form of attention that can be applied at O(NM) cost. FLARE not only scales to unprecedented problem sizes, but also delivers superior accuracy compared to state-of-the-art neural PDE surrogates across diverse benchmarks. We also release a new additive manufacturing dataset to spur further research. Our code is available at this https URL.
[LG-34] Physics-informed deep operator network for traffic state estimation
链接: https://arxiv.org/abs/2508.12593
作者: Zhihao Li,Ting Wang,Guojian Zou,Ruofei Wang,Ye Li
类目: Machine Learning (cs.LG)
*备注: under review in Transportmetrica B: Transport Dynamics
Abstract:Traffic state estimation (TSE) fundamentally involves solving high-dimensional spatiotemporal partial differential equations (PDEs) governing traffic flow dynamics from limited, noisy measurements. While Physics-Informed Neural Networks (PINNs) enforce PDE constraints point-wise, this paper adopts a physics-informed deep operator network (PI-DeepONet) framework that reformulates TSE as an operator learning problem. Our approach trains a parameterized neural operator that maps sparse input data to the full spatiotemporal traffic state field, governed by the traffic flow conservation law. Crucially, unlike PINNs that enforce PDE constraints point-wise, PI-DeepONet integrates traffic flow conservation model and the fundamental diagram directly into the operator learning process, ensuring physical consistency while capturing congestion propagation, spatial correlations, and temporal evolution. Experiments on the NGSIM dataset demonstrate superior performance over state-of-the-art baselines. Further analysis reveals insights into optimal function generation strategies and branch network complexity. Additionally, the impact of input function generation methods and the number of functions on model performance is explored, highlighting the robustness and efficacy of proposed framework.
[LG-35] Data-driven particle dynamics: Structure-preserving coarse-graining for emergent behavior in non-equilibrium systems
链接: https://arxiv.org/abs/2508.12569
作者: Quercus Hernandez,Max Win,Thomas C. O’Connor,Paulo E. Arratia,Nathaniel Trask
类目: Machine Learning (cs.LG); Computational Engineering, Finance, and Science (cs.CE); Computational Physics (physics.comp-ph); Machine Learning (stat.ML)
*备注: 34 pages, 12 figures
Abstract:Multiscale systems are ubiquitous in science and technology, but are notoriously challenging to simulate as short spatiotemporal scales must be appropriately linked to emergent bulk physics. When expensive high-dimensional dynamical systems are coarse-grained into low-dimensional models, the entropic loss of information leads to emergent physics which are dissipative, history-dependent, and stochastic. To machine learn coarse-grained dynamics from time-series observations of particle trajectories, we propose a framework using the metriplectic bracket formalism that preserves these properties by construction; most notably, the framework guarantees discrete notions of the first and second laws of thermodynamics, conservation of momentum, and a discrete fluctuation-dissipation balance crucial for capturing non-equilibrium statistics. We introduce the mathematical framework abstractly before specializing to a particle discretization. As labels are generally unavailable for entropic state variables, we introduce a novel self-supervised learning strategy to identify emergent structural variables. We validate the method on benchmark systems and demonstrate its utility on two challenging examples: (1) coarse-graining star polymers at challenging levels of coarse-graining while preserving non-equilibrium statistics, and (2) learning models from high-speed video of colloidal suspensions that capture coupling between local rearrangement events and emergent stochastic dynamics. We provide open-source implementations in both PyTorch and LAMMPS, enabling large-scale inference and extensibility to diverse particle-based systems.
[LG-36] Deep Learning-Based Financial Time Series Forecasting via Sliding Window and Variational Mode Decomposition
链接: https://arxiv.org/abs/2508.12565
作者: Luke Li
类目: Machine Learning (cs.LG)
*备注:
Abstract:To address the complexity of financial time series, this paper proposes a forecasting model combining sliding window and variational mode decomposition (VMD) methods. Historical stock prices and relevant market indicators are used to construct datasets. VMD decomposes non-stationary financial time series into smoother subcomponents, improving model adaptability. The decomposed data is then input into a deep learning model for prediction. The study compares the forecasting effects of an LSTM model trained on VMD-processed sequences with those using raw time series, demonstrating better performance and stability.
[LG-37] Data-driven Trust Bootstrapping for Mobile Edge Computing-based Industrial IoT Services
链接: https://arxiv.org/abs/2508.12560
作者: Prabath Abeysekara,Hai Dong
类目: Cryptography and Security (cs.CR); Distributed, Parallel, and Cluster Computing (cs.DC); Machine Learning (cs.LG)
*备注: 15 pages
Abstract:We propose a data-driven and context-aware approach to bootstrap trustworthiness of homogeneous Internet of Things (IoT) services in Mobile Edge Computing (MEC) based industrial IoT (IIoT) systems. The proposed approach addresses key limitations in adapting existing trust bootstrapping approaches into MEC-based IIoT systems. These key limitations include, the lack of opportunity for a service consumer to interact with a lesser-known service over a prolonged period of time to get a robust measure of its trustworthiness, inability of service consumers to consistently interact with their peers to receive reliable recommendations of the trustworthiness of a lesser-known service as well as the impact of uneven context parameters in different MEC environments causing uneven trust environments for trust evaluation. In addition, the proposed approach also tackles the problem of data sparsity via enabling knowledge sharing among different MEC environments within a given MEC topology. To verify the effectiveness of the proposed approach, we carried out a comprehensive evaluation on two real-world datasets suitably adjusted to exhibit the context-dependent trust information accumulated in MEC environments within a given MEC topology. The experimental results affirmed the effectiveness of our approach and its suitability to bootstrap trustworthiness of services in MEC-based IIoT systems.
[LG-38] Illuminating LLM Coding Agents : Visual Analytics for Deeper Understanding and Enhancement
链接: https://arxiv.org/abs/2508.12555
作者: Junpeng Wang,Yuzhong Chen,Menghai Pan,Chin-Chia Michael Yeh,Mahashweta Das
类目: Machine Learning (cs.LG)
*备注: 11 pages, 10 figures
Abstract:Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain, AutoML, and AIDE, ML scientists still struggle to effectively review and adjust the agents’ coding process. The current approach of manually inspecting individual outputs is inefficient, making it difficult to track code evolution, compare coding iterations, and identify improvement opportunities. To address this challenge, we introduce a visual analytics system designed to enhance the examination of coding agent behaviors. Focusing on the AIDE framework, our system supports comparative analysis across three levels: (1) Code-Level Analysis, which reveals how the agent debugs and refines its code over iterations; (2) Process-Level Analysis, which contrasts different solution-seeking processes explored by the agent; and (3) LLM-Level Analysis, which highlights variations in coding behavior across different LLMs. By integrating these perspectives, our system enables ML scientists to gain a structured understanding of agent behaviors, facilitating more effective debugging and prompt engineering. Through case studies using coding agents to tackle popular Kaggle competitions, we demonstrate how our system provides valuable insights into the iterative coding process.
[LG-39] Results of the NeurIPS 2023 Neural MMO Competition on Multi-task Reinforcement Learning
链接: https://arxiv.org/abs/2508.12524
作者: Joseph Suárez,Kyoung Whan Choe,David Bloomin,Jianming Gao,Yunkun Li,Yao Feng,Saidinesh Pola,Kun Zhang,Yonghui Zhu,Nikhil Pinnaparaju,Hao Xiang Li,Nishaanth Kanna,Daniel Scott,Ryan Sullivan,Rose S. Shuman,Lucas de Alcântara,Herbie Bradley,Kirsty You,Bo Wu,Yuhao Jiang,Qimai Li,Jiaxin Chen,Louis Castricato,Xiaolong Zhu,Phillip Isola
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present the results of the NeurIPS 2023 Neural MMO Competition, which attracted over 200 participants and submissions. Participants trained goal-conditional policies that generalize to tasks, maps, and opponents never seen during training. The top solution achieved a score 4x higher than our baseline within 8 hours of training on a single 4090 GPU. We open-source everything relating to Neural MMO and the competition under the MIT license, including the policy weights and training code for our baseline and for the top submissions.
[LG-40] rust Region Constrained Measure Transport in Path Space for Stochastic Optimal Control and Inference
链接: https://arxiv.org/abs/2508.12511
作者: Denis Blessing,Julius Berner,Lorenz Richter,Carles Domingo-Enrich,Yuanqi Du,Arash Vahdat,Gerhard Neumann
类目: Machine Learning (cs.LG)
*备注:
Abstract:Solving stochastic optimal control problems with quadratic control costs can be viewed as approximating a target path space measure, e.g. via gradient-based optimization. In practice, however, this optimization is challenging in particular if the target measure differs substantially from the prior. In this work, we therefore approach the problem by iteratively solving constrained problems incorporating trust regions that aim for approaching the target measure gradually in a systematic way. It turns out that this trust region based strategy can be understood as a geometric annealing from the prior to the target measure, where, however, the incorporated trust regions lead to a principled and educated way of choosing the time steps in the annealing path. We demonstrate in multiple optimal control applications that our novel method can improve performance significantly, including tasks in diffusion-based sampling, transition path sampling, and fine-tuning of diffusion models.
[LG-41] Cost-Aware Contrastive Routing for LLM s
链接: https://arxiv.org/abs/2508.12491
作者: Reza Shirkavand,Shangqian Gao,Peiran Yu,Heng Huang
类目: Machine Learning (cs.LG)
*备注:
Abstract:We study cost-aware routing for large language models across diverse and dynamic pools of models. Existing approaches often overlook prompt-specific context, rely on expensive model profiling, assume a fixed set of experts, or use inefficient trial-and-error strategies. We introduce Cost-Spectrum Contrastive Routing (CSCR), a lightweight framework that maps both prompts and models into a shared embedding space to enable fast, cost-sensitive selection. CSCR uses compact, fast-to-compute logit footprints for open-source models and perplexity fingerprints for black-box APIs. A contrastive encoder is trained to favor the cheapest accurate expert within adaptive cost bands. At inference time, routing reduces to a single k-NN lookup via a FAISS index, requiring no retraining when the expert pool changes and enabling microsecond latency. Across multiple benchmarks, CSCR consistently outperforms baselines, improving the accuracy-cost tradeoff by up to 25%, while generalizing robustly to unseen LLMs and out-of-distribution prompts.
[LG-42] Local Cluster Cardinality Estimation for Adaptive Mean Shift
链接: https://arxiv.org/abs/2508.12450
作者: Étienne Pepin
类目: Machine Learning (cs.LG)
*备注: 24 pages, 9 figures
Abstract:This article presents an adaptive mean shift algorithm designed for datasets with varying local scale and cluster cardinality. Local distance distributions, from a point to all others, are used to estimate the cardinality of the local cluster by identifying a local minimum in the density of the distance distribution. Based on these cardinality estimates, local cluster parameters are then computed for the entire cluster in contrast to KDE-based methods, which provide insight only into localized regions of the cluster. During the mean shift execution, the cluster cardinality estimate is used to adaptively adjust the bandwidth and the mean shift kernel radius threshold. Our algorithm outperformed a recently proposed adaptive mean shift method on its original dataset and demonstrated competitive performance on a broader clustering benchmark.
[LG-43] Machine Learning-Based Manufacturing Cost Prediction from 2D Engineering Drawings via Geometric Features
链接: https://arxiv.org/abs/2508.12440
作者: Ahmet Bilal Arıkan,Şener Özönder,Mustafa Taha Koçyiğit,Hüseyin Oktay Altun,H. Kübra Küçükkartal,Murat Arslanoğlu,Fatih Çağırankaya,Berk Ayvaz
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an integrated machine learning framework that transforms how manufacturing cost is estimated from 2D engineering drawings. Unlike traditional quotation workflows that require labor-intensive process planning, our approach about 200 geometric and statistical descriptors directly from 13,684 DWG drawings of automotive suspension and steering parts spanning 24 product groups. Gradient-boosted decision tree models (XGBoost, CatBoost, LightGBM) trained on these features achieve nearly 10% mean absolute percentage error across groups, demonstrating robust scalability beyond part-specific heuristics. By coupling cost prediction with explainability tools such as SHAP, the framework identifies geometric design drivers including rotated dimension maxima, arc statistics and divergence metrics, offering actionable insights for cost-aware design. This end-to-end CAD-to-cost pipeline shortens quotation lead times, ensures consistent and transparent cost assessments across part families and provides a deployable pathway toward real-time, ERP-integrated decision support in Industry 4.0 manufacturing environments.
[LG-44] Bi-Axial Transformers: Addressing the Increasing Complexity of EHR Classification
链接: https://arxiv.org/abs/2508.12418
作者: Rachael DeVries,Casper Christensen,Marie Lisandra Zepeda Mendoza,Ole Winther
类目: Machine Learning (cs.LG)
*备注: 18 pages, 7 figures. Submitted to the IEEE for possible publication
Abstract:Electronic Health Records (EHRs), the digital representation of a patient’s medical history, are a valuable resource for epidemiological and clinical research. They are also becoming increasingly complex, with recent trends indicating larger datasets, longer time series, and multi-modal integrations. Transformers, which have rapidly gained popularity due to their success in natural language processing and other domains, are well-suited to address these challenges due to their ability to model long-range dependencies and process data in parallel. But their application to EHR classification remains limited by data representations, which can reduce performance or fail to capture informative missingness. In this paper, we present the Bi-Axial Transformer (BAT), which attends to both the clinical variable and time point axes of EHR data to learn richer data relationships and address the difficulties of data sparsity. BAT achieves state-of-the-art performance on sepsis prediction and is competitive to top methods for mortality classification. In comparison to other transformers, BAT demonstrates increased robustness to data missingness, and learns unique sensor embeddings which can be used in transfer learning. Baseline models, which were previously located across multiple repositories or utilized deprecated libraries, were re-implemented with PyTorch and made available for reproduction and future benchmarking.
[LG-45] Convergence Analysis of the Lion Optimizer in Centralized and Distributed Settings
链接: https://arxiv.org/abs/2508.12327
作者: Wei Jiang,Lijun Zhang
类目: Machine Learning (cs.LG); Optimization and Control (math.OC)
*备注:
Abstract:In this paper, we analyze the convergence properties of the Lion optimizer. First, we establish that the Lion optimizer attains a convergence rate of \mathcalO(d^1/2T^-1/4) under standard assumptions, where d denotes the problem dimension and T is the iteration number. To further improve this rate, we introduce the Lion optimizer with variance reduction, resulting in an enhanced convergence rate of \mathcalO(d^1/2T^-1/3) . We then analyze in distributed settings, where the standard and variance reduced version of the distributed Lion can obtain the convergence rates of \mathcalO(d^1/2(nT)^-1/4) and \mathcalO(d^1/2(nT)^-1/3) , with n denoting the number of nodes. Furthermore, we investigate a communication-efficient variant of the distributed Lion that ensures sign compression in both communication directions. By employing the unbiased sign operations, the proposed Lion variant and its variance reduction counterpart, achieve convergence rates of \mathcalO\left( \max \left\fracd^1/4T^1/4, \fracd^1/10n^1/5T^1/5 \right\ \right) and \mathcalO\left( \fracd^1/4T^1/4 \right) , respectively.
[LG-46] DHG-Bench: A Comprehensive Benchmark on Deep Hypergraph Learning
链接: https://arxiv.org/abs/2508.12244
作者: Fan Li,Xiaoyang Wang,Wenjie Zhang,Ying Zhang,Xuemin Lin
类目: Machine Learning (cs.LG)
*备注: 22 pages, 5 figures
Abstract:Although conventional deep graph models have achieved great success in relational learning, their focus on pairwise relationships limits their capacity to learn pervasive higher-order interactions in real-world complex systems, which can be naturally modeled as hypergraphs. To tackle this, hypergraph neural networks (HNNs), the dominant approach in deep hypergraph learning (DHGL), has garnered substantial attention in recent years. Despite the proposal of numerous HNN methods, there is no comprehensive benchmark for HNNs, which creates a great obstacle to understanding the progress of DHGL in several aspects: (i) insufficient coverage of datasets, algorithms, and tasks; (ii) a narrow evaluation of algorithm performance; and (iii) inconsistent dataset usage, preprocessing, and experimental setups that hinder comparability. To fill the gap, we introduce DHG-Bench, the first comprehensive benchmark for DHGL. Specifically, DHG-Bench integrates 20 diverse datasets spanning node-, edge-, and graph-level tasks, along with 16 state-of-the-art HNN algorithms, under consistent data processing and experimental protocols. Our benchmark systematically investigates the characteristics of HNNs in terms of four dimensions: effectiveness, efficiency, robustness, and fairness. Further, to facilitate reproducible research, we have developed an easy-to-use library for training and evaluating different HNN methods. Extensive experiments conducted with DHG-Bench reveal both the strengths and inherent limitations of existing algorithms, offering valuable insights and directions for future research. The code is publicly available at: this https URL.
[LG-47] CC-Time: Cross-Model and Cross-Modality Time Series Forecasting
链接: https://arxiv.org/abs/2508.12235
作者: Peng Chen,Yihang Wang,Yang Shu,Yunyao Cheng,Kai Zhao,Zhongwen Rao,Lujia Pan,Bin Yang,Chenjuan Guo
类目: Machine Learning (cs.LG)
*备注:
Abstract:With the success of pre-trained language models (PLMs) in various application fields beyond natural language processing, language models have raised emerging attention in the field of time series forecasting (TSF) and have shown great prospects. However, current PLM-based TSF methods still fail to achieve satisfactory prediction accuracy matching the strong sequential modeling power of language models. To address this issue, we propose Cross-Model and Cross-Modality Learning with PLMs for time series forecasting (CC-Time). We explore the potential of PLMs for time series forecasting from two aspects: 1) what time series features could be modeled by PLMs, and 2) whether relying solely on PLMs is sufficient for building time series models. In the first aspect, CC-Time incorporates cross-modality learning to model temporal dependency and channel correlations in the language model from both time series sequences and their corresponding text descriptions. In the second aspect, CC-Time further proposes the cross-model fusion block to adaptively integrate knowledge from the PLMs and time series model to form a more comprehensive modeling of time series patterns. Extensive experiments on nine real-world datasets demonstrate that CC-Time achieves state-of-the-art prediction accuracy in both full-data training and few-shot learning situations.
[LG-48] Communication-Efficient Distributed Asynchronous ADMM
链接: https://arxiv.org/abs/2508.12233
作者: Sagar Shrestha
类目: Machine Learning (cs.LG); Signal Processing (eess.SP)
*备注:
Abstract:In distributed optimization and federated learning, asynchronous alternating direction method of multipliers (ADMM) serves as an attractive option for large-scale optimization, data privacy, straggler nodes and variety of objective functions. However, communication costs can become a major bottleneck when the nodes have limited communication budgets or when the data to be communicated is prohibitively large. In this work, we propose introducing coarse quantization to the data to be exchanged in aynchronous ADMM so as to reduce communication overhead for large-scale federated learning and distributed optimization applications. We experimentally verify the convergence of the proposed method for several distributed learning tasks, including neural networks.
[LG-49] Belief-Conditioned One-Step Diffusion: Real-Time Trajectory Planning with Just-Enough Sensing
链接: https://arxiv.org/abs/2508.12166
作者: Gokul Puthumanaillam,Aditya Penumarti,Manav Vora,Paulo Padrao,Jose Fuentes,Leonardo Bobadilla,Jane Shin,Melkior Ornik
类目: Robotics (cs.RO); Machine Learning (cs.LG); Systems and Control (eess.SY)
*备注: Accepted to CoRL 2025 (Conference on Robot Learning)
Abstract:Robots equipped with rich sensor suites can localize reliably in partially-observable environments, but powering every sensor continuously is wasteful and often infeasible. Belief-space planners address this by propagating pose-belief covariance through analytic models and switching sensors heuristically–a brittle, runtime-expensive approach. Data-driven approaches–including diffusion models–learn multi-modal trajectories from demonstrations, but presuppose an accurate, always-on state estimate. We address the largely open problem: for a given task in a mapped environment, which \textitminimal sensor subset must be active at each location to maintain state uncertainty \textitjust low enough to complete the task? Our key insight is that when a diffusion planner is explicitly conditioned on a pose-belief raster and a sensor mask, the spread of its denoising trajectories yields a calibrated, differentiable proxy for the expected localisation error. Building on this insight, we present Belief-Conditioned One-Step Diffusion (B-COD), the first planner that, in a 10 ms forward pass, returns a short-horizon trajectory, per-waypoint aleatoric variances, and a proxy for localisation error–eliminating external covariance rollouts. We show that this single proxy suffices for a soft-actor-critic to choose sensors online, optimising energy while bounding pose-covariance growth. We deploy B-COD in real-time marine trials on an unmanned surface vehicle and show that it reduces sensing energy consumption while matching the goal-reach performance of an always-on baseline.
[LG-50] DE-VAE: Revealing Uncertainty in Parametric and Inverse Projections with Variational Autoencoders using Differential Entropy
链接: https://arxiv.org/abs/2508.12145
作者: Frederik L. Dennig,Daniel A. Keim
类目: Machine Learning (cs.LG)
*备注: 5 pages, 3 figures, LaTeX
Abstract:Recently, autoencoders (AEs) have gained interest for creating parametric and invertible projections of multidimensional data. Parametric projections make it possible to embed new, unseen samples without recalculating the entire projection, while invertible projections allow the synthesis of new data instances. However, existing methods perform poorly when dealing with out-of-distribution samples in either the data or embedding space. Thus, we propose DE-VAE, an uncertainty-aware variational AE using differential entropy (DE) to improve the learned parametric and invertible projections. Given a fixed projection, we train DE-VAE to learn a mapping into 2D space and an inverse mapping back to the original space. We conduct quantitative and qualitative evaluations on four well-known datasets, using UMAP and t-SNE as baseline projection methods. Our findings show that DE-VAE can create parametric and inverse projections with comparable accuracy to other current AE-based approaches while enabling the analysis of embedding uncertainty.
[LG-51] me-Scale Coupling Between States and Parameters in Recurrent Neural Networks
链接: https://arxiv.org/abs/2508.12121
作者: Lorenzo Livi
类目: Machine Learning (cs.LG); Dynamical Systems (math.DS)
*备注:
Abstract:We study how gating mechanisms in recurrent neural networks (RNNs) implicitly induce adaptive learning-rate behavior, even when training is carried out with a fixed, global learning rate. This effect arises from the coupling between state-space time scales–parametrized by the gates–and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs, we obtain a first-order expansion that makes explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates not only control memory retention in the hidden states, but also act as data-driven preconditioners that adapt optimization trajectories in parameter space. We further draw formal analogies with learning-rate schedules, momentum, and adaptive methods such as Adam, showing that these optimization behaviors emerge naturally from gating. Numerical experiments confirm the validity of our perturbative analysis, supporting the view that gate-induced corrections remain small while exerting systematic effects on training dynamics. Overall, this work provides a unified dynamical-systems perspective on how gating couples state evolution with parameter updates, explaining why gated architectures achieve robust trainability and stability in practice.
[LG-52] Content Accuracy and Quality Aware Resource Allocation Based on LP-Guided DRL for ISAC-Driven AIGC Networks
链接: https://arxiv.org/abs/2508.12079
作者: Ningzhe Shi,Yiqing Zhou,Ling Liu,Jinglin Shi,Yihao Wu,Haiwei Shi,Hanxiao Yu
类目: Machine Learning (cs.LG)
*备注:
Abstract:Integrated sensing and communication (ISAC) can enhance artificial intelligence-generated content (AIGC) networks by providing efficient sensing and transmission. Existing AIGC services usually assume that the accuracy of the generated content can be ensured, given accurate input data and prompt, thus only the content generation quality (CGQ) is concerned. However, it is not applicable in ISAC-based AIGC networks, where content generation is based on inaccurate sensed data. Moreover, the AIGC model itself introduces generation errors, which depend on the number of generating steps (i.e., computing resources). To assess the quality of experience of ISAC-based AIGC services, we propose a content accuracy and quality aware service assessment metric (CAQA). Since allocating more resources to sensing and generating improves content accuracy but may reduce communication quality, and vice versa, this sensing-generating (computing)-communication three-dimensional resource tradeoff must be optimized to maximize the average CAQA (AvgCAQA) across all users with AIGC (CAQA-AIGC). This problem is NP-hard, with a large solution space that grows exponentially with users. To solve the CAQA-AIGC problem with low complexity, a linear programming (LP) guided deep reinforcement learning (DRL) algorithm with an action filter (LPDRL-F) is proposed. Through the LP-guided approach and the action filter, LPDRL-F can transform the original three-dimensional solution space to two dimensions, reducing complexity while improving the learning performance of DRL. Simulations show that compared to existing DRL and generative diffusion model algorithms without LP, LPDRL-F converges faster by over 60% and finds better resource allocation solutions, improving AvgCAQA by more than 14%. With LPDRL-F, CAQA-AIGC can achieve an improvement in AvgCAQA of more than 50% compared to existing schemes focusing solely on CGQ.
[LG-53] VARAN: Variational Inference for Self-Supervised Speech Models Fine-Tuning on Downstream Tasks
链接: https://arxiv.org/abs/2508.12061
作者: Daria Diatlova,Nikita Balagansky,Alexander Varlamov,Egor Spirin
类目: Machine Learning (cs.LG)
*备注:
Abstract:Conventional methods for aggregating layers in fine-tuned self-supervised speech models, such as using the final layer or weighted sum, suffer from information bottlenecks and static feature weighting for all dataset examples. We propose VARAN, a framework that dynamically tailors layer aggregation to individual inputs. By employing layer-specialized probing heads and data-dependent weighting, VARAN adaptively prioritizes layer’s features based on input. Evaluations on automatic speech recognition and speech emotion recognition tasks demonstrate VARAN’s superior performance, particularly when using the LoRA fine-tuning technique. The framework resolves the trade-off between preserving layer-specific information and enabling flexible feature utilization, advancing efficient adaptation of self-supervised speech representations.
[LG-54] Fairness Regularization in Federated Learning
链接: https://arxiv.org/abs/2508.12042
作者: Zahra Kharaghani,Ali Dadras,Tommy Löfstedt
类目: Machine Learning (cs.LG)
*备注: 25 pages
Abstract:Federated Learning (FL) has emerged as a vital paradigm in modern machine learning that enables collaborative training across decentralized data sources without exchanging raw data. This approach not only addresses privacy concerns but also allows access to overall substantially larger and potentially more diverse datasets, without the need for centralized storage or hardware resources. However, heterogeneity in client data may cause certain clients to have disproportionate impacts on the global model, leading to disparities in the clients’ performances. Fairness, therefore, becomes a crucial concern in FL and can be addressed in various ways. However, the effectiveness of existing fairness-aware methods, particularly in heterogeneous data settings, remains unclear, and the relationships between different approaches are not well understood. In this work, we focus on performance equitable fairness, which aims to minimize differences in performance across clients. We restrict our study to fairness-aware methods that explicitly regularize client losses, evaluating both existing and newly proposed approaches. We identify and theoretically explain connections between the investigated fairness methods, and empirically show that FairGrad (approximate) and FairGrad* (exact) (two variants of a gradient variance regularization method introduced here for performance equitable fairness) improve both fairness and overall model performance in heterogeneous data settings.
[LG-55] FedUHD: Unsupervised Federated Learning using Hyperdimensional Computing
链接: https://arxiv.org/abs/2508.12021
作者: You Hak Lee,Xiaofan Yu,Quanling Zhao,Flavio Ponzina,Tajana Rosing
类目: Machine Learning (cs.LG); Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC)
*备注:
Abstract:Unsupervised federated learning (UFL) has gained attention as a privacy-preserving, decentralized machine learning approach that eliminates the need for labor-intensive data labeling. However, UFL faces several challenges in practical applications: (1) non-independent and identically distributed (non-iid) data distribution across devices, (2) expensive computational and communication costs at the edge, and (3) vulnerability to communication noise. Previous UFL approaches have relied on deep neural networks (NN), which introduce substantial overhead in both computation and communication. In this paper, we propose FedUHD, the first UFL framework based on Hyperdimensional Computing (HDC). HDC is a brain-inspired computing scheme with lightweight training and inference operations, much smaller model size, and robustness to communication noise. FedUHD introduces two novel HDC-based designs to improve UFL performance. On the client side, a kNN-based cluster hypervector removal method addresses non-iid data samples by eliminating detrimental outliers. On the server side, a weighted HDC aggregation technique balances the non-iid data distribution across clients. Our experiments demonstrate that FedUHD achieves up to 173.6x and 612.7x better speedup and energy efficiency, respectively, in training, up to 271x lower communication cost, and 15.50% higher accuracy on average across diverse settings, along with superior robustness to various types of noise compared to state-of-the-art NN-based UFL approaches.
[LG-56] Optimizing Neural Architectures for Hindi Speech Separation and Enhancement in Noisy Environments
链接: https://arxiv.org/abs/2508.12009
作者: Arnav Ramamoorthy
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注: ICAD 2025
Abstract:This paper addresses the challenges of Hindi speech separation and enhancement using advanced neural network architectures, with a focus on edge devices. We propose a refined approach leveraging the DEMUCS model to overcome limitations of traditional methods, achieving substantial improvements in speech clarity and intelligibility. The model is fine-tuned with U-Net and LSTM layers, trained on a dataset of 400,000 Hindi speech clips augmented with ESC-50 and MS-SNSD for diverse acoustic environments. Evaluation using PESQ and STOI metrics shows superior performance, particularly under extreme noise conditions. To ensure deployment on resource-constrained devices like TWS earbuds, we explore quantization techniques to reduce computational requirements. This research highlights the effectiveness of customized AI algorithms for speech processing in Indian contexts and suggests future directions for optimizing edge-based architectures.
[LG-57] Universal Learning of Nonlinear Dynamics
链接: https://arxiv.org/abs/2508.11990
作者: Evan Dogariu,Anand Brahmbhatt,Elad Hazan
类目: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
*备注:
Abstract:We study the fundamental problem of learning a marginally stable unknown nonlinear dynamical system. We describe an algorithm for this problem, based on the technique of spectral filtering, which learns a mapping from past observations to the next based on a spectral representation of the system. Using techniques from online convex optimization, we prove vanishing prediction error for any nonlinear dynamical system that has finitely many marginally stable modes, with rates governed by a novel quantitative control-theoretic notion of learnability. The main technical component of our method is a new spectral filtering algorithm for linear dynamical systems, which incorporates past observations and applies to general noisy and marginally stable systems. This significantly generalizes the original spectral filtering algorithm to both asymmetric dynamics as well as incorporating noise correction, and is of independent interest.
[LG-58] Leverag ing Geometric Insights in Hyperbolic Triplet Loss for Improved Recommendations
链接: https://arxiv.org/abs/2508.11978
作者: Viacheslav Yusupov,Maxim Rakhuba,Evgeny Frolov
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Recent studies have demonstrated the potential of hyperbolic geometry for capturing complex patterns from interaction data in recommender systems. In this work, we introduce a novel hyperbolic recommendation model that uses geometrical insights to improve representation learning and increase computational stability at the same time. We reformulate the notion of hyperbolic distances to unlock additional representation capacity over conventional Euclidean space and learn more expressive user and item representations. To better capture user-items interactions, we construct a triplet loss that models ternary relations between users and their corresponding preferred and nonpreferred choices through a mix of pairwise interaction terms driven by the geometry of data. Our hyperbolic approach not only outperforms existing Euclidean and hyperbolic models but also reduces popularity bias, leading to more diverse and personalized recommendations.
[LG-59] Set-Valued Transformer Network for High-Emission Mobile Source Identification
链接: https://arxiv.org/abs/2508.11976
作者: Yunning Cao,Lihong Pei,Jian Guo,Yang Cao,Yu Kang,Yanlong Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Identifying high-emission vehicles is a crucial step in regulating urban pollution levels and formulating traffic emission reduction strategies. However, in practical monitoring data, the proportion of high-emission state data is significantly lower compared to normal emission states. This characteristic long-tailed distribution severely impedes the extraction of discriminative features for emission state identification during data mining. Furthermore, the highly nonlinear nature of vehicle emission states and the lack of relevant prior knowledge also pose significant challenges to the construction of identification this http URL address the aforementioned issues, we propose a Set-Valued Transformer Network (SVTN) to achieve comprehensive learning of discriminative features from high-emission samples, thereby enhancing detection accuracy. Specifically, this model first employs the transformer to measure the temporal similarity of micro-trip condition variations, thus constructing a mapping rule that projects the original high-dimensional emission data into a low-dimensional feature space. Next, a set-valued identification algorithm is used to probabilistically model the relationship between the generated feature vectors and their labels, providing an accurate metric criterion for the classification algorithm. To validate the effectiveness of our proposed approach, we conducted extensive experiments on the diesel vehicle monitoring data of Hefei city in 2020. The results demonstrate that our method achieves a 9.5% reduction in the missed detection rate for high-emission vehicles compared to the transformer-based baseline, highlighting its superior capability in accurately identifying high-emission mobile pollution sources.
[LG-60] Learning Marked Temporal Point Process Explanations based on Counterfactual and Factual Reasoning ECAI2025
链接: https://arxiv.org/abs/2508.11943
作者: Sishun Liu,Ke Deng,Xiuzhen Zhang,Yan Wang
类目: Machine Learning (cs.LG)
*备注: ECAI 2025 full version
Abstract:Neural network-based Marked Temporal Point Process (MTPP) models have been widely adopted to model event sequences in high-stakes applications, raising concerns about the trustworthiness of outputs from these models. This study focuses on Explanation for MTPP, aiming to identify the minimal and rational explanation, that is, the minimum subset of events in history, based on which the prediction accuracy of MTPP matches that based on full history to a great extent and better than that based on the complement of the subset. This study finds that directly defining Explanation for MTPP as counterfactual explanation or factual explanation can result in irrational explanations. To address this issue, we define Explanation for MTPP as a combination of counterfactual explanation and factual explanation. This study proposes Counterfactual and Factual Explainer for MTPP (CFF) to solve Explanation for MTPP with a series of deliberately designed techniques. Experiments demonstrate the correctness and superiority of CFF over baselines regarding explanation quality and processing efficiency.
[LG-61] M3OOD: Automatic Selection of Multimodal OOD Detectors
链接: https://arxiv.org/abs/2508.11936
作者: Yuehan Qin,Li Li,Defu Cao,Tiankai Yang,Yue Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Out-of-distribution (OOD) robustness is a critical challenge for modern machine learning systems, particularly as they increasingly operate in multimodal settings involving inputs like video, audio, and sensor data. Currently, many OOD detection methods have been proposed, each with different designs targeting various distribution shifts. A single OOD detector may not prevail across all the scenarios; therefore, how can we automatically select an ideal OOD detection model for different distribution shifts? Due to the inherent unsupervised nature of the OOD detection task, it is difficult to predict model performance and find a universally Best model. Also, systematically comparing models on the new unseen data is costly or even impractical. To address this challenge, we introduce M3OOD, a meta-learning-based framework for OOD detector selection in multimodal settings. Meta learning offers a solution by learning from historical model behaviors, enabling rapid adaptation to new data distribution shifts with minimal supervision. Our approach combines multimodal embeddings with handcrafted meta-features that capture distributional and cross-modal characteristics to represent datasets. By leveraging historical performance across diverse multimodal benchmarks, M3OOD can recommend suitable detectors for a new data distribution shift. Experimental evaluation demonstrates that M3OOD consistently outperforms 10 competitive baselines across 12 test scenarios with minimal computational overhead.
[LG-62] An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction
链接: https://arxiv.org/abs/2508.11931
作者: Tim van Erven,Jack Mayo,Julia Olkhovskaya,Chen-Yu Wei
类目: Machine Learning (cs.LG)
*备注:
Abstract:We present an efficient algorithm for linear contextual bandits with adversarial losses and stochastic action sets. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves \tildeO(\min\d^2\sqrtT, \sqrtd^3T\log K) regret and runs in \textpoly(d,C,T) time, where d is the feature dimension, C is an upper bound on the number of linear constraints defining the action set in each round, K is an upper bound on the number of actions in each round, and T is number of rounds. This resolves the open question by Liu et al. (2023) on whether one can obtain \textpoly(d)\sqrtT regret in polynomial time independent of the number of actions. For the important class of combinatorial bandits with adversarial losses and stochastic action sets where the action sets can be described by a polynomial number of linear constraints, our algorithm is the first to achieve \textpoly(d)\sqrtT regret in polynomial time, while no prior algorithm achieves even o(T) regret in polynomial time to our knowledge. When a simulator is available, the regret bound can be improved to \tildeO(d\sqrtL^\star) , where L^\star is the cumulative loss of the best policy.
[LG-63] Scale-Disentangled spatiotemporal Modeling for Long-term Traffic Emission Forecasting
链接: https://arxiv.org/abs/2508.11923
作者: Yan Wu,Lihong Pei,Yukai Han,Yang Cao,Yu Kang,Yanlong Zhao
类目: Machine Learning (cs.LG)
*备注:
Abstract:Long-term traffic emission forecasting is crucial for the comprehensive management of urban air pollution. Traditional forecasting methods typically construct spatiotemporal graph models by mining spatiotemporal dependencies to predict emissions. However, due to the multi-scale entanglement of traffic emissions across time and space, these spatiotemporal graph modeling method tend to suffer from cascading error amplification during long-term inference. To address this issue, we propose a Scale-Disentangled Spatio-Temporal Modeling (SDSTM) framework for long-term traffic emission forecasting. It leverages the predictability differences across multiple scales to decompose and fuse features at different scales, while constraining them to remain independent yet complementary. Specifically, the model first introduces a dual-stream feature decomposition strategy based on the Koopman lifting operator. It lifts the scale-coupled spatiotemporal dynamical system into an infinite-dimensional linear space via Koopman operator, and delineates the predictability boundary using gated wavelet decomposition. Then a novel fusion mechanism is constructed, incorporating a dual-stream independence constraint based on cross-term loss to dynamically refine the dual-stream prediction results, suppress mutual interference, and enhance the accuracy of long-term traffic emission prediction. Extensive experiments conducted on a road-level traffic emission dataset within Xi’an’s Second Ring Road demonstrate that the proposed model achieves state-of-the-art performance.
[LG-64] Reduced-order modeling of Hamiltonian dynamics based on symplectic neural networks
链接: https://arxiv.org/abs/2508.11911
作者: Yongsheng Chen,Wei Guo,Qi Tang,Xinghui Zhong
类目: Numerical Analysis (math.NA); Machine Learning (cs.LG); Computational Physics (physics.comp-ph)
*备注:
Abstract:We introduce a novel data-driven symplectic induced-order modeling (ROM) framework for high-dimensional Hamiltonian systems that unifies latent-space discovery and dynamics learning within a single, end-to-end neural architecture. The encoder-decoder is built from Henon neural networks (HenonNets) and may be augmented with linear SGS-reflector layers. This yields an exact symplectic map between full and latent phase spaces. Latent dynamics are advanced by a symplectic flow map implemented as a HenonNet. This unified neural architecture ensures exact preservation of the underlying symplectic structure at the reduced-order level, significantly enhancing the fidelity and long-term stability of the resulting ROM. We validate our method through comprehensive numerical experiments on canonical Hamiltonian systems. The results demonstrate the method’s capability for accurate trajectory reconstruction, robust predictive performance beyond the training horizon, and accurate Hamiltonian preservation. These promising outcomes underscore the effectiveness and potential applicability of our symplectic ROM framework for complex dynamical systems across a broad range of scientific and engineering disciplines.
[LG-65] PCA- and SVM-Grad-CAM for Convolutional Neural Networks: Closed-form Jacobian Expression
链接: https://arxiv.org/abs/2508.11880
作者: Yuto Omae
类目: Machine Learning (cs.LG)
*备注: 15 pages
Abstract:Convolutional Neural Networks (CNNs) are an effective approach for classification tasks, particularly when the training dataset is large. Although CNNs have long been considered a black-box classification method, they can be used as a white-box method through visualization techniques such as Grad-CAM. When training samples are limited, incorporating a Principal Component Analysis (PCA) layer and/or a Support Vector Machine (SVM) classifier into a CNN can effectively improve classification performance. However, traditional Grad-CAM cannot be directly applied to PCA and/or SVM layers. It is important to generate attention regions for PCA and/or SVM layers in CNNs to facilitate the development of white-box methods. Therefore, we propose PCA-Grad-CAM'', a method for visualizing attention regions in PCA feature vectors, and
SVM-Grad-CAM’', a method for visualizing attention regions in an SVM classifier layer. To complete our methods analytically, it is necessary to solve the closed-form Jacobian consisting of partial derivatives from the last convolutional layer to the PCA and/or SVM layers. In this paper, we present the exact closed-form Jacobian and the visualization results of our methods applied to several major datasets.
[LG-66] Combinations of Fast Activation and Trigonometric Functions in Kolmogorov-Arnold Networks
链接: https://arxiv.org/abs/2508.11876
作者: Hoang-Thang Ta,Duy-Quy Thai,Phuong-Linh Tran-Thi
类目: Machine Learning (cs.LG)
*备注: 6pages
Abstract:For years, many neural networks have been developed based on the Kolmogorov-Arnold Representation Theorem (KART), which was created to address Hilbert’s 13th problem. Recently, relying on KART, Kolmogorov-Arnold Networks (KANs) have attracted attention from the research community, stimulating the use of polynomial functions such as B-splines and RBFs. However, these functions are not fully supported by GPU devices and are still considered less popular. In this paper, we propose the use of fast computational functions, such as ReLU and trigonometric functions (e.g., ReLU, sin, cos, arctan), as basis components in Kolmogorov-Arnold Networks (KANs). By integrating these function combinations into the network structure, we aim to enhance computational efficiency. Experimental results show that these combinations maintain competitive performance while offering potential improvements in training time and generalization.
[LG-67] On Balancing Sparsity with Reliable Connectivity in Distributed Network Design with Random K-out Graphs
链接: https://arxiv.org/abs/2508.11863
作者: Mansi Sood,Eray Can Elumar,Osman Yagan
类目: ocial and Information Networks (cs.SI); Information Theory (cs.IT); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI); Optimization and Control (math.OC)
*备注: Present extensive evaluation of connectivity and related properties of random K-out graphs with several use cases in network design. Subsumes earlier results in IEEE ISIT 2021, ICC 2021, and ICC 2023
Abstract:In several applications in distributed systems, an important design criterion is ensuring that the network is sparse, i.e., does not contain too many edges, while achieving reliable connectivity. Sparsity ensures communication overhead remains low, while reliable connectivity is tied to reliable communication and inference on decentralized data reservoirs and computational resources. A class of network models called random K-out graphs appear widely as a heuristic to balance connectivity and sparsity, especially in settings with limited trust, e.g., privacy-preserving aggregation of networked data in which networks are deployed. However, several questions remain regarding how to choose network parameters in response to different operational requirements, including the need to go beyond asymptotic results and the ability to model the stochastic and adversarial environments. To address this gap, we present theorems to inform the choice of network parameters that guarantee reliable connectivity in regimes where nodes can be finite or unreliable. We first derive upper and lower bounds for probability of connectivity in random K-out graphs when the number of nodes is finite. Next, we analyze the property of r-robustness, a stronger notion than connectivity that enables resilient consensus in the presence of malicious nodes. Finally, motivated by aggregation mechanisms based on pairwise masking, we model and analyze the impact of a subset of adversarial nodes, modeled as deletions, on connectivity and giant component size - metrics that are closely tied to privacy guarantees. Together, our results pave the way for end-to-end performance guarantees for a suite of algorithms for reliable inference on networks.
[LG-68] Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding
链接: https://arxiv.org/abs/2508.11818
作者: Zhifeng Kong,Arushi Goel,Joao Felipe Santos,Sreyan Ghosh,Rafael Valle,Wei Ping,Bryan Catanzaro
类目: ound (cs.SD); Machine Learning (cs.LG)
*备注:
Abstract:Chain-of-thought reasoning has demonstrated significant improvements in large language models and vision language models, yet its potential for audio language models remains largely unexplored. In this technical report, we take a preliminary step towards closing this gap. For better assessment of sound reasoning, we propose AF-Reasoning-Eval, a benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. To prepare training corpus for sound reasoning abilities, we propose automatic pipelines that transform existing audio question answering and classification data into explicit reasoning chains, yielding AF-CoT-Train with 1.24M samples. We study the effect of finetuning Audio Flamingo series on AF-CoT-Train and observe considerable improvements on several reasoning benchmarks, validating the effectiveness of chain-of-thought finetuning on advanced sound understanding.
[LG-69] Fed-Meta-Align: A Similarity-Aware Aggregation and Personalization Pipeline for Federated TinyML on Heterogeneous Data
链接: https://arxiv.org/abs/2508.11794
作者: Hemanth Macharla,Mayukha Pal
类目: Machine Learning (cs.LG)
*备注:
Abstract:Real-time fault classification in resource-constrained Internet of Things (IoT) devices is critical for industrial safety, yet training robust models in such heterogeneous environments remains a significant challenge. Standard Federated Learning (FL) often fails in the presence of non-IID data, leading to model divergence. This paper introduces Fed-Meta-Align, a novel four-phase framework designed to overcome these limitations through a sophisticated initialization and training pipeline. Our process begins by training a foundational model on a general public dataset to establish a competent starting point. This model then undergoes a serial meta-initialization phase, where it sequentially trains on a subset of IOT Device data to learn a heterogeneity-aware initialization that is already situated in a favorable region of the loss landscape. This informed model is subsequently refined in a parallel FL phase, which utilizes a dual-criterion aggregation mechanism that weights for IOT devices updates based on both local performance and cosine similarity alignment. Finally, an on-device personalization phase adapts the converged global model into a specialized expert for each IOT Device. Comprehensive experiments demonstrate that Fed-Meta-Align achieves an average test accuracy of 91.27% across heterogeneous IOT devices, outperforming personalized FedAvg and FedProx by up to 3.87% and 3.37% on electrical and mechanical fault datasets, respectively. This multi-stage approach of sequenced initialization and adaptive aggregation provides a robust pathway for deploying high-performance intelligence on diverse TinyML networks.
[LG-70] Ontology-Guided Query Expansion for Biomedical Document Retrieval using Large Language Models
链接: https://arxiv.org/abs/2508.11784
作者: Zabir Al Nazi,Vagelis Hristidis,Aaron Lawson McLean,Jannat Ara Meem,Md Taukir Azam Chowdhury
类目: Information Retrieval (cs.IR); Machine Learning (cs.LG)
*备注:
Abstract:Effective Question Answering (QA) on large biomedical document collections requires effective document retrieval techniques. The latter remains a challenging task due to the domain-specific vocabulary and semantic ambiguity in user queries. We propose BMQExpander, a novel ontology-aware query expansion pipeline that combines medical knowledge - definitions and relationships - from the UMLS Metathesaurus with the generative capabilities of large language models (LLMs) to enhance retrieval effectiveness. We implemented several state-of-the-art baselines, including sparse and dense retrievers, query expansion methods, and biomedical-specific solutions. We show that BMQExpander has superior retrieval performance on three popular biomedical Information Retrieval (IR) benchmarks: NFCorpus, TREC-COVID, and SciFact - with improvements of up to 22.1% in NDCG@10 over sparse baselines and up to 6.5% over the strongest baseline. Further, BMQExpander generalizes robustly under query perturbation settings, in contrast to supervised baselines, achieving up to 15.7% improvement over the strongest baseline. As a side contribution, we publish our paraphrased benchmarks. Finally, our qualitative analysis shows that BMQExpander has fewer hallucinations compared to other LLM-based query expansion baselines.
[LG-71] Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks
链接: https://arxiv.org/abs/2508.11727
作者: Songyao Jin,Biwei Huang
类目: Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注:
Abstract:Multivariate Hawkes process provides a powerful framework for modeling temporal dependencies and event-driven interactions in complex systems. While existing methods primarily focus on uncovering causal structures among observed subprocesses, real-world systems are often only partially observed, with latent subprocesses posing significant challenges. In this paper, we show that continuous-time event sequences can be represented by a discrete-time model as the time interval shrinks, and we leverage this insight to establish necessary and sufficient conditions for identifying latent subprocesses and the causal influences. Accordingly, we propose a two-phase iterative algorithm that alternates between inferring causal relationships among discovered subprocesses and uncovering new latent subprocesses, guided by path-based conditions that guarantee identifiability. Experiments on both synthetic and real-world datasets show that our method effectively recovers causal structures despite the presence of latent subprocesses.
[LG-72] From Heuristics to Data: Quantifying Site Planning Layout Indicators with Deep Learning and Multi-Modal Data
链接: https://arxiv.org/abs/2508.11723
作者: Qian Cao,Jielin Chen,Junchao Zhao,Rudi Stouffs
类目: Machine Learning (cs.LG)
*备注: 42 pages, 32 figures, submitted to Environment and Planning B: Urban Analytics and City Science
Abstract:The spatial layout of urban sites shapes land-use efficiency and spatial organization. Traditional site planning often relies on experiential judgment and single-source data, limiting systematic quantification of multifunctional layouts. We propose a Site Planning Layout Indicator (SPLI) system, a data-driven framework integrating empirical knowledge with heterogeneous multi-source data to produce structured urban spatial information. The SPLI supports multimodal spatial data systems for analytics, inference, and retrieval by combining OpenStreetMap (OSM), Points of Interest (POI), building morphology, land use, and satellite imagery. It extends conventional metrics through five dimensions: (1) Hierarchical Building Function Classification, refining empirical systems into clear hierarchies; (2) Spatial Organization, quantifying seven layout patterns (e.g., symmetrical, concentric, axial-oriented); (3) Functional Diversity, transforming qualitative assessments into measurable indicators using Functional Ratio (FR) and Simpson Index (SI); (4) Accessibility to Essential Services, integrating facility distribution and transport networks for comprehensive accessibility metrics; and (5) Land Use Intensity, using Floor Area Ratio (FAR) and Building Coverage Ratio (BCR) to assess utilization efficiency. Data gaps are addressed through deep learning, including Relational Graph Neural Networks (RGNN) and Graph Neural Networks (GNN). Experiments show the SPLI improves functional classification accuracy and provides a standardized basis for automated, data-driven urban spatial analytics.
[LG-73] Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming
链接: https://arxiv.org/abs/2508.11703
作者: Vasileios Saketos,Sebastian Kaltenbach,Sergey Litvinov,Petros Koumoutsakos
类目: Neural and Evolutionary Computing (cs.NE); Machine Learning (cs.LG)
*备注:
Abstract:Algorithmic discovery has traditionally relied on human ingenuity and extensive experimentation. Here we investigate whether a prominent scientific computing algorithm, the Kalman Filter, can be discovered through an automated, data-driven, evolutionary process that relies on Cartesian Genetic Programming (CGP) and Large Language Models (LLM). We evaluate the contributions of both modalities (CGP and LLM) in discovering the Kalman filter under varying conditions. Our results demonstrate that our framework of CGP and LLM-assisted evolution converges to near-optimal solutions when Kalman optimality assumptions hold. When these assumptions are violated, our framework evolves interpretable alternatives that outperform the Kalman filter. These results demonstrate that combining evolutionary algorithms and generative models for interpretable, data-driven synthesis of simple computational modules is a potent approach for algorithmic discovery in scientific computing.
[LG-74] ransfer Learning for Neutrino Scattering: Domain Adaptation with GANs
链接: https://arxiv.org/abs/2508.12987
作者: Jose L. Bonilla,Krzysztof M. Graczyk,Artur M. Ankowski,Rwik Dharmapal Banerjee,Beata E. Kowal,Hemant Prasad,Jan T. Sobczyk
类目: High Energy Physics - Phenomenology (hep-ph); Machine Learning (cs.LG); High Energy Physics - Experiment (hep-ex); Nuclear Experiment (nucl-ex); Computational Physics (physics.comp-ph)
*备注: 17 pages, 17 figures
Abstract:We utilize transfer learning to extrapolate the physics knowledge encoded in a Generative Adversarial Network (GAN) model trained on synthetic charged-current (CC) neutrino-carbon inclusive scattering data. This base model is adapted to generate CC inclusive scattering events (lepton kinematics only) for neutrino-argon and antineutrino-carbon interactions. Furthermore, we assess the effectiveness of transfer learning in re-optimizing a custom model when new data comes from a different neutrino-nucleus interaction model. Our results demonstrate that transfer learning significantly outperforms training generative models from scratch. To study this, we consider two training data sets: one with 10,000 and another with 100,000 events. The models obtained via transfer learning perform well even with smaller training data. The proposed method provides a promising approach for constructing neutrino scattering event generators in scenarios where experimental data is sparse.
[LG-75] Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
链接: https://arxiv.org/abs/2508.12968
作者: Branislav Gerazov,Marcello Politi,Sébastien Bratières
类目: Audio and Speech Processing (eess.AS); Machine Learning (cs.LG)
*备注:
Abstract:We explore the performance of several state-of-the-art automatic speech recognition (ASR) models on a large-scale Arabic speech dataset, the SADA (Saudi Audio Dataset for Arabic), which contains 668 hours of high-quality audio from Saudi television shows. The dataset includes multiple dialects and environments, specifically a noisy subset that makes it particularly challenging for ASR. We evaluate the performance of the models on the SADA test set, and we explore the impact of fine-tuning, language models, as well as noise and denoising on their performance. We find that the best performing model is the MMS 1B model finetuned on SADA with a 4-gram language model that achieves a WER of 40.9% and a CER of 17.6% on the SADA test clean set.
[LG-76] Shapley Values: Paired-Sampling Approximations
链接: https://arxiv.org/abs/2508.12947
作者: Michael Mayer,Mario V. Wüthrich
类目: Machine Learning (stat.ML); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG)
*备注:
Abstract:Originally introduced in cooperative game theory, Shapley values have become a very popular tool to explain machine learning predictions. Based on Shapley’s fairness axioms, every input (feature component) gets a credit how it contributes to an output (prediction). These credits are then used to explain the prediction. The only limitation in computing the Shapley values (credits) for many different predictions is of computational nature. There are two popular sampling approximations, sampling KernelSHAP and sampling PermutationSHAP. Our first novel contributions are asymptotic normality results for these sampling approximations. Next, we show that the paired-sampling approaches provide exact results in case of interactions being of maximal order two. Furthermore, the paired-sampling PermutationSHAP possesses the additive recovery property, whereas its kernel counterpart does not.
[LG-77] Simulation-Based Inference: A Practical Guide
链接: https://arxiv.org/abs/2508.12939
作者: Michael Deistler,Jan Boelts,Peter Steinbach,Guy Moss,Thomas Moreau,Manuel Gloeckler,Pedro L. C. Rodrigues,Julia Linhart,Janne K. Lappalainen,Benjamin Kurt Miller,Pedro J. Gonçalves,Jan-Matthis Lueckmann,Cornelius Schröder,Jakob H. Macke
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:A central challenge in many areas of science and engineering is to identify model parameters that are consistent with prior knowledge and empirical data. Bayesian inference offers a principled framework for this task, but can be computationally prohibitive when models are defined by stochastic simulators. Simulation-based Inference (SBI) is a suite of methods developed to overcome this limitation, which has enabled scientific discoveries in fields such as particle physics, astrophysics, and neuroscience. The core idea of SBI is to train neural networks on data generated by a simulator, without requiring access to likelihood evaluations. Once trained, inference is amortized: The neural network can rapidly perform Bayesian inference on empirical observations without requiring additional training or simulations. In this tutorial, we provide a practical guide for practitioners aiming to apply SBI methods. We outline a structured SBI workflow and offer practical guidelines and diagnostic tools for every stage of the process – from setting up the simulator and prior, choosing and training inference networks, to performing inference and validating the results. We illustrate these steps through examples from astrophysics, psychophysics, and neuroscience. This tutorial empowers researchers to apply state-of-the-art SBI methods, facilitating efficient parameter inference for scientific discovery.
[LG-78] he path to a goal: Understanding soccer possessions via path signatures
链接: https://arxiv.org/abs/2508.12930
作者: David Hirnschall,Robert Bajons
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We present a novel framework for predicting next actions in soccer possessions by leveraging path signatures to encode their complex spatio-temporal structure. Unlike existing approaches, we do not rely on fixed historical windows and handcrafted features, but rather encode the entire recent possession, thereby avoiding the inclusion of potentially irrelevant or misleading historical information. Path signatures naturally capture the order and interaction of events, providing a mathematically grounded feature encoding for variable-length time series of irregular sampling frequencies without the necessity for manual feature engineering. Our proposed approach outperforms a transformer-based benchmark across various loss metrics and considerably reduces computational cost. Building on these results, we introduce a new possession evaluation metric based on well-established frameworks in soccer analytics, incorporating both predicted action type probabilities and action location. Our metric shows greater reliability than existing metrics in domain-specific comparisons. Finally, we validate our approach through a detailed analysis of the 2017/18 Premier League season and discuss further applications and future extensions.
[LG-79] Optimal Condition for Initialization Variance in Deep Neural Networks: An SGD Dynamics Perspective
链接: https://arxiv.org/abs/2508.12834
作者: Hiroshi Horii(SU),Sothea Has(KHM)
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Stochastic gradient descent (SGD), one of the most fundamental optimization algorithms in machine learning (ML), can be recast through a continuous-time approximation as a Fokker-Planck equation for Langevin dynamics, a viewpoint that has motivated many theoretical studies. Within this framework, we study the relationship between the quasi-stationary distribution derived from this equation and the initial distribution through the Kullback-Leibler (KL) divergence. As the quasi-steady-state distribution depends on the expected cost function, the KL divergence eventually reveals the connection between the expected cost function and the initialization distribution. By applying this to deep neural network models (DNNs), we can express the bounds of the expected loss function explicitly in terms of the initialization parameters. Then, by minimizing this bound, we obtain an optimal condition of the initialization variance in the Gaussian case. This result provides a concrete mathematical criterion, rather than a heuristic approach, to select the scale of weight initialization in DNNs. In addition, we experimentally confirm our theoretical results by using the classical SGD to train fully connected neural networks on the MNIST and Fashion-MNIST datasets. The result shows that if the variance of the initialization distribution satisfies our theoretical optimal condition, then the corresponding DNN model always achieves lower final training loss and higher test accuracy than the conventional He-normal initialization. Our work thus supplies a mathematically grounded indicator that guides the choice of initialization variance and clarifies its physical meaning of the dynamics of parameters in DNNs.
[LG-80] Unfolded Laplacian Spectral Embedding: A Theoretically Grounded Approach to Dynamic Network Representation
链接: https://arxiv.org/abs/2508.12674
作者: Haruka Ezoe,Hiroki Matsumoto,Ryohei Hisano
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Social and Information Networks (cs.SI)
*备注:
Abstract:Dynamic relational structures play a central role in many AI tasks, but their evolving nature presents challenges for consistent and interpretable representation. A common approach is to learn time-varying node embeddings, whose effectiveness depends on satisfying key stability properties. In this paper, we propose Unfolded Laplacian Spectral Embedding, a new method that extends the Unfolded Adjacency Spectral Embedding framework to normalized Laplacians while preserving both cross-sectional and longitudinal stability. We provide formal proof that our method satisfies these stability conditions. In addition, as a bonus of using the Laplacian matrix, we establish a new Cheeger-style inequality that connects the embeddings to the conductance of the underlying dynamic graphs. Empirical evaluations on synthetic and real-world datasets support our theoretical findings and demonstrate the strong performance of our method. These results establish a principled and stable framework for dynamic network representation grounded in spectral graph theory.
[LG-81] owards SISO Bistatic Sensing for ISAC
链接: https://arxiv.org/abs/2508.12614
作者: Zhongqin Wang,J. Andrew Zhang,Kai Wu,Min Xu,Y. Jay Guo
类目: ignal Processing (eess.SP); Human-Computer Interaction (cs.HC); Machine Learning (cs.LG)
*备注:
Abstract:Integrated Sensing and Communication (ISAC) is a key enabler for next-generation wireless systems. However, real-world deployment is often limited to low-cost, single-antenna transceivers. In such bistatic Single-Input Single-Output (SISO) setup, clock asynchrony introduces random phase offsets in Channel State Information (CSI), which cannot be mitigated using conventional multi-antenna methods. This work proposes WiDFS 3.0, a lightweight bistatic SISO sensing framework that enables accurate delay and Doppler estimation from distorted CSI by effectively suppressing Doppler mirroring ambiguity. It operates with only a single antenna at both the transmitter and receiver, making it suitable for low-complexity deployments. We propose a self-referencing cross-correlation (SRCC) method for SISO random phase removal and employ delay-domain beamforming to resolve Doppler ambiguity. The resulting unambiguous delay-Doppler-time features enable robust sensing with compact neural networks. Extensive experiments show that WiDFS 3.0 achieves accurate parameter estimation, with performance comparable to or even surpassing that of prior multi-antenna methods, especially in delay estimation. Validated under single- and multi-target scenarios, the extracted ambiguity-resolved features show strong sensing accuracy and generalization. For example, when deployed on the embedded-friendly MobileViT-XXS with only 1.3M parameters, WiDFS 3.0 consistently outperforms conventional features such as CSI amplitude, mirrored Doppler, and multi-receiver aggregated Doppler.
[LG-82] SimQFL: A Quantum Federated Learning Simulator with Real-Time Visualization
链接: https://arxiv.org/abs/2508.12477
作者: Ratun Rahman,Atit Pokharel,Md Raihan Uddin,Dinh C. Nguyen
类目: Quantum Physics (quant-ph); Machine Learning (cs.LG)
*备注:
Abstract:Quantum federated learning (QFL) is an emerging field that has the potential to revolutionize computation by taking advantage of quantum physics concepts in a distributed machine learning (ML) environment. However, the majority of available quantum simulators are primarily built for general quantum circuit simulation and do not include integrated support for machine learning tasks such as training, evaluation, and iterative optimization. Furthermore, designing and assessing quantum learning algorithms is still a difficult and resource-intensive task. Real-time updates are essential for observing model convergence, debugging quantum circuits, and making conscious choices during training with the use of limited resources. Furthermore, most current simulators fail to support the integration of user-specific data for training purposes, undermining the main purpose of using a simulator. In this study, we introduce SimQFL, a customized simulator that simplifies and accelerates QFL experiments in quantum network applications. SimQFL supports real-time, epoch-wise output development and visualization, allowing researchers to monitor the process of learning across each training round. Furthermore, SimQFL offers an intuitive and visually appealing interface that facilitates ease of use and seamless execution. Users can customize key variables such as the number of epochs, learning rates, number of clients, and quantum hyperparameters such as qubits and quantum layers, making the simulator suitable for various QFL applications. The system gives immediate feedback following each epoch by showing intermediate outcomes and dynamically illustrating learning curves. SimQFL is a practical and interactive platform enabling academics and developers to prototype, analyze, and tune quantum neural networks with greater transparency and control in distributed quantum networks.
[LG-83] ATLAS: AI-Native Receiver Test-and-Measurement by Leverag ing AI-Guided Search
链接: https://arxiv.org/abs/2508.12204
作者: Mauro Belgiovine,Suyash Pradhan,Johannes Lange,Michael Löhning,Kaushik Chowdhury
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Networking and Internet Architecture (cs.NI)
*备注: Accepted at IEEE PIMRC 2025
Abstract:Industry adoption of Artificial Intelligence (AI)-native wireless receivers, or even modular, Machine Learning (ML)-aided wireless signal processing blocks, has been slow. The main concern is the lack of explainability of these trained ML models and the significant risks posed to network functionalities in case of failures, especially since (i) testing on every exhaustive case is infeasible and (ii) the data used for model training may not be available. This paper proposes ATLAS, an AI-guided approach that generates a battery of tests for pre-trained AI-native receiver models and benchmarks the performance against a classical receiver architecture. Using gradient-based optimization, it avoids spanning the exhaustive set of all environment and channel conditions; instead, it generates the next test in an online manner to further probe specific configurations that offer the highest risk of failure. We implement and validate our approach by adopting the well-known DeepRx AI-native receiver model as well as a classical receiver using differentiable tensors in NVIDIA’s Sionna environment. ATLAS uncovers specific combinations of mobility, channel delay spread, and noise, where fully and partially trained variants of AI-native DeepRx perform suboptimally compared to the classical receivers. Our proposed method reduces the number of tests required per failure found by 19% compared to grid search for a 3-parameters input optimization problem, demonstrating greater efficiency. In contrast, the computational cost of the grid-based approach scales exponentially with the number of variables, making it increasingly impractical for high-dimensional problems.
[LG-84] Robust Data Fusion via Subsampling
链接: https://arxiv.org/abs/2508.12048
作者: Jing Wang,HaiYing Wang,Kun Chen
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:Data fusion and transfer learning are rapidly growing fields that enhance model performance for a target population by leveraging other related data sources or tasks. The challenges lie in the various potential heterogeneities between the target and external data, as well as various practical concerns that prevent a naïve data integration. We consider a realistic scenario where the target data is limited in size while the external data is large but contaminated with outliers; such data contamination, along with other computational and operational constraints, necessitates proper selection or subsampling of the external data for transfer learning. To our knowledge,transfer learning and subsampling under data contamination have not been thoroughly investigated. We address this gap by studying various transfer learning methods with subsamples of the external data, accounting for outliers deviating from the underlying true model due to arbitrary mean shifts. Two subsampling strategies are investigated: one aimed at reducing biases and the other at minimizing variances. Approaches to combine these strategies are also introduced to enhance the performance of the estimators. We provide non-asymptotic error bounds for the transfer learning estimators, clarifying the roles of sample sizes, signal strength, sampling rates, magnitude of outliers, and tail behaviors of model error distributions, among other factors. Extensive simulations show the superior performance of the proposed methods. Additionally, we apply our methods to analyze the risk of hard landings in A380 airplanes by utilizing data from other airplane types,demonstrating that robust transfer learning can improve estimation efficiency for relatively rare airplane types with the help of data from other types of airplanes.
[LG-85] Adversarial Robustness in Distributed Quantum Machine Learning
链接: https://arxiv.org/abs/2508.11848
作者: Pouya Kananian,Hans-Arno Jacobsen
类目: Quantum Physics (quant-ph); Emerging Technologies (cs.ET); Machine Learning (cs.LG)
*备注: This is a preprint of a book chapter that is planned to be published in “Quantum Robustness in Artificial Intelligence” by Springer Nature
Abstract:Studying adversarial robustness of quantum machine learning (QML) models is essential in order to understand their potential advantages over classical models and build trustworthy systems. Distributing QML models allows leveraging multiple quantum processors to overcome the limitations of individual devices and build scalable systems. However, this distribution can affect their adversarial robustness, potentially making them more vulnerable to new attacks. Key paradigms in distributed QML include federated learning, which, similar to classical models, involves training a shared model on local data and sending only the model updates, as well as circuit distribution methods inherent to quantum computing, such as circuit cutting and teleportation-based techniques. These quantum-specific methods enable the distributed execution of quantum circuits across multiple devices. This work reviews the differences between these distribution methods, summarizes existing approaches on the adversarial robustness of QML models when distributed using each paradigm, and discusses open questions in this area.
[LG-86] Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
链接: https://arxiv.org/abs/2508.11847
作者: Jenny Y. Huang,Yunyi Shen,Dennis Wei,Tamara Broderick
类目: Machine Learning (stat.ML); Machine Learning (cs.LG)
*备注:
Abstract:We propose a method for evaluating the robustness of a widely used LLM ranking system – the Bradley–Terry ranking system – to dropping a worst-case very small fraction of evaluation data. Our approach is computationally fast and easy to adopt. When we apply our method to matchups from two popular human-preference platforms, Chatbot Arena and MT-Bench, we find that the Bradley–Terry rankings of top-performing models are remarkably sensitive to the removal of a small fraction of evaluations. Our framework also identifies the specific evaluations most responsible for such ranking flips, allowing for inspections of these influential preferences. We observe that the rankings derived from MT-Bench preferences are notably more robust than those from Chatbot Arena, likely due to MT-bench’s use of expert annotators and carefully constructed prompts. Finally, we find that rankings based on crowdsourced human-evaluated systems are just as sensitive as those based on LLM-as-a-judge evaluations, where in both, dropping as little as 0.02% of the total evaluations in the dataset can change the top-ranked model.
[LG-87] BaMANI: Bayesian Multi-Algorithm causal Network Inference
链接: https://arxiv.org/abs/2508.11741
作者: Habibolla Latifizadeh,Anika C. Pirkey,Alanna Gould,David J. Klinke II
类目: Machine Learning (stat.ML); Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
*备注: 12 pages, 6 figures
Abstract:Improved computational power has enabled different disciplines to predict causal relationships among modeled variables using Bayesian network inference. While many alternative algorithms have been proposed to improve the efficiency and reliability of network prediction, the predicted causal networks reflect the generative process but also bear an opaque imprint of the specific computational algorithm used. Following a ``wisdom of the crowds" strategy, we developed an ensemble learning approach to marginalize the impact of a single algorithm on Bayesian causal network inference. To introduce the approach, we first present the theoretical foundation of this framework. Next, we present a comprehensive implementation of the framework in terms of a new software tool called BaMANI (Bayesian Multi-Algorithm causal Network Inference). Finally, we describe a BaMANI use-case from biology, particularly within human breast cancer studies.
[LG-88] Enhancing Corrosion Resistance of Aluminum Alloys Through AI and ML Modeling
链接: https://arxiv.org/abs/2508.11685
作者: Farnaz Kaboudvand,Maham Khalid,Nydia Assaf,Vardaan Sahgal,Jon P. Ruffley,Brian J. McDermott
类目: ignal Processing (eess.SP); Materials Science (cond-mat.mtrl-sci); Machine Learning (cs.LG); Machine Learning (stat.ML)
*备注: Manuscript length: 11 pages, 6 figures
Abstract:Corrosion poses a significant challenge to the performance of aluminum alloys, particularly in marine environments. This study investigates the application of machine learning (ML) algorithms to predict and optimize corrosion resistance, utilizing a comprehensive open-source dataset compiled from various sources. The dataset encompasses corrosion rate data and environmental conditions, preprocessed to standardize units and formats. We explored two different approaches, a direct approach, where the material’s composition and environmental conditions were used as inputs to predict corrosion rates; and an inverse approach, where corrosion rate served as the input to identify suitable material compositions as output. We employed and compared three distinct ML methodologies for forward predictions: Random Forest regression, optimized via grid search; a feed-forward neural network, utilizing ReLU activation and Adam optimization; and Gaussian Process Regression (GPR), implemented with GPyTorch and employing various kernel functions. The Random Forest and neural network models provided predictive capabilities based on elemental compositions and environmental conditions. Notably, Gaussian Process Regression demonstrated superior performance, particularly with hybrid kernel functions. Log-transformed GPR further refined predictions. This study highlights the efficacy of ML, particularly GPR, in predicting corrosion rates and material properties.
[LG-89] A Graph Neural Network based on a Functional Topology Model: Unveiling the Dynamic Mechanisms of Non-Suicidal Self-Injury in Single-Channel EEG
链接: https://arxiv.org/abs/2508.11684
作者: BG Tong
类目: ignal Processing (eess.SP); Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC)
*备注:
Abstract:Objective: This study proposes and preliminarily validates a novel “Functional-Energetic Topology Model” to uncover neurodynamic mechanisms of Non-Suicidal Self-Injury (NSSI), using Graph Neural Networks (GNNs) to decode brain network patterns from single-channel EEG in real-world this http URL: EEG data were collected over ~1 month from three adolescents with NSSI using a smartphone app and a portable Fp1 EEG headband during impulsive and non-impulsive states. A theory-driven GNN with seven functional nodes was built. Performance was evaluated via intra-subject (80/20 split) and leave-one-subject-out cross-validation (LOSOCV). GNNExplainer was used for this http URL: The model achieved high intra-subject accuracy (85%) and significantly above-chance cross-subject performance (approximately73.7%). Explainability analysis revealed a key finding: during NSSI states, a critical feedback loop regulating somatic sensation exhibits dysfunction and directional reversal. Specifically, the brain loses its ability to self-correct via negative bodily feedback, and the regulatory mechanism enters an “ineffective idling” this http URL: This work demonstrates the feasibility of applying theory-guided GNNs to sparse, single-channel EEG for decoding complex mental states. The identified “feedback loop reversal” offers a novel, dynamic, and computable model of NSSI mechanisms, paving the way for objective biomarkers and next-generation Digital Therapeutics (DTx).
[LG-90] Explainable Deep Neural Network for Multimodal ECG Signals: Intermediate vs Late Fusion
链接: https://arxiv.org/abs/2508.11666
作者: Timothy Oladunni,Ehimen Aneni
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:The limitations of unimodal deep learning models, particularly their tendency to overfit and limited generalizability, have renewed interest in multimodal fusion strategies. Multimodal deep neural networks (MDNN) have the capability of integrating diverse data domains and offer a promising solution for robust and accurate predictions. However, the optimal fusion strategy, intermediate fusion (feature-level) versus late fusion (decision-level) remains insufficiently examined, especially in high-stakes clinical contexts such as ECG-based cardiovascular disease (CVD) classification. This study investigates the comparative effectiveness of intermediate and late fusion strategies using ECG signals across three domains: time, frequency, and time-frequency. A series of experiments were conducted to identify the highest-performing fusion architecture. Results demonstrate that intermediate fusion consistently outperformed late fusion, achieving a peak accuracy of 97 percent, with Cohen’s d 0.8 relative to standalone models and d = 0.40 compared to late fusion. Interpretability analyses using saliency maps reveal that both models align with the discretized ECG signals. Statistical dependency between the discretized ECG signals and corresponding saliency maps for each class was confirmed using Mutual Information (MI). The proposed ECG domain-based multimodal model offers superior predictive capability and enhanced explainability, crucial attributes in medical AI applications, surpassing state-of-the-art models.
[LG-91] Energy-Efficient Real-Time 4-Stage Sleep Classification at 10-Second Resolution: A Comprehensive Study
链接: https://arxiv.org/abs/2508.11664
作者: Zahra Mohammadi,Parnian Fazel,Siamak Mohammadi
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Sleep stage classification is crucial for diagnosing and managing disorders such as sleep apnea and insomnia. Conventional clinical methods like polysomnography are costly and impractical for long-term home use. We present an energy-efficient pipeline that detects four sleep stages (wake, REM, light, and deep) from a single-lead ECG. Two windowing strategies are introduced: (1) a 5-minute window with 30-second steps for machine-learning models that use handcrafted features, and (2) a 30-second window with 10-second steps for deep-learning models, enabling near-real-time 10-second resolution. Lightweight networks such as MobileNet-v1 reach 92 percent accuracy and 91 percent F1-score but still draw significant energy. We therefore design SleepLiteCNN, a custom model that achieves 89 percent accuracy and 89 percent F1-score while lowering energy use to 5.48 microjoules per inference at 45 nm. Applying eight-bit quantization preserves accuracy and further reduces power, and FPGA deployment confirms low resource usage. The proposed system offers a practical solution for continuous, wearable ECG-based sleep monitoring.
[LG-92] Unsupervised Pairwise Learning Optimization Framework for Cross-Corpus EEG-Based Emotion Recognition Based on Prototype Representation
链接: https://arxiv.org/abs/2508.11663
作者: Guangli Li,Canbiao Wu,Zhen Liang
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Affective computing is a rapidly developing interdisciplinary research direction in the field of brain-computer interface. In recent years, the introduction of deep learning technology has greatly promoted the development of the field of emotion recognition. However, due to physiological differences between subjects, as well as the variations in experimental environments and equipment, cross-corpus emotion recognition faces serious challenges, especially for samples near the decision boundary. To solve the above problems, we propose an optimization method based on domain adversarial transfer learning to fine-grained alignment of affective features, named Maximum classifier discrepancy with Pairwise Learning (McdPL) framework. In McdPL, we design a dual adversarial classifier (Ada classifier and RMS classifier), and apply a three-stage adversarial training to maximize classification discrepancy and minimize feature distribution to align controversy samples near the decision boundary. In the process of domain adversarial training, the two classifiers also maintain an adversarial relationship, ultimately enabling precise cross-corpus feature alignment. In addition, the introduction of pairwise learning transforms the classification problem of samples into a similarity problem between samples, alleviating the influence of label noise. We conducted systematic experimental evaluation of the model using publicly available SEED, SEED-IV and SEED-V databases. The results show that the McdPL model is superior to other baseline models in the cross-corpus emotion recognition task, and the average accuracy improvements of 4.76% and 3.97%, respectively. Our work provides a promising solution for emotion recognition cross-corpus. The source code is available at this https URL.
[LG-93] Robust Sparse Bayesian Learning Based on Minimum Error Entropy for Noisy High-Dimensional Brain Activity Decoding
链接: https://arxiv.org/abs/2508.11657
作者: Yuanhao Li,Badong Chen,Wenjun Bai,Yasuharu Koike,Okito Yamashita
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注:
Abstract:Objective: Sparse Bayesian learning provides an effective scheme to solve the high-dimensional problem in brain signal decoding. However, traditional assumptions regarding data distributions such as Gaussian and binomial are potentially inadequate to characterize the noisy signals of brain activity. Hence, this study aims to propose a robust sparse Bayesian learning framework to address noisy highdimensional brain activity decoding. Methods: Motivated by the commendable robustness of the minimum error entropy (MEE) criterion for handling complex data distributions, we proposed an MEE-based likelihood function to facilitate the accurate inference of sparse Bayesian learning in analyzing noisy brain datasets. Results: Our proposed approach was evaluated using two high-dimensional brain decoding tasks in regression and classification contexts, respectively. The experimental results showed that, our approach can realize superior decoding metrics and physiological patterns than the conventional and state-of-the-art methods. Conclusion: Utilizing the proposed MEE-based likelihood model, sparse Bayesian learning is empowered to simultaneously address the challenges of noise and high dimensionality in the brain decoding task. Significance: This work provides a powerful tool to realize robust brain decoding, advancing biomedical engineering applications such as brain-computer interface.
[LG-94] Inductive transfer learning from regression to classification in ECG analysis
链接: https://arxiv.org/abs/2508.11656
作者: Ridma Jayasundara,Ishan Fernando,Adeepa Fernando,Roshan Ragel,Vajira Thambawita,Isuru Nawinne
类目: ignal Processing (eess.SP); Machine Learning (cs.LG)
*备注: This manuscript is 15 pages with 4 tables and 5 figures. The manuscript is under review at Nature Scientific Reports
Abstract:Cardiovascular diseases (CVDs) are the leading cause of mortality worldwide, accounting for over 30% of global deaths according to the World Health Organization (WHO). Importantly, one-third of these deaths are preventable with timely and accurate diagnosis. The electrocardiogram (ECG), a non-invasive method for recording the electrical activity of the heart, is crucial for diagnosing CVDs. However, privacy concerns surrounding the use of patient ECG data in research have spurred interest in synthetic data, which preserves the statistical properties of real data without compromising patient confidentiality. This study explores the potential of synthetic ECG data for training deep learning models from regression to classification tasks and evaluates the feasibility of transfer learning to enhance classification performance on real ECG data. We experimented with popular deep learning models to predict four key cardiac parameters, namely, Heart Rate (HR), PR interval, QT interval, and QRS complex-using separate regression models. Subsequently, we leveraged these regression models for transfer learning to perform 5-class ECG signal classification. Our experiments systematically investigate whether transfer learning from regression to classification is viable, enabling better utilization of diverse open-access and synthetic ECG datasets. Our findings demonstrate that transfer learning from regression to classification improves classification performance, highlighting its potential to maximize the utility of available data and advance deep learning applications in this domain.
[LG-95] HetSyn: Versatile Timescale Integration in Spiking Neural Networks via Heterogeneous Synapses
链接: https://arxiv.org/abs/2508.11644
作者: Zhichao Deng,Zhikun Liu,Junxue Wang,Shengqian Chen,Xiang Wei,Qiang Yu
类目: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
*备注:
Abstract:Spiking Neural Networks (SNNs) offer a biologically plausible and energy-efficient framework for temporal information processing. However, existing studies overlook a fundamental property widely observed in biological neurons-synaptic heterogeneity, which plays a crucial role in temporal processing and cognitive capabilities. To bridge this gap, we introduce HetSyn, a generalized framework that models synaptic heterogeneity with synapse-specific time constants. This design shifts temporal integration from the membrane potential to the synaptic current, enabling versatile timescale integration and allowing the model to capture diverse synaptic dynamics. We implement HetSyn as HetSynLIF, an extended form of the leaky integrate-and-fire (LIF) model equipped with synapse-specific decay dynamics. By adjusting the parameter configuration, HetSynLIF can be specialized into vanilla LIF neurons, neurons with threshold adaptation, and neuron-level heterogeneous models. We demonstrate that HetSynLIF not only improves the performance of SNNs across a variety of tasks-including pattern generation, delayed match-to-sample, speech recognition, and visual recognition-but also exhibits strong robustness to noise, enhanced working memory performance, efficiency under limited neuron resources, and generalization across timescales. In addition, analysis of the learned synaptic time constants reveals trends consistent with empirical observations in biological synapses. These findings underscore the significance of synaptic heterogeneity in enabling efficient neural computation, offering new insights into brain-inspired temporal modeling.
[LG-96] ghtening the mixed integer linear formulation for the piecewise linear approximation in general dimensions
链接: https://arxiv.org/abs/2508.09395
作者: Quentin Ploussard,Xiang Li,Matija Pavičević
类目: Optimization and Control (math.OC); Computational Geometry (cs.CG); Discrete Mathematics (cs.DM); Machine Learning (cs.LG)
*备注: Added Acknowledgements and U.S. Government license disclaimer
Abstract:This paper addresses the problem of tightening the mixed-integer linear programming (MILP) formulation for continuous piecewise linear (CPWL) approximations of data sets in arbitrary dimensions. The MILP formulation leverages the difference-of-convex (DC) representation of CPWL functions. We introduce the concept of well-behaved CPWL interpolations and demonstrate that any CPWL interpolation of a data set has a well-behaved version. This result is critical to tighten the MILP problem. We present six different strategies to tighten the problem, which include fixing the values of some variables, introducing additional constraints, identifying small big-M parameter values and applying tighter variable bounds. These methods leverage key aspects of the DC representation and the inherent structure of well-behaved CPWL interpolations. Experimental results demonstrate that specific combinations of these tightening strategies lead to significant improvement in solution times, especially for tightening strategies that consider well-behaved CPWL solutions.
信息检索
[IR-0] D-RDW: Diversity-Driven Random Walks for News Recommender Systems
链接: https://arxiv.org/abs/2508.13035
作者: Runze Li,Lucien Heitz,Oana Inel,Abraham Bernstein
类目: Information Retrieval (cs.IR)
*备注: 6 pages
Abstract:This paper introduces Diversity-Driven RandomWalks (D-RDW), a lightweight algorithm and re-ranking technique that generates diverse news recommendations. D-RDW is a societal recommender, which combines the diversification capabilities of the traditional random walk algorithms with customizable target distributions of news article properties. In doing so, our model provides a transparent approach for editors to incorporate norms and values into the recommendation process. D-RDW shows enhanced performance across key diversity metrics that consider the articles’ sentiment and political party mentions when compared to state-of-the-art neural models. Furthermore, D-RDW proves to be more computationally efficient than existing approaches.
[IR-1] Informfully Recommenders – Reproducibility Framework for Diversity-aware Intra-session Recommendations
链接: https://arxiv.org/abs/2508.13019
作者: Lucien Heitz,Runze Li,Oana Inel,Abraham Bernstein
类目: Information Retrieval (cs.IR)
*备注: 10 pages
Abstract:Norm-aware recommender systems have gained increased attention, especially for diversity optimization. The recommender systems community has well-established experimentation pipelines that support reproducible evaluations by facilitating models’ benchmarking and comparisons against state-of-the-art methods. However, to the best of our knowledge, there is currently no reproducibility framework to support thorough norm-driven experimentation at the pre-processing, in-processing, post-processing, and evaluation stages of the recommender pipeline. To address this gap, we present Informfully Recommenders, a first step towards a normative reproducibility framework that focuses on diversity-aware design built on Cornac. Our extension provides an end-to-end solution for implementing and experimenting with normative and general-purpose diverse recommender systems that cover 1) dataset pre-processing, 2) diversity-optimized models, 3) dedicated intrasession item re-ranking, and 4) an extensive set of diversity metrics. We demonstrate the capabilities of our extension through an extensive offline experiment in the news domain.
[IR-2] Deep Research: A Survey of Autonomous Research Agents
链接: https://arxiv.org/abs/2508.12752
作者: Wenlin Zhang,Xiaopeng Li,Yingyi Zhang,Pengyue Jia,Yichao Wang,Huifeng Guo,Yong Liu,Xiangyu Zhao
类目: Information Retrieval (cs.IR)
*备注:
Abstract:The rapid advancement of large language models (LLMs) has driven the development of agentic systems capable of autonomously performing complex tasks. Despite their impressive capabilities, LLMs remain constrained by their internal knowledge boundaries. To overcome these limitations, the paradigm of deep research has been proposed, wherein agents actively engage in planning, retrieval, and synthesis to generate comprehensive and faithful analytical reports grounded in web-based evidence. In this survey, we provide a systematic overview of the deep research pipeline, which comprises four core stages: planning, question developing, web exploration, and report generation. For each stage, we analyze the key technical challenges and categorize representative methods developed to address them. Furthermore, we summarize recent advances in optimization techniques and benchmarks tailored for deep research. Finally, we discuss open challenges and promising research directions, aiming to chart a roadmap toward building more capable and trustworthy deep research agents.
[IR-3] Multi-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network RECSYS’2025
链接: https://arxiv.org/abs/2508.12665
作者: Xu Zhao,Ruibo Ma,Jiaqi Chen,Weiqi Zhao,Ping Yang,Yao Hu
类目: Information Retrieval (cs.IR)
*备注: Accepted as oral full paper by RecSys’2025 conference
Abstract:Accurate watch time prediction is crucial for enhancing user engagement in streaming short-video platforms, although it is challenged by complex distribution characteristics across multi-granularity levels. Through systematic analysis of real-world industrial data, we uncover two critical challenges in watch time prediction from a distribution aspect: (1) coarse-grained skewness induced by a significant concentration of quick-skips1, (2) fine-grained diversity arising from various user-video interaction patterns. Consequently, we assume that the watch time follows the Exponential-Gaussian Mixture (EGM) distribution, where the exponential and Gaussian components respectively characterize the skewness and diversity. Accordingly, an Exponential-Gaussian Mixture Network (EGMN) is proposed for the parameterization of EGM distribution, which consists of two key modules: a hidden representation encoder and a mixture parameter generator. We conducted extensive offline experiments on public datasets and online A/B tests on the industrial short-video feeding scenario of Xiaohongshu App to validate the superiority of EGMN compared with existing state-of-the-art methods. Remarkably, comprehensive experimental results have proven that EGMN exhibits excellent distribution fitting ability across coarse-to-fine-grained levels. We open source related code on Github: this https URL.
[IR-4] Diagnostic-Guided Dynamic Profile Optimization for LLM -based User Simulators in Sequential Recommendation
链接: https://arxiv.org/abs/2508.12645
作者: Hongyang Liu,Zhu Sun,Tianjun Wei,Yan Wang,Jiajie Zhu,Xinghua Qu
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Recent advances in large language models (LLMs) have enabled realistic user simulators for developing and evaluating recommender systems (RSs). However, existing LLM-based simulators for RSs face two major limitations: (1) static and single-step prompt-based inference that leads to inaccurate and incomplete user profile construction; (2) unrealistic and single-round recommendation-feedback interaction pattern that fails to capture real-world scenarios. To address these limitations, we propose DGDPO (Diagnostic-Guided Dynamic Profile Optimization), a novel framework that constructs user profile through a dynamic and iterative optimization process to enhance the simulation fidelity. Specifically, DGDPO incorporates two core modules within each optimization loop: firstly, a specialized LLM-based diagnostic module, calibrated through our novel training strategy, accurately identifies specific defects in the user profile. Subsequently, a generalized LLM-based treatment module analyzes the diagnosed defect and generates targeted suggestions to refine the profile. Furthermore, unlike existing LLM-based user simulators that are limited to single-round interactions, we are the first to integrate DGDPO with sequential recommenders, enabling a bidirectional evolution where user profiles and recommendation strategies adapt to each other over multi-round interactions. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of our proposed framework.
[IR-5] jXBW: Fast Substructure Search in Large-Scale JSONL Datasets for Foundation Model Applications
链接: https://arxiv.org/abs/2508.12536
作者: Yasuo Tabei
类目: Databases (cs.DB); Data Structures and Algorithms (cs.DS); Information Retrieval (cs.IR)
*备注:
Abstract:Substructure search in JSON Lines (JSONL) datasets is essential for modern applications such as prompt engineering in foundation models, but existing methods suffer from prohibitive computational costs due to exhaustive tree traversal and subtree matching. We present jXBW, a fast method for substructure search on large-scale JSONL datasets. Our method makes three key technical contributions: (i) a merged tree representation built by merging trees of multiple JSON objects while preserving individual identities, (ii) a succinct data structure based on the eXtended Burrows-Wheeler Transform that enables efficient tree navigation and subpath search, and (iii) an efficient three-step substructure search algorithm that combines path decomposition, ancestor computation, and adaptive tree identifier collection to ensure correctness while avoiding exhaustive tree traversal. Experimental evaluation on real-world datasets demonstrates that jXBW consistently outperforms existing methods, achieving speedups of 16 \times for smaller datasets and up to 4,700 \times for larger datasets over tree-based approaches, and more than 6 \times 10 ^6 over XML-based processing while maintaining competitive memory usage.
[IR-6] Contrastive Multi-View Graph Hashing
链接: https://arxiv.org/abs/2508.12377
作者: Yang Xu,Zuliang Yang,Kai Ming Ting
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Multi-view graph data, which both captures node attributes and rich relational information from diverse sources, is becoming increasingly prevalent in various domains. The effective and efficient retrieval of such data is an important task. Although multi-view hashing techniques have offered a paradigm for fusing diverse information into compact binary codes, they typically assume attributes-based inputs per view. This makes them unsuitable for multi-view graph data, where effectively encoding and fusing complex topological information from multiple heterogeneous graph views to generate unified binary embeddings remains a significant challenge. In this work, we propose Contrastive Multi-view Graph Hashing (CMGHash), a novel end-to-end framework designed to learn unified and discriminative binary embeddings from multi-view graph data. CMGHash learns a consensus node representation space using a contrastive multi-view graph loss, which aims to pull k -nearest neighbors from all graphs closer while pushing away negative pairs, i.e., non-neighbor nodes. Moreover, we impose binarization constraints on this consensus space, enabling its conversion to a corresponding binary embedding space at minimal cost. Extensive experiments on several benchmark datasets demonstrate that CMGHash significantly outperforms existing approaches in terms of retrieval accuracy.
[IR-7] aoSR1: The Thinking Model for E-commerce Relevance Search
链接: https://arxiv.org/abs/2508.12365
作者: Chenhe Dong,Shaowei Yao,Pengkun Jiao,Jianhui Yang,Yiming Jin,Zerui Huang,Xiaojiang Zhou,Dan Ou,Haihong Tang
类目: Information Retrieval (cs.IR)
*备注:
Abstract:Query-product relevance prediction is a core task in e-commerce search. BERT-based models excel at semantic matching but lack complex reasoning capabilities. While Large Language Models (LLMs) are explored, most still use discriminative fine-tuning or distill to smaller models for deployment. We propose a framework to directly deploy LLMs for this task, addressing key challenges: Chain-of-Thought (CoT) error accumulation, discriminative hallucination, and deployment feasibility. Our framework, TaoSR1, involves three stages: (1) Supervised Fine-Tuning (SFT) with CoT to instill reasoning; (2) Offline sampling with a pass@N strategy and Direct Preference Optimization (DPO) to improve generation quality; and (3) Difficulty-based dynamic sampling with Group Relative Policy Optimization (GRPO) to mitigate discriminative hallucination. Additionally, post-CoT processing and a cumulative probability-based partitioning method enable efficient online deployment. TaoSR1 significantly outperforms baselines on offline datasets and achieves substantial gains in online side-by-side human evaluations, introducing a novel paradigm for applying CoT reasoning to relevance classification.